PostgreSQL

mirror of https://github.com/postgres/postgres.git synced 2025-06-02 00:01:40 -04:00

Author	SHA1	Message	Date
Robert Haas	3ae5133b1c	Teach SLRU code to avoid replacing I/O-busy pages. Patch by me; review by Tom Lane and others.	2012-04-08 23:05:55 -04:00
Robert Haas	b736aef2ec	Publish checkpoint timing information to pg_stat_bgwriter. Greg Smith, Peter Geoghegan, and Robert Haas	2012-04-05 14:04:37 -04:00
Simon Riggs	68219aaf6b	Correct epoch of txid_current() when executed on a Hot Standby server. Initialise ckptXidEpoch from starting checkpoint and maintain the correct value as we roll forwards. This allows GetNextXidAndEpoch() to return the correct epoch when executed during recovery. Backpatch to 9.0 when the problem is first observable by a user. Bug report from Daniel Farina	2012-03-29 14:55:30 +01:00
Peter Eisentraut	e684ab5e1e	Add additional safety check against invalid backup label file It was already checking for invalid data after "BACKUP FROM", but would possibly crash if "BACKUP FROM" was missing altogether. found by Coverity	2012-03-14 22:41:50 +02:00
Heikki Linnakangas	d93f209f48	Silence warning about unused variable, when building without assertions.	2012-03-08 11:10:02 +02:00
Robert Haas	bc97c38115	Typo fix. Fujii Masao	2012-03-06 08:23:51 -05:00
Heikki Linnakangas	e587e2e3e3	Make the comments more clear on the fact that UpdateFullPageWrites() is not safe to call concurrently from multiple processes.	2012-03-06 10:45:58 +02:00
Heikki Linnakangas	7714c63829	Remove extra copies of LogwrtResult. This simplifies the code a little bit. The new rule is that to update XLogCtl->LogwrtResult, you must hold both WALWriteLock and info_lck, whereas before we had two copies, one that was protected by WALWriteLock and another protected by info_lck. The code that updates them was already holding both locks, so merging the two is trivial. The third copy, XLogCtl->Insert.LogwrtResult, was not totally redundant, it was used in AdvanceXLInsertBuffer to update the backend-local copy, before acquiring the info_lck to read the up-to-date value. But the value of that seems dubious; at best it's saving one spinlock acquisition per completed WAL page, which is not significant compared to all the other work involved. And in practice, it's probably not saving even that much.	2012-03-06 10:18:33 +02:00
Heikki Linnakangas	3b682df326	Simplify the way changes to full_page_writes are logged. It's harmless to do full page writes even when not strictly necessary, so when turning full_page_writes on, we can set the global flag first, and then call XLogInsert. Likewise, when turning it off, we can write the WAL record first, and then clear the flag. This way XLogInsert doesn't need any special handling of the XLOG_FPW_CHANGE record type. XLogInsert is complicated enough already, so anything we can keep away from there is a good thing. Actually I don't think the atomicity of the shared memory flag matters, anyway, because we only write the XLOG_FPW_CHANGE at the end of recovery, when there are no concurrent WAL insertions going on. But might as well make it safe, in case we allow changing full_page_writes on the fly in the future.	2012-03-06 09:48:30 +02:00
Magnus Hagander	141b89826d	More carefully validate xlog location string inputs Now that we have validate_xlog_location, call it from the previously existing functions taking xlog locatoins as a string input. Suggested by Fujii Masao	2012-03-04 12:25:47 +01:00
Magnus Hagander	bc5ac36865	Add function pg_xlog_location_diff to help comparisons Comparing two xlog locations are useful for example when calculating replication lag. Euler Taveira de Oliveira, reviewed by Fujii Masao, and some cleanups from me	2012-03-04 12:22:38 +01:00
Heikki Linnakangas	1a01560cbb	Rename LWLockWaitUntilFree to LWLockAcquireOrWait. LWLockAcquireOrWait makes it more clear that the lock is acquired if it's free.	2012-02-08 09:17:13 +02:00
Tom Lane	c6d76d7c82	Add locking around WAL-replay modification of shared-memory variables. Originally, most of this code assumed that no Postgres backends could be running concurrently with it, and so no locking could be needed. That assumption fails in Hot Standby. While it's still true that Hot Standby backends should never change values like nextXid, they can examine them, and consistency is important in some cases such as when computing a snapshot. Therefore, prudence requires that WAL replay code obtain the relevant locks when modifying such variables, even though it can examine them without taking a lock. We were following that coding rule in some places but not all. This commit applies the coding rule uniformly to all updates of ShmemVariableCache and MultiXactState fields; a search of the replay routines did not find any other cases that seemed to be at risk. In addition, this commit fixes a longstanding thinko in replay of NEXTOID and checkpoint records: we tried to advance nextOid only if it was behind the value in the WAL record, but the comparison would draw the wrong conclusion if OID wraparound had occurred since the previous value. Better to just unconditionally assign the new value, since OID assignment shouldn't be happening during replay anyway. The additional locking seems to be more in the nature of future-proofing than fixing any live bug, so I am not going to back-patch it. The NEXTOID fix will be back-patched separately.	2012-02-06 12:34:10 -05:00
Tom Lane	17118825b8	Fix transient clobbering of shared buffers during WAL replay. RestoreBkpBlocks was in the habit of zeroing and refilling the target buffer; which was perfectly safe when the code was written, but is unsafe during Hot Standby operation. The reason is that we have coding rules that allow backends to continue accessing a tuple in a heap relation while holding only a pin on its buffer. Such a backend could see transiently zeroed data, if WAL replay had occasion to change other data on the page. This has been shown to be the cause of bug #6425 from Duncan Rance (who deserves kudos for developing a sufficiently-reproducible test case) as well as Bridget Frey's re-report of bug #6200. It most likely explains the original report as well, though we don't yet have confirmation of that. To fix, change the code so that only bytes that are supposed to change will change, even transiently. This actually saves cycles in RestoreBkpBlocks, since it's not writing the same bytes twice. Also fix seq_redo, which has the same disease, though it has to work a bit harder to meet the requirement. So far as I can tell, no other WAL replay routines have this type of bug. In particular, the index-related replay routines, which would certainly be broken if they had to meet the same standard, are not at risk because we do not have coding rules that allow access to an index page when not holding a buffer lock on it. Back-patch to 9.0 where Hot Standby was added.	2012-02-05 15:49:17 -05:00
Heikki Linnakangas	9b38d46d9f	Make group commit more effective. When a backend needs to flush the WAL, and someone else is already flushing the WAL, wait until it releases the WALInsertLock and check if we still need to do the flush or if the other backend already did the work for us, before acquiring WALInsertLock. This helps group commit, because when the WAL flush finishes, all the backends that were waiting for it can be woken up in one go, and the can all concurrently observe that they're done, rather than waking them up one by one in a cascading fashion. This is based on a new LWLock function, LWLockWaitUntilFree(), which has peculiar semantics. If the lock is immediately free, it grabs the lock and returns true. If it's not free, it waits until it is released, but then returns false without grabbing the lock. This is used in XLogFlush(), so that when the lock is acquired, the backend flushes the WAL, but if it's not, the backend first checks the current flush location before retrying. Original patch and benchmarking by Peter Geoghegan and Simon Riggs, although this patch as committed ended up being very different from that.	2012-01-30 16:53:48 +02:00
Tom Lane	ad10853b30	Assorted comment fixes, mostly just typos, but some obsolete statements. YAMAMOTO Takashi	2012-01-29 19:23:56 -05:00
Simon Riggs	8366c7803e	Allow pg_basebackup from standby node with safety checking. Base backup follows recommended procedure, plus goes to great lengths to ensure that partial page writes are avoided. Jun Ishizuka and Fujii Masao, with minor modifications	2012-01-25 18:02:04 +00:00
Simon Riggs	5530623d03	Correctly initialise shared recoveryLastRecPtr in recovery. Previously we used ReadRecPtr rather than EndRecPtr, which was not a serious error but caused pg_stat_replication to report incorrect replay_location until at least one WAL record is replayed. Fujii Masao	2012-01-13 13:02:44 +00:00
Heikki Linnakangas	1b9dea04b5	Remove useless 'needlock' argument from GetXLogInsertRecPtr. It was always passed as 'true'.	2012-01-11 11:01:47 +02:00
Heikki Linnakangas	9c808f89c2	Refactor XLogInsert a bit. The rdata entries for backup blocks are now constructed before acquiring WALInsertLock, which slightly reduces the time the lock is held. Although I could not measure any benefit in benchmarks, the code is more readable this way.	2012-01-11 11:01:47 +02:00
Robert Haas	33aaa139e6	Make the number of CLOG buffers adaptive, based on shared_buffers. Previously, this was hardcoded: we always had 8. Performance testing shows that isn't enough, especially on big SMP systems, so we allow it to scale up as high as 32 when there's adequate memory. On the flip side, when shared_buffers is very small, drop the number of CLOG buffers down to as little as 4, so that we can start the postmaster even when very little shared memory is available. Per extensive discussion with Simon Riggs, Tom Lane, and others on pgsql-hackers.	2012-01-06 14:32:18 -05:00
Bruce Momjian	e126958c2e	Update copyright notices for year 2012.	2012-01-01 18:01:58 -05:00
Simon Riggs	64233902d2	Send new protocol keepalive messages to standby servers. Allows streaming replication users to calculate transfer latency and apply delay via internal functions. No external functions yet.	2011-12-31 13:30:26 +00:00
Tom Lane	d0024cd188	Avoid crashing when we have problems unlinking files post-commit. smgrdounlink takes care to not throw an ERROR if it fails to unlink something, but that caution was rendered useless by commit 3396000684b41e7e9467d1abc67152b39e697035, which put an smgrexists call in front of it; smgrexists does throw error if anything looks funny, such as getting a permissions error from trying to open the file. If that happens post-commit, you get a PANIC, and what's worse the same logic appears in the WAL replay code, so the database even fails to restart. Restore the intended behavior by removing the smgrexists call --- it isn't accomplishing anything that we can't do better by adjusting mdunlink's ideas of whether it ought to warn about ENOENT or not. Per report from Joseph Shraibman of unrecoverable crash after trying to drop a table whose FSM fork had somehow gotten chmod'd to 000 permissions. Backpatch to 8.4, where the bogus coding was introduced.	2011-12-20 15:00:36 -05:00
Tom Lane	dd45d3ad33	Fix some long-obsolete references to XLogOpenRelation. These were missed in commit a213f1ee6c5a1bbe1f074ca201975e76ad2ed50c, which removed that function.	2011-12-17 18:26:52 -05:00
Tom Lane	8daeb5ddd6	Add SP-GiST (space-partitioned GiST) index access method. SP-GiST is comparable to GiST in flexibility, but supports non-balanced partitioned search structures rather than balanced trees. As described at PGCon 2011, this new indexing structure can beat GiST in both index build time and query speed for search problems that it is well matched to. There are a number of areas that could still use improvement, but at this point the code seems committable. Teodor Sigaev and Oleg Bartunov, with considerable revisions by Tom Lane	2011-12-17 16:42:30 -05:00
Tom Lane	2dd9322ba6	Move BKP_REMOVABLE bit from individual WAL records to WAL page headers. Removing this bit from xl_info allows us to restore the old limit of four (not three) separate pages touched by a WAL record, which is needed for the upcoming SP-GiST feature, and will likely be useful elsewhere in future. When we implemented XLR_BKP_REMOVABLE in 2007, we had to do it like that because no special WAL-visible action was taken when starting a backup. However, now we force a segment switch when starting a backup, so a compressing WAL archiver (such as pglesslog) that uses the state shown in the current page header will not be fooled as to removability of backup blocks. The only downside is that the archiver will not return to compressing mode for up to one WAL page after the backup is over, which is a small price to pay for getting back the extra xl_info bit. In any case the archiver could look for XLOG_BACKUP_END records if it thought it was worth the trouble to do so. Bump XLOG_PAGE_MAGIC since this is effectively a change in WAL format.	2011-12-12 16:22:14 -05:00
Heikki Linnakangas	9f0d2bdc88	Don't set reachedMinRecoveryPoint during crash recovery. In crash recovery, we don't reach consistency before replaying all of the WAL. Rename the variable to reachedConsistency, to make its intention clearer. In master, that was an active bug because of the recent patch to immediately PANIC if a reference to a missing page is found in WAL after reaching consistency, as Tom Lane's test case demonstrated. In 9.1 and 9.0, the only consequence was a misleading "consistent recovery state reached at %X/%X" message in the log at the beginning of crash recovery (the database is not consistent at that point yet). In 8.4, the log message was not printed in crash recovery, even though there was a similar reachedMinRecoveryPoint local variable that was also set early. So, backpatch to 9.1 and 9.0.	2011-12-09 15:21:12 +02:00
Heikki Linnakangas	1e616f6391	During recovery, if we reach consistent state and still have entries in the invalid-page hash table, PANIC immediately. Immediate PANIC is much better than waiting for end-of-recovery, which is what we did before, because the end-of-recovery might not come until months later if this is a standby server. Also refrain from creating a restartpoint if there are invalid-page entries in the hash table. Restarting recovery from such a restartpoint would not see the invalid references, and wouldn't be able to cross-check them when consistency is reached. That wouldn't matter when things are going smoothly, but the more sanity checks you have the better. Fujii Masao	2011-12-02 10:49:54 +02:00
Robert Haas	ed0b409d22	Move "hot" members of PGPROC into a separate PGXACT array. This speeds up snapshot-taking and reduces ProcArrayLock contention. Also, the PGPROC (and PGXACT) structures used by two-phase commit are now allocated as part of the main array, rather than in a separate array, and we keep ProcArray sorted in pointer order. These changes are intended to minimize the number of cache lines that must be pulled in to take a snapshot, and testing shows a substantial increase in performance on both read and write workloads at high concurrencies. Pavan Deolasee, Heikki Linnakangas, Robert Haas	2011-11-25 08:02:10 -05:00
Simon Riggs	4de82f7d7c	Wakeup WALWriter as needed for asynchronous commit performance. Previously we waited for wal_writer_delay before flushing WAL. Now we also wake WALWriter as soon as a WAL buffer page has filled. Significant effect observed on performance of asynchronous commits by Robert Haas, attributed to the ability to set hint bits on tuples earlier and so reducing contention caused by clog lookups.	2011-11-13 09:00:57 +00:00
Simon Riggs	a030bfa6e4	Move user functions related to WAL into xlogfuncs.c	2011-11-04 09:37:17 +00:00
Simon Riggs	750f70b0fe	Update more comments about checkpoints being done by bgwriter	2011-11-02 17:15:35 +00:00
Simon Riggs	18fb9d8d21	Reduce checkpoints and WAL traffic on low activity database server Previously, we skipped a checkpoint if no WAL had been written since last checkpoint, though this does not appear in user documentation. As of now, we skip a checkpoint until we have written at least one enough WAL to switch the next WAL file. This greatly reduces the level of activity and number of WAL messages generated by a very low activity server. This is safe because the purpose of a checkpoint is to act as a starting place for a recovery, in case of crash. This patch maintains minimal WAL volume for replay in case of crash, thus maintaining very low crash recovery time.	2011-11-02 15:26:33 +00:00
Simon Riggs	9aceb6ab3c	Refactor xlog.c to create src/backend/postmaster/startup.c Startup process now has its own dedicated file, just like all other special/background processes. Reduces role and size of xlog.c	2011-11-02 14:25:01 +00:00
Simon Riggs	86e3364899	Derive oldestActiveXid at correct time for Hot Standby. There was a timing window between when oldestActiveXid was derived and when it should have been derived that only shows itself under heavy load. Move code around to ensure correct timing of derivation. No change to StartupSUBTRANS() code, which is where this failed. Bug report by Chris Redekop	2011-11-02 08:54:56 +00:00
Simon Riggs	f8409b39d1	Fix timing of Startup CLOG and MultiXact during Hot Standby Patch by me, bug report by Chris Redekop, analysis by Florian Pflug	2011-11-02 08:07:44 +00:00
Simon Riggs	f3ebaad45b	Comment changes to show bgwriter no longer performs checkpoints.	2011-11-01 18:48:47 +00:00
Tom Lane	bb446b689b	Support synchronization of snapshots through an export/import procedure. A transaction can export a snapshot with pg_export_snapshot(), and then others can import it with SET TRANSACTION SNAPSHOT. The data does not leave the server so there are not security issues. A snapshot can only be imported while the exporting transaction is still running, and there are some other restrictions. I'm not totally convinced that we've covered all the bases for SSI (true serializable) mode, but it works fine for lesser isolation modes. Joachim Wieland, reviewed by Marko Tiikkaja, and rather heavily modified by Tom Lane	2011-10-22 18:23:30 -04:00
Tom Lane	aa90e148ca	Suppress -Wunused-result warnings about write() and fwrite(). This is merely an exercise in satisfying pedants, not a bug fix, because in every case we were checking for failure later with ferror(), or else there was nothing useful to be done about a failure anyway. Document the latter cases.	2011-10-18 21:37:51 -04:00
Tom Lane	fa56a0c3e0	Fix uninitialized-variable bug.	2011-10-04 17:08:18 -04:00
Alvaro Herrera	09e196e453	Use callbacks in SlruScanDirectory for the actual action Previously, the code assumed that the only possible action to take was to delete files behind a certain cutoff point. The async notify code was already a crock: it used a different "pagePrecedes" function for truncation than for regular operation. By allowing it to pass a callback to SlruScanDirectory it can do cleanly exactly what it needs to do. The clog.c code also had its own use for SlruScanDirectory, which is made a bit simpler with this.	2011-10-04 14:03:23 -03:00
Tom Lane	d56b3afc03	Restructure error handling in reading of postgresql.conf. This patch has two distinct purposes: to report multiple problems in postgresql.conf rather than always bailing out after the first one, and to change the policy for whether changes are applied when there are unrelated errors in postgresql.conf. Formerly the policy was to apply no changes if any errors could be detected, but that had a significant consistency problem, because in some cases specific values might be seen as valid by some processes but invalid by others. This meant that the latter processes would fail to adopt changes in other parameters even though the former processes had done so. The new policy is that during SIGHUP, the file is rejected as a whole if there are any errors in the "name = value" syntax, or if any lines attempt to set nonexistent built-in parameters, or if any lines attempt to set custom parameters whose prefix is not listed in (the new value of) custom_variable_classes. These tests should always give the same results in all processes, and provide what seems a reasonably robust defense against loading values from badly corrupted config files. If these tests pass, all processes will apply all settings that they individually see as good, ignoring (but logging) any they don't. In addition, the postmaster does not abandon reading a configuration file after the first syntax error, but continues to read the file and report syntax errors (up to a maximum of 100 syntax errors per file). The postmaster will still refuse to start up if the configuration file contains any errors at startup time, but these changes allow multiple errors to be detected and reported before quitting. Alexey Klyukin, reviewed by Andy Colson and av (Alexander ?) with some additional hacking by Tom Lane	2011-10-02 16:50:04 -04:00
Tom Lane	57eb009092	Allow snapshot references to still work during transaction abort. In REPEATABLE READ (nee SERIALIZABLE) mode, an attempt to do GetTransactionSnapshot() between AbortTransaction and CleanupTransaction failed, because GetTransactionSnapshot would recompute the transaction snapshot (which is already wrong, given the isolation mode) and then re-register it in the TopTransactionResourceOwner, leading to an Assert because the TopTransactionResourceOwner should be empty of resources after AbortTransaction. This is the root cause of bug #6218 from Yamamoto Takashi. While changing plancache.c to avoid requesting a snapshot when handling a ROLLBACK masks the problem, I think this is really a snapmgr.c bug: it's lower-level than the resource manager mechanism and should not be shutting itself down before we unwind resource manager resources. However, just postponing the release of the transaction snapshot until cleanup time didn't work because of the circular dependency with TopTransactionResourceOwner. Fix by managing the internal reference to that snapshot manually instead of depending on TopTransactionResourceOwner. This saves a few cycles as well as making the module layering more straightforward. predicate.c's dependencies on TopTransactionResourceOwner go away too. I think this is a longstanding bug, but there's no evidence that it's more than a latent bug, so it doesn't seem worth any risk of back-patching.	2011-09-26 22:25:28 -04:00
Tom Lane	a7801b62f2	Move Timestamp/Interval typedefs and basic macros into datatype/timestamp.h. As per my recent proposal, this refactors things so that these typedefs and macros are available in a header that can be included in frontend-ish code. I also changed various headers that were undesirably including utils/timestamp.h to include datatype/timestamp.h instead. Unsurprisingly, this showed that half the system was getting utils/timestamp.h by way of xlog.h. No actual code changes here, just header refactoring.	2011-09-09 13:23:41 -04:00
Simon Riggs	df383b03e6	Partially revoke attempt to improve performance with many savepoints. Maintain difference between subtransaction release and commit introduced by earlier patch.	2011-09-07 12:11:26 +01:00
Alvaro Herrera	56a9ed92b6	Adjust translator comment format to xgettext expectations	2011-09-05 19:04:30 -03:00
Alvaro Herrera	b64f18c583	Mark some untranslatable messages with errmsg_internal	2011-09-05 17:48:07 -03:00
Tom Lane	1609797c25	Clean up the #include mess a little. walsender.h should depend on xlog.h, not vice versa. (Actually, the inclusion was circular until a couple hours ago, which was even sillier; but Bruce broke it in the expedient rather than logically correct direction.) Because of that poor decision, plus blind application of pgrminclude, we had a situation where half the system was depending on xlog.h to include such unrelated stuff as array.h and guc.h. Clean up the header inclusion, and manually revert a lot of what pgrminclude had done so things build again. This episode reinforces my feeling that pgrminclude should not be run without adult supervision. Inclusion changes in header files in particular need to be reviewed with great care. More generally, it'd be good if we had a clearer notion of module layering to dictate which headers can sanely include which others ... but that's a big task for another day.	2011-09-04 01:13:16 -04:00
Peter Eisentraut	f1e4f3d44f	Whitespace adjustment for consistency in the file	2011-09-03 01:28:05 +03:00

... 2 3 4 5 6 ...

1083 Commits