This reverts commit c11130690d6dca64267201a169cfb38c1adec5ef in favor of
actually fixing the problem: namely, that we should never have been
modifying the checkpoint record's nextXid at this point to begin with.
The nextXid should match the state as of the checkpoint's logical WAL
position (ie the redo point), not the state as of its physical position.
It's especially bogus to advance it in some wal_levels and not others.
In any case there is no need for the checkpoint record to carry the
same nextXid shown in the XLOG_RUNNING_XACTS record just emitted by
LogStandbySnapshot, as any replay operation will already have adopted
that value as current.
This fixes bug #7710 from Tarvi Pillessaar, and probably also explains bug
#6291 from Daniel Farina, in that if a checkpoint were in progress at the
instant of XID wraparound, the epoch bump would be lost as reported.
(And, of course, these days there's at least a 50-50 chance of a checkpoint
being in progress at any given instant.)
Diagnosed by me and independently by Andres Freund. Back-patch to all
branches supporting hot standby.
If wal_level = hot_standby we update the checkpoint nextxid,
though in the case where a wraparound occurred half-way through
a checkpoint we would neglect updating the epoch also. Updating
the nextxid is arguably the wrong thing to do, but changing that
may introduce subtle bugs into hot standby startup, while updating
the value doesn't cause any known bugs yet. Minimal fix now to
HEAD and backbranches, wider fix later in HEAD.
Bug reported in #6291 by Daniel Farina and slightly differently in
Cause analysis and recommended fixes from Tom Lane and Andres Freund.
Applied patch is minimal version of Andres Freund's work.
This was apparently a typo, which caused recovery to think that it
immediately reached the end of backup, and allowed the database to start
up too early.
Reported by Jeff Janes. Backpatch to 9.2, where this code was introduced.
When startup process opens a WAL segment after replaying part of it, it
validates the first page on the WAL segment, even though the page it's
really interested in later in the file. As part of the validation, it checks
that the TLI on the page header is >= the TLI it saw on the last page it
read. If the segment contains a timeline switch, and we have already
replayed it, and then re-open the WAL segment (because of streaming
replication got disconnected and reconnected, for example), the TLI check
will fail when the first page is validated. Fix that by relaxing the TLI
check when re-opening a WAL segment.
Backpatch to 9.0. Earlier versions had the same code, but before standby
mode was introduced in 9.0, recovery never tried to re-read a segment after
partially replaying it.
Reported by Amit Kapila, while testing a new feature.
Most of the replay functions for WAL record types that modify more than
one page failed to ensure that those pages were locked correctly to ensure
that concurrent queries could not see inconsistent page states. This is
a hangover from coding decisions made long before Hot Standby was added,
when it was hardly necessary to acquire buffer locks during WAL replay
at all, let alone hold them for carefully-chosen periods.
The key problem was that RestoreBkpBlocks was written to hold lock on each
page restored from a full-page image for only as long as it took to update
that page. This was guaranteed to break any WAL replay function in which
there was any update-ordering constraint between pages, because even if the
nominal order of the pages is the right one, any mixture of full-page and
non-full-page updates in the same record would result in out-of-order
updates. Moreover, it wouldn't work for situations where there's a
requirement to maintain lock on one page while updating another. Failure
to honor an update ordering constraint in this way is thought to be the
cause of bug #7648 from Daniel Farina: what seems to have happened there
is that a btree page being split was rewritten from a full-page image
before the new right sibling page was written, and because lock on the
original page was not maintained it was possible for hot standby queries to
try to traverse the page's right-link to the not-yet-existing sibling page.
To fix, get rid of RestoreBkpBlocks as such, and instead create a new
function RestoreBackupBlock that restores just one full-page image at a
time. This function can be invoked by WAL replay functions at the points
where they would otherwise perform non-full-page updates; in this way, the
physical order of page updates remains the same no matter which pages are
replaced by full-page images. We can then further adjust the logic in
individual replay functions if it is necessary to hold buffer locks
for overlapping periods. A side benefit is that we can simplify the
handling of concurrency conflict resolution by moving that code into the
record-type-specfic functions; there's no more need to contort the code
layout to keep conflict resolution in front of the RestoreBkpBlocks call.
In connection with that, standardize on zero-based numbering rather than
one-based numbering for referencing the full-page images. In HEAD, I
removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are
still there in the header files in previous branches, but are no longer
used by the code.
In addition, fix some other bugs identified in the course of making these
changes:
spgRedoAddNode could fail to update the parent downlink at all, if the
parent tuple is in the same page as either the old or new split tuple and
we're not doing a full-page image: it would get fooled by the LSN having
been advanced already. This would result in permanent index corruption,
not just transient failure of concurrent queries.
Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.
heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
exclusive lock: it did the former in the normal path but the latter for a
full-page image. A plain exclusive lock seems sufficient, so change to
that.
Also, remove gistRedoPageDeleteRecord(), which has been dead code since
VACUUM FULL was rewritten.
Back-patch to 9.0, where hot standby was introduced. Note however that 9.0
had a significantly different WAL-logging scheme for GIST index updates,
and it doesn't appear possible to make that scheme safe for concurrent hot
standby queries, because it can leave inconsistent states in the index even
between WAL records. Given the lack of complaints from the field, we won't
work too hard on fixing that branch.
When the startup process restores a WAL file from the archive, it deletes
any old file with the same name and renames the new file in its place. On
Windows, however, when a file is deleted, it still lingers as long as a
process holds a file handle open on it. With cascading replication, a
walsender process can hold the old file open, so the rename() in the startup
process would fail. To fix that, rename the old file to a temporary name, to
make the original file name available for reuse, before deleting the old
file.
Give the correct name of the GUC parameter being complained of.
Also, emit a more suitable SQLSTATE (INVALID_PARAMETER_VALUE,
not the default INTERNAL_ERROR).
Gurjeet Singh, errcode adjustment by me
The cascading replication code assumed that the current RecoveryTargetTLI
never changes, but that's not true with recovery_target_timeline='latest'.
The obvious upshot of that is that RecoveryTargetTLI in shared memory needs
to be protected by a lock. A less obvious consequence is that when a
cascading standby is connected, and the standby switches to a new target
timeline after scanning the archive, it will continue to stream WAL to the
cascading standby, but from a wrong file, ie. the file of the previous
timeline. For example, if the standby is currently streaming from the middle
of file 000000010000000000000005, and the timeline changes, the standby
will continue to stream from that file. However, the WAL on the new
timeline is in file 000000020000000000000005, so the standby sends garbage
from 000000010000000000000005 to the cascading standby, instead of the
correct WAL from file 000000020000000000000005.
This also fixes a related bug where a partial WAL segment is restored from
the archive and streamed to a cascading standby. The code assumed that when
a WAL segment is copied from the archive, it can immediately be fully
streamed to a cascading standby. However, if the segment is only partially
filled, ie. has the right size, but only N first bytes contain valid WAL,
that's not safe. That can happen if a partial WAL segment is manually copied
to the archive, or if a partial WAL segment is archived because a server is
started up on a new timeline within that segment. The cascading standby will
get confused if the WAL it received is not valid, and will get stuck until
it's restarted. This patch fixes that problem by not allowing WAL restored
from the archive to be streamed to a cascading standby until it's been
replayed, and thus validated.
This prevents spurious attempts to archive xlog files after promotion of
standby, a bug introduced by cascading replication patch in 9.2.
Fujii Masao, simplified and extended to cover streaming by Simon Riggs
Cascading replication copied the incoming file into pg_xlog but
didn't set path correctly, so the first attempt to open file failed
causing it to loop around and look for file in pg_xlog. So the
earlier coding worked, but accidentally rather than by design.
Spotted by Fujii Masao, fix by Fujii Masao and Simon Riggs
mdinit() was misusing IsBootstrapProcessingMode() to decide whether to
create an fsync pending-operations table in the current process. This led
to creating a table not only in the startup and checkpointer processes as
intended, but also in the bgwriter process, not to mention other auxiliary
processes such as walwriter and walreceiver. Creation of the table in the
bgwriter is fatal, because it absorbs fsync requests that should have gone
to the checkpointer; instead they just sit in bgwriter local memory and are
never acted on. So writes performed by the bgwriter were not being fsync'd
which could result in data loss after an OS crash. I think there is no
live bug with respect to walwriter and walreceiver because those never
perform any writes of shared buffers; but the potential is there for
future breakage in those processes too.
To fix, make AuxiliaryProcessMain() export the current process's
AuxProcType as a global variable, and then make mdinit() test directly for
the types of aux process that should have a pendingOpsTable. Having done
that, we might as well also get rid of the random bool flags such as
am_walreceiver that some of the aux processes had grown. (Note that we
could not have fixed the bug by examining those variables in mdinit(),
because it's called from BaseInit() which is run by AuxiliaryProcessMain()
before entering any of the process-type-specific code.)
Back-patch to 9.2, where the problem was introduced by the split-up of
bgwriter and checkpointer processes. The bogus pendingOpsTable exists
in walwriter and walreceiver processes in earlier branches, but absent
any evidence that it causes actual problems there, I'll leave the older
branches alone.
This bug was introduced by commit 20d98ab6e4110087d1816cd105a40fcc8ce0a307,
so backpatch this to 9.0-9.2 like that one.
This fixes bug #6710, reported by Tarvi Pillessaar
This reverts commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724. Per
discussion, it does not seem like a good idea to allow committed changes to
go un-checkpointed indefinitely, as could happen in a low-traffic server;
that makes us entirely reliant on the WAL stream with no redundancy that
might aid data recovery in case of disk failure.
This re-introduces the original problem of hot-standby setups generating a
small continuing stream of WAL traffic even when idle, but there are other
ways to address that without compromising crash recovery, so we'll revisit
that issue in a future release cycle.
WALSender now woken up after each background flush by WALwriter, avoiding
multi-second replication delay for an all-async commit workload.
Replication delay reduced from 7s with default settings to 200ms and often
much less, allowing significantly reduced data loss at failover.
Andres Freund and Simon Riggs
Users of asynchronous-commit mode expect there to be a guaranteed maximum
delay before an async commit's WAL records get flushed to disk. The
original version of the walwriter hibernation patch broke that. Add an
extra shared-memory flag to allow async commits to kick the walwriter out
of hibernation mode, without adding any noticeable overhead in cases where
no action is needed.
This patch modifies the walwriter process so that, when it has not found
anything useful to do for many consecutive wakeup cycles, it extends its
sleep time to reduce the server's idle power consumption. It reverts to
normal as soon as it's done any successful flushes. It's still true that
during any async commit, backends check for completed, unflushed pages of
WAL and signal the walwriter if there are any; so that in practice the
walwriter can get awakened and returned to normal operation sooner than the
sleep time might suggest.
Also, improve the checkpointer so that it uses a latch and a computed delay
time to not wake up at all except when it has something to do, replacing a
previous hardcoded 0.5 sec wakeup cycle. This also is primarily useful for
reducing the server's power consumption when idle.
In passing, get rid of the dedicated latch for signaling the walwriter in
favor of using its procLatch, since that comports better with possible
generic signal handlers using that latch. Also, fix a pre-existing bug
with failure to save/restore errno in walwriter's signal handlers.
Peter Geoghegan, somewhat simplified by Tom
This patch adjusts the core statistics views to match the decision already
taken for pg_stat_statements, that values representing elapsed time should
be represented as float8 and measured in milliseconds. By using float8,
we are no longer tied to a specific maximum precision of timing data.
(Internally, it's still microseconds, but we could now change that without
needing changes at the SQL level.)
The columns affected are
pg_stat_bgwriter.checkpoint_write_time
pg_stat_bgwriter.checkpoint_sync_time
pg_stat_database.blk_read_time
pg_stat_database.blk_write_time
pg_stat_user_functions.total_time
pg_stat_user_functions.self_time
pg_stat_xact_user_functions.total_time
pg_stat_xact_user_functions.self_time
The first four of these are new in 9.2, so there is no compatibility issue
from changing them. The others require a release note comment that they
are now double precision (and can show a fractional part) rather than
bigint as before; also their underlying statistics functions now match
the column definitions, instead of returning bigint microseconds.
Initialise ckptXidEpoch from starting checkpoint and maintain the correct
value as we roll forwards. This allows GetNextXidAndEpoch() to return the
correct epoch when executed during recovery. Backpatch to 9.0 when the
problem is first observable by a user.
Bug report from Daniel Farina
This simplifies the code a little bit. The new rule is that to update
XLogCtl->LogwrtResult, you must hold both WALWriteLock and info_lck, whereas
before we had two copies, one that was protected by WALWriteLock and another
protected by info_lck. The code that updates them was already holding both
locks, so merging the two is trivial.
The third copy, XLogCtl->Insert.LogwrtResult, was not totally redundant, it
was used in AdvanceXLInsertBuffer to update the backend-local copy, before
acquiring the info_lck to read the up-to-date value. But the value of that
seems dubious; at best it's saving one spinlock acquisition per completed
WAL page, which is not significant compared to all the other work involved.
And in practice, it's probably not saving even that much.
It's harmless to do full page writes even when not strictly necessary, so
when turning full_page_writes on, we can set the global flag first, and then
call XLogInsert. Likewise, when turning it off, we can write the WAL record
first, and then clear the flag. This way XLogInsert doesn't need any special
handling of the XLOG_FPW_CHANGE record type. XLogInsert is complicated
enough already, so anything we can keep away from there is a good thing.
Actually I don't think the atomicity of the shared memory flag matters,
anyway, because we only write the XLOG_FPW_CHANGE at the end of recovery,
when there are no concurrent WAL insertions going on. But might as well make
it safe, in case we allow changing full_page_writes on the fly in the
future.
Originally, most of this code assumed that no Postgres backends could be
running concurrently with it, and so no locking could be needed. That
assumption fails in Hot Standby. While it's still true that Hot Standby
backends should never change values like nextXid, they can examine them,
and consistency is important in some cases such as when computing a
snapshot. Therefore, prudence requires that WAL replay code obtain the
relevant locks when modifying such variables, even though it can examine
them without taking a lock. We were following that coding rule in some
places but not all. This commit applies the coding rule uniformly to all
updates of ShmemVariableCache and MultiXactState fields; a search of the
replay routines did not find any other cases that seemed to be at risk.
In addition, this commit fixes a longstanding thinko in replay of NEXTOID
and checkpoint records: we tried to advance nextOid only if it was behind
the value in the WAL record, but the comparison would draw the wrong
conclusion if OID wraparound had occurred since the previous value.
Better to just unconditionally assign the new value, since OID assignment
shouldn't be happening during replay anyway.
The additional locking seems to be more in the nature of future-proofing
than fixing any live bug, so I am not going to back-patch it. The NEXTOID
fix will be back-patched separately.
RestoreBkpBlocks was in the habit of zeroing and refilling the target
buffer; which was perfectly safe when the code was written, but is unsafe
during Hot Standby operation. The reason is that we have coding rules
that allow backends to continue accessing a tuple in a heap relation while
holding only a pin on its buffer. Such a backend could see transiently
zeroed data, if WAL replay had occasion to change other data on the page.
This has been shown to be the cause of bug #6425 from Duncan Rance (who
deserves kudos for developing a sufficiently-reproducible test case) as
well as Bridget Frey's re-report of bug #6200. It most likely explains the
original report as well, though we don't yet have confirmation of that.
To fix, change the code so that only bytes that are supposed to change will
change, even transiently. This actually saves cycles in RestoreBkpBlocks,
since it's not writing the same bytes twice.
Also fix seq_redo, which has the same disease, though it has to work a bit
harder to meet the requirement.
So far as I can tell, no other WAL replay routines have this type of bug.
In particular, the index-related replay routines, which would certainly be
broken if they had to meet the same standard, are not at risk because we
do not have coding rules that allow access to an index page when not
holding a buffer lock on it.
Back-patch to 9.0 where Hot Standby was added.
When a backend needs to flush the WAL, and someone else is already flushing
the WAL, wait until it releases the WALInsertLock and check if we still need
to do the flush or if the other backend already did the work for us, before
acquiring WALInsertLock. This helps group commit, because when the WAL flush
finishes, all the backends that were waiting for it can be woken up in one
go, and the can all concurrently observe that they're done, rather than
waking them up one by one in a cascading fashion.
This is based on a new LWLock function, LWLockWaitUntilFree(), which has
peculiar semantics. If the lock is immediately free, it grabs the lock and
returns true. If it's not free, it waits until it is released, but then
returns false without grabbing the lock. This is used in XLogFlush(), so
that when the lock is acquired, the backend flushes the WAL, but if it's
not, the backend first checks the current flush location before retrying.
Original patch and benchmarking by Peter Geoghegan and Simon Riggs, although
this patch as committed ended up being very different from that.
Base backup follows recommended procedure, plus goes to great
lengths to ensure that partial page writes are avoided.
Jun Ishizuka and Fujii Masao, with minor modifications
Previously we used ReadRecPtr rather than EndRecPtr, which was
not a serious error but caused pg_stat_replication to report
incorrect replay_location until at least one WAL record is replayed.
Fujii Masao
constructed before acquiring WALInsertLock, which slightly reduces the time
the lock is held. Although I could not measure any benefit in benchmarks,
the code is more readable this way.
Removing this bit from xl_info allows us to restore the old limit of four
(not three) separate pages touched by a WAL record, which is needed for the
upcoming SP-GiST feature, and will likely be useful elsewhere in future.
When we implemented XLR_BKP_REMOVABLE in 2007, we had to do it like that
because no special WAL-visible action was taken when starting a backup.
However, now we force a segment switch when starting a backup, so a
compressing WAL archiver (such as pglesslog) that uses the state shown in
the current page header will not be fooled as to removability of backup
blocks. The only downside is that the archiver will not return to
compressing mode for up to one WAL page after the backup is over, which is
a small price to pay for getting back the extra xl_info bit. In any case
the archiver could look for XLOG_BACKUP_END records if it thought it was
worth the trouble to do so.
Bump XLOG_PAGE_MAGIC since this is effectively a change in WAL format.
we don't reach consistency before replaying all of the WAL. Rename the
variable to reachedConsistency, to make its intention clearer.
In master, that was an active bug because of the recent patch to
immediately PANIC if a reference to a missing page is found in WAL after
reaching consistency, as Tom Lane's test case demonstrated. In 9.1 and 9.0,
the only consequence was a misleading "consistent recovery state reached at
%X/%X" message in the log at the beginning of crash recovery (the database
is not consistent at that point yet). In 8.4, the log message was not
printed in crash recovery, even though there was a similar
reachedMinRecoveryPoint local variable that was also set early. So,
backpatch to 9.1 and 9.0.
invalid-page hash table, PANIC immediately. Immediate PANIC is much better
than waiting for end-of-recovery, which is what we did before, because the
end-of-recovery might not come until months later if this is a standby
server.
Also refrain from creating a restartpoint if there are invalid-page entries
in the hash table. Restarting recovery from such a restartpoint would not
see the invalid references, and wouldn't be able to cross-check them when
consistency is reached. That wouldn't matter when things are going smoothly,
but the more sanity checks you have the better.
Fujii Masao
Previously we waited for wal_writer_delay before flushing WAL. Now
we also wake WALWriter as soon as a WAL buffer page has filled.
Significant effect observed on performance of asynchronous commits
by Robert Haas, attributed to the ability to set hint bits on tuples
earlier and so reducing contention caused by clog lookups.
Previously, we skipped a checkpoint if no WAL had been written since
last checkpoint, though this does not appear in user documentation.
As of now, we skip a checkpoint until we have written at least one
enough WAL to switch the next WAL file. This greatly reduces the
level of activity and number of WAL messages generated by a very
low activity server. This is safe because the purpose of a checkpoint
is to act as a starting place for a recovery, in case of crash.
This patch maintains minimal WAL volume for replay in case of crash,
thus maintaining very low crash recovery time.