RelationAllowsEarlyPruning() performed a catalog scan, but is used
in two contexts where that was a bad idea:
1. In heap_page_prune_opt(), which runs very frequently in some large
scans. This caused major performance problems in a field report
that was easy to reproduce.
2. In TestForOldSnapshot(), which runs while we hold a buffer content
lock. It's not clear if this was guaranteed to be free of buffer
deadlock risk.
The check was introduced in commit 2cc41acd8 and defended against a
real problem: 9.6's hash indexes have no page LSN and so we can't
allow early pruning (ie the snapshot-too-old feature). We can remove
the check from all later releases though: hash indexes are now logged,
and there is no way to create UNLOGGED indexes on regular logged
tables.
If a future release allows such a combination, it might need to put
a similar check in place, but it'll need some more thought.
Back-patch to 10.
Author: Thomas Munro
Reviewed-by: Tom Lane, who spotted the second problem
Discussion: https://postgr.es/m/CA%2BhUKGKT8oTkp5jw_U4p0S-7UG9zsvtw_M47Y285bER6a2gD%2Bg%40mail.gmail.com
Discussion: https://postgr.es/m/CAA4eK1%2BWy%2BN4eE5zPm765h68LrkWc3Biu_8rzzi%2BOYX4j%2BiHRw%40mail.gmail.com
Otherwise, after a deleted page gets even older, it becomes unrecyclable
again. B-tree has the same problem, and has had since time immemorial,
but let's at least fix this in GiST, where this is new.
Backpatch to v12, where GiST page deletion was introduced.
Reviewed-by: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/835A15A5-F1B4-4446-A711-BF48357EB602%40yandex-team.ru
The code in tqual.c is largely heap specific. Due to the upcoming
pluggable storage work, it therefore makes sense to move it into
access/heap/ (as the file's header notes, the tqual name isn't very
good).
But the various statically allocated snapshot and snapshot
initialization functions are now (see previous commit) generic and do
not depend on functions declared in tqual.h anymore. Therefore move.
Also move XidInMVCCSnapshot as that's useful for future AMs, and
already used outside of tqual.c.
Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
For normal commits and aborts we already reset PgXact->xmin,
so we can simply avoid running SnapshotResetXmin() twice.
During performance tests by Alexander Korotkov, diagnosis
by Andres Freund showed PgXact array as a bottleneck. After
manual analysis by me of the code paths that touch those
memory locations, I was able to identify extraneous code
in the main transaction commit path.
Avoiding touching highly contented shmem improves concurrent
performance slightly on all workloads, confirmed by tests
run by Ashutosh Sharma and Alexander Korotkov.
Simon Riggs
Discussion: CANP8+jJdXE9b+b9F8CQT-LuxxO0PBCB-SZFfMVAdp+akqo4zfg@mail.gmail.com
Per buildfarm. Maybe some of the other xmin variables in snapmgr.h
ought to get this too, but for the moment I'm just interested in
un-breaking the buildfarm.
Twiddle the replication-related code so that its timestamp variables
are declared TimestampTz, rather than the uninformative "int64" that
was previously used for meant-to-be-always-integer timestamps.
This resolves the int64-vs-TimestampTz declaration inconsistencies
introduced by commit 7c030783a, though in the opposite direction to
what was originally suggested.
This required including datatype/timestamp.h in a couple more places
than before. I decided it would be a good idea to slim down that
header by not having it pull in <float.h> etc, as those headers are
no longer at all relevant to its purpose. Unsurprisingly, a small number
of .c files turn out to have been depending on those inclusions, so add
them back in the .c files as needed.
Discussion: https://postgr.es/m/26788.1487455319@sss.pgh.pa.us
Discussion: https://postgr.es/m/27694.1487456324@sss.pgh.pa.us
Gen_fmgrtab.pl creates a new file fmgrprotos.h, which contains
prototypes for all functions registered in pg_proc.h. This avoids
having to manually maintain these prototypes across a random variety of
header files. It also automatically enforces a correct function
signature, and since there are warnings about missing prototypes, it
will detect functions that are defined but not registered in
pg_proc.h (or otherwise used).
Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
Previously, we tested for MVCC snapshots to see whether they were too
old, but not TOAST snapshots, which can lead to complaints about missing
TOAST chunks if those chunks are subject to early pruning. Ideally,
the threshold lsn and timestamp for a TOAST snapshot would be that of
the corresponding MVCC snapshot, but since we have no way of deciding
which MVCC snapshot was used to fetch the TOAST pointer, use the oldest
active or registered snapshot instead.
Reported by Andres Freund, who also sketched out what the fix should
look like. Patch by me, reviewed by Amit Kapila.
Since indexes are created without valid LSNs, an index created
while a snapshot older than old_snapshot_threshold existed could
cause queries to return incorrect results when those old snapshots
were used, if any relevant rows had been subject to early pruning
before the index was built. Prevent usage of a newly created index
until all such snapshots are released, for relations where this can
happen.
Questions about the interaction of "snapshot too old" with index
creation were initially raised by Andres Freund.
Reviewed by Robert Haas.
Without a few entries beyond old_snapshot_threshold, the lookup
would often fail, resulting in the more aggressive pruning or
vacuum being skipped often enough to matter. This was very clearly
shown by a python test script posted by Ants Aasma, and was likely
a factor in an earlier but somewhat less clear-cut test case posted
by Jeff Janes.
This patch makes no change to the logic, per se -- it just makes
the array of mapping entries big enough to make lookup misses based
on timing much less likely. An occasional miss is still possible
if a thread stalls for more than 10 minutes, but that does not
create any problem with correctness of behavior. Besides, if
things are so busy that a thread is stalling for more than 10
minutes, it is probably OK to skip the more aggressive cleanup at
that particular point in time.
Revert commit 7cb1db1d9599f0a09d6920d2149d956ef6d88b0e, which represented
a misunderstanding of the problem (if snapmgr.h weren't already included
in bufmgr.h, things wouldn't compile anywhere). Instead install what
I think is the real fix.
This feature is controlled by a new old_snapshot_threshold GUC. A
value of -1 disables the feature, and that is the default. The
value of 0 is just intended for testing. Above that it is the
number of minutes a snapshot can reach before pruning and vacuum
are allowed to remove dead tuples which the snapshot would
otherwise protect. The xmin associated with a transaction ID does
still protect dead tuples. A connection which is using an "old"
snapshot does not get an error unless it accesses a page modified
recently enough that it might not be able to produce accurate
results.
This is similar to the Oracle feature, and we use the same SQLSTATE
and error message for compatibility.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
When a relfilenode is created in this subtransaction or
a committed child transaction and it cannot otherwise
be seen by our own process, mark tuples committed ahead
of transaction commit for all COPY commands in same
transaction. If FREEZE specified on COPY
and pre-conditions met then rows will also be frozen.
Both options designed to avoid revisiting rows after commit,
increasing performance of subsequent commands after
data load and upgrade. pg_restore changes later.
Simon Riggs, review comments from Heikki Linnakangas, Noah Misch and design
input from Tom Lane, Robert Haas and Kevin Grittner
A transaction can export a snapshot with pg_export_snapshot(), and then
others can import it with SET TRANSACTION SNAPSHOT. The data does not
leave the server so there are not security issues. A snapshot can only
be imported while the exporting transaction is still running, and there
are some other restrictions.
I'm not totally convinced that we've covered all the bases for SSI (true
serializable) mode, but it works fine for lesser isolation modes.
Joachim Wieland, reviewed by Marko Tiikkaja, and rather heavily modified
by Tom Lane
In REPEATABLE READ (nee SERIALIZABLE) mode, an attempt to do
GetTransactionSnapshot() between AbortTransaction and CleanupTransaction
failed, because GetTransactionSnapshot would recompute the transaction
snapshot (which is already wrong, given the isolation mode) and then
re-register it in the TopTransactionResourceOwner, leading to an Assert
because the TopTransactionResourceOwner should be empty of resources after
AbortTransaction. This is the root cause of bug #6218 from Yamamoto
Takashi. While changing plancache.c to avoid requesting a snapshot when
handling a ROLLBACK masks the problem, I think this is really a snapmgr.c
bug: it's lower-level than the resource manager mechanism and should not be
shutting itself down before we unwind resource manager resources. However,
just postponing the release of the transaction snapshot until cleanup time
didn't work because of the circular dependency with
TopTransactionResourceOwner. Fix by managing the internal reference to
that snapshot manually instead of depending on TopTransactionResourceOwner.
This saves a few cycles as well as making the module layering more
straightforward. predicate.c's dependencies on TopTransactionResourceOwner
go away too.
I think this is a longstanding bug, but there's no evidence that it's more
than a latent bug, so it doesn't seem worth any risk of back-patching.
With this patch, portals, SQL functions, and SPI all agree that there
should be only a CommandCounterIncrement between the queries that are
generated from a single SQL command by rule expansion. Fetching a whole
new snapshot now happens only between original queries. This is equivalent
to the existing behavior of EXPLAIN ANALYZE, and it was judged to be the
best choice since it eliminates one source of concurrency hazards for
rules. The patch should also make things marginally faster by reducing the
number of snapshot push/pop operations.
The patch removes pg_parse_and_rewrite(), which is no longer used anywhere.
There was considerable discussion about more aggressive refactoring of the
query-processing functions exported by postgres.c, but for the moment
nothing more has been done there.
I also took the opportunity to refactor snapmgr.c's API slightly: the
former PushUpdatedSnapshot() has been split into two functions.
Marko Tiikkaja, reviewed by Steve Singer and Tom Lane
Partially revert the previous patch I installed and replace it with a more
general fix: any time a snapshot is pushed as Active, we need to ensure that it
will not be modified in the future. This means that if the same snapshot is
used as CurrentSnapshot, it needs to be copied separately. This affects
serializable transactions only, because CurrentSnapshot has already been copied
by RegisterSnapshot and so PushActiveSnapshot does not think it needs another
copy. However, CommandCounterIncrement would modify CurrentSnapshot, whereas
ActiveSnapshots must not have their command counters incremented.
I say "partially" because the regression test I added for the previous bug
has been kept.
(This restores 8.3 behavior, because before snapmgr.c existed, any snapshot set
as Active was copied.)
Per bug report from Stuart Bishop in
6bc73d4c0910042358k3d1adff3qa36f8df75198ecea@mail.gmail.com
The old coding was using a regular snapshot, referenced elsewhere, that was
subject to having its command counter updated. Fix by creating a private copy
of the snapshot exclusively for the cursor.
Backpatch to 8.4, which is when the bug was introduced during the snapshot
management rewrite.
non-writable large objects need to have their snapshots registered on the
transaction resowner, not the current portal's, because it must persist until
the large object is closed (which the portal does not). Also, ensure that the
serializable snapshot is recorded by the transaction resource owner too, even
when a subtransaction has changed the current resource owner before
serializable is taken.
Per bug reports from Pavan Deolasee.
There are two ways to track a snapshot: there's the "registered" list, which
is used for arbitrary long-lived snapshots; and there's the "active stack",
which is used for the snapshot that is considered "active" at any time.
This also allows users of snapshots to stop worrying about snapshot memory
allocation and freeing, and about using PG_TRY blocks around ActiveSnapshot
assignment. This is all done automatically now.
As a consequence, this allows us to reset MyProc->xmin when there are no
more snapshots registered in the current backend, reducing the impact that
long-running transactions have on VACUUM.