Split ProcSleep into two functions: JoinWaitQueue and ProcSleep.
JoinWaitQueue is called while holding the partition lock, and inserts
the current process to the wait queue, while ProcSleep() does the
actual sleeping. ProcSleep() is now called without holding the
partition lock, and it no longer re-acquires the partition lock before
returning. That makes the wakeup a little cheaper. Once upon a time,
re-acquiring the partition lock was needed to prevent a signal handler
from longjmping out at a bad time, but these days our signal handlers
just set flags, and longjmping can only happen at points where we
explicitly run CHECK_FOR_INTERRUPTS().
If JoinWaitQueue detects an "early deadlock" before even joining the
wait queue, it returns without changing the shared lock entry, leaving
the cleanup of the shared lock entry to the caller. This makes the
handling of an early deadlock the same as the dontWait=true case.
One small user-visible side-effect of this refactoring is that we now
only set the 'ps' title to say "waiting" when we actually enter the
sleep, not when the lock is skipped because dontWait=true, or when a
deadlock is detected early before entering the sleep.
This eliminates the 'lockAwaited' global variable in proc.c, which was
largely redundant with 'awaitedLock' in lock.c
Note: Updating the local lock table is now the caller's responsibility.
JoinWaitQueue and ProcSleep are now only responsible for modifying the
shared state. Seems a little nicer that way.
Based on Thomas Munro's earlier patch and observation that ProcSleep
doesn't really need to re-acquire the partition lock.
Reviewed-by: Maxim Orlov
Discussion: https://www.postgresql.org/message-id/7c2090cd-a72a-4e34-afaa-6dd2ef31440e@iki.fi
Replace the fixed-size array of fast-path locks with arrays, sized on
startup based on max_locks_per_transaction. This allows using fast-path
locking for workloads that need more locks.
The fast-path locking introduced in 9.2 allowed each backend to acquire
a small number (16) of weak relation locks cheaply. If a backend needs
to hold more locks, it has to insert them into the shared lock table.
This is considerably more expensive, and may be subject to contention
(especially on many-core systems).
The limit of 16 fast-path locks was always rather low, because we have
to lock all relations - not just tables, but also indexes, views, etc.
For planning we need to lock all relations that might be used in the
plan, not just those that actually get used in the final plan. So even
with rather simple queries and schemas, we often need significantly more
than 16 locks.
As partitioning gets used more widely, and the number of partitions
increases, this limit is trivial to hit. Complex queries may easily use
hundreds or even thousands of locks. For workloads doing a lot of I/O
this is not noticeable, but for workloads accessing only data in RAM,
the access to the shared lock table may be a serious issue.
This commit removes the hard-coded limit of the number of fast-path
locks. Instead, the size of the fast-path arrays is calculated at
startup, and can be set much higher than the original 16-lock limit.
The overall fast-path locking protocol remains unchanged.
The variable-sized fast-path arrays can no longer be part of PGPROC, but
are allocated as a separate chunk of shared memory and then references
from the PGPROC entries.
The fast-path slots are organized as a 16-way set associative cache. You
can imagine it as a hash table of 16-slot "groups". Each relation is
mapped to exactly one group using hash(relid), and the group is then
processed using linear search, just like the original fast-path cache.
With only 16 entries this is cheap, with good locality.
Treating this as a simple hash table with open addressing would not be
efficient, especially once the hash table gets almost full. The usual
remedy is to grow the table, but we can't do that here easily. The
access would also be more random, with worse locality.
The fast-path arrays are sized using the max_locks_per_transaction GUC.
We try to have enough capacity for the number of locks specified in the
GUC, using the traditional 2^n formula, with an upper limit of 1024 lock
groups (i.e. 16k locks). The default value of max_locks_per_transaction
is 64, which means those instances will have 64 fast-path slots.
The main purpose of the max_locks_per_transaction GUC is to size the
shared lock table. It is often set to the "average" number of locks
needed by backends, with some backends using significantly more locks.
This should not be a major issue, however. Some backens may have to
insert locks into the shared lock table, but there can't be too many of
them, limiting the contention.
The only solution is to increase the GUC, even if the shared lock table
already has sufficient capacity. That is not free, especially in terms
of memory usage (the shared lock table entries are fairly large). It
should only happen on machines with plenty of memory, though.
In the future we may consider a separate GUC for the number of fast-path
slots, but let's try without one first.
Reviewed-by: Robert Haas, Jakub Wartak
Discussion: https://postgr.es/m/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com
According to the commit message in 8ec569479, we must have all variables
in header files marked with PGDLLIMPORT. In commit d3cc5ffe81f6 some
variables were moved from launch_backend.c file to several header files.
This adds PGDLLIMPORT to moved variables.
Author: Sofia Kopikova <s.kopikova@postgrespro.ru>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/e0b17014-5319-4dd6-91cd-93d9c8fc9539%40postgrespro.ru
This fixes warnings from -Wmissing-variable-declarations (not yet part
of the standard warning options) under EXEC_BACKEND. The
NON_EXEC_STATIC variables need a suitable declaration in a header file
under EXEC_BACKEND.
Also fix the inconsistent application of the volatile qualifier for
PMSignalState, which was revealed by this change.
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/e0a62134-83da-4ba4-8cdb-ceb0111c95ce@eisentraut.org
Since commit 5764f611e1, we've been using the ilist.h functions for
handling the linked list. There's no need for 'links' to be the first
element of the struct anymore, except for one call in InitProcess
where we used a straight cast from the 'dlist_node *' to PGPROC *,
without the dlist_container() macro. That was just an oversight in
commit 5764f611e1, fix it.
There no imminent need to move 'links' from being the first field, but
let's be tidy.
Reviewed-by: Aleksander Alekseev, Andres Freund
Discussion: https://www.postgresql.org/message-id/22aa749e-cc1a-424a-b455-21325473a794@iki.fi
This fixes various typos, duplicated words, and tiny bits of whitespace
mainly in code comments but also in docs.
Author: Daniel Gustafsson <daniel@yesql.se>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
Author: Alexander Lakhin <exclusion@gmail.com>
Author: David Rowley <dgrowleyml@gmail.com>
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/3F577953-A29E-4722-98AD-2DA9EFF2CBB8@yesql.se
We don't determine the position at which a process waiting for a lock
should insert itself into the wait queue until we reach ProcSleep(),
and we may at that point discover that we must insert ourselves ahead
of everyone who wants a conflicting lock, in which case we obtain the
lock immediately. Up until now, a no-wait lock acquisition would fail
in such cases, erroneously claiming that the lock couldn't be obtained
immediately. Fix that by trying ProcSleep even in the no-wait case.
No back-patch for now, because I'm treating this as an improvement to
the existing no-wait feature. It could instead be argued that it's a
bug fix, on the theory that there should never be any case whatsoever
where no-wait fails to obtain a lock that would have been obtained
immediately without no-wait, but I'm reluctant to interpret the
semantics of no-wait that strictly.
Robert Haas and Jingxian Li
Discussion: http://postgr.es/m/CA+TgmobCH-kMXGVpb0BB-iNMdtcNkTvcZ4JBxDJows3kYM+GDg@mail.gmail.com
Now that BackendId was just another index into the proc array, it was
redundant with the 0-based proc numbers used in other places. Replace
all usage of backend IDs with proc numbers.
The only place where the term "backend id" remains is in a few pgstat
functions that expose backend IDs at the SQL level. Those IDs are now
in fact 0-based ProcNumbers too, but the documentation still calls
them "backend ids". That term still seems appropriate to describe what
the numbers are, so I let it be.
One user-visible effect is that pg_temp_0 is now a valid temp schema
name, for backend with ProcNumber 0.
Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
Previously, backend ID was an index into the ProcState array, in the
shared cache invalidation manager (sinvaladt.c). The entry in the
ProcState array was reserved at backend startup by scanning the array
for a free entry, and that was also when the backend got its backend
ID. Things become slightly simpler if we redefine backend ID to be the
index into the PGPROC array, and directly use it also as an index to
the ProcState array. This uses a little more memory, as we reserve a
few extra slots in the ProcState array for aux processes that don't
need them, but the simplicity is worth it.
Aux processes now also have a backend ID. This simplifies the
reservation of BackendStatusArray and ProcSignal slots.
You can now convert a backend ID into an index into the PGPROC array
simply by subtracting 1. We still use 0-based "pgprocnos" in various
places, for indexes into the PGPROC array, but the only difference now
is that backend IDs start at 1 while pgprocnos start at 0. (The next
commmit will get rid of the term "backend ID" altogether and make
everything 0-based.)
There is still a 'backendId' field in PGPROC, now part of 'vxid' which
encapsulates the backend ID and local transaction ID together. It's
needed for prepared xacts. For regular backends, the backendId is
always equal to pgprocno + 1, but for prepared xact PGPROC entries,
it's the ID of the original backend that processed the transaction.
Reviewed-by: Andres Freund, Reid Thompson
Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
This commit adds timeout that is expected to be used as a prevention
of long-running queries. Any session within the transaction will be
terminated after spanning longer than this timeout.
However, this timeout is not applied to prepared transactions.
Only transactions with user connections are affected.
Discussion: https://postgr.es/m/CAAhFRxiQsRs2Eq5kCo9nXE3HTugsAAJdSQSmxncivebAxdmBjQ%40mail.gmail.com
Author: Andrey Borodin <amborodin@acm.org>
Author: Japin Li <japinli@hotmail.com>
Author: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Nikolay Samokhvalov <samokhvalov@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: bt23nguyent <bt23nguyent@oss.nttdata.com>
Reviewed-by: Yuhang Qiu <iamqyh@gmail.com>
When active, this process writes WAL summary files to
$PGDATA/pg_wal/summaries. Each summary file contains information for a
certain range of LSNs on a certain TLI. For each relation, it stores a
"limit block" which is 0 if a relation is created or destroyed within
a certain range of WAL records, or otherwise the shortest length to
which the relation was truncated during that range of WAL records, or
otherwise InvalidBlockNumber. In addition, it stores a list of blocks
which have been modified during that range of WAL records, but
excluding blocks which were removed by truncation after they were
modified and never subsequently modified again.
In other words, it tells us which blocks need to copied in case of an
incremental backup covering that range of WAL records. But this
doesn't yet add the capability to actually perform an incremental
backup; the next patch will do that.
A new parameter summarize_wal enables or disables this new background
process. The background process also automatically deletes summary
files that are older than wal_summarize_keep_time, if that parameter
has a non-zero value and the summarizer is configured to run.
Patch by me, with some design help from Dilip Kumar and Andres Freund.
Reviewed by Matthias van de Meent, Dilip Kumar, Jakub Wartak, Peter
Eisentraut, and Álvaro Herrera.
Discussion: http://postgr.es/m/CA+TgmoYOYZfMCyOXFyC-P+-mdrZqm5pP2N7S-r0z3_402h9rsA@mail.gmail.com
We've had repeated bugs in the area of handling SLRU wraparound in the past,
some of which have caused data loss. Switching to an indexing system for SLRUs
that does not wrap around should allow us to get rid of a whole bunch
of problems and improve the overall reliability of the system.
This particular patch however only changes the indexing and doesn't address
the wraparound per se. This is going to be done in the following patches.
Author: Maxim Orlov, Aleksander Alekseev, Alexander Korotkov, Teodor Sigaev
Author: Nikita Glukhov, Pavel Borisov, Yura Sokolov
Reviewed-by: Jacob Champion, Heikki Linnakangas, Alexander Korotkov
Reviewed-by: Japin Li, Pavel Borisov, Tom Lane, Peter Eisentraut, Andres Freund
Reviewed-by: Andrey Borodin, Dilip Kumar, Aleksander Alekseev
Discussion: https://postgr.es/m/CACG%3DezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe%3DpyyjVWA%40mail.gmail.com
Discussion: https://postgr.es/m/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq%2BvfkmTF5Q%40mail.gmail.com
Run pgindent, pgperltidy, and reformat-dat-files.
This set of diffs is a bit larger than typical. We've updated to
pg_bsd_indent 2.1.2, which properly indents variable declarations that
have multi-line initialization expressions (the continuation lines are
now indented one tab stop). We've also updated to perltidy version
20230309 and changed some of its settings, which reduces its desire to
add whitespace to lines to make assignments etc. line up. Going
forward, that should make for fewer random-seeming changes to existing
code.
Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
This provides a way to reserve connection slots for non-superusers.
The slots reserved via the new GUC are available only to users who
have the new predefined role pg_use_reserved_connections.
superuser_reserved_connections remains as a final reserve in case
reserved_connections has been exhausted.
Patch by Nathan Bossart. Reviewed by Tushar Ahuja and by me.
Discussion: http://postgr.es/m/20230119194601.GA4105788@nathanxps13
Until now LWLockDequeueSelf() sequentially searched the list of waiters to see
if the current proc is still is on the list of waiters, or has already been
removed. In extreme workloads, where the wait lists are very long, this leads
to a quadratic behavior. #backends iterating over a list #backends
long. Additionally, the likelihood of needing to call LWLockDequeueSelf() in
the first place also increases with the increased length of the wait queue, as
it becomes more likely that a lock is released while waiting for the wait list
lock, which is held for longer during lock release.
Due to the exponential back-off in perform_spin_delay() this is surprisingly
hard to detect. We should make that easier, e.g. by adding a wait event around
the pg_usleep() - but that's a separate patch.
The fix is simple - track whether a proc is currently waiting in the wait list
or already removed but waiting to be woken up in PGPROC->lwWaiting.
In some workloads with a lot of clients contending for a small number of
lwlocks (e.g. WALWriteLock), the fix can substantially increase throughput.
As the quadratic behavior arguably is a bug, we might want to decide to
backpatch this fix in the future.
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
Discussion: https://postgr.es/m/CALj2ACXktNbG=K8Xi7PSqbofTZozavhaxjatVc14iYaLu4Maag@mail.gmail.com
This is a slight, convenient semantics change from what commit
0f0cfb494004 ("Fix parallel operations that prevent oldest xmin from
advancing") introduced that lets us simplify the coding in the one place
where it is used.
Backpatch to 13. This is related to commit 6fea65508a1a ("Tighten
ComputeXidHorizons' handling of walsenders") rewriting the code site
where this is used, which has not yet been backpatched, but it may well
be in the future.
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/202204191637.eldwa2exvguw@alvherre.pgsql
ComputeXidHorizons (nee GetOldestXmin) thought that it could identify
walsenders by checking for proc->databaseId == 0. Perhaps that was
safe when the code was written, but it's been wrong at least since
autovacuum was invented. Background processes that aren't connected
to any particular database, such as the autovacuum launcher and
logical replication launcher, look like that too.
This imprecision is harmful because when such a process advertises an
xmin, the result is to hold back dead-tuple cleanup in all databases,
though it'd be sufficient to hold it back in shared catalogs (which
are the only relations such a process can access). Aside from being
generally inefficient, this has recently been seen to cause regression
test failures in the buildfarm, as a consequence of the logical
replication launcher's startup transaction preventing VACUUM from
marking pages of a user table as all-visible.
We only want that global hold-back effect for the case where a
walsender is advertising a hot standby feedback xmin. Therefore,
invent a new PGPROC flag that says that a process' xmin should be
considered globally, and check that instead of using the incorrect
databaseId == 0 test. Currently only a walsender sets that flag,
and only if it is not connected to any particular database. (This is
for bug-compatibility with the undocumented behavior of the existing
code, namely that feedback sent by a client who has connected to a
particular database would not be applied globally. I'm not sure this
is a great definition; however, such a client is capable of issuing
plain SQL commands, and I don't think we want xmins advertised for
such commands to be applied globally. Perhaps this could do with
refinement later.)
While at it, I rewrote the comment in ComputeXidHorizons, and
re-ordered the commented-upon if-tests, to make them match up
for intelligibility's sake.
This is arguably a back-patchable bug fix, but given the lack of
complaints I think it prudent to let it age awhile in HEAD first.
Discussion: https://postgr.es/m/1346227.1649887693@sss.pgh.pa.us
Before commit 412ad7a55639516f284cd0ef9757d6ae5c7abd43, delayChkpt
was a Boolean. Now it's an integer. Extensions using it need to be
appropriately updated, so let's rename the field to make sure that
a hard compilation failure occurs.
Replacing delayChkpt with delayChkptFlags made a few comments extend
past 80 characters, so I reflowed them and changed some wording very
slightly.
The back-branches will need a different change to restore compatibility
with existing minor releases; this is just for master.
Per suggestion from Tom Lane.
Discussion: http://postgr.es/m/a7880f4d-1d74-582a-ada7-dad168d046d1@enterprisedb.com
Up until now, we've had a policy of only marking certain variables
in the PostgreSQL header files with PGDLLIMPORT, but now we've
decided to mark them all. This means that extensions running on
Windows should no longer operate at a disadvantage as compared to
extensions running on Linux: if the variable is present in a header
file, it should be accessible.
Discussion: http://postgr.es/m/CA+TgmoYanc1_FSfimhgiWSqVyP5KKmh5NP2BWNwDhO8Pg2vGYQ@mail.gmail.com
If TRUNCATE causes some buffers to be invalidated and thus the
checkpoint does not flush them, TRUNCATE must also ensure that the
corresponding files are truncated on disk. Otherwise, a replay
from the checkpoint might find that the buffers exist but have
the wrong contents, which may cause replay to fail.
Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design
suggestion from Heikki Linnakangas, with some changes to the
comments by me. Review of this and a prior patch that approached
the issue differently by Heikki Linnakangas, Andres Freund, Álvaro
Herrera, Masahiko Sawada, and Tom Lane.
Discussion: http://postgr.es/m/BYAPR06MB6373BF50B469CA393C614257ABF00@BYAPR06MB6373.namprd06.prod.outlook.com
Instead of referring to target backends by pid, use pgprocno. This
means that we don't have to scan the ProcArray and we can drop some
special case code for dealing with the startup process.
Discussion: https://postgr.es/m/CA%2BhUKGLYRyDaneEwz5Uya_OgFLMx5BgJfkQSD%3Dq9HmwsfRRb-w%40mail.gmail.com
Reviewed-by: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>
Reviewed-by: Ashwin Agrawal <ashwinstar@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
While determining xid horizons, we skip over backends that are running
Vacuum. We also ignore Create Index Concurrently, or Reindex Concurrently
for the purposes of computing Xmin for Vacuum. But we were not setting the
flags corresponding to these operations when they are performed in
parallel which was preventing Xid horizon from advancing.
The optimization related to skipping Create Index Concurrently, or Reindex
Concurrently operations was implemented in PG-14 but the fix is the same
for the Parallel Vacuum as well so back-patched till PG-13.
Author: Masahiko Sawada
Reviewed-by: Amit Kapila
Backpatch-through: 13
Discussion: https://postgr.es/m/CAD21AoCLQqgM1sXh9BrDFq0uzd3RBFKi=Vfo6cjjKODm0Onr5w@mail.gmail.com
Also "make reformat-dat-files".
The only change worthy of note is that pgindent messed up the formatting
of launcher.c's struct LogicalRepWorkerId, which led me to notice that
that struct wasn't used at all anymore, so I just took it out.
This commit changes WAL archiver process so that it's treated as
an auxiliary process and can use shared memory. This is an infrastructure
patch required for upcoming shared-memory based stats collector patch
series. These patch series basically need any processes including archiver
that can report the statistics to access to shared memory. Since this patch
itself is useful to simplify the code and when users monitor the status of
archiver, it's committed separately in advance.
This commit simplifies the code for WAL archiving. For example, previously
backends need to signal to archiver via postmaster when they notify
archiver that there are some WAL files to archive. On the other hand,
this commit removes that signal to postmaster and enables backends to
notify archier directly using shared latch.
Also, as the side of this change, the information about archiver process
becomes viewable at pg_stat_activity view.
Author: Kyotaro Horiguchi
Reviewed-by: Andres Freund, Álvaro Herrera, Julien Rouhaud, Tomas Vondra, Arthur Zakirov, Fujii Masao
Discussion: https://postgr.es/m/20180629.173418.190173462.horiguchi.kyotaro@lab.ntt.co.jp
This commit adds new column "waitstart" into pg_locks view. This column
reports the time when the server process started waiting for the lock
if the lock is not held. This information is useful, for example, when
examining the amount of time to wait on a lock by subtracting
"waitstart" in pg_locks from the current time, and identify the lock
that the processes are waiting for very long.
This feature uses the current time obtained for the deadlock timeout
timer as "waitstart" (i.e., the time when this process started waiting
for the lock). Since getting the current time newly can cause overhead,
we reuse the already-obtained time to avoid that overhead.
Note that "waitstart" is updated without holding the lock table's
partition lock, to avoid the overhead by additional lock acquisition.
This can cause "waitstart" in pg_locks to become NULL for a very short
period of time after the wait started even though "granted" is false.
This is OK in practice because we can assume that users are likely to
look at "waitstart" when waiting for the lock for a long time.
The first attempt of this patch (commit 3b733fcd04) caused the buildfarm
member "rorqual" (built with --disable-atomics --disable-spinlocks) to report
the failure of the regression test. It was reverted by commit 890d2182a2.
The cause of this failure was that the atomic variable for "waitstart"
in the dummy process entry created at the end of prepare transaction was
not initialized. This second attempt fixes that issue.
Bump catalog version.
Author: Atsushi Torikoshi
Reviewed-by: Ian Lawrence Barwick, Robert Haas, Justin Pryzby, Fujii Masao
Discussion: https://postgr.es/m/a96013dc51cdc56b2a2b84fa8a16a993@oss.nttdata.com
This commit adds new column "waitstart" into pg_locks view. This column
reports the time when the server process started waiting for the lock
if the lock is not held. This information is useful, for example, when
examining the amount of time to wait on a lock by subtracting
"waitstart" in pg_locks from the current time, and identify the lock
that the processes are waiting for very long.
This feature uses the current time obtained for the deadlock timeout
timer as "waitstart" (i.e., the time when this process started waiting
for the lock). Since getting the current time newly can cause overhead,
we reuse the already-obtained time to avoid that overhead.
Note that "waitstart" is updated without holding the lock table's
partition lock, to avoid the overhead by additional lock acquisition.
This can cause "waitstart" in pg_locks to become NULL for a very short
period of time after the wait started even though "granted" is false.
This is OK in practice because we can assume that users are likely to
look at "waitstart" when waiting for the lock for a long time.
Bump catalog version.
Author: Atsushi Torikoshi
Reviewed-by: Ian Lawrence Barwick, Robert Haas, Justin Pryzby, Fujii Masao
Discussion: https://postgr.es/m/a96013dc51cdc56b2a2b84fa8a16a993@oss.nttdata.com
This is like commit c98763bf51bf, but for REINDEX CONCURRENTLY. To wit:
this flags indicates that the current process is safe to ignore for the
purposes of waiting for other snapshots, when doing CREATE INDEX
CONCURRENTLY and REINDEX CONCURRENTLY. This helps two processes doing
either of those things not deadlock, and also avoids spurious waits.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Hamid Akhtar <hamid.akhtar@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/20201130195439.GA24598@alvherre.pgsql
This GUC variable works much like idle_in_transaction_session_timeout,
in that it kills sessions that have waited too long for a new client
query. But it applies when we're not in a transaction, rather than
when we are.
Li Japin, reviewed by David Johnston and Hayato Kuroda, some
fixes by me
Discussion: https://postgr.es/m/763A0689-F189-459E-946F-F0EC4458980B@hotmail.com
Reverts 27838981be9d (some comments are kept). Per discussion, it does
not seem safe to relax the lock level used for this; in order for it to
be safe, there would have to be memory barriers between the point we set
the flag and the point we set the trasaction Xid, which perhaps would
not be so bad; but there would also have to be barriers at the readers'
side, which from a performance perspective might be bad.
Now maybe this analysis is wrong and it *is* safe for some reason, but
proof of that is not trivial.
Discussion: https://postgr.es/m/20201118190928.vnztes7c2sldu43a@alap3.anarazel.de
In the various waiting phases of CREATE INDEX CONCURRENTLY (CIC) and
REINDEX CONCURRENTLY (RC), we wait for other processes to release their
snapshots; this is necessary in general for correctness. However,
processes doing CIC in other tables cannot possibly affect CIC or RC
done in "this" table, so we don't need to wait for those. This commit
adds a flag in MyProc->statusFlags to indicate that the current process
is doing CIC, so that other processes doing CIC or RC can ignore it when
waiting.
Note that this logic is only valid if the index does not access other
tables. For simplicity we avoid setting the flag if the index has a
column that's an expression, or has a WHERE predicate. (It is possible
to have expressional or partial indexes that do not access other tables,
but figuring that out would require more work.)
This flag can potentially also be used by processes doing REINDEX
CONCURRENTLY to be skipped; and by VACUUM to ignore processes in CIC or
RC for the purposes of computing an Xmin. That's left for future
commits.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Author: Dimitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/20200810233815.GA18970@alvherre.pgsql
We don't actually need a lock to set PGPROC->statusFlags itself; what we
do need is a shared lock on either XidGenLock or ProcArrayLock in order to
ensure MyProc->pgxactoff keeps still while we modify the mirror array in
ProcGlobal->statusFlags. Some places were using an exclusive lock for
that, which is excessive. Relax those to use shared lock only.
procarray.c has a couple of places with somewhat brittle assumptions
about PGPROC changes: ProcArrayEndTransaction uses only shared lock, so
it's permissible to change MyProc only. On the other hand,
ProcArrayEndTransactionInternal also changes other procs, so it must
hold exclusive lock. Add asserts to ensure those assumptions continue
to hold.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/20201117155501.GA13805@alvherre.pgsql
Similar to the previous changes this increases the chance that data
frequently needed by GetSnapshotData() stays in l2 cache. In many
workloads subtransactions are very rare, and this makes the check for
that considerably cheaper.
As this removes the last member of PGXACT, there is no need to keep it
around anymore.
On a larger 2 socket machine this and the two preceding commits result
in a ~1.07x performance increase in read-only pgbench. For read-heavy
mixed r/w workloads without row level contention, I see about 1.1x.
Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
Similar to the previous commit this increases the chance that data
frequently needed by GetSnapshotData() stays in l2 cache. As we now
take care to not unnecessarily write to ProcGlobal->vacuumFlags, there
should be very few modifications to the ProcGlobal->vacuumFlags array.
Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
The new array contains the xids for all connected backends / in-use
PGPROC entries in a dense manner (in contrast to the PGPROC/PGXACT
arrays which can have unused entries interspersed).
This improves performance because GetSnapshotData() always needs to
scan the xids of all live procarray entries and now there's no need to
go through the procArray->pgprocnos indirection anymore.
As the set of running top-level xids changes rarely, compared to the
number of snapshots taken, this substantially increases the likelihood
of most data required for a snapshot being in l2 cache. In
read-mostly workloads scanning the xids[] array will sufficient to
build a snapshot, as most backends will not have an xid assigned.
To keep the xid array dense ProcArrayRemove() needs to move entries
behind the to-be-removed proc's one further up in the array. Obviously
moving array entries cannot happen while a backend sets it
xid. I.e. locking needs to prevent that array entries are moved while
a backend modifies its xid.
To avoid locking ProcArrayLock in GetNewTransactionId() - a fairly hot
spot already - ProcArrayAdd() / ProcArrayRemove() now needs to hold
XidGenLock in addition to ProcArrayLock. Adding / Removing a procarray
entry is not a very frequent operation, even taking 2PC into account.
Due to the above, the dense array entries can only be read or modified
while holding ProcArrayLock and/or XidGenLock. This prevents a
concurrent ProcArrayRemove() from shifting the dense array while it is
accessed concurrently.
While the new dense array is very good when needing to look at all
xids it is less suitable when accessing a single backend's xid. In
particular it would be problematic to have to acquire a lock to access
a backend's own xid. Therefore a backend's xid is not just stored in
the dense array, but also in PGPROC. This also allows a backend to
only access the shared xid value when the backend had acquired an
xid.
The infrastructure added in this commit will be used for the remaining
PGXACT fields in subsequent commits. They are kept separate to make
review easier.
Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
Now that xmin isn't needed for GetSnapshotData() anymore, it leads to
unnecessary cacheline ping-pong to have it in PGXACT, as it is updated
considerably more frequently than the other PGXACT members.
After the changes in dc7420c2c92, this is a very straight-forward change.
For highly concurrent, snapshot acquisition heavy, workloads this change alone
can significantly increase scalability. E.g. plain pgbench on a smaller 2
socket machine gains 1.07x for read-only pgbench, 1.22x for read-only pgbench
when submitting queries in batches of 100, and 2.85x for batches of 100
'SELECT';. The latter numbers are obviously not to be expected in the
real-world, but micro-benchmark the snapshot computation
scalability (previously spending ~80% of the time in GetSnapshotData()).
Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
To make GetSnapshotData() more scalable, it cannot not look at at each proc's
xmin: While snapshot contents do not need to change whenever a read-only
transaction commits or a snapshot is released, a proc's xmin is modified in
those cases. The frequency of xmin modifications leads to, particularly on
higher core count systems, many cache misses inside GetSnapshotData(), despite
the data underlying a snapshot not changing. That is the most
significant source of GetSnapshotData() scaling poorly on larger systems.
Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons /
thresholds as it has so far. But we don't really have to: The horizons don't
actually change that much between GetSnapshotData() calls. Nor are the horizons
actually used every time a snapshot is built.
The trick this commit introduces is to delay computation of accurate horizons
until there use and using horizon boundaries to determine whether accurate
horizons need to be computed.
The use of RecentGlobal[Data]Xmin to decide whether a row version could be
removed has been replaces with new GlobalVisTest* functions. These use two
thresholds to determine whether a row can be pruned:
1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed
are definitely still visible.
2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
definitely be removed
GetSnapshotData() updates definitely_needed to be the xmin of the computed
snapshot.
When testing whether a row can be removed (with GlobalVisTestIsRemovableXid())
and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID <
definitely_needed) the boundaries can be recomputed to be more accurate. As it
is not cheap to compute accurate boundaries, we limit the number of times that
happens in short succession. As the boundaries used by
GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by
GetSnapshotData()), it is likely that further test can benefit from an earlier
computation of accurate horizons.
To avoid regressing performance when old_snapshot_threshold is set (as that
requires an accurate horizon to be computed), heap_page_prune_opt() doesn't
unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the
computation of the limited horizon, and the triggering of errors (with
SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove
tuples.
This commit just removes the accesses to PGXACT->xmin from
GetSnapshotData(), but other members of PGXACT residing in the same
cache line are accessed. Therefore this in itself does not result in a
significant improvement. Subsequent commits will take advantage of the
fact that GetSnapshotData() now does not need to access xmins anymore.
Note: This contains a workaround in heap_page_prune_opt() to keep the
snapshot_too_old tests working. While that workaround is ugly, the tests
currently are not meaningful, and it seems best to address them separately.
Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de