Actually, the correct fix for this is "we don't need this at all",
because this program isn't dealing in any non-ASCII data. The
dependency on Data::Dumper seems to be a leftover too.
Discussion: https://postgr.es/m/3287943.1680922997@sss.pgh.pa.us
Support GSSAPI/Kerberos credentials being delegated to the server by a
client. With this, a user authenticating to PostgreSQL using Kerberos
(GSSAPI) credentials can choose to delegate their credentials to the
PostgreSQL server (which can choose to accept them, or not), allowing
the server to then use those delegated credentials to connect to
another service, such as with postgres_fdw or dblink or theoretically
any other service which is able to be authenticated using Kerberos.
Both postgres_fdw and dblink are changed to allow non-superuser
password-less connections but only when GSSAPI credentials have been
delegated to the server by the client and GSSAPI is used to
authenticate to the remote system.
Authors: Stephen Frost, Peifeng Qiu
Reviewed-By: David Christensen
Discussion: https://postgr.es/m/CO1PR05MB8023CC2CB575E0FAAD7DF4F8A8E29@CO1PR05MB8023.namprd05.prod.outlook.com
a9c70b46dbe and 8aaa04b32S added counting of IO operations to a new view,
pg_stat_io. Now, add IO timing for reads, writes, extends, and fsyncs to
pg_stat_io as well.
This combines the tracking for pgBufferUsage with the tracking for pg_stat_io
into a new function pgstat_count_io_op_time(). This should make it a bit
easier to avoid the somewhat costly instr_time conversion done for
pgBufferUsage.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
Show a detailed description of the page offset number arrays that appear
in certain nbtree WAL records.
Also brings nbtree desc routines in line with the guidelines established
by recent commit 7d8219a4.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/flat/20230109215842.fktuhesvayno6o4g%40awork3.anarazel.de
Similar to 8dff2f224, this disables DNS lookups by the Kerberos library
to look up the KDC and the realm while the Kerberos tests are running.
In some environments, these lookups can take a long time and end up
timing out and causing tests to fail. Further, since this isn't really
our domain, we shouldn't be sending out these DNS requests during our
tests.
Add helper functions that output arrays in a standard format, and use
the functions inside heapdesc routines. This allows tools like
pg_walinspect to show a detailed description of the page offset number
arrays for records like PRUNE and VACUUM (unless there was an FPI).
Also document the conventions that desc routines should follow. Only
the heapdesc routines follow the conventions for now, so they're just
guidelines for the time being.
Based on a suggestion from Andres Freund.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/flat/20230109215842.fktuhesvayno6o4g%40awork3.anarazel.de
This modernized version of Soundex works significantly better than
the original, particularly for non-English names.
Dag Lem, reviewed by quite a few people along the way
Discussion: https://postgr.es/m/yger1atbgfy.fsf@sid.nimrod.no
A few buildfarm animals recently started complaining about the "child"
relation already existing. e056c557aef added a new child table to inherit.sql,
but triggers.sql, running in the same parallel group, also uses a child table.
Rename the new table to inh_child. It maybe worth renaming child, parent in
other tests as well, but that's work for another day.
Discussion: https://postgr.es/m/20230407204530.52q3v5cu5x6dj676@awork3.anarazel.de
Both pgstat_database and pgBufferUsage count IO timing for reads of temporary
relation blocks into local buffers. However, both failed to count write IO
timing for flushes of dirty local buffers. Fix.
Additionally, FlushRelationBuffers() seems to have omitted counting write
IO (both count and timing) stats for both pgstat_database and
pgBufferUsage. Fix.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20230321023451.7rzy4kjj2iktrg2r%40awork3.anarazel.de
This breaks out the background and interactive psql functionality into a
new class, PostgreSQL::Test::BackgroundPsql. Sessions are still initiated
via PostgreSQL::Test::Cluster, but once started they can be manipulated by
the new helper functions which intend to make querying easier. A sample
session for a command which can be expected to finish at a later time can
be seen below.
my $session = $node->background_psql('postgres');
$bsession->query_until(qr/start/, q(
\echo start
CREATE INDEX CONCURRENTLY idx ON t(a);
));
$bsession->quit;
Patch by Andres Freund with some additional hacking by me.
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/20230130194350.zj5v467x4jgqt3d6@awork3.anarazel.de
It was pointed out that pg_buffercache_summary()'s report of
the overall average usage count isn't that useful, and what
would be more helpful in many cases is to report totals for
each possible usage count. Add a new function to do it like
that. Since pg_buffercache 1.4 is already new for v16,
we don't need to create a new extension version; we'll just
define this as part of 1.4.
Nathan Bossart
Discussion: https://postgr.es/m/20230130233040.GA2800702@nathanxps13
We now create pg_constaint rows for NOT NULL constraints with
contype='n'.
We propagate these constraints during operations such as adding
inheritance relationships, creating and attaching partitions, creating
tables LIKE other tables. We mostly follow the well-known rules of
conislocal and coninhcount that we have for CHECK constraints, with some
adaptations; for example, as opposed to CHECK constraints, we don't
match NOT NULL ones by name when descending a hierarchy to alter it;
instead we match by column number. This means we don't require the
constraint names to be identical across a hierarchy.
For now, we omit them from system catalogs. Maybe this is worth
reconsidering. We don't support NOT VALID nor DEFERRABLE clauses
either; these can be added as separate features later (this patch is
already large and complicated enough.)
This has been very long in the making. The first patch was written by
Bernd Helmle in 2010 to add a new pg_constraint.contype value ('n'),
which I (Álvaro) then hijacked in 2011 and 2012, until that one was
killed by the realization that we ought to use contype='c' instead:
manufactured CHECK constraints. However, later SQL standard
development, as well as nonobvious emergent properties of that design
(mostly, failure to distinguish them from "normal" CHECK constraints as
well as the performance implication of having to test the CHECK
expression) led us to reconsider this choice, so now the current
implementation uses contype='n' again.
In 2016 Vitaly Burovoy also worked on this feature[1] but found no
consensus for his proposed approach, which was claimed to be closer to
the letter of the standard, requiring additional pg_attribute columns to
track the OID of the NOT NULL constraint for that column.
[1] https://postgr.es/m/CAKOSWNkN6HSyatuys8xZxzRCR-KL1OkHS5-b9qd9bf1Rad3PLA@mail.gmail.com
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Author: Bernd Helmle <mailings@oopsware.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Discussion: https://postgr.es/m/CACA0E642A0267EDA387AF2B%40%5B172.26.14.62%5D
Discussion: https://postgr.es/m/AANLkTinLXMOEMz+0J29tf1POokKi4XDkWJ6-DDR9BKgU@mail.gmail.com
Discussion: https://postgr.es/m/20110707213401.GA27098@alvh.no-ip.org
Discussion: https://postgr.es/m/1343682669-sup-2532@alvh.no-ip.org
Discussion: https://postgr.es/m/CAKOSWNkN6HSyatuys8xZxzRCR-KL1OkHS5-b9qd9bf1Rad3PLA@mail.gmail.com
Discussion: https://postgr.es/m/20220817181249.q7qvj3okywctra3c@alvherre.pgsql
The old wording described these as being multiplied by max_connections
plus max_prepared_transactions, which hasn't been exactly right for
some time thanks to the addition of various auxiliary processes.
Moreover, exactness here is a bit pointless given that the lock tables
can expand into the initially-unallocated "slop" space in shared
memory. Rather than trying to track exactly what the code is doing,
let's just use the term "server processes".
Likewise adjust these GUCs' description strings in guc_tables.c.
Wang Wei, reviewed by Nathan Bossart and myself
Discussion: https://postgr.es/m/OS3PR01MB6275BDD09C9B875C65FCC5AB9EA39@OS3PR01MB6275.jpnprd01.prod.outlook.com
psql parses the interval argument of \watch with locale-dependent
strtod(). In commit 00beecfe8 I added a test case that exercises
a fractional interval, but I hard-coded 0.01 which doesn't work
in locales where the radix point isn't ".". We don't want to
change this longstanding parsing behavior, so fix the test case
to generate a suitably locale-aware spelling.
Report and patch by Alexander Korotkov.
Discussion: https://postgr.es/m/CAPpHfdv+10Uk6FWjsh3+ju7kHYr76LaRXbYayXmrM7FBU-=Hgg@mail.gmail.com
The same condition accidentally was copied to both branches. Manual testing
confirms that otherwise the error recovery path works fine.
Found while reviewing the logical-decoding-on-standby patch.
This commit adds a limit to the size of an XLogRecord at 1020MB, based
on a suggestion by Heikki Linnakangas. This counts for the overhead
needed by the XLogReader when allocating the memory it needs to read a
record in DecodeXLogRecordRequiredSpace(), based on the record size. An
assertion based on that is added to detect that any additions in the
XLogReader facilities would not cause any overflows. If that's ever the
case, the upper bound allowed would need to be adjusted.
Before this, it was possible for an external module to create WAL
records large enough to be assembled but not replayable, causing
failures when replaying such WAL records on standbys. One case
mentioned where this is possible is the in-core function
pg_logical_emit_message() (wrapper for LogLogicalMessage), that allows
to emit WAL records with an arbitrary amount of data potentially higher
than the replay limit of approximately 1GB (limit of a palloc, minus the
overhead needed by a XLogReader).
This commit is a follow-up of ffd1b6b that has added similar protections
for the block-level data. Here, the checks are extended to the whole
record length, mainrdata_len being extended from uint32 to uint64 with
the routines registering buffer and record data still limited to uint32
to minimize the checks when assembling a record. All the error messages
related to overflow checks are improved to provide more context about
the error happening.
Author: Matthias van de Meent
Reviewed-by: Andres Freund, Heikki Linnakangas, Michael Paquier
Discussion: https://postgr.es/m/CAEze2WgGiw+LZt+vHf8tWqB_6VxeLsMeoAuod0N=ij1q17n5pw@mail.gmail.com
Instead of extending the relation block-by-block, use ExtendBufferedRelTo(),
introduced in 31966b151e6. This is faster and simpler.
This also somewhat reduces the danger that disconnected segments pose (which
can be "discovered" once the previous segment reaches SEGSIZE), as
ExtendBufferedRelTo() won't extend past the block it has been asked. However,
the risk of the content of such a disconnected segment being invalid
remains.
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
Discussion: https://postgr.es/m/20230223010147.32oir7sb66slqnjk@awork3.anarazel.de
1cbbee033 added BUFFER_USAGE_LIMIT to the VACUUM and ANALYZE commands, so
here we permit that option to be specified in vacuumdb.
In passing, adjust the documents for vacuum_buffer_usage_limit and the
BUFFER_USAGE_LIMIT VACUUM option to mention "kB" rather than "KB". Do the
same for the ERROR message in ExecVacuum() and
check_vacuum_buffer_usage_limit(). Without that we might tell a user that
the valid minimum value is 128 KB only to reject that because we accept
only "kB" and not "KB".
Also, add a small reminder comment in vacuum.h to try to trigger the
memory of anyone adding new fields to VacuumParams that they might want to
consider if vacuumdb needs to grow a new option too.
Author: Melanie Plageman
Reviewed-by: Justin Pryzby
Reviewed-by: David Rowley
Discussion: https://postgr.es/m/ZAzTg3iEnubscvbf@telsasoft.com
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.
To address this, use ExtendBufferedRelBy(), introduced in 31966b151e6, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:
1) The caller tells RelationGetBufferForTuple() that it will need multiple
pages. For now that's only used by heap_multi_insert(), see commit FIXME.
2) If there is contention on the extension lock, use the number of waiters for
the lock as a multiplier for the number of blocks to extend by. This is
similar to what we already did. Previously we additionally multiplied the
numbers of waiters by 20, but with the new relation extension
infrastructure I could not see a benefit in doing so.
Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.
One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.
This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb1). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
Add new options to the VACUUM and ANALYZE commands called
BUFFER_USAGE_LIMIT to allow users more control over how large to make the
buffer access strategy that is used to limit the usage of buffers in
shared buffers. Larger rings can allow VACUUM to run more quickly but
have the drawback of VACUUM possibly evicting more buffers from shared
buffers that might be useful for other queries running on the database.
Here we also add a new GUC named vacuum_buffer_usage_limit which controls
how large to make the access strategy when it's not specified in the
VACUUM/ANALYZE command. This defaults to 256KB, which is the same size as
the access strategy was prior to this change. This setting also
controls how large to make the buffer access strategy for autovacuum.
Per idea by Andres Freund.
Author: Melanie Plageman
Reviewed-by: David Rowley
Reviewed-by: Andres Freund
Reviewed-by: Justin Pryzby
Reviewed-by: Bharath Rupireddy
Discussion: https://postgr.es/m/20230111182720.ejifsclfwymw2reb@awork3.anarazel.de
A future commit will use this information to determine how aggressively to
extend the relation by. In heap_multi_insert() we know accurately how many
pages we need once we need to extend the relation, providing an accurate lower
bound for how much to extend.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.
Now, check if a reload is pending roughly once per block, when checking
if we need to delay.
In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.
Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.
Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
Vacuum code run both by autovacuum workers and a backend doing
VACUUM/ANALYZE previously inspected VacuumCostLimit and VacuumCostDelay,
which are the global variables backing the GUCs vacuum_cost_limit and
vacuum_cost_delay.
Autovacuum workers needed to override these variables with their
own values, derived from autovacuum_vacuum_cost_limit and
autovacuum_vacuum_cost_delay and worker cost limit balancing logic.
This led to confusing code which, in some cases, both derived and
set a new value of VacuumCostLimit from VacuumCostLimit.
In preparation for refreshing these GUC values more often, introduce
new, independent global variables and add a function to update them
using the GUCs and existing logic.
Per suggestion by Kyotaro Horiguchi
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
While vacuuming a table in failsafe mode, VacuumCostActive should
not be re-enabled. This currently isn't a problem because vacuum
cost parameters are only refreshed in between vacuuming tables and
failsafe status is reset for every table.
In preparation for allowing vacuum cost parameters to be updated
more frequently, elevate LVRelState->failsafe_active to a global,
VacuumFailsafeActive, which will be checked when determining whether
or not to re-enable vacuum cost-related delays.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
The tests added by commits 029dea882 et al turn out to produce
different output under -DRANDOMIZE_ALLOCATED_MEMORY. This is
not a bug exactly: that flag causes coerce_type() to invoke
the input function twice when coercing an unknown-type literal
to a specific type. So you get tsqueryin's bleat about an empty
tsquery twice. Revise the test query to avoid that.
Discussion: https://postgr.es/m/20230406213813.uep7plg6lvcywujo@awork3.anarazel.de
Make the \g, \o, \w, and \copy commands set these variables
when closing a pipe. We missed doing this in commit b0d8f2d98,
but it seems like a good idea.
There are some remaining places in psql that intentionally don't
update these variables after running a child program:
* pager invocations
* backtick evaluation within a prompt
* \e (edit query buffer)
Corey Huinker and Tom Lane
Discussion: https://postgr.es/m/CADkLM=eSKwRGF-rnRqhtBORRtL49QsjcVUCa-kLxKTqxypsakw@mail.gmail.com
tsquery's GETQUERY() macro is only safe to apply to a tsquery
that is known non-empty; otherwise it gives a pointer to garbage.
Before commit 5a617d75d, ts_headline() avoided this pitfall, but
only in a very indirect, nonobvious way. (hlCover could not reach
its TS_execute call, because if the query contains no lexemes
then hlFirstIndex would surely return -1.) After that commit,
it fell into the trap, resulting in weird errors such as
"unrecognized operator" and/or valgrind complaints. In HEAD,
fix this by not calling TS_execute_locations() at all for an
empty query. In the back branches, add a defensive check to
hlCover() --- that's not fixing any live bug, but I judge the
code a bit too fragile as-is.
Also, both mark_hl_fragments() and mark_hl_words() were careless
about the possibility of empty search text: in the cases where
no match has been found, they'd end up telling mark_fragment() to
mark from word indexes 0 to 0 inclusive, even when there is no
word 0. This is harmless since we over-allocated the prs->words
array, but it does annoy valgrind. Fix so that the end index is -1
and thus mark_fragment() will do nothing in such cases.
Bottom line is that this fixes a live bug in HEAD, but in the
back branches it's only getting rid of a valgrind nitpick.
Back-patch anyway.
Per report from Alexander Lakhin.
Discussion: https://postgr.es/m/c27f642d-020b-01ff-ae61-086af287c4fd@gmail.com
Starting with commit 7db0cd2145f, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af99, that happened while
holding the relation extension lock.
Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.
We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.
While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.
To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
GetVisibilityMapPins() insisted on the buffer1/buffer2 being in a specific
order. This required checks at the callsite. As a subsequent patch will add
another callsite, move related logic into GetVisibilityMapPins().
Discussion: https://postgr.es/m/20230403190030.fk2frxv6faklrseb@awork3.anarazel.de
\watch can now be told to stop after N executions of the query.
With the idea that we might want to add more options to \watch
in future, this patch generalizes the command's syntax to a list
of name=value options, with the interval allowed to omit the name
for backwards compatibility.
Andrey Borodin, reviewed by Kyotaro Horiguchi, Nathan Bossart,
Michael Paquier, Yugo Nagata, and myself
Discussion: https://postgr.es/m/CAAhFRxiZ2-n_L1ErMm9AZjgmUK=qS6VHb+0SaMn8sqqbhF7How@mail.gmail.com
zstd compression supports a special mode for finding matched in distant
past, which may result in better compression ratio, at the expense of
using more memory (the window size is 128MB).
To enable this optional mode, use the "long" keyword when specifying the
compression method (--compress=zstd:long).
Author: Justin Pryzby
Reviewed-by: Tomas Vondra, Jacob Champion
Discussion: https://postgr.es/m/20230224191840.GD1653@telsasoft.com
Discussion: https://postgr.es/m/20220327205020.GM28503@telsasoft.com
postgres_fdw aborts remote (sub)transactions opened on remote server(s)
in a local (sub)transaction one by one when the local (sub)transaction
aborts. This patch allows it to abort the remote (sub)transactions in
parallel to improve performance. This is enabled by the server option
"parallel_abort". The default is false.
Etsuro Fujita, reviewed by David Zhang.
Discussion: http://postgr.es/m/CAPmGK15FuPVGx3TGHKShsbPKKtF1y58-ZLcKoxfN-nqLj1dZ%3Dg%40mail.gmail.com
vacuum() is used for both the VACUUM command and for autovacuum. There
were many prechecks being done inside vacuum() that were just not relevant
to autovacuum. Let's move the bulk of these into ExecVacuum() so that
they're only executed when running the VACUUM command. This removes a
small amount of overhead when autovacuum vacuums a table.
While we are at it, allocate VACUUM's BufferAccessStrategy in ExecVacuum()
and pass it into vacuum() instead of expecting vacuum() to make it if it's
not already made by the calling function. To make this work, we need to
create the vacuum memory context slightly earlier, so we now need to pass
that down to vacuum() so that it's available for use in other memory
allocations.
Author: Melanie Plageman
Reviewed-by: David Rowley
Discussion: https://postgr.es/m/20230405211534.4skgskbilnxqrmxg@awork3.anarazel.de
A few places are not converted. Some because they are tackled in later
commits (e.g. hio.c, xlogutils.c), some because they are more
complicated (e.g. brin_pageops.c). Having a few users of ReadBuffer(P_NEW) is
good anyway, to ensure the backward compat path stays working.
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
32fbe0239 changed things so we didn't bother allocating the
BufferAccessStrategy during VACUUM (ONLY_DATABASE_STATS); and VACUUM
(FULL), however, it forgot to consider that VACUUM (FULL, ANALYZE) is a
possible combination. That change would have resulted in such a command
allowing ANALYZE to make full use of shared buffers, which wasn't
intended, so fix that.
Reported-by: Melanie Plageman
Discussion: https://postgr.es/m/CAAKRu_bJRKe+v_=OqwC+5sA3j5qv8rqdAwy3+yHaO3wmtfrCRg@mail.gmail.com
pg_stat_statements relies on EState->es_processed to count the number of
rows processed by ExecutorRun(). This proves to be a problem under the
extended query protocol when the result of a query is fetched through
more than one call of ExecutorRun(), as es_processed is reset each time
ExecutorRun() is called. This causes pg_stat_statements to report the
number of rows calculated in the last execute fetch, rather than the
global sum of all the rows processed.
As pquery.c tells, this is a problem when a portal does not use
holdStore. For example, DMLs with RETURNING would report a correct
tuple count as these do one execution cycle when the query is first
executed to fill in the portal's store with one ExecutorRun(), feeding
on the portal's store for each follow-up execute fetch depending on the
fetch size requested by the client.
The fix proposed for this issue is simple with the addition of an extra
counter in EState that's preserved across multiple ExecutorRun() calls,
incremented with the value calculated in es_processed. This approach is
not back-patchable, unfortunately.
Note that libpq does not currently give any way to control the fetch
size when using the extended v3 protocol, meaning that in-core testing
is not possible yet. This issue can be easily verified with the JDBC
driver, though, with *autocommit disabled*. Hence, having in-core tests
requires more features, left for future discussion:
- At least two new libpq routines splitting PQsendQueryGuts(), one for
the bind/describe and a second for a series of execute fetches with a
custom fetch size, likely in a fashion similar to what JDBC does.
- A psql meta-command for the execute phase. This part is not strictly
mandatory, still it could be handy.
Reported-by: Andrew Dunstan (original discovery by Simon Siggs)
Author: Sami Imseih
Reviewed-by: Tom Lane, Michael Paquier
Discussion: https://postgr.es/m/EBE6C507-9EB6-4142-9E4D-38B1673363A7@amazon.com
Discussion: https://postgr.es/m/c90890e7-9c89-c34f-d3c5-d5c763a34bd8@dunslane.net
The primary bottlenecks for relation extension are:
1) The extension lock is held while acquiring a victim buffer for the new
page. Acquiring a victim buffer can require writing out the old page
contents including possibly needing to flush WAL.
2) When extending via ReadBuffer() et al, we write a zero page during the
extension, and then later write out the actual page contents. This can
nearly double the write rate.
3) The existing bulk relation extension infrastructure in hio.c just amortized
the cost of acquiring the relation extension lock, but none of the other
costs.
Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.
The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.
To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.
There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.
To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.
Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
This adds a new option to libpq's sslrootcert, "system", which will load
the system trusted CA roots for certificate verification. This is a more
convenient way to achieve this than pointing to the system CA roots
manually since the location can differ by installation and be locally
adjusted by env vars in OpenSSL.
When sslrootcert is set to system, sslmode is forced to be verify-full
as weaker modes aren't providing much security for public CAs.
Changing the location of the system roots by setting environment vars is
not supported by LibreSSL so the tests will use a heuristic to determine
if the system being tested is LibreSSL or OpenSSL.
The workaround in .cirrus.yml is required to handle a strange interaction
between homebrew and the openssl@3 formula; hopefully this can be removed
in the near future.
The original patch was written by Thomas Habets, which was later revived
by Jacob Champion.
Author: Jacob Champion <jchampion@timescale.com>
Author: Thomas Habets <thomas@habets.se>
Reviewed-by: Jelte Fennema <postgres@jeltef.nl>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Magnus Hagander <magnus@hagander.net>
Discussion: https://www.postgresql.org/message-id/flat/CA%2BkHd%2BcJwCUxVb-Gj_0ptr3_KZPwi3%2B67vK6HnLFBK9MzuYrLA%40mail.gmail.com
A future patch will add support for extending relations by multiple blocks at
once. To be concurrency safe, the buffers for those blocks need to be marked
as BM_IO_IN_PROGRESS. Until now we only had infrastructure for recovering from
an IO error for a single buffer. This commit extends that infrastructure to
multiple buffers by using the resource owner infrastructure.
This commit increases the size of the ResourceOwnerData struct, which appears
to have a just about measurable overhead in very extreme workloads. Medium
term we are planning to substantially shrink the size of
ResourceOwnerData. Short term the increase is small enough to not worry about
it for now.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
Discussion: https://postgr.es/m/20221029200025.w7bvlgvamjfo6z44@awork3.anarazel.de
Merge and hash joins can support antijoin with the non-nullable input
on the right, using very simple combinations of their existing logic
for right join and anti join. This gives the planner more freedom
about how to order the join. It's particularly useful for hash join,
since we may now have the option to hash the smaller table instead
of the larger.
Richard Guo, reviewed by Ronan Dunklau and myself
Discussion: https://postgr.es/m/CAMbWs48xh9hMzXzSy3VaPzGAz+fkxXXTUbCLohX1_L8THFRm2Q@mail.gmail.com
Previously we held buffer locks for two buffer mapping partitions at the same
time to change the identity of buffers. Particularly for extending relations
needing to hold the extension lock while acquiring a victim buffer is
painful.But it also creates a bottleneck for workloads that just involve
reads.
Now we instead first acquire a victim buffer and write it out, if
necessary. Then we remove that buffer from the old partition with just the old
partition's partition lock held and insert it into the new partition with just
that partition's lock held.
By separating out the victim buffer acquisition, future commits will be able
to change relation extensions to scale better.
On my workstation, a micro-benchmark exercising buffered reads strenuously and
under a lot of concurrency, sees a >2x improvement.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de