Consistently prevent nbtree array advancement from treating a scankey as
required when operating in pstate.forcenonrequired mode. Otherwise, we
risk a NULL pointer dereference. This was possible in the path where
_bt_check_compare is called to recheck a tuple that advanced all of the
scan's arrays to matching values: its continuescan=false handling
expects _bt_advance_array_keys to have been called with a valid pstate,
but it'll always be NULL during sktrig_required=false calls (which is
how _bt_advance_array_keys must be called when pstate.forcenonrequired).
Oversight in commit 8a510275, which optimized nbtree search scan key
comparisons.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://postgr.es/m/CAHgHdKsn2W=gPBmj7p6MjQFvxB+zZDBkwTSg0o3f5Hh8rkRrsA@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzmodSE+gpTd1CRGU9ez8ytyyDS+Kns2r9NzgUp1s56kpw@mail.gmail.com
Checking if another primitive scan is required after all once the next
leaf page was moved from _bt_checkkeys to its _bt_readpage caller by
commit 9a2e2a28. Update a comment that incorrectly described the
recheck mechanism as something that takes place in _bt_checkkeys.
Also fix an older typo in related code comments.
_bt_check_compare neglected to handle a case that can arise when the
scan's keys are temporarily treated as nonrequired, as an optimization:
whenever a NULL tuple value was encountered that had a skip array whose
current element wasn't already NULL, _bt_check_compare failed to advance
the array to the NULL element. This allowed _bt_check_compare to fail
to return matching tuples containing a NULL value (though only with an
array column that came before a skip array column with NULLs, and only
during _bt_readpage calls that set pstate.forcenonrequired=true on a
page where the higher-order column also had to advance).
To fix, teach _bt_check_compare to handle this case just like any other
case where a skip array key is unsatisfied and must be advanced directly
(due to the key being considered a nonrequired key).
Oversight in commit 8a510275, which optimized nbtree search scan key
comparisons with skip arrays.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://postgr.es/m/CAHgHdKtLFWZcjr87hMH0hYDHgcifu4Tj7iHz-xh8qsJREt5cqA@mail.gmail.com
Don't allow nbtree scans with skip arrays to end any primitive scan on
its first leaf page without giving some consideration to how many times
the scan's arrays advanced while changing at least one skip array
(though continue not caring about the number of array advancements that
only affected SAOP arrays, even during skip scans with SAOP arrays).
Now when a scan performs more than 3 such array advancements in the
course of reading a single leaf page, it is taken as a signal that the
next page is unlikely to be skippable. We'll therefore continue the
ongoing primitive index scan, at least until we can perform a recheck
against the next page's finaltup.
Testing has shown that this new heuristic occasionally makes all the
difference with skip scans that were expected to rely on the "passed
first page" heuristic added by commit 9a2e2a28. Without it, there is a
remaining risk that certain kinds of skip scans will never quite manage
to clear the initial hurdle of performing a primitive scan that lasts
beyond its first leaf page (or that such a skip scan will only clear
that initial hurdle when it has already wasted noticeably-many cycles
due to inefficient primitive scan scheduling).
Follow-up to commits 92fe23d9 and 9a2e2a28.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=RVdG3zWytFWBsyW7fWH7zveFvTHed5JKEsuTT0RCO_A@mail.gmail.com
Postgres 17 commit e0b1ee17 added two complementary optimizations to
nbtree: the "prechecked" and "firstmatch" optimizations. _bt_readpage
was made to avoid needlessly evaluating keys that are guaranteed to be
satisfied by applying page-level context. "prechecked" did this for
keys required in the current scan direction, while "firstmatch" did it
for keys required in the opposite-to-scan direction only.
The "prechecked" design had a number of notable issues. It didn't
account for the fact that an = array scan key's sk_argument field might
need to advance at the point of the page precheck (it didn't check the
precheck tuple against the key's array, only the key's sk_argument,
which needlessly made it ineffective in cases involving stepping to a
page having advanced the scan's arrays using a truncated high key).
"prechecked" was also completely ineffective when only one scan key
wasn't guaranteed to be satisfied by every tuple (it didn't recognize
that it was still safe to avoid evaluating other, earlier keys).
The "firstmatch" optimization had similar limitations. It could only be
applied after _bt_readpage found its first matching tuple, regardless of
why any earlier tuples failed to satisfy the scan's index quals. This
allowed unsatisfied non-required scan keys to impede the optimization.
Replace both optimizations with a new optimization, without any of these
limitations: the "startikey" optimization. Affected _bt_readpage calls
generate a page-level key offset ("startikey"), that their _bt_checkkeys
calls can then start at. This is an offset to the first key that isn't
known to be satisfied by every tuple on the page.
Although this is independently useful work, its main goal is to avoid
performance regressions with index scans that use skip arrays, but still
never manage to skip over irrelevant leaf pages. We must avoid wasting
CPU cycles on overly granular skip array maintenance in these cases.
The new "startikey" optimization helps with this by selectively
disabling array maintenance for the duration of a _bt_readpage call.
This has no lasting consequences for the scan's array keys (they'll
still reliably track the scan's progress through the index's key space
whenever the scan is "between pages").
Skip scan adds skip arrays during preprocessing using simple, static
rules, and decides how best to navigate/apply the scan's skip arrays
dynamically, at runtime. The "startikey" optimization enables this
approach. As a result of all this, the planner doesn't need to generate
distinct, competing index paths (one path for skip scan, another for an
equivalent traditional full index scan). The overall effect is to make
scan runtime close to optimal, even when the planner works off an
incorrect cardinality estimate. Scans will also perform well given a
skipped column with data skew: individual groups of pages with many
distinct values (in respect of a skipped column) can be read about as
efficiently as before -- without the scan being forced to give up on
skipping over other groups of pages that are provably irrelevant.
Many scans that cannot possibly skip will still benefit from the use of
skip arrays, since they'll allow the "startikey" optimization to be as
effective as possible (by allowing preprocessing to mark all the scan's
keys as required). A scan that uses a skip array on "a" for a qual
"WHERE a BETWEEN 0 AND 1_000_000 AND b = 42" is often much faster now,
even when every tuple read by the scan has its own distinct "a" value.
However, there are still some remaining regressions, affecting certain
trickier cases.
Scans whose index quals have several range skip arrays, each on some
high cardinality column, can still be slower than they were before the
introduction of skip scan -- even with the new "startikey" optimization.
There are also known regressions affecting very selective index scans
that use a skip array. The underlying issue with such selective scans
is that they never get as far as reading a second leaf page, and so will
never get a chance to consider applying the "startikey" optimization.
In principle, all regressions could be avoided by teaching preprocessing
to not add skip arrays whenever they aren't expected to help, but it
seems best to err on the side of robust performance.
Follow-up to commit 92fe23d9, which added nbtree skip scan.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=Y93jf5WjoOsN=xvqpMjRy-bxCE037bVFi-EasrpeUJA@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WznWDK45JfNPNvDxh6RQy-TaCwULaM5u5ALMXbjLBMcugQ@mail.gmail.com
Teach nbtree multi-column index scans to opportunistically skip over
irrelevant sections of the index given a query with no "=" conditions on
one or more prefix index columns. When nbtree is passed input scan keys
derived from a predicate "WHERE b = 5", new nbtree preprocessing steps
output "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys.
That is, preprocessing generates a "skip array" (and an output scan key)
for the omitted prefix column "a", which makes it safe to mark the scan
key on "b" as required to continue the scan. The scan is therefore able
to repeatedly reposition itself by applying both the "a" and "b" keys.
A skip array has "elements" that are generated procedurally and on
demand, but otherwise works just like a regular ScalarArrayOp array.
Preprocessing can freely add a skip array before or after any input
ScalarArrayOp arrays. Index scans with a skip array decide when and
where to reposition the scan using the same approach as any other scan
with array keys. This design builds on the design for array advancement
and primitive scan scheduling added to Postgres 17 by commit 5bf748b8.
Testing has shown that skip scans of an index with a low cardinality
skipped prefix column can be multiple orders of magnitude faster than an
equivalent full index scan (or sequential scan). In general, the
cardinality of the scan's skipped column(s) limits the number of leaf
pages that can be skipped over.
The core B-Tree operator classes on most discrete types generate their
array elements with the help of their own custom skip support routine.
This infrastructure gives nbtree a way to generate the next required
array element by incrementing (or decrementing) the current array value.
It can reduce the number of index descents in cases where the next
possible indexable value frequently turns out to be the next value
stored in the index. Opclasses that lack a skip support routine fall
back on having nbtree "increment" (or "decrement") a skip array's
current element by setting the NEXT (or PRIOR) scan key flag, without
directly changing the scan key's sk_argument. These sentinel values
behave just like any other value from an array -- though they can never
locate equal index tuples (they can only locate the next group of index
tuples containing the next set of non-sentinel values that the scan's
arrays need to advance to).
A skip array's range is constrained by "contradictory" inequality keys.
For example, a skip array on "x" will only generate the values 1 and 2
given a qual such as "WHERE x BETWEEN 1 AND 2 AND y = 66". Such a skip
array qual usually has near-identical performance characteristics to a
comparable SAOP qual "WHERE x = ANY('{1, 2}') AND y = 66". However,
improved performance isn't guaranteed. Much depends on physical index
characteristics.
B-Tree preprocessing is optimistic about skipping working out: it
applies static, generic rules when determining where to generate skip
arrays, which assumes that the runtime overhead of maintaining skip
arrays will pay for itself -- or lead to only a modest performance loss.
As things stand, these assumptions are much too optimistic: skip array
maintenance will lead to unacceptable regressions with unsympathetic
queries (queries whose scan can't skip over many irrelevant leaf pages).
An upcoming commit will address the problems in this area by enhancing
_bt_readpage's approach to saving cycles on scan key evaluation, making
it work in a way that directly considers the needs of = array keys
(particularly = skip array keys).
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiro Ikeda <masahiro.ikeda@nttdata.com>
Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Reviewed-By: Aleksander Alekseev <aleksander@timescale.com>
Reviewed-By: Alena Rybakina <a.rybakina@postgrespro.ru>
Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
Add a new scheduling heuristic: don't end the ongoing primitive index
scan immediately (at the point where _bt_advance_array_keys notices that
the next set of matching tuples must be on a later page) if the primscan
already managed to step right/left from its first leaf page. Schedule a
recheck against the next sibling leaf page's finaltup instead.
The new heuristic tends to avoid scenarios where the top-level scan
repeatedly starts and ends primitive index scans that each read only one
leaf page from a group of neighboring leaf pages. Affected top-level
scans will now tend to step forward (or backward) through the index
instead, without wasting cycles on descending the index anew.
The recheck mechanism isn't exactly new. But up until now it has only
been used to deal with edge cases involving high key finaltups with one
or more truncated -inf attributes that _bt_advance_array_keys deemed
"provisionally satisfied" (satisfied for the purposes of allowing the
scan to step onto the next page, subject to recheck once on that page).
The mechanism was added by commit 5bf748b8, which invented the general
concept of primitive scan scheduling. It was later enhanced by commit
79fa7b3b, which taught it about cases involving -inf attributes that
satisfy inequality scan keys required in the opposite-to-scan direction
only (arguably, they should have been covered by the earliest version).
Now the recheck mechanism can be applied based on scan-level heuristics,
which have nothing to do with truncated high keys. Now rechecks might
be performed by _bt_readpage when scanning in _either_ scan direction.
The theory behind the new heuristic is that any primitive scan that
makes it past its first leaf page is one that is already likely to have
arrays whose key values match index tuples that are closely clustered
together in the index. The rules that determine whether we ever get
past the first page are still conservative (that'll still only happen
when pstate.finaltup strongly suggests that it's the right thing to do).
Surviving past the first leaf page is a strong signal in itself.
Preparation for an upcoming patch that will add skip scan optimizations
to nbtree. That'll work by adding skip arrays, which behave similarly
to SAOP arrays, but generate their elements procedurally and on-demand.
Note that this commit isn't specifically concerned with skip arrays; the
scheduling logic doesn't (and won't) condition anything on whether the
scan uses skip arrays, SAOP arrays, or some combination of the two
(which seems like a good general principle for _bt_advance_array_keys).
While the problems that this commit ameliorates are more likely with
skip arrays (at least in practice), SAOP arrays (or those with very
dense, contiguous array elements) are also affected.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzkz0wPe6+02kr+hC+JJNKfGtjGTzpG3CFVTQmKwWNrXNw@mail.gmail.com
Make nbtree's "1/3 of a page limit" BTMaxItemSize function-like macro
(which accepts a "page" argument) into an object-like macro that can be
used from code that doesn't have convenient access to an nbtree page.
Preparation for an upcoming patch that adds skip scan to nbtree.
Parallel index scans that use skip scan will serialize datums (not just
SAOP array subscripts) when scheduling primitive scans. BTMaxItemSize
will be used by btestimateparallelscan to determine how much DSM to
request.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=H_RG5weNGeUG_TkK87tRBnH9mGCQj6WpM4V4FNWKv2g@mail.gmail.com
Move nbtree's detection of RowCompare quals that are unsatisfiable due
to having a NULL in their first row element: rather than detecting these
cases at the point where _bt_first builds its insertion scan key, do so
earlier, during preprocessing proper. This brings the RowCompare case
in line every other case involving an unsatisfiable-due-to-NULL qual.
nbtree now consistently detects such unsatisfiable quals -- even when
they happen to involve a key that isn't examined by _bt_first at all.
Affected cases thereby avoid useless full index scans that cannot
possibly return any matching rows.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzmySVXst2hFrOATC-zw1Byg1XC-jYUS314=mzuqsNwk+Q@mail.gmail.com
The new compact_attrs array stores a few select fields from
FormData_pg_attribute in a more compact way, using only 16 bytes per
column instead of the 104 bytes that FormData_pg_attribute uses. Using
CompactAttribute allows performance-critical operations such as tuple
deformation to be performed without looking at the FormData_pg_attribute
element in TupleDesc which means fewer cacheline accesses.
For some workloads, tuple deformation can be the most CPU intensive part
of processing the query. Some testing with 16 columns on a table
where the first column is variable length showed around a 10% increase in
transactions per second for an OLAP type query performing aggregation on
the 16th column. However, in certain cases, the increases were much
higher, up to ~25% on one AMD Zen4 machine.
This also makes pg_attribute.attcacheoff redundant. A follow-on commit
will remove it, thus shrinking the FormData_pg_attribute struct by 4
bytes.
Author: David Rowley
Reviewed-by: Andres Freund, Victor Yegorov
Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com
Consistently reset so->scanBehind at the beginning of nbtree array
advancement, even during sktrig_required=false calls (calls where array
advancement is triggered by an unsatisfied non-required array scan key).
Otherwise, it's possible for queries to fail to return all relevant
tuples to the scan given a low-order required scan key that was
previously deemed "satisfied" by a truncated high key attribute value.
This only happened at the point where a later non-required array scan
key needed to be "advanced" once on the next leaf page (that is, once
the right sibling of the truncated high key page was reached).
The underlying issue was that later code within _bt_advance_array_keys
assumed that the so->scanBehind flag must have been set using the
current page's high key (not the previous page's high key). Any later
successful recheck call to _bt_check_compare would therefore spuriously
be prevented from making _bt_advance_array_keys return true, based on
the faulty belief that the truncated attribute must be from the scan's
current tuple (i.e. the non-pivot tuple at the start of the next page).
_bt_advance_array_keys would return false for the tuple, ultimately
resulting in _bt_checkkeys failing to return a matching tuple.
Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp
execution.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkJKncfqyAUTeuB5GgRhT1vhsWO2q11dbZNqKmvjopP_g@mail.gmail.com
Backpatch: 17-, where commit 5bf748b8 first appears.
Spell out how a = key associated with a SAOP array renders a > key
against the same index column redundant at the relevant point inside
_bt_preprocess_keys.
Follow-up to commit 5bf748b8.
Commit 1bd4bc85, which refactored nbtree sibling link traversal, made
_bt_parallel_seize reset the scan's currPos so that things were
consistent with the state of a serial backend moving between pages.
This overlooked the fact that _bt_readnextpage relied on the existing
currPos state to decide when to end the scan -- even though it came from
before the scan was seized. As a result of all this, parallel nbtree
scans could needlessly behave like full index scans.
To fix, teach _bt_readnextpage to explicitly allow the use of an already
read page's so->currPos when deciding whether to end the scan -- even
during parallel index scans (allow it consistently now). This requires
moving _bt_readnextpage's seizure of the scan to earlier in its loop.
That way _bt_readnextpage either deals with the true so->currPos state,
or an initialized-by-_bt_parallel_seize currPos state set from when the
scan was seized. Now _bt_steppage (the most important _bt_readnextpage
caller) takes the same uniform approach to setting up its call using
details taken from so->currPos -- regardless of whether the scan happens
to be parallel or serial.
The new loop structure in _bt_readnextpage is prone to getting confused
by P_NONE blknos set when the rightmost or leftmost page was reached.
We could avoid that by adding an explicit check, but that would be ugly.
Avoid this problem by teaching _bt_parallel_seize to end the parallel
scan instead of returning a P_NONE next block/blkno. Doing things this
way was arguably a missed opportunity for commit 1bd4bc85. It allows us
to remove a similar "blkno == P_NONE" check from _bt_first.
Oversight in commit 1bd4bc85, which refactored sibling link traversal
(as part of optimizing nbtree backward scan locking).
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Diagnosed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Discussion: https://postgr.es/m/f8efb9c0f8d1a71b44fd7f8e42e49c25@oss.nttdata.com
Strictly speaking, we only need to make sure to leave the scan's array
keys in their final positions (final for the current scan direction) to
handle SAOP array exhaustion because btgettuple might only return a
subset of the items for the final page (final for the current scan
direction), before the scan changes direction. While it's typical for
so->currPos to be invalidated shortly after the scan's arrays are first
exhausted, and while so->currPos invalidation does obviate the need to
leave the scan's arrays in any particular state, we can't rely on any of
that actually happening when handling array exhaustion. Adjust comments
to make all of that a lot clearer.
Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp
execution.
A bug in nbtree's handling of primitive index scan scheduling could lead
to wrong answers when a scrollable cursor was used with an index scan
that had a SAOP index qual. Wrong answers were only possible when the
scan direction changed after a primitive scan was scheduled, but before
_bt_next was asked to fetch the next tuple in line (i.e. for things to
break, _bt_next had to be denied the opportunity to step off the page in
the same direction as the one used when the primscan was scheduled).
Furthermore, the issue only occurred when the page in question happened
to be the first page to be visited by the entire top-level scan; the
issue hinged upon the cursor backing up to the absolute beginning of the
key space that it returns tuples from (fetching in the opposite scan
direction across a "primitive scan boundary" always worked correctly).
To fix, make _bt_next unset the "needs primitive index scan" flag when
it detects that the current scan direction is not the one that was used
by _bt_readpage back when the primitive scan in question was scheduled.
This fixes the cases that are known to be faulty, and also seems like a
good idea on general robustness grounds.
Affected scrollable cursor cases now avoid a spurious primitive index
scan when they fetch backwards to the absolute start of the key space to
be visited by their cursor. Fetching backwards now only returns those
tuples at the start of the scan, as expected. It'll also be okay to
once again fetch forwards from the start at that point, since the scan
will be left in a state that's exactly consistent with the state it was
in before any tuples were ever fetched, as expected.
Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp
execution.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wznv49bFsE2jkt4GuZ0tU2C91dEST=50egzjY2FeOcHL4Q@mail.gmail.com
Backpatch: 17-, where commit 5bf748b8 first appears.
Make nbtree backwards scans optimistically access the next page to be
read to the left by following a prevPage block number that's now stashed
in currPos when the leaf page is first read. This approach matches the
one taken during forward scans, which follow a symmetric nextPage block
number from currPos. We stash both a prevPage and a nextPage, since the
scan direction might change (when fetching from a scrollable cursor).
Backwards scans will no longer need to lock the same page twice, except
in rare cases where the scan detects a concurrent page split (or page
deletion). Testing has shown this optimization to be particularly
effective during parallel index-only backwards scans: ~12% reductions in
query execution time are quite possible.
We're much better off being optimistic; concurrent left sibling page
splits are rare in general. It's possible that we'll need to lock more
pages than the pessimistic approach would have, but only when there are
_multiple_ concurrent splits of the left sibling page we now start at.
If there's just a single concurrent left sibling page split, the new
approach to scanning backwards will at least break even relative to the
old one (we'll acquire the same number of leaf page locks as before).
The optimization from this commit has long been contemplated by comments
added by commit 2ed5b87f96, which changed the rules for locking/pinning
during nbtree index scans. The approach that that commit introduced to
leaf level link traversal when scanning forwards is now more or less
applied all the time, regardless of the direction we're scanning in.
Following uniform conventions around sibling link traversal is simpler.
The only real remaining difference between our forward and backwards
handling is that our backwards handling must still detect and recover
from any concurrent left sibling splits (and concurrent page deletions),
as documented in the nbtree README. That is structured as a single,
isolated extra step that takes place in _bt_readnextpage.
Also use this opportunity to further simplify the functions that deal
with reading pages and traversing sibling links on the leaf level, and
to document their preconditions and postconditions (with respect to
things like buffer locks, buffer pins, and seizing the parallel scan).
This enhancement completely supersedes the one recently added by commit
3f44959f.
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAEze2WgpBGRgTTxTWVPXc9+PB6fc1a7t+VyGXHzfnrFXcQVxnA@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzkBTuFv7W2+84jJT8mWZLXVL0GHq2hMUTn6c9Vw=eYrCw@mail.gmail.com
Commit 5bf748b8 taught nbtree ScalarArrayOp index scans to decide when
and how to start the next primitive index scan based on physical index
characteristics. This included rules for deciding whether to start a
new primitive index scan (or whether to move onto the right sibling leaf
page instead) that specifically consider truncated lower-order columns
(-inf columns) from leaf page high keys.
These omitted columns were treated as satisfying the scan's required
scan keys, though only for scan keys marked required in the current scan
direction (forward). Scan keys that didn't get this behavior (those
marked required in the backwards direction only) usually didn't give the
scan reasonable cause to reposition itself to a later leaf page (via
another descent of the index in _bt_first), but _bt_advance_array_keys
would nevertheless always give up by forcing another call to _bt_first.
_bt_advance_array_keys was unwilling to allow the scan to continue onto
the next leaf page, to reconsider whether we really should start another
primitive scan based on the details of the sibling page's tuples. This
didn't match its behavior with similar cases involving keys required in
the current scan direction (forward), which seems unprincipled. It led
to an excessive number of primitive scans/index descents for queries
with a higher-order = array scan key (with dense, contiguous values)
mixed with a lower-order required > or >= scan key.
Bring > and >= strategy scan keys in line with other required scan key
types: treat truncated -inf scan keys as having satisfied scan keys
required in either scan direction (forwards and backwards alike) during
array advancement. That way affected scans can continue to the right
sibling leaf page. Advancement must now schedule an explicit recheck of
the right sibling page's high key in cases involving > or >= scan keys.
The recheck gives the scan a way to back out and start another primitive
index scan (we can't just rely on _bt_checkkeys with > or >= scan keys).
This work can be considered a stand alone optimization on top of the
work from commit 5bf748b8. But it was written in preparation for an
upcoming patch that will add skip scan to nbtree. In practice scans
that use "skip arrays" will tend to be much more sensitive to any
implementation deficiencies in this area.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2-Wz=9A_UtM7HzUThSkQ+BcrQsQZuNhWOvQWK06PRkEp=SKQ@mail.gmail.com
The array->scan_key references fixed up at the end of preprocessing
start out as offsets into the arrayKeyData[] array (the array returned
by _bt_preprocess_array_keys at the start of preprocessing that involves
array scan keys). Offsets into the arrayKeyData[] array are no longer
guaranteed to be valid offsets into our original scan->keyData[] input
scan key array, but comments describing the array->scan_key references
still talked about scan->keyData[]. Update those comments.
Oversight in commit b5249741.
Teach _bt_preprocess_array_keys to eliminate redundant array equality
scan keys directly, rather than just marking them as redundant. Its
_bt_preprocess_keys caller is no longer required to ignore input scan
keys that were marked redundant in this way. Oversights like the one
fixed by commit f22e17f7 are no longer possible.
The new scheme also makes it easier for _bt_preprocess_keys to output a
so.keyData[] scan key array with _more_ scan keys than it was passed in
its scan.keyData[] input scan key array. An upcoming patch that adds
skip scan optimizations to nbtree will take advantage of this.
In passing, remove and rename certain _bt_preprocess_keys variables to
make the difference between our input scan key array and our output scan
key array clearer.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2-Wz=9A_UtM7HzUThSkQ+BcrQsQZuNhWOvQWK06PRkEp=SKQ@mail.gmail.com
Add bounds checking to nbtree's lookahead/skip-within-a-page mechanism.
Otherwise it's possible for cases with lots of before-array-keys tuples
to overflow an int16 variable, causing the mechanism to generate an out
of bounds page offset number.
Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp
execution.
Reported-By: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/6c68ac42-bbb5-8b24-103e-af0e279c536f@gmail.com
Backpatch: 17-, where nbtree SAOP execution was enhanced.
_bt_advance_array_keys didn't take sufficient care at the point where it
decides whether to start a new primitive index scan based on a call to
_bt_check_compare against finaltup (a call with the scan direction
flipped around). The final decision was conditioned on rules about how
the scan key offset sktrig that initially triggered array advancement
(passed to _bt_advance_array_keys from its _bt_checkkeys caller)
compared to the offset set by its own _bt_check_compare finaltup call.
This approach was faulty, in that it allowed _bt_advance_array_keys to
incorrectly start a new primitive index scan, that landed on the same
leaf page (on assert-enabled builds it led to an assertion failure).
In general, scans with array keys are expected to never have to read the
same leaf page more than once (barring cases involving cursors, and
cases where the scan restores a marked position for the inner side of a
merge join). This principle was established by commit 5bf748b8.
To fix, make the final decision based on whether the scan key offset set
by the _bt_check_compare finaltup call is an offset to an inequality
strategy scan key. An unsatisfied required inequality strategy scan key
indicates that all of the scan's required equality strategy scan keys
must also be satisfied by finaltup (not just by caller's tuple), and
that there is a decent chance that _bt_first will be able to reposition
the scan to a position many leaf pages ahead of the current leaf page.
Oversight in commit 5bf748b8.
Discussion: https://postgr.es/m/CAH2-Wz=DyHbcg7o6zXqzyiin8WE8vzk4tvU8Lrnh-a=EAvO0TQ@mail.gmail.com
Certain cases involving the use of cursors had assertion failures within
_bt_preprocess_keys's recently added no-op return path. The assertion
in question made the faulty assumption that a second or third call to
_bt_preprocess_keys (within the same btrescan) could only happen when
another scheduled primitive index scan was just about to begin.
It would be possible to address the problem by only allowing scans that
have array keys to take the new no-op path, forcing affected cases to
perform redundant preprocessing work. It seems simpler to just remove
the assertion, and reframe the no-op path as a more general mechanism.
Take this simpler approach.
The important underlying principle is that we only need to perform
preprocessing once per btrescan (at most). This is expected regardless
of whether or not the scan happens to have array keys.
Oversight in commit 1b134ca5, which enhanced nbtree ScalarArrayOp
execution.
Reported-By: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/ef0f7c8b-a6fa-362e-6fd6-054950f947ca@gmail.com
This fixes various typos, duplicated words, and tiny bits of whitespace
mainly in code comments but also in docs.
Author: Daniel Gustafsson <daniel@yesql.se>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
Author: Alexander Lakhin <exclusion@gmail.com>
Author: David Rowley <dgrowleyml@gmail.com>
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/3F577953-A29E-4722-98AD-2DA9EFF2CBB8@yesql.se
Preprocessing for nbtree index scans allowed array "input" scan keys
already marked eliminated during array-specific preprocessing to be
"fixed up" during preprocessing proper. This allowed eliminated scan
keys on DESC index columns to spurious have their strategy commuted,
causing assertion failures.
To fix, teach _bt_fix_scankey_strategy to ignore these scan keys. This
brings it in line with its only caller, _bt_preprocess_keys.
Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp
execution.
Reported-By: Donghang Lin <donghanglin@gmail.com>
Discussion: https://postgr.es/m/CAA=D8a2sHK6CAzZ=0CeafC-Y-MFXbYxnRSHvZTi=+JHu6kAa8Q@mail.gmail.com
One of the assertions was the subject of a false positive complaint from
Coverity, but none of the assertions added much, so get rid of them.
Reported-By: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3000247.1712537309@sss.pgh.pa.us
nbtree index scans with SAOP inequalities (but no SAOP equalities)
performed extra ORDER proc lookups for any remaining equality strategy
scan keys. This could waste cycles, and caused assertion failures.
Keeping around a separate ORDER proc is only necessary for a scan's
non-array/non-SAOP equality scan keys when the scan has at least one
other SAOP equality strategy key (a SAOP inequality shouldn't count).
To fix, replace _bt_preprocess_array_keys_final's assertion with a test
that makes the function return early when the scan has no SAOP equality
scan keys.
Oversight in commit 1b134ca5, which enhanced nbtree ScalarArrayOp
execution.
Reported-By: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/0539d3d3-a402-0a49-ed5e-26429dffc4bd@gmail.com
Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively. This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).
Take this general approach a lot further: teach nbtree SAOP index scans
to decide how to execute ScalarArrayOp scans (when and where to start
the next primitive index scan) based on physical index characteristics.
This can be far more efficient. All SAOP scans will now reliably avoid
duplicative leaf page accesses (just like any other nbtree index scan).
SAOP scans whose array keys are naturally clustered together now require
far fewer index descents, since we'll reliably avoid starting a new
primitive scan just to get to a later offset from the same leaf page.
The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value. Required
scan key arrays (i.e. arrays from scan keys that can terminate the scan)
ratchet forward in lockstep with the index scan. Non-required arrays
(i.e. arrays from scan keys that can only exclude non-matching tuples)
"advance" without the process ever rolling over to a higher-order array.
Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, even index scans of a composite index with a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we won't mark required) now avoid repeating leaf page
accesses -- that benefit isn't limited to simpler equality-only cases.
In general, all nbtree index scans now output tuples as if they were one
continuous index scan -- even scans that mix a high-order inequality
with lower-order SAOP equalities reliably output tuples in index order.
This allows us to remove a couple of special cases that were applied
when building index paths with SAOP clauses during planning.
Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute. These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Affected queries can now exploit scan output order in all the usual ways
(e.g., certain "ORDER BY ... LIMIT n" queries can now terminate early).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths, with path keys, but
without low-order SAOP index quals (filter quals were used instead).
We'll no longer generate these alternative paths, since they can no
longer offer any meaningful advantages over standard index qual paths.
Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes. They can avoid extra heap
page accesses from using filter quals to exclude non-matching tuples
(index quals will never have that problem). They can also skip over
irrelevant sections of the index in more cases (though only when nbtree
determines that starting another primitive scan actually makes sense).
There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with amcanorder-based
index AMs maintained as extensions. Such an index AM could have the
same limitations around ordered SAOP scans as nbtree had up until now.
Adding a pro forma incompatibility item about the issue to the Postgres
17 release notes seems like a good idea.
Author: Peter Geoghegan <pg@bowt.ie>
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
as determined by include-what-you-use (IWYU)
While IWYU also suggests to *add* a bunch of #include's (which is its
main purpose), this patch does not do that. In some cases, a more
specific #include replaces another less specific one.
Some manual adjustments of the automatic result:
- IWYU currently doesn't know about includes that provide global
variable declarations (like -Wmissing-variable-declarations), so
those includes are being kept manually.
- All includes for port(ability) headers are being kept for now, to
play it safe.
- No changes of catalog/pg_foo.h to catalog/pg_foo_d.h, to keep the
patch from exploding in size.
Note that this patch touches just *.c files, so nothing declared in
header files changes in hidden ways.
As a small example, in src/backend/access/transam/rmgr.c, some IWYU
pragma annotations are added to handle a special case there.
Discussion: https://www.postgresql.org/message-id/flat/af837490-6b2f-46df-ba05-37ea6a6653fc%40eisentraut.org
e0b1ee17dc introduced optimization for matching B-tree scan keys required for
the directional scan. However, it incorrectly assumed that all keys required
for opposite direction scan are satisfied by _bt_first(). It has been
illustrated that with multiple scan keys over the same column, a lesser one
(according to the scan direction) could win leaving the other one unsatisfied.
Instead of relying on _bt_first() this commit introduces code that memorizes
whether there was at least one match on the page. If that's true we know that
keys required for opposite-direction scan are satisfied as soon as
corresponding values are not NULLs.
Also, this commit simplifies the description for the optimization of keys
required for the current direction scan. Now the flag used for this is named
continuescanPrechecked and means exactly that *continuescan flag is known
to be true for the last item on the page.
Reported-by: Peter Geoghegan
Discussion: https://postgr.es/m/CAH2-Wzn0LeLcb1PdBnK0xisz8NpHkxRrMr3NWJ%2BKOK-WZ%2BQtTQ%40mail.gmail.com
Reviewed-by: Pavel Borisov
Teach _bt_binsrch (and related helper routines like _bt_search and
_bt_compare) about the initial positioning requirements of backward
scans. Routines like _bt_binsrch already know all about "nextkey"
searches, so it seems natural to teach them about "goback"/backward
searches, too. These concepts are closely related, and are much easier
to understand when discussed together.
Now that certain implementation details are hidden from _bt_first, it's
straightforward to add a new optimization: backward scans using the <
strategy now avoid extra leaf page accesses in certain "boundary cases".
Consider the following example, which uses the tenk1 table (and its
tenk1_hundred index) from the standard regression tests:
SELECT * FROM tenk1 WHERE hundred < 12 ORDER BY hundred DESC LIMIT 1;
Before this commit, nbtree would scan two leaf pages, even though it was
only really necessary to scan one leaf page. We'll now descend straight
to the leaf page containing a (12, -inf) high key instead. The scan
will locate matching non-pivot tuples with "hundred" values starting
from the value 11. The scan won't waste a page access on the right
sibling leaf page, which cannot possibly contain any matching tuples.
You can think of the optimization added by this commit as disabling an
optimization (the _bt_compare "!pivotsearch" behavior that was added to
Postgres 12 in commit dd299df8) for a small subset of cases where it was
always counterproductive.
Equivalently, you can think of the new optimization as extending the
"pivotsearch" behavior that page deletion by VACUUM has long required
(since the aforementioned Postgres 12 commit went in) to other, similar
cases. Obviously, this isn't strictly necessary for these new cases
(unlike VACUUM, _bt_first is prepared to move the scan to the left once
on the leaf level), but the underlying principle is the same.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=XPzM8HzaLPq278Vms420mVSHfgs9wi5tjFKHcapZCEw@mail.gmail.com
Currently, B-tree code matches every scan key to every item on the page.
Imagine the ordered B-tree scan for the query like this.
SELECT * FROM tbl WHERE col > 'a' AND col < 'b' ORDER BY col;
The (col > 'a') scan key will be always matched once we find the location to
start the scan. The (col < 'b') scan key will match every item on the page
as long as it matches the last item on the page.
This patch implements prechecking of the scan keys required for directional
scan on beginning of page scan. If precheck is successful we can skip this
scan keys check for the items on the page. That could lead to significant
acceleration especially if the comparison operator is expensive.
Idea from patch by Konstantin Knizhnik.
Discussion: https://postgr.es/m/079c3f8e-3371-abe2-e93c-fc8a0ae3f571%40garret.ru
Reviewed-by: Peter Geoghegan, Pavel Borisov
nbtree's mark/restore processing failed to correctly handle an edge case
involving array key advancement and related search-type scan key state.
Scans with ScalarArrayScalarArrayOpExpr quals requiring mark/restore
processing (for a merge join) could incorrectly conclude that an
affected array/scan key must not have advanced during the time between
marking and restoring the scan's position.
As a result of all this, array key handling within btrestrpos could skip
a required call to _bt_preprocess_keys(). This confusion allowed later
primitive index scans to overlook tuples matching the true current array
keys. The scan's search-type scan keys would still have spurious values
corresponding to the final array element(s) -- not values matching the
first/now-current array element(s).
To fix, remember that "array key wraparound" has taken place during the
ongoing btrescan in a flag variable stored in the scan's state, and use
that information at the point where btrestrpos decides if another call
to _bt_preprocess_keys is required.
Oversight in commit 70bc5833, which taught nbtree to handle array keys
during mark/restore processing, but missed this subtlety. That commit
was itself a bug fix for an issue in commit 9e8da0f7, which taught
nbtree to handle ScalarArrayOpExpr quals natively.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkgP3DDRJxw6DgjCxo-cu-DKrvjEv_ArkP2ctBJatDCYg@mail.gmail.com
Backpatch: 11- (all supported branches).
Split nbtree's _bt_getbuf function is two: code that read locks or write
locks existing pages remains in _bt_getbuf, while code that deals with
allocating new pages is moved to a new, dedicated function called
_bt_allocbuf. This simplifies most _bt_getbuf callers, since it is no
longer necessary for them to pass a heaprel argument. Many of the
changes to nbtree from commit 61b313e4 can be reverted. This minimizes
the divergence between HEAD/PostgreSQL 16 and earlier release branches.
_bt_allocbuf replaces the previous nbtree idiom of passing P_NEW to
_bt_getbuf. There are only 3 affected call sites, all of which continue
to pass a heaprel for recovery conflict purposes. Note that nbtree's
use of P_NEW was superficial; nbtree never actually relied on the P_NEW
code paths in bufmgr.c, so this change is strictly mechanical.
GiST already took the same approach; it has a dedicated function for
allocating new pages called gistNewBuffer(). That factor allowed commit
61b313e4 to make much more targeted changes to GiST.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CAH2-Wz=8Z9qY58bjm_7TAHgtW6RzZ5Ke62q5emdCEy9BAzwhmg@mail.gmail.com
This is done in preparation for logical decoding on standby, which needs to
include whether visibility affecting WAL records are about a (user) catalog
table. Which is only known for the table, not the indexes.
It's also nice to be able to pass the heap relation to GlobalVisTestFor() in
vacuumRedirectAndPlaceholder().
Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/21b700c3-eecf-2e05-a699-f8c78dd31ec7@gmail.com
Because we added StaticAssertStmt() first before StaticAssertDecl(),
some uses as well as the instructions in c.h are now a bit backwards
from the "native" way static assertions are meant to be used in C.
This updates the guidance and moves some static assertions to better
places.
Specifically, since the addition of StaticAssertDecl(), we can put
static assertions at the file level. This moves a number of static
assertions out of function bodies, where they might have been stuck
out of necessity, to perhaps better places at the file level or in
header files.
Also, when the static assertion appears in a position where a
declaration is allowed, then using StaticAssertDecl() is more native
than StaticAssertStmt().
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/941a04e7-dd6f-c0e4-8cdf-a33b3338cbda%40enterprisedb.com
This reverts commits 0147fc7, 4567596, aa64f23, and 5ecd018.
There is no longer agreement that introducing this function
was the right way to address the problem. The consensus now
seems to favor trying to make a correct value for MaxBackends
available to mdules executing their _PG_init() functions.
Nathan Bossart
Discussion: http://postgr.es/m/20220323045229.i23skfscdbvrsuxa@jrouhaud
This makes the code more consistent with SpGiST, GiST and GIN, that
already use this style, and the idea is to make easier the introduction
of more sanity checks for each of these AM-specific macros. BRIN uses a
different set of macros to get a page's type and flags, so it has no
need for something similar.
Author: Matthias van de Meent
Discussion: https://postgr.es/m/CAEze2WjE3+tGO9Fs9+iZMU+z6mMZKo54W1Zt98WKqbEUHbHOBg@mail.gmail.com
Previously, it was really easy to write code that accessed MaxBackends
before we'd actually initialized it, especially when coding up an
extension. To make this less error-prune, introduce a new function
GetMaxBackends() which should be used to obtain the correct value.
This will ERROR if called too early. Demote the global variable to
a file-level static, so that nobody can peak at it directly.
Nathan Bossart. Idea by Andres Freund. Review by Greg Sabino Mullane,
by Michael Paquier (who had doubts about the approach), and by me.
Discussion: http://postgr.es/m/20210802224204.bckcikl45uezv5e4@alap3.anarazel.de