mirror of
https://github.com/postgres/postgres.git
synced 2025-05-20 00:03:14 -04:00
Write some real documentation about the index access method API.
This commit is contained in:
parent
67ff8009cf
commit
c6521b1b93
@ -1,6 +1,6 @@
|
|||||||
<!--
|
<!--
|
||||||
Documentation of the system catalogs, directed toward PostgreSQL developers
|
Documentation of the system catalogs, directed toward PostgreSQL developers
|
||||||
$PostgreSQL: pgsql/doc/src/sgml/catalogs.sgml,v 2.95 2005/01/05 23:42:03 tgl Exp $
|
$PostgreSQL: pgsql/doc/src/sgml/catalogs.sgml,v 2.96 2005/02/13 03:04:15 tgl Exp $
|
||||||
-->
|
-->
|
||||||
|
|
||||||
<chapter id="catalogs">
|
<chapter id="catalogs">
|
||||||
@ -289,9 +289,10 @@
|
|||||||
</indexterm>
|
</indexterm>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
The catalog <structname>pg_am</structname> stores information about index access
|
The catalog <structname>pg_am</structname> stores information about index
|
||||||
methods. There is one row for each index access method supported by
|
access methods. There is one row for each index access method supported by
|
||||||
the system.
|
the system. The contents of this catalog are discussed in detail in
|
||||||
|
<xref linkend="indexam">.
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<table>
|
<table>
|
||||||
@ -453,20 +454,6 @@
|
|||||||
</tgroup>
|
</tgroup>
|
||||||
</table>
|
</table>
|
||||||
|
|
||||||
<para>
|
|
||||||
An index access method that supports multiple columns (has
|
|
||||||
<structfield>amcanmulticol</structfield> true) <emphasis>must</>
|
|
||||||
support indexing null values in columns after the first, because the planner
|
|
||||||
will assume the index can be used for queries on just the first
|
|
||||||
column(s). For example, consider an index on (a,b) and a query with
|
|
||||||
<literal>WHERE a = 4</literal>. The system will assume the index can be used to scan for
|
|
||||||
rows with <literal>a = 4</literal>, which is wrong if the index omits rows where <literal>b</> is null.
|
|
||||||
It is, however, OK to omit rows where the first indexed column is null.
|
|
||||||
(GiST currently does so.)
|
|
||||||
<structfield>amindexnulls</structfield> should be set true only if the
|
|
||||||
index access method indexes all rows, including arbitrary combinations of null values.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
</sect1>
|
</sect1>
|
||||||
|
|
||||||
|
|
||||||
|
@ -1,4 +1,4 @@
|
|||||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/filelist.sgml,v 1.41 2005/01/10 00:04:38 tgl Exp $ -->
|
<!-- $PostgreSQL: pgsql/doc/src/sgml/filelist.sgml,v 1.42 2005/02/13 03:04:15 tgl Exp $ -->
|
||||||
|
|
||||||
<!entity history SYSTEM "history.sgml">
|
<!entity history SYSTEM "history.sgml">
|
||||||
<!entity info SYSTEM "info.sgml">
|
<!entity info SYSTEM "info.sgml">
|
||||||
@ -77,7 +77,7 @@
|
|||||||
<!entity catalogs SYSTEM "catalogs.sgml">
|
<!entity catalogs SYSTEM "catalogs.sgml">
|
||||||
<!entity geqo SYSTEM "geqo.sgml">
|
<!entity geqo SYSTEM "geqo.sgml">
|
||||||
<!entity gist SYSTEM "gist.sgml">
|
<!entity gist SYSTEM "gist.sgml">
|
||||||
<!entity indexcost SYSTEM "indexcost.sgml">
|
<!entity indexam SYSTEM "indexam.sgml">
|
||||||
<!entity nls SYSTEM "nls.sgml">
|
<!entity nls SYSTEM "nls.sgml">
|
||||||
<!entity plhandler SYSTEM "plhandler.sgml">
|
<!entity plhandler SYSTEM "plhandler.sgml">
|
||||||
<!entity protocol SYSTEM "protocol.sgml">
|
<!entity protocol SYSTEM "protocol.sgml">
|
||||||
|
837
doc/src/sgml/indexam.sgml
Normal file
837
doc/src/sgml/indexam.sgml
Normal file
@ -0,0 +1,837 @@
|
|||||||
|
<!--
|
||||||
|
$PostgreSQL: pgsql/doc/src/sgml/indexam.sgml,v 2.1 2005/02/13 03:04:15 tgl Exp $
|
||||||
|
-->
|
||||||
|
|
||||||
|
<chapter id="indexam">
|
||||||
|
<title>Index Access Method Interface Definition</title>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
This chapter defines the interface between the core
|
||||||
|
<productname>PostgreSQL</productname> system and <firstterm>index access
|
||||||
|
methods</>, which manage individual index types. The core system
|
||||||
|
knows nothing about indexes beyond what is specified here, so it is
|
||||||
|
possible to develop entirely new index types by writing add-on code.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
All indexes in <productname>PostgreSQL</productname> are what are known
|
||||||
|
technically as <firstterm>secondary indexes</>; that is, the index is
|
||||||
|
physically separate from the table file that it describes. Each index
|
||||||
|
is stored as its own physical <firstterm>relation</> and so is described
|
||||||
|
by an entry in the <structname>pg_class</> catalog. The contents of an
|
||||||
|
index are entirely under the control of its index access method. In
|
||||||
|
practice, all index access methods divide indexes into standard-size
|
||||||
|
pages so that they can use the regular storage manager and buffer manager
|
||||||
|
to access the index contents. (All the existing index access methods
|
||||||
|
furthermore use the standard page layout described in <xref
|
||||||
|
linkend="storage-page-layout">, and they all use the same format for index
|
||||||
|
tuple headers; but these decisions are not forced on an access method.)
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
An index is effectively a mapping from some data key values to
|
||||||
|
<firstterm>tuple identifiers</>, or <acronym>TIDs</>, of row versions
|
||||||
|
(tuples) in the index's parent table. A TID consists of a
|
||||||
|
block number and an item number within that block (see <xref
|
||||||
|
linkend="storage-page-layout">). This is sufficient
|
||||||
|
information to fetch a particular row version from the table.
|
||||||
|
Indexes are not directly aware that under MVCC, there may be multiple
|
||||||
|
extant versions of the same logical row; to an index, each tuple is
|
||||||
|
an independent object that needs its own index entry. Thus, an
|
||||||
|
update of a row always creates all-new index entries for the row, even if
|
||||||
|
the key values did not change. Index entries for dead tuples are
|
||||||
|
reclaimed (by vacuuming) when the dead tuples themselves are reclaimed.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<sect1 id="index-catalog">
|
||||||
|
<title>Catalog Entries for Indexes</title>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Each index access method is described by a row in the
|
||||||
|
<structname>pg_am</structname> system catalog (see
|
||||||
|
<xref linkend="catalog-pg-am">). The principal contents of a
|
||||||
|
<structname>pg_am</structname> row are references to
|
||||||
|
<link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>
|
||||||
|
entries that identify the index access
|
||||||
|
functions supplied by the access method. The APIs for these functions
|
||||||
|
are defined later in this chapter. In addition, the
|
||||||
|
<structname>pg_am</structname> row specifies a few fixed properties of
|
||||||
|
the access method, such as whether it can support multi-column indexes.
|
||||||
|
There is not currently any special support
|
||||||
|
for creating or deleting <structname>pg_am</structname> entries;
|
||||||
|
anyone able to write a new access method is expected to be competent
|
||||||
|
to insert an appropriate row for themselves.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
To be useful, an index access method must also have one or more
|
||||||
|
<firstterm>operator classes</> defined in
|
||||||
|
<link linkend="catalog-pg-opclass"><structname>pg_opclass</structname></link>,
|
||||||
|
<link linkend="catalog-pg-amop"><structname>pg_amop</structname></link>, and
|
||||||
|
<link linkend="catalog-pg-amproc"><structname>pg_amproc</structname></link>.
|
||||||
|
These entries allow the planner
|
||||||
|
to determine what kinds of query qualifications can be used with
|
||||||
|
indexes of this access method. Operator classes are described
|
||||||
|
in <xref linkend="xindex">, which is prerequisite material for reading
|
||||||
|
this chapter.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
An individual index is defined by a
|
||||||
|
<link linkend="catalog-pg-class"><structname>pg_class</structname></link>
|
||||||
|
entry that describes it as a physical relation, plus a
|
||||||
|
<link linkend="catalog-pg-index"><structname>pg_index</structname></link>
|
||||||
|
entry that shows the logical content of the index — that is, the set
|
||||||
|
of index columns it has and the semantics of those columns, as captured by
|
||||||
|
the associated operator classes. The index columns (key values) can be
|
||||||
|
either simple columns of the underlying table or expressions over the table
|
||||||
|
rows. The index access method normally has no interest in where the index
|
||||||
|
key values come from (it is always handed precomputed key values) but it
|
||||||
|
will be very interested in the operator class information in
|
||||||
|
<structname>pg_index</structname>. Both of these catalog entries can be
|
||||||
|
accessed as part of the <structname>Relation</> data structure that is
|
||||||
|
passed to all operations on the index.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Some of the flag columns of <structname>pg_am</structname> have nonobvious
|
||||||
|
implications. The requirements of <structfield>amcanunique</structfield>
|
||||||
|
are discussed in <xref linkend="index-unique-checks">, and those of
|
||||||
|
<structfield>amconcurrent</structfield> in <xref linkend="index-locking">.
|
||||||
|
The <structfield>amcanmulticol</structfield> flag asserts that the
|
||||||
|
access method supports multi-column indexes, while
|
||||||
|
<structfield>amindexnulls</structfield> asserts that index entries are
|
||||||
|
created for NULL key values. Since most indexable operators are
|
||||||
|
strict and hence cannot return TRUE for NULL inputs,
|
||||||
|
it is at first sight attractive to not store index entries for NULLs:
|
||||||
|
they could never be returned by an index scan anyway. However, this
|
||||||
|
argument fails for a full-table index scan (one with no scan keys);
|
||||||
|
such a scan should include null rows. In practice this means that
|
||||||
|
indexes that support ordered scans (have <structfield>amorderstrategy</>
|
||||||
|
nonzero) must index nulls, since the planner might decide to use such a
|
||||||
|
scan as a substitute for sorting. Another restriction is that an index
|
||||||
|
access method that supports multiple index columns <emphasis>must</>
|
||||||
|
support indexing null values in columns after the first, because the planner
|
||||||
|
will assume the index can be used for queries on just the first
|
||||||
|
column(s). For example, consider an index on (a,b) and a query with
|
||||||
|
<literal>WHERE a = 4</literal>. The system will assume the index can be
|
||||||
|
used to scan for rows with <literal>a = 4</literal>, which is wrong if the
|
||||||
|
index omits rows where <literal>b</> is null.
|
||||||
|
It is, however, OK to omit rows where the first indexed column is null.
|
||||||
|
(GiST currently does so.) Thus,
|
||||||
|
<structfield>amindexnulls</structfield> should be set true only if the
|
||||||
|
index access method indexes all rows, including arbitrary combinations of
|
||||||
|
null values.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
</sect1>
|
||||||
|
|
||||||
|
<sect1 id="index-functions">
|
||||||
|
<title>Index Access Method Functions</title>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The index construction and maintenance functions that an index access
|
||||||
|
method must provide are:
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<programlisting>
|
||||||
|
void
|
||||||
|
ambuild (Relation heapRelation,
|
||||||
|
Relation indexRelation,
|
||||||
|
IndexInfo *indexInfo);
|
||||||
|
</programlisting>
|
||||||
|
Build a new index. The index relation has been physically created,
|
||||||
|
but is empty. It must be filled in with whatever fixed data the
|
||||||
|
access method requires, plus entries for all tuples already existing
|
||||||
|
in the table. Ordinarily the <function>ambuild</> function will call
|
||||||
|
<function>IndexBuildHeapScan()</> to scan the table for existing tuples
|
||||||
|
and compute the keys that need to be inserted into the index.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<programlisting>
|
||||||
|
InsertIndexResult
|
||||||
|
aminsert (Relation indexRelation,
|
||||||
|
Datum *datums,
|
||||||
|
char *nulls,
|
||||||
|
ItemPointer heap_tid,
|
||||||
|
Relation heapRelation,
|
||||||
|
bool check_uniqueness);
|
||||||
|
</programlisting>
|
||||||
|
Insert a new tuple into an existing index. The <literal>datums</> and
|
||||||
|
<literal>nulls</> arrays give the key values to be indexed, and
|
||||||
|
<literal>heap_tid</> is the TID to be indexed.
|
||||||
|
If the access method supports unique indexes (its
|
||||||
|
<structname>pg_am</>.<structfield>amcanunique</> flag is true) then
|
||||||
|
<literal>check_uniqueness</> may be true, in which case the access method
|
||||||
|
must verify that there is no conflicting row; this is the only situation in
|
||||||
|
which the access method normally needs the <literal>heapRelation</>
|
||||||
|
parameter. See <xref linkend="index-unique-checks"> for details.
|
||||||
|
The result is a struct that must be pfree'd by the caller. (The result
|
||||||
|
struct is really quite useless and should be removed...)
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<programlisting>
|
||||||
|
IndexBulkDeleteResult *
|
||||||
|
ambulkdelete (Relation indexRelation,
|
||||||
|
IndexBulkDeleteCallback callback,
|
||||||
|
void *callback_state);
|
||||||
|
</programlisting>
|
||||||
|
Delete tuple(s) from the index. This is a <quote>bulk delete</> operation
|
||||||
|
that is intended to be implemented by scanning the whole index and checking
|
||||||
|
each entry to see if it should be deleted.
|
||||||
|
The passed-in <literal>callback</> function may be called, in the style
|
||||||
|
<literal>callback(<replaceable>TID</>, callback_state) returns bool</literal>,
|
||||||
|
to determine whether any particular index entry, as identified by its
|
||||||
|
referenced TID, is to be deleted. Must return either NULL or a palloc'd
|
||||||
|
struct containing statistics about the effects of the deletion operation.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<programlisting>
|
||||||
|
IndexBulkDeleteResult *
|
||||||
|
amvacuumcleanup (Relation indexRelation,
|
||||||
|
IndexVacuumCleanupInfo *info,
|
||||||
|
IndexBulkDeleteResult *stats);
|
||||||
|
</programlisting>
|
||||||
|
Clean up after a <command>VACUUM</command> operation (one or more
|
||||||
|
<function>ambulkdelete</> calls). An index access method does not have
|
||||||
|
to provide this function (if so, the entry in <structname>pg_am</> must
|
||||||
|
be zero). If it is provided, it is typically used for bulk cleanup
|
||||||
|
such as reclaiming empty index pages. <literal>info</>
|
||||||
|
provides some additional arguments such as a message level for statistical
|
||||||
|
reports, and <literal>stats</> is whatever the last
|
||||||
|
<function>ambulkdelete</> call returned. <function>amvacuumcleanup</>
|
||||||
|
may replace or modify this struct before returning it. If the result
|
||||||
|
is not NULL it must be a palloc'd struct. The statistics it contains
|
||||||
|
will be reported by <command>VACUUM</> if <literal>VERBOSE</> is given.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The purpose of an index, of course, is to support scans for tuples matching
|
||||||
|
an indexable <literal>WHERE</> condition, often called a
|
||||||
|
<firstterm>qualifier</> or <firstterm>scan key</>. The semantics of
|
||||||
|
index scanning are described more fully in <xref linkend="index-scanning">,
|
||||||
|
below. The scan-related functions that an index access method must provide
|
||||||
|
are:
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<programlisting>
|
||||||
|
IndexScanDesc
|
||||||
|
ambeginscan (Relation indexRelation,
|
||||||
|
int nkeys,
|
||||||
|
ScanKey key);
|
||||||
|
</programlisting>
|
||||||
|
Begin a new scan. The <literal>key</> array (of length <literal>nkeys</>)
|
||||||
|
describes the scan key(s) for the index scan. The result must be a
|
||||||
|
palloc'd struct. For implementation reasons the index access method
|
||||||
|
<emphasis>must</> create this struct by calling
|
||||||
|
<function>RelationGetIndexScan()</>. In most cases
|
||||||
|
<function>ambeginscan</> itself does little beyond making that call;
|
||||||
|
the interesting parts of indexscan startup are in <function>amrescan</>.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<programlisting>
|
||||||
|
boolean
|
||||||
|
amgettuple (IndexScanDesc scan,
|
||||||
|
ScanDirection direction);
|
||||||
|
</programlisting>
|
||||||
|
Fetch the next tuple in the given scan, moving in the given
|
||||||
|
direction (forward or backward in the index). Returns TRUE if a tuple was
|
||||||
|
obtained, FALSE if no matching tuples remain. In the TRUE case the tuple
|
||||||
|
TID is stored into the <literal>scan</> structure. Note that
|
||||||
|
<quote>success</> means only that the index contains an entry that matches
|
||||||
|
the scan keys, not that the tuple necessarily still exists in the heap or
|
||||||
|
will pass the caller's snapshot test.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<programlisting>
|
||||||
|
void
|
||||||
|
amrescan (IndexScanDesc scan,
|
||||||
|
ScanKey key);
|
||||||
|
</programlisting>
|
||||||
|
Restart the given scan, possibly with new scan keys (to continue using
|
||||||
|
the old keys, NULL is passed for <literal>key</>). Note that it is not
|
||||||
|
possible for the number of keys to be changed. In practice the restart
|
||||||
|
feature is used when a new outer tuple is selected by a nestloop join
|
||||||
|
and so a new key comparison value is needed, but the scan key structure
|
||||||
|
remains the same. This function is also called by
|
||||||
|
<function>RelationGetIndexScan()</>, so it is used for initial setup
|
||||||
|
of an indexscan as well as rescanning.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<programlisting>
|
||||||
|
void
|
||||||
|
amendscan (IndexScanDesc scan);
|
||||||
|
</programlisting>
|
||||||
|
End a scan and release resources. The <literal>scan</> struct itself
|
||||||
|
should not be freed, but any locks or pins taken internally by the
|
||||||
|
access method must be released.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<programlisting>
|
||||||
|
void
|
||||||
|
ammarkpos (IndexScanDesc scan);
|
||||||
|
</programlisting>
|
||||||
|
Mark current scan position. The access method need only support one
|
||||||
|
remembered scan position per scan.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<programlisting>
|
||||||
|
void
|
||||||
|
amrestrpos (IndexScanDesc scan);
|
||||||
|
</programlisting>
|
||||||
|
Restore the scan to the most recently marked position.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<programlisting>
|
||||||
|
void
|
||||||
|
amcostestimate (Query *root,
|
||||||
|
RelOptInfo *rel,
|
||||||
|
IndexOptInfo *index,
|
||||||
|
List *indexQuals,
|
||||||
|
Cost *indexStartupCost,
|
||||||
|
Cost *indexTotalCost,
|
||||||
|
Selectivity *indexSelectivity,
|
||||||
|
double *indexCorrelation);
|
||||||
|
</programlisting>
|
||||||
|
Estimate the costs of an index scan. This function is described fully
|
||||||
|
in <xref linkend="index-cost-estimation">, below.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
By convention, the <literal>pg_proc</literal> entry for any index
|
||||||
|
access method function should show the correct number of arguments,
|
||||||
|
but declare them all as type <type>internal</> (since most of the arguments
|
||||||
|
have types that are not known to SQL, and we don't want users calling
|
||||||
|
the functions directly anyway). The return type is declared as
|
||||||
|
<type>void</>, <type>internal</>, or <type>boolean</> as appropriate.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
</sect1>
|
||||||
|
|
||||||
|
<sect1 id="index-scanning">
|
||||||
|
<title>Index Scanning</title>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
In an index scan, the index access method is responsible for regurgitating
|
||||||
|
the TIDs of all the tuples it has been told about that match the
|
||||||
|
<firstterm>scan keys</>. The access method is <emphasis>not</> involved in
|
||||||
|
actually fetching those tuples from the index's parent table, nor in
|
||||||
|
determining whether they pass the scan's time qualification test or other
|
||||||
|
conditions.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
A scan key is the internal representation of a <literal>WHERE</> clause of
|
||||||
|
the form <replaceable>index_key</> <replaceable>operator</>
|
||||||
|
<replaceable>constant</>, where the index key is one of the columns of the
|
||||||
|
index and the operator is one of the members of the operator class
|
||||||
|
associated with that index column. An index scan has zero or more scan
|
||||||
|
keys, which are implicitly ANDed — the returned tuples are expected
|
||||||
|
to satisfy all the indicated conditions.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The operator class may indicate that the index is <firstterm>lossy</> for a
|
||||||
|
particular operator; this implies that the index scan will return all the
|
||||||
|
entries that pass the scan key, plus possibly additional entries that do
|
||||||
|
not. The core system's indexscan machinery will then apply that operator
|
||||||
|
again to the heap tuple to verify whether or not it really should be
|
||||||
|
selected. For non-lossy operators, the index scan must return exactly the
|
||||||
|
set of matching entries, as there is no recheck.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Note that it is entirely up to the access method to ensure that it
|
||||||
|
correctly finds all and only the entries passing all the given scan keys.
|
||||||
|
Also, the core system will simply hand off all the <literal>WHERE</>
|
||||||
|
clauses that match the index keys and operator classes, without any
|
||||||
|
semantic analysis to determine whether they are redundant or
|
||||||
|
contradictory. As an example, given
|
||||||
|
<literal>WHERE x > 4 AND x > 14</> where <literal>x</> is a b-tree
|
||||||
|
indexed column, it is left to the b-tree <function>amrescan</> function
|
||||||
|
to realize that the first scan key is redundant and can be discarded.
|
||||||
|
The extent of preprocessing needed during <function>amrescan</> will
|
||||||
|
depend on the extent to which the index access method needs to reduce
|
||||||
|
the scan keys to a <quote>normalized</> form.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The <function>amgettuple</> function has a <literal>direction</> argument,
|
||||||
|
which can be either <literal>ForwardScanDirection</> (the normal case)
|
||||||
|
or <literal>BackwardScanDirection</>. If the first call after
|
||||||
|
<function>amrescan</> specifies <literal>BackwardScanDirection</>, then the
|
||||||
|
set of matching index entries is to be scanned back-to-front rather than in
|
||||||
|
the normal front-to-back direction, so <function>amgettuple</> must return
|
||||||
|
the last matching tuple in the index, rather than the first one as it
|
||||||
|
normally would. (This will only occur for access
|
||||||
|
methods that advertise they support ordered scans by setting
|
||||||
|
<structname>pg_am</>.<structfield>amorderstrategy</> nonzero.) After the
|
||||||
|
first call, <function>amgettuple</> must be prepared to advance the scan in
|
||||||
|
either direction from the most recently returned entry.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The access method must support <quote>marking</> a position in a scan
|
||||||
|
and later returning to the marked position. The same position may be
|
||||||
|
restored multiple times. However, only one position need be remembered
|
||||||
|
per scan; a new <function>ammarkpos</> call overrides the previously
|
||||||
|
marked position.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Both the scan position and the mark position (if any) must be maintained
|
||||||
|
consistently in the face of concurrent insertions or deletions in the
|
||||||
|
index. It is OK if a freshly-inserted entry is not returned by a scan that
|
||||||
|
would have found the entry if it had existed when the scan started, or for
|
||||||
|
the scan to return such an entry upon rescanning or backing
|
||||||
|
up even though it had not been returned the first time through. Similarly,
|
||||||
|
a concurrent delete may or may not be reflected in the results of a scan.
|
||||||
|
What is important is that insertions or deletions not cause the scan to
|
||||||
|
miss or multiply return entries that were not themselves being inserted or
|
||||||
|
deleted. (For an index type that does not set
|
||||||
|
<structname>pg_am</>.<structfield>amconcurrent</>, it is sufficient to
|
||||||
|
handle these cases for insertions or deletions performed by the same
|
||||||
|
backend that's doing the scan. But when <structfield>amconcurrent</> is
|
||||||
|
true, insertions or deletions from other backends must be handled as well.)
|
||||||
|
</para>
|
||||||
|
|
||||||
|
</sect1>
|
||||||
|
|
||||||
|
<sect1 id="index-locking">
|
||||||
|
<title>Index Locking Considerations</title>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
An index access method can choose whether it supports concurrent updates
|
||||||
|
of the index by multiple processes. If the method's
|
||||||
|
<structname>pg_am</>.<structfield>amconcurrent</> flag is true, then
|
||||||
|
the core <productname>PostgreSQL</productname> system obtains
|
||||||
|
<literal>AccessShareLock</> on the index during an index scan, and
|
||||||
|
<literal>RowExclusiveLock</> when updating the index. Since these lock
|
||||||
|
types do not conflict, the access method is responsible for handling any
|
||||||
|
fine-grained locking it may need. An exclusive lock on the index as a whole
|
||||||
|
will be taken only during index creation, destruction, or
|
||||||
|
<literal>REINDEX</>. When <structfield>amconcurrent</> is false,
|
||||||
|
<productname>PostgreSQL</productname> still obtains
|
||||||
|
<literal>AccessShareLock</> during index scans, but it obtains
|
||||||
|
<literal>AccessExclusiveLock</> during any update. This ensures that
|
||||||
|
updaters have sole use of the index. Note that this implicitly assumes
|
||||||
|
that index scans are read-only; an access method that might modify the
|
||||||
|
index during a scan will still have to do its own locking to handle the
|
||||||
|
case of concurrent scans.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Recall that a backend's own locks never conflict; therefore, even a
|
||||||
|
non-concurrent index type must be prepared to handle the case where
|
||||||
|
a backend is inserting or deleting entries in an index that it is itself
|
||||||
|
scanning. (This is of course necessary to support an <command>UPDATE</>
|
||||||
|
that uses the index to find the rows to be updated.)
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Building an index type that supports concurrent updates usually requires
|
||||||
|
extensive and subtle analysis of the required behavior. For the b-tree
|
||||||
|
and hash index types, you can read about the design decisions involved in
|
||||||
|
<filename>src/backend/access/nbtree/README</> and
|
||||||
|
<filename>src/backend/access/hash/README</>.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Aside from the index's own internal consistency requirements, concurrent
|
||||||
|
updates create issues about consistency between the parent table (the
|
||||||
|
<firstterm>heap</>) and the index. Because
|
||||||
|
<productname>PostgreSQL</productname> separates accesses
|
||||||
|
and updates of the heap from those of the index, there are windows in
|
||||||
|
which the index may be inconsistent with the heap. We handle this problem
|
||||||
|
with the following rules:
|
||||||
|
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
A new heap entry is made before making its index entries. (Therefore
|
||||||
|
a concurrent index scan is likely to fail to see the heap entry.
|
||||||
|
This is okay because the index reader would be uninterested in an
|
||||||
|
uncommitted row anyway. But see <xref linkend="index-unique-checks">.)
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
When a heap entry is to be deleted (by <command>VACUUM</>), all its
|
||||||
|
index entries must be removed first.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
For concurrent index types, an indexscan must maintain a pin
|
||||||
|
on the index page holding the item last returned by
|
||||||
|
<function>amgettuple</>, and <function>ambulkdelete</> cannot delete
|
||||||
|
entries from pages that are pinned by other backends. The need
|
||||||
|
for this rule is explained below.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
|
||||||
|
If an index is concurrent then it is possible for an index reader to
|
||||||
|
see an index entry just before it is removed by <command>VACUUM</>, and
|
||||||
|
then to arrive at the corresponding heap entry after that was removed by
|
||||||
|
<command>VACUUM</>. (With a nonconcurrent index, this is not possible
|
||||||
|
because of the conflicting index-level locks that will be taken out.)
|
||||||
|
This creates no serious problems if that item
|
||||||
|
number is still unused when the reader reaches it, since an empty
|
||||||
|
item slot will be ignored by <function>heap_fetch()</>. But what if a
|
||||||
|
third backend has already re-used the item slot for something else?
|
||||||
|
When using an MVCC-compliant snapshot, there is no problem because
|
||||||
|
the new occupant of the slot is certain to be too new to pass the
|
||||||
|
snapshot test. However, with a non-MVCC-compliant snapshot (such as
|
||||||
|
<literal>SnapshotNow</>), it would be possible to accept and return
|
||||||
|
a row that does not in fact match the scan keys. We could defend
|
||||||
|
against this scenario by requiring the scan keys to be rechecked
|
||||||
|
against the heap row in all cases, but that is too expensive. Instead,
|
||||||
|
we use a pin on an index page as a proxy to indicate that the reader
|
||||||
|
may still be <quote>in flight</> from the index entry to the matching
|
||||||
|
heap entry. Making <function>ambulkdelete</> block on such a pin ensures
|
||||||
|
that <command>VACUUM</> cannot delete the heap entry before the reader
|
||||||
|
is done with it. This solution costs little in runtime, and adds blocking
|
||||||
|
overhead only in the rare cases where there actually is a conflict.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
This solution requires that index scans be <quote>synchronous</>: we have
|
||||||
|
to fetch each heap tuple immediately after scanning the corresponding index
|
||||||
|
entry. This is expensive for a number of reasons. An
|
||||||
|
<quote>asynchronous</> scan in which we collect many TIDs from the index,
|
||||||
|
and only visit the heap tuples sometime later, requires much less index
|
||||||
|
locking overhead and may allow a more efficient heap access pattern.
|
||||||
|
Per the above analysis, we must use the synchronous approach for
|
||||||
|
non-MVCC-compliant snapshots, but an asynchronous scan would be safe
|
||||||
|
for a query using an MVCC snapshot. This possibility is not exploited
|
||||||
|
as of <productname>PostgreSQL</productname> 8.0, but it is likely to be
|
||||||
|
investigated soon.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
</sect1>
|
||||||
|
|
||||||
|
<sect1 id="index-unique-checks">
|
||||||
|
<title>Index Uniqueness Checks</title>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<productname>PostgreSQL</productname> enforces SQL uniqueness constraints
|
||||||
|
using <firstterm>unique indexes</>, which are indexes that disallow
|
||||||
|
multiple entries with identical keys. An access method that supports this
|
||||||
|
feature sets <structname>pg_am</>.<structfield>amcanunique</> true.
|
||||||
|
(At present, only b-tree supports it.)
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Because of MVCC, it is always necessary to allow duplicate entries to
|
||||||
|
exist physically in an index: the entries might refer to successive
|
||||||
|
versions of a single logical row. The behavior we actually want to
|
||||||
|
enforce is that no MVCC snapshot could include two rows with equal
|
||||||
|
index keys. This breaks down into the following cases that must be
|
||||||
|
checked when inserting a new row into a unique index:
|
||||||
|
|
||||||
|
<itemizedlist>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
If a conflicting valid row has been deleted by the current transaction,
|
||||||
|
it's okay. (In particular, since an UPDATE always deletes the old row
|
||||||
|
version before inserting the new version, this will allow an UPDATE on
|
||||||
|
a row without changing the key.)
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
If a conflicting row has been inserted by an as-yet-uncommitted
|
||||||
|
transaction, the would-be inserter must wait to see if that transaction
|
||||||
|
commits. If it rolls back then there is no conflict. If it commits
|
||||||
|
without deleting the conflicting row again, there is a uniqueness
|
||||||
|
violation. (In practice we just wait for the other transaction to
|
||||||
|
end and then redo the visibility check in toto.)
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
Similarly, if a conflicting valid row has been deleted by an
|
||||||
|
as-yet-uncommitted transaction, the would-be inserter must wait
|
||||||
|
for that transaction to commit or abort, and then repeat the test.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</itemizedlist>
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
We require the index access method to apply these tests itself, which
|
||||||
|
means that it must reach into the heap to check the commit status of
|
||||||
|
any row that is shown to have a duplicate key according to the index
|
||||||
|
contents. This is without a doubt ugly and non-modular, but it saves
|
||||||
|
redundant work: if we did a separate probe then the index lookup for
|
||||||
|
a conflicting row would be essentially repeated while finding the place to
|
||||||
|
insert the new row's index entry. What's more, there is no obvious way
|
||||||
|
to avoid race conditions unless the conflict check is an integral part
|
||||||
|
of insertion of the new index entry.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The main limitation of this scheme is that it has no convenient way
|
||||||
|
to support deferred uniqueness checks.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
</sect1>
|
||||||
|
|
||||||
|
<sect1 id="index-cost-estimation">
|
||||||
|
<title>Index Cost Estimation Functions</title>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The amcostestimate function is given a list of WHERE clauses that have
|
||||||
|
been determined to be usable with the index. It must return estimates
|
||||||
|
of the cost of accessing the index and the selectivity of the WHERE
|
||||||
|
clauses (that is, the fraction of parent-table rows that will be
|
||||||
|
retrieved during the index scan). For simple cases, nearly all the
|
||||||
|
work of the cost estimator can be done by calling standard routines
|
||||||
|
in the optimizer; the point of having an amcostestimate function is
|
||||||
|
to allow index access methods to provide index-type-specific knowledge,
|
||||||
|
in case it is possible to improve on the standard estimates.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Each amcostestimate function must have the signature:
|
||||||
|
|
||||||
|
<programlisting>
|
||||||
|
void
|
||||||
|
amcostestimate (Query *root,
|
||||||
|
RelOptInfo *rel,
|
||||||
|
IndexOptInfo *index,
|
||||||
|
List *indexQuals,
|
||||||
|
Cost *indexStartupCost,
|
||||||
|
Cost *indexTotalCost,
|
||||||
|
Selectivity *indexSelectivity,
|
||||||
|
double *indexCorrelation);
|
||||||
|
</programlisting>
|
||||||
|
|
||||||
|
The first four parameters are inputs:
|
||||||
|
|
||||||
|
<variablelist>
|
||||||
|
<varlistentry>
|
||||||
|
<term>root</term>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
The query being processed.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry>
|
||||||
|
<term>rel</term>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
The relation the index is on.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry>
|
||||||
|
<term>index</term>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
The index itself.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry>
|
||||||
|
<term>indexQuals</term>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
List of index qual clauses (implicitly ANDed);
|
||||||
|
a NIL list indicates no qualifiers are available.
|
||||||
|
Note that the list contains expression trees, not ScanKeys.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
</variablelist>
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The last four parameters are pass-by-reference outputs:
|
||||||
|
|
||||||
|
<variablelist>
|
||||||
|
<varlistentry>
|
||||||
|
<term>*indexStartupCost</term>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
Set to cost of index start-up processing
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry>
|
||||||
|
<term>*indexTotalCost</term>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
Set to total cost of index processing
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry>
|
||||||
|
<term>*indexSelectivity</term>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
Set to index selectivity
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry>
|
||||||
|
<term>*indexCorrelation</term>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
Set to correlation coefficient between index scan order and
|
||||||
|
underlying table's order
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
</variablelist>
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Note that cost estimate functions must be written in C, not in SQL or
|
||||||
|
any available procedural language, because they must access internal
|
||||||
|
data structures of the planner/optimizer.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The index access costs should be computed in the units used by
|
||||||
|
<filename>src/backend/optimizer/path/costsize.c</filename>: a sequential disk block fetch
|
||||||
|
has cost 1.0, a nonsequential fetch has cost random_page_cost, and
|
||||||
|
the cost of processing one index row should usually be taken as
|
||||||
|
cpu_index_tuple_cost (which is a user-adjustable optimizer parameter).
|
||||||
|
In addition, an appropriate multiple of cpu_operator_cost should be charged
|
||||||
|
for any comparison operators invoked during index processing (especially
|
||||||
|
evaluation of the indexQuals themselves).
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The access costs should include all disk and CPU costs associated with
|
||||||
|
scanning the index itself, but NOT the costs of retrieving or processing
|
||||||
|
the parent-table rows that are identified by the index.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The <quote>start-up cost</quote> is the part of the total scan cost that must be expended
|
||||||
|
before we can begin to fetch the first row. For most indexes this can
|
||||||
|
be taken as zero, but an index type with a high start-up cost might want
|
||||||
|
to set it nonzero.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The indexSelectivity should be set to the estimated fraction of the parent
|
||||||
|
table rows that will be retrieved during the index scan. In the case
|
||||||
|
of a lossy index, this will typically be higher than the fraction of
|
||||||
|
rows that actually pass the given qual conditions.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The indexCorrelation should be set to the correlation (ranging between
|
||||||
|
-1.0 and 1.0) between the index order and the table order. This is used
|
||||||
|
to adjust the estimate for the cost of fetching rows from the parent
|
||||||
|
table.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<procedure>
|
||||||
|
<title>Cost Estimation</title>
|
||||||
|
<para>
|
||||||
|
A typical cost estimator will proceed as follows:
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<step>
|
||||||
|
<para>
|
||||||
|
Estimate and return the fraction of parent-table rows that will be visited
|
||||||
|
based on the given qual conditions. In the absence of any index-type-specific
|
||||||
|
knowledge, use the standard optimizer function <function>clauselist_selectivity()</function>:
|
||||||
|
|
||||||
|
<programlisting>
|
||||||
|
*indexSelectivity = clauselist_selectivity(root, indexQuals,
|
||||||
|
rel->relid, JOIN_INNER);
|
||||||
|
</programlisting>
|
||||||
|
</para>
|
||||||
|
</step>
|
||||||
|
|
||||||
|
<step>
|
||||||
|
<para>
|
||||||
|
Estimate the number of index rows that will be visited during the
|
||||||
|
scan. For many index types this is the same as indexSelectivity times
|
||||||
|
the number of rows in the index, but it might be more. (Note that the
|
||||||
|
index's size in pages and rows is available from the IndexOptInfo struct.)
|
||||||
|
</para>
|
||||||
|
</step>
|
||||||
|
|
||||||
|
<step>
|
||||||
|
<para>
|
||||||
|
Estimate the number of index pages that will be retrieved during the scan.
|
||||||
|
This might be just indexSelectivity times the index's size in pages.
|
||||||
|
</para>
|
||||||
|
</step>
|
||||||
|
|
||||||
|
<step>
|
||||||
|
<para>
|
||||||
|
Compute the index access cost. A generic estimator might do this:
|
||||||
|
|
||||||
|
<programlisting>
|
||||||
|
/*
|
||||||
|
* Our generic assumption is that the index pages will be read
|
||||||
|
* sequentially, so they have cost 1.0 each, not random_page_cost.
|
||||||
|
* Also, we charge for evaluation of the indexquals at each index row.
|
||||||
|
* All the costs are assumed to be paid incrementally during the scan.
|
||||||
|
*/
|
||||||
|
cost_qual_eval(&index_qual_cost, indexQuals);
|
||||||
|
*indexStartupCost = index_qual_cost.startup;
|
||||||
|
*indexTotalCost = numIndexPages +
|
||||||
|
(cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples;
|
||||||
|
</programlisting>
|
||||||
|
</para>
|
||||||
|
</step>
|
||||||
|
|
||||||
|
<step>
|
||||||
|
<para>
|
||||||
|
Estimate the index correlation. For a simple ordered index on a single
|
||||||
|
field, this can be retrieved from pg_statistic. If the correlation
|
||||||
|
is not known, the conservative estimate is zero (no correlation).
|
||||||
|
</para>
|
||||||
|
</step>
|
||||||
|
</procedure>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Examples of cost estimator functions can be found in
|
||||||
|
<filename>src/backend/utils/adt/selfuncs.c</filename>.
|
||||||
|
</para>
|
||||||
|
</sect1>
|
||||||
|
</chapter>
|
||||||
|
|
||||||
|
<!-- Keep this comment at the end of the file
|
||||||
|
Local variables:
|
||||||
|
mode:sgml
|
||||||
|
sgml-omittag:nil
|
||||||
|
sgml-shorttag:t
|
||||||
|
sgml-minimize-attributes:nil
|
||||||
|
sgml-always-quote-attributes:t
|
||||||
|
sgml-indent-step:1
|
||||||
|
sgml-indent-data:t
|
||||||
|
sgml-parent-document:nil
|
||||||
|
sgml-default-dtd-file:"./reference.ced"
|
||||||
|
sgml-exposed-tags:nil
|
||||||
|
sgml-local-catalogs:("/usr/lib/sgml/catalog")
|
||||||
|
sgml-local-ecat-files:nil
|
||||||
|
End:
|
||||||
|
-->
|
@ -1,285 +0,0 @@
|
|||||||
<!--
|
|
||||||
$PostgreSQL: pgsql/doc/src/sgml/indexcost.sgml,v 2.19 2005/01/22 22:06:17 momjian Exp $
|
|
||||||
-->
|
|
||||||
|
|
||||||
<chapter id="indexcost">
|
|
||||||
<title>Index Cost Estimation Functions</title>
|
|
||||||
|
|
||||||
<note>
|
|
||||||
<title>Author</title>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
Written by Tom Lane (<email>tgl@sss.pgh.pa.us</email>) on 2000-01-24
|
|
||||||
</para>
|
|
||||||
</note>
|
|
||||||
|
|
||||||
<note>
|
|
||||||
<para>
|
|
||||||
This must eventually become part of a much larger chapter about
|
|
||||||
writing new index access methods.
|
|
||||||
</para>
|
|
||||||
</note>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
Every index access method must provide a cost estimation function for
|
|
||||||
use by the planner/optimizer. The procedure OID of this function is
|
|
||||||
given in the <literal>amcostestimate</literal> field of the access
|
|
||||||
method's <literal>pg_am</literal> entry.
|
|
||||||
|
|
||||||
<note>
|
|
||||||
<para>
|
|
||||||
Prior to <productname>PostgreSQL</productname> 7.0, a different
|
|
||||||
scheme was used for registering
|
|
||||||
index-specific cost estimation functions.
|
|
||||||
</para>
|
|
||||||
</note>
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
The amcostestimate function is given a list of WHERE clauses that have
|
|
||||||
been determined to be usable with the index. It must return estimates
|
|
||||||
of the cost of accessing the index and the selectivity of the WHERE
|
|
||||||
clauses (that is, the fraction of main-table rows that will be
|
|
||||||
retrieved during the index scan). For simple cases, nearly all the
|
|
||||||
work of the cost estimator can be done by calling standard routines
|
|
||||||
in the optimizer; the point of having an amcostestimate function is
|
|
||||||
to allow index access methods to provide index-type-specific knowledge,
|
|
||||||
in case it is possible to improve on the standard estimates.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
Each amcostestimate function must have the signature:
|
|
||||||
|
|
||||||
<programlisting>
|
|
||||||
void
|
|
||||||
amcostestimate (Query *root,
|
|
||||||
RelOptInfo *rel,
|
|
||||||
IndexOptInfo *index,
|
|
||||||
List *indexQuals,
|
|
||||||
Cost *indexStartupCost,
|
|
||||||
Cost *indexTotalCost,
|
|
||||||
Selectivity *indexSelectivity,
|
|
||||||
double *indexCorrelation);
|
|
||||||
</programlisting>
|
|
||||||
|
|
||||||
The first four parameters are inputs:
|
|
||||||
|
|
||||||
<variablelist>
|
|
||||||
<varlistentry>
|
|
||||||
<term>root</term>
|
|
||||||
<listitem>
|
|
||||||
<para>
|
|
||||||
The query being processed.
|
|
||||||
</para>
|
|
||||||
</listitem>
|
|
||||||
</varlistentry>
|
|
||||||
|
|
||||||
<varlistentry>
|
|
||||||
<term>rel</term>
|
|
||||||
<listitem>
|
|
||||||
<para>
|
|
||||||
The relation the index is on.
|
|
||||||
</para>
|
|
||||||
</listitem>
|
|
||||||
</varlistentry>
|
|
||||||
|
|
||||||
<varlistentry>
|
|
||||||
<term>index</term>
|
|
||||||
<listitem>
|
|
||||||
<para>
|
|
||||||
The index itself.
|
|
||||||
</para>
|
|
||||||
</listitem>
|
|
||||||
</varlistentry>
|
|
||||||
|
|
||||||
<varlistentry>
|
|
||||||
<term>indexQuals</term>
|
|
||||||
<listitem>
|
|
||||||
<para>
|
|
||||||
List of index qual clauses (implicitly ANDed);
|
|
||||||
a NIL list indicates no qualifiers are available.
|
|
||||||
</para>
|
|
||||||
</listitem>
|
|
||||||
</varlistentry>
|
|
||||||
</variablelist>
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
The last four parameters are pass-by-reference outputs:
|
|
||||||
|
|
||||||
<variablelist>
|
|
||||||
<varlistentry>
|
|
||||||
<term>*indexStartupCost</term>
|
|
||||||
<listitem>
|
|
||||||
<para>
|
|
||||||
Set to cost of index start-up processing
|
|
||||||
</para>
|
|
||||||
</listitem>
|
|
||||||
</varlistentry>
|
|
||||||
|
|
||||||
<varlistentry>
|
|
||||||
<term>*indexTotalCost</term>
|
|
||||||
<listitem>
|
|
||||||
<para>
|
|
||||||
Set to total cost of index processing
|
|
||||||
</para>
|
|
||||||
</listitem>
|
|
||||||
</varlistentry>
|
|
||||||
|
|
||||||
<varlistentry>
|
|
||||||
<term>*indexSelectivity</term>
|
|
||||||
<listitem>
|
|
||||||
<para>
|
|
||||||
Set to index selectivity
|
|
||||||
</para>
|
|
||||||
</listitem>
|
|
||||||
</varlistentry>
|
|
||||||
|
|
||||||
<varlistentry>
|
|
||||||
<term>*indexCorrelation</term>
|
|
||||||
<listitem>
|
|
||||||
<para>
|
|
||||||
Set to correlation coefficient between index scan order and
|
|
||||||
underlying table's order
|
|
||||||
</para>
|
|
||||||
</listitem>
|
|
||||||
</varlistentry>
|
|
||||||
</variablelist>
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
Note that cost estimate functions must be written in C, not in SQL or
|
|
||||||
any available procedural language, because they must access internal
|
|
||||||
data structures of the planner/optimizer.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
The index access costs should be computed in the units used by
|
|
||||||
<filename>src/backend/optimizer/path/costsize.c</filename>: a sequential disk block fetch
|
|
||||||
has cost 1.0, a nonsequential fetch has cost random_page_cost, and
|
|
||||||
the cost of processing one index row should usually be taken as
|
|
||||||
cpu_index_tuple_cost (which is a user-adjustable optimizer parameter).
|
|
||||||
In addition, an appropriate multiple of cpu_operator_cost should be charged
|
|
||||||
for any comparison operators invoked during index processing (especially
|
|
||||||
evaluation of the indexQuals themselves).
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
The access costs should include all disk and CPU costs associated with
|
|
||||||
scanning the index itself, but NOT the costs of retrieving or processing
|
|
||||||
the main-table rows that are identified by the index.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
The <quote>start-up cost</quote> is the part of the total scan cost that must be expended
|
|
||||||
before we can begin to fetch the first row. For most indexes this can
|
|
||||||
be taken as zero, but an index type with a high start-up cost might want
|
|
||||||
to set it nonzero.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
The indexSelectivity should be set to the estimated fraction of the main
|
|
||||||
table rows that will be retrieved during the index scan. In the case
|
|
||||||
of a lossy index, this will typically be higher than the fraction of
|
|
||||||
rows that actually pass the given qual conditions.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
The indexCorrelation should be set to the correlation (ranging between
|
|
||||||
-1.0 and 1.0) between the index order and the table order. This is used
|
|
||||||
to adjust the estimate for the cost of fetching rows from the main
|
|
||||||
table.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<procedure>
|
|
||||||
<title>Cost Estimation</title>
|
|
||||||
<para>
|
|
||||||
A typical cost estimator will proceed as follows:
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<step>
|
|
||||||
<para>
|
|
||||||
Estimate and return the fraction of main-table rows that will be visited
|
|
||||||
based on the given qual conditions. In the absence of any index-type-specific
|
|
||||||
knowledge, use the standard optimizer function <function>clauselist_selectivity()</function>:
|
|
||||||
|
|
||||||
<programlisting>
|
|
||||||
*indexSelectivity = clauselist_selectivity(root, indexQuals,
|
|
||||||
rel->relid, JOIN_INNER);
|
|
||||||
</programlisting>
|
|
||||||
</para>
|
|
||||||
</step>
|
|
||||||
|
|
||||||
<step>
|
|
||||||
<para>
|
|
||||||
Estimate the number of index rows that will be visited during the
|
|
||||||
scan. For many index types this is the same as indexSelectivity times
|
|
||||||
the number of rows in the index, but it might be more. (Note that the
|
|
||||||
index's size in pages and rows is available from the IndexOptInfo struct.)
|
|
||||||
</para>
|
|
||||||
</step>
|
|
||||||
|
|
||||||
<step>
|
|
||||||
<para>
|
|
||||||
Estimate the number of index pages that will be retrieved during the scan.
|
|
||||||
This might be just indexSelectivity times the index's size in pages.
|
|
||||||
</para>
|
|
||||||
</step>
|
|
||||||
|
|
||||||
<step>
|
|
||||||
<para>
|
|
||||||
Compute the index access cost. A generic estimator might do this:
|
|
||||||
|
|
||||||
<programlisting>
|
|
||||||
/*
|
|
||||||
* Our generic assumption is that the index pages will be read
|
|
||||||
* sequentially, so they have cost 1.0 each, not random_page_cost.
|
|
||||||
* Also, we charge for evaluation of the indexquals at each index row.
|
|
||||||
* All the costs are assumed to be paid incrementally during the scan.
|
|
||||||
*/
|
|
||||||
cost_qual_eval(&index_qual_cost, indexQuals);
|
|
||||||
*indexStartupCost = index_qual_cost.startup;
|
|
||||||
*indexTotalCost = numIndexPages +
|
|
||||||
(cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples;
|
|
||||||
</programlisting>
|
|
||||||
</para>
|
|
||||||
</step>
|
|
||||||
|
|
||||||
<step>
|
|
||||||
<para>
|
|
||||||
Estimate the index correlation. For a simple ordered index on a single
|
|
||||||
field, this can be retrieved from pg_statistic. If the correlation
|
|
||||||
is not known, the conservative estimate is zero (no correlation).
|
|
||||||
</para>
|
|
||||||
</step>
|
|
||||||
</procedure>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
Examples of cost estimator functions can be found in
|
|
||||||
<filename>src/backend/utils/adt/selfuncs.c</filename>.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
By convention, the <literal>pg_proc</literal> entry for an
|
|
||||||
<literal>amcostestimate</literal> function should show
|
|
||||||
eight arguments all declared as <type>internal</> (since none of them have
|
|
||||||
types that are known to SQL), and the return type is <type>void</>.
|
|
||||||
</para>
|
|
||||||
</chapter>
|
|
||||||
|
|
||||||
<!-- Keep this comment at the end of the file
|
|
||||||
Local variables:
|
|
||||||
mode:sgml
|
|
||||||
sgml-omittag:nil
|
|
||||||
sgml-shorttag:t
|
|
||||||
sgml-minimize-attributes:nil
|
|
||||||
sgml-always-quote-attributes:t
|
|
||||||
sgml-indent-step:1
|
|
||||||
sgml-indent-data:t
|
|
||||||
sgml-parent-document:nil
|
|
||||||
sgml-default-dtd-file:"./reference.ced"
|
|
||||||
sgml-exposed-tags:nil
|
|
||||||
sgml-local-catalogs:("/usr/lib/sgml/catalog")
|
|
||||||
sgml-local-ecat-files:nil
|
|
||||||
End:
|
|
||||||
-->
|
|
@ -1,5 +1,5 @@
|
|||||||
<!--
|
<!--
|
||||||
$PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.73 2005/01/10 00:04:38 tgl Exp $
|
$PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.74 2005/02/13 03:04:15 tgl Exp $
|
||||||
-->
|
-->
|
||||||
|
|
||||||
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V4.2//EN" [
|
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V4.2//EN" [
|
||||||
@ -235,7 +235,7 @@ $PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.73 2005/01/10 00:04:38 tgl Exp
|
|||||||
&nls;
|
&nls;
|
||||||
&plhandler;
|
&plhandler;
|
||||||
&geqo;
|
&geqo;
|
||||||
&indexcost;
|
&indexam;
|
||||||
&gist;
|
&gist;
|
||||||
&storage;
|
&storage;
|
||||||
&bki;
|
&bki;
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
<!--
|
<!--
|
||||||
$PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.38 2005/01/23 00:30:18 momjian Exp $
|
$PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.39 2005/02/13 03:04:15 tgl Exp $
|
||||||
-->
|
-->
|
||||||
|
|
||||||
<sect1 id="xindex">
|
<sect1 id="xindex">
|
||||||
@ -43,7 +43,7 @@ $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.38 2005/01/23 00:30:18 momjian E
|
|||||||
described in <classname>pg_am</classname>. It is possible to add a
|
described in <classname>pg_am</classname>. It is possible to add a
|
||||||
new index method by defining the required interface routines and
|
new index method by defining the required interface routines and
|
||||||
then creating a row in <classname>pg_am</classname> — but that is
|
then creating a row in <classname>pg_am</classname> — but that is
|
||||||
far beyond the scope of this chapter.
|
beyond the scope of this chapter (see <xref linkend="indexam">).
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
@ -514,7 +514,7 @@ CREATE OPERATOR < (
|
|||||||
<listitem>
|
<listitem>
|
||||||
<para>
|
<para>
|
||||||
Although <productname>PostgreSQL</productname> can cope with
|
Although <productname>PostgreSQL</productname> can cope with
|
||||||
functions having the same name as long as they have different
|
functions having the same SQL name as long as they have different
|
||||||
argument data types, C can only cope with one global function
|
argument data types, C can only cope with one global function
|
||||||
having a given name. So we shouldn't name the C function
|
having a given name. So we shouldn't name the C function
|
||||||
something simple like <filename>abs_eq</filename>. Usually it's
|
something simple like <filename>abs_eq</filename>. Usually it's
|
||||||
@ -525,14 +525,12 @@ CREATE OPERATOR < (
|
|||||||
|
|
||||||
<listitem>
|
<listitem>
|
||||||
<para>
|
<para>
|
||||||
We could have made the <productname>PostgreSQL</productname> name
|
We could have made the SQL name
|
||||||
of the function <filename>abs_eq</filename>, relying on
|
of the function <filename>abs_eq</filename>, relying on
|
||||||
<productname>PostgreSQL</productname> to distinguish it by
|
<productname>PostgreSQL</productname> to distinguish it by
|
||||||
argument data types from any other
|
argument data types from any other SQL function of the same name.
|
||||||
<productname>PostgreSQL</productname> function of the same name.
|
|
||||||
To keep the example simple, we make the function have the same
|
To keep the example simple, we make the function have the same
|
||||||
names at the C level and <productname>PostgreSQL</productname>
|
names at the C level and SQL level.
|
||||||
level.
|
|
||||||
</para>
|
</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
</itemizedlist>
|
</itemizedlist>
|
||||||
|
Loading…
x
Reference in New Issue
Block a user