mirror of
https://git.postgresql.org/git/postgresql.git
synced 2024-09-30 18:41:16 +02:00
7e2f906201
The only use we have had for amindexnulls is in determining whether an index is safe to cluster on; but since the addition of the amclusterable flag, that usage is pretty redundant. In passing, clean up assorted sloppiness from the last patch that touched pg_am.h: Natts_pg_am was wrong, and ambuildempty was not documented.
1113 lines
47 KiB
Plaintext
1113 lines
47 KiB
Plaintext
<!-- doc/src/sgml/indexam.sgml -->
|
|
|
|
<chapter id="indexam">
|
|
<title>Index Access Method Interface Definition</title>
|
|
|
|
<para>
|
|
This chapter defines the interface between the core
|
|
<productname>PostgreSQL</productname> system and <firstterm>index access
|
|
methods</>, which manage individual index types. The core system
|
|
knows nothing about indexes beyond what is specified here, so it is
|
|
possible to develop entirely new index types by writing add-on code.
|
|
</para>
|
|
|
|
<para>
|
|
All indexes in <productname>PostgreSQL</productname> are what are known
|
|
technically as <firstterm>secondary indexes</>; that is, the index is
|
|
physically separate from the table file that it describes. Each index
|
|
is stored as its own physical <firstterm>relation</> and so is described
|
|
by an entry in the <structname>pg_class</> catalog. The contents of an
|
|
index are entirely under the control of its index access method. In
|
|
practice, all index access methods divide indexes into standard-size
|
|
pages so that they can use the regular storage manager and buffer manager
|
|
to access the index contents. (All the existing index access methods
|
|
furthermore use the standard page layout described in <xref
|
|
linkend="storage-page-layout">, and they all use the same format for index
|
|
tuple headers; but these decisions are not forced on an access method.)
|
|
</para>
|
|
|
|
<para>
|
|
An index is effectively a mapping from some data key values to
|
|
<firstterm>tuple identifiers</>, or <acronym>TIDs</>, of row versions
|
|
(tuples) in the index's parent table. A TID consists of a
|
|
block number and an item number within that block (see <xref
|
|
linkend="storage-page-layout">). This is sufficient
|
|
information to fetch a particular row version from the table.
|
|
Indexes are not directly aware that under MVCC, there might be multiple
|
|
extant versions of the same logical row; to an index, each tuple is
|
|
an independent object that needs its own index entry. Thus, an
|
|
update of a row always creates all-new index entries for the row, even if
|
|
the key values did not change. (HOT tuples are an exception to this
|
|
statement; but indexes do not deal with those, either.) Index entries for
|
|
dead tuples are reclaimed (by vacuuming) when the dead tuples themselves
|
|
are reclaimed.
|
|
</para>
|
|
|
|
<sect1 id="index-catalog">
|
|
<title>Catalog Entries for Indexes</title>
|
|
|
|
<para>
|
|
Each index access method is described by a row in the
|
|
<structname>pg_am</structname> system catalog (see
|
|
<xref linkend="catalog-pg-am">). The principal contents of a
|
|
<structname>pg_am</structname> row are references to
|
|
<link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>
|
|
entries that identify the index access
|
|
functions supplied by the access method. The APIs for these functions
|
|
are defined later in this chapter. In addition, the
|
|
<structname>pg_am</structname> row specifies a few fixed properties of
|
|
the access method, such as whether it can support multicolumn indexes.
|
|
There is not currently any special support
|
|
for creating or deleting <structname>pg_am</structname> entries;
|
|
anyone able to write a new access method is expected to be competent
|
|
to insert an appropriate row for themselves.
|
|
</para>
|
|
|
|
<para>
|
|
To be useful, an index access method must also have one or more
|
|
<firstterm>operator families</> and
|
|
<firstterm>operator classes</> defined in
|
|
<link linkend="catalog-pg-opfamily"><structname>pg_opfamily</structname></link>,
|
|
<link linkend="catalog-pg-opclass"><structname>pg_opclass</structname></link>,
|
|
<link linkend="catalog-pg-amop"><structname>pg_amop</structname></link>, and
|
|
<link linkend="catalog-pg-amproc"><structname>pg_amproc</structname></link>.
|
|
These entries allow the planner
|
|
to determine what kinds of query qualifications can be used with
|
|
indexes of this access method. Operator families and classes are described
|
|
in <xref linkend="xindex">, which is prerequisite material for reading
|
|
this chapter.
|
|
</para>
|
|
|
|
<para>
|
|
An individual index is defined by a
|
|
<link linkend="catalog-pg-class"><structname>pg_class</structname></link>
|
|
entry that describes it as a physical relation, plus a
|
|
<link linkend="catalog-pg-index"><structname>pg_index</structname></link>
|
|
entry that shows the logical content of the index — that is, the set
|
|
of index columns it has and the semantics of those columns, as captured by
|
|
the associated operator classes. The index columns (key values) can be
|
|
either simple columns of the underlying table or expressions over the table
|
|
rows. The index access method normally has no interest in where the index
|
|
key values come from (it is always handed precomputed key values) but it
|
|
will be very interested in the operator class information in
|
|
<structname>pg_index</structname>. Both of these catalog entries can be
|
|
accessed as part of the <structname>Relation</> data structure that is
|
|
passed to all operations on the index.
|
|
</para>
|
|
|
|
<para>
|
|
Some of the flag columns of <structname>pg_am</structname> have nonobvious
|
|
implications. The requirements of <structfield>amcanunique</structfield>
|
|
are discussed in <xref linkend="index-unique-checks">.
|
|
The <structfield>amcanmulticol</structfield> flag asserts that the
|
|
access method supports multicolumn indexes, while
|
|
<structfield>amoptionalkey</structfield> asserts that it allows scans
|
|
where no indexable restriction clause is given for the first index column.
|
|
When <structfield>amcanmulticol</structfield> is false,
|
|
<structfield>amoptionalkey</structfield> essentially says whether the
|
|
access method supports full-index scans without any restriction clause.
|
|
Access methods that support multiple index columns <emphasis>must</>
|
|
support scans that omit restrictions on any or all of the columns after
|
|
the first; however they are permitted to require some restriction to
|
|
appear for the first index column, and this is signaled by setting
|
|
<structfield>amoptionalkey</structfield> false.
|
|
One reason that an index AM might set
|
|
<structfield>amoptionalkey</structfield> false is if it doesn't index
|
|
NULLs. Since most indexable operators are
|
|
strict and hence cannot return TRUE for NULL inputs,
|
|
it is at first sight attractive to not store index entries for null values:
|
|
they could never be returned by an index scan anyway. However, this
|
|
argument fails when an index scan has no restriction clause for a given
|
|
index column. In practice this means that
|
|
indexes that have <structfield>amoptionalkey</structfield> true must
|
|
index nulls, since the planner might decide to use such an index
|
|
with no scan keys at all. A related restriction is that an index
|
|
access method that supports multiple index columns <emphasis>must</>
|
|
support indexing null values in columns after the first, because the planner
|
|
will assume the index can be used for queries that do not restrict
|
|
these columns. For example, consider an index on (a,b) and a query with
|
|
<literal>WHERE a = 4</literal>. The system will assume the index can be
|
|
used to scan for rows with <literal>a = 4</literal>, which is wrong if the
|
|
index omits rows where <literal>b</> is null.
|
|
It is, however, OK to omit rows where the first indexed column is null.
|
|
An index access method that does index nulls may also set
|
|
<structfield>amsearchnulls</structfield>, indicating that it supports
|
|
<literal>IS NULL</> and <literal>IS NOT NULL</> clauses as search
|
|
conditions.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="index-functions">
|
|
<title>Index Access Method Functions</title>
|
|
|
|
<para>
|
|
The index construction and maintenance functions that an index access
|
|
method must provide are:
|
|
</para>
|
|
|
|
<para>
|
|
<programlisting>
|
|
IndexBuildResult *
|
|
ambuild (Relation heapRelation,
|
|
Relation indexRelation,
|
|
IndexInfo *indexInfo);
|
|
</programlisting>
|
|
Build a new index. The index relation has been physically created,
|
|
but is empty. It must be filled in with whatever fixed data the
|
|
access method requires, plus entries for all tuples already existing
|
|
in the table. Ordinarily the <function>ambuild</> function will call
|
|
<function>IndexBuildHeapScan()</> to scan the table for existing tuples
|
|
and compute the keys that need to be inserted into the index.
|
|
The function must return a palloc'd struct containing statistics about
|
|
the new index.
|
|
</para>
|
|
|
|
<para>
|
|
<programlisting>
|
|
void
|
|
ambuildempty (Relation indexRelation);
|
|
</programlisting>
|
|
Build an empty index, and write it to the initialization fork (INIT_FORKNUM)
|
|
of the given relation. This method is called only for unlogged tables; the
|
|
empty index written to the initialization fork will be copied over the main
|
|
relation fork on each server restart.
|
|
</para>
|
|
|
|
<para>
|
|
<programlisting>
|
|
bool
|
|
aminsert (Relation indexRelation,
|
|
Datum *values,
|
|
bool *isnull,
|
|
ItemPointer heap_tid,
|
|
Relation heapRelation,
|
|
IndexUniqueCheck checkUnique);
|
|
</programlisting>
|
|
Insert a new tuple into an existing index. The <literal>values</> and
|
|
<literal>isnull</> arrays give the key values to be indexed, and
|
|
<literal>heap_tid</> is the TID to be indexed.
|
|
If the access method supports unique indexes (its
|
|
<structname>pg_am</>.<structfield>amcanunique</> flag is true) then
|
|
<literal>checkUnique</> indicates the type of uniqueness check to
|
|
perform. This varies depending on whether the unique constraint is
|
|
deferrable; see <xref linkend="index-unique-checks"> for details.
|
|
Normally the access method only needs the <literal>heapRelation</>
|
|
parameter when performing uniqueness checking (since then it will have to
|
|
look into the heap to verify tuple liveness).
|
|
</para>
|
|
|
|
<para>
|
|
The function's Boolean result value is significant only when
|
|
<literal>checkUnique</> is <literal>UNIQUE_CHECK_PARTIAL</>.
|
|
In this case a TRUE result means the new entry is known unique, whereas
|
|
FALSE means it might be non-unique (and a deferred uniqueness check must
|
|
be scheduled). For other cases a constant FALSE result is recommended.
|
|
</para>
|
|
|
|
<para>
|
|
Some indexes might not index all tuples. If the tuple is not to be
|
|
indexed, <function>aminsert</> should just return without doing anything.
|
|
</para>
|
|
|
|
<para>
|
|
<programlisting>
|
|
IndexBulkDeleteResult *
|
|
ambulkdelete (IndexVacuumInfo *info,
|
|
IndexBulkDeleteResult *stats,
|
|
IndexBulkDeleteCallback callback,
|
|
void *callback_state);
|
|
</programlisting>
|
|
Delete tuple(s) from the index. This is a <quote>bulk delete</> operation
|
|
that is intended to be implemented by scanning the whole index and checking
|
|
each entry to see if it should be deleted.
|
|
The passed-in <literal>callback</> function must be called, in the style
|
|
<literal>callback(<replaceable>TID</>, callback_state) returns bool</literal>,
|
|
to determine whether any particular index entry, as identified by its
|
|
referenced TID, is to be deleted. Must return either NULL or a palloc'd
|
|
struct containing statistics about the effects of the deletion operation.
|
|
It is OK to return NULL if no information needs to be passed on to
|
|
<function>amvacuumcleanup</>.
|
|
</para>
|
|
|
|
<para>
|
|
Because of limited <varname>maintenance_work_mem</>,
|
|
<function>ambulkdelete</> might need to be called more than once when many
|
|
tuples are to be deleted. The <literal>stats</> argument is the result
|
|
of the previous call for this index (it is NULL for the first call within a
|
|
<command>VACUUM</> operation). This allows the AM to accumulate statistics
|
|
across the whole operation. Typically, <function>ambulkdelete</> will
|
|
modify and return the same struct if the passed <literal>stats</> is not
|
|
null.
|
|
</para>
|
|
|
|
<para>
|
|
<programlisting>
|
|
IndexBulkDeleteResult *
|
|
amvacuumcleanup (IndexVacuumInfo *info,
|
|
IndexBulkDeleteResult *stats);
|
|
</programlisting>
|
|
Clean up after a <command>VACUUM</command> operation (zero or more
|
|
<function>ambulkdelete</> calls). This does not have to do anything
|
|
beyond returning index statistics, but it might perform bulk cleanup
|
|
such as reclaiming empty index pages. <literal>stats</> is whatever the
|
|
last <function>ambulkdelete</> call returned, or NULL if
|
|
<function>ambulkdelete</> was not called because no tuples needed to be
|
|
deleted. If the result is not NULL it must be a palloc'd struct.
|
|
The statistics it contains will be used to update <structname>pg_class</>,
|
|
and will be reported by <command>VACUUM</> if <literal>VERBOSE</> is given.
|
|
It is OK to return NULL if the index was not changed at all during the
|
|
<command>VACUUM</command> operation, but otherwise correct stats should
|
|
be returned.
|
|
</para>
|
|
|
|
<para>
|
|
As of <productname>PostgreSQL</productname> 8.4,
|
|
<function>amvacuumcleanup</> will also be called at completion of an
|
|
<command>ANALYZE</> operation. In this case <literal>stats</> is always
|
|
NULL and any return value will be ignored. This case can be distinguished
|
|
by checking <literal>info->analyze_only</literal>. It is recommended
|
|
that the access method do nothing except post-insert cleanup in such a
|
|
call, and that only in an autovacuum worker process.
|
|
</para>
|
|
|
|
<para>
|
|
<programlisting>
|
|
void
|
|
amcostestimate (PlannerInfo *root,
|
|
IndexOptInfo *index,
|
|
List *indexQuals,
|
|
List *indexOrderBys,
|
|
RelOptInfo *outer_rel,
|
|
Cost *indexStartupCost,
|
|
Cost *indexTotalCost,
|
|
Selectivity *indexSelectivity,
|
|
double *indexCorrelation);
|
|
</programlisting>
|
|
Estimate the costs of an index scan. This function is described fully
|
|
in <xref linkend="index-cost-estimation">, below.
|
|
</para>
|
|
|
|
<para>
|
|
<programlisting>
|
|
bytea *
|
|
amoptions (ArrayType *reloptions,
|
|
bool validate);
|
|
</programlisting>
|
|
Parse and validate the reloptions array for an index. This is called only
|
|
when a non-null reloptions array exists for the index.
|
|
<parameter>reloptions</> is a <type>text</> array containing entries of the
|
|
form <replaceable>name</><literal>=</><replaceable>value</>.
|
|
The function should construct a <type>bytea</> value, which will be copied
|
|
into the <structfield>rd_options</> field of the index's relcache entry.
|
|
The data contents of the <type>bytea</> value are open for the access
|
|
method to define; most of the standard access methods use struct
|
|
<structname>StdRdOptions</>.
|
|
When <parameter>validate</> is true, the function should report a suitable
|
|
error message if any of the options are unrecognized or have invalid
|
|
values; when <parameter>validate</> is false, invalid entries should be
|
|
silently ignored. (<parameter>validate</> is false when loading options
|
|
already stored in <structname>pg_catalog</>; an invalid entry could only
|
|
be found if the access method has changed its rules for options, and in
|
|
that case ignoring obsolete entries is appropriate.)
|
|
It is OK to return NULL if default behavior is wanted.
|
|
</para>
|
|
|
|
<para>
|
|
The purpose of an index, of course, is to support scans for tuples matching
|
|
an indexable <literal>WHERE</> condition, often called a
|
|
<firstterm>qualifier</> or <firstterm>scan key</>. The semantics of
|
|
index scanning are described more fully in <xref linkend="index-scanning">,
|
|
below. An index access method can support <quote>plain</> index scans,
|
|
<quote>bitmap</> index scans, or both. The scan-related functions that an
|
|
index access method must or may provide are:
|
|
</para>
|
|
|
|
<para>
|
|
<programlisting>
|
|
IndexScanDesc
|
|
ambeginscan (Relation indexRelation,
|
|
int nkeys,
|
|
int norderbys);
|
|
</programlisting>
|
|
Prepare for an index scan. The <literal>nkeys</> and <literal>norderbys</>
|
|
parameters indicate the number of quals and ordering operators that will be
|
|
used in the scan; these may be useful for space allocation purposes.
|
|
Note that the actual values of the scan keys aren't provided yet.
|
|
The result must be a palloc'd struct.
|
|
For implementation reasons the index access method
|
|
<emphasis>must</> create this struct by calling
|
|
<function>RelationGetIndexScan()</>. In most cases
|
|
<function>ambeginscan</> does little beyond making that call and perhaps
|
|
acquiring locks;
|
|
the interesting parts of index-scan startup are in <function>amrescan</>.
|
|
</para>
|
|
|
|
<para>
|
|
<programlisting>
|
|
void
|
|
amrescan (IndexScanDesc scan,
|
|
ScanKey keys,
|
|
int nkeys,
|
|
ScanKey orderbys,
|
|
int norderbys);
|
|
</programlisting>
|
|
Start or restart an indexscan, possibly with new scan keys. (To restart
|
|
using previously-passed keys, NULL is passed for <literal>keys</> and/or
|
|
<literal>orderbys</>.) Note that it is not allowed for
|
|
the number of keys or order-by operators to be larger than
|
|
what was passed to <function>ambeginscan</>. In practice the restart
|
|
feature is used when a new outer tuple is selected by a nested-loop join
|
|
and so a new key comparison value is needed, but the scan key structure
|
|
remains the same.
|
|
</para>
|
|
|
|
<para>
|
|
<programlisting>
|
|
boolean
|
|
amgettuple (IndexScanDesc scan,
|
|
ScanDirection direction);
|
|
</programlisting>
|
|
Fetch the next tuple in the given scan, moving in the given
|
|
direction (forward or backward in the index). Returns TRUE if a tuple was
|
|
obtained, FALSE if no matching tuples remain. In the TRUE case the tuple
|
|
TID is stored into the <literal>scan</> structure. Note that
|
|
<quote>success</> means only that the index contains an entry that matches
|
|
the scan keys, not that the tuple necessarily still exists in the heap or
|
|
will pass the caller's snapshot test. On success, <function>amgettuple</>
|
|
must also set <literal>scan->xs_recheck</> to TRUE or FALSE.
|
|
FALSE means it is certain that the index entry matches the scan keys.
|
|
TRUE means this is not certain, and the conditions represented by the
|
|
scan keys must be rechecked against the heap tuple after fetching it.
|
|
This provision supports <quote>lossy</> index operators.
|
|
Note that rechecking will extend only to the scan conditions; a partial
|
|
index predicate (if any) is never rechecked by <function>amgettuple</>
|
|
callers.
|
|
</para>
|
|
|
|
<para>
|
|
The <function>amgettuple</> function need only be provided if the access
|
|
method supports <quote>plain</> index scans. If it doesn't, the
|
|
<structfield>amgettuple</> field in its <structname>pg_am</> row must
|
|
be set to zero.
|
|
</para>
|
|
|
|
<para>
|
|
<programlisting>
|
|
int64
|
|
amgetbitmap (IndexScanDesc scan,
|
|
TIDBitmap *tbm);
|
|
</programlisting>
|
|
Fetch all tuples in the given scan and add them to the caller-supplied
|
|
<type>TIDBitmap</type> (that is, OR the set of tuple IDs into whatever set is already
|
|
in the bitmap). The number of tuples fetched is returned (this might be
|
|
just an approximate count, for instance some AMs do not detect duplicates).
|
|
While inserting tuple IDs into the bitmap, <function>amgetbitmap</> can
|
|
indicate that rechecking of the scan conditions is required for specific
|
|
tuple IDs. This is analogous to the <literal>xs_recheck</> output parameter
|
|
of <function>amgettuple</>. Note: in the current implementation, support
|
|
for this feature is conflated with support for lossy storage of the bitmap
|
|
itself, and therefore callers recheck both the scan conditions and the
|
|
partial index predicate (if any) for recheckable tuples. That might not
|
|
always be true, however.
|
|
<function>amgetbitmap</> and
|
|
<function>amgettuple</> cannot be used in the same index scan; there
|
|
are other restrictions too when using <function>amgetbitmap</>, as explained
|
|
in <xref linkend="index-scanning">.
|
|
</para>
|
|
|
|
<para>
|
|
The <function>amgetbitmap</> function need only be provided if the access
|
|
method supports <quote>bitmap</> index scans. If it doesn't, the
|
|
<structfield>amgetbitmap</> field in its <structname>pg_am</> row must
|
|
be set to zero.
|
|
</para>
|
|
|
|
<para>
|
|
<programlisting>
|
|
void
|
|
amendscan (IndexScanDesc scan);
|
|
</programlisting>
|
|
End a scan and release resources. The <literal>scan</> struct itself
|
|
should not be freed, but any locks or pins taken internally by the
|
|
access method must be released.
|
|
</para>
|
|
|
|
<para>
|
|
<programlisting>
|
|
void
|
|
ammarkpos (IndexScanDesc scan);
|
|
</programlisting>
|
|
Mark current scan position. The access method need only support one
|
|
remembered scan position per scan.
|
|
</para>
|
|
|
|
<para>
|
|
<programlisting>
|
|
void
|
|
amrestrpos (IndexScanDesc scan);
|
|
</programlisting>
|
|
Restore the scan to the most recently marked position.
|
|
</para>
|
|
|
|
<para>
|
|
By convention, the <literal>pg_proc</literal> entry for an index
|
|
access method function should show the correct number of arguments,
|
|
but declare them all as type <type>internal</> (since most of the arguments
|
|
have types that are not known to SQL, and we don't want users calling
|
|
the functions directly anyway). The return type is declared as
|
|
<type>void</>, <type>internal</>, or <type>boolean</> as appropriate.
|
|
The only exception is <function>amoptions</>, which should be correctly
|
|
declared as taking <type>text[]</> and <type>bool</> and returning
|
|
<type>bytea</>. This provision allows client code to execute
|
|
<function>amoptions</> to test validity of options settings.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="index-scanning">
|
|
<title>Index Scanning</title>
|
|
|
|
<para>
|
|
In an index scan, the index access method is responsible for regurgitating
|
|
the TIDs of all the tuples it has been told about that match the
|
|
<firstterm>scan keys</>. The access method is <emphasis>not</> involved in
|
|
actually fetching those tuples from the index's parent table, nor in
|
|
determining whether they pass the scan's time qualification test or other
|
|
conditions.
|
|
</para>
|
|
|
|
<para>
|
|
A scan key is the internal representation of a <literal>WHERE</> clause of
|
|
the form <replaceable>index_key</> <replaceable>operator</>
|
|
<replaceable>constant</>, where the index key is one of the columns of the
|
|
index and the operator is one of the members of the operator family
|
|
associated with that index column. An index scan has zero or more scan
|
|
keys, which are implicitly ANDed — the returned tuples are expected
|
|
to satisfy all the indicated conditions.
|
|
</para>
|
|
|
|
<para>
|
|
The access method can report that the index is <firstterm>lossy</>, or
|
|
requires rechecks, for a particular query. This implies that the index
|
|
scan will return all the entries that pass the scan key, plus possibly
|
|
additional entries that do not. The core system's index-scan machinery
|
|
will then apply the index conditions again to the heap tuple to verify
|
|
whether or not it really should be selected. If the recheck option is not
|
|
specified, the index scan must return exactly the set of matching entries.
|
|
</para>
|
|
|
|
<para>
|
|
Note that it is entirely up to the access method to ensure that it
|
|
correctly finds all and only the entries passing all the given scan keys.
|
|
Also, the core system will simply hand off all the <literal>WHERE</>
|
|
clauses that match the index keys and operator families, without any
|
|
semantic analysis to determine whether they are redundant or
|
|
contradictory. As an example, given
|
|
<literal>WHERE x > 4 AND x > 14</> where <literal>x</> is a b-tree
|
|
indexed column, it is left to the b-tree <function>amrescan</> function
|
|
to realize that the first scan key is redundant and can be discarded.
|
|
The extent of preprocessing needed during <function>amrescan</> will
|
|
depend on the extent to which the index access method needs to reduce
|
|
the scan keys to a <quote>normalized</> form.
|
|
</para>
|
|
|
|
<para>
|
|
Some access methods return index entries in a well-defined order, others
|
|
do not. There are actually two different ways that an access method can
|
|
support sorted output:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Access methods that always return entries in the natural ordering
|
|
of their data (such as btree) should set
|
|
<structname>pg_am</>.<structfield>amcanorder</> to true.
|
|
Currently, such access methods must use btree-compatible strategy
|
|
numbers for their equality and ordering operators.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Access methods that support ordering operators should set
|
|
<structname>pg_am</>.<structfield>amcanorderbyop</> to true.
|
|
This indicates that the index is capable of returning entries in
|
|
an order satisfying <literal>ORDER BY</> <replaceable>index_key</>
|
|
<replaceable>operator</> <replaceable>constant</>. Scan modifiers
|
|
of that form can be passed to <function>amrescan</> as described
|
|
previously.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
The <function>amgettuple</> function has a <literal>direction</> argument,
|
|
which can be either <literal>ForwardScanDirection</> (the normal case)
|
|
or <literal>BackwardScanDirection</>. If the first call after
|
|
<function>amrescan</> specifies <literal>BackwardScanDirection</>, then the
|
|
set of matching index entries is to be scanned back-to-front rather than in
|
|
the normal front-to-back direction, so <function>amgettuple</> must return
|
|
the last matching tuple in the index, rather than the first one as it
|
|
normally would. (This will only occur for access
|
|
methods that set <structfield>amcanorder</> to true.) After the
|
|
first call, <function>amgettuple</> must be prepared to advance the scan in
|
|
either direction from the most recently returned entry. (But if
|
|
<structname>pg_am</>.<structfield>amcanbackward</> is false, all subsequent
|
|
calls will have the same direction as the first one.)
|
|
</para>
|
|
|
|
<para>
|
|
Access methods that support ordered scans must support <quote>marking</> a
|
|
position in a scan and later returning to the marked position. The same
|
|
position might be restored multiple times. However, only one position need
|
|
be remembered per scan; a new <function>ammarkpos</> call overrides the
|
|
previously marked position. An access method that does not support
|
|
ordered scans should still provide mark and restore functions in
|
|
<structname>pg_am</>, but it is sufficient to have them throw errors if
|
|
called.
|
|
</para>
|
|
|
|
<para>
|
|
Both the scan position and the mark position (if any) must be maintained
|
|
consistently in the face of concurrent insertions or deletions in the
|
|
index. It is OK if a freshly-inserted entry is not returned by a scan that
|
|
would have found the entry if it had existed when the scan started, or for
|
|
the scan to return such an entry upon rescanning or backing
|
|
up even though it had not been returned the first time through. Similarly,
|
|
a concurrent delete might or might not be reflected in the results of a scan.
|
|
What is important is that insertions or deletions not cause the scan to
|
|
miss or multiply return entries that were not themselves being inserted or
|
|
deleted.
|
|
</para>
|
|
|
|
<para>
|
|
Instead of using <function>amgettuple</>, an index scan can be done with
|
|
<function>amgetbitmap</> to fetch all tuples in one call. This can be
|
|
noticeably more efficient than <function>amgettuple</> because it allows
|
|
avoiding lock/unlock cycles within the access method. In principle
|
|
<function>amgetbitmap</> should have the same effects as repeated
|
|
<function>amgettuple</> calls, but we impose several restrictions to
|
|
simplify matters. First of all, <function>amgetbitmap</> returns all
|
|
tuples at once and marking or restoring scan positions isn't
|
|
supported. Secondly, the tuples are returned in a bitmap which doesn't
|
|
have any specific ordering, which is why <function>amgetbitmap</> doesn't
|
|
take a <literal>direction</> argument. (Ordering operators will never be
|
|
supplied for such a scan, either.) Finally, <function>amgetbitmap</>
|
|
does not guarantee any locking of the returned tuples, with implications
|
|
spelled out in <xref linkend="index-locking">.
|
|
</para>
|
|
|
|
<para>
|
|
Note that it is permitted for an access method to implement only
|
|
<function>amgetbitmap</> and not <function>amgettuple</>, or vice versa,
|
|
if its internal implementation is unsuited to one API or the other.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="index-locking">
|
|
<title>Index Locking Considerations</title>
|
|
|
|
<para>
|
|
Index access methods must handle concurrent updates
|
|
of the index by multiple processes.
|
|
The core <productname>PostgreSQL</productname> system obtains
|
|
<literal>AccessShareLock</> on the index during an index scan, and
|
|
<literal>RowExclusiveLock</> when updating the index (including plain
|
|
<command>VACUUM</>). Since these lock types do not conflict, the access
|
|
method is responsible for handling any fine-grained locking it might need.
|
|
An exclusive lock on the index as a whole will be taken only during index
|
|
creation, destruction, or <command>REINDEX</>.
|
|
</para>
|
|
|
|
<para>
|
|
Building an index type that supports concurrent updates usually requires
|
|
extensive and subtle analysis of the required behavior. For the b-tree
|
|
and hash index types, you can read about the design decisions involved in
|
|
<filename>src/backend/access/nbtree/README</> and
|
|
<filename>src/backend/access/hash/README</>.
|
|
</para>
|
|
|
|
<para>
|
|
Aside from the index's own internal consistency requirements, concurrent
|
|
updates create issues about consistency between the parent table (the
|
|
<firstterm>heap</>) and the index. Because
|
|
<productname>PostgreSQL</productname> separates accesses
|
|
and updates of the heap from those of the index, there are windows in
|
|
which the index might be inconsistent with the heap. We handle this problem
|
|
with the following rules:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
A new heap entry is made before making its index entries. (Therefore
|
|
a concurrent index scan is likely to fail to see the heap entry.
|
|
This is okay because the index reader would be uninterested in an
|
|
uncommitted row anyway. But see <xref linkend="index-unique-checks">.)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
When a heap entry is to be deleted (by <command>VACUUM</>), all its
|
|
index entries must be removed first.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
An index scan must maintain a pin
|
|
on the index page holding the item last returned by
|
|
<function>amgettuple</>, and <function>ambulkdelete</> cannot delete
|
|
entries from pages that are pinned by other backends. The need
|
|
for this rule is explained below.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
Without the third rule, it is possible for an index reader to
|
|
see an index entry just before it is removed by <command>VACUUM</>, and
|
|
then to arrive at the corresponding heap entry after that was removed by
|
|
<command>VACUUM</>.
|
|
This creates no serious problems if that item
|
|
number is still unused when the reader reaches it, since an empty
|
|
item slot will be ignored by <function>heap_fetch()</>. But what if a
|
|
third backend has already re-used the item slot for something else?
|
|
When using an MVCC-compliant snapshot, there is no problem because
|
|
the new occupant of the slot is certain to be too new to pass the
|
|
snapshot test. However, with a non-MVCC-compliant snapshot (such as
|
|
<literal>SnapshotNow</>), it would be possible to accept and return
|
|
a row that does not in fact match the scan keys. We could defend
|
|
against this scenario by requiring the scan keys to be rechecked
|
|
against the heap row in all cases, but that is too expensive. Instead,
|
|
we use a pin on an index page as a proxy to indicate that the reader
|
|
might still be <quote>in flight</> from the index entry to the matching
|
|
heap entry. Making <function>ambulkdelete</> block on such a pin ensures
|
|
that <command>VACUUM</> cannot delete the heap entry before the reader
|
|
is done with it. This solution costs little in run time, and adds blocking
|
|
overhead only in the rare cases where there actually is a conflict.
|
|
</para>
|
|
|
|
<para>
|
|
This solution requires that index scans be <quote>synchronous</>: we have
|
|
to fetch each heap tuple immediately after scanning the corresponding index
|
|
entry. This is expensive for a number of reasons. An
|
|
<quote>asynchronous</> scan in which we collect many TIDs from the index,
|
|
and only visit the heap tuples sometime later, requires much less index
|
|
locking overhead and can allow a more efficient heap access pattern.
|
|
Per the above analysis, we must use the synchronous approach for
|
|
non-MVCC-compliant snapshots, but an asynchronous scan is workable
|
|
for a query using an MVCC snapshot.
|
|
</para>
|
|
|
|
<para>
|
|
In an <function>amgetbitmap</> index scan, the access method does not
|
|
keep an index pin on any of the returned tuples. Therefore
|
|
it is only safe to use such scans with MVCC-compliant snapshots.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="index-unique-checks">
|
|
<title>Index Uniqueness Checks</title>
|
|
|
|
<para>
|
|
<productname>PostgreSQL</productname> enforces SQL uniqueness constraints
|
|
using <firstterm>unique indexes</>, which are indexes that disallow
|
|
multiple entries with identical keys. An access method that supports this
|
|
feature sets <structname>pg_am</>.<structfield>amcanunique</> true.
|
|
(At present, only b-tree supports it.)
|
|
</para>
|
|
|
|
<para>
|
|
Because of MVCC, it is always necessary to allow duplicate entries to
|
|
exist physically in an index: the entries might refer to successive
|
|
versions of a single logical row. The behavior we actually want to
|
|
enforce is that no MVCC snapshot could include two rows with equal
|
|
index keys. This breaks down into the following cases that must be
|
|
checked when inserting a new row into a unique index:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
If a conflicting valid row has been deleted by the current transaction,
|
|
it's okay. (In particular, since an UPDATE always deletes the old row
|
|
version before inserting the new version, this will allow an UPDATE on
|
|
a row without changing the key.)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
If a conflicting row has been inserted by an as-yet-uncommitted
|
|
transaction, the would-be inserter must wait to see if that transaction
|
|
commits. If it rolls back then there is no conflict. If it commits
|
|
without deleting the conflicting row again, there is a uniqueness
|
|
violation. (In practice we just wait for the other transaction to
|
|
end and then redo the visibility check in toto.)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Similarly, if a conflicting valid row has been deleted by an
|
|
as-yet-uncommitted transaction, the would-be inserter must wait
|
|
for that transaction to commit or abort, and then repeat the test.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
Furthermore, immediately before reporting a uniqueness violation
|
|
according to the above rules, the access method must recheck the
|
|
liveness of the row being inserted. If it is committed dead then
|
|
no violation should be reported. (This case cannot occur during the
|
|
ordinary scenario of inserting a row that's just been created by
|
|
the current transaction. It can happen during
|
|
<command>CREATE UNIQUE INDEX CONCURRENTLY</>, however.)
|
|
</para>
|
|
|
|
<para>
|
|
We require the index access method to apply these tests itself, which
|
|
means that it must reach into the heap to check the commit status of
|
|
any row that is shown to have a duplicate key according to the index
|
|
contents. This is without a doubt ugly and non-modular, but it saves
|
|
redundant work: if we did a separate probe then the index lookup for
|
|
a conflicting row would be essentially repeated while finding the place to
|
|
insert the new row's index entry. What's more, there is no obvious way
|
|
to avoid race conditions unless the conflict check is an integral part
|
|
of insertion of the new index entry.
|
|
</para>
|
|
|
|
<para>
|
|
If the unique constraint is deferrable, there is additional complexity:
|
|
we need to be able to insert an index entry for a new row, but defer any
|
|
uniqueness-violation error until end of statement or even later. To
|
|
avoid unnecessary repeat searches of the index, the index access method
|
|
should do a preliminary uniqueness check during the initial insertion.
|
|
If this shows that there is definitely no conflicting live tuple, we
|
|
are done. Otherwise, we schedule a recheck to occur when it is time to
|
|
enforce the constraint. If, at the time of the recheck, both the inserted
|
|
tuple and some other tuple with the same key are live, then the error
|
|
must be reported. (Note that for this purpose, <quote>live</> actually
|
|
means <quote>any tuple in the index entry's HOT chain is live</>.)
|
|
To implement this, the <function>aminsert</> function is passed a
|
|
<literal>checkUnique</> parameter having one of the following values:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<literal>UNIQUE_CHECK_NO</> indicates that no uniqueness checking
|
|
should be done (this is not a unique index).
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>UNIQUE_CHECK_YES</> indicates that this is a non-deferrable
|
|
unique index, and the uniqueness check must be done immediately, as
|
|
described above.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>UNIQUE_CHECK_PARTIAL</> indicates that the unique
|
|
constraint is deferrable. <productname>PostgreSQL</productname>
|
|
will use this mode to insert each row's index entry. The access
|
|
method must allow duplicate entries into the index, and report any
|
|
potential duplicates by returning FALSE from <function>aminsert</>.
|
|
For each row for which FALSE is returned, a deferred recheck will
|
|
be scheduled.
|
|
</para>
|
|
|
|
<para>
|
|
The access method must identify any rows which might violate the
|
|
unique constraint, but it is not an error for it to report false
|
|
positives. This allows the check to be done without waiting for other
|
|
transactions to finish; conflicts reported here are not treated as
|
|
errors and will be rechecked later, by which time they may no longer
|
|
be conflicts.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>UNIQUE_CHECK_EXISTING</> indicates that this is a deferred
|
|
recheck of a row that was reported as a potential uniqueness violation.
|
|
Although this is implemented by calling <function>aminsert</>, the
|
|
access method must <emphasis>not</> insert a new index entry in this
|
|
case. The index entry is already present. Rather, the access method
|
|
must check to see if there is another live index entry. If so, and
|
|
if the target row is also still live, report error.
|
|
</para>
|
|
|
|
<para>
|
|
It is recommended that in a <literal>UNIQUE_CHECK_EXISTING</> call,
|
|
the access method further verify that the target row actually does
|
|
have an existing entry in the index, and report error if not. This
|
|
is a good idea because the index tuple values passed to
|
|
<function>aminsert</> will have been recomputed. If the index
|
|
definition involves functions that are not really immutable, we
|
|
might be checking the wrong area of the index. Checking that the
|
|
target row is found in the recheck verifies that we are scanning
|
|
for the same tuple values as were used in the original insertion.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="index-cost-estimation">
|
|
<title>Index Cost Estimation Functions</title>
|
|
|
|
<para>
|
|
The <function>amcostestimate</> function is given information describing
|
|
a possible index scan, including lists of WHERE and ORDER BY clauses that
|
|
have been determined to be usable with the index. It must return estimates
|
|
of the cost of accessing the index and the selectivity of the WHERE
|
|
clauses (that is, the fraction of parent-table rows that will be
|
|
retrieved during the index scan). For simple cases, nearly all the
|
|
work of the cost estimator can be done by calling standard routines
|
|
in the optimizer; the point of having an <function>amcostestimate</> function is
|
|
to allow index access methods to provide index-type-specific knowledge,
|
|
in case it is possible to improve on the standard estimates.
|
|
</para>
|
|
|
|
<para>
|
|
Each <function>amcostestimate</> function must have the signature:
|
|
|
|
<programlisting>
|
|
void
|
|
amcostestimate (PlannerInfo *root,
|
|
IndexOptInfo *index,
|
|
List *indexQuals,
|
|
List *indexOrderBys,
|
|
RelOptInfo *outer_rel,
|
|
Cost *indexStartupCost,
|
|
Cost *indexTotalCost,
|
|
Selectivity *indexSelectivity,
|
|
double *indexCorrelation);
|
|
</programlisting>
|
|
|
|
The first five parameters are inputs:
|
|
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><parameter>root</></term>
|
|
<listitem>
|
|
<para>
|
|
The planner's information about the query being processed.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><parameter>index</></term>
|
|
<listitem>
|
|
<para>
|
|
The index being considered.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><parameter>indexQuals</></term>
|
|
<listitem>
|
|
<para>
|
|
List of index qual clauses (implicitly ANDed);
|
|
a <symbol>NIL</> list indicates no qualifiers are available.
|
|
Note that the list contains expression trees with RestrictInfo nodes
|
|
at the top, not ScanKeys.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><parameter>indexOrderBys</></term>
|
|
<listitem>
|
|
<para>
|
|
List of indexable ORDER BY operators, or <symbol>NIL</> if none.
|
|
Note that the list contains expression trees, not ScanKeys.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><parameter>outer_rel</></term>
|
|
<listitem>
|
|
<para>
|
|
If the index is being considered for use in a join inner indexscan,
|
|
the planner's information about the outer side of the join. Otherwise
|
|
<symbol>NULL</>. When non-<symbol>NULL</>, some of the qual clauses will be join clauses
|
|
with this rel rather than being simple restriction clauses. Also,
|
|
the cost estimator should expect that the index scan will be repeated
|
|
for each row of the outer rel.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
|
|
<para>
|
|
The last four parameters are pass-by-reference outputs:
|
|
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term><parameter>*indexStartupCost</></term>
|
|
<listitem>
|
|
<para>
|
|
Set to cost of index start-up processing
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><parameter>*indexTotalCost</></term>
|
|
<listitem>
|
|
<para>
|
|
Set to total cost of index processing
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><parameter>*indexSelectivity</></term>
|
|
<listitem>
|
|
<para>
|
|
Set to index selectivity
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><parameter>*indexCorrelation</></term>
|
|
<listitem>
|
|
<para>
|
|
Set to correlation coefficient between index scan order and
|
|
underlying table's order
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</para>
|
|
|
|
<para>
|
|
Note that cost estimate functions must be written in C, not in SQL or
|
|
any available procedural language, because they must access internal
|
|
data structures of the planner/optimizer.
|
|
</para>
|
|
|
|
<para>
|
|
The index access costs should be computed using the parameters used by
|
|
<filename>src/backend/optimizer/path/costsize.c</filename>: a sequential
|
|
disk block fetch has cost <varname>seq_page_cost</>, a nonsequential fetch
|
|
has cost <varname>random_page_cost</>, and the cost of processing one index
|
|
row should usually be taken as <varname>cpu_index_tuple_cost</>. In
|
|
addition, an appropriate multiple of <varname>cpu_operator_cost</> should
|
|
be charged for any comparison operators invoked during index processing
|
|
(especially evaluation of the <literal>indexQuals</> themselves).
|
|
</para>
|
|
|
|
<para>
|
|
The access costs should include all disk and CPU costs associated with
|
|
scanning the index itself, but <emphasis>not</> the costs of retrieving or
|
|
processing the parent-table rows that are identified by the index.
|
|
</para>
|
|
|
|
<para>
|
|
The <quote>start-up cost</quote> is the part of the total scan cost that
|
|
must be expended before we can begin to fetch the first row. For most
|
|
indexes this can be taken as zero, but an index type with a high start-up
|
|
cost might want to set it nonzero.
|
|
</para>
|
|
|
|
<para>
|
|
The <parameter>indexSelectivity</> should be set to the estimated fraction of the parent
|
|
table rows that will be retrieved during the index scan. In the case
|
|
of a lossy query, this will typically be higher than the fraction of
|
|
rows that actually pass the given qual conditions.
|
|
</para>
|
|
|
|
<para>
|
|
The <parameter>indexCorrelation</> should be set to the correlation (ranging between
|
|
-1.0 and 1.0) between the index order and the table order. This is used
|
|
to adjust the estimate for the cost of fetching rows from the parent
|
|
table.
|
|
</para>
|
|
|
|
<para>
|
|
In the join case, the returned numbers should be averages expected for
|
|
any one scan of the index.
|
|
</para>
|
|
|
|
<procedure>
|
|
<title>Cost Estimation</title>
|
|
<para>
|
|
A typical cost estimator will proceed as follows:
|
|
</para>
|
|
|
|
<step>
|
|
<para>
|
|
Estimate and return the fraction of parent-table rows that will be visited
|
|
based on the given qual conditions. In the absence of any index-type-specific
|
|
knowledge, use the standard optimizer function <function>clauselist_selectivity()</function>:
|
|
|
|
<programlisting>
|
|
*indexSelectivity = clauselist_selectivity(root, indexQuals,
|
|
index->rel->relid,
|
|
JOIN_INNER, NULL);
|
|
</programlisting>
|
|
</para>
|
|
</step>
|
|
|
|
<step>
|
|
<para>
|
|
Estimate the number of index rows that will be visited during the
|
|
scan. For many index types this is the same as <parameter>indexSelectivity</> times
|
|
the number of rows in the index, but it might be more. (Note that the
|
|
index's size in pages and rows is available from the <structname>IndexOptInfo</> struct.)
|
|
</para>
|
|
</step>
|
|
|
|
<step>
|
|
<para>
|
|
Estimate the number of index pages that will be retrieved during the scan.
|
|
This might be just <parameter>indexSelectivity</> times the index's size in pages.
|
|
</para>
|
|
</step>
|
|
|
|
<step>
|
|
<para>
|
|
Compute the index access cost. A generic estimator might do this:
|
|
|
|
<programlisting>
|
|
/*
|
|
* Our generic assumption is that the index pages will be read
|
|
* sequentially, so they cost seq_page_cost each, not random_page_cost.
|
|
* Also, we charge for evaluation of the indexquals at each index row.
|
|
* All the costs are assumed to be paid incrementally during the scan.
|
|
*/
|
|
cost_qual_eval(&index_qual_cost, indexQuals, root);
|
|
*indexStartupCost = index_qual_cost.startup;
|
|
*indexTotalCost = seq_page_cost * numIndexPages +
|
|
(cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples;
|
|
</programlisting>
|
|
|
|
However, the above does not account for amortization of index reads
|
|
across repeated index scans in the join case.
|
|
</para>
|
|
</step>
|
|
|
|
<step>
|
|
<para>
|
|
Estimate the index correlation. For a simple ordered index on a single
|
|
field, this can be retrieved from pg_statistic. If the correlation
|
|
is not known, the conservative estimate is zero (no correlation).
|
|
</para>
|
|
</step>
|
|
</procedure>
|
|
|
|
<para>
|
|
Examples of cost estimator functions can be found in
|
|
<filename>src/backend/utils/adt/selfuncs.c</filename>.
|
|
</para>
|
|
</sect1>
|
|
</chapter>
|