2010-09-20 22:08:53 +02:00
|
|
|
<!-- doc/src/sgml/indexam.sgml -->
|
2005-02-13 04:04:15 +01:00
|
|
|
|
|
|
|
<chapter id="indexam">
|
|
|
|
<title>Index Access Method Interface Definition</title>
|
|
|
|
|
2019-04-04 02:37:00 +02:00
|
|
|
<indexterm>
|
|
|
|
<primary>Index Access Method</primary>
|
|
|
|
</indexterm>
|
|
|
|
<indexterm>
|
|
|
|
<primary>indexam</primary>
|
|
|
|
<secondary>Index Access Method</secondary>
|
|
|
|
</indexterm>
|
|
|
|
|
2005-02-13 04:04:15 +01:00
|
|
|
<para>
|
|
|
|
This chapter defines the interface between the core
|
|
|
|
<productname>PostgreSQL</productname> system and <firstterm>index access
|
2017-10-09 03:44:17 +02:00
|
|
|
methods</firstterm>, which manage individual index types. The core system
|
2005-02-13 04:04:15 +01:00
|
|
|
knows nothing about indexes beyond what is specified here, so it is
|
|
|
|
possible to develop entirely new index types by writing add-on code.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
All indexes in <productname>PostgreSQL</productname> are what are known
|
2017-10-09 03:44:17 +02:00
|
|
|
technically as <firstterm>secondary indexes</firstterm>; that is, the index is
|
2005-02-13 04:04:15 +01:00
|
|
|
physically separate from the table file that it describes. Each index
|
2017-10-09 03:44:17 +02:00
|
|
|
is stored as its own physical <firstterm>relation</firstterm> and so is described
|
|
|
|
by an entry in the <structname>pg_class</structname> catalog. The contents of an
|
2005-02-13 04:04:15 +01:00
|
|
|
index are entirely under the control of its index access method. In
|
|
|
|
practice, all index access methods divide indexes into standard-size
|
|
|
|
pages so that they can use the regular storage manager and buffer manager
|
|
|
|
to access the index contents. (All the existing index access methods
|
|
|
|
furthermore use the standard page layout described in <xref
|
2017-11-23 15:39:47 +01:00
|
|
|
linkend="storage-page-layout"/>, and most use the same format for index
|
2005-02-13 04:04:15 +01:00
|
|
|
tuple headers; but these decisions are not forced on an access method.)
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
An index is effectively a mapping from some data key values to
|
2017-10-09 03:44:17 +02:00
|
|
|
<firstterm>tuple identifiers</firstterm>, or <acronym>TIDs</acronym>, of row versions
|
2005-02-13 04:04:15 +01:00
|
|
|
(tuples) in the index's parent table. A TID consists of a
|
|
|
|
block number and an item number within that block (see <xref
|
2017-11-23 15:39:47 +01:00
|
|
|
linkend="storage-page-layout"/>). This is sufficient
|
2005-02-13 04:04:15 +01:00
|
|
|
information to fetch a particular row version from the table.
|
Update documentation on may/can/might:
Standard English uses "may", "can", and "might" in different ways:
may - permission, "You may borrow my rake."
can - ability, "I can lift that log."
might - possibility, "It might rain today."
Unfortunately, in conversational English, their use is often mixed, as
in, "You may use this variable to do X", when in fact, "can" is a better
choice. Similarly, "It may crash" is better stated, "It might crash".
Also update two error messages mentioned in the documenation to match.
2007-01-31 21:56:20 +01:00
|
|
|
Indexes are not directly aware that under MVCC, there might be multiple
|
2005-02-13 04:04:15 +01:00
|
|
|
extant versions of the same logical row; to an index, each tuple is
|
|
|
|
an independent object that needs its own index entry. Thus, an
|
|
|
|
update of a row always creates all-new index entries for the row, even if
|
2022-08-12 21:05:13 +02:00
|
|
|
the key values did not change. (<link linkend="storage-hot">HOT
|
|
|
|
tuples</link> are an exception to this
|
2009-03-06 00:06:45 +01:00
|
|
|
statement; but indexes do not deal with those, either.) Index entries for
|
|
|
|
dead tuples are reclaimed (by vacuuming) when the dead tuples themselves
|
|
|
|
are reclaimed.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
<sect1 id="index-api">
|
|
|
|
<title>Basic API Structure for Indexes</title>
|
2005-02-13 04:04:15 +01:00
|
|
|
|
|
|
|
<para>
|
|
|
|
Each index access method is described by a row in the
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
<link linkend="catalog-pg-am"><structname>pg_am</structname></link>
|
|
|
|
system catalog. The <structname>pg_am</structname> entry
|
2019-04-04 02:37:00 +02:00
|
|
|
specifies a name and a <firstterm>handler function</firstterm> for the index
|
|
|
|
access method. These entries can be created and deleted using the
|
2017-11-23 15:39:47 +01:00
|
|
|
<xref linkend="sql-create-access-method"/> and
|
|
|
|
<xref linkend="sql-drop-access-method"/> SQL commands.
|
2016-03-24 03:01:35 +01:00
|
|
|
</para>
|
|
|
|
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
<para>
|
|
|
|
An index access method handler function must be declared to accept a
|
2017-10-09 03:44:17 +02:00
|
|
|
single argument of type <type>internal</type> and to return the
|
|
|
|
pseudo-type <type>index_am_handler</type>. The argument is a dummy value that
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
simply serves to prevent handler functions from being called directly from
|
|
|
|
SQL commands. The result of the function must be a palloc'd struct of
|
|
|
|
type <structname>IndexAmRoutine</structname>, which contains everything
|
|
|
|
that the core code needs to know to make use of the index access method.
|
|
|
|
The <structname>IndexAmRoutine</structname> struct, also called the access
|
2017-10-09 03:44:17 +02:00
|
|
|
method's <firstterm>API struct</firstterm>, includes fields specifying assorted
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
fixed properties of the access method, such as whether it can support
|
|
|
|
multicolumn indexes. More importantly, it contains pointers to support
|
|
|
|
functions for the access method, which do all of the real work to access
|
|
|
|
indexes. These support functions are plain C functions and are not
|
|
|
|
visible or callable at the SQL level. The support functions are described
|
2017-11-23 15:39:47 +01:00
|
|
|
in <xref linkend="index-functions"/>.
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The structure <structname>IndexAmRoutine</structname> is defined thus:
|
|
|
|
<programlisting>
|
|
|
|
typedef struct IndexAmRoutine
|
|
|
|
{
|
|
|
|
NodeTag type;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Total number of strategies (operators) by which we can traverse/search
|
|
|
|
* this AM. Zero if AM does not have a fixed set of strategy assignments.
|
|
|
|
*/
|
|
|
|
uint16 amstrategies;
|
|
|
|
/* total number of support functions that this AM uses */
|
|
|
|
uint16 amsupport;
|
2020-06-20 12:34:54 +02:00
|
|
|
/* opclass options support function number or 0 */
|
|
|
|
uint16 amoptsprocnum;
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
/* does AM support ORDER BY indexed column's value? */
|
|
|
|
bool amcanorder;
|
|
|
|
/* does AM support ORDER BY result of an operator on indexed column? */
|
|
|
|
bool amcanorderbyop;
|
|
|
|
/* does AM support backward scanning? */
|
|
|
|
bool amcanbackward;
|
|
|
|
/* does AM support UNIQUE indexes? */
|
|
|
|
bool amcanunique;
|
|
|
|
/* does AM support multi-column indexes? */
|
|
|
|
bool amcanmulticol;
|
|
|
|
/* does AM require scans to have a constraint on the first index column? */
|
|
|
|
bool amoptionalkey;
|
|
|
|
/* does AM handle ScalarArrayOpExpr quals? */
|
|
|
|
bool amsearcharray;
|
|
|
|
/* does AM handle IS NULL/IS NOT NULL quals? */
|
|
|
|
bool amsearchnulls;
|
|
|
|
/* can index storage data type differ from column data type? */
|
|
|
|
bool amstorage;
|
|
|
|
/* can an index of this type be clustered on? */
|
|
|
|
bool amclusterable;
|
|
|
|
/* does AM handle predicate locks? */
|
|
|
|
bool ampredlocks;
|
2017-02-15 19:53:24 +01:00
|
|
|
/* does AM support parallel scan? */
|
|
|
|
bool amcanparallel;
|
Allow parallel CREATE INDEX for BRIN indexes
Allow using multiple worker processes to build BRIN index, which until
now was supported only for BTREE indexes. For large tables this often
results in significant speedup when the build is CPU-bound.
The work is split in a simple way - each worker builds BRIN summaries on
a subset of the table, determined by the regular parallel scan used to
read the data, and feeds them into a shared tuplesort which sorts them
by blkno (start of the range). The leader then reads this sorted stream
of ranges, merges duplicates (which may happen if the parallel scan does
not align with BRIN pages_per_range), and adds the resulting ranges into
the index.
The number of duplicate results produced by workers (requiring merging
in the leader process) should be fairly small, thanks to how parallel
scans assign chunks to workers. The likelihood of duplicate results may
increase for higher pages_per_range values, but then there are fewer
page ranges in total. In any case, we expect the merging to be much
cheaper than summarization, so this should be a win.
Most of the parallelism infrastructure is a simplified copy of the code
used by BTREE indexes, omitting the parts irrelevant for BRIN indexes
(e.g. uniqueness checks).
This also introduces a new index AM flag amcanbuildparallel, determining
whether to attempt to start parallel workers for the index build.
Original patch by me, with reviews and substantial reworks by Matthias
van de Meent, certainly enough to make him a co-author.
Author: Tomas Vondra, Matthias van de Meent
Reviewed-by: Matthias van de Meent
Discussion: https://postgr.es/m/c2ee7d69-ce17-43f2-d1a0-9811edbda6e6%40enterprisedb.com
2023-12-08 18:15:23 +01:00
|
|
|
/* does AM support parallel build? */
|
|
|
|
bool amcanbuildparallel;
|
2018-04-07 22:00:39 +02:00
|
|
|
/* does AM support columns included with clause INCLUDE? */
|
|
|
|
bool amcaninclude;
|
2020-01-15 02:54:14 +01:00
|
|
|
/* does AM use maintenance_work_mem? */
|
|
|
|
bool amusemaintenanceworkmem;
|
Ignore BRIN indexes when checking for HOT updates
When determining whether an index update may be skipped by using HOT, we
can ignore attributes indexed by block summarizing indexes without
references to individual tuples that need to be cleaned up.
A new type TU_UpdateIndexes provides a signal to the executor to
determine which indexes to update - no indexes, all indexes, or only the
summarizing indexes.
This also removes rd_indexattr list, and replaces it with rd_attrsvalid
flag. The list was not used anywhere, and a simple flag is sufficient.
This was originally committed as 5753d4ee32, but then got reverted by
e3fcca0d0d because of correctness issues.
Original patch by Josef Simanek, various fixes and improvements by Tomas
Vondra and me.
Authors: Matthias van de Meent, Josef Simanek, Tomas Vondra
Reviewed-by: Tomas Vondra, Alvaro Herrera
Discussion: https://postgr.es/m/05ebcb44-f383-86e3-4f31-0a97a55634cf@enterprisedb.com
Discussion: https://postgr.es/m/CAFp7QwpMRGcDAQumN7onN9HjrJ3u4X3ZRXdGFT0K5G2JWvnbWg%40mail.gmail.com
2023-03-20 10:34:07 +01:00
|
|
|
/* does AM summarize tuples, with at least all tuples in the block
|
|
|
|
* summarized in one summary */
|
|
|
|
bool amsummarizing;
|
2020-01-15 02:54:14 +01:00
|
|
|
/* OR of parallel vacuum flags */
|
|
|
|
uint8 amparallelvacuumoptions;
|
2018-07-18 20:43:03 +02:00
|
|
|
/* type of data stored in index, or InvalidOid if variable */
|
|
|
|
Oid amkeytype;
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
|
|
|
|
/* interface functions */
|
|
|
|
ambuild_function ambuild;
|
|
|
|
ambuildempty_function ambuildempty;
|
|
|
|
aminsert_function aminsert;
|
2023-11-25 20:27:04 +01:00
|
|
|
aminsertcleanup_function aminsertcleanup;
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
ambulkdelete_function ambulkdelete;
|
|
|
|
amvacuumcleanup_function amvacuumcleanup;
|
|
|
|
amcanreturn_function amcanreturn; /* can be NULL */
|
|
|
|
amcostestimate_function amcostestimate;
|
|
|
|
amoptions_function amoptions;
|
2016-08-14 00:31:14 +02:00
|
|
|
amproperty_function amproperty; /* can be NULL */
|
Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 20:18:08 +02:00
|
|
|
ambuildphasename_function ambuildphasename; /* can be NULL */
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
amvalidate_function amvalidate;
|
2020-08-01 23:12:47 +02:00
|
|
|
amadjustmembers_function amadjustmembers; /* can be NULL */
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
ambeginscan_function ambeginscan;
|
|
|
|
amrescan_function amrescan;
|
|
|
|
amgettuple_function amgettuple; /* can be NULL */
|
|
|
|
amgetbitmap_function amgetbitmap; /* can be NULL */
|
|
|
|
amendscan_function amendscan;
|
|
|
|
ammarkpos_function ammarkpos; /* can be NULL */
|
|
|
|
amrestrpos_function amrestrpos; /* can be NULL */
|
2017-01-24 22:42:58 +01:00
|
|
|
|
|
|
|
/* interface functions to support parallel index scans */
|
|
|
|
amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
|
|
|
|
aminitparallelscan_function aminitparallelscan; /* can be NULL */
|
|
|
|
amparallelrescan_function amparallelrescan; /* can be NULL */
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
} IndexAmRoutine;
|
|
|
|
</programlisting>
|
|
|
|
</para>
|
|
|
|
|
2005-02-13 04:04:15 +01:00
|
|
|
<para>
|
|
|
|
To be useful, an index access method must also have one or more
|
2017-10-09 03:44:17 +02:00
|
|
|
<firstterm>operator families</firstterm> and
|
|
|
|
<firstterm>operator classes</firstterm> defined in
|
2006-12-23 01:43:13 +01:00
|
|
|
<link linkend="catalog-pg-opfamily"><structname>pg_opfamily</structname></link>,
|
2005-02-13 04:04:15 +01:00
|
|
|
<link linkend="catalog-pg-opclass"><structname>pg_opclass</structname></link>,
|
|
|
|
<link linkend="catalog-pg-amop"><structname>pg_amop</structname></link>, and
|
|
|
|
<link linkend="catalog-pg-amproc"><structname>pg_amproc</structname></link>.
|
|
|
|
These entries allow the planner
|
|
|
|
to determine what kinds of query qualifications can be used with
|
2006-12-23 01:43:13 +01:00
|
|
|
indexes of this access method. Operator families and classes are described
|
2017-11-23 15:39:47 +01:00
|
|
|
in <xref linkend="xindex"/>, which is prerequisite material for reading
|
2005-02-13 04:04:15 +01:00
|
|
|
this chapter.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2009-03-24 21:17:18 +01:00
|
|
|
An individual index is defined by a
|
2005-02-13 04:04:15 +01:00
|
|
|
<link linkend="catalog-pg-class"><structname>pg_class</structname></link>
|
|
|
|
entry that describes it as a physical relation, plus a
|
|
|
|
<link linkend="catalog-pg-index"><structname>pg_index</structname></link>
|
|
|
|
entry that shows the logical content of the index — that is, the set
|
|
|
|
of index columns it has and the semantics of those columns, as captured by
|
|
|
|
the associated operator classes. The index columns (key values) can be
|
|
|
|
either simple columns of the underlying table or expressions over the table
|
|
|
|
rows. The index access method normally has no interest in where the index
|
|
|
|
key values come from (it is always handed precomputed key values) but it
|
|
|
|
will be very interested in the operator class information in
|
|
|
|
<structname>pg_index</structname>. Both of these catalog entries can be
|
2017-10-09 03:44:17 +02:00
|
|
|
accessed as part of the <structname>Relation</structname> data structure that is
|
2005-02-13 04:04:15 +01:00
|
|
|
passed to all operations on the index.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
Some of the flag fields of <structname>IndexAmRoutine</structname> have nonobvious
|
2005-02-13 04:04:15 +01:00
|
|
|
implications. The requirements of <structfield>amcanunique</structfield>
|
2017-11-23 15:39:47 +01:00
|
|
|
are discussed in <xref linkend="index-unique-checks"/>.
|
2005-02-13 04:04:15 +01:00
|
|
|
The <structfield>amcanmulticol</structfield> flag asserts that the
|
2020-11-15 22:10:48 +01:00
|
|
|
access method supports multi-key-column indexes, while
|
2005-06-14 01:14:49 +02:00
|
|
|
<structfield>amoptionalkey</structfield> asserts that it allows scans
|
|
|
|
where no indexable restriction clause is given for the first index column.
|
|
|
|
When <structfield>amcanmulticol</structfield> is false,
|
|
|
|
<structfield>amoptionalkey</structfield> essentially says whether the
|
2011-01-08 22:08:05 +01:00
|
|
|
access method supports full-index scans without any restriction clause.
|
2017-10-09 03:44:17 +02:00
|
|
|
Access methods that support multiple index columns <emphasis>must</emphasis>
|
2005-06-14 01:14:49 +02:00
|
|
|
support scans that omit restrictions on any or all of the columns after
|
|
|
|
the first; however they are permitted to require some restriction to
|
|
|
|
appear for the first index column, and this is signaled by setting
|
|
|
|
<structfield>amoptionalkey</structfield> false.
|
2024-01-23 13:20:15 +01:00
|
|
|
One reason that an index <acronym>AM</acronym> might set
|
2011-01-08 22:08:05 +01:00
|
|
|
<structfield>amoptionalkey</structfield> false is if it doesn't index
|
2013-05-21 03:13:13 +02:00
|
|
|
null values. Since most indexable operators are
|
|
|
|
strict and hence cannot return true for null inputs,
|
2005-11-05 00:14:02 +01:00
|
|
|
it is at first sight attractive to not store index entries for null values:
|
2005-02-13 04:04:15 +01:00
|
|
|
they could never be returned by an index scan anyway. However, this
|
2005-06-14 01:14:49 +02:00
|
|
|
argument fails when an index scan has no restriction clause for a given
|
|
|
|
index column. In practice this means that
|
|
|
|
indexes that have <structfield>amoptionalkey</structfield> true must
|
|
|
|
index nulls, since the planner might decide to use such an index
|
|
|
|
with no scan keys at all. A related restriction is that an index
|
2017-10-09 03:44:17 +02:00
|
|
|
access method that supports multiple index columns <emphasis>must</emphasis>
|
2005-02-13 04:04:15 +01:00
|
|
|
support indexing null values in columns after the first, because the planner
|
2005-06-14 01:14:49 +02:00
|
|
|
will assume the index can be used for queries that do not restrict
|
|
|
|
these columns. For example, consider an index on (a,b) and a query with
|
2005-02-13 04:04:15 +01:00
|
|
|
<literal>WHERE a = 4</literal>. The system will assume the index can be
|
|
|
|
used to scan for rows with <literal>a = 4</literal>, which is wrong if the
|
2017-10-09 03:44:17 +02:00
|
|
|
index omits rows where <literal>b</literal> is null.
|
2005-02-13 04:04:15 +01:00
|
|
|
It is, however, OK to omit rows where the first indexed column is null.
|
2011-01-08 22:08:05 +01:00
|
|
|
An index access method that does index nulls may also set
|
2007-04-07 00:33:43 +02:00
|
|
|
<structfield>amsearchnulls</structfield>, indicating that it supports
|
2017-10-09 03:44:17 +02:00
|
|
|
<literal>IS NULL</literal> and <literal>IS NOT NULL</literal> clauses as search
|
2010-01-01 22:53:49 +01:00
|
|
|
conditions.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
2020-11-15 22:10:48 +01:00
|
|
|
<para>
|
|
|
|
The <structfield>amcaninclude</structfield> flag indicates whether the
|
|
|
|
access method supports <quote>included</quote> columns, that is it can
|
|
|
|
store (without processing) additional columns beyond the key column(s).
|
|
|
|
The requirements of the preceding paragraph apply only to the key
|
|
|
|
columns. In particular, the combination
|
|
|
|
of <structfield>amcanmulticol</structfield>=<literal>false</literal>
|
|
|
|
and <structfield>amcaninclude</structfield>=<literal>true</literal> is
|
|
|
|
sensible: it means that there can only be one key column, but there can
|
|
|
|
also be included column(s). Also, included columns must be allowed to be
|
|
|
|
null, independently of <structfield>amoptionalkey</structfield>.
|
|
|
|
</para>
|
|
|
|
|
Ignore BRIN indexes when checking for HOT updates
When determining whether an index update may be skipped by using HOT, we
can ignore attributes indexed by block summarizing indexes without
references to individual tuples that need to be cleaned up.
A new type TU_UpdateIndexes provides a signal to the executor to
determine which indexes to update - no indexes, all indexes, or only the
summarizing indexes.
This also removes rd_indexattr list, and replaces it with rd_attrsvalid
flag. The list was not used anywhere, and a simple flag is sufficient.
This was originally committed as 5753d4ee32, but then got reverted by
e3fcca0d0d because of correctness issues.
Original patch by Josef Simanek, various fixes and improvements by Tomas
Vondra and me.
Authors: Matthias van de Meent, Josef Simanek, Tomas Vondra
Reviewed-by: Tomas Vondra, Alvaro Herrera
Discussion: https://postgr.es/m/05ebcb44-f383-86e3-4f31-0a97a55634cf@enterprisedb.com
Discussion: https://postgr.es/m/CAFp7QwpMRGcDAQumN7onN9HjrJ3u4X3ZRXdGFT0K5G2JWvnbWg%40mail.gmail.com
2023-03-20 10:34:07 +01:00
|
|
|
<para>
|
|
|
|
The <structfield>amsummarizing</structfield> flag indicates whether the
|
|
|
|
access method summarizes the indexed tuples, with summarizing granularity
|
|
|
|
of at least per block.
|
|
|
|
Access methods that do not point to individual tuples, but to block ranges
|
|
|
|
(like <acronym>BRIN</acronym>), may allow the <acronym>HOT</acronym> optimization
|
|
|
|
to continue. This does not apply to attributes referenced in index
|
2023-04-12 06:03:09 +02:00
|
|
|
predicates, an update of such an attribute always disables <acronym>HOT</acronym>.
|
Ignore BRIN indexes when checking for HOT updates
When determining whether an index update may be skipped by using HOT, we
can ignore attributes indexed by block summarizing indexes without
references to individual tuples that need to be cleaned up.
A new type TU_UpdateIndexes provides a signal to the executor to
determine which indexes to update - no indexes, all indexes, or only the
summarizing indexes.
This also removes rd_indexattr list, and replaces it with rd_attrsvalid
flag. The list was not used anywhere, and a simple flag is sufficient.
This was originally committed as 5753d4ee32, but then got reverted by
e3fcca0d0d because of correctness issues.
Original patch by Josef Simanek, various fixes and improvements by Tomas
Vondra and me.
Authors: Matthias van de Meent, Josef Simanek, Tomas Vondra
Reviewed-by: Tomas Vondra, Alvaro Herrera
Discussion: https://postgr.es/m/05ebcb44-f383-86e3-4f31-0a97a55634cf@enterprisedb.com
Discussion: https://postgr.es/m/CAFp7QwpMRGcDAQumN7onN9HjrJ3u4X3ZRXdGFT0K5G2JWvnbWg%40mail.gmail.com
2023-03-20 10:34:07 +01:00
|
|
|
</para>
|
|
|
|
|
2005-02-13 04:04:15 +01:00
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="index-functions">
|
|
|
|
<title>Index Access Method Functions</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The index construction and maintenance functions that an index access
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
method must provide in <structname>IndexAmRoutine</structname> are:
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<programlisting>
|
2006-05-11 01:18:39 +02:00
|
|
|
IndexBuildResult *
|
2005-02-13 04:04:15 +01:00
|
|
|
ambuild (Relation heapRelation,
|
|
|
|
Relation indexRelation,
|
|
|
|
IndexInfo *indexInfo);
|
|
|
|
</programlisting>
|
|
|
|
Build a new index. The index relation has been physically created,
|
|
|
|
but is empty. It must be filled in with whatever fixed data the
|
|
|
|
access method requires, plus entries for all tuples already existing
|
2017-10-09 03:44:17 +02:00
|
|
|
in the table. Ordinarily the <function>ambuild</function> function will call
|
2019-03-28 03:59:06 +01:00
|
|
|
<function>table_index_build_scan()</function> to scan the table for existing tuples
|
2005-02-13 04:04:15 +01:00
|
|
|
and compute the keys that need to be inserted into the index.
|
2006-05-11 01:18:39 +02:00
|
|
|
The function must return a palloc'd struct containing statistics about
|
|
|
|
the new index.
|
Allow parallel CREATE INDEX for BRIN indexes
Allow using multiple worker processes to build BRIN index, which until
now was supported only for BTREE indexes. For large tables this often
results in significant speedup when the build is CPU-bound.
The work is split in a simple way - each worker builds BRIN summaries on
a subset of the table, determined by the regular parallel scan used to
read the data, and feeds them into a shared tuplesort which sorts them
by blkno (start of the range). The leader then reads this sorted stream
of ranges, merges duplicates (which may happen if the parallel scan does
not align with BRIN pages_per_range), and adds the resulting ranges into
the index.
The number of duplicate results produced by workers (requiring merging
in the leader process) should be fairly small, thanks to how parallel
scans assign chunks to workers. The likelihood of duplicate results may
increase for higher pages_per_range values, but then there are fewer
page ranges in total. In any case, we expect the merging to be much
cheaper than summarization, so this should be a win.
Most of the parallelism infrastructure is a simplified copy of the code
used by BTREE indexes, omitting the parts irrelevant for BRIN indexes
(e.g. uniqueness checks).
This also introduces a new index AM flag amcanbuildparallel, determining
whether to attempt to start parallel workers for the index build.
Original patch by me, with reviews and substantial reworks by Matthias
van de Meent, certainly enough to make him a co-author.
Author: Tomas Vondra, Matthias van de Meent
Reviewed-by: Matthias van de Meent
Discussion: https://postgr.es/m/c2ee7d69-ce17-43f2-d1a0-9811edbda6e6%40enterprisedb.com
2023-12-08 18:15:23 +01:00
|
|
|
The <structfield>amcanbuildparallel</structfield> flag indicates whether
|
|
|
|
the access method supports parallel index builds. When set to <literal>true</literal>,
|
|
|
|
the system will attempt to allocate parallel workers for the build.
|
|
|
|
Access methods supporting only non-parallel index builds should leave
|
|
|
|
this flag set to <literal>false</literal>.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<programlisting>
|
2010-12-29 12:48:53 +01:00
|
|
|
void
|
|
|
|
ambuildempty (Relation indexRelation);
|
|
|
|
</programlisting>
|
2014-07-17 04:20:15 +02:00
|
|
|
Build an empty index, and write it to the initialization fork (<symbol>INIT_FORKNUM</symbol>)
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
of the given relation. This method is called only for unlogged indexes; the
|
2010-12-29 12:48:53 +01:00
|
|
|
empty index written to the initialization fork will be copied over the main
|
|
|
|
relation fork on each server restart.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<programlisting>
|
2005-03-21 02:24:04 +01:00
|
|
|
bool
|
2005-02-13 04:04:15 +01:00
|
|
|
aminsert (Relation indexRelation,
|
2005-03-21 02:24:04 +01:00
|
|
|
Datum *values,
|
|
|
|
bool *isnull,
|
2005-02-13 04:04:15 +01:00
|
|
|
ItemPointer heap_tid,
|
|
|
|
Relation heapRelation,
|
Allow index AMs to cache data across aminsert calls within a SQL command.
It's always been possible for index AMs to cache data across successive
amgettuple calls within a single SQL command: the IndexScanDesc.opaque
field is meant for precisely that. However, no comparable facility
exists for amortizing setup work across successive aminsert calls.
This patch adds such a feature and teaches GIN, GIST, and BRIN to use it
to amortize catalog lookups they'd previously been doing on every call.
(The other standard index AMs keep everything they need in the relcache,
so there's little to improve there.)
For GIN, the overall improvement in a statement that inserts many rows
can be as much as 10%, though it seems a bit less for the other two.
In addition, this makes a really significant difference in runtime
for CLOBBER_CACHE_ALWAYS tests, since in those builds the repeated
catalog lookups are vastly more expensive.
The reason this has been hard up to now is that the aminsert function is
not passed any useful place to cache per-statement data. What I chose to
do is to add suitable fields to struct IndexInfo and pass that to aminsert.
That's not widening the index AM API very much because IndexInfo is already
within the ken of ambuild; in fact, by passing the same info to aminsert
as to ambuild, this is really removing an inconsistency in the AM API.
Discussion: https://postgr.es/m/27568.1486508680@sss.pgh.pa.us
2017-02-09 17:52:12 +01:00
|
|
|
IndexUniqueCheck checkUnique,
|
2021-01-13 17:11:00 +01:00
|
|
|
bool indexUnchanged,
|
Allow index AMs to cache data across aminsert calls within a SQL command.
It's always been possible for index AMs to cache data across successive
amgettuple calls within a single SQL command: the IndexScanDesc.opaque
field is meant for precisely that. However, no comparable facility
exists for amortizing setup work across successive aminsert calls.
This patch adds such a feature and teaches GIN, GIST, and BRIN to use it
to amortize catalog lookups they'd previously been doing on every call.
(The other standard index AMs keep everything they need in the relcache,
so there's little to improve there.)
For GIN, the overall improvement in a statement that inserts many rows
can be as much as 10%, though it seems a bit less for the other two.
In addition, this makes a really significant difference in runtime
for CLOBBER_CACHE_ALWAYS tests, since in those builds the repeated
catalog lookups are vastly more expensive.
The reason this has been hard up to now is that the aminsert function is
not passed any useful place to cache per-statement data. What I chose to
do is to add suitable fields to struct IndexInfo and pass that to aminsert.
That's not widening the index AM API very much because IndexInfo is already
within the ken of ambuild; in fact, by passing the same info to aminsert
as to ambuild, this is really removing an inconsistency in the AM API.
Discussion: https://postgr.es/m/27568.1486508680@sss.pgh.pa.us
2017-02-09 17:52:12 +01:00
|
|
|
IndexInfo *indexInfo);
|
2005-02-13 04:04:15 +01:00
|
|
|
</programlisting>
|
2017-10-09 03:44:17 +02:00
|
|
|
Insert a new tuple into an existing index. The <literal>values</literal> and
|
|
|
|
<literal>isnull</literal> arrays give the key values to be indexed, and
|
|
|
|
<literal>heap_tid</literal> is the TID to be indexed.
|
2005-02-13 04:04:15 +01:00
|
|
|
If the access method supports unique indexes (its
|
2017-10-09 03:44:17 +02:00
|
|
|
<structfield>amcanunique</structfield> flag is true) then
|
|
|
|
<literal>checkUnique</literal> indicates the type of uniqueness check to
|
2009-07-29 22:56:21 +02:00
|
|
|
perform. This varies depending on whether the unique constraint is
|
2017-11-23 15:39:47 +01:00
|
|
|
deferrable; see <xref linkend="index-unique-checks"/> for details.
|
2017-10-09 03:44:17 +02:00
|
|
|
Normally the access method only needs the <literal>heapRelation</literal>
|
2009-07-29 22:56:21 +02:00
|
|
|
parameter when performing uniqueness checking (since then it will have to
|
|
|
|
look into the heap to verify tuple liveness).
|
|
|
|
</para>
|
|
|
|
|
2021-01-13 17:11:00 +01:00
|
|
|
<para>
|
2021-07-16 10:35:38 +02:00
|
|
|
The <literal>indexUnchanged</literal> Boolean value gives a hint
|
2021-01-13 17:11:00 +01:00
|
|
|
about the nature of the tuple to be indexed. When it is true,
|
|
|
|
the tuple is a duplicate of some existing tuple in the index. The
|
|
|
|
new tuple is a logically unchanged successor MVCC tuple version. This
|
|
|
|
happens when an <command>UPDATE</command> takes place that does not
|
|
|
|
modify any columns covered by the index, but nevertheless requires a
|
|
|
|
new version in the index. The index AM may use this hint to decide
|
|
|
|
to apply bottom-up index deletion in parts of the index where many
|
2023-10-24 18:27:27 +02:00
|
|
|
versions of the same logical row accumulate. Note that updating a non-key
|
|
|
|
column or a column that only appears in a partial index predicate does not
|
|
|
|
affect the value of <literal>indexUnchanged</literal>. The core code
|
|
|
|
determines each tuple's <literal>indexUnchanged</literal> value using a low
|
|
|
|
overhead approach that allows both false positives and false negatives.
|
|
|
|
Index AMs must not treat <literal>indexUnchanged</literal> as an
|
|
|
|
authoritative source of information about tuple visibility or versioning.
|
2021-01-13 17:11:00 +01:00
|
|
|
</para>
|
|
|
|
|
2009-07-29 22:56:21 +02:00
|
|
|
<para>
|
2010-08-17 06:37:21 +02:00
|
|
|
The function's Boolean result value is significant only when
|
2017-10-09 03:44:17 +02:00
|
|
|
<literal>checkUnique</literal> is <literal>UNIQUE_CHECK_PARTIAL</literal>.
|
2017-08-16 06:22:32 +02:00
|
|
|
In this case a true result means the new entry is known unique, whereas
|
|
|
|
false means it might be non-unique (and a deferred uniqueness check must
|
|
|
|
be scheduled). For other cases a constant false result is recommended.
|
2009-07-29 22:56:21 +02:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Some indexes might not index all tuples. If the tuple is not to be
|
2017-10-09 03:44:17 +02:00
|
|
|
indexed, <function>aminsert</function> should just return without doing anything.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
Allow index AMs to cache data across aminsert calls within a SQL command.
It's always been possible for index AMs to cache data across successive
amgettuple calls within a single SQL command: the IndexScanDesc.opaque
field is meant for precisely that. However, no comparable facility
exists for amortizing setup work across successive aminsert calls.
This patch adds such a feature and teaches GIN, GIST, and BRIN to use it
to amortize catalog lookups they'd previously been doing on every call.
(The other standard index AMs keep everything they need in the relcache,
so there's little to improve there.)
For GIN, the overall improvement in a statement that inserts many rows
can be as much as 10%, though it seems a bit less for the other two.
In addition, this makes a really significant difference in runtime
for CLOBBER_CACHE_ALWAYS tests, since in those builds the repeated
catalog lookups are vastly more expensive.
The reason this has been hard up to now is that the aminsert function is
not passed any useful place to cache per-statement data. What I chose to
do is to add suitable fields to struct IndexInfo and pass that to aminsert.
That's not widening the index AM API very much because IndexInfo is already
within the ken of ambuild; in fact, by passing the same info to aminsert
as to ambuild, this is really removing an inconsistency in the AM API.
Discussion: https://postgr.es/m/27568.1486508680@sss.pgh.pa.us
2017-02-09 17:52:12 +01:00
|
|
|
<para>
|
|
|
|
If the index AM wishes to cache data across successive index insertions
|
2021-06-11 03:38:04 +02:00
|
|
|
within an SQL statement, it can allocate space
|
Allow index AMs to cache data across aminsert calls within a SQL command.
It's always been possible for index AMs to cache data across successive
amgettuple calls within a single SQL command: the IndexScanDesc.opaque
field is meant for precisely that. However, no comparable facility
exists for amortizing setup work across successive aminsert calls.
This patch adds such a feature and teaches GIN, GIST, and BRIN to use it
to amortize catalog lookups they'd previously been doing on every call.
(The other standard index AMs keep everything they need in the relcache,
so there's little to improve there.)
For GIN, the overall improvement in a statement that inserts many rows
can be as much as 10%, though it seems a bit less for the other two.
In addition, this makes a really significant difference in runtime
for CLOBBER_CACHE_ALWAYS tests, since in those builds the repeated
catalog lookups are vastly more expensive.
The reason this has been hard up to now is that the aminsert function is
not passed any useful place to cache per-statement data. What I chose to
do is to add suitable fields to struct IndexInfo and pass that to aminsert.
That's not widening the index AM API very much because IndexInfo is already
within the ken of ambuild; in fact, by passing the same info to aminsert
as to ambuild, this is really removing an inconsistency in the AM API.
Discussion: https://postgr.es/m/27568.1486508680@sss.pgh.pa.us
2017-02-09 17:52:12 +01:00
|
|
|
in <literal>indexInfo->ii_Context</literal> and store a pointer to the
|
|
|
|
data in <literal>indexInfo->ii_AmCache</literal> (which will be NULL
|
2024-04-19 15:47:48 +02:00
|
|
|
initially). If resources other than memory have to be released after
|
|
|
|
index insertions, <function>aminsertcleanup</function> may be provided,
|
|
|
|
which will be called before the memory is released.
|
2023-11-25 20:27:04 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<programlisting>
|
|
|
|
void
|
2024-04-19 15:47:48 +02:00
|
|
|
aminsertcleanup (Relation indexRelation,
|
|
|
|
IndexInfo *indexInfo);
|
2023-11-25 20:27:04 +01:00
|
|
|
</programlisting>
|
|
|
|
Clean up state that was maintained across successive inserts in
|
|
|
|
<literal>indexInfo->ii_AmCache</literal>. This is useful if the data
|
2024-04-19 15:47:48 +02:00
|
|
|
requires additional cleanup steps (e.g., releasing pinned buffers), and
|
|
|
|
simply releasing the memory is not sufficient.
|
Allow index AMs to cache data across aminsert calls within a SQL command.
It's always been possible for index AMs to cache data across successive
amgettuple calls within a single SQL command: the IndexScanDesc.opaque
field is meant for precisely that. However, no comparable facility
exists for amortizing setup work across successive aminsert calls.
This patch adds such a feature and teaches GIN, GIST, and BRIN to use it
to amortize catalog lookups they'd previously been doing on every call.
(The other standard index AMs keep everything they need in the relcache,
so there's little to improve there.)
For GIN, the overall improvement in a statement that inserts many rows
can be as much as 10%, though it seems a bit less for the other two.
In addition, this makes a really significant difference in runtime
for CLOBBER_CACHE_ALWAYS tests, since in those builds the repeated
catalog lookups are vastly more expensive.
The reason this has been hard up to now is that the aminsert function is
not passed any useful place to cache per-statement data. What I chose to
do is to add suitable fields to struct IndexInfo and pass that to aminsert.
That's not widening the index AM API very much because IndexInfo is already
within the ken of ambuild; in fact, by passing the same info to aminsert
as to ambuild, this is really removing an inconsistency in the AM API.
Discussion: https://postgr.es/m/27568.1486508680@sss.pgh.pa.us
2017-02-09 17:52:12 +01:00
|
|
|
</para>
|
|
|
|
|
2005-02-13 04:04:15 +01:00
|
|
|
<para>
|
|
|
|
<programlisting>
|
|
|
|
IndexBulkDeleteResult *
|
2006-05-03 00:25:10 +02:00
|
|
|
ambulkdelete (IndexVacuumInfo *info,
|
|
|
|
IndexBulkDeleteResult *stats,
|
2005-02-13 04:04:15 +01:00
|
|
|
IndexBulkDeleteCallback callback,
|
|
|
|
void *callback_state);
|
|
|
|
</programlisting>
|
2017-10-09 03:44:17 +02:00
|
|
|
Delete tuple(s) from the index. This is a <quote>bulk delete</quote> operation
|
2005-02-13 04:04:15 +01:00
|
|
|
that is intended to be implemented by scanning the whole index and checking
|
|
|
|
each entry to see if it should be deleted.
|
2017-10-09 03:44:17 +02:00
|
|
|
The passed-in <literal>callback</literal> function must be called, in the style
|
|
|
|
<literal>callback(<replaceable>TID</replaceable>, callback_state) returns bool</literal>,
|
2005-02-13 04:04:15 +01:00
|
|
|
to determine whether any particular index entry, as identified by its
|
|
|
|
referenced TID, is to be deleted. Must return either NULL or a palloc'd
|
|
|
|
struct containing statistics about the effects of the deletion operation.
|
2006-05-03 00:25:10 +02:00
|
|
|
It is OK to return NULL if no information needs to be passed on to
|
2017-10-09 03:44:17 +02:00
|
|
|
<function>amvacuumcleanup</function>.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
2006-02-12 00:31:34 +01:00
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
Because of limited <varname>maintenance_work_mem</varname>,
|
|
|
|
<function>ambulkdelete</function> might need to be called more than once when many
|
|
|
|
tuples are to be deleted. The <literal>stats</literal> argument is the result
|
2006-05-03 00:25:10 +02:00
|
|
|
of the previous call for this index (it is NULL for the first call within a
|
2017-10-09 03:44:17 +02:00
|
|
|
<command>VACUUM</command> operation). This allows the AM to accumulate statistics
|
|
|
|
across the whole operation. Typically, <function>ambulkdelete</function> will
|
|
|
|
modify and return the same struct if the passed <literal>stats</literal> is not
|
2006-05-03 00:25:10 +02:00
|
|
|
null.
|
2006-02-12 00:31:34 +01:00
|
|
|
</para>
|
|
|
|
|
2005-02-13 04:04:15 +01:00
|
|
|
<para>
|
|
|
|
<programlisting>
|
|
|
|
IndexBulkDeleteResult *
|
2006-05-03 00:25:10 +02:00
|
|
|
amvacuumcleanup (IndexVacuumInfo *info,
|
2005-02-13 04:04:15 +01:00
|
|
|
IndexBulkDeleteResult *stats);
|
|
|
|
</programlisting>
|
2006-05-03 00:25:10 +02:00
|
|
|
Clean up after a <command>VACUUM</command> operation (zero or more
|
2017-10-09 03:44:17 +02:00
|
|
|
<function>ambulkdelete</function> calls). This does not have to do anything
|
Update documentation on may/can/might:
Standard English uses "may", "can", and "might" in different ways:
may - permission, "You may borrow my rake."
can - ability, "I can lift that log."
might - possibility, "It might rain today."
Unfortunately, in conversational English, their use is often mixed, as
in, "You may use this variable to do X", when in fact, "can" is a better
choice. Similarly, "It may crash" is better stated, "It might crash".
Also update two error messages mentioned in the documenation to match.
2007-01-31 21:56:20 +01:00
|
|
|
beyond returning index statistics, but it might perform bulk cleanup
|
2017-10-09 03:44:17 +02:00
|
|
|
such as reclaiming empty index pages. <literal>stats</literal> is whatever the
|
|
|
|
last <function>ambulkdelete</function> call returned, or NULL if
|
|
|
|
<function>ambulkdelete</function> was not called because no tuples needed to be
|
2006-05-03 00:25:10 +02:00
|
|
|
deleted. If the result is not NULL it must be a palloc'd struct.
|
2017-10-09 03:44:17 +02:00
|
|
|
The statistics it contains will be used to update <structname>pg_class</structname>,
|
|
|
|
and will be reported by <command>VACUUM</command> if <literal>VERBOSE</literal> is given.
|
2006-05-03 00:25:10 +02:00
|
|
|
It is OK to return NULL if the index was not changed at all during the
|
|
|
|
<command>VACUUM</command> operation, but otherwise correct stats should
|
|
|
|
be returned.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
2009-03-24 21:17:18 +01:00
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
<function>amvacuumcleanup</function> will also be called at completion of an
|
|
|
|
<command>ANALYZE</command> operation. In this case <literal>stats</literal> is always
|
2009-03-24 21:17:18 +01:00
|
|
|
NULL and any return value will be ignored. This case can be distinguished
|
|
|
|
by checking <literal>info->analyze_only</literal>. It is recommended
|
|
|
|
that the access method do nothing except post-insert cleanup in such a
|
|
|
|
call, and that only in an autovacuum worker process.
|
|
|
|
</para>
|
|
|
|
|
2006-07-04 00:45:41 +02:00
|
|
|
<para>
|
|
|
|
<programlisting>
|
2011-12-18 21:49:00 +01:00
|
|
|
bool
|
2015-03-26 18:12:00 +01:00
|
|
|
amcanreturn (Relation indexRelation, int attno);
|
2011-12-18 21:49:00 +01:00
|
|
|
</programlisting>
|
2016-05-08 22:36:19 +02:00
|
|
|
Check whether the index can support <link
|
2017-10-09 03:44:17 +02:00
|
|
|
linkend="indexes-index-only-scans"><firstterm>index-only scans</firstterm></link> on
|
2020-11-15 22:10:48 +01:00
|
|
|
the given column, by returning the column's original indexed value.
|
|
|
|
The attribute number is 1-based, i.e., the first column's attno is 1.
|
|
|
|
Returns true if supported, else false.
|
|
|
|
This function should always return true for included columns
|
|
|
|
(if those are supported), since there's little point in an included
|
|
|
|
column that can't be retrieved.
|
|
|
|
If the access method does not support index-only scans at all,
|
2017-10-09 03:44:17 +02:00
|
|
|
the <structfield>amcanreturn</structfield> field in its <structname>IndexAmRoutine</structname>
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
struct can be set to NULL.
|
2011-12-18 21:49:00 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<programlisting>
|
2006-07-04 00:45:41 +02:00
|
|
|
void
|
|
|
|
amcostestimate (PlannerInfo *root,
|
2011-12-25 01:03:21 +01:00
|
|
|
IndexPath *path,
|
2012-01-28 01:26:38 +01:00
|
|
|
double loop_count,
|
2006-07-04 00:45:41 +02:00
|
|
|
Cost *indexStartupCost,
|
|
|
|
Cost *indexTotalCost,
|
|
|
|
Selectivity *indexSelectivity,
|
2018-08-10 13:14:36 +02:00
|
|
|
double *indexCorrelation,
|
|
|
|
double *indexPages);
|
2006-07-04 00:45:41 +02:00
|
|
|
</programlisting>
|
|
|
|
Estimate the costs of an index scan. This function is described fully
|
2017-11-23 15:39:47 +01:00
|
|
|
in <xref linkend="index-cost-estimation"/>, below.
|
2006-07-04 00:45:41 +02:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<programlisting>
|
|
|
|
bytea *
|
|
|
|
amoptions (ArrayType *reloptions,
|
|
|
|
bool validate);
|
|
|
|
</programlisting>
|
|
|
|
Parse and validate the reloptions array for an index. This is called only
|
|
|
|
when a non-null reloptions array exists for the index.
|
2017-10-09 03:44:17 +02:00
|
|
|
<parameter>reloptions</parameter> is a <type>text</type> array containing entries of the
|
|
|
|
form <replaceable>name</replaceable><literal>=</literal><replaceable>value</replaceable>.
|
|
|
|
The function should construct a <type>bytea</type> value, which will be copied
|
|
|
|
into the <structfield>rd_options</structfield> field of the index's relcache entry.
|
|
|
|
The data contents of the <type>bytea</type> value are open for the access
|
2009-03-06 00:06:45 +01:00
|
|
|
method to define; most of the standard access methods use struct
|
2017-10-09 03:44:17 +02:00
|
|
|
<structname>StdRdOptions</structname>.
|
|
|
|
When <parameter>validate</parameter> is true, the function should report a suitable
|
2006-07-04 00:45:41 +02:00
|
|
|
error message if any of the options are unrecognized or have invalid
|
2017-10-09 03:44:17 +02:00
|
|
|
values; when <parameter>validate</parameter> is false, invalid entries should be
|
|
|
|
silently ignored. (<parameter>validate</parameter> is false when loading options
|
|
|
|
already stored in <structname>pg_catalog</structname>; an invalid entry could only
|
2006-07-04 00:45:41 +02:00
|
|
|
be found if the access method has changed its rules for options, and in
|
|
|
|
that case ignoring obsolete entries is appropriate.)
|
|
|
|
It is OK to return NULL if default behavior is wanted.
|
|
|
|
</para>
|
|
|
|
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
<para>
|
|
|
|
<programlisting>
|
|
|
|
bool
|
2016-08-14 00:31:14 +02:00
|
|
|
amproperty (Oid index_oid, int attno,
|
|
|
|
IndexAMProperty prop, const char *propname,
|
|
|
|
bool *res, bool *isnull);
|
|
|
|
</programlisting>
|
2017-10-09 03:44:17 +02:00
|
|
|
The <function>amproperty</function> method allows index access methods to override
|
2016-08-14 00:31:14 +02:00
|
|
|
the default behavior of <function>pg_index_column_has_property</function>
|
|
|
|
and related functions.
|
|
|
|
If the access method does not have any special behavior for index property
|
2017-10-09 03:44:17 +02:00
|
|
|
inquiries, the <structfield>amproperty</structfield> field in
|
|
|
|
its <structname>IndexAmRoutine</structname> struct can be set to NULL.
|
|
|
|
Otherwise, the <function>amproperty</function> method will be called with
|
|
|
|
<parameter>index_oid</parameter> and <parameter>attno</parameter> both zero for
|
2016-08-14 00:31:14 +02:00
|
|
|
<function>pg_indexam_has_property</function> calls,
|
2017-10-09 03:44:17 +02:00
|
|
|
or with <parameter>index_oid</parameter> valid and <parameter>attno</parameter> zero for
|
2016-08-14 00:31:14 +02:00
|
|
|
<function>pg_index_has_property</function> calls,
|
2017-10-09 03:44:17 +02:00
|
|
|
or with <parameter>index_oid</parameter> valid and <parameter>attno</parameter> greater than
|
2016-08-14 00:31:14 +02:00
|
|
|
zero for <function>pg_index_column_has_property</function> calls.
|
2017-10-09 03:44:17 +02:00
|
|
|
<parameter>prop</parameter> is an enum value identifying the property being tested,
|
|
|
|
while <parameter>propname</parameter> is the original property name string.
|
2016-08-14 00:31:14 +02:00
|
|
|
If the core code does not recognize the property name
|
2017-10-09 03:44:17 +02:00
|
|
|
then <parameter>prop</parameter> is <literal>AMPROP_UNKNOWN</literal>.
|
2016-08-14 00:31:14 +02:00
|
|
|
Access methods can define custom property names by
|
2017-10-09 03:44:17 +02:00
|
|
|
checking <parameter>propname</parameter> for a match (use <function>pg_strcasecmp</function>
|
2016-08-14 00:31:14 +02:00
|
|
|
to match, for consistency with the core code); for names known to the core
|
2017-10-09 03:44:17 +02:00
|
|
|
code, it's better to inspect <parameter>prop</parameter>.
|
|
|
|
If the <structfield>amproperty</structfield> method returns <literal>true</literal> then
|
|
|
|
it has determined the property test result: it must set <literal>*res</literal>
|
2021-07-16 10:35:38 +02:00
|
|
|
to the Boolean value to return, or set <literal>*isnull</literal>
|
2017-10-09 03:44:17 +02:00
|
|
|
to <literal>true</literal> to return a NULL. (Both of the referenced variables
|
|
|
|
are initialized to <literal>false</literal> before the call.)
|
|
|
|
If the <structfield>amproperty</structfield> method returns <literal>false</literal> then
|
2016-08-14 00:31:14 +02:00
|
|
|
the core code will proceed with its normal logic for determining the
|
|
|
|
property test result.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Access methods that support ordering operators should
|
2017-10-09 03:44:17 +02:00
|
|
|
implement <literal>AMPROP_DISTANCE_ORDERABLE</literal> property testing, as the
|
2016-08-14 00:31:14 +02:00
|
|
|
core code does not know how to do that and will return NULL. It may
|
2017-10-09 03:44:17 +02:00
|
|
|
also be advantageous to implement <literal>AMPROP_RETURNABLE</literal> testing,
|
2016-08-14 00:31:14 +02:00
|
|
|
if that can be done more cheaply than by opening the index and calling
|
2020-11-15 22:10:48 +01:00
|
|
|
<function>amcanreturn</function>, which is the core code's default behavior.
|
2016-08-14 00:31:14 +02:00
|
|
|
The default behavior should be satisfactory for all other standard
|
|
|
|
properties.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<programlisting>
|
Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 20:18:08 +02:00
|
|
|
char *
|
|
|
|
ambuildphasename (int64 phasenum);
|
|
|
|
</programlisting>
|
|
|
|
Return the textual name of the given build phase number.
|
|
|
|
The phase numbers are those reported during an index build via the
|
|
|
|
<function>pgstat_progress_update_param</function> interface.
|
|
|
|
The phase names are then exposed in the
|
|
|
|
<structname>pg_stat_progress_create_index</structname> view.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<programlisting>
|
2016-08-14 00:31:14 +02:00
|
|
|
bool
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
amvalidate (Oid opclassoid);
|
|
|
|
</programlisting>
|
|
|
|
Validate the catalog entries for the specified operator class, so far as
|
|
|
|
the access method can reasonably do that. For example, this might include
|
|
|
|
testing that all required support functions are provided.
|
2017-10-09 03:44:17 +02:00
|
|
|
The <function>amvalidate</function> function must return false if the opclass is
|
2020-08-01 23:12:47 +02:00
|
|
|
invalid. Problems should be reported with <function>ereport</function>
|
|
|
|
messages, typically at <literal>INFO</literal> level.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<programlisting>
|
|
|
|
void
|
|
|
|
amadjustmembers (Oid opfamilyoid,
|
|
|
|
Oid opclassoid,
|
|
|
|
List *operators,
|
|
|
|
List *functions);
|
|
|
|
</programlisting>
|
|
|
|
Validate proposed new operator and function members of an operator family,
|
|
|
|
so far as the access method can reasonably do that, and set their
|
|
|
|
dependency types if the default is not satisfactory. This is called
|
|
|
|
during <command>CREATE OPERATOR CLASS</command> and during
|
|
|
|
<command>ALTER OPERATOR FAMILY ADD</command>; in the latter
|
|
|
|
case <parameter>opclassoid</parameter> is <literal>InvalidOid</literal>.
|
|
|
|
The <type>List</type> arguments are lists
|
|
|
|
of <structname>OpFamilyMember</structname> structs, as defined
|
|
|
|
in <filename>amapi.h</filename>.
|
|
|
|
|
|
|
|
Tests done by this function will typically be a subset of those
|
|
|
|
performed by <function>amvalidate</function>,
|
|
|
|
since <function>amadjustmembers</function> cannot assume that it is
|
|
|
|
seeing a complete set of members. For example, it would be reasonable
|
|
|
|
to check the signature of a support function, but not to check whether
|
|
|
|
all required support functions are provided. Any problems can be
|
|
|
|
reported by throwing an error.
|
|
|
|
|
|
|
|
The dependency-related fields of
|
|
|
|
the <structname>OpFamilyMember</structname> structs are initialized by
|
|
|
|
the core code to create hard dependencies on the opclass if this
|
|
|
|
is <command>CREATE OPERATOR CLASS</command>, or soft dependencies on the
|
|
|
|
opfamily if this is <command>ALTER OPERATOR FAMILY ADD</command>.
|
|
|
|
<function>amadjustmembers</function> can adjust these fields if some other
|
|
|
|
behavior is more appropriate. For example, GIN, GiST, and SP-GiST
|
|
|
|
always set operator members to have soft dependencies on the opfamily,
|
|
|
|
since the connection between an operator and an opclass is relatively
|
|
|
|
weak in these index types; so it is reasonable to allow operator members
|
|
|
|
to be added and removed freely. Optional support functions are typically
|
|
|
|
also given soft dependencies, so that they can be removed if necessary.
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
|
2005-02-13 04:04:15 +01:00
|
|
|
<para>
|
|
|
|
The purpose of an index, of course, is to support scans for tuples matching
|
2017-10-09 03:44:17 +02:00
|
|
|
an indexable <literal>WHERE</literal> condition, often called a
|
|
|
|
<firstterm>qualifier</firstterm> or <firstterm>scan key</firstterm>. The semantics of
|
2017-11-23 15:39:47 +01:00
|
|
|
index scanning are described more fully in <xref linkend="index-scanning"/>,
|
2017-10-09 03:44:17 +02:00
|
|
|
below. An index access method can support <quote>plain</quote> index scans,
|
|
|
|
<quote>bitmap</quote> index scans, or both. The scan-related functions that an
|
2009-03-06 00:06:45 +01:00
|
|
|
index access method must or may provide are:
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<programlisting>
|
|
|
|
IndexScanDesc
|
|
|
|
ambeginscan (Relation indexRelation,
|
|
|
|
int nkeys,
|
2010-12-03 02:50:48 +01:00
|
|
|
int norderbys);
|
2005-02-13 04:04:15 +01:00
|
|
|
</programlisting>
|
2017-10-09 03:44:17 +02:00
|
|
|
Prepare for an index scan. The <literal>nkeys</literal> and <literal>norderbys</literal>
|
2010-12-03 02:50:48 +01:00
|
|
|
parameters indicate the number of quals and ordering operators that will be
|
|
|
|
used in the scan; these may be useful for space allocation purposes.
|
|
|
|
Note that the actual values of the scan keys aren't provided yet.
|
|
|
|
The result must be a palloc'd struct.
|
|
|
|
For implementation reasons the index access method
|
2017-10-09 03:44:17 +02:00
|
|
|
<emphasis>must</emphasis> create this struct by calling
|
|
|
|
<function>RelationGetIndexScan()</function>. In most cases
|
|
|
|
<function>ambeginscan</function> does little beyond making that call and perhaps
|
2010-12-03 02:50:48 +01:00
|
|
|
acquiring locks;
|
2017-10-09 03:44:17 +02:00
|
|
|
the interesting parts of index-scan startup are in <function>amrescan</function>.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<programlisting>
|
2010-12-03 02:50:48 +01:00
|
|
|
void
|
|
|
|
amrescan (IndexScanDesc scan,
|
|
|
|
ScanKey keys,
|
|
|
|
int nkeys,
|
|
|
|
ScanKey orderbys,
|
|
|
|
int norderbys);
|
|
|
|
</programlisting>
|
2012-06-07 23:06:20 +02:00
|
|
|
Start or restart an index scan, possibly with new scan keys. (To restart
|
2017-10-09 03:44:17 +02:00
|
|
|
using previously-passed keys, NULL is passed for <literal>keys</literal> and/or
|
|
|
|
<literal>orderbys</literal>.) Note that it is not allowed for
|
2010-12-03 02:50:48 +01:00
|
|
|
the number of keys or order-by operators to be larger than
|
2017-10-09 03:44:17 +02:00
|
|
|
what was passed to <function>ambeginscan</function>. In practice the restart
|
2010-12-03 02:50:48 +01:00
|
|
|
feature is used when a new outer tuple is selected by a nested-loop join
|
|
|
|
and so a new key comparison value is needed, but the scan key structure
|
|
|
|
remains the same.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<programlisting>
|
2021-07-12 22:07:35 +02:00
|
|
|
bool
|
2005-02-13 04:04:15 +01:00
|
|
|
amgettuple (IndexScanDesc scan,
|
|
|
|
ScanDirection direction);
|
|
|
|
</programlisting>
|
|
|
|
Fetch the next tuple in the given scan, moving in the given
|
2017-08-16 06:22:32 +02:00
|
|
|
direction (forward or backward in the index). Returns true if a tuple was
|
|
|
|
obtained, false if no matching tuples remain. In the true case the tuple
|
2017-10-09 03:44:17 +02:00
|
|
|
TID is stored into the <literal>scan</literal> structure. Note that
|
|
|
|
<quote>success</quote> means only that the index contains an entry that matches
|
2005-02-13 04:04:15 +01:00
|
|
|
the scan keys, not that the tuple necessarily still exists in the heap or
|
2017-10-09 03:44:17 +02:00
|
|
|
will pass the caller's snapshot test. On success, <function>amgettuple</function>
|
2017-08-16 06:22:32 +02:00
|
|
|
must also set <literal>scan->xs_recheck</literal> to true or false.
|
|
|
|
False means it is certain that the index entry matches the scan keys.
|
2020-10-19 18:28:54 +02:00
|
|
|
True means this is not certain, and the conditions represented by the
|
2008-04-13 21:18:14 +02:00
|
|
|
scan keys must be rechecked against the heap tuple after fetching it.
|
2017-10-09 03:44:17 +02:00
|
|
|
This provision supports <quote>lossy</quote> index operators.
|
2008-04-13 21:18:14 +02:00
|
|
|
Note that rechecking will extend only to the scan conditions; a partial
|
2017-10-09 03:44:17 +02:00
|
|
|
index predicate (if any) is never rechecked by <function>amgettuple</function>
|
2008-04-13 21:18:14 +02:00
|
|
|
callers.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
2011-10-08 02:13:02 +02:00
|
|
|
<para>
|
2016-05-08 22:36:19 +02:00
|
|
|
If the index supports <link linkend="indexes-index-only-scans">index-only
|
2020-11-15 22:10:48 +01:00
|
|
|
scans</link> (i.e., <function>amcanreturn</function> returns true for any
|
|
|
|
of its columns),
|
2017-10-09 03:44:17 +02:00
|
|
|
then on success the AM must also check <literal>scan->xs_want_itup</literal>,
|
2017-02-27 23:20:34 +01:00
|
|
|
and if that is true it must return the originally indexed data for the
|
2020-11-15 22:10:48 +01:00
|
|
|
index entry. Columns for which <function>amcanreturn</function> returns
|
|
|
|
false can be returned as nulls.
|
|
|
|
The data can be returned in the form of an
|
2017-10-09 03:44:17 +02:00
|
|
|
<structname>IndexTuple</structname> pointer stored at <literal>scan->xs_itup</literal>,
|
|
|
|
with tuple descriptor <literal>scan->xs_itupdesc</literal>; or in the form of
|
|
|
|
a <structname>HeapTuple</structname> pointer stored at <literal>scan->xs_hitup</literal>,
|
|
|
|
with tuple descriptor <literal>scan->xs_hitupdesc</literal>. (The latter
|
2017-02-27 23:20:34 +01:00
|
|
|
format should be used when reconstructing data that might possibly not fit
|
2017-10-09 03:44:17 +02:00
|
|
|
into an <structname>IndexTuple</structname>.) In either case,
|
2017-02-27 23:20:34 +01:00
|
|
|
management of the data referenced by the pointer is the access method's
|
2011-10-09 06:21:08 +02:00
|
|
|
responsibility. The data must remain good at least until the next
|
2017-10-09 03:44:17 +02:00
|
|
|
<function>amgettuple</function>, <function>amrescan</function>, or <function>amendscan</function>
|
2017-02-27 23:20:34 +01:00
|
|
|
call for the scan.
|
2011-10-08 02:13:02 +02:00
|
|
|
</para>
|
|
|
|
|
2009-03-06 00:06:45 +01:00
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
The <function>amgettuple</function> function need only be provided if the access
|
|
|
|
method supports <quote>plain</quote> index scans. If it doesn't, the
|
|
|
|
<structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
struct must be set to NULL.
|
2009-03-06 00:06:45 +01:00
|
|
|
</para>
|
|
|
|
|
2005-02-13 04:04:15 +01:00
|
|
|
<para>
|
|
|
|
<programlisting>
|
2008-04-11 00:25:26 +02:00
|
|
|
int64
|
|
|
|
amgetbitmap (IndexScanDesc scan,
|
|
|
|
TIDBitmap *tbm);
|
2005-03-28 01:53:05 +02:00
|
|
|
</programlisting>
|
2008-04-11 00:25:26 +02:00
|
|
|
Fetch all tuples in the given scan and add them to the caller-supplied
|
2010-08-17 06:37:21 +02:00
|
|
|
<type>TIDBitmap</type> (that is, OR the set of tuple IDs into whatever set is already
|
2009-03-24 21:17:18 +01:00
|
|
|
in the bitmap). The number of tuples fetched is returned (this might be
|
|
|
|
just an approximate count, for instance some AMs do not detect duplicates).
|
2017-10-09 03:44:17 +02:00
|
|
|
While inserting tuple IDs into the bitmap, <function>amgetbitmap</function> can
|
2008-04-13 21:18:14 +02:00
|
|
|
indicate that rechecking of the scan conditions is required for specific
|
2017-10-09 03:44:17 +02:00
|
|
|
tuple IDs. This is analogous to the <literal>xs_recheck</literal> output parameter
|
|
|
|
of <function>amgettuple</function>. Note: in the current implementation, support
|
2008-04-13 21:18:14 +02:00
|
|
|
for this feature is conflated with support for lossy storage of the bitmap
|
|
|
|
itself, and therefore callers recheck both the scan conditions and the
|
|
|
|
partial index predicate (if any) for recheckable tuples. That might not
|
|
|
|
always be true, however.
|
2017-10-09 03:44:17 +02:00
|
|
|
<function>amgetbitmap</function> and
|
|
|
|
<function>amgettuple</function> cannot be used in the same index scan; there
|
|
|
|
are other restrictions too when using <function>amgetbitmap</function>, as explained
|
2017-11-23 15:39:47 +01:00
|
|
|
in <xref linkend="index-scanning"/>.
|
2005-03-28 01:53:05 +02:00
|
|
|
</para>
|
|
|
|
|
2009-03-06 00:06:45 +01:00
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
The <function>amgetbitmap</function> function need only be provided if the access
|
|
|
|
method supports <quote>bitmap</quote> index scans. If it doesn't, the
|
|
|
|
<structfield>amgetbitmap</structfield> field in its <structname>IndexAmRoutine</structname>
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
struct must be set to NULL.
|
2009-03-06 00:06:45 +01:00
|
|
|
</para>
|
|
|
|
|
2005-03-28 01:53:05 +02:00
|
|
|
<para>
|
|
|
|
<programlisting>
|
2005-02-13 04:04:15 +01:00
|
|
|
void
|
|
|
|
amendscan (IndexScanDesc scan);
|
|
|
|
</programlisting>
|
2017-10-09 03:44:17 +02:00
|
|
|
End a scan and release resources. The <literal>scan</literal> struct itself
|
2005-02-13 04:04:15 +01:00
|
|
|
should not be freed, but any locks or pins taken internally by the
|
2018-10-31 22:04:42 +01:00
|
|
|
access method must be released, as well as any other memory allocated
|
|
|
|
by <function>ambeginscan</function> and other scan-related functions.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<programlisting>
|
|
|
|
void
|
|
|
|
ammarkpos (IndexScanDesc scan);
|
|
|
|
</programlisting>
|
|
|
|
Mark current scan position. The access method need only support one
|
|
|
|
remembered scan position per scan.
|
|
|
|
</para>
|
|
|
|
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
The <function>ammarkpos</function> function need only be provided if the access
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
method supports ordered scans. If it doesn't,
|
2017-10-09 03:44:17 +02:00
|
|
|
the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
struct may be set to NULL.
|
|
|
|
</para>
|
|
|
|
|
2005-02-13 04:04:15 +01:00
|
|
|
<para>
|
|
|
|
<programlisting>
|
|
|
|
void
|
|
|
|
amrestrpos (IndexScanDesc scan);
|
|
|
|
</programlisting>
|
|
|
|
Restore the scan to the most recently marked position.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
The <function>amrestrpos</function> function need only be provided if the access
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
method supports ordered scans. If it doesn't,
|
2017-10-09 03:44:17 +02:00
|
|
|
the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
struct may be set to NULL.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
2017-01-24 22:42:58 +01:00
|
|
|
|
|
|
|
<para>
|
|
|
|
In addition to supporting ordinary index scans, some types of index
|
2017-10-09 03:44:17 +02:00
|
|
|
may wish to support <firstterm>parallel index scans</firstterm>, which allow
|
2017-01-24 22:42:58 +01:00
|
|
|
multiple backends to cooperate in performing an index scan. The
|
|
|
|
index access method should arrange things so that each cooperating
|
|
|
|
process returns a subset of the tuples that would be performed by
|
|
|
|
an ordinary, non-parallel index scan, but in such a way that the
|
|
|
|
union of those subsets is equal to the set of tuples that would be
|
|
|
|
returned by an ordinary, non-parallel index scan. Furthermore, while
|
|
|
|
there need not be any global ordering of tuples returned by a parallel
|
|
|
|
scan, the ordering of that subset of tuples returned within each
|
|
|
|
cooperating backend must match the requested ordering. The following
|
|
|
|
functions may be implemented to support parallel index scans:
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<programlisting>
|
|
|
|
Size
|
Enhance nbtree ScalarArrayOp execution.
Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively. This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).
Take this general approach a lot further: teach nbtree SAOP index scans
to decide how to execute ScalarArrayOp scans (when and where to start
the next primitive index scan) based on physical index characteristics.
This can be far more efficient. All SAOP scans will now reliably avoid
duplicative leaf page accesses (just like any other nbtree index scan).
SAOP scans whose array keys are naturally clustered together now require
far fewer index descents, since we'll reliably avoid starting a new
primitive scan just to get to a later offset from the same leaf page.
The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value. Required
scan key arrays (i.e. arrays from scan keys that can terminate the scan)
ratchet forward in lockstep with the index scan. Non-required arrays
(i.e. arrays from scan keys that can only exclude non-matching tuples)
"advance" without the process ever rolling over to a higher-order array.
Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, even index scans of a composite index with a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we won't mark required) now avoid repeating leaf page
accesses -- that benefit isn't limited to simpler equality-only cases.
In general, all nbtree index scans now output tuples as if they were one
continuous index scan -- even scans that mix a high-order inequality
with lower-order SAOP equalities reliably output tuples in index order.
This allows us to remove a couple of special cases that were applied
when building index paths with SAOP clauses during planning.
Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute. These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Affected queries can now exploit scan output order in all the usual ways
(e.g., certain "ORDER BY ... LIMIT n" queries can now terminate early).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths, with path keys, but
without low-order SAOP index quals (filter quals were used instead).
We'll no longer generate these alternative paths, since they can no
longer offer any meaningful advantages over standard index qual paths.
Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes. They can avoid extra heap
page accesses from using filter quals to exclude non-matching tuples
(index quals will never have that problem). They can also skip over
irrelevant sections of the index in more cases (though only when nbtree
determines that starting another primitive scan actually makes sense).
There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with amcanorder-based
index AMs maintained as extensions. Such an index AM could have the
same limitations around ordered SAOP scans as nbtree had up until now.
Adding a pro forma incompatibility item about the issue to the Postgres
17 release notes seems like a good idea.
Author: Peter Geoghegan <pg@bowt.ie>
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
2024-04-06 17:47:10 +02:00
|
|
|
amestimateparallelscan (int nkeys,
|
|
|
|
int norderbys);
|
2017-01-24 22:42:58 +01:00
|
|
|
</programlisting>
|
|
|
|
Estimate and return the number of bytes of dynamic shared memory which
|
|
|
|
the access method will be needed to perform a parallel scan. (This number
|
|
|
|
is in addition to, not in lieu of, the amount of space needed for
|
2017-10-09 03:44:17 +02:00
|
|
|
AM-independent data in <structname>ParallelIndexScanDescData</structname>.)
|
2017-01-24 22:42:58 +01:00
|
|
|
</para>
|
|
|
|
|
Enhance nbtree ScalarArrayOp execution.
Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively. This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).
Take this general approach a lot further: teach nbtree SAOP index scans
to decide how to execute ScalarArrayOp scans (when and where to start
the next primitive index scan) based on physical index characteristics.
This can be far more efficient. All SAOP scans will now reliably avoid
duplicative leaf page accesses (just like any other nbtree index scan).
SAOP scans whose array keys are naturally clustered together now require
far fewer index descents, since we'll reliably avoid starting a new
primitive scan just to get to a later offset from the same leaf page.
The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value. Required
scan key arrays (i.e. arrays from scan keys that can terminate the scan)
ratchet forward in lockstep with the index scan. Non-required arrays
(i.e. arrays from scan keys that can only exclude non-matching tuples)
"advance" without the process ever rolling over to a higher-order array.
Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, even index scans of a composite index with a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we won't mark required) now avoid repeating leaf page
accesses -- that benefit isn't limited to simpler equality-only cases.
In general, all nbtree index scans now output tuples as if they were one
continuous index scan -- even scans that mix a high-order inequality
with lower-order SAOP equalities reliably output tuples in index order.
This allows us to remove a couple of special cases that were applied
when building index paths with SAOP clauses during planning.
Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute. These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Affected queries can now exploit scan output order in all the usual ways
(e.g., certain "ORDER BY ... LIMIT n" queries can now terminate early).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths, with path keys, but
without low-order SAOP index quals (filter quals were used instead).
We'll no longer generate these alternative paths, since they can no
longer offer any meaningful advantages over standard index qual paths.
Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes. They can avoid extra heap
page accesses from using filter quals to exclude non-matching tuples
(index quals will never have that problem). They can also skip over
irrelevant sections of the index in more cases (though only when nbtree
determines that starting another primitive scan actually makes sense).
There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with amcanorder-based
index AMs maintained as extensions. Such an index AM could have the
same limitations around ordered SAOP scans as nbtree had up until now.
Adding a pro forma incompatibility item about the issue to the Postgres
17 release notes seems like a good idea.
Author: Peter Geoghegan <pg@bowt.ie>
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
2024-04-06 17:47:10 +02:00
|
|
|
<para>
|
|
|
|
The <literal>nkeys</literal> and <literal>norderbys</literal>
|
|
|
|
parameters indicate the number of quals and ordering operators that will be
|
|
|
|
used in the scan; the same values will be passed to <function>amrescan</function>.
|
|
|
|
Note that the actual values of the scan keys aren't provided yet.
|
|
|
|
</para>
|
|
|
|
|
2017-01-24 22:42:58 +01:00
|
|
|
<para>
|
|
|
|
It is not necessary to implement this function for access methods which
|
|
|
|
do not support parallel scans or for which the number of additional bytes
|
|
|
|
of storage required is zero.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<programlisting>
|
|
|
|
void
|
|
|
|
aminitparallelscan (void *target);
|
|
|
|
</programlisting>
|
|
|
|
This function will be called to initialize dynamic shared memory at the
|
2017-10-09 03:44:17 +02:00
|
|
|
beginning of a parallel scan. <parameter>target</parameter> will point to at least
|
2017-01-24 22:42:58 +01:00
|
|
|
the number of bytes previously returned by
|
2017-10-09 03:44:17 +02:00
|
|
|
<function>amestimateparallelscan</function>, and this function may use that
|
2017-01-24 22:42:58 +01:00
|
|
|
amount of space to store whatever data it wishes.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
It is not necessary to implement this function for access methods which
|
|
|
|
do not support parallel scans or in cases where the shared memory space
|
|
|
|
required needs no initialization.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<programlisting>
|
|
|
|
void
|
|
|
|
amparallelrescan (IndexScanDesc scan);
|
|
|
|
</programlisting>
|
|
|
|
This function, if implemented, will be called when a parallel index scan
|
|
|
|
must be restarted. It should reset any shared state set up by
|
2017-10-09 03:44:17 +02:00
|
|
|
<function>aminitparallelscan</function> such that the scan will be restarted from
|
2017-01-24 22:42:58 +01:00
|
|
|
the beginning.
|
|
|
|
</para>
|
|
|
|
|
2005-02-13 04:04:15 +01:00
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="index-scanning">
|
|
|
|
<title>Index Scanning</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
In an index scan, the index access method is responsible for regurgitating
|
|
|
|
the TIDs of all the tuples it has been told about that match the
|
2017-10-09 03:44:17 +02:00
|
|
|
<firstterm>scan keys</firstterm>. The access method is <emphasis>not</emphasis> involved in
|
2005-02-13 04:04:15 +01:00
|
|
|
actually fetching those tuples from the index's parent table, nor in
|
2019-01-22 02:03:15 +01:00
|
|
|
determining whether they pass the scan's visibility test or other
|
2005-02-13 04:04:15 +01:00
|
|
|
conditions.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
A scan key is the internal representation of a <literal>WHERE</literal> clause of
|
|
|
|
the form <replaceable>index_key</replaceable> <replaceable>operator</replaceable>
|
|
|
|
<replaceable>constant</replaceable>, where the index key is one of the columns of the
|
2006-12-23 01:43:13 +01:00
|
|
|
index and the operator is one of the members of the operator family
|
2005-02-13 04:04:15 +01:00
|
|
|
associated with that index column. An index scan has zero or more scan
|
|
|
|
keys, which are implicitly ANDed — the returned tuples are expected
|
|
|
|
to satisfy all the indicated conditions.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
The access method can report that the index is <firstterm>lossy</firstterm>, or
|
2008-04-14 19:05:34 +02:00
|
|
|
requires rechecks, for a particular query. This implies that the index
|
|
|
|
scan will return all the entries that pass the scan key, plus possibly
|
|
|
|
additional entries that do not. The core system's index-scan machinery
|
|
|
|
will then apply the index conditions again to the heap tuple to verify
|
|
|
|
whether or not it really should be selected. If the recheck option is not
|
|
|
|
specified, the index scan must return exactly the set of matching entries.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Note that it is entirely up to the access method to ensure that it
|
|
|
|
correctly finds all and only the entries passing all the given scan keys.
|
2017-10-09 03:44:17 +02:00
|
|
|
Also, the core system will simply hand off all the <literal>WHERE</literal>
|
2006-12-23 01:43:13 +01:00
|
|
|
clauses that match the index keys and operator families, without any
|
2005-02-13 04:04:15 +01:00
|
|
|
semantic analysis to determine whether they are redundant or
|
|
|
|
contradictory. As an example, given
|
2017-10-09 03:44:17 +02:00
|
|
|
<literal>WHERE x > 4 AND x > 14</literal> where <literal>x</literal> is a b-tree
|
|
|
|
indexed column, it is left to the b-tree <function>amrescan</function> function
|
2005-02-13 04:04:15 +01:00
|
|
|
to realize that the first scan key is redundant and can be discarded.
|
2017-10-09 03:44:17 +02:00
|
|
|
The extent of preprocessing needed during <function>amrescan</function> will
|
2005-02-13 04:04:15 +01:00
|
|
|
depend on the extent to which the index access method needs to reduce
|
2017-10-09 03:44:17 +02:00
|
|
|
the scan keys to a <quote>normalized</quote> form.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
2007-01-21 00:13:01 +01:00
|
|
|
<para>
|
|
|
|
Some access methods return index entries in a well-defined order, others
|
2010-12-04 05:49:06 +01:00
|
|
|
do not. There are actually two different ways that an access method can
|
|
|
|
support sorted output:
|
|
|
|
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Access methods that always return entries in the natural ordering
|
|
|
|
of their data (such as btree) should set
|
2017-10-09 03:44:17 +02:00
|
|
|
<structfield>amcanorder</structfield> to true.
|
2010-12-04 05:49:06 +01:00
|
|
|
Currently, such access methods must use btree-compatible strategy
|
|
|
|
numbers for their equality and ordering operators.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Access methods that support ordering operators should set
|
2017-10-09 03:44:17 +02:00
|
|
|
<structfield>amcanorderbyop</structfield> to true.
|
2010-12-04 05:49:06 +01:00
|
|
|
This indicates that the index is capable of returning entries in
|
2017-10-09 03:44:17 +02:00
|
|
|
an order satisfying <literal>ORDER BY</literal> <replaceable>index_key</replaceable>
|
|
|
|
<replaceable>operator</replaceable> <replaceable>constant</replaceable>. Scan modifiers
|
|
|
|
of that form can be passed to <function>amrescan</function> as described
|
2010-12-04 05:49:06 +01:00
|
|
|
previously.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
2007-01-21 00:13:01 +01:00
|
|
|
</para>
|
|
|
|
|
2005-02-13 04:04:15 +01:00
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
The <function>amgettuple</function> function has a <literal>direction</literal> argument,
|
|
|
|
which can be either <literal>ForwardScanDirection</literal> (the normal case)
|
|
|
|
or <literal>BackwardScanDirection</literal>. If the first call after
|
|
|
|
<function>amrescan</function> specifies <literal>BackwardScanDirection</literal>, then the
|
2005-02-13 04:04:15 +01:00
|
|
|
set of matching index entries is to be scanned back-to-front rather than in
|
2017-10-09 03:44:17 +02:00
|
|
|
the normal front-to-back direction, so <function>amgettuple</function> must return
|
2005-02-13 04:04:15 +01:00
|
|
|
the last matching tuple in the index, rather than the first one as it
|
|
|
|
normally would. (This will only occur for access
|
2017-10-09 03:44:17 +02:00
|
|
|
methods that set <structfield>amcanorder</structfield> to true.) After the
|
|
|
|
first call, <function>amgettuple</function> must be prepared to advance the scan in
|
2008-10-18 00:10:30 +02:00
|
|
|
either direction from the most recently returned entry. (But if
|
2017-10-09 03:44:17 +02:00
|
|
|
<structfield>amcanbackward</structfield> is false, all subsequent
|
2008-10-18 00:10:30 +02:00
|
|
|
calls will have the same direction as the first one.)
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
Access methods that support ordered scans must support <quote>marking</quote> a
|
2008-10-18 00:10:30 +02:00
|
|
|
position in a scan and later returning to the marked position. The same
|
|
|
|
position might be restored multiple times. However, only one position need
|
2017-10-09 03:44:17 +02:00
|
|
|
be remembered per scan; a new <function>ammarkpos</function> call overrides the
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
previously marked position. An access method that does not support ordered
|
2017-10-09 03:44:17 +02:00
|
|
|
scans need not provide <function>ammarkpos</function> and <function>amrestrpos</function>
|
|
|
|
functions in <structname>IndexAmRoutine</structname>; set those pointers to NULL
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
instead.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Both the scan position and the mark position (if any) must be maintained
|
|
|
|
consistently in the face of concurrent insertions or deletions in the
|
|
|
|
index. It is OK if a freshly-inserted entry is not returned by a scan that
|
|
|
|
would have found the entry if it had existed when the scan started, or for
|
|
|
|
the scan to return such an entry upon rescanning or backing
|
|
|
|
up even though it had not been returned the first time through. Similarly,
|
Update documentation on may/can/might:
Standard English uses "may", "can", and "might" in different ways:
may - permission, "You may borrow my rake."
can - ability, "I can lift that log."
might - possibility, "It might rain today."
Unfortunately, in conversational English, their use is often mixed, as
in, "You may use this variable to do X", when in fact, "can" is a better
choice. Similarly, "It may crash" is better stated, "It might crash".
Also update two error messages mentioned in the documenation to match.
2007-01-31 21:56:20 +01:00
|
|
|
a concurrent delete might or might not be reflected in the results of a scan.
|
2005-02-13 04:04:15 +01:00
|
|
|
What is important is that insertions or deletions not cause the scan to
|
|
|
|
miss or multiply return entries that were not themselves being inserted or
|
2006-07-31 22:09:10 +02:00
|
|
|
deleted.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
2011-10-08 02:13:02 +02:00
|
|
|
<para>
|
|
|
|
If the index stores the original indexed data values (and not some lossy
|
2016-05-08 22:36:19 +02:00
|
|
|
representation of them), it is useful to
|
|
|
|
support <link linkend="indexes-index-only-scans">index-only scans</link>, in
|
2011-10-08 02:13:02 +02:00
|
|
|
which the index returns the actual data not just the TID of the heap tuple.
|
2016-05-08 22:36:19 +02:00
|
|
|
This will only avoid I/O if the visibility map shows that the TID is on an
|
2011-10-08 02:13:02 +02:00
|
|
|
all-visible page; else the heap tuple must be visited anyway to check
|
|
|
|
MVCC visibility. But that is no concern of the access method's.
|
|
|
|
</para>
|
|
|
|
|
2005-03-28 01:53:05 +02:00
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
Instead of using <function>amgettuple</function>, an index scan can be done with
|
|
|
|
<function>amgetbitmap</function> to fetch all tuples in one call. This can be
|
|
|
|
noticeably more efficient than <function>amgettuple</function> because it allows
|
2005-03-28 01:53:05 +02:00
|
|
|
avoiding lock/unlock cycles within the access method. In principle
|
2017-10-09 03:44:17 +02:00
|
|
|
<function>amgetbitmap</function> should have the same effects as repeated
|
|
|
|
<function>amgettuple</function> calls, but we impose several restrictions to
|
|
|
|
simplify matters. First of all, <function>amgetbitmap</function> returns all
|
2009-03-24 21:17:18 +01:00
|
|
|
tuples at once and marking or restoring scan positions isn't
|
2008-04-11 00:25:26 +02:00
|
|
|
supported. Secondly, the tuples are returned in a bitmap which doesn't
|
2017-10-09 03:44:17 +02:00
|
|
|
have any specific ordering, which is why <function>amgetbitmap</function> doesn't
|
|
|
|
take a <literal>direction</literal> argument. (Ordering operators will never be
|
2011-10-08 02:13:02 +02:00
|
|
|
supplied for such a scan, either.)
|
|
|
|
Also, there is no provision for index-only scans with
|
2017-10-09 03:44:17 +02:00
|
|
|
<function>amgetbitmap</function>, since there is no way to return the contents of
|
2011-10-08 02:13:02 +02:00
|
|
|
index tuples.
|
2017-10-09 03:44:17 +02:00
|
|
|
Finally, <function>amgetbitmap</function>
|
2008-04-11 00:25:26 +02:00
|
|
|
does not guarantee any locking of the returned tuples, with implications
|
2017-11-23 15:39:47 +01:00
|
|
|
spelled out in <xref linkend="index-locking"/>.
|
2005-03-28 01:53:05 +02:00
|
|
|
</para>
|
|
|
|
|
2009-03-06 00:06:45 +01:00
|
|
|
<para>
|
|
|
|
Note that it is permitted for an access method to implement only
|
2017-10-09 03:44:17 +02:00
|
|
|
<function>amgetbitmap</function> and not <function>amgettuple</function>, or vice versa,
|
2009-03-06 00:06:45 +01:00
|
|
|
if its internal implementation is unsuited to one API or the other.
|
|
|
|
</para>
|
|
|
|
|
2005-02-13 04:04:15 +01:00
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="index-locking">
|
|
|
|
<title>Index Locking Considerations</title>
|
|
|
|
|
|
|
|
<para>
|
2006-07-31 22:09:10 +02:00
|
|
|
Index access methods must handle concurrent updates
|
|
|
|
of the index by multiple processes.
|
|
|
|
The core <productname>PostgreSQL</productname> system obtains
|
2017-10-09 03:44:17 +02:00
|
|
|
<literal>AccessShareLock</literal> on the index during an index scan, and
|
|
|
|
<literal>RowExclusiveLock</literal> when updating the index (including plain
|
|
|
|
<command>VACUUM</command>). Since these lock types do not conflict, the access
|
2010-02-08 05:33:55 +01:00
|
|
|
method is responsible for handling any fine-grained locking it might need.
|
2021-04-01 08:28:37 +02:00
|
|
|
An <literal>ACCESS EXCLUSIVE</literal> lock on the index as a whole will be
|
|
|
|
taken only during index creation, destruction, or <command>REINDEX</command>
|
|
|
|
(<literal>SHARE UPDATE EXCLUSIVE</literal> is taken instead with
|
|
|
|
<literal>CONCURRENTLY</literal>).
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Building an index type that supports concurrent updates usually requires
|
|
|
|
extensive and subtle analysis of the required behavior. For the b-tree
|
|
|
|
and hash index types, you can read about the design decisions involved in
|
2017-10-09 03:44:17 +02:00
|
|
|
<filename>src/backend/access/nbtree/README</filename> and
|
|
|
|
<filename>src/backend/access/hash/README</filename>.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Aside from the index's own internal consistency requirements, concurrent
|
|
|
|
updates create issues about consistency between the parent table (the
|
2017-10-09 03:44:17 +02:00
|
|
|
<firstterm>heap</firstterm>) and the index. Because
|
2009-03-24 21:17:18 +01:00
|
|
|
<productname>PostgreSQL</productname> separates accesses
|
2005-02-13 04:04:15 +01:00
|
|
|
and updates of the heap from those of the index, there are windows in
|
Update documentation on may/can/might:
Standard English uses "may", "can", and "might" in different ways:
may - permission, "You may borrow my rake."
can - ability, "I can lift that log."
might - possibility, "It might rain today."
Unfortunately, in conversational English, their use is often mixed, as
in, "You may use this variable to do X", when in fact, "can" is a better
choice. Similarly, "It may crash" is better stated, "It might crash".
Also update two error messages mentioned in the documenation to match.
2007-01-31 21:56:20 +01:00
|
|
|
which the index might be inconsistent with the heap. We handle this problem
|
2005-02-13 04:04:15 +01:00
|
|
|
with the following rules:
|
|
|
|
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
A new heap entry is made before making its index entries. (Therefore
|
|
|
|
a concurrent index scan is likely to fail to see the heap entry.
|
|
|
|
This is okay because the index reader would be uninterested in an
|
2017-11-23 15:39:47 +01:00
|
|
|
uncommitted row anyway. But see <xref linkend="index-unique-checks"/>.)
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
When a heap entry is to be deleted (by <command>VACUUM</command>), all its
|
2005-02-13 04:04:15 +01:00
|
|
|
index entries must be removed first.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
2006-07-31 22:09:10 +02:00
|
|
|
An index scan must maintain a pin
|
2005-02-13 04:04:15 +01:00
|
|
|
on the index page holding the item last returned by
|
2017-10-09 03:44:17 +02:00
|
|
|
<function>amgettuple</function>, and <function>ambulkdelete</function> cannot delete
|
2005-02-13 04:04:15 +01:00
|
|
|
entries from pages that are pinned by other backends. The need
|
|
|
|
for this rule is explained below.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
|
2006-07-31 22:09:10 +02:00
|
|
|
Without the third rule, it is possible for an index reader to
|
2017-10-09 03:44:17 +02:00
|
|
|
see an index entry just before it is removed by <command>VACUUM</command>, and
|
2005-02-13 04:04:15 +01:00
|
|
|
then to arrive at the corresponding heap entry after that was removed by
|
2017-10-09 03:44:17 +02:00
|
|
|
<command>VACUUM</command>.
|
2005-02-13 04:04:15 +01:00
|
|
|
This creates no serious problems if that item
|
|
|
|
number is still unused when the reader reaches it, since an empty
|
2017-10-09 03:44:17 +02:00
|
|
|
item slot will be ignored by <function>heap_fetch()</function>. But what if a
|
2005-02-13 04:04:15 +01:00
|
|
|
third backend has already re-used the item slot for something else?
|
|
|
|
When using an MVCC-compliant snapshot, there is no problem because
|
|
|
|
the new occupant of the slot is certain to be too new to pass the
|
|
|
|
snapshot test. However, with a non-MVCC-compliant snapshot (such as
|
2017-10-09 03:44:17 +02:00
|
|
|
<literal>SnapshotAny</literal>), it would be possible to accept and return
|
2005-02-13 04:04:15 +01:00
|
|
|
a row that does not in fact match the scan keys. We could defend
|
|
|
|
against this scenario by requiring the scan keys to be rechecked
|
|
|
|
against the heap row in all cases, but that is too expensive. Instead,
|
|
|
|
we use a pin on an index page as a proxy to indicate that the reader
|
2017-10-09 03:44:17 +02:00
|
|
|
might still be <quote>in flight</quote> from the index entry to the matching
|
|
|
|
heap entry. Making <function>ambulkdelete</function> block on such a pin ensures
|
|
|
|
that <command>VACUUM</command> cannot delete the heap entry before the reader
|
2005-11-05 00:14:02 +01:00
|
|
|
is done with it. This solution costs little in run time, and adds blocking
|
2005-02-13 04:04:15 +01:00
|
|
|
overhead only in the rare cases where there actually is a conflict.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
This solution requires that index scans be <quote>synchronous</quote>: we have
|
2005-02-13 04:04:15 +01:00
|
|
|
to fetch each heap tuple immediately after scanning the corresponding index
|
|
|
|
entry. This is expensive for a number of reasons. An
|
2017-10-09 03:44:17 +02:00
|
|
|
<quote>asynchronous</quote> scan in which we collect many TIDs from the index,
|
2005-02-13 04:04:15 +01:00
|
|
|
and only visit the heap tuples sometime later, requires much less index
|
Update documentation on may/can/might:
Standard English uses "may", "can", and "might" in different ways:
may - permission, "You may borrow my rake."
can - ability, "I can lift that log."
might - possibility, "It might rain today."
Unfortunately, in conversational English, their use is often mixed, as
in, "You may use this variable to do X", when in fact, "can" is a better
choice. Similarly, "It may crash" is better stated, "It might crash".
Also update two error messages mentioned in the documenation to match.
2007-01-31 21:56:20 +01:00
|
|
|
locking overhead and can allow a more efficient heap access pattern.
|
2005-02-13 04:04:15 +01:00
|
|
|
Per the above analysis, we must use the synchronous approach for
|
2005-03-28 01:53:05 +02:00
|
|
|
non-MVCC-compliant snapshots, but an asynchronous scan is workable
|
|
|
|
for a query using an MVCC snapshot.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
In an <function>amgetbitmap</function> index scan, the access method does not
|
2008-04-11 00:25:26 +02:00
|
|
|
keep an index pin on any of the returned tuples. Therefore
|
2005-03-28 01:53:05 +02:00
|
|
|
it is only safe to use such scans with MVCC-compliant snapshots.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
When the <structfield>ampredlocks</structfield> flag is not set, any scan using that
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
index access method within a serializable transaction will acquire a
|
2013-04-19 05:35:19 +02:00
|
|
|
nonblocking predicate lock on the full index. This will generate a
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
read-write conflict with the insert of any tuple into that index by a
|
|
|
|
concurrent serializable transaction. If certain patterns of read-write
|
|
|
|
conflicts are detected among a set of concurrent serializable
|
2011-06-29 08:26:14 +02:00
|
|
|
transactions, one of those transactions may be canceled to protect data
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
integrity. When the flag is set, it indicates that the index access
|
|
|
|
method implements finer-grained predicate locking, which will tend to
|
|
|
|
reduce the frequency of such transaction cancellations.
|
|
|
|
</para>
|
|
|
|
|
2005-02-13 04:04:15 +01:00
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="index-unique-checks">
|
|
|
|
<title>Index Uniqueness Checks</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<productname>PostgreSQL</productname> enforces SQL uniqueness constraints
|
2017-10-09 03:44:17 +02:00
|
|
|
using <firstterm>unique indexes</firstterm>, which are indexes that disallow
|
2005-02-13 04:04:15 +01:00
|
|
|
multiple entries with identical keys. An access method that supports this
|
2017-10-09 03:44:17 +02:00
|
|
|
feature sets <structfield>amcanunique</structfield> true.
|
2018-07-18 20:43:03 +02:00
|
|
|
(At present, only b-tree supports it.) Columns listed in the
|
|
|
|
<literal>INCLUDE</literal> clause are not considered when enforcing
|
|
|
|
uniqueness.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Because of MVCC, it is always necessary to allow duplicate entries to
|
|
|
|
exist physically in an index: the entries might refer to successive
|
|
|
|
versions of a single logical row. The behavior we actually want to
|
|
|
|
enforce is that no MVCC snapshot could include two rows with equal
|
|
|
|
index keys. This breaks down into the following cases that must be
|
|
|
|
checked when inserting a new row into a unique index:
|
|
|
|
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
If a conflicting valid row has been deleted by the current transaction,
|
|
|
|
it's okay. (In particular, since an UPDATE always deletes the old row
|
|
|
|
version before inserting the new version, this will allow an UPDATE on
|
|
|
|
a row without changing the key.)
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
If a conflicting row has been inserted by an as-yet-uncommitted
|
|
|
|
transaction, the would-be inserter must wait to see if that transaction
|
|
|
|
commits. If it rolls back then there is no conflict. If it commits
|
|
|
|
without deleting the conflicting row again, there is a uniqueness
|
|
|
|
violation. (In practice we just wait for the other transaction to
|
|
|
|
end and then redo the visibility check in toto.)
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Similarly, if a conflicting valid row has been deleted by an
|
|
|
|
as-yet-uncommitted transaction, the would-be inserter must wait
|
|
|
|
for that transaction to commit or abort, and then repeat the test.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
</para>
|
|
|
|
|
2006-08-25 06:06:58 +02:00
|
|
|
<para>
|
2009-07-29 22:56:21 +02:00
|
|
|
Furthermore, immediately before reporting a uniqueness violation
|
2006-08-25 06:06:58 +02:00
|
|
|
according to the above rules, the access method must recheck the
|
|
|
|
liveness of the row being inserted. If it is committed dead then
|
2009-07-29 22:56:21 +02:00
|
|
|
no violation should be reported. (This case cannot occur during the
|
2006-08-25 06:06:58 +02:00
|
|
|
ordinary scenario of inserting a row that's just been created by
|
|
|
|
the current transaction. It can happen during
|
2017-10-09 03:44:17 +02:00
|
|
|
<command>CREATE UNIQUE INDEX CONCURRENTLY</command>, however.)
|
2006-08-25 06:06:58 +02:00
|
|
|
</para>
|
|
|
|
|
2005-02-13 04:04:15 +01:00
|
|
|
<para>
|
|
|
|
We require the index access method to apply these tests itself, which
|
|
|
|
means that it must reach into the heap to check the commit status of
|
|
|
|
any row that is shown to have a duplicate key according to the index
|
|
|
|
contents. This is without a doubt ugly and non-modular, but it saves
|
|
|
|
redundant work: if we did a separate probe then the index lookup for
|
|
|
|
a conflicting row would be essentially repeated while finding the place to
|
|
|
|
insert the new row's index entry. What's more, there is no obvious way
|
|
|
|
to avoid race conditions unless the conflict check is an integral part
|
|
|
|
of insertion of the new index entry.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2009-07-29 22:56:21 +02:00
|
|
|
If the unique constraint is deferrable, there is additional complexity:
|
|
|
|
we need to be able to insert an index entry for a new row, but defer any
|
|
|
|
uniqueness-violation error until end of statement or even later. To
|
|
|
|
avoid unnecessary repeat searches of the index, the index access method
|
|
|
|
should do a preliminary uniqueness check during the initial insertion.
|
|
|
|
If this shows that there is definitely no conflicting live tuple, we
|
|
|
|
are done. Otherwise, we schedule a recheck to occur when it is time to
|
|
|
|
enforce the constraint. If, at the time of the recheck, both the inserted
|
|
|
|
tuple and some other tuple with the same key are live, then the error
|
2017-10-09 03:44:17 +02:00
|
|
|
must be reported. (Note that for this purpose, <quote>live</quote> actually
|
|
|
|
means <quote>any tuple in the index entry's HOT chain is live</quote>.)
|
|
|
|
To implement this, the <function>aminsert</function> function is passed a
|
|
|
|
<literal>checkUnique</literal> parameter having one of the following values:
|
2009-07-29 22:56:21 +02:00
|
|
|
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
<literal>UNIQUE_CHECK_NO</literal> indicates that no uniqueness checking
|
2009-07-29 22:56:21 +02:00
|
|
|
should be done (this is not a unique index).
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
<literal>UNIQUE_CHECK_YES</literal> indicates that this is a non-deferrable
|
2009-07-29 22:56:21 +02:00
|
|
|
unique index, and the uniqueness check must be done immediately, as
|
|
|
|
described above.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
<literal>UNIQUE_CHECK_PARTIAL</literal> indicates that the unique
|
2009-07-29 22:56:21 +02:00
|
|
|
constraint is deferrable. <productname>PostgreSQL</productname>
|
|
|
|
will use this mode to insert each row's index entry. The access
|
|
|
|
method must allow duplicate entries into the index, and report any
|
2017-08-16 06:22:32 +02:00
|
|
|
potential duplicates by returning false from <function>aminsert</function>.
|
|
|
|
For each row for which false is returned, a deferred recheck will
|
2009-07-29 22:56:21 +02:00
|
|
|
be scheduled.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The access method must identify any rows which might violate the
|
|
|
|
unique constraint, but it is not an error for it to report false
|
|
|
|
positives. This allows the check to be done without waiting for other
|
|
|
|
transactions to finish; conflicts reported here are not treated as
|
|
|
|
errors and will be rechecked later, by which time they may no longer
|
|
|
|
be conflicts.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
<literal>UNIQUE_CHECK_EXISTING</literal> indicates that this is a deferred
|
2009-07-29 22:56:21 +02:00
|
|
|
recheck of a row that was reported as a potential uniqueness violation.
|
2017-10-09 03:44:17 +02:00
|
|
|
Although this is implemented by calling <function>aminsert</function>, the
|
|
|
|
access method must <emphasis>not</emphasis> insert a new index entry in this
|
2009-07-29 22:56:21 +02:00
|
|
|
case. The index entry is already present. Rather, the access method
|
|
|
|
must check to see if there is another live index entry. If so, and
|
|
|
|
if the target row is also still live, report error.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
It is recommended that in a <literal>UNIQUE_CHECK_EXISTING</literal> call,
|
2009-07-29 22:56:21 +02:00
|
|
|
the access method further verify that the target row actually does
|
|
|
|
have an existing entry in the index, and report error if not. This
|
|
|
|
is a good idea because the index tuple values passed to
|
2017-10-09 03:44:17 +02:00
|
|
|
<function>aminsert</function> will have been recomputed. If the index
|
2009-07-29 22:56:21 +02:00
|
|
|
definition involves functions that are not really immutable, we
|
|
|
|
might be checking the wrong area of the index. Checking that the
|
|
|
|
target row is found in the recheck verifies that we are scanning
|
|
|
|
for the same tuple values as were used in the original insertion.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="index-cost-estimation">
|
|
|
|
<title>Index Cost Estimation Functions</title>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
The <function>amcostestimate</function> function is given information describing
|
2010-12-03 02:50:48 +01:00
|
|
|
a possible index scan, including lists of WHERE and ORDER BY clauses that
|
|
|
|
have been determined to be usable with the index. It must return estimates
|
2005-02-13 04:04:15 +01:00
|
|
|
of the cost of accessing the index and the selectivity of the WHERE
|
|
|
|
clauses (that is, the fraction of parent-table rows that will be
|
|
|
|
retrieved during the index scan). For simple cases, nearly all the
|
|
|
|
work of the cost estimator can be done by calling standard routines
|
2017-10-09 03:44:17 +02:00
|
|
|
in the optimizer; the point of having an <function>amcostestimate</function> function is
|
2005-02-13 04:04:15 +01:00
|
|
|
to allow index access methods to provide index-type-specific knowledge,
|
|
|
|
in case it is possible to improve on the standard estimates.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
Each <function>amcostestimate</function> function must have the signature:
|
2005-02-13 04:04:15 +01:00
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
void
|
2005-06-06 00:32:58 +02:00
|
|
|
amcostestimate (PlannerInfo *root,
|
2011-12-25 01:03:21 +01:00
|
|
|
IndexPath *path,
|
2012-01-28 01:26:38 +01:00
|
|
|
double loop_count,
|
2005-02-13 04:04:15 +01:00
|
|
|
Cost *indexStartupCost,
|
|
|
|
Cost *indexTotalCost,
|
|
|
|
Selectivity *indexSelectivity,
|
2018-08-10 13:14:36 +02:00
|
|
|
double *indexCorrelation,
|
|
|
|
double *indexPages);
|
2005-02-13 04:04:15 +01:00
|
|
|
</programlisting>
|
|
|
|
|
2011-12-25 01:03:21 +01:00
|
|
|
The first three parameters are inputs:
|
2005-02-13 04:04:15 +01:00
|
|
|
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
2017-10-09 03:44:17 +02:00
|
|
|
<term><parameter>root</parameter></term>
|
2005-02-13 04:04:15 +01:00
|
|
|
<listitem>
|
|
|
|
<para>
|
2005-06-06 00:32:58 +02:00
|
|
|
The planner's information about the query being processed.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
<varlistentry>
|
2017-10-09 03:44:17 +02:00
|
|
|
<term><parameter>path</parameter></term>
|
2005-02-13 04:04:15 +01:00
|
|
|
<listitem>
|
|
|
|
<para>
|
2011-12-25 01:03:21 +01:00
|
|
|
The index access path being considered. All fields except cost and
|
|
|
|
selectivity values are valid.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
2006-06-06 19:59:58 +02:00
|
|
|
|
|
|
|
<varlistentry>
|
2017-10-09 03:44:17 +02:00
|
|
|
<term><parameter>loop_count</parameter></term>
|
2006-06-06 19:59:58 +02:00
|
|
|
<listitem>
|
|
|
|
<para>
|
2012-01-28 01:26:38 +01:00
|
|
|
The number of repetitions of the index scan that should be factored
|
|
|
|
into the cost estimates. This will typically be greater than one when
|
|
|
|
considering a parameterized scan for use in the inside of a nestloop
|
|
|
|
join. Note that the cost estimates should still be for just one scan;
|
2017-10-09 03:44:17 +02:00
|
|
|
a larger <parameter>loop_count</parameter> means that it may be appropriate
|
2012-01-28 01:26:38 +01:00
|
|
|
to allow for some caching effects across multiple scans.
|
2006-06-06 19:59:58 +02:00
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
2005-02-13 04:04:15 +01:00
|
|
|
</variablelist>
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2018-08-10 13:14:36 +02:00
|
|
|
The last five parameters are pass-by-reference outputs:
|
2005-02-13 04:04:15 +01:00
|
|
|
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
2017-10-09 03:44:17 +02:00
|
|
|
<term><parameter>*indexStartupCost</parameter></term>
|
2005-02-13 04:04:15 +01:00
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Set to cost of index start-up processing
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
<varlistentry>
|
2017-10-09 03:44:17 +02:00
|
|
|
<term><parameter>*indexTotalCost</parameter></term>
|
2005-02-13 04:04:15 +01:00
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Set to total cost of index processing
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
<varlistentry>
|
2017-10-09 03:44:17 +02:00
|
|
|
<term><parameter>*indexSelectivity</parameter></term>
|
2005-02-13 04:04:15 +01:00
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Set to index selectivity
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
<varlistentry>
|
2017-10-09 03:44:17 +02:00
|
|
|
<term><parameter>*indexCorrelation</parameter></term>
|
2005-02-13 04:04:15 +01:00
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Set to correlation coefficient between index scan order and
|
|
|
|
underlying table's order
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
2018-08-10 13:14:36 +02:00
|
|
|
|
|
|
|
<varlistentry>
|
|
|
|
<term><parameter>*indexPages</parameter></term>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Set to number of index leaf pages
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
2005-02-13 04:04:15 +01:00
|
|
|
</variablelist>
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Note that cost estimate functions must be written in C, not in SQL or
|
|
|
|
any available procedural language, because they must access internal
|
|
|
|
data structures of the planner/optimizer.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2006-06-05 04:49:58 +02:00
|
|
|
The index access costs should be computed using the parameters used by
|
2005-03-28 01:53:05 +02:00
|
|
|
<filename>src/backend/optimizer/path/costsize.c</filename>: a sequential
|
2017-10-09 03:44:17 +02:00
|
|
|
disk block fetch has cost <varname>seq_page_cost</varname>, a nonsequential fetch
|
|
|
|
has cost <varname>random_page_cost</varname>, and the cost of processing one index
|
|
|
|
row should usually be taken as <varname>cpu_index_tuple_cost</varname>. In
|
|
|
|
addition, an appropriate multiple of <varname>cpu_operator_cost</varname> should
|
2006-06-05 04:49:58 +02:00
|
|
|
be charged for any comparison operators invoked during index processing
|
2011-12-25 01:03:21 +01:00
|
|
|
(especially evaluation of the indexquals themselves).
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The access costs should include all disk and CPU costs associated with
|
2017-10-09 03:44:17 +02:00
|
|
|
scanning the index itself, but <emphasis>not</emphasis> the costs of retrieving or
|
2005-03-28 01:53:05 +02:00
|
|
|
processing the parent-table rows that are identified by the index.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2006-06-05 04:49:58 +02:00
|
|
|
The <quote>start-up cost</quote> is the part of the total scan cost that
|
|
|
|
must be expended before we can begin to fetch the first row. For most
|
|
|
|
indexes this can be taken as zero, but an index type with a high start-up
|
|
|
|
cost might want to set it nonzero.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
The <parameter>indexSelectivity</parameter> should be set to the estimated fraction of the parent
|
2005-02-13 04:04:15 +01:00
|
|
|
table rows that will be retrieved during the index scan. In the case
|
2008-04-14 19:05:34 +02:00
|
|
|
of a lossy query, this will typically be higher than the fraction of
|
2005-02-13 04:04:15 +01:00
|
|
|
rows that actually pass the given qual conditions.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
The <parameter>indexCorrelation</parameter> should be set to the correlation (ranging between
|
2005-02-13 04:04:15 +01:00
|
|
|
-1.0 and 1.0) between the index order and the table order. This is used
|
|
|
|
to adjust the estimate for the cost of fetching rows from the parent
|
|
|
|
table.
|
|
|
|
</para>
|
|
|
|
|
2018-08-10 13:14:36 +02:00
|
|
|
<para>
|
|
|
|
The <parameter>indexPages</parameter> should be set to the number of leaf pages.
|
|
|
|
This is used to estimate the number of workers for parallel index scan.
|
|
|
|
</para>
|
|
|
|
|
2006-06-06 19:59:58 +02:00
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
When <parameter>loop_count</parameter> is greater than one, the returned numbers
|
2012-01-28 01:26:38 +01:00
|
|
|
should be averages expected for any one scan of the index.
|
2006-06-06 19:59:58 +02:00
|
|
|
</para>
|
|
|
|
|
2005-02-13 04:04:15 +01:00
|
|
|
<procedure>
|
|
|
|
<title>Cost Estimation</title>
|
|
|
|
<para>
|
|
|
|
A typical cost estimator will proceed as follows:
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<step>
|
|
|
|
<para>
|
|
|
|
Estimate and return the fraction of parent-table rows that will be visited
|
|
|
|
based on the given qual conditions. In the absence of any index-type-specific
|
|
|
|
knowledge, use the standard optimizer function <function>clauselist_selectivity()</function>:
|
|
|
|
|
|
|
|
<programlisting>
|
2011-12-25 01:03:21 +01:00
|
|
|
*indexSelectivity = clauselist_selectivity(root, path->indexquals,
|
|
|
|
path->indexinfo->rel->relid,
|
2008-08-14 20:48:00 +02:00
|
|
|
JOIN_INNER, NULL);
|
2005-02-13 04:04:15 +01:00
|
|
|
</programlisting>
|
|
|
|
</para>
|
|
|
|
</step>
|
|
|
|
|
|
|
|
<step>
|
|
|
|
<para>
|
|
|
|
Estimate the number of index rows that will be visited during the
|
2017-10-09 03:44:17 +02:00
|
|
|
scan. For many index types this is the same as <parameter>indexSelectivity</parameter> times
|
2005-02-13 04:04:15 +01:00
|
|
|
the number of rows in the index, but it might be more. (Note that the
|
2011-12-25 01:03:21 +01:00
|
|
|
index's size in pages and rows is available from the
|
2017-10-09 03:44:17 +02:00
|
|
|
<literal>path->indexinfo</literal> struct.)
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
</step>
|
|
|
|
|
|
|
|
<step>
|
|
|
|
<para>
|
|
|
|
Estimate the number of index pages that will be retrieved during the scan.
|
2017-10-09 03:44:17 +02:00
|
|
|
This might be just <parameter>indexSelectivity</parameter> times the index's size in pages.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
</step>
|
|
|
|
|
|
|
|
<step>
|
|
|
|
<para>
|
|
|
|
Compute the index access cost. A generic estimator might do this:
|
|
|
|
|
|
|
|
<programlisting>
|
2010-07-29 21:34:41 +02:00
|
|
|
/*
|
|
|
|
* Our generic assumption is that the index pages will be read
|
|
|
|
* sequentially, so they cost seq_page_cost each, not random_page_cost.
|
|
|
|
* Also, we charge for evaluation of the indexquals at each index row.
|
|
|
|
* All the costs are assumed to be paid incrementally during the scan.
|
|
|
|
*/
|
2011-12-25 01:03:21 +01:00
|
|
|
cost_qual_eval(&index_qual_cost, path->indexquals, root);
|
2010-07-29 21:34:41 +02:00
|
|
|
*indexStartupCost = index_qual_cost.startup;
|
|
|
|
*indexTotalCost = seq_page_cost * numIndexPages +
|
|
|
|
(cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples;
|
2005-02-13 04:04:15 +01:00
|
|
|
</programlisting>
|
2006-06-06 19:59:58 +02:00
|
|
|
|
|
|
|
However, the above does not account for amortization of index reads
|
2012-01-28 01:26:38 +01:00
|
|
|
across repeated index scans.
|
2005-02-13 04:04:15 +01:00
|
|
|
</para>
|
|
|
|
</step>
|
|
|
|
|
|
|
|
<step>
|
|
|
|
<para>
|
|
|
|
Estimate the index correlation. For a simple ordered index on a single
|
|
|
|
field, this can be retrieved from pg_statistic. If the correlation
|
|
|
|
is not known, the conservative estimate is zero (no correlation).
|
|
|
|
</para>
|
|
|
|
</step>
|
|
|
|
</procedure>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Examples of cost estimator functions can be found in
|
|
|
|
<filename>src/backend/utils/adt/selfuncs.c</filename>.
|
|
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
</chapter>
|