Index Access Method Interface Definition This chapter defines the interface between the core PostgreSQL system and index access methods, which manage individual index types. The core system knows nothing about indexes beyond what is specified here, so it is possible to develop entirely new index types by writing add-on code. All indexes in PostgreSQL are what are known technically as secondary indexes; that is, the index is physically separate from the table file that it describes. Each index is stored as its own physical relation and so is described by an entry in the pg_class catalog. The contents of an index are entirely under the control of its index access method. In practice, all index access methods divide indexes into standard-size pages so that they can use the regular storage manager and buffer manager to access the index contents. (All the existing index access methods furthermore use the standard page layout described in , and they all use the same format for index tuple headers; but these decisions are not forced on an access method.) An index is effectively a mapping from some data key values to tuple identifiers, or TIDs, of row versions (tuples) in the index's parent table. A TID consists of a block number and an item number within that block (see ). This is sufficient information to fetch a particular row version from the table. Indexes are not directly aware that under MVCC, there might be multiple extant versions of the same logical row; to an index, each tuple is an independent object that needs its own index entry. Thus, an update of a row always creates all-new index entries for the row, even if the key values did not change. Index entries for dead tuples are reclaimed (by vacuuming) when the dead tuples themselves are reclaimed. Catalog Entries for Indexes Each index access method is described by a row in the pg_am system catalog (see ). The principal contents of a pg_am row are references to pg_proc entries that identify the index access functions supplied by the access method. The APIs for these functions are defined later in this chapter. In addition, the pg_am row specifies a few fixed properties of the access method, such as whether it can support multicolumn indexes. There is not currently any special support for creating or deleting pg_am entries; anyone able to write a new access method is expected to be competent to insert an appropriate row for themselves. To be useful, an index access method must also have one or more operator families and operator classes defined in pg_opfamily, pg_opclass, pg_amop, and pg_amproc. These entries allow the planner to determine what kinds of query qualifications can be used with indexes of this access method. Operator families and classes are described in , which is prerequisite material for reading this chapter. An individual index is defined by a pg_class entry that describes it as a physical relation, plus a pg_index entry that shows the logical content of the index — that is, the set of index columns it has and the semantics of those columns, as captured by the associated operator classes. The index columns (key values) can be either simple columns of the underlying table or expressions over the table rows. The index access method normally has no interest in where the index key values come from (it is always handed precomputed key values) but it will be very interested in the operator class information in pg_index. Both of these catalog entries can be accessed as part of the Relation data structure that is passed to all operations on the index. Some of the flag columns of pg_am have nonobvious implications. The requirements of amcanunique are discussed in . The amcanmulticol flag asserts that the access method supports multicolumn indexes, while amoptionalkey asserts that it allows scans where no indexable restriction clause is given for the first index column. When amcanmulticol is false, amoptionalkey essentially says whether the access method allows full-index scans without any restriction clause. Access methods that support multiple index columns must support scans that omit restrictions on any or all of the columns after the first; however they are permitted to require some restriction to appear for the first index column, and this is signaled by setting amoptionalkey false. amindexnulls asserts that index entries are created for NULL key values. Since most indexable operators are strict and hence cannot return TRUE for NULL inputs, it is at first sight attractive to not store index entries for null values: they could never be returned by an index scan anyway. However, this argument fails when an index scan has no restriction clause for a given index column. In practice this means that indexes that have amoptionalkey true must index nulls, since the planner might decide to use such an index with no scan keys at all. A related restriction is that an index access method that supports multiple index columns must support indexing null values in columns after the first, because the planner will assume the index can be used for queries that do not restrict these columns. For example, consider an index on (a,b) and a query with WHERE a = 4. The system will assume the index can be used to scan for rows with a = 4, which is wrong if the index omits rows where b is null. It is, however, OK to omit rows where the first indexed column is null. Thus, amindexnulls should be set true only if the index access method indexes all rows, including arbitrary combinations of null values. An index access method that sets amindexnulls may also set amsearchnulls, indicating that it supports IS NULL clauses as search conditions. Index Access Method Functions The index construction and maintenance functions that an index access method must provide are: IndexBuildResult * ambuild (Relation heapRelation, Relation indexRelation, IndexInfo *indexInfo); Build a new index. The index relation has been physically created, but is empty. It must be filled in with whatever fixed data the access method requires, plus entries for all tuples already existing in the table. Ordinarily the ambuild function will call IndexBuildHeapScan() to scan the table for existing tuples and compute the keys that need to be inserted into the index. The function must return a palloc'd struct containing statistics about the new index. bool aminsert (Relation indexRelation, Datum *values, bool *isnull, ItemPointer heap_tid, Relation heapRelation, bool check_uniqueness); Insert a new tuple into an existing index. The values and isnull arrays give the key values to be indexed, and heap_tid is the TID to be indexed. If the access method supports unique indexes (its pg_am.amcanunique flag is true) then check_uniqueness might be true, in which case the access method must verify that there is no conflicting row; this is the only situation in which the access method normally needs the heapRelation parameter. See for details. The result is TRUE if an index entry was inserted, FALSE if not. (A FALSE result does not denote an error condition, but is used for cases such as an index AM refusing to index a NULL.) IndexBulkDeleteResult * ambulkdelete (IndexVacuumInfo *info, IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback, void *callback_state); Delete tuple(s) from the index. This is a bulk delete operation that is intended to be implemented by scanning the whole index and checking each entry to see if it should be deleted. The passed-in callback function must be called, in the style callback(TID, callback_state) returns bool, to determine whether any particular index entry, as identified by its referenced TID, is to be deleted. Must return either NULL or a palloc'd struct containing statistics about the effects of the deletion operation. It is OK to return NULL if no information needs to be passed on to amvacuumcleanup. Because of limited maintenance_work_mem, ambulkdelete might need to be called more than once when many tuples are to be deleted. The stats argument is the result of the previous call for this index (it is NULL for the first call within a VACUUM operation). This allows the AM to accumulate statistics across the whole operation. Typically, ambulkdelete will modify and return the same struct if the passed stats is not null. IndexBulkDeleteResult * amvacuumcleanup (IndexVacuumInfo *info, IndexBulkDeleteResult *stats); Clean up after a VACUUM operation (zero or more ambulkdelete calls). This does not have to do anything beyond returning index statistics, but it might perform bulk cleanup such as reclaiming empty index pages. stats is whatever the last ambulkdelete call returned, or NULL if ambulkdelete was not called because no tuples needed to be deleted. If the result is not NULL it must be a palloc'd struct. The statistics it contains will be used to update pg_class, and will be reported by VACUUM if VERBOSE is given. It is OK to return NULL if the index was not changed at all during the VACUUM operation, but otherwise correct stats should be returned. void amcostestimate (PlannerInfo *root, IndexOptInfo *index, List *indexQuals, RelOptInfo *outer_rel, Cost *indexStartupCost, Cost *indexTotalCost, Selectivity *indexSelectivity, double *indexCorrelation); Estimate the costs of an index scan. This function is described fully in , below. bytea * amoptions (ArrayType *reloptions, bool validate); Parse and validate the reloptions array for an index. This is called only when a non-null reloptions array exists for the index. reloptions is a text array containing entries of the form name=value. The function should construct a bytea value, which will be copied into the rd_options field of the index's relcache entry. The data contents of the bytea value are open for the access method to define, but the standard access methods currently all use struct StdRdOptions. When validate is true, the function should report a suitable error message if any of the options are unrecognized or have invalid values; when validate is false, invalid entries should be silently ignored. (validate is false when loading options already stored in pg_catalog; an invalid entry could only be found if the access method has changed its rules for options, and in that case ignoring obsolete entries is appropriate.) It is OK to return NULL if default behavior is wanted. The purpose of an index, of course, is to support scans for tuples matching an indexable WHERE condition, often called a qualifier or scan key. The semantics of index scanning are described more fully in , below. The scan-related functions that an index access method must provide are: IndexScanDesc ambeginscan (Relation indexRelation, int nkeys, ScanKey key); Begin a new scan. The key array (of length nkeys) describes the scan key(s) for the index scan. The result must be a palloc'd struct. For implementation reasons the index access method must create this struct by calling RelationGetIndexScan(). In most cases ambeginscan itself does little beyond making that call; the interesting parts of index-scan startup are in amrescan. boolean amgettuple (IndexScanDesc scan, ScanDirection direction); Fetch the next tuple in the given scan, moving in the given direction (forward or backward in the index). Returns TRUE if a tuple was obtained, FALSE if no matching tuples remain. In the TRUE case the tuple TID is stored into the scan structure. Note that success means only that the index contains an entry that matches the scan keys, not that the tuple necessarily still exists in the heap or will pass the caller's snapshot test. boolean amgetmulti (IndexScanDesc scan, ItemPointer tids, int32 max_tids, int32 *returned_tids); Fetch multiple tuples in the given scan. Returns TRUE if the scan should continue, FALSE if no matching tuples remain. tids points to a caller-supplied array of max_tids ItemPointerData records, which the call fills with TIDs of matching tuples. *returned_tids is set to the number of TIDs actually returned. This can be less than max_tids, or even zero, even when the return value is TRUE. (This provision allows the access method to choose the most efficient stopping points in its scan, for example index page boundaries.) amgetmulti and amgettuple cannot be used in the same index scan; there are other restrictions too when using amgetmulti, as explained in . void amrescan (IndexScanDesc scan, ScanKey key); Restart the given scan, possibly with new scan keys (to continue using the old keys, NULL is passed for key). Note that it is not possible for the number of keys to be changed. In practice the restart feature is used when a new outer tuple is selected by a nested-loop join and so a new key comparison value is needed, but the scan key structure remains the same. This function is also called by RelationGetIndexScan(), so it is used for initial setup of an index scan as well as rescanning. void amendscan (IndexScanDesc scan); End a scan and release resources. The scan struct itself should not be freed, but any locks or pins taken internally by the access method must be released. void ammarkpos (IndexScanDesc scan); Mark current scan position. The access method need only support one remembered scan position per scan. void amrestrpos (IndexScanDesc scan); Restore the scan to the most recently marked position. By convention, the pg_proc entry for an index access method function should show the correct number of arguments, but declare them all as type internal (since most of the arguments have types that are not known to SQL, and we don't want users calling the functions directly anyway). The return type is declared as void, internal, or boolean as appropriate. The only exception is amoptions, which should be correctly declared as taking text[] and bool and returning bytea. This provision allows client code to execute amoptions to test validity of options settings. Index Scanning In an index scan, the index access method is responsible for regurgitating the TIDs of all the tuples it has been told about that match the scan keys. The access method is not involved in actually fetching those tuples from the index's parent table, nor in determining whether they pass the scan's time qualification test or other conditions. A scan key is the internal representation of a WHERE clause of the form index_key operator constant, where the index key is one of the columns of the index and the operator is one of the members of the operator family associated with that index column. An index scan has zero or more scan keys, which are implicitly ANDed — the returned tuples are expected to satisfy all the indicated conditions. The operator family can indicate that the index is lossy for a particular operator; this implies that the index scan will return all the entries that pass the scan key, plus possibly additional entries that do not. The core system's index-scan machinery will then apply that operator again to the heap tuple to verify whether or not it really should be selected. For non-lossy operators, the index scan must return exactly the set of matching entries, as there is no recheck. Note that it is entirely up to the access method to ensure that it correctly finds all and only the entries passing all the given scan keys. Also, the core system will simply hand off all the WHERE clauses that match the index keys and operator families, without any semantic analysis to determine whether they are redundant or contradictory. As an example, given WHERE x > 4 AND x > 14 where x is a b-tree indexed column, it is left to the b-tree amrescan function to realize that the first scan key is redundant and can be discarded. The extent of preprocessing needed during amrescan will depend on the extent to which the index access method needs to reduce the scan keys to a normalized form. Some access methods return index entries in a well-defined order, others do not. If entries are returned in sorted order, the access method should set pg_am.amcanorder true to indicate that it supports ordered scans. All such access methods must use btree-compatible strategy numbers for their equality and ordering operators. The amgettuple function has a direction argument, which can be either ForwardScanDirection (the normal case) or BackwardScanDirection. If the first call after amrescan specifies BackwardScanDirection, then the set of matching index entries is to be scanned back-to-front rather than in the normal front-to-back direction, so amgettuple must return the last matching tuple in the index, rather than the first one as it normally would. (This will only occur for access methods that advertise they support ordered scans.) After the first call, amgettuple must be prepared to advance the scan in either direction from the most recently returned entry. The access method must support marking a position in a scan and later returning to the marked position. The same position might be restored multiple times. However, only one position need be remembered per scan; a new ammarkpos call overrides the previously marked position. Both the scan position and the mark position (if any) must be maintained consistently in the face of concurrent insertions or deletions in the index. It is OK if a freshly-inserted entry is not returned by a scan that would have found the entry if it had existed when the scan started, or for the scan to return such an entry upon rescanning or backing up even though it had not been returned the first time through. Similarly, a concurrent delete might or might not be reflected in the results of a scan. What is important is that insertions or deletions not cause the scan to miss or multiply return entries that were not themselves being inserted or deleted. Instead of using amgettuple, an index scan can be done with amgetmulti to fetch multiple tuples per call. This can be noticeably more efficient than amgettuple because it allows avoiding lock/unlock cycles within the access method. In principle amgetmulti should have the same effects as repeated amgettuple calls, but we impose several restrictions to simplify matters. In the first place, amgetmulti does not take a direction argument, and therefore it does not support backwards scan nor intrascan reversal of direction. The access method need not support marking or restoring scan positions during an amgetmulti scan, either. (These restrictions cost little since it would be difficult to use these features in an amgetmulti scan anyway: adjusting the caller's buffered list of TIDs would be complex.) Finally, amgetmulti does not guarantee any locking of the returned tuples, with implications spelled out in . Index Locking Considerations Index access methods must handle concurrent updates of the index by multiple processes. The core PostgreSQL system obtains AccessShareLock on the index during an index scan, and RowExclusiveLock when updating the index (including plain VACUUM). Since these lock types do not conflict, the access method is responsible for handling any fine-grained locking it might need. An exclusive lock on the index as a whole will be taken only during index creation, destruction, REINDEX, or VACUUM FULL. Building an index type that supports concurrent updates usually requires extensive and subtle analysis of the required behavior. For the b-tree and hash index types, you can read about the design decisions involved in src/backend/access/nbtree/README and src/backend/access/hash/README. Aside from the index's own internal consistency requirements, concurrent updates create issues about consistency between the parent table (the heap) and the index. Because PostgreSQL separates accesses and updates of the heap from those of the index, there are windows in which the index might be inconsistent with the heap. We handle this problem with the following rules: A new heap entry is made before making its index entries. (Therefore a concurrent index scan is likely to fail to see the heap entry. This is okay because the index reader would be uninterested in an uncommitted row anyway. But see .) When a heap entry is to be deleted (by VACUUM), all its index entries must be removed first. An index scan must maintain a pin on the index page holding the item last returned by amgettuple, and ambulkdelete cannot delete entries from pages that are pinned by other backends. The need for this rule is explained below. Without the third rule, it is possible for an index reader to see an index entry just before it is removed by VACUUM, and then to arrive at the corresponding heap entry after that was removed by VACUUM. This creates no serious problems if that item number is still unused when the reader reaches it, since an empty item slot will be ignored by heap_fetch(). But what if a third backend has already re-used the item slot for something else? When using an MVCC-compliant snapshot, there is no problem because the new occupant of the slot is certain to be too new to pass the snapshot test. However, with a non-MVCC-compliant snapshot (such as SnapshotNow), it would be possible to accept and return a row that does not in fact match the scan keys. We could defend against this scenario by requiring the scan keys to be rechecked against the heap row in all cases, but that is too expensive. Instead, we use a pin on an index page as a proxy to indicate that the reader might still be in flight from the index entry to the matching heap entry. Making ambulkdelete block on such a pin ensures that VACUUM cannot delete the heap entry before the reader is done with it. This solution costs little in run time, and adds blocking overhead only in the rare cases where there actually is a conflict. This solution requires that index scans be synchronous: we have to fetch each heap tuple immediately after scanning the corresponding index entry. This is expensive for a number of reasons. An asynchronous scan in which we collect many TIDs from the index, and only visit the heap tuples sometime later, requires much less index locking overhead and can allow a more efficient heap access pattern. Per the above analysis, we must use the synchronous approach for non-MVCC-compliant snapshots, but an asynchronous scan is workable for a query using an MVCC snapshot. In an amgetmulti index scan, the access method need not guarantee to keep an index pin on any of the returned tuples. (It would be impractical to pin more than the last one anyway.) Therefore it is only safe to use such scans with MVCC-compliant snapshots. Index Uniqueness Checks PostgreSQL enforces SQL uniqueness constraints using unique indexes, which are indexes that disallow multiple entries with identical keys. An access method that supports this feature sets pg_am.amcanunique true. (At present, only b-tree supports it.) Because of MVCC, it is always necessary to allow duplicate entries to exist physically in an index: the entries might refer to successive versions of a single logical row. The behavior we actually want to enforce is that no MVCC snapshot could include two rows with equal index keys. This breaks down into the following cases that must be checked when inserting a new row into a unique index: If a conflicting valid row has been deleted by the current transaction, it's okay. (In particular, since an UPDATE always deletes the old row version before inserting the new version, this will allow an UPDATE on a row without changing the key.) If a conflicting row has been inserted by an as-yet-uncommitted transaction, the would-be inserter must wait to see if that transaction commits. If it rolls back then there is no conflict. If it commits without deleting the conflicting row again, there is a uniqueness violation. (In practice we just wait for the other transaction to end and then redo the visibility check in toto.) Similarly, if a conflicting valid row has been deleted by an as-yet-uncommitted transaction, the would-be inserter must wait for that transaction to commit or abort, and then repeat the test. Furthermore, immediately before raising a uniqueness violation according to the above rules, the access method must recheck the liveness of the row being inserted. If it is committed dead then no error should be raised. (This case cannot occur during the ordinary scenario of inserting a row that's just been created by the current transaction. It can happen during CREATE UNIQUE INDEX CONCURRENTLY, however.) We require the index access method to apply these tests itself, which means that it must reach into the heap to check the commit status of any row that is shown to have a duplicate key according to the index contents. This is without a doubt ugly and non-modular, but it saves redundant work: if we did a separate probe then the index lookup for a conflicting row would be essentially repeated while finding the place to insert the new row's index entry. What's more, there is no obvious way to avoid race conditions unless the conflict check is an integral part of insertion of the new index entry. The main limitation of this scheme is that it has no convenient way to support deferred uniqueness checks. Index Cost Estimation Functions The amcostestimate function is given a list of WHERE clauses that have been determined to be usable with the index. It must return estimates of the cost of accessing the index and the selectivity of the WHERE clauses (that is, the fraction of parent-table rows that will be retrieved during the index scan). For simple cases, nearly all the work of the cost estimator can be done by calling standard routines in the optimizer; the point of having an amcostestimate function is to allow index access methods to provide index-type-specific knowledge, in case it is possible to improve on the standard estimates. Each amcostestimate function must have the signature: void amcostestimate (PlannerInfo *root, IndexOptInfo *index, List *indexQuals, RelOptInfo *outer_rel, Cost *indexStartupCost, Cost *indexTotalCost, Selectivity *indexSelectivity, double *indexCorrelation); The first four parameters are inputs: root The planner's information about the query being processed. index The index being considered. indexQuals List of index qual clauses (implicitly ANDed); a NIL list indicates no qualifiers are available. Note that the list contains expression trees, not ScanKeys. outer_rel If the index is being considered for use in a join inner indexscan, the planner's information about the outer side of the join. Otherwise NULL. When non-NULL, some of the qual clauses will be join clauses with this rel rather than being simple restriction clauses. Also, the cost estimator should expect that the index scan will be repeated for each row of the outer rel. The last four parameters are pass-by-reference outputs: *indexStartupCost Set to cost of index start-up processing *indexTotalCost Set to total cost of index processing *indexSelectivity Set to index selectivity *indexCorrelation Set to correlation coefficient between index scan order and underlying table's order Note that cost estimate functions must be written in C, not in SQL or any available procedural language, because they must access internal data structures of the planner/optimizer. The index access costs should be computed using the parameters used by src/backend/optimizer/path/costsize.c: a sequential disk block fetch has cost seq_page_cost, a nonsequential fetch has cost random_page_cost, and the cost of processing one index row should usually be taken as cpu_index_tuple_cost. In addition, an appropriate multiple of cpu_operator_cost should be charged for any comparison operators invoked during index processing (especially evaluation of the indexQuals themselves). The access costs should include all disk and CPU costs associated with scanning the index itself, but not the costs of retrieving or processing the parent-table rows that are identified by the index. The start-up cost is the part of the total scan cost that must be expended before we can begin to fetch the first row. For most indexes this can be taken as zero, but an index type with a high start-up cost might want to set it nonzero. The indexSelectivity should be set to the estimated fraction of the parent table rows that will be retrieved during the index scan. In the case of a lossy index, this will typically be higher than the fraction of rows that actually pass the given qual conditions. The indexCorrelation should be set to the correlation (ranging between -1.0 and 1.0) between the index order and the table order. This is used to adjust the estimate for the cost of fetching rows from the parent table. In the join case, the returned numbers should be averages expected for any one scan of the index. Cost Estimation A typical cost estimator will proceed as follows: Estimate and return the fraction of parent-table rows that will be visited based on the given qual conditions. In the absence of any index-type-specific knowledge, use the standard optimizer function clauselist_selectivity(): *indexSelectivity = clauselist_selectivity(root, indexQuals, index->rel->relid, JOIN_INNER); Estimate the number of index rows that will be visited during the scan. For many index types this is the same as indexSelectivity times the number of rows in the index, but it might be more. (Note that the index's size in pages and rows is available from the IndexOptInfo struct.) Estimate the number of index pages that will be retrieved during the scan. This might be just indexSelectivity times the index's size in pages. Compute the index access cost. A generic estimator might do this: /* * Our generic assumption is that the index pages will be read * sequentially, so they cost seq_page_cost each, not random_page_cost. * Also, we charge for evaluation of the indexquals at each index row. * All the costs are assumed to be paid incrementally during the scan. */ cost_qual_eval(&index_qual_cost, indexQuals, root); *indexStartupCost = index_qual_cost.startup; *indexTotalCost = seq_page_cost * numIndexPages + (cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples; However, the above does not account for amortization of index reads across repeated index scans in the join case. Estimate the index correlation. For a simple ordered index on a single field, this can be retrieved from pg_statistic. If the correlation is not known, the conservative estimate is zero (no correlation). Examples of cost estimator functions can be found in src/backend/utils/adt/selfuncs.c.