1996-08-27 23:50:29 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
1999-02-14 00:22:53 +01:00
|
|
|
* relscan.h
|
2002-05-21 01:51:44 +02:00
|
|
|
* POSTGRES relation scan descriptor definitions.
|
1996-08-27 23:50:29 +02:00
|
|
|
*
|
|
|
|
*
|
2015-01-06 17:43:47 +01:00
|
|
|
* Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
|
2000-01-26 06:58:53 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
1996-08-27 23:50:29 +02:00
|
|
|
*
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/include/access/relscan.h
|
1996-08-27 23:50:29 +02:00
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
#ifndef RELSCAN_H
|
1996-08-27 23:50:29 +02:00
|
|
|
#define RELSCAN_H
|
|
|
|
|
2008-06-19 02:46:06 +02:00
|
|
|
#include "access/genam.h"
|
|
|
|
#include "access/heapam.h"
|
2012-08-30 22:15:44 +02:00
|
|
|
#include "access/htup_details.h"
|
2011-10-08 02:13:02 +02:00
|
|
|
#include "access/itup.h"
|
2011-10-17 01:15:04 +02:00
|
|
|
#include "access/tupdesc.h"
|
1996-08-27 23:50:29 +02:00
|
|
|
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct HeapScanDescData
|
|
|
|
{
|
2002-05-21 01:51:44 +02:00
|
|
|
/* scan parameters */
|
|
|
|
Relation rs_rd; /* heap relation descriptor */
|
|
|
|
Snapshot rs_snapshot; /* snapshot to see */
|
|
|
|
int rs_nkeys; /* number of scan keys */
|
|
|
|
ScanKey rs_key; /* array of scan key descriptors */
|
2007-06-09 20:49:55 +02:00
|
|
|
bool rs_bitmapscan; /* true if this is really a bitmap scan */
|
2015-05-15 20:37:10 +02:00
|
|
|
bool rs_samplescan; /* true if this is really a sample scan */
|
2007-11-15 22:14:46 +01:00
|
|
|
bool rs_pageatatime; /* verify visibility page-at-a-time? */
|
2008-01-14 02:39:09 +01:00
|
|
|
bool rs_allow_strat; /* allow or disallow use of access strategy */
|
|
|
|
bool rs_allow_sync; /* allow or disallow use of syncscan */
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
bool rs_temp_snap; /* unregister snapshot at scan end? */
|
2007-06-08 20:23:53 +02:00
|
|
|
|
|
|
|
/* state set up at initscan time */
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
BlockNumber rs_nblocks; /* total number of blocks in rel */
|
2007-11-15 22:14:46 +01:00
|
|
|
BlockNumber rs_startblock; /* block # to start at */
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
BlockNumber rs_initblock; /* block # to consider initial of rel */
|
|
|
|
BlockNumber rs_numblocks; /* number of blocks to scan */
|
2007-05-30 22:12:03 +02:00
|
|
|
BufferAccessStrategy rs_strategy; /* access strategy for reads */
|
2007-06-08 20:23:53 +02:00
|
|
|
bool rs_syncscan; /* report location to syncscan logic? */
|
2002-05-21 01:51:44 +02:00
|
|
|
|
|
|
|
/* scan current state */
|
2005-11-26 04:03:07 +01:00
|
|
|
bool rs_inited; /* false = scan not init'd yet */
|
2001-06-09 20:16:59 +02:00
|
|
|
HeapTupleData rs_ctup; /* current tuple in scan, if any */
|
2006-10-04 02:30:14 +02:00
|
|
|
BlockNumber rs_cblock; /* current block # in scan, if any */
|
2001-06-09 20:16:59 +02:00
|
|
|
Buffer rs_cbuf; /* current buffer in scan, if any */
|
|
|
|
/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
|
2001-06-22 21:16:24 +02:00
|
|
|
|
2007-06-09 20:49:55 +02:00
|
|
|
/* these fields only used in page-at-a-time mode and for bitmap scans */
|
2005-11-26 04:03:07 +01:00
|
|
|
int rs_cindex; /* current tuple's index in vistuples */
|
|
|
|
int rs_ntuples; /* number of visible tuples on page */
|
|
|
|
OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */
|
2011-04-10 17:42:00 +02:00
|
|
|
} HeapScanDescData;
|
1996-08-27 23:50:29 +02:00
|
|
|
|
2005-03-28 01:53:05 +02:00
|
|
|
/*
|
|
|
|
* We use the same IndexScanDescData structure for both amgettuple-based
|
2008-04-11 00:25:26 +02:00
|
|
|
* and amgetbitmap-based index scans. Some fields are only relevant in
|
2005-05-28 01:31:21 +02:00
|
|
|
* amgettuple-based scans.
|
2005-03-28 01:53:05 +02:00
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct IndexScanDescData
|
|
|
|
{
|
2002-05-21 01:51:44 +02:00
|
|
|
/* scan parameters */
|
|
|
|
Relation heapRelation; /* heap relation descriptor, or NULL */
|
|
|
|
Relation indexRelation; /* index relation descriptor */
|
|
|
|
Snapshot xs_snapshot; /* snapshot to see */
|
2010-12-03 02:50:48 +01:00
|
|
|
int numberOfKeys; /* number of index qualifier conditions */
|
2011-04-10 17:42:00 +02:00
|
|
|
int numberOfOrderBys; /* number of ordering operators */
|
|
|
|
ScanKey keyData; /* array of index qualifier descriptors */
|
|
|
|
ScanKey orderByData; /* array of ordering op descriptors */
|
2011-10-08 02:13:02 +02:00
|
|
|
bool xs_want_itup; /* caller requests index tuples */
|
2002-05-21 01:51:44 +02:00
|
|
|
|
2002-05-24 20:57:57 +02:00
|
|
|
/* signaling to index AM about killing index tuples */
|
|
|
|
bool kill_prior_tuple; /* last-returned tuple is dead */
|
|
|
|
bool ignore_killed_tuples; /* do not return killed entries */
|
2010-02-26 03:01:40 +01:00
|
|
|
bool xactStartedInRecovery; /* prevents killing/seeing killed
|
|
|
|
* tuples */
|
2002-05-24 20:57:57 +02:00
|
|
|
|
2006-05-07 03:21:30 +02:00
|
|
|
/* index access method's private state */
|
2001-06-09 20:16:59 +02:00
|
|
|
void *opaque; /* access-method-specific info */
|
2002-09-04 22:31:48 +02:00
|
|
|
|
2011-10-08 02:13:02 +02:00
|
|
|
/* in an index-only scan, this is valid after a successful amgettuple */
|
2011-10-09 06:21:08 +02:00
|
|
|
IndexTuple xs_itup; /* index tuple returned by AM */
|
2011-10-17 01:15:04 +02:00
|
|
|
TupleDesc xs_itupdesc; /* rowtype descriptor of xs_itup */
|
2011-10-08 02:13:02 +02:00
|
|
|
|
2008-04-13 21:18:14 +02:00
|
|
|
/* xs_ctup/xs_cbuf/xs_recheck are valid after a successful index_getnext */
|
2002-05-21 01:51:44 +02:00
|
|
|
HeapTupleData xs_ctup; /* current heap tuple, if any */
|
|
|
|
Buffer xs_cbuf; /* current heap buffer in scan, if any */
|
|
|
|
/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
|
2008-04-13 21:18:14 +02:00
|
|
|
bool xs_recheck; /* T means scan keys must be rechecked */
|
2008-04-13 01:14:21 +02:00
|
|
|
|
2015-05-15 13:26:51 +02:00
|
|
|
/*
|
|
|
|
* When fetching with an ordering operator, the values of the ORDER BY
|
|
|
|
* expressions of the last returned tuple, according to the index. If
|
|
|
|
* xs_recheck is true, these need to be rechecked just like the scan keys,
|
|
|
|
* and the values returned here are a lower-bound on the actual values.
|
|
|
|
*/
|
|
|
|
Datum *xs_orderbyvals;
|
|
|
|
bool *xs_orderbynulls;
|
Fix datatype confusion with the new lossy GiST distance functions.
We can only support a lossy distance function when the distance function's
datatype is comparable with the original ordering operator's datatype.
The distance function always returns a float8, so we are limited to float8,
and float4 (by a hard-coded cast of the float8 to float4).
In light of this limitation, it seems like a good idea to have a separate
'recheck' flag for the ORDER BY expressions, so that if you have a non-lossy
distance function, it still works with lossy quals. There are cases like
that with the build-in or contrib opclasses, but it's plausible.
There was a hidden assumption that the ORDER BY values returned by GiST
match the original ordering operator's return type, but there are plenty
of examples where that's not true, e.g. in btree_gist and pg_trgm. As long
as the distance function is not lossy, we can tolerate that and just not
return the distance to the executor (or rather, always return NULL). The
executor doesn't need the distances if there are no lossy results.
There was another little bug: the recheck variable was not initialized
before calling the distance function. That revealed the bigger issue,
as the executor tried to reorder tuples that didn't need reordering, and
that failed because of the datatype mismatch.
2015-05-15 16:59:46 +02:00
|
|
|
bool xs_recheckorderby; /* T means ORDER BY exprs must be rechecked */
|
2015-05-15 13:26:51 +02:00
|
|
|
|
2008-04-13 01:14:21 +02:00
|
|
|
/* state data for traversing HOT chains in index_getnext */
|
2011-06-27 16:27:17 +02:00
|
|
|
bool xs_continue_hot; /* T if must keep walking HOT chain */
|
2011-04-10 17:42:00 +02:00
|
|
|
} IndexScanDescData;
|
1996-08-27 23:50:29 +02:00
|
|
|
|
2008-06-19 02:46:06 +02:00
|
|
|
/* Struct for heap-or-index scans of system tables */
|
|
|
|
typedef struct SysScanDescData
|
|
|
|
{
|
|
|
|
Relation heap_rel; /* catalog being scanned */
|
|
|
|
Relation irel; /* NULL if doing heap scan */
|
|
|
|
HeapScanDesc scan; /* only valid in heap-scan case */
|
|
|
|
IndexScanDesc iscan; /* only valid in index-scan case */
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
Snapshot snapshot; /* snapshot to unregister at end of scan */
|
2011-04-10 17:42:00 +02:00
|
|
|
} SysScanDescData;
|
2001-10-28 07:26:15 +01:00
|
|
|
|
2001-11-05 18:46:40 +01:00
|
|
|
#endif /* RELSCAN_H */
|