1996-08-27 23:50:29 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
1999-02-14 00:22:53 +01:00
|
|
|
* nbtree.h
|
1997-09-07 07:04:48 +02:00
|
|
|
* header file for postgres btree access method implementation.
|
1996-08-27 23:50:29 +02:00
|
|
|
*
|
|
|
|
*
|
2019-01-02 18:44:25 +01:00
|
|
|
* Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
|
2000-01-26 06:58:53 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
1996-08-27 23:50:29 +02:00
|
|
|
*
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/include/access/nbtree.h
|
1996-08-27 23:50:29 +02:00
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
#ifndef NBTREE_H
|
|
|
|
#define NBTREE_H
|
1996-08-27 23:50:29 +02:00
|
|
|
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
#include "access/amapi.h"
|
1999-07-16 01:04:24 +02:00
|
|
|
#include "access/itup.h"
|
1999-07-16 19:07:40 +02:00
|
|
|
#include "access/sdir.h"
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
#include "access/xlogreader.h"
|
2011-08-27 17:05:33 +02:00
|
|
|
#include "catalog/pg_index.h"
|
2014-11-06 12:52:08 +01:00
|
|
|
#include "lib/stringinfo.h"
|
|
|
|
#include "storage/bufmgr.h"
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
#include "storage/shm_toc.h"
|
2006-05-08 02:00:17 +02:00
|
|
|
|
|
|
|
/* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
|
|
|
|
typedef uint16 BTCycleId;
|
|
|
|
|
1996-08-27 23:50:29 +02:00
|
|
|
/*
|
1997-09-07 07:04:48 +02:00
|
|
|
* BTPageOpaqueData -- At the end of every page, we store a pointer
|
2001-02-22 22:48:49 +01:00
|
|
|
* to both siblings in the tree. This is used to do forward/backward
|
2003-02-21 01:06:22 +01:00
|
|
|
* index scans. The next-page link is also critical for recovery when
|
|
|
|
* a search has navigated to the wrong page due to concurrent page splits
|
|
|
|
* or deletions; see src/backend/access/nbtree/README for more info.
|
1996-08-27 23:50:29 +02:00
|
|
|
*
|
2003-08-04 02:43:34 +02:00
|
|
|
* In addition, we store the page's btree level (counting upwards from
|
2003-02-21 01:06:22 +01:00
|
|
|
* zero at a leaf page) as well as some flag bits indicating the page type
|
|
|
|
* and status. If the page is deleted, we replace the level with the
|
|
|
|
* next-transaction-ID value indicating when it is safe to reclaim the page.
|
|
|
|
*
|
2014-05-06 18:12:18 +02:00
|
|
|
* We also store a "vacuum cycle ID". When a page is split while VACUUM is
|
2006-05-08 02:00:17 +02:00
|
|
|
* processing the index, a nonzero value associated with the VACUUM run is
|
2014-05-06 18:12:18 +02:00
|
|
|
* stored into both halves of the split page. (If VACUUM is not running,
|
2006-10-04 02:30:14 +02:00
|
|
|
* both pages receive zero cycleids.) This allows VACUUM to detect whether
|
2006-05-08 02:00:17 +02:00
|
|
|
* a page was split since it started, with a small probability of false match
|
2007-04-10 00:04:08 +02:00
|
|
|
* if the page was last split some exact multiple of MAX_BT_CYCLE_ID VACUUMs
|
|
|
|
* ago. Also, during a split, the BTP_SPLIT_END flag is cleared in the left
|
2006-05-08 02:00:17 +02:00
|
|
|
* (original) page, and set in the right page, but only if the next page
|
|
|
|
* to its right has a different cycleid.
|
|
|
|
*
|
2003-02-21 01:06:22 +01:00
|
|
|
* NOTE: the BTP_LEAF flag bit is redundant since level==0 could be tested
|
|
|
|
* instead.
|
1996-08-27 23:50:29 +02:00
|
|
|
*/
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct BTPageOpaqueData
|
|
|
|
{
|
2003-02-21 01:06:22 +01:00
|
|
|
BlockNumber btpo_prev; /* left sibling, or P_NONE if leftmost */
|
|
|
|
BlockNumber btpo_next; /* right sibling, or P_NONE if rightmost */
|
|
|
|
union
|
|
|
|
{
|
2003-08-04 02:43:34 +02:00
|
|
|
uint32 level; /* tree level --- zero for leaf pages */
|
2003-02-21 01:06:22 +01:00
|
|
|
TransactionId xact; /* next transaction ID, if deleted */
|
2003-08-04 02:43:34 +02:00
|
|
|
} btpo;
|
2003-02-21 01:06:22 +01:00
|
|
|
uint16 btpo_flags; /* flag bits, see below */
|
2006-05-08 02:00:17 +02:00
|
|
|
BTCycleId btpo_cycleid; /* vacuum cycle ID of latest split */
|
2001-02-21 20:07:04 +01:00
|
|
|
} BTPageOpaqueData;
|
|
|
|
|
|
|
|
typedef BTPageOpaqueData *BTPageOpaque;
|
|
|
|
|
2000-07-21 08:42:39 +02:00
|
|
|
/* Bits defined in btpo_flags */
|
2003-02-21 01:06:22 +01:00
|
|
|
#define BTP_LEAF (1 << 0) /* leaf page, i.e. not internal page */
|
2001-10-28 07:26:15 +01:00
|
|
|
#define BTP_ROOT (1 << 1) /* root page (has no parent) */
|
2003-02-21 01:06:22 +01:00
|
|
|
#define BTP_DELETED (1 << 2) /* page has been deleted from tree */
|
2001-10-28 07:26:15 +01:00
|
|
|
#define BTP_META (1 << 3) /* meta-page */
|
2003-02-22 01:45:05 +01:00
|
|
|
#define BTP_HALF_DEAD (1 << 4) /* empty, but still in tree */
|
2006-05-08 02:00:17 +02:00
|
|
|
#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */
|
2008-12-30 17:24:37 +01:00
|
|
|
#define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples */
|
2014-05-06 18:12:18 +02:00
|
|
|
#define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
|
1996-08-27 23:50:29 +02:00
|
|
|
|
2007-04-10 00:04:08 +02:00
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* The max allowed value of a cycle ID is a bit less than 64K. This is
|
2007-04-10 00:04:08 +02:00
|
|
|
* for convenience of pg_filedump and similar utilities: we want to use
|
|
|
|
* the last 2 bytes of special space as an index type indicator, and
|
|
|
|
* restricting cycle ID lets btree use that space for vacuum cycle IDs
|
|
|
|
* while still allowing index type to be identified.
|
|
|
|
*/
|
|
|
|
#define MAX_BT_CYCLE_ID 0xFF7F
|
|
|
|
|
1996-08-27 23:50:29 +02:00
|
|
|
|
2001-02-22 22:48:49 +01:00
|
|
|
/*
|
|
|
|
* The Meta page is always the first page in the btree index.
|
|
|
|
* Its primary purpose is to point to the location of the btree root page.
|
2003-02-21 01:06:22 +01:00
|
|
|
* We also point to the "fast" root, which is the current effective root;
|
|
|
|
* see README for discussion.
|
2001-02-22 22:48:49 +01:00
|
|
|
*/
|
2000-10-13 04:03:02 +02:00
|
|
|
|
|
|
|
typedef struct BTMetaPageData
|
|
|
|
{
|
2003-02-21 01:06:22 +01:00
|
|
|
uint32 btm_magic; /* should contain BTREE_MAGIC */
|
|
|
|
uint32 btm_version; /* should contain BTREE_VERSION */
|
|
|
|
BlockNumber btm_root; /* current root location */
|
|
|
|
uint32 btm_level; /* tree level of the root page */
|
|
|
|
BlockNumber btm_fastroot; /* current "fast" root location */
|
|
|
|
uint32 btm_fastlevel; /* tree level of the "fast" root page */
|
Skip full index scan during cleanup of B-tree indexes when possible
Vacuum of index consists from two stages: multiple (zero of more) ambulkdelete
calls and one amvacuumcleanup call. When workload on particular table
is append-only, then autovacuum isn't intended to touch this table. However,
user may run vacuum manually in order to fill visibility map and get benefits
of index-only scans. Then ambulkdelete wouldn't be called for indexes
of such table (because no heap tuples were deleted), only amvacuumcleanup would
be called In this case, amvacuumcleanup would perform full index scan for
two objectives: put recyclable pages into free space map and update index
statistics.
This patch allows btvacuumclanup to skip full index scan when two conditions
are satisfied: no pages are going to be put into free space map and index
statistics isn't stalled. In order to check first condition, we store
oldest btpo_xact in the meta-page. When it's precedes RecentGlobalXmin, then
there are some recyclable pages. In order to check second condition we store
number of heap tuples observed during previous full index scan by cleanup.
If fraction of newly inserted tuples is less than
vacuum_cleanup_index_scale_factor, then statistics isn't considered to be
stalled. vacuum_cleanup_index_scale_factor can be defined as both reloption and GUC (default).
This patch bumps B-tree meta-page version. Upgrade of meta-page is performed
"on the fly": during VACUUM meta-page is rewritten with new version. No special
handling in pg_upgrade is required.
Author: Masahiko Sawada, Alexander Korotkov
Review by: Peter Geoghegan, Kyotaro Horiguchi, Alexander Korotkov, Yura Sokolov
Discussion: https://www.postgresql.org/message-id/flat/CAD21AoAX+d2oD_nrd9O2YkpzHaFr=uQeGr9s1rKC3O4ENc568g@mail.gmail.com
2018-04-04 18:29:00 +02:00
|
|
|
/* following fields are available since page version 3 */
|
2018-05-30 18:45:39 +02:00
|
|
|
TransactionId btm_oldest_btpo_xact; /* oldest btpo_xact among all deleted
|
2018-04-26 20:47:16 +02:00
|
|
|
* pages */
|
|
|
|
float8 btm_last_cleanup_num_heap_tuples; /* number of heap tuples
|
|
|
|
* during last cleanup */
|
2000-10-13 04:03:02 +02:00
|
|
|
} BTMetaPageData;
|
|
|
|
|
|
|
|
#define BTPageGetMeta(p) \
|
2002-07-02 07:48:44 +02:00
|
|
|
((BTMetaPageData *) PageGetContents(p))
|
2000-10-04 02:04:43 +02:00
|
|
|
|
2001-03-22 05:01:46 +01:00
|
|
|
#define BTREE_METAPAGE 0 /* first page is meta */
|
2001-10-28 07:26:15 +01:00
|
|
|
#define BTREE_MAGIC 0x053162 /* magic number of btree pages */
|
Skip full index scan during cleanup of B-tree indexes when possible
Vacuum of index consists from two stages: multiple (zero of more) ambulkdelete
calls and one amvacuumcleanup call. When workload on particular table
is append-only, then autovacuum isn't intended to touch this table. However,
user may run vacuum manually in order to fill visibility map and get benefits
of index-only scans. Then ambulkdelete wouldn't be called for indexes
of such table (because no heap tuples were deleted), only amvacuumcleanup would
be called In this case, amvacuumcleanup would perform full index scan for
two objectives: put recyclable pages into free space map and update index
statistics.
This patch allows btvacuumclanup to skip full index scan when two conditions
are satisfied: no pages are going to be put into free space map and index
statistics isn't stalled. In order to check first condition, we store
oldest btpo_xact in the meta-page. When it's precedes RecentGlobalXmin, then
there are some recyclable pages. In order to check second condition we store
number of heap tuples observed during previous full index scan by cleanup.
If fraction of newly inserted tuples is less than
vacuum_cleanup_index_scale_factor, then statistics isn't considered to be
stalled. vacuum_cleanup_index_scale_factor can be defined as both reloption and GUC (default).
This patch bumps B-tree meta-page version. Upgrade of meta-page is performed
"on the fly": during VACUUM meta-page is rewritten with new version. No special
handling in pg_upgrade is required.
Author: Masahiko Sawada, Alexander Korotkov
Review by: Peter Geoghegan, Kyotaro Horiguchi, Alexander Korotkov, Yura Sokolov
Discussion: https://www.postgresql.org/message-id/flat/CAD21AoAX+d2oD_nrd9O2YkpzHaFr=uQeGr9s1rKC3O4ENc568g@mail.gmail.com
2018-04-04 18:29:00 +02:00
|
|
|
#define BTREE_VERSION 3 /* current version number */
|
2018-04-26 20:47:16 +02:00
|
|
|
#define BTREE_MIN_VERSION 2 /* minimal supported version number */
|
2001-03-22 05:01:46 +01:00
|
|
|
|
2002-07-02 07:48:44 +02:00
|
|
|
/*
|
2007-02-05 05:22:18 +01:00
|
|
|
* Maximum size of a btree index entry, including its tuple header.
|
|
|
|
*
|
2002-07-02 07:48:44 +02:00
|
|
|
* We actually need to be able to fit three items on every page,
|
|
|
|
* so restrict any one item to 1/3 the per-page available space.
|
|
|
|
*/
|
|
|
|
#define BTMaxItemSize(page) \
|
2007-02-05 05:22:18 +01:00
|
|
|
MAXALIGN_DOWN((PageGetPageSize(page) - \
|
2008-07-13 22:45:47 +02:00
|
|
|
MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
|
2007-02-05 05:22:18 +01:00
|
|
|
MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
|
1996-08-27 23:50:29 +02:00
|
|
|
|
2006-07-04 00:45:41 +02:00
|
|
|
/*
|
2006-07-11 23:05:57 +02:00
|
|
|
* The leaf-page fillfactor defaults to 90% but is user-adjustable.
|
|
|
|
* For pages above the leaf level, we use a fixed 70% fillfactor.
|
|
|
|
* The fillfactor is applied during index build and when splitting
|
|
|
|
* a rightmost page; when splitting non-rightmost pages we try to
|
|
|
|
* divide the data equally.
|
2006-07-04 00:45:41 +02:00
|
|
|
*/
|
2006-07-11 23:05:57 +02:00
|
|
|
#define BTREE_MIN_FILLFACTOR 10
|
2006-07-04 00:45:41 +02:00
|
|
|
#define BTREE_DEFAULT_FILLFACTOR 90
|
2006-07-11 23:05:57 +02:00
|
|
|
#define BTREE_NONLEAF_FILLFACTOR 70
|
2006-07-04 00:45:41 +02:00
|
|
|
|
1996-08-27 23:50:29 +02:00
|
|
|
/*
|
1997-09-07 07:04:48 +02:00
|
|
|
* In general, the btree code tries to localize its knowledge about
|
|
|
|
* page layout to a couple of routines. However, we need a special
|
|
|
|
* value to indicate "no page number" in those places where we expect
|
2000-07-21 08:42:39 +02:00
|
|
|
* page numbers. We can use zero for this because we never need to
|
|
|
|
* make a pointer to the metadata page.
|
1996-08-27 23:50:29 +02:00
|
|
|
*/
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
#define P_NONE 0
|
2000-07-21 08:42:39 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Macros to test whether a page is leftmost or rightmost on its tree level,
|
|
|
|
* as well as other state info kept in the opaque data.
|
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
#define P_LEFTMOST(opaque) ((opaque)->btpo_prev == P_NONE)
|
|
|
|
#define P_RIGHTMOST(opaque) ((opaque)->btpo_next == P_NONE)
|
2017-09-18 22:36:28 +02:00
|
|
|
#define P_ISLEAF(opaque) (((opaque)->btpo_flags & BTP_LEAF) != 0)
|
|
|
|
#define P_ISROOT(opaque) (((opaque)->btpo_flags & BTP_ROOT) != 0)
|
|
|
|
#define P_ISDELETED(opaque) (((opaque)->btpo_flags & BTP_DELETED) != 0)
|
|
|
|
#define P_ISMETA(opaque) (((opaque)->btpo_flags & BTP_META) != 0)
|
|
|
|
#define P_ISHALFDEAD(opaque) (((opaque)->btpo_flags & BTP_HALF_DEAD) != 0)
|
|
|
|
#define P_IGNORE(opaque) (((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD)) != 0)
|
|
|
|
#define P_HAS_GARBAGE(opaque) (((opaque)->btpo_flags & BTP_HAS_GARBAGE) != 0)
|
|
|
|
#define P_INCOMPLETE_SPLIT(opaque) (((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0)
|
2000-07-21 08:42:39 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
|
|
|
|
* page. The high key is not a data key, but gives info about what range of
|
|
|
|
* keys is supposed to be on this page. The high key on a page is required
|
|
|
|
* to be greater than or equal to any data key that appears on the page.
|
|
|
|
* If we find ourselves trying to insert a key > high key, we know we need
|
|
|
|
* to move right (this should only happen if the page was split since we
|
|
|
|
* examined the parent page).
|
|
|
|
*
|
|
|
|
* Our insertion algorithm guarantees that we can use the initial least key
|
|
|
|
* on our right sibling as the high key. Once a page is created, its high
|
|
|
|
* key changes only if the page is split.
|
|
|
|
*
|
|
|
|
* On a non-rightmost page, the high key lives in item 1 and data items
|
|
|
|
* start in item 2. Rightmost pages have no high key, so we store data
|
|
|
|
* items beginning in item 1.
|
|
|
|
*/
|
1996-08-27 23:50:29 +02:00
|
|
|
|
2001-02-23 00:02:33 +01:00
|
|
|
#define P_HIKEY ((OffsetNumber) 1)
|
|
|
|
#define P_FIRSTKEY ((OffsetNumber) 2)
|
2001-03-22 05:01:46 +01:00
|
|
|
#define P_FIRSTDATAKEY(opaque) (P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)
|
1996-08-27 23:50:29 +02:00
|
|
|
|
2018-04-07 22:00:39 +02:00
|
|
|
/*
|
Adjust INCLUDE index truncation comments and code.
Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected. Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature. A
similar check will be performed for the page highkey within
_bt_moveright() in most cases. Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.
Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer). This isn't strictly necessary,
but seems more future-proof than the old approach. It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation. Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.
In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE INDEX. Also pfree during a page split, just to be
consistent.
Refactor _bt_check_natts() to be more readable.
Author: Peter Geoghegan with some editorization by me
Reviewed by: Alexander Korotkov, Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
2018-04-19 07:45:58 +02:00
|
|
|
* INCLUDE B-Tree indexes have non-key attributes. These are extra
|
|
|
|
* attributes that may be returned by index-only scans, but do not influence
|
|
|
|
* the order of items in the index (formally, non-key attributes are not
|
|
|
|
* considered to be part of the key space). Non-key attributes are only
|
|
|
|
* present in leaf index tuples whose item pointers actually point to heap
|
|
|
|
* tuples. All other types of index tuples (collectively, "pivot" tuples)
|
|
|
|
* only have key attributes, since pivot tuples only ever need to represent
|
|
|
|
* how the key space is separated. In general, any B-Tree index that has
|
|
|
|
* more than one level (i.e. any index that does not just consist of a
|
|
|
|
* metapage and a single leaf root page) must have some number of pivot
|
|
|
|
* tuples, since pivot tuples are used for traversing the tree.
|
2018-04-07 22:00:39 +02:00
|
|
|
*
|
Adjust INCLUDE index truncation comments and code.
Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected. Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature. A
similar check will be performed for the page highkey within
_bt_moveright() in most cases. Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.
Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer). This isn't strictly necessary,
but seems more future-proof than the old approach. It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation. Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.
In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE INDEX. Also pfree during a page split, just to be
consistent.
Refactor _bt_check_natts() to be more readable.
Author: Peter Geoghegan with some editorization by me
Reviewed by: Alexander Korotkov, Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
2018-04-19 07:45:58 +02:00
|
|
|
* We store the number of attributes present inside pivot tuples by abusing
|
|
|
|
* their item pointer offset field, since pivot tuples never need to store a
|
|
|
|
* real offset (downlinks only need to store a block number). The offset
|
|
|
|
* field only stores the number of attributes when the INDEX_ALT_TID_MASK
|
|
|
|
* bit is set (we never assume that pivot tuples must explicitly store the
|
|
|
|
* number of attributes, and currently do not bother storing the number of
|
|
|
|
* attributes unless indnkeyatts actually differs from indnatts).
|
|
|
|
* INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
|
|
|
|
* possible that it will be used within non-pivot tuples in the future. Do
|
|
|
|
* not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
|
|
|
|
* tuple.
|
2018-04-07 22:00:39 +02:00
|
|
|
*
|
Adjust INCLUDE index truncation comments and code.
Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected. Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature. A
similar check will be performed for the page highkey within
_bt_moveright() in most cases. Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.
Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer). This isn't strictly necessary,
but seems more future-proof than the old approach. It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation. Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.
In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE INDEX. Also pfree during a page split, just to be
consistent.
Refactor _bt_check_natts() to be more readable.
Author: Peter Geoghegan with some editorization by me
Reviewed by: Alexander Korotkov, Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
2018-04-19 07:45:58 +02:00
|
|
|
* The 12 least significant offset bits are used to represent the number of
|
|
|
|
* attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
|
|
|
|
* for future use (BT_RESERVED_OFFSET_MASK bits). BT_N_KEYS_OFFSET_MASK should
|
|
|
|
* be large enough to store any number <= INDEX_MAX_KEYS.
|
2018-04-07 22:00:39 +02:00
|
|
|
*/
|
Adjust INCLUDE index truncation comments and code.
Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected. Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature. A
similar check will be performed for the page highkey within
_bt_moveright() in most cases. Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.
Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer). This isn't strictly necessary,
but seems more future-proof than the old approach. It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation. Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.
In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE INDEX. Also pfree during a page split, just to be
consistent.
Refactor _bt_check_natts() to be more readable.
Author: Peter Geoghegan with some editorization by me
Reviewed by: Alexander Korotkov, Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
2018-04-19 07:45:58 +02:00
|
|
|
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
|
|
|
|
#define BT_RESERVED_OFFSET_MASK 0xF000
|
|
|
|
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
|
|
|
|
|
|
|
|
/* Get/set downlink block number */
|
2018-04-07 22:00:39 +02:00
|
|
|
#define BTreeInnerTupleGetDownLink(itup) \
|
|
|
|
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
|
|
|
|
#define BTreeInnerTupleSetDownLink(itup, blkno) \
|
|
|
|
ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno))
|
|
|
|
|
2018-04-23 14:55:10 +02:00
|
|
|
/*
|
|
|
|
* Get/set leaf page highkey's link. During the second phase of deletion, the
|
|
|
|
* target leaf page's high key may point to an ancestor page (at all other
|
|
|
|
* times, the leaf level high key's link is not used). See the nbtree README
|
|
|
|
* for full details.
|
|
|
|
*/
|
|
|
|
#define BTreeTupleGetTopParent(itup) \
|
|
|
|
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
|
|
|
|
#define BTreeTupleSetTopParent(itup, blkno) \
|
|
|
|
do { \
|
|
|
|
ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno)); \
|
|
|
|
BTreeTupleSetNAtts((itup), 0); \
|
|
|
|
} while(0)
|
|
|
|
|
Adjust INCLUDE index truncation comments and code.
Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected. Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature. A
similar check will be performed for the page highkey within
_bt_moveright() in most cases. Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.
Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer). This isn't strictly necessary,
but seems more future-proof than the old approach. It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation. Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.
In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE INDEX. Also pfree during a page split, just to be
consistent.
Refactor _bt_check_natts() to be more readable.
Author: Peter Geoghegan with some editorization by me
Reviewed by: Alexander Korotkov, Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
2018-04-19 07:45:58 +02:00
|
|
|
/*
|
|
|
|
* Get/set number of attributes within B-tree index tuple. Asserts should be
|
|
|
|
* removed when BT_RESERVED_OFFSET_MASK bits will be used.
|
|
|
|
*/
|
|
|
|
#define BTreeTupleGetNAtts(itup, rel) \
|
2018-04-07 22:00:39 +02:00
|
|
|
( \
|
|
|
|
(itup)->t_info & INDEX_ALT_TID_MASK ? \
|
|
|
|
( \
|
|
|
|
AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
|
|
|
|
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
|
|
|
|
) \
|
|
|
|
: \
|
Adjust INCLUDE index truncation comments and code.
Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected. Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature. A
similar check will be performed for the page highkey within
_bt_moveright() in most cases. Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.
Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer). This isn't strictly necessary,
but seems more future-proof than the old approach. It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation. Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.
In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE INDEX. Also pfree during a page split, just to be
consistent.
Refactor _bt_check_natts() to be more readable.
Author: Peter Geoghegan with some editorization by me
Reviewed by: Alexander Korotkov, Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
2018-04-19 07:45:58 +02:00
|
|
|
IndexRelationGetNumberOfAttributes(rel) \
|
2018-04-07 22:00:39 +02:00
|
|
|
)
|
Adjust INCLUDE index truncation comments and code.
Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected. Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature. A
similar check will be performed for the page highkey within
_bt_moveright() in most cases. Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.
Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer). This isn't strictly necessary,
but seems more future-proof than the old approach. It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation. Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.
In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE INDEX. Also pfree during a page split, just to be
consistent.
Refactor _bt_check_natts() to be more readable.
Author: Peter Geoghegan with some editorization by me
Reviewed by: Alexander Korotkov, Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
2018-04-19 07:45:58 +02:00
|
|
|
#define BTreeTupleSetNAtts(itup, n) \
|
|
|
|
do { \
|
|
|
|
(itup)->t_info |= INDEX_ALT_TID_MASK; \
|
|
|
|
Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
|
|
|
|
ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
|
|
|
|
} while(0)
|
2018-04-07 22:00:39 +02:00
|
|
|
|
1996-08-27 23:50:29 +02:00
|
|
|
/*
|
2015-05-15 22:03:16 +02:00
|
|
|
* Operator strategy numbers for B-tree have been moved to access/stratnum.h,
|
2003-11-12 22:15:59 +01:00
|
|
|
* because many places need to use them in ScanKeyInit() calls.
|
2007-01-09 03:14:16 +01:00
|
|
|
*
|
|
|
|
* The strategy numbers are chosen so that we can commute them by
|
|
|
|
* subtraction, thus:
|
1996-08-27 23:50:29 +02:00
|
|
|
*/
|
2007-11-15 22:14:46 +01:00
|
|
|
#define BTCommuteStrategyNumber(strat) (BTMaxStrategyNumber + 1 - (strat))
|
1996-08-27 23:50:29 +02:00
|
|
|
|
|
|
|
/*
|
1997-09-07 07:04:48 +02:00
|
|
|
* When a new operator class is declared, we require that the user
|
2011-12-07 06:18:38 +01:00
|
|
|
* supply us with an amproc procedure (BTORDER_PROC) for determining
|
2014-05-06 18:12:18 +02:00
|
|
|
* whether, for two keys a and b, a < b, a = b, or a > b. This routine
|
Allow btree comparison functions to return INT_MIN.
Historically we forbade datatype-specific comparison functions from
returning INT_MIN, so that it would be safe to invert the sort order
just by negating the comparison result. However, this was never
really safe for comparison functions that directly return the result
of memcmp(), strcmp(), etc, as POSIX doesn't place any such restriction
on those library functions. Buildfarm results show that at least on
recent Linux on s390x, memcmp() actually does return INT_MIN sometimes,
causing sort failures.
The agreed-on answer is to remove this restriction and fix relevant
call sites to not make such an assumption; code such as "res = -res"
should be replaced by "INVERT_COMPARE_RESULT(res)". The same is needed
in a few places that just directly negated the result of memcmp or
strcmp.
To help find places having this problem, I've also added a compile option
to nbtcompare.c that causes some of the commonly used comparators to
return INT_MIN/INT_MAX instead of their usual -1/+1. It'd likely be
a good idea to have at least one buildfarm member running with
"-DSTRESS_SORT_INT_MIN". That's far from a complete test of course,
but it should help to prevent fresh introductions of such bugs.
This is a longstanding portability hazard, so back-patch to all supported
branches.
Discussion: https://postgr.es/m/20180928185215.ffoq2xrq5d3pafna@alap3.anarazel.de
2018-10-05 22:01:29 +02:00
|
|
|
* must return < 0, 0, > 0, respectively, in these three cases.
|
2011-12-07 06:18:38 +01:00
|
|
|
*
|
|
|
|
* To facilitate accelerated sorting, an operator class may choose to
|
2014-05-06 18:12:18 +02:00
|
|
|
* offer a second procedure (BTSORTSUPPORT_PROC). For full details, see
|
2011-12-07 06:18:38 +01:00
|
|
|
* src/include/utils/sortsupport.h.
|
Support all SQL:2011 options for window frame clauses.
This patch adds the ability to use "RANGE offset PRECEDING/FOLLOWING"
frame boundaries in window functions. We'd punted on that back in the
original patch to add window functions, because it was not clear how to
do it in a reasonably data-type-extensible fashion. That problem is
resolved here by adding the ability for btree operator classes to provide
an "in_range" support function that defines how to add or subtract the
RANGE offset value. Factoring it this way also allows the operator class
to avoid overflow problems near the ends of the datatype's range, if it
wishes to expend effort on that. (In the committed patch, the integer
opclasses handle that issue, but it did not seem worth the trouble to
avoid overflow failures for datetime types.)
The patch includes in_range support for the integer_ops opfamily
(int2/int4/int8) as well as the standard datetime types. Support for
other numeric types has been requested, but that seems like suitable
material for a follow-on patch.
In addition, the patch adds GROUPS mode which counts the offset in
ORDER-BY peer groups rather than rows, and it adds the frame_exclusion
options specified by SQL:2011. As far as I can see, we are now fully
up to spec on window framing options.
Existing behaviors remain unchanged, except that I changed the errcode
for a couple of existing error reports to meet the SQL spec's expectation
that negative "offset" values should be reported as SQLSTATE 22013.
Internally and in relevant parts of the documentation, we now consistently
use the terminology "offset PRECEDING/FOLLOWING" rather than "value
PRECEDING/FOLLOWING", since the term "value" is confusingly vague.
Oliver Ford, reviewed and whacked around some by me
Discussion: https://postgr.es/m/CAGMVOdu9sivPAxbNN0X+q19Sfv9edEPv=HibOJhB14TJv_RCQg@mail.gmail.com
2018-02-07 06:06:50 +01:00
|
|
|
*
|
|
|
|
* To support window frames defined by "RANGE offset PRECEDING/FOLLOWING",
|
|
|
|
* an operator class may choose to offer a third amproc procedure
|
|
|
|
* (BTINRANGE_PROC), independently of whether it offers sortsupport.
|
|
|
|
* For full details, see doc/src/sgml/btree.sgml.
|
1996-08-27 23:50:29 +02:00
|
|
|
*/
|
|
|
|
|
2011-12-07 06:18:38 +01:00
|
|
|
#define BTORDER_PROC 1
|
|
|
|
#define BTSORTSUPPORT_PROC 2
|
Support all SQL:2011 options for window frame clauses.
This patch adds the ability to use "RANGE offset PRECEDING/FOLLOWING"
frame boundaries in window functions. We'd punted on that back in the
original patch to add window functions, because it was not clear how to
do it in a reasonably data-type-extensible fashion. That problem is
resolved here by adding the ability for btree operator classes to provide
an "in_range" support function that defines how to add or subtract the
RANGE offset value. Factoring it this way also allows the operator class
to avoid overflow problems near the ends of the datatype's range, if it
wishes to expend effort on that. (In the committed patch, the integer
opclasses handle that issue, but it did not seem worth the trouble to
avoid overflow failures for datetime types.)
The patch includes in_range support for the integer_ops opfamily
(int2/int4/int8) as well as the standard datetime types. Support for
other numeric types has been requested, but that seems like suitable
material for a follow-on patch.
In addition, the patch adds GROUPS mode which counts the offset in
ORDER-BY peer groups rather than rows, and it adds the frame_exclusion
options specified by SQL:2011. As far as I can see, we are now fully
up to spec on window framing options.
Existing behaviors remain unchanged, except that I changed the errcode
for a couple of existing error reports to meet the SQL spec's expectation
that negative "offset" values should be reported as SQLSTATE 22013.
Internally and in relevant parts of the documentation, we now consistently
use the terminology "offset PRECEDING/FOLLOWING" rather than "value
PRECEDING/FOLLOWING", since the term "value" is confusingly vague.
Oliver Ford, reviewed and whacked around some by me
Discussion: https://postgr.es/m/CAGMVOdu9sivPAxbNN0X+q19Sfv9edEPv=HibOJhB14TJv_RCQg@mail.gmail.com
2018-02-07 06:06:50 +01:00
|
|
|
#define BTINRANGE_PROC 3
|
|
|
|
#define BTNProcs 3
|
1996-08-27 23:50:29 +02:00
|
|
|
|
2003-02-21 01:06:22 +01:00
|
|
|
/*
|
|
|
|
* We need to be able to tell the difference between read and write
|
|
|
|
* requests for pages, in order to do locking correctly.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define BT_READ BUFFER_LOCK_SHARE
|
|
|
|
#define BT_WRITE BUFFER_LOCK_EXCLUSIVE
|
|
|
|
|
|
|
|
/*
|
|
|
|
* BTStackData -- As we descend a tree, we push the (location, downlink)
|
|
|
|
* pairs from internal pages onto a private stack. If we split a
|
|
|
|
* leaf, we use this stack to walk back up the tree and insert data
|
|
|
|
* into parent pages (and possibly to split them, too). Lehman and
|
|
|
|
* Yao's update algorithm guarantees that under no circumstances can
|
|
|
|
* our private stack give us an irredeemably bad picture up the tree.
|
|
|
|
* Again, see the paper for details.
|
|
|
|
*/
|
|
|
|
|
|
|
|
typedef struct BTStackData
|
|
|
|
{
|
|
|
|
BlockNumber bts_blkno;
|
|
|
|
OffsetNumber bts_offset;
|
2018-04-07 22:00:39 +02:00
|
|
|
BlockNumber bts_btentry;
|
2003-02-21 01:06:22 +01:00
|
|
|
struct BTStackData *bts_parent;
|
|
|
|
} BTStackData;
|
|
|
|
|
|
|
|
typedef BTStackData *BTStack;
|
|
|
|
|
|
|
|
/*
|
2006-05-07 03:21:30 +02:00
|
|
|
* BTScanOpaqueData is the btree-private state needed for an indexscan.
|
|
|
|
* This consists of preprocessed scan keys (see _bt_preprocess_keys() for
|
|
|
|
* details of the preprocessing), information about the current location
|
2014-05-06 18:12:18 +02:00
|
|
|
* of the scan, and information about the marked location, if any. (We use
|
2006-05-07 03:21:30 +02:00
|
|
|
* BTScanPosData to represent the data needed for each of current and marked
|
2006-10-04 02:30:14 +02:00
|
|
|
* locations.) In addition we can remember some known-killed index entries
|
2006-05-07 03:21:30 +02:00
|
|
|
* that must be marked before we can move off the current page.
|
2003-02-21 01:06:22 +01:00
|
|
|
*
|
2006-05-07 03:21:30 +02:00
|
|
|
* Index scans work a page at a time: we pin and read-lock the page, identify
|
|
|
|
* all the matching items on the page and save them in BTScanPosData, then
|
|
|
|
* release the read-lock while returning the items to the caller for
|
2014-05-06 18:12:18 +02:00
|
|
|
* processing. This approach minimizes lock/unlock traffic. Note that we
|
2006-05-07 03:21:30 +02:00
|
|
|
* keep the pin on the index page until the caller is done with all the items
|
2014-05-06 18:12:18 +02:00
|
|
|
* (this is needed for VACUUM synchronization, see nbtree/README). When we
|
2006-05-07 03:21:30 +02:00
|
|
|
* are ready to step to the next page, if the caller has told us any of the
|
|
|
|
* items were killed, we re-lock the page to mark them killed, then unlock.
|
|
|
|
* Finally we drop the pin and step to the next page in the appropriate
|
|
|
|
* direction.
|
2011-10-09 06:21:08 +02:00
|
|
|
*
|
|
|
|
* If we are doing an index-only scan, we save the entire IndexTuple for each
|
|
|
|
* matched item, otherwise only its heap TID and offset. The IndexTuples go
|
|
|
|
* into a separate workspace array; each BTScanPosItem stores its tuple's
|
|
|
|
* offset within that array.
|
2003-02-21 01:06:22 +01:00
|
|
|
*/
|
|
|
|
|
2006-05-07 03:21:30 +02:00
|
|
|
typedef struct BTScanPosItem /* what we remember about each match */
|
|
|
|
{
|
|
|
|
ItemPointerData heapTid; /* TID of referenced heap item */
|
|
|
|
OffsetNumber indexOffset; /* index item's location within page */
|
2011-10-09 06:21:08 +02:00
|
|
|
LocationIndex tupleOffset; /* IndexTuple's offset in workspace, if any */
|
2006-05-07 03:21:30 +02:00
|
|
|
} BTScanPosItem;
|
|
|
|
|
|
|
|
typedef struct BTScanPosData
|
|
|
|
{
|
|
|
|
Buffer buf; /* if valid, the buffer is pinned */
|
|
|
|
|
2015-03-25 20:24:43 +01:00
|
|
|
XLogRecPtr lsn; /* pos in the WAL stream when page was read */
|
2016-06-02 19:52:41 +02:00
|
|
|
BlockNumber currPage; /* page referenced by items array */
|
2006-05-07 03:21:30 +02:00
|
|
|
BlockNumber nextPage; /* page's right link when we scanned it */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* moreLeft and moreRight track whether we think there may be matching
|
|
|
|
* index entries to the left and right of the current page, respectively.
|
|
|
|
* We can clear the appropriate one of these flags when _bt_checkkeys()
|
|
|
|
* returns continuescan = false.
|
|
|
|
*/
|
|
|
|
bool moreLeft;
|
|
|
|
bool moreRight;
|
|
|
|
|
2011-10-09 06:21:08 +02:00
|
|
|
/*
|
|
|
|
* If we are doing an index-only scan, nextTupleOffset is the first free
|
|
|
|
* location in the associated tuple storage workspace.
|
|
|
|
*/
|
|
|
|
int nextTupleOffset;
|
|
|
|
|
2006-05-07 03:21:30 +02:00
|
|
|
/*
|
|
|
|
* The items array is always ordered in index order (ie, increasing
|
|
|
|
* indexoffset). When scanning backwards it is convenient to fill the
|
|
|
|
* array back-to-front, so we start at the last slot and fill downwards.
|
|
|
|
* Hence we need both a first-valid-entry and a last-valid-entry counter.
|
|
|
|
* itemIndex is a cursor showing which entry was last returned to caller.
|
|
|
|
*/
|
|
|
|
int firstItem; /* first valid index in items[] */
|
|
|
|
int lastItem; /* last valid index in items[] */
|
|
|
|
int itemIndex; /* current index in items[] */
|
|
|
|
|
2006-10-04 02:30:14 +02:00
|
|
|
BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
|
2006-05-07 03:21:30 +02:00
|
|
|
} BTScanPosData;
|
|
|
|
|
|
|
|
typedef BTScanPosData *BTScanPos;
|
|
|
|
|
2015-03-25 20:24:43 +01:00
|
|
|
#define BTScanPosIsPinned(scanpos) \
|
|
|
|
( \
|
|
|
|
AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
|
|
|
|
!BufferIsValid((scanpos).buf)), \
|
|
|
|
BufferIsValid((scanpos).buf) \
|
|
|
|
)
|
|
|
|
#define BTScanPosUnpin(scanpos) \
|
|
|
|
do { \
|
|
|
|
ReleaseBuffer((scanpos).buf); \
|
|
|
|
(scanpos).buf = InvalidBuffer; \
|
|
|
|
} while (0)
|
|
|
|
#define BTScanPosUnpinIfPinned(scanpos) \
|
|
|
|
do { \
|
|
|
|
if (BTScanPosIsPinned(scanpos)) \
|
|
|
|
BTScanPosUnpin(scanpos); \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
#define BTScanPosIsValid(scanpos) \
|
|
|
|
( \
|
|
|
|
AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
|
|
|
|
!BufferIsValid((scanpos).buf)), \
|
|
|
|
BlockNumberIsValid((scanpos).currPage) \
|
|
|
|
)
|
|
|
|
#define BTScanPosInvalidate(scanpos) \
|
|
|
|
do { \
|
|
|
|
(scanpos).currPage = InvalidBlockNumber; \
|
|
|
|
(scanpos).nextPage = InvalidBlockNumber; \
|
|
|
|
(scanpos).buf = InvalidBuffer; \
|
|
|
|
(scanpos).lsn = InvalidXLogRecPtr; \
|
|
|
|
(scanpos).nextTupleOffset = 0; \
|
|
|
|
} while (0);
|
2006-05-07 03:21:30 +02:00
|
|
|
|
2011-10-16 21:39:24 +02:00
|
|
|
/* We need one of these for each equality-type SK_SEARCHARRAY scan key */
|
|
|
|
typedef struct BTArrayKeyInfo
|
|
|
|
{
|
|
|
|
int scan_key; /* index of associated key in arrayKeyData */
|
|
|
|
int cur_elem; /* index of current element in elem_values */
|
2012-09-27 22:59:59 +02:00
|
|
|
int mark_elem; /* index of marked element in elem_values */
|
2011-10-16 21:39:24 +02:00
|
|
|
int num_elems; /* number of elems in current array value */
|
|
|
|
Datum *elem_values; /* array of num_elems Datums */
|
|
|
|
} BTArrayKeyInfo;
|
|
|
|
|
2003-02-21 01:06:22 +01:00
|
|
|
typedef struct BTScanOpaqueData
|
|
|
|
{
|
2003-11-12 22:15:59 +01:00
|
|
|
/* these fields are set by _bt_preprocess_keys(): */
|
2003-02-21 01:06:22 +01:00
|
|
|
bool qual_ok; /* false if qual can never be satisfied */
|
2003-11-12 22:15:59 +01:00
|
|
|
int numberOfKeys; /* number of preprocessed scan keys */
|
|
|
|
ScanKey keyData; /* array of preprocessed scan keys */
|
2006-05-07 03:21:30 +02:00
|
|
|
|
2011-10-16 21:39:24 +02:00
|
|
|
/* workspace for SK_SEARCHARRAY support */
|
|
|
|
ScanKey arrayKeyData; /* modified copy of scan->keyData */
|
|
|
|
int numArrayKeys; /* number of equality-type array keys (-1 if
|
|
|
|
* there are any unsatisfiable array keys) */
|
2017-02-15 13:41:14 +01:00
|
|
|
int arrayKeyCount; /* count indicating number of array scan keys
|
|
|
|
* processed */
|
2011-10-16 21:39:24 +02:00
|
|
|
BTArrayKeyInfo *arrayKeys; /* info about each equality-type array key */
|
2012-06-10 21:20:04 +02:00
|
|
|
MemoryContext arrayContext; /* scan-lifespan context for array data */
|
2011-10-16 21:39:24 +02:00
|
|
|
|
2006-05-07 03:21:30 +02:00
|
|
|
/* info about killed items if any (killedItems is NULL if never used) */
|
|
|
|
int *killedItems; /* currPos.items indexes of killed items */
|
|
|
|
int numKilled; /* number of currently stored items */
|
|
|
|
|
2011-10-09 06:21:08 +02:00
|
|
|
/*
|
|
|
|
* If we are doing an index-only scan, these are the tuple storage
|
2012-06-10 21:20:04 +02:00
|
|
|
* workspaces for the currPos and markPos respectively. Each is of size
|
|
|
|
* BLCKSZ, so it can hold as much as a full page's worth of tuples.
|
2011-10-09 06:21:08 +02:00
|
|
|
*/
|
|
|
|
char *currTuples; /* tuple storage for currPos */
|
|
|
|
char *markTuples; /* tuple storage for markPos */
|
|
|
|
|
2006-08-24 03:18:34 +02:00
|
|
|
/*
|
2006-10-04 02:30:14 +02:00
|
|
|
* If the marked position is on the same page as current position, we
|
|
|
|
* don't use markPos, but just keep the marked itemIndex in markItemIndex
|
|
|
|
* (all the rest of currPos is valid for the mark position). Hence, to
|
|
|
|
* determine if there is a mark, first look at markItemIndex, then at
|
|
|
|
* markPos.
|
2006-08-24 03:18:34 +02:00
|
|
|
*/
|
|
|
|
int markItemIndex; /* itemIndex, or -1 if not valid */
|
|
|
|
|
2006-05-07 03:21:30 +02:00
|
|
|
/* keep these last in struct for efficiency */
|
|
|
|
BTScanPosData currPos; /* current position data */
|
|
|
|
BTScanPosData markPos; /* marked position, if any */
|
2003-02-21 01:06:22 +01:00
|
|
|
} BTScanOpaqueData;
|
|
|
|
|
|
|
|
typedef BTScanOpaqueData *BTScanOpaque;
|
|
|
|
|
2006-01-23 23:31:41 +01:00
|
|
|
/*
|
2007-01-09 03:14:16 +01:00
|
|
|
* We use some private sk_flags bits in preprocessed scan keys. We're allowed
|
2014-05-06 18:12:18 +02:00
|
|
|
* to use bits 16-31 (see skey.h). The uppermost bits are copied from the
|
2007-01-09 03:14:16 +01:00
|
|
|
* index's indoption[] array entry for the index attribute.
|
2006-01-23 23:31:41 +01:00
|
|
|
*/
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
#define SK_BT_REQFWD 0x00010000 /* required to continue forward scan */
|
|
|
|
#define SK_BT_REQBKWD 0x00020000 /* required to continue backward scan */
|
|
|
|
#define SK_BT_INDOPTION_SHIFT 24 /* must clear the above bits */
|
2007-01-09 03:14:16 +01:00
|
|
|
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
|
|
|
|
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
|
2006-01-23 23:31:41 +01:00
|
|
|
|
2000-07-21 08:42:39 +02:00
|
|
|
/*
|
2017-02-15 13:41:14 +01:00
|
|
|
* external entry points for btree, in nbtree.c
|
2000-07-21 08:42:39 +02:00
|
|
|
*/
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
extern void btbuildempty(Relation index);
|
|
|
|
extern bool btinsert(Relation rel, Datum *values, bool *isnull,
|
|
|
|
ItemPointer ht_ctid, Relation heapRel,
|
Allow index AMs to cache data across aminsert calls within a SQL command.
It's always been possible for index AMs to cache data across successive
amgettuple calls within a single SQL command: the IndexScanDesc.opaque
field is meant for precisely that. However, no comparable facility
exists for amortizing setup work across successive aminsert calls.
This patch adds such a feature and teaches GIN, GIST, and BRIN to use it
to amortize catalog lookups they'd previously been doing on every call.
(The other standard index AMs keep everything they need in the relcache,
so there's little to improve there.)
For GIN, the overall improvement in a statement that inserts many rows
can be as much as 10%, though it seems a bit less for the other two.
In addition, this makes a really significant difference in runtime
for CLOBBER_CACHE_ALWAYS tests, since in those builds the repeated
catalog lookups are vastly more expensive.
The reason this has been hard up to now is that the aminsert function is
not passed any useful place to cache per-statement data. What I chose to
do is to add suitable fields to struct IndexInfo and pass that to aminsert.
That's not widening the index AM API very much because IndexInfo is already
within the ken of ambuild; in fact, by passing the same info to aminsert
as to ambuild, this is really removing an inconsistency in the AM API.
Discussion: https://postgr.es/m/27568.1486508680@sss.pgh.pa.us
2017-02-09 17:52:12 +01:00
|
|
|
IndexUniqueCheck checkUnique,
|
|
|
|
struct IndexInfo *indexInfo);
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
|
2017-02-15 13:41:14 +01:00
|
|
|
extern Size btestimateparallelscan(void);
|
|
|
|
extern void btinitparallelscan(void *target);
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
|
|
|
|
extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
|
|
|
|
extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
|
|
|
|
ScanKey orderbys, int norderbys);
|
2017-02-15 13:41:14 +01:00
|
|
|
extern void btparallelrescan(IndexScanDesc scan);
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
extern void btendscan(IndexScanDesc scan);
|
|
|
|
extern void btmarkpos(IndexScanDesc scan);
|
|
|
|
extern void btrestrpos(IndexScanDesc scan);
|
|
|
|
extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
|
|
|
|
IndexBulkDeleteResult *stats,
|
|
|
|
IndexBulkDeleteCallback callback,
|
|
|
|
void *callback_state);
|
|
|
|
extern IndexBulkDeleteResult *btvacuumcleanup(IndexVacuumInfo *info,
|
|
|
|
IndexBulkDeleteResult *stats);
|
|
|
|
extern bool btcanreturn(Relation index, int attno);
|
2000-07-21 08:42:39 +02:00
|
|
|
|
2017-02-15 13:41:14 +01:00
|
|
|
/*
|
|
|
|
* prototypes for internal functions in nbtree.c
|
|
|
|
*/
|
|
|
|
extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
|
|
|
|
extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
|
|
|
|
extern void _bt_parallel_done(IndexScanDesc scan);
|
|
|
|
extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
|
|
|
|
|
1996-08-27 23:50:29 +02:00
|
|
|
/*
|
|
|
|
* prototypes for functions in nbtinsert.c
|
|
|
|
*/
|
2009-07-29 22:56:21 +02:00
|
|
|
extern bool _bt_doinsert(Relation rel, IndexTuple itup,
|
|
|
|
IndexUniqueCheck checkUnique, Relation heapRel);
|
2019-02-26 02:47:43 +01:00
|
|
|
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack);
|
Make the handling of interrupted B-tree page splits more robust.
Splitting a page consists of two separate steps: splitting the child page,
and inserting the downlink for the new right page to the parent. Previously,
we handled the case that you crash in between those steps with a cleanup
routine after the WAL recovery had finished, which finished the incomplete
split. However, that doesn't help if the page split is interrupted but the
database doesn't crash, so that you don't perform WAL recovery. That could
happen for example if you run out of disk space.
Remove the end-of-recovery cleanup step. Instead, when a page is split, the
left page is marked with a new INCOMPLETE_SPLIT flag, and when the downlink
is inserted to the parent, the flag is cleared again. If an insertion sees
a page with the flag set, it knows that the split was interrupted for some
reason, and inserts the missing downlink before proceeding.
I used the same approach to fix GIN and GiST split algorithms earlier. This
was the last WAL cleanup routine, so we could get rid of that whole
machinery now, but I'll leave that for a separate patch.
Reviewed by Peter Geoghegan.
2014-03-18 19:12:58 +01:00
|
|
|
extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
|
2016-04-08 20:52:13 +02:00
|
|
|
|
1996-08-27 23:50:29 +02:00
|
|
|
/*
|
|
|
|
* prototypes for functions in nbtpage.c
|
|
|
|
*/
|
2004-06-02 19:28:18 +02:00
|
|
|
extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
|
Skip full index scan during cleanup of B-tree indexes when possible
Vacuum of index consists from two stages: multiple (zero of more) ambulkdelete
calls and one amvacuumcleanup call. When workload on particular table
is append-only, then autovacuum isn't intended to touch this table. However,
user may run vacuum manually in order to fill visibility map and get benefits
of index-only scans. Then ambulkdelete wouldn't be called for indexes
of such table (because no heap tuples were deleted), only amvacuumcleanup would
be called In this case, amvacuumcleanup would perform full index scan for
two objectives: put recyclable pages into free space map and update index
statistics.
This patch allows btvacuumclanup to skip full index scan when two conditions
are satisfied: no pages are going to be put into free space map and index
statistics isn't stalled. In order to check first condition, we store
oldest btpo_xact in the meta-page. When it's precedes RecentGlobalXmin, then
there are some recyclable pages. In order to check second condition we store
number of heap tuples observed during previous full index scan by cleanup.
If fraction of newly inserted tuples is less than
vacuum_cleanup_index_scale_factor, then statistics isn't considered to be
stalled. vacuum_cleanup_index_scale_factor can be defined as both reloption and GUC (default).
This patch bumps B-tree meta-page version. Upgrade of meta-page is performed
"on the fly": during VACUUM meta-page is rewritten with new version. No special
handling in pg_upgrade is required.
Author: Masahiko Sawada, Alexander Korotkov
Review by: Peter Geoghegan, Kyotaro Horiguchi, Alexander Korotkov, Yura Sokolov
Discussion: https://www.postgresql.org/message-id/flat/CAD21AoAX+d2oD_nrd9O2YkpzHaFr=uQeGr9s1rKC3O4ENc568g@mail.gmail.com
2018-04-04 18:29:00 +02:00
|
|
|
extern void _bt_update_meta_cleanup_info(Relation rel,
|
2018-04-26 20:47:16 +02:00
|
|
|
TransactionId oldestBtpoXact, float8 numHeapTuples);
|
Skip full index scan during cleanup of B-tree indexes when possible
Vacuum of index consists from two stages: multiple (zero of more) ambulkdelete
calls and one amvacuumcleanup call. When workload on particular table
is append-only, then autovacuum isn't intended to touch this table. However,
user may run vacuum manually in order to fill visibility map and get benefits
of index-only scans. Then ambulkdelete wouldn't be called for indexes
of such table (because no heap tuples were deleted), only amvacuumcleanup would
be called In this case, amvacuumcleanup would perform full index scan for
two objectives: put recyclable pages into free space map and update index
statistics.
This patch allows btvacuumclanup to skip full index scan when two conditions
are satisfied: no pages are going to be put into free space map and index
statistics isn't stalled. In order to check first condition, we store
oldest btpo_xact in the meta-page. When it's precedes RecentGlobalXmin, then
there are some recyclable pages. In order to check second condition we store
number of heap tuples observed during previous full index scan by cleanup.
If fraction of newly inserted tuples is less than
vacuum_cleanup_index_scale_factor, then statistics isn't considered to be
stalled. vacuum_cleanup_index_scale_factor can be defined as both reloption and GUC (default).
This patch bumps B-tree meta-page version. Upgrade of meta-page is performed
"on the fly": during VACUUM meta-page is rewritten with new version. No special
handling in pg_upgrade is required.
Author: Masahiko Sawada, Alexander Korotkov
Review by: Peter Geoghegan, Kyotaro Horiguchi, Alexander Korotkov, Yura Sokolov
Discussion: https://www.postgresql.org/message-id/flat/CAD21AoAX+d2oD_nrd9O2YkpzHaFr=uQeGr9s1rKC3O4ENc568g@mail.gmail.com
2018-04-04 18:29:00 +02:00
|
|
|
extern void _bt_upgrademetapage(Page page);
|
1997-09-08 04:41:22 +02:00
|
|
|
extern Buffer _bt_getroot(Relation rel, int access);
|
2003-02-21 01:06:22 +01:00
|
|
|
extern Buffer _bt_gettrueroot(Relation rel);
|
Redesign the planner's handling of index-descent cost estimation.
Historically we've used a couple of very ad-hoc fudge factors to try to
get the right results when indexes of different sizes would satisfy a
query with the same number of index leaf tuples being visited. In
commit 21a39de5809cd3050a37d2554323cc1d0cbeed9d I tweaked one of these
fudge factors, with results that proved disastrous for larger indexes.
Commit bf01e34b556ff37982ba2d882db424aa484c0d07 fudged it some more,
but still with not a lot of principle behind it.
What seems like a better way to address these issues is to explicitly model
index-descent costs, since that's what's really at stake when considering
diferent indexes with similar leaf-page-level costs. We tried that once
long ago, and found that charging random_page_cost per page descended
through was way too much, because upper btree levels tend to stay in cache
in real-world workloads. However, there's still CPU costs to think about,
and the previous fudge factors can be seen as a crude attempt to account
for those costs. So this patch replaces those fudge factors with explicit
charges for the number of tuple comparisons needed to descend the index
tree, plus a small charge per page touched in the descent. The cost
multipliers are chosen so that the resulting charges are in the vicinity of
the historical (pre-9.2) fudge factors for indexes of up to about a million
tuples, while not ballooning unreasonably beyond that, as the old fudge
factor did (even more so in 9.2).
To make this work accurately for btree indexes, add some code that allows
extraction of the known root-page height from a btree. There's no
equivalent number readily available for other index types, but we can use
the log of the number of index pages as an approximate substitute.
This seems like too much of a behavioral change to risk back-patching,
but it should improve matters going forward. In 9.2 I'll just revert
the fudge-factor change.
2013-01-11 18:56:58 +01:00
|
|
|
extern int _bt_getrootheight(Relation rel);
|
2005-11-06 20:29:01 +01:00
|
|
|
extern void _bt_checkpage(Relation rel, Buffer buf);
|
1997-09-08 04:41:22 +02:00
|
|
|
extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
|
2004-04-21 20:24:26 +02:00
|
|
|
extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
|
2004-08-29 07:07:03 +02:00
|
|
|
BlockNumber blkno, int access);
|
Restructure index AM interface for index building and index tuple deletion,
per previous discussion on pghackers. Most of the duplicate code in
different AMs' ambuild routines has been moved out to a common routine
in index.c; this means that all index types now do the right things about
inserting recently-dead tuples, etc. (I also removed support for EXTEND
INDEX in the ambuild routines, since that's about to go away anyway, and
it cluttered the code a lot.) The retail indextuple deletion routines have
been replaced by a "bulk delete" routine in which the indexscan is inside
the access method. I haven't pushed this change as far as it should go yet,
but it should allow considerable simplification of the internal bookkeeping
for deletions. Also, add flag columns to pg_am to eliminate various
hardcoded tests on AM OIDs, and remove unused pg_am columns.
Fix rtree and gist index types to not attempt to store NULLs; before this,
gist usually crashed, while rtree managed not to crash but computed wacko
bounding boxes for NULL entries (which might have had something to do with
the performance problems we've heard about occasionally).
Add AtEOXact routines to hash, rtree, and gist, all of which have static
state that needs to be reset after an error. We discovered this need long
ago for btree, but missed the other guys.
Oh, one more thing: concurrent VACUUM is now the default.
2001-07-16 00:48:19 +02:00
|
|
|
extern void _bt_relbuf(Relation rel, Buffer buf);
|
1997-09-08 04:41:22 +02:00
|
|
|
extern void _bt_pageinit(Page page, Size size);
|
2003-02-23 07:17:13 +01:00
|
|
|
extern bool _bt_page_recyclable(Page page);
|
2010-03-28 11:27:02 +02:00
|
|
|
extern void _bt_delitems_delete(Relation rel, Buffer buf,
|
2010-07-06 21:19:02 +02:00
|
|
|
OffsetNumber *itemnos, int nitems, Relation heapRel);
|
2010-03-28 11:27:02 +02:00
|
|
|
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
|
2012-02-21 20:14:16 +01:00
|
|
|
OffsetNumber *itemnos, int nitems,
|
|
|
|
BlockNumber lastBlockVacuumed);
|
Fix race condition in B-tree page deletion.
In short, we don't allow a page to be deleted if it's the rightmost child
of its parent, but that situation can change after we check for it.
Problem
-------
We check that the page to be deleted is not the rightmost child of its
parent, and then lock its left sibling, the page itself, its right sibling,
and the parent, in that order. However, if the parent page is split after
the check but before acquiring the locks, the target page might become the
rightmost child, if the split happens at the right place. That leads to an
error in vacuum (I reproduced this by setting a breakpoint in debugger):
ERROR: failed to delete rightmost child 41 of block 3 in index "foo_pkey"
We currently re-check that the page is still the rightmost child, and throw
the above error if it's not. We could easily just give up rather than throw
an error, but that approach doesn't scale to half-dead pages. To recap,
although we don't normally allow deleting the rightmost child, if the page
is the *only* child of its parent, we delete the child page and mark the
parent page as half-dead in one atomic operation. But before we do that, we
check that the parent can later be deleted, by checking that it in turn is
not the rightmost child of the grandparent (potentially recursing all the
way up to the root). But the same situation can arise there - the
grandparent can be split while we're not holding the locks. We end up with
a half-dead page that we cannot delete.
To make things worse, the keyspace of the deleted page has already been
transferred to its right sibling. As the README points out, the keyspace at
the grandparent level is "out-of-whack" until the half-dead page is deleted,
and if enough tuples with keys in the transferred keyspace are inserted, the
page might get split and a downlink might be inserted into the grandparent
that is out-of-order. That might not cause any serious problem if it's
transient (as the README ponders), but is surely bad if it stays that way.
Solution
--------
This patch changes the page deletion algorithm to avoid that problem. After
checking that the topmost page in the chain of to-be-deleted pages is not
the rightmost child of its parent, and then deleting the pages from bottom
up, unlink the pages from top to bottom. This way, the intermediate stages
are similar to the intermediate stages in page splitting, and there is no
transient stage where the keyspace is "out-of-whack". The topmost page in
the to-be-deleted chain doesn't have a downlink pointing to it, like a page
split before the downlink has been inserted.
This also allows us to get rid of the cleanup step after WAL recovery, if we
crash during page deletion. The deletion will be continued at next VACUUM,
but the tree is consistent for searches and insertions at every step.
This bug is old, all supported versions are affected, but this patch is too
big to back-patch (and changes the WAL record formats of related records).
We have not heard any reports of the bug from users, so clearly it's not
easy to bump into. Maybe backpatch later, after this has had some field
testing.
Reviewed by Kevin Grittner and Peter Geoghegan.
2014-03-14 14:43:58 +01:00
|
|
|
extern int _bt_pagedel(Relation rel, Buffer buf);
|
1996-08-27 23:50:29 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* prototypes for functions in nbtsearch.c
|
|
|
|
*/
|
2003-12-21 02:23:06 +01:00
|
|
|
extern BTStack _bt_search(Relation rel,
|
2004-08-29 07:07:03 +02:00
|
|
|
int keysz, ScanKey scankey, bool nextkey,
|
2016-04-08 21:36:30 +02:00
|
|
|
Buffer *bufP, int access, Snapshot snapshot);
|
1998-09-01 06:40:42 +02:00
|
|
|
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
|
Make the handling of interrupted B-tree page splits more robust.
Splitting a page consists of two separate steps: splitting the child page,
and inserting the downlink for the new right page to the parent. Previously,
we handled the case that you crash in between those steps with a cleanup
routine after the WAL recovery had finished, which finished the incomplete
split. However, that doesn't help if the page split is interrupted but the
database doesn't crash, so that you don't perform WAL recovery. That could
happen for example if you run out of disk space.
Remove the end-of-recovery cleanup step. Instead, when a page is split, the
left page is marked with a new INCOMPLETE_SPLIT flag, and when the downlink
is inserted to the parent, the flag is cleared again. If an insertion sees
a page with the flag set, it knows that the split was interrupted for some
reason, and inserts the missing downlink before proceeding.
I used the same approach to fix GIN and GiST split algorithms earlier. This
was the last WAL cleanup routine, so we could get rid of that whole
machinery now, but I'll leave that for a separate patch.
Reviewed by Peter Geoghegan.
2014-03-18 19:12:58 +01:00
|
|
|
ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
|
2016-04-08 21:36:30 +02:00
|
|
|
int access, Snapshot snapshot);
|
1998-09-01 06:40:42 +02:00
|
|
|
extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
|
2003-12-21 02:23:06 +01:00
|
|
|
ScanKey scankey, bool nextkey);
|
2000-07-21 08:42:39 +02:00
|
|
|
extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
|
2001-03-22 05:01:46 +01:00
|
|
|
Page page, OffsetNumber offnum);
|
2002-05-21 01:51:44 +02:00
|
|
|
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
|
2006-05-07 03:21:30 +02:00
|
|
|
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
|
2016-04-08 21:36:30 +02:00
|
|
|
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
|
2016-06-10 00:02:36 +02:00
|
|
|
Snapshot snapshot);
|
1996-08-27 23:50:29 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* prototypes for functions in nbtutils.c
|
|
|
|
*/
|
1997-09-08 04:41:22 +02:00
|
|
|
extern ScanKey _bt_mkscankey(Relation rel, IndexTuple itup);
|
2000-02-18 07:32:39 +01:00
|
|
|
extern ScanKey _bt_mkscankey_nodata(Relation rel);
|
1997-09-08 04:41:22 +02:00
|
|
|
extern void _bt_freeskey(ScanKey skey);
|
|
|
|
extern void _bt_freestack(BTStack stack);
|
2011-10-16 21:39:24 +02:00
|
|
|
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
|
|
|
|
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
|
|
|
|
extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
|
2012-09-27 22:59:59 +02:00
|
|
|
extern void _bt_mark_array_keys(IndexScanDesc scan);
|
|
|
|
extern void _bt_restore_array_keys(IndexScanDesc scan);
|
2003-11-12 22:15:59 +01:00
|
|
|
extern void _bt_preprocess_keys(IndexScanDesc scan);
|
2011-10-09 06:21:08 +02:00
|
|
|
extern IndexTuple _bt_checkkeys(IndexScanDesc scan,
|
2006-10-04 02:30:14 +02:00
|
|
|
Page page, OffsetNumber offnum,
|
|
|
|
ScanDirection dir, bool *continuescan);
|
2015-03-25 20:24:43 +01:00
|
|
|
extern void _bt_killitems(IndexScanDesc scan);
|
2006-05-08 02:00:17 +02:00
|
|
|
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
|
|
|
|
extern BTCycleId _bt_start_vacuum(Relation rel);
|
|
|
|
extern void _bt_end_vacuum(Relation rel);
|
2008-04-17 01:59:40 +02:00
|
|
|
extern void _bt_end_vacuum_callback(int code, Datum arg);
|
2006-05-08 02:00:17 +02:00
|
|
|
extern Size BTreeShmemSize(void);
|
|
|
|
extern void BTreeShmemInit(void);
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
extern bytea *btoptions(Datum reloptions, bool validate);
|
2016-08-14 00:31:14 +02:00
|
|
|
extern bool btproperty(Oid index_oid, int attno,
|
|
|
|
IndexAMProperty prop, const char *propname,
|
|
|
|
bool *res, bool *isnull);
|
Adjust INCLUDE index truncation comments and code.
Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected. Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature. A
similar check will be performed for the page highkey within
_bt_moveright() in most cases. Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.
Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer). This isn't strictly necessary,
but seems more future-proof than the old approach. It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation. Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.
In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE INDEX. Also pfree during a page split, just to be
consistent.
Refactor _bt_check_natts() to be more readable.
Author: Peter Geoghegan with some editorization by me
Reviewed by: Alexander Korotkov, Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
2018-04-19 07:45:58 +02:00
|
|
|
extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
|
|
|
|
extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* prototypes for functions in nbtvalidate.c
|
|
|
|
*/
|
|
|
|
extern bool btvalidate(Oid opclassoid);
|
1996-08-27 23:50:29 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* prototypes for functions in nbtsort.c
|
|
|
|
*/
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
extern IndexBuildResult *btbuild(Relation heap, Relation index,
|
|
|
|
struct IndexInfo *indexInfo);
|
|
|
|
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
|
2001-10-28 07:26:15 +01:00
|
|
|
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
#endif /* NBTREE_H */
|