1996-08-27 23:50:29 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
1999-02-14 00:22:53 +01:00
|
|
|
* hash.h
|
1997-09-07 07:04:48 +02:00
|
|
|
* header file for postgres hash access method implementation
|
1996-08-27 23:50:29 +02:00
|
|
|
*
|
|
|
|
*
|
2020-01-01 18:21:45 +01:00
|
|
|
* Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
|
2000-01-26 06:58:53 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
1996-08-27 23:50:29 +02:00
|
|
|
*
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/include/access/hash.h
|
1996-08-27 23:50:29 +02:00
|
|
|
*
|
|
|
|
* NOTES
|
1997-09-07 07:04:48 +02:00
|
|
|
* modeled after Margo Seltzer's hash implementation for unix.
|
1996-08-27 23:50:29 +02:00
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
#ifndef HASH_H
|
|
|
|
#define HASH_H
|
|
|
|
|
2019-12-27 00:09:00 +01:00
|
|
|
#include "access/amapi.h"
|
1999-07-16 01:04:24 +02:00
|
|
|
#include "access/itup.h"
|
1999-07-16 19:07:40 +02:00
|
|
|
#include "access/sdir.h"
|
2019-11-25 01:40:53 +01:00
|
|
|
#include "catalog/pg_am_d.h"
|
2020-02-27 04:55:41 +01:00
|
|
|
#include "common/hashfn.h"
|
2014-11-06 12:52:08 +01:00
|
|
|
#include "lib/stringinfo.h"
|
2011-08-27 17:05:33 +02:00
|
|
|
#include "storage/bufmgr.h"
|
2015-08-07 15:10:56 +02:00
|
|
|
#include "storage/lockdefs.h"
|
Improve hash index bucket split behavior.
Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency. Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.
In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel. There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.
This patch also removes the unworldly assumption that a split will
never be interrupted. With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion. While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.
Amit Kapila. I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's. Also
reviewed by Jesper Pedersen, Jeff Janes, and others.
Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
2016-11-30 21:39:21 +01:00
|
|
|
#include "utils/hsearch.h"
|
2008-06-19 02:46:06 +02:00
|
|
|
#include "utils/relcache.h"
|
1996-08-27 23:50:29 +02:00
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
2003-09-01 22:26:34 +02:00
|
|
|
* Mapping from hash bucket number to physical block number of bucket's
|
2003-09-02 04:18:38 +02:00
|
|
|
* starting page. Beware of multiple evaluations of argument!
|
1996-08-27 23:50:29 +02:00
|
|
|
*/
|
1997-09-08 04:41:22 +02:00
|
|
|
typedef uint32 Bucket;
|
1996-08-27 23:50:29 +02:00
|
|
|
|
Improve hash index bucket split behavior.
Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency. Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.
In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel. There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.
This patch also removes the unworldly assumption that a split will
never be interrupted. With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion. While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.
Amit Kapila. I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's. Also
reviewed by Jesper Pedersen, Jeff Janes, and others.
Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
2016-11-30 21:39:21 +01:00
|
|
|
#define InvalidBucket ((Bucket) 0xFFFFFFFF)
|
|
|
|
|
2003-09-02 04:18:38 +02:00
|
|
|
#define BUCKET_TO_BLKNO(metap,B) \
|
Expand hash indexes more gradually.
Since hash indexes typically have very few overflow pages, adding a
new splitpoint essentially doubles the on-disk size of the index,
which can lead to large and abrupt increases in disk usage (and
perhaps long delays on occasion). To mitigate this problem to some
degree, divide larger splitpoints into four equal phases. This means
that, for example, instead of growing from 4GB to 8GB all at once, a
hash index will now grow from 4GB to 5GB to 6GB to 7GB to 8GB, which
is perhaps still not as smooth as we'd like but certainly an
improvement.
This changes the on-disk format of the metapage, so bump HASH_VERSION
from 2 to 3. This will force a REINDEX of all existing hash indexes,
but that's probably a good idea anyway. First, hash indexes from
pre-10 versions of PostgreSQL could easily be corrupted, and we don't
want to confuse corruption carried over from an older release with any
corruption caused despite the new write-ahead logging in v10. Second,
it will let us remove some backward-compatibility code added by commit
293e24e507838733aba4748b514536af2d39d7f2.
Mithun Cy, reviewed by Amit Kapila, Jesper Pedersen and me. Regression
test outputs updated by me.
Discussion: http://postgr.es/m/CAD__OuhG6F1gQLCgMQNnMNgoCvOLQZz9zKYJQNYvYmmJoM42gA@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoYty0jCf-pa+m+vYUJ716+AxM7nv_syvyanyf5O-L_i2A@mail.gmail.com
2017-04-04 05:46:33 +02:00
|
|
|
((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_spareindex((B)+1)-1] : 0)) + 1)
|
1996-08-27 23:50:29 +02:00
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
2003-09-01 22:26:34 +02:00
|
|
|
* Special space for hash index pages.
|
|
|
|
*
|
2017-04-14 23:04:25 +02:00
|
|
|
* hasho_flag's LH_PAGE_TYPE bits tell us which type of page we're looking at.
|
|
|
|
* Additional bits in the flag word are used for more transient purposes.
|
|
|
|
*
|
|
|
|
* To test a page's type, do (hasho_flag & LH_PAGE_TYPE) == LH_xxx_PAGE.
|
|
|
|
* However, we ensure that each used page type has a distinct bit so that
|
|
|
|
* we can OR together page types for uses such as the allowable-page-types
|
|
|
|
* argument of _hash_checkpage().
|
1996-08-27 23:50:29 +02:00
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
#define LH_UNUSED_PAGE (0)
|
|
|
|
#define LH_OVERFLOW_PAGE (1 << 0)
|
|
|
|
#define LH_BUCKET_PAGE (1 << 1)
|
|
|
|
#define LH_BITMAP_PAGE (1 << 2)
|
|
|
|
#define LH_META_PAGE (1 << 3)
|
Improve hash index bucket split behavior.
Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency. Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.
In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel. There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.
This patch also removes the unworldly assumption that a split will
never be interrupted. With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion. While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.
Amit Kapila. I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's. Also
reviewed by Jesper Pedersen, Jeff Janes, and others.
Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
2016-11-30 21:39:21 +01:00
|
|
|
#define LH_BUCKET_BEING_POPULATED (1 << 4)
|
|
|
|
#define LH_BUCKET_BEING_SPLIT (1 << 5)
|
|
|
|
#define LH_BUCKET_NEEDS_SPLIT_CLEANUP (1 << 6)
|
2017-05-17 22:31:56 +02:00
|
|
|
#define LH_PAGE_HAS_DEAD_TUPLES (1 << 7)
|
1996-08-27 23:50:29 +02:00
|
|
|
|
2017-02-02 20:12:58 +01:00
|
|
|
#define LH_PAGE_TYPE \
|
2017-04-14 23:04:25 +02:00
|
|
|
(LH_OVERFLOW_PAGE | LH_BUCKET_PAGE | LH_BITMAP_PAGE | LH_META_PAGE)
|
2017-02-02 20:12:58 +01:00
|
|
|
|
Cache hash index's metapage in rel->rd_amcache.
This avoids a very significant amount of buffer manager traffic and
contention when scanning hash indexes, because it's no longer
necessary to lock and pin the metapage for every scan. We do need
some way of figuring out when the cache is too stale to use any more,
so that when we lock the primary bucket page to which the cached
metapage points us, we can tell whether a split has occurred since we
cached the metapage data. To do that, we use the hash_prevblkno field
in the primary bucket page, which would otherwise always be set to
InvalidBuffer.
This patch contains code so that it will continue working (although
less efficiently) with hash indexes built before this change, but
perhaps we should consider bumping the hash version and ripping out
the compatibility code. That decision can be made later, though.
Mithun Cy, reviewed by Jesper Pedersen, Amit Kapila, and by me.
Before committing, I made a number of cosmetic changes to the last
posted version of the patch, adjusted _hash_getcachedmetap to be more
careful about order of operation, and made some necessary updates to
the pageinspect documentation and regression tests.
2017-02-07 18:24:25 +01:00
|
|
|
/*
|
|
|
|
* In an overflow page, hasho_prevblkno stores the block number of the previous
|
|
|
|
* page in the bucket chain; in a bucket page, hasho_prevblkno stores the
|
|
|
|
* hashm_maxbucket value as of the last time the bucket was last split, or
|
|
|
|
* else as of the time the bucket was created. The latter convention is used
|
|
|
|
* to determine whether a cached copy of the metapage is too stale to be used
|
|
|
|
* without needing to lock or pin the metapage.
|
|
|
|
*
|
|
|
|
* hasho_nextblkno is always the block number of the next page in the
|
|
|
|
* bucket chain, or InvalidBlockNumber if there are no more such pages.
|
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct HashPageOpaqueData
|
|
|
|
{
|
Cache hash index's metapage in rel->rd_amcache.
This avoids a very significant amount of buffer manager traffic and
contention when scanning hash indexes, because it's no longer
necessary to lock and pin the metapage for every scan. We do need
some way of figuring out when the cache is too stale to use any more,
so that when we lock the primary bucket page to which the cached
metapage points us, we can tell whether a split has occurred since we
cached the metapage data. To do that, we use the hash_prevblkno field
in the primary bucket page, which would otherwise always be set to
InvalidBuffer.
This patch contains code so that it will continue working (although
less efficiently) with hash indexes built before this change, but
perhaps we should consider bumping the hash version and ripping out
the compatibility code. That decision can be made later, though.
Mithun Cy, reviewed by Jesper Pedersen, Amit Kapila, and by me.
Before committing, I made a number of cosmetic changes to the last
posted version of the patch, adjusted _hash_getcachedmetap to be more
careful about order of operation, and made some necessary updates to
the pageinspect documentation and regression tests.
2017-02-07 18:24:25 +01:00
|
|
|
BlockNumber hasho_prevblkno; /* see above */
|
|
|
|
BlockNumber hasho_nextblkno; /* see above */
|
2003-09-02 20:13:32 +02:00
|
|
|
Bucket hasho_bucket; /* bucket number this pg belongs to */
|
2017-04-14 23:04:25 +02:00
|
|
|
uint16 hasho_flag; /* page type code + flag bits, see above */
|
2007-04-10 00:04:08 +02:00
|
|
|
uint16 hasho_page_id; /* for identification of hash indexes */
|
1997-09-08 23:56:23 +02:00
|
|
|
} HashPageOpaqueData;
|
1996-08-27 23:50:29 +02:00
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef HashPageOpaqueData *HashPageOpaque;
|
1996-08-27 23:50:29 +02:00
|
|
|
|
2017-04-14 23:04:25 +02:00
|
|
|
#define H_NEEDS_SPLIT_CLEANUP(opaque) (((opaque)->hasho_flag & LH_BUCKET_NEEDS_SPLIT_CLEANUP) != 0)
|
|
|
|
#define H_BUCKET_BEING_SPLIT(opaque) (((opaque)->hasho_flag & LH_BUCKET_BEING_SPLIT) != 0)
|
|
|
|
#define H_BUCKET_BEING_POPULATED(opaque) (((opaque)->hasho_flag & LH_BUCKET_BEING_POPULATED) != 0)
|
|
|
|
#define H_HAS_DEAD_TUPLES(opaque) (((opaque)->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES) != 0)
|
Improve hash index bucket split behavior.
Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency. Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.
In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel. There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.
This patch also removes the unworldly assumption that a split will
never be interrupted. With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion. While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.
Amit Kapila. I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's. Also
reviewed by Jesper Pedersen, Jeff Janes, and others.
Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
2016-11-30 21:39:21 +01:00
|
|
|
|
2007-04-10 00:04:08 +02:00
|
|
|
/*
|
|
|
|
* The page ID is for the convenience of pg_filedump and similar utilities,
|
|
|
|
* which otherwise would have a hard time telling pages of different index
|
|
|
|
* types apart. It should be the last 2 bytes on the page. This is more or
|
|
|
|
* less "free" due to alignment considerations.
|
|
|
|
*/
|
|
|
|
#define HASHO_PAGE_ID 0xFF80
|
2003-09-02 20:13:32 +02:00
|
|
|
|
2017-05-17 22:31:56 +02:00
|
|
|
typedef struct HashScanPosItem /* what we remember about each match */
|
2017-03-16 03:18:56 +01:00
|
|
|
{
|
|
|
|
ItemPointerData heapTid; /* TID of referenced heap item */
|
|
|
|
OffsetNumber indexOffset; /* index item's location within page */
|
|
|
|
} HashScanPosItem;
|
|
|
|
|
2017-09-22 19:26:25 +02:00
|
|
|
typedef struct HashScanPosData
|
|
|
|
{
|
|
|
|
Buffer buf; /* if valid, the buffer is pinned */
|
|
|
|
BlockNumber currPage; /* current hash index page */
|
|
|
|
BlockNumber nextPage; /* next overflow page */
|
|
|
|
BlockNumber prevPage; /* prev overflow or bucket page */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The items array is always ordered in index order (ie, increasing
|
|
|
|
* indexoffset). When scanning backwards it is convenient to fill the
|
|
|
|
* array back-to-front, so we start at the last slot and fill downwards.
|
|
|
|
* Hence we need both a first-valid-entry and a last-valid-entry counter.
|
|
|
|
* itemIndex is a cursor showing which entry was last returned to caller.
|
|
|
|
*/
|
|
|
|
int firstItem; /* first valid index in items[] */
|
|
|
|
int lastItem; /* last valid index in items[] */
|
|
|
|
int itemIndex; /* current index in items[] */
|
|
|
|
|
|
|
|
HashScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
|
2017-11-29 15:24:24 +01:00
|
|
|
} HashScanPosData;
|
2017-09-22 19:26:25 +02:00
|
|
|
|
|
|
|
#define HashScanPosIsPinned(scanpos) \
|
|
|
|
( \
|
|
|
|
AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
|
|
|
|
!BufferIsValid((scanpos).buf)), \
|
|
|
|
BufferIsValid((scanpos).buf) \
|
|
|
|
)
|
|
|
|
|
|
|
|
#define HashScanPosIsValid(scanpos) \
|
|
|
|
( \
|
|
|
|
AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
|
|
|
|
!BufferIsValid((scanpos).buf)), \
|
|
|
|
BlockNumberIsValid((scanpos).currPage) \
|
|
|
|
)
|
|
|
|
|
|
|
|
#define HashScanPosInvalidate(scanpos) \
|
|
|
|
do { \
|
|
|
|
(scanpos).buf = InvalidBuffer; \
|
|
|
|
(scanpos).currPage = InvalidBlockNumber; \
|
|
|
|
(scanpos).nextPage = InvalidBlockNumber; \
|
|
|
|
(scanpos).prevPage = InvalidBlockNumber; \
|
|
|
|
(scanpos).firstItem = 0; \
|
|
|
|
(scanpos).lastItem = 0; \
|
|
|
|
(scanpos).itemIndex = 0; \
|
|
|
|
} while (0);
|
2017-03-16 03:18:56 +01:00
|
|
|
|
1996-08-27 23:50:29 +02:00
|
|
|
/*
|
2003-09-05 00:06:27 +02:00
|
|
|
* HashScanOpaqueData is private state for a hash index scan.
|
1996-08-27 23:50:29 +02:00
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct HashScanOpaqueData
|
|
|
|
{
|
2008-09-15 20:43:41 +02:00
|
|
|
/* Hash value of the scan key, ie, the hash key we seek */
|
|
|
|
uint32 hashso_sk_hash;
|
|
|
|
|
Improve hash index bucket split behavior.
Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency. Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.
In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel. There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.
This patch also removes the unworldly assumption that a split will
never be interrupted. With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion. While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.
Amit Kapila. I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's. Also
reviewed by Jesper Pedersen, Jeff Janes, and others.
Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
2016-11-30 21:39:21 +01:00
|
|
|
/* remember the buffer associated with primary bucket */
|
|
|
|
Buffer hashso_bucket_buf;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* remember the buffer associated with primary bucket page of bucket being
|
|
|
|
* split. it is required during the scan of the bucket which is being
|
|
|
|
* populated during split operation.
|
|
|
|
*/
|
|
|
|
Buffer hashso_split_bucket_buf;
|
|
|
|
|
|
|
|
/* Whether scan starts on bucket being populated due to split */
|
|
|
|
bool hashso_buc_populated;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Whether scanning bucket being split? The value of this parameter is
|
|
|
|
* referred only when hashso_buc_populated is true.
|
|
|
|
*/
|
|
|
|
bool hashso_buc_split;
|
2017-03-16 03:18:56 +01:00
|
|
|
/* info about killed items if any (killedItems is NULL if never used) */
|
2017-09-22 19:26:25 +02:00
|
|
|
int *killedItems; /* currPos.items indexes of killed items */
|
2017-05-17 22:31:56 +02:00
|
|
|
int numKilled; /* number of currently stored items */
|
2017-09-22 19:26:25 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Identify all the matching items on a page and save them in
|
|
|
|
* HashScanPosData
|
|
|
|
*/
|
|
|
|
HashScanPosData currPos; /* current position data */
|
1997-09-08 23:56:23 +02:00
|
|
|
} HashScanOpaqueData;
|
1996-08-27 23:50:29 +02:00
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef HashScanOpaqueData *HashScanOpaque;
|
1996-08-27 23:50:29 +02:00
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
1996-08-27 23:50:29 +02:00
|
|
|
* Definitions for metapage.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define HASH_METAPAGE 0 /* metapage is always block 0 */
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
#define HASH_MAGIC 0x6440640
|
hash: Increase the number of possible overflow bitmaps by 8x.
Per a report from AP, it's not that hard to exhaust the supply of
bitmap pages if you create a table with a hash index and then insert a
few billion rows - and then you start getting errors when you try to
insert additional rows. In the particular case reported by AP,
there's another fix that we can make to improve recycling of overflow
pages, which is another way to avoid the error, but there may be other
cases where this problem happens and that fix won't help. So let's
buy ourselves as much headroom as we can without rearchitecting
anything.
The comments claim that the old limit was 64GB, but it was really
only 32GB, because we didn't use all the bits in the page for bitmap
bits - only the largest power of 2 that could fit after deducting
space for the page header and so forth. Thus, we have 4kB per page
for bitmap bits, not 8kB. The new limit is thus actually 8 times the
old *real* limit but only 4 times the old *purported* limit.
Since this breaks on-disk compatibility, bump HASH_VERSION. We've
already done this earlier in this release cycle, so this doesn't cause
any incremental inconvenience for people using pg_upgrade from
releases prior to v10. However, users who use pg_upgrade to reach
10beta3 or later from 10beta2 or earlier will need to REINDEX any hash
indexes again.
Amit Kapila and Robert Haas
Discussion: http://postgr.es/m/20170704105728.mwb72jebfmok2nm2@zip.com.au
2017-08-04 21:29:26 +02:00
|
|
|
#define HASH_VERSION 4
|
1996-08-27 23:50:29 +02:00
|
|
|
|
|
|
|
/*
|
2016-09-26 18:00:00 +02:00
|
|
|
* spares[] holds the number of overflow pages currently allocated at or
|
2003-09-01 22:26:34 +02:00
|
|
|
* before a certain splitpoint. For example, if spares[3] = 7 then there are
|
|
|
|
* 7 ovflpages before splitpoint 3 (compare BUCKET_TO_BLKNO macro). The
|
|
|
|
* value in spares[ovflpoint] increases as overflow pages are added at the
|
|
|
|
* end of the index. Once ovflpoint increases (ie, we have actually allocated
|
|
|
|
* the bucket pages belonging to that splitpoint) the number of spares at the
|
|
|
|
* prior splitpoint cannot change anymore.
|
1996-08-27 23:50:29 +02:00
|
|
|
*
|
2003-09-01 22:26:34 +02:00
|
|
|
* ovflpages that have been recycled for reuse can be found by looking at
|
|
|
|
* bitmaps that are stored within ovflpages dedicated for the purpose.
|
2016-09-26 18:00:00 +02:00
|
|
|
* The blknos of these bitmap pages are kept in mapp[]; nmaps is the
|
2003-09-01 22:26:34 +02:00
|
|
|
* number of currently existing bitmaps.
|
1997-09-07 07:04:48 +02:00
|
|
|
*
|
2003-09-01 22:26:34 +02:00
|
|
|
* The limitation on the size of spares[] comes from the fact that there's
|
|
|
|
* no point in having more than 2^32 buckets with only uint32 hashcodes.
|
Expand hash indexes more gradually.
Since hash indexes typically have very few overflow pages, adding a
new splitpoint essentially doubles the on-disk size of the index,
which can lead to large and abrupt increases in disk usage (and
perhaps long delays on occasion). To mitigate this problem to some
degree, divide larger splitpoints into four equal phases. This means
that, for example, instead of growing from 4GB to 8GB all at once, a
hash index will now grow from 4GB to 5GB to 6GB to 7GB to 8GB, which
is perhaps still not as smooth as we'd like but certainly an
improvement.
This changes the on-disk format of the metapage, so bump HASH_VERSION
from 2 to 3. This will force a REINDEX of all existing hash indexes,
but that's probably a good idea anyway. First, hash indexes from
pre-10 versions of PostgreSQL could easily be corrupted, and we don't
want to confuse corruption carried over from an older release with any
corruption caused despite the new write-ahead logging in v10. Second,
it will let us remove some backward-compatibility code added by commit
293e24e507838733aba4748b514536af2d39d7f2.
Mithun Cy, reviewed by Amit Kapila, Jesper Pedersen and me. Regression
test outputs updated by me.
Discussion: http://postgr.es/m/CAD__OuhG6F1gQLCgMQNnMNgoCvOLQZz9zKYJQNYvYmmJoM42gA@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoYty0jCf-pa+m+vYUJ716+AxM7nv_syvyanyf5O-L_i2A@mail.gmail.com
2017-04-04 05:46:33 +02:00
|
|
|
* (Note: The value of HASH_MAX_SPLITPOINTS which is the size of spares[] is
|
|
|
|
* adjusted in such a way to accommodate multi phased allocation of buckets
|
|
|
|
* after HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE).
|
|
|
|
*
|
2003-09-02 20:13:32 +02:00
|
|
|
* There is no particular upper limit on the size of mapp[], other than
|
hash: Increase the number of possible overflow bitmaps by 8x.
Per a report from AP, it's not that hard to exhaust the supply of
bitmap pages if you create a table with a hash index and then insert a
few billion rows - and then you start getting errors when you try to
insert additional rows. In the particular case reported by AP,
there's another fix that we can make to improve recycling of overflow
pages, which is another way to avoid the error, but there may be other
cases where this problem happens and that fix won't help. So let's
buy ourselves as much headroom as we can without rearchitecting
anything.
The comments claim that the old limit was 64GB, but it was really
only 32GB, because we didn't use all the bits in the page for bitmap
bits - only the largest power of 2 that could fit after deducting
space for the page header and so forth. Thus, we have 4kB per page
for bitmap bits, not 8kB. The new limit is thus actually 8 times the
old *real* limit but only 4 times the old *purported* limit.
Since this breaks on-disk compatibility, bump HASH_VERSION. We've
already done this earlier in this release cycle, so this doesn't cause
any incremental inconvenience for people using pg_upgrade from
releases prior to v10. However, users who use pg_upgrade to reach
10beta3 or later from 10beta2 or earlier will need to REINDEX any hash
indexes again.
Amit Kapila and Robert Haas
Discussion: http://postgr.es/m/20170704105728.mwb72jebfmok2nm2@zip.com.au
2017-08-04 21:29:26 +02:00
|
|
|
* needing to fit into the metapage. (With 8K block size, 1024 bitmaps
|
2018-09-06 05:57:19 +02:00
|
|
|
* limit us to 256 GB of overflow space...). For smaller block size we
|
|
|
|
* can not use 1024 bitmaps as it will lead to the meta page data crossing
|
|
|
|
* the block size boundary. So we use BLCKSZ to determine the maximum number
|
|
|
|
* of bitmaps.
|
1996-08-27 23:50:29 +02:00
|
|
|
*/
|
2018-09-06 05:57:19 +02:00
|
|
|
#define HASH_MAX_BITMAPS Min(BLCKSZ / 8, 1024)
|
1997-09-07 07:04:48 +02:00
|
|
|
|
Expand hash indexes more gradually.
Since hash indexes typically have very few overflow pages, adding a
new splitpoint essentially doubles the on-disk size of the index,
which can lead to large and abrupt increases in disk usage (and
perhaps long delays on occasion). To mitigate this problem to some
degree, divide larger splitpoints into four equal phases. This means
that, for example, instead of growing from 4GB to 8GB all at once, a
hash index will now grow from 4GB to 5GB to 6GB to 7GB to 8GB, which
is perhaps still not as smooth as we'd like but certainly an
improvement.
This changes the on-disk format of the metapage, so bump HASH_VERSION
from 2 to 3. This will force a REINDEX of all existing hash indexes,
but that's probably a good idea anyway. First, hash indexes from
pre-10 versions of PostgreSQL could easily be corrupted, and we don't
want to confuse corruption carried over from an older release with any
corruption caused despite the new write-ahead logging in v10. Second,
it will let us remove some backward-compatibility code added by commit
293e24e507838733aba4748b514536af2d39d7f2.
Mithun Cy, reviewed by Amit Kapila, Jesper Pedersen and me. Regression
test outputs updated by me.
Discussion: http://postgr.es/m/CAD__OuhG6F1gQLCgMQNnMNgoCvOLQZz9zKYJQNYvYmmJoM42gA@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoYty0jCf-pa+m+vYUJ716+AxM7nv_syvyanyf5O-L_i2A@mail.gmail.com
2017-04-04 05:46:33 +02:00
|
|
|
#define HASH_SPLITPOINT_PHASE_BITS 2
|
|
|
|
#define HASH_SPLITPOINT_PHASES_PER_GRP (1 << HASH_SPLITPOINT_PHASE_BITS)
|
|
|
|
#define HASH_SPLITPOINT_PHASE_MASK (HASH_SPLITPOINT_PHASES_PER_GRP - 1)
|
|
|
|
#define HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE 10
|
|
|
|
|
2018-04-01 21:01:28 +02:00
|
|
|
/* defines max number of splitpoint phases a hash index can have */
|
Expand hash indexes more gradually.
Since hash indexes typically have very few overflow pages, adding a
new splitpoint essentially doubles the on-disk size of the index,
which can lead to large and abrupt increases in disk usage (and
perhaps long delays on occasion). To mitigate this problem to some
degree, divide larger splitpoints into four equal phases. This means
that, for example, instead of growing from 4GB to 8GB all at once, a
hash index will now grow from 4GB to 5GB to 6GB to 7GB to 8GB, which
is perhaps still not as smooth as we'd like but certainly an
improvement.
This changes the on-disk format of the metapage, so bump HASH_VERSION
from 2 to 3. This will force a REINDEX of all existing hash indexes,
but that's probably a good idea anyway. First, hash indexes from
pre-10 versions of PostgreSQL could easily be corrupted, and we don't
want to confuse corruption carried over from an older release with any
corruption caused despite the new write-ahead logging in v10. Second,
it will let us remove some backward-compatibility code added by commit
293e24e507838733aba4748b514536af2d39d7f2.
Mithun Cy, reviewed by Amit Kapila, Jesper Pedersen and me. Regression
test outputs updated by me.
Discussion: http://postgr.es/m/CAD__OuhG6F1gQLCgMQNnMNgoCvOLQZz9zKYJQNYvYmmJoM42gA@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoYty0jCf-pa+m+vYUJ716+AxM7nv_syvyanyf5O-L_i2A@mail.gmail.com
2017-04-04 05:46:33 +02:00
|
|
|
#define HASH_MAX_SPLITPOINT_GROUP 32
|
|
|
|
#define HASH_MAX_SPLITPOINTS \
|
|
|
|
(((HASH_MAX_SPLITPOINT_GROUP - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) * \
|
|
|
|
HASH_SPLITPOINT_PHASES_PER_GRP) + \
|
|
|
|
HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE)
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct HashMetaPageData
|
|
|
|
{
|
1997-09-08 04:41:22 +02:00
|
|
|
uint32 hashm_magic; /* magic no. for hash tables */
|
|
|
|
uint32 hashm_version; /* version ID */
|
2003-09-02 20:13:32 +02:00
|
|
|
double hashm_ntuples; /* number of tuples stored in the table */
|
2003-09-01 22:26:34 +02:00
|
|
|
uint16 hashm_ffactor; /* target fill factor (tuples/bucket) */
|
2003-09-02 20:13:32 +02:00
|
|
|
uint16 hashm_bsize; /* index page size (bytes) */
|
2005-10-15 04:49:52 +02:00
|
|
|
uint16 hashm_bmsize; /* bitmap array size (bytes) - must be a power
|
|
|
|
* of 2 */
|
2003-09-02 20:13:32 +02:00
|
|
|
uint16 hashm_bmshift; /* log2(bitmap array size in BITS) */
|
2001-10-28 07:26:15 +01:00
|
|
|
uint32 hashm_maxbucket; /* ID of maximum bucket in use */
|
1997-09-08 04:41:22 +02:00
|
|
|
uint32 hashm_highmask; /* mask to modulo into entire table */
|
|
|
|
uint32 hashm_lowmask; /* mask to modulo into lower half of table */
|
2019-08-05 05:14:58 +02:00
|
|
|
uint32 hashm_ovflpoint; /* splitpoint from which ovflpage being
|
2017-06-21 20:39:04 +02:00
|
|
|
* allocated */
|
2003-09-01 22:26:34 +02:00
|
|
|
uint32 hashm_firstfree; /* lowest-number free ovflpage (bit#) */
|
|
|
|
uint32 hashm_nmaps; /* number of bitmap pages */
|
2018-08-15 17:01:39 +02:00
|
|
|
RegProcedure hashm_procid; /* hash function id from pg_proc */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
uint32 hashm_spares[HASH_MAX_SPLITPOINTS]; /* spare pages before each
|
|
|
|
* splitpoint */
|
2003-09-01 22:26:34 +02:00
|
|
|
BlockNumber hashm_mapp[HASH_MAX_BITMAPS]; /* blknos of ovfl bitmaps */
|
1997-09-08 23:56:23 +02:00
|
|
|
} HashMetaPageData;
|
1996-08-27 23:50:29 +02:00
|
|
|
|
|
|
|
typedef HashMetaPageData *HashMetaPage;
|
|
|
|
|
2019-11-25 01:40:53 +01:00
|
|
|
typedef struct HashOptions
|
|
|
|
{
|
|
|
|
int32 varlena_header_; /* varlena header (do not touch directly!) */
|
|
|
|
int fillfactor; /* page fill factor in percent (0..100) */
|
|
|
|
} HashOptions;
|
|
|
|
|
|
|
|
#define HashGetFillFactor(relation) \
|
|
|
|
(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \
|
|
|
|
relation->rd_rel->relam == HASH_AM_OID), \
|
|
|
|
(relation)->rd_options ? \
|
|
|
|
((HashOptions *) (relation)->rd_options)->fillfactor : \
|
|
|
|
HASH_DEFAULT_FILLFACTOR)
|
|
|
|
#define HashGetTargetPageUsage(relation) \
|
|
|
|
(BLCKSZ * HashGetFillFactor(relation) / 100)
|
|
|
|
|
2003-09-05 00:06:27 +02:00
|
|
|
/*
|
|
|
|
* Maximum size of a hash index item (it's okay to have only one per page)
|
|
|
|
*/
|
|
|
|
#define HashMaxItemSize(page) \
|
2008-07-13 22:45:47 +02:00
|
|
|
MAXALIGN_DOWN(PageGetPageSize(page) - \
|
|
|
|
SizeOfPageHeaderData - \
|
|
|
|
sizeof(ItemIdData) - \
|
|
|
|
MAXALIGN(sizeof(HashPageOpaqueData)))
|
2003-09-05 00:06:27 +02:00
|
|
|
|
2018-04-07 22:00:39 +02:00
|
|
|
#define INDEX_MOVED_BY_SPLIT_MASK INDEX_AM_RESERVED_BIT
|
Improve hash index bucket split behavior.
Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency. Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.
In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel. There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.
This patch also removes the unworldly assumption that a split will
never be interrupted. With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion. While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.
Amit Kapila. I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's. Also
reviewed by Jesper Pedersen, Jeff Janes, and others.
Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
2016-11-30 21:39:21 +01:00
|
|
|
|
2006-07-11 23:05:57 +02:00
|
|
|
#define HASH_MIN_FILLFACTOR 10
|
2006-07-04 00:45:41 +02:00
|
|
|
#define HASH_DEFAULT_FILLFACTOR 75
|
|
|
|
|
1996-08-27 23:50:29 +02:00
|
|
|
/*
|
|
|
|
* Constants
|
|
|
|
*/
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
#define BYTE_TO_BIT 3 /* 2^3 bits/byte */
|
1997-09-07 07:04:48 +02:00
|
|
|
#define ALL_SET ((uint32) ~0)
|
1996-08-27 23:50:29 +02:00
|
|
|
|
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Bitmap pages do not contain tuples. They do contain the standard
|
1996-08-27 23:50:29 +02:00
|
|
|
* page headers and trailers; however, everything in between is a
|
2003-09-01 22:26:34 +02:00
|
|
|
* giant bit array. The number of bits that fit on a page obviously
|
2003-09-02 20:13:32 +02:00
|
|
|
* depends on the page size and the header/trailer overhead. We require
|
|
|
|
* the number of bits per page to be a power of 2.
|
1996-08-27 23:50:29 +02:00
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
#define BMPGSZ_BYTE(metap) ((metap)->hashm_bmsize)
|
|
|
|
#define BMPGSZ_BIT(metap) ((metap)->hashm_bmsize << BYTE_TO_BIT)
|
2003-09-02 20:13:32 +02:00
|
|
|
#define BMPG_SHIFT(metap) ((metap)->hashm_bmshift)
|
2003-09-01 22:26:34 +02:00
|
|
|
#define BMPG_MASK(metap) (BMPGSZ_BIT(metap) - 1)
|
2008-09-15 20:43:41 +02:00
|
|
|
|
|
|
|
#define HashPageGetBitmap(page) \
|
|
|
|
((uint32 *) PageGetContents(page))
|
|
|
|
|
|
|
|
#define HashGetMaxBitmapSize(page) \
|
|
|
|
(PageGetPageSize((Page) page) - \
|
|
|
|
(MAXALIGN(SizeOfPageHeaderData) + MAXALIGN(sizeof(HashPageOpaqueData))))
|
|
|
|
|
|
|
|
#define HashPageGetMeta(page) \
|
|
|
|
((HashMetaPage) PageGetContents(page))
|
1996-08-27 23:50:29 +02:00
|
|
|
|
|
|
|
/*
|
2003-09-01 22:26:34 +02:00
|
|
|
* The number of bits in an ovflpage bitmap word.
|
1996-08-27 23:50:29 +02:00
|
|
|
*/
|
2003-09-01 22:26:34 +02:00
|
|
|
#define BITS_PER_MAP 32 /* Number of bits in uint32 */
|
1996-08-27 23:50:29 +02:00
|
|
|
|
2003-09-01 22:26:34 +02:00
|
|
|
/* Given the address of the beginning of a bit map, clear/set the nth bit */
|
1996-08-27 23:50:29 +02:00
|
|
|
#define CLRBIT(A, N) ((A)[(N)/BITS_PER_MAP] &= ~(1<<((N)%BITS_PER_MAP)))
|
|
|
|
#define SETBIT(A, N) ((A)[(N)/BITS_PER_MAP] |= (1<<((N)%BITS_PER_MAP)))
|
1997-09-07 07:04:48 +02:00
|
|
|
#define ISSET(A, N) ((A)[(N)/BITS_PER_MAP] & (1<<((N)%BITS_PER_MAP)))
|
1996-08-27 23:50:29 +02:00
|
|
|
|
|
|
|
/*
|
2003-09-05 00:06:27 +02:00
|
|
|
* page-level and high-level locking modes (see README)
|
1996-08-27 23:50:29 +02:00
|
|
|
*/
|
2003-09-05 00:06:27 +02:00
|
|
|
#define HASH_READ BUFFER_LOCK_SHARE
|
|
|
|
#define HASH_WRITE BUFFER_LOCK_EXCLUSIVE
|
|
|
|
#define HASH_NOLOCK (-1)
|
|
|
|
|
1996-08-27 23:50:29 +02:00
|
|
|
/*
|
2017-09-01 04:21:21 +02:00
|
|
|
* When a new operator class is declared, we require that the user supply
|
2018-08-15 17:01:39 +02:00
|
|
|
* us with an amproc function for hashing a key of the new type, returning
|
|
|
|
* a 32-bit hash value. We call this the "standard" hash function. We
|
|
|
|
* also allow an optional "extended" hash function which accepts a salt and
|
2017-09-01 04:21:21 +02:00
|
|
|
* returns a 64-bit hash value. This is highly recommended but, for reasons
|
|
|
|
* of backward compatibility, optional.
|
|
|
|
*
|
|
|
|
* When the salt is 0, the low 32 bits of the value returned by the extended
|
2018-08-15 17:01:39 +02:00
|
|
|
* hash function should match the value that would have been returned by the
|
|
|
|
* standard hash function.
|
1996-08-27 23:50:29 +02:00
|
|
|
*/
|
2017-09-01 04:21:21 +02:00
|
|
|
#define HASHSTANDARD_PROC 1
|
|
|
|
#define HASHEXTENDED_PROC 2
|
Implement operator class parameters
PostgreSQL provides set of template index access methods, where opclasses have
much freedom in the semantics of indexing. These index AMs are GiST, GIN,
SP-GiST and BRIN. There opclasses define representation of keys, operations on
them and supported search strategies. So, it's natural that opclasses may be
faced some tradeoffs, which require user-side decision. This commit implements
opclass parameters allowing users to set some values, which tell opclass how to
index the particular dataset.
This commit doesn't introduce new storage in system catalog. Instead it uses
pg_attribute.attoptions, which is used for table column storage options but
unused for index attributes.
In order to evade changing signature of each opclass support function, we
implement unified way to pass options to opclass support functions. Options
are set to fn_expr as the constant bytea expression. It's possible due to the
fact that opclass support functions are executed outside of expressions, so
fn_expr is unused for them.
This commit comes with some examples of opclass options usage. We parametrize
signature length in GiST. That applies to multiple opclasses: tsvector_ops,
gist__intbig_ops, gist_ltree_ops, gist__ltree_ops, gist_trgm_ops and
gist_hstore_ops. Also we parametrize maximum number of integer ranges for
gist__int_ops. However, the main future usage of this feature is expected
to be json, where users would be able to specify which way to index particular
json parts.
Catversion is bumped.
Discussion: https://postgr.es/m/d22c3a18-31c7-1879-fc11-4c1ce2f5e5af%40postgrespro.ru
Author: Nikita Glukhov, revised by me
Reviwed-by: Nikolay Shaplov, Robert Haas, Tom Lane, Tomas Vondra, Alvaro Herrera
2020-03-30 18:17:11 +02:00
|
|
|
#define HASHOPTIONS_PROC 3
|
|
|
|
#define HASHNProcs 3
|
1996-08-27 23:50:29 +02:00
|
|
|
|
2003-09-01 22:26:34 +02:00
|
|
|
|
1996-08-27 23:50:29 +02:00
|
|
|
/* public routines */
|
|
|
|
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
extern IndexBuildResult *hashbuild(Relation heap, Relation index,
|
2019-05-22 19:04:48 +02:00
|
|
|
struct IndexInfo *indexInfo);
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
extern void hashbuildempty(Relation index);
|
|
|
|
extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
|
2019-05-22 19:04:48 +02:00
|
|
|
ItemPointer ht_ctid, Relation heapRel,
|
|
|
|
IndexUniqueCheck checkUnique,
|
|
|
|
struct IndexInfo *indexInfo);
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
|
|
|
|
extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
|
|
|
|
extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
|
|
|
|
extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
|
2019-05-22 19:04:48 +02:00
|
|
|
ScanKey orderbys, int norderbys);
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
extern void hashendscan(IndexScanDesc scan);
|
|
|
|
extern IndexBulkDeleteResult *hashbulkdelete(IndexVacuumInfo *info,
|
2019-05-22 19:04:48 +02:00
|
|
|
IndexBulkDeleteResult *stats,
|
|
|
|
IndexBulkDeleteCallback callback,
|
|
|
|
void *callback_state);
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
extern IndexBulkDeleteResult *hashvacuumcleanup(IndexVacuumInfo *info,
|
2019-05-22 19:04:48 +02:00
|
|
|
IndexBulkDeleteResult *stats);
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
extern bytea *hashoptions(Datum reloptions, bool validate);
|
|
|
|
extern bool hashvalidate(Oid opclassoid);
|
1996-08-27 23:50:29 +02:00
|
|
|
|
|
|
|
/* private routines */
|
|
|
|
|
|
|
|
/* hashinsert.c */
|
2017-03-16 03:18:56 +01:00
|
|
|
extern void _hash_doinsert(Relation rel, IndexTuple itup, Relation heapRel);
|
2009-11-01 22:25:25 +01:00
|
|
|
extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
|
2019-05-22 19:04:48 +02:00
|
|
|
Size itemsize, IndexTuple itup);
|
2017-02-27 18:04:21 +01:00
|
|
|
extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
|
2019-05-22 19:04:48 +02:00
|
|
|
OffsetNumber *itup_offsets, uint16 nitups);
|
1996-08-27 23:50:29 +02:00
|
|
|
|
|
|
|
/* hashovfl.c */
|
Improve hash index bucket split behavior.
Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency. Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.
In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel. There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.
This patch also removes the unworldly assumption that a split will
never be interrupted. With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion. While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.
Amit Kapila. I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's. Also
reviewed by Jesper Pedersen, Jeff Janes, and others.
Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
2016-11-30 21:39:21 +01:00
|
|
|
extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
|
2017-02-27 18:04:21 +01:00
|
|
|
extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
|
2019-05-22 19:04:48 +02:00
|
|
|
Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
|
|
|
|
Size *tups_size, uint16 nitups, BufferAccessStrategy bstrategy);
|
2017-02-27 18:26:34 +01:00
|
|
|
extern void _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage);
|
2003-09-02 04:18:38 +02:00
|
|
|
extern void _hash_squeezebucket(Relation rel,
|
2019-05-22 19:04:48 +02:00
|
|
|
Bucket bucket, BlockNumber bucket_blkno,
|
|
|
|
Buffer bucket_buf,
|
|
|
|
BufferAccessStrategy bstrategy);
|
2017-02-02 20:12:58 +01:00
|
|
|
extern uint32 _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno);
|
1996-08-27 23:50:29 +02:00
|
|
|
|
|
|
|
/* hashpage.c */
|
2007-05-03 18:45:58 +02:00
|
|
|
extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
|
2019-05-22 19:04:48 +02:00
|
|
|
int access, int flags);
|
Improve hash index bucket split behavior.
Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency. Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.
In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel. There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.
This patch also removes the unworldly assumption that a split will
never be interrupted. With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion. While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.
Amit Kapila. I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's. Also
reviewed by Jesper Pedersen, Jeff Janes, and others.
Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
2016-11-30 21:39:21 +01:00
|
|
|
extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
|
2019-05-22 19:04:48 +02:00
|
|
|
BlockNumber blkno, int flags);
|
Cache hash index's metapage in rel->rd_amcache.
This avoids a very significant amount of buffer manager traffic and
contention when scanning hash indexes, because it's no longer
necessary to lock and pin the metapage for every scan. We do need
some way of figuring out when the cache is too stale to use any more,
so that when we lock the primary bucket page to which the cached
metapage points us, we can tell whether a split has occurred since we
cached the metapage data. To do that, we use the hash_prevblkno field
in the primary bucket page, which would otherwise always be set to
InvalidBuffer.
This patch contains code so that it will continue working (although
less efficiently) with hash indexes built before this change, but
perhaps we should consider bumping the hash version and ripping out
the compatibility code. That decision can be made later, though.
Mithun Cy, reviewed by Jesper Pedersen, Amit Kapila, and by me.
Before committing, I made a number of cosmetic changes to the last
posted version of the patch, adjusted _hash_getcachedmetap to be more
careful about order of operation, and made some necessary updates to
the pageinspect documentation and regression tests.
2017-02-07 18:24:25 +01:00
|
|
|
extern HashMetaPage _hash_getcachedmetap(Relation rel, Buffer *metabuf,
|
2019-05-22 19:04:48 +02:00
|
|
|
bool force_refresh);
|
Cache hash index's metapage in rel->rd_amcache.
This avoids a very significant amount of buffer manager traffic and
contention when scanning hash indexes, because it's no longer
necessary to lock and pin the metapage for every scan. We do need
some way of figuring out when the cache is too stale to use any more,
so that when we lock the primary bucket page to which the cached
metapage points us, we can tell whether a split has occurred since we
cached the metapage data. To do that, we use the hash_prevblkno field
in the primary bucket page, which would otherwise always be set to
InvalidBuffer.
This patch contains code so that it will continue working (although
less efficiently) with hash indexes built before this change, but
perhaps we should consider bumping the hash version and ripping out
the compatibility code. That decision can be made later, though.
Mithun Cy, reviewed by Jesper Pedersen, Amit Kapila, and by me.
Before committing, I made a number of cosmetic changes to the last
posted version of the patch, adjusted _hash_getcachedmetap to be more
careful about order of operation, and made some necessary updates to
the pageinspect documentation and regression tests.
2017-02-07 18:24:25 +01:00
|
|
|
extern Buffer _hash_getbucketbuf_from_hashkey(Relation rel, uint32 hashkey,
|
2019-05-22 19:04:48 +02:00
|
|
|
int access,
|
|
|
|
HashMetaPage *cachedmetap);
|
2007-05-03 18:45:58 +02:00
|
|
|
extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
|
2017-03-07 23:03:51 +01:00
|
|
|
extern void _hash_initbuf(Buffer buf, uint32 max_bucket, uint32 num_bucket,
|
2019-05-22 19:04:48 +02:00
|
|
|
uint32 flag, bool initpage);
|
2010-12-29 12:48:53 +01:00
|
|
|
extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
|
2019-05-22 19:04:48 +02:00
|
|
|
ForkNumber forkNum);
|
2007-05-30 22:12:03 +02:00
|
|
|
extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
|
2019-05-22 19:04:48 +02:00
|
|
|
int access, int flags,
|
|
|
|
BufferAccessStrategy bstrategy);
|
2003-09-05 00:06:27 +02:00
|
|
|
extern void _hash_relbuf(Relation rel, Buffer buf);
|
|
|
|
extern void _hash_dropbuf(Relation rel, Buffer buf);
|
Improve hash index bucket split behavior.
Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency. Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.
In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel. There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.
This patch also removes the unworldly assumption that a split will
never be interrupted. With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion. While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.
Amit Kapila. I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's. Also
reviewed by Jesper Pedersen, Jeff Janes, and others.
Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
2016-11-30 21:39:21 +01:00
|
|
|
extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
|
2017-03-07 23:03:51 +01:00
|
|
|
extern uint32 _hash_init(Relation rel, double num_tuples,
|
2019-05-22 19:04:48 +02:00
|
|
|
ForkNumber forkNum);
|
2017-03-07 23:03:51 +01:00
|
|
|
extern void _hash_init_metabuffer(Buffer buf, double num_tuples,
|
2019-05-22 19:04:48 +02:00
|
|
|
RegProcedure procid, uint16 ffactor, bool initpage);
|
1997-09-08 04:41:22 +02:00
|
|
|
extern void _hash_pageinit(Page page, Size size);
|
|
|
|
extern void _hash_expandtable(Relation rel, Buffer metabuf);
|
Improve hash index bucket split behavior.
Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency. Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.
In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel. There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.
This patch also removes the unworldly assumption that a split will
never be interrupted. With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion. While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.
Amit Kapila. I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's. Also
reviewed by Jesper Pedersen, Jeff Janes, and others.
Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
2016-11-30 21:39:21 +01:00
|
|
|
extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
|
2019-05-22 19:04:48 +02:00
|
|
|
Bucket obucket, uint32 maxbucket, uint32 highmask,
|
|
|
|
uint32 lowmask);
|
1996-08-27 23:50:29 +02:00
|
|
|
|
|
|
|
/* hashsearch.c */
|
2002-05-21 01:51:44 +02:00
|
|
|
extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
|
|
|
|
extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
|
1996-08-27 23:50:29 +02:00
|
|
|
|
2008-03-17 00:15:08 +01:00
|
|
|
/* hashsort.c */
|
|
|
|
typedef struct HSpool HSpool; /* opaque struct in hashsort.c */
|
|
|
|
|
Provide database object names as separate fields in error messages.
This patch addresses the problem that applications currently have to
extract object names from possibly-localized textual error messages,
if they want to know for example which index caused a UNIQUE_VIOLATION
failure. It adds new error message fields to the wire protocol, which
can carry the name of a table, table column, data type, or constraint
associated with the error. (Since the protocol spec has always instructed
clients to ignore unrecognized field types, this should not create any
compatibility problem.)
Support for providing these new fields has been added to just a limited set
of error reports (mainly, those in the "integrity constraint violation"
SQLSTATE class), but we will doubtless add them to more calls in future.
Pavel Stehule, reviewed and extensively revised by Peter Geoghegan, with
additional hacking by Tom Lane.
2013-01-29 23:06:26 +01:00
|
|
|
extern HSpool *_h_spoolinit(Relation heap, Relation index, uint32 num_buckets);
|
2008-03-17 00:15:08 +01:00
|
|
|
extern void _h_spooldestroy(HSpool *hspool);
|
2014-07-01 16:34:42 +02:00
|
|
|
extern void _h_spool(HSpool *hspool, ItemPointer self,
|
2019-05-22 19:04:48 +02:00
|
|
|
Datum *values, bool *isnull);
|
2017-03-16 03:18:56 +01:00
|
|
|
extern void _h_indexbuild(HSpool *hspool, Relation heapRel);
|
2008-03-17 00:15:08 +01:00
|
|
|
|
1996-08-27 23:50:29 +02:00
|
|
|
/* hashutil.c */
|
1997-09-08 04:41:22 +02:00
|
|
|
extern bool _hash_checkqual(IndexScanDesc scan, IndexTuple itup);
|
2003-09-05 00:06:27 +02:00
|
|
|
extern uint32 _hash_datum2hashkey(Relation rel, Datum key);
|
2007-01-30 02:33:36 +01:00
|
|
|
extern uint32 _hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype);
|
2003-09-05 00:06:27 +02:00
|
|
|
extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
|
2019-05-22 19:04:48 +02:00
|
|
|
uint32 highmask, uint32 lowmask);
|
Expand hash indexes more gradually.
Since hash indexes typically have very few overflow pages, adding a
new splitpoint essentially doubles the on-disk size of the index,
which can lead to large and abrupt increases in disk usage (and
perhaps long delays on occasion). To mitigate this problem to some
degree, divide larger splitpoints into four equal phases. This means
that, for example, instead of growing from 4GB to 8GB all at once, a
hash index will now grow from 4GB to 5GB to 6GB to 7GB to 8GB, which
is perhaps still not as smooth as we'd like but certainly an
improvement.
This changes the on-disk format of the metapage, so bump HASH_VERSION
from 2 to 3. This will force a REINDEX of all existing hash indexes,
but that's probably a good idea anyway. First, hash indexes from
pre-10 versions of PostgreSQL could easily be corrupted, and we don't
want to confuse corruption carried over from an older release with any
corruption caused despite the new write-ahead logging in v10. Second,
it will let us remove some backward-compatibility code added by commit
293e24e507838733aba4748b514536af2d39d7f2.
Mithun Cy, reviewed by Amit Kapila, Jesper Pedersen and me. Regression
test outputs updated by me.
Discussion: http://postgr.es/m/CAD__OuhG6F1gQLCgMQNnMNgoCvOLQZz9zKYJQNYvYmmJoM42gA@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoYty0jCf-pa+m+vYUJ716+AxM7nv_syvyanyf5O-L_i2A@mail.gmail.com
2017-04-04 05:46:33 +02:00
|
|
|
extern uint32 _hash_spareindex(uint32 num_bucket);
|
|
|
|
extern uint32 _hash_get_totalbuckets(uint32 splitpoint_phase);
|
2005-11-06 20:29:01 +01:00
|
|
|
extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
|
2008-09-15 20:43:41 +02:00
|
|
|
extern uint32 _hash_get_indextuple_hashkey(IndexTuple itup);
|
2016-06-24 22:57:36 +02:00
|
|
|
extern bool _hash_convert_tuple(Relation index,
|
2019-05-22 19:04:48 +02:00
|
|
|
Datum *user_values, bool *user_isnull,
|
|
|
|
Datum *index_values, bool *index_isnull);
|
2008-09-15 20:43:41 +02:00
|
|
|
extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
|
|
|
|
extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
|
Improve hash index bucket split behavior.
Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency. Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.
In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel. There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.
This patch also removes the unworldly assumption that a split will
never be interrupted. With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion. While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.
Amit Kapila. I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's. Also
reviewed by Jesper Pedersen, Jeff Janes, and others.
Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
2016-11-30 21:39:21 +01:00
|
|
|
extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bucket);
|
|
|
|
extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
|
|
|
|
extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
|
2019-05-22 19:04:48 +02:00
|
|
|
uint32 lowmask, uint32 maxbucket);
|
2017-03-16 03:18:56 +01:00
|
|
|
extern void _hash_kill_items(IndexScanDesc scan);
|
Improve hash index bucket split behavior.
Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency. Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.
In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel. There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.
This patch also removes the unworldly assumption that a split will
never be interrupted. With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion. While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.
Amit Kapila. I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's. Also
reviewed by Jesper Pedersen, Jeff Janes, and others.
Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
2016-11-30 21:39:21 +01:00
|
|
|
|
|
|
|
/* hash.c */
|
|
|
|
extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
|
2019-05-22 19:04:48 +02:00
|
|
|
Buffer bucket_buf, BlockNumber bucket_blkno,
|
|
|
|
BufferAccessStrategy bstrategy,
|
|
|
|
uint32 maxbucket, uint32 highmask, uint32 lowmask,
|
|
|
|
double *tuples_removed, double *num_index_tuples,
|
2019-07-01 03:00:23 +02:00
|
|
|
bool split_cleanup,
|
2019-05-22 19:04:48 +02:00
|
|
|
IndexBulkDeleteCallback callback, void *callback_state);
|
1996-08-27 23:50:29 +02:00
|
|
|
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
#endif /* HASH_H */
|