2003-09-05 00:06:27 +02:00
|
|
|
$Header: /cvsroot/pgsql/src/backend/access/hash/README,v 1.3 2003/09/04 22:06:27 tgl Exp $
|
2003-09-01 22:24:49 +02:00
|
|
|
|
|
|
|
This directory contains an implementation of hash indexing for Postgres.
|
|
|
|
|
|
|
|
A hash index consists of two or more "buckets", into which tuples are
|
|
|
|
placed whenever their hash key maps to the bucket number. The
|
|
|
|
key-to-bucket-number mapping is chosen so that the index can be
|
|
|
|
incrementally expanded. When a new bucket is to be added to the index,
|
|
|
|
exactly one existing bucket will need to be "split", with some of its
|
|
|
|
tuples being transferred to the new bucket according to the updated
|
|
|
|
key-to-bucket-number mapping. This is essentially the same hash table
|
|
|
|
management technique embodied in src/backend/utils/hash/dynahash.c for
|
|
|
|
in-memory hash tables.
|
|
|
|
|
|
|
|
Each bucket in the hash index comprises one or more index pages. The
|
|
|
|
bucket's first page is permanently assigned to it when the bucket is
|
|
|
|
created. Additional pages, called "overflow pages", are added if the
|
|
|
|
bucket receives too many tuples to fit in the primary bucket page.
|
|
|
|
The pages of a bucket are chained together in a doubly-linked list
|
|
|
|
using fields in the index page special space.
|
|
|
|
|
|
|
|
There is currently no provision to shrink a hash index, other than by
|
|
|
|
rebuilding it with REINDEX. Overflow pages can be recycled for reuse
|
|
|
|
in other buckets, but we never give them back to the operating system.
|
|
|
|
There is no provision for reducing the number of buckets, either.
|
|
|
|
|
|
|
|
|
|
|
|
Page addressing
|
|
|
|
---------------
|
|
|
|
|
|
|
|
There are four kinds of pages in a hash index: the meta page (page zero),
|
|
|
|
which contains statically allocated control information; primary bucket
|
|
|
|
pages; overflow pages; and bitmap pages, which keep track of overflow
|
|
|
|
pages that have been freed and are available for re-use. For addressing
|
|
|
|
purposes, bitmap pages are regarded as a subset of the overflow pages.
|
|
|
|
|
|
|
|
Primary bucket pages and overflow pages are allocated independently (since
|
|
|
|
any given index might need more or fewer overflow pages relative to its
|
|
|
|
number of buckets). The hash code uses an interesting set of addressing
|
|
|
|
rules to support a variable number of overflow pages while not having to
|
|
|
|
move primary bucket pages around after they are created.
|
|
|
|
|
|
|
|
Primary bucket pages (henceforth just "bucket pages") are allocated in
|
|
|
|
power-of-2 groups, called "split points" in the code. Buckets 0 and 1
|
|
|
|
are created when the index is initialized. At the first split, buckets 2
|
|
|
|
and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
|
|
|
|
when bucket 8 is needed, buckets 8-15 are allocated; etc. All the bucket
|
|
|
|
pages of a power-of-2 group appear consecutively in the index. This
|
|
|
|
addressing scheme allows the physical location of a bucket page to be
|
|
|
|
computed from the bucket number relatively easily, using only a small
|
|
|
|
amount of control information. We take the log2() of the bucket number
|
|
|
|
to determine which split point S the bucket belongs to, and then simply
|
|
|
|
add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
|
|
|
|
metapage) to compute the physical address. hashm_spares[S] can be
|
|
|
|
interpreted as the total number of overflow pages that have been allocated
|
|
|
|
before the bucket pages of splitpoint S. hashm_spares[0] is always 0,
|
|
|
|
so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
|
|
|
|
block numbers 1 and 2, just after the meta page. We always have
|
|
|
|
hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
|
|
|
|
former. The difference between the two represents the number of overflow
|
|
|
|
pages appearing between the bucket page groups of splitpoints N and N+1.
|
|
|
|
|
|
|
|
When S splitpoints exist altogether, the array entries hashm_spares[0]
|
|
|
|
through hashm_spares[S] are valid; hashm_spares[S] records the current
|
|
|
|
total number of overflow pages. New overflow pages are created as needed
|
|
|
|
at the end of the index, and recorded by incrementing hashm_spares[S].
|
|
|
|
When it is time to create a new splitpoint's worth of bucket pages, we
|
|
|
|
copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
|
|
|
|
stored in the hashm_ovflpoint field of the meta page). This has the
|
|
|
|
effect of reserving the correct number of bucket pages at the end of the
|
|
|
|
index, and preparing to allocate additional overflow pages after those
|
|
|
|
bucket pages. hashm_spares[] entries before S cannot change anymore,
|
|
|
|
since that would require moving already-created bucket pages.
|
|
|
|
|
|
|
|
Since overflow pages may be recycled if enough tuples are deleted from
|
|
|
|
their bucket, we need a way to keep track of currently-free overflow
|
|
|
|
pages. The state of each overflow page (0 = available, 1 = not available)
|
|
|
|
is recorded in "bitmap" pages dedicated to this purpose. The entries in
|
|
|
|
the bitmap are indexed by "bit number", a zero-based count in which every
|
|
|
|
overflow page has a unique entry. We can convert between an overflow
|
|
|
|
page's physical block number and its bit number using the information in
|
|
|
|
hashm_spares[] (see hashovfl.c for details). The bit number sequence
|
|
|
|
includes the bitmap pages, which is the reason for saying that bitmap
|
|
|
|
pages are a subset of the overflow pages. It turns out in fact that each
|
|
|
|
bitmap page's first bit represents itself --- this is not an essential
|
|
|
|
property, but falls out of the fact that we only allocate another bitmap
|
|
|
|
page when we really need one. Bit number zero always corresponds to block
|
|
|
|
number 3, which is the first bitmap page and is allocated during index
|
|
|
|
creation.
|
|
|
|
|
|
|
|
|
|
|
|
Lock definitions
|
|
|
|
----------------
|
|
|
|
|
|
|
|
We use both lmgr locks ("heavyweight" locks) and buffer context locks
|
|
|
|
(LWLocks) to control access to a hash index. lmgr locks are needed for
|
|
|
|
long-term locking since there is a (small) risk of deadlock, which we must
|
|
|
|
be able to detect. Buffer context locks are used for short-term access
|
|
|
|
control to individual pages of the index.
|
|
|
|
|
|
|
|
We define the following lmgr locks for a hash index:
|
|
|
|
|
|
|
|
LockPage(rel, 0) represents the right to modify the hash-code-to-bucket
|
|
|
|
mapping. A process attempting to enlarge the hash table by splitting a
|
|
|
|
bucket must exclusive-lock this lock before modifying the metapage data
|
|
|
|
representing the mapping. Processes intending to access a particular
|
|
|
|
bucket must share-lock this lock until they have acquired lock on the
|
|
|
|
correct target bucket.
|
|
|
|
|
|
|
|
LockPage(rel, page), where page is the page number of a hash bucket page,
|
|
|
|
represents the right to split or compact an individual bucket. A process
|
|
|
|
splitting a bucket must exclusive-lock both old and new halves of the
|
|
|
|
bucket until it is done. A process doing VACUUM must exclusive-lock the
|
|
|
|
bucket it is currently purging tuples from. Processes doing scans or
|
|
|
|
insertions must share-lock the bucket they are scanning or inserting into.
|
|
|
|
(It is okay to allow concurrent scans and insertions.)
|
|
|
|
|
|
|
|
The lmgr lock IDs corresponding to overflow pages are currently unused.
|
|
|
|
These are available for possible future refinements.
|
|
|
|
|
|
|
|
Note that these lock definitions are conceptually distinct from any sort
|
|
|
|
of lock on the pages whose numbers they share. A process must also obtain
|
|
|
|
read or write buffer lock on the metapage or bucket page before accessing
|
|
|
|
said page.
|
|
|
|
|
|
|
|
Processes performing hash index scans must hold share lock on the bucket
|
|
|
|
they are scanning throughout the scan. This seems to be essential, since
|
|
|
|
there is no reasonable way for a scan to cope with its bucket being split
|
|
|
|
underneath it. This creates a possibility of deadlock external to the
|
|
|
|
hash index code, since a process holding one of these locks could block
|
|
|
|
waiting for an unrelated lock held by another process. If that process
|
|
|
|
then does something that requires exclusive lock on the bucket, we have
|
|
|
|
deadlock. Therefore the bucket locks must be lmgr locks so that deadlock
|
|
|
|
can be detected and recovered from. This also forces the page-zero lock
|
|
|
|
to be an lmgr lock, because as we'll see below it is held while attempting
|
|
|
|
to acquire a bucket lock, and so it could also participate in a deadlock.
|
|
|
|
|
|
|
|
Processes must obtain read (share) buffer context lock on any hash index
|
|
|
|
page while reading it, and write (exclusive) lock while modifying it.
|
|
|
|
To prevent deadlock we enforce these coding rules: no buffer lock may be
|
|
|
|
held long term (across index AM calls), nor may any buffer lock be held
|
|
|
|
while waiting for an lmgr lock, nor may more than one buffer lock
|
|
|
|
be held at a time by any one process. (The third restriction is probably
|
|
|
|
stronger than necessary, but it makes the proof of no deadlock obvious.)
|
|
|
|
|
|
|
|
|
|
|
|
Pseudocode algorithms
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
The operations we need to support are: readers scanning the index for
|
|
|
|
entries of a particular hash code (which by definition are all in the same
|
|
|
|
bucket); insertion of a new tuple into the correct bucket; enlarging the
|
|
|
|
hash table by splitting an existing bucket; and garbage collection
|
|
|
|
(deletion of dead tuples and compaction of buckets). Bucket splitting is
|
|
|
|
done at conclusion of any insertion that leaves the hash table more full
|
|
|
|
than the target load factor, but it is convenient to consider it as an
|
|
|
|
independent operation. Note that we do not have a bucket-merge operation
|
|
|
|
--- the number of buckets never shrinks. Insertion, splitting, and
|
|
|
|
garbage collection may all need access to freelist management, which keeps
|
|
|
|
track of available overflow pages.
|
|
|
|
|
|
|
|
The reader algorithm is:
|
|
|
|
|
|
|
|
share-lock page 0 (to prevent active split)
|
|
|
|
read/sharelock meta page
|
|
|
|
compute bucket number for target hash key
|
|
|
|
release meta page
|
|
|
|
share-lock bucket page (to prevent split/compact of this bucket)
|
|
|
|
release page 0 share-lock
|
|
|
|
-- then, per read request:
|
|
|
|
read/sharelock current page of bucket
|
|
|
|
step to next page if necessary (no chaining of locks)
|
|
|
|
get tuple
|
|
|
|
release current page
|
|
|
|
-- at scan shutdown:
|
|
|
|
release bucket share-lock
|
|
|
|
|
|
|
|
By holding the page-zero lock until lock on the target bucket is obtained,
|
|
|
|
the reader ensures that the target bucket calculation is valid (otherwise
|
|
|
|
the bucket might be split before the reader arrives at it, and the target
|
|
|
|
entries might go into the new bucket). Holding the bucket sharelock for
|
|
|
|
the remainder of the scan prevents the reader's current-tuple pointer from
|
|
|
|
being invalidated by other processes. Notice though that the reader need
|
|
|
|
not prevent other buckets from being split or compacted.
|
|
|
|
|
|
|
|
The insertion algorithm is rather similar:
|
|
|
|
|
|
|
|
share-lock page 0 (to prevent active split)
|
|
|
|
read/sharelock meta page
|
|
|
|
compute bucket number for target hash key
|
|
|
|
release meta page
|
2003-09-02 05:29:01 +02:00
|
|
|
share-lock bucket page (to prevent split/compact of this bucket)
|
2003-09-01 22:24:49 +02:00
|
|
|
release page 0 share-lock
|
|
|
|
-- (so far same as reader)
|
|
|
|
read/exclusive-lock current page of bucket
|
|
|
|
if full, release, read/exclusive-lock next page; repeat as needed
|
|
|
|
>> see below if no space in any page of bucket
|
|
|
|
insert tuple
|
|
|
|
write/release current page
|
|
|
|
release bucket share-lock
|
|
|
|
read/exclusive-lock meta page
|
|
|
|
increment tuple count, decide if split needed
|
|
|
|
write/release meta page
|
|
|
|
done if no split needed, else enter Split algorithm below
|
|
|
|
|
|
|
|
It is okay for an insertion to take place in a bucket that is being
|
|
|
|
actively scanned, because it does not change the position of any existing
|
|
|
|
item in the bucket, so scan states are not invalidated. We only need the
|
|
|
|
short-term buffer locks to ensure that readers do not see a
|
|
|
|
partially-updated page.
|
|
|
|
|
|
|
|
It is clearly impossible for readers and inserters to deadlock, and in
|
|
|
|
fact this algorithm allows them a very high degree of concurrency.
|
|
|
|
(The exclusive metapage lock taken to update the tuple count is stronger
|
|
|
|
than necessary, since readers do not care about the tuple count, but the
|
|
|
|
lock is held for such a short time that this is probably not an issue.)
|
|
|
|
|
|
|
|
When an inserter cannot find space in any existing page of a bucket, it
|
|
|
|
must obtain an overflow page and add that page to the bucket's chain.
|
|
|
|
Details of that part of the algorithm appear later.
|
|
|
|
|
|
|
|
The page split algorithm is entered whenever an inserter observes that the
|
|
|
|
index is overfull (has a higher-than-wanted ratio of tuples to buckets).
|
|
|
|
The algorithm attempts, but does not necessarily succeed, to split one
|
|
|
|
existing bucket in two, thereby lowering the fill ratio:
|
|
|
|
|
|
|
|
exclusive-lock page 0 (assert the right to begin a split)
|
|
|
|
read/exclusive-lock meta page
|
|
|
|
check split still needed
|
|
|
|
if split not needed anymore, drop locks and exit
|
|
|
|
decide which bucket to split
|
|
|
|
Attempt to X-lock old bucket number (definitely could fail)
|
2003-09-05 00:06:27 +02:00
|
|
|
Attempt to X-lock new bucket number (shouldn't fail, but...)
|
2003-09-01 22:24:49 +02:00
|
|
|
if above fail, drop locks and exit
|
|
|
|
update meta page to reflect new number of buckets
|
|
|
|
write/release meta page
|
|
|
|
release X-lock on page 0
|
|
|
|
-- now, accesses to all other buckets can proceed.
|
|
|
|
Perform actual split of bucket, moving tuples as needed
|
|
|
|
>> see below about acquiring needed extra space
|
|
|
|
Release X-locks of old and new buckets
|
|
|
|
|
|
|
|
Note the page zero and metapage locks are not held while the actual tuple
|
|
|
|
rearrangement is performed, so accesses to other buckets can proceed in
|
|
|
|
parallel; in fact, it's possible for multiple bucket splits to proceed
|
|
|
|
in parallel.
|
|
|
|
|
|
|
|
Split's attempt to X-lock the old bucket number could fail if another
|
|
|
|
process holds S-lock on it. We do not want to wait if that happens, first
|
|
|
|
because we don't want to wait while holding the metapage exclusive-lock,
|
|
|
|
and second because it could very easily result in deadlock. (The other
|
|
|
|
process might be out of the hash AM altogether, and could do something
|
|
|
|
that blocks on another lock this process holds; so even if the hash
|
|
|
|
algorithm itself is deadlock-free, a user-induced deadlock could occur.)
|
|
|
|
So, this is a conditional LockAcquire operation, and if it fails we just
|
|
|
|
abandon the attempt to split. This is all right since the index is
|
|
|
|
overfull but perfectly functional. Every subsequent inserter will try to
|
|
|
|
split, and eventually one will succeed. If multiple inserters failed to
|
|
|
|
split, the index might still be overfull, but eventually, the index will
|
|
|
|
not be overfull and split attempts will stop. (We could make a successful
|
|
|
|
splitter loop to see if the index is still overfull, but it seems better to
|
|
|
|
distribute the split overhead across successive insertions.)
|
|
|
|
|
|
|
|
A problem is that if a split fails partway through (eg due to insufficient
|
|
|
|
disk space) the index is left corrupt. The probability of that could be
|
|
|
|
made quite low if we grab a free page or two before we update the meta
|
|
|
|
page, but the only real solution is to treat a split as a WAL-loggable,
|
|
|
|
must-complete action. I'm not planning to teach hash about WAL in this
|
|
|
|
go-round.
|
|
|
|
|
|
|
|
The fourth operation is garbage collection (bulk deletion):
|
|
|
|
|
|
|
|
next bucket := 0
|
|
|
|
read/sharelock meta page
|
|
|
|
fetch current max bucket number
|
|
|
|
release meta page
|
|
|
|
while next bucket <= max bucket do
|
|
|
|
Acquire X lock on target bucket
|
|
|
|
Scan and remove tuples, compact free space as needed
|
|
|
|
Release X lock
|
|
|
|
next bucket ++
|
|
|
|
end loop
|
|
|
|
exclusive-lock meta page
|
|
|
|
check if number of buckets changed
|
|
|
|
if so, release lock and return to for-each-bucket loop
|
|
|
|
else update metapage tuple count
|
|
|
|
write/release meta page
|
|
|
|
|
|
|
|
Note that this is designed to allow concurrent splits. If a split occurs,
|
|
|
|
tuples relocated into the new bucket will be visited twice by the scan,
|
|
|
|
but that does no harm. (We must however be careful about the statistics
|
|
|
|
reported by the VACUUM operation. What we can do is count the number of
|
|
|
|
tuples scanned, and believe this in preference to the stored tuple count
|
|
|
|
if the stored tuple count and number of buckets did *not* change at any
|
|
|
|
time during the scan. This provides a way of correcting the stored tuple
|
|
|
|
count if it gets out of sync for some reason. But if a split or insertion
|
|
|
|
does occur concurrently, the scan count is untrustworthy; instead,
|
|
|
|
subtract the number of tuples deleted from the stored tuple count and
|
|
|
|
use that.)
|
|
|
|
|
|
|
|
The exclusive lock request could deadlock in some strange scenarios, but
|
|
|
|
we can just error out without any great harm being done.
|
|
|
|
|
|
|
|
|
|
|
|
Free space management
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
Free space management consists of two sub-algorithms, one for reserving
|
|
|
|
an overflow page to add to a bucket chain, and one for returning an empty
|
|
|
|
overflow page to the free pool.
|
|
|
|
|
|
|
|
Obtaining an overflow page:
|
|
|
|
|
|
|
|
read/exclusive-lock meta page
|
|
|
|
determine next bitmap page number; if none, exit loop
|
|
|
|
release meta page lock
|
|
|
|
read/exclusive-lock bitmap page
|
|
|
|
search for a free page (zero bit in bitmap)
|
|
|
|
if found:
|
|
|
|
set bit in bitmap
|
|
|
|
write/release bitmap page
|
|
|
|
read/exclusive-lock meta page
|
|
|
|
if first-free-bit value did not change,
|
|
|
|
update it and write meta page
|
|
|
|
release meta page
|
|
|
|
return page number
|
|
|
|
else (not found):
|
|
|
|
release bitmap page
|
|
|
|
loop back to try next bitmap page, if any
|
|
|
|
-- here when we have checked all bitmap pages; we hold meta excl. lock
|
2003-09-02 05:29:01 +02:00
|
|
|
extend index to add another overflow page; update meta information
|
2003-09-01 22:24:49 +02:00
|
|
|
write/release meta page
|
|
|
|
return page number
|
|
|
|
|
|
|
|
It is slightly annoying to release and reacquire the metapage lock
|
|
|
|
multiple times, but it seems best to do it that way to minimize loss of
|
|
|
|
concurrency against processes just entering the index. We don't want
|
|
|
|
to hold the metapage exclusive lock while reading in a bitmap page.
|
|
|
|
(We can at least avoid repeated buffer pin/unpin here.)
|
|
|
|
|
2003-09-02 05:29:01 +02:00
|
|
|
The normal path for extending the index does not require doing I/O while
|
|
|
|
holding the metapage lock. We do have to do I/O when the extension
|
|
|
|
requires adding a new bitmap page as well as the required overflow page
|
|
|
|
... but that is an infrequent case, so the loss of concurrency seems
|
|
|
|
acceptable.
|
|
|
|
|
2003-09-01 22:24:49 +02:00
|
|
|
The portion of tuple insertion that calls the above subroutine looks
|
|
|
|
like this:
|
|
|
|
|
|
|
|
-- having determined that no space is free in the target bucket:
|
|
|
|
remember last page of bucket, drop write lock on it
|
|
|
|
call free-page-acquire routine
|
|
|
|
re-write-lock last page of bucket
|
|
|
|
if it is not last anymore, step to the last page
|
|
|
|
update (former) last page to point to new page
|
|
|
|
write-lock and initialize new page, with back link to former last page
|
|
|
|
write and release former last page
|
|
|
|
insert tuple into new page
|
|
|
|
-- etc.
|
|
|
|
|
|
|
|
Notice this handles the case where two concurrent inserters try to extend
|
|
|
|
the same bucket. They will end up with a valid, though perhaps
|
|
|
|
space-inefficient, configuration: two overflow pages will be added to the
|
|
|
|
bucket, each containing one tuple.
|
|
|
|
|
|
|
|
The last part of this violates the rule about holding write lock on two
|
|
|
|
pages concurrently, but it should be okay to write-lock the previously
|
|
|
|
free page; there can be no other process holding lock on it.
|
|
|
|
|
|
|
|
Bucket splitting uses a similar algorithm if it has to extend the new
|
|
|
|
bucket, but it need not worry about concurrent extension since it has
|
|
|
|
exclusive lock on the new bucket.
|
|
|
|
|
|
|
|
Freeing an overflow page is done by garbage collection and by bucket
|
|
|
|
splitting (the old bucket may contain no-longer-needed overflow pages).
|
|
|
|
In both cases, the process holds exclusive lock on the containing bucket,
|
|
|
|
so need not worry about other accessors of pages in the bucket. The
|
|
|
|
algorithm is:
|
|
|
|
|
|
|
|
delink overflow page from bucket chain
|
|
|
|
(this requires read/update/write/release of fore and aft siblings)
|
|
|
|
read/share-lock meta page
|
|
|
|
determine which bitmap page contains the free space bit for page
|
|
|
|
release meta page
|
|
|
|
read/exclusive-lock bitmap page
|
|
|
|
update bitmap bit
|
|
|
|
write/release bitmap page
|
|
|
|
if page number is less than what we saw as first-free-bit in meta:
|
|
|
|
read/exclusive-lock meta page
|
|
|
|
if page number is still less than first-free-bit,
|
|
|
|
update first-free-bit field and write meta page
|
|
|
|
release meta page
|
|
|
|
|
|
|
|
We have to do it this way because we must clear the bitmap bit before
|
2003-09-02 05:29:01 +02:00
|
|
|
changing the first-free-bit field (hashm_firstfree). It is possible that
|
|
|
|
we set first-free-bit too small (because someone has already reused the
|
|
|
|
page we just freed), but that is okay; the only cost is the next overflow
|
|
|
|
page acquirer will scan more bitmap bits than he needs to. What must be
|
|
|
|
avoided is having first-free-bit greater than the actual first free bit,
|
|
|
|
because then that free page would never be found by searchers.
|
2003-09-01 22:24:49 +02:00
|
|
|
|
|
|
|
All the freespace operations should be called while holding no buffer
|
|
|
|
locks. Since they need no lmgr locks, deadlock is not possible.
|
|
|
|
|
|
|
|
|
|
|
|
Other notes
|
|
|
|
-----------
|
|
|
|
|
|
|
|
All the shenanigans with locking prevent a split occurring while *another*
|
|
|
|
process is stopped in a given bucket. They do not ensure that one of
|
|
|
|
our *own* backend's scans is not stopped in the bucket, because lmgr
|
|
|
|
doesn't consider a process's own locks to conflict. So the Split
|
|
|
|
algorithm must check for that case separately before deciding it can go
|
|
|
|
ahead with the split. VACUUM does not have this problem since nothing
|
|
|
|
else can be happening within the vacuuming backend.
|
|
|
|
|
|
|
|
Should we instead try to fix the state of any conflicting local scan?
|
|
|
|
Seems mighty ugly --- got to move the held bucket S-lock as well as lots
|
|
|
|
of other messiness. For now, just punt and don't split.
|