mirror of
https://git.postgresql.org/git/postgresql.git
synced 2024-09-30 14:31:40 +02:00
652 lines
34 KiB
Plaintext
652 lines
34 KiB
Plaintext
src/backend/access/hash/README
|
|
|
|
Hash Indexing
|
|
=============
|
|
|
|
This directory contains an implementation of hash indexing for Postgres.
|
|
Most of the core ideas are taken from Margo Seltzer and Ozan Yigit,
|
|
A New Hashing Package for UNIX, Proceedings of the Winter USENIX Conference,
|
|
January 1991. (Our in-memory hashtable implementation,
|
|
src/backend/utils/hash/dynahash.c, also relies on some of the same concepts;
|
|
it is derived from code written by Esmond Pitt and later improved by Margo
|
|
among others.)
|
|
|
|
A hash index consists of two or more "buckets", into which tuples are
|
|
placed whenever their hash key maps to the bucket number. The
|
|
key-to-bucket-number mapping is chosen so that the index can be
|
|
incrementally expanded. When a new bucket is to be added to the index,
|
|
exactly one existing bucket will need to be "split", with some of its
|
|
tuples being transferred to the new bucket according to the updated
|
|
key-to-bucket-number mapping. This is essentially the same hash table
|
|
management technique embodied in src/backend/utils/hash/dynahash.c for
|
|
in-memory hash tables.
|
|
|
|
Each bucket in the hash index comprises one or more index pages. The
|
|
bucket's first page is permanently assigned to it when the bucket is
|
|
created. Additional pages, called "overflow pages", are added if the
|
|
bucket receives too many tuples to fit in the primary bucket page.
|
|
The pages of a bucket are chained together in a doubly-linked list
|
|
using fields in the index page special space.
|
|
|
|
There is currently no provision to shrink a hash index, other than by
|
|
rebuilding it with REINDEX. Overflow pages can be recycled for reuse
|
|
in other buckets, but we never give them back to the operating system.
|
|
There is no provision for reducing the number of buckets, either.
|
|
|
|
As of PostgreSQL 8.4, hash index entries store only the hash code, not the
|
|
actual data value, for each indexed item. This makes the index entries
|
|
smaller (perhaps very substantially so) and speeds up various operations.
|
|
In particular, we can speed searches by keeping the index entries in any
|
|
one index page sorted by hash code, thus allowing binary search to be used
|
|
within an index page. Note however that there is *no* assumption about the
|
|
relative ordering of hash codes across different index pages of a bucket.
|
|
|
|
|
|
Page Addressing
|
|
---------------
|
|
|
|
There are four kinds of pages in a hash index: the meta page (page zero),
|
|
which contains statically allocated control information; primary bucket
|
|
pages; overflow pages; and bitmap pages, which keep track of overflow
|
|
pages that have been freed and are available for re-use. For addressing
|
|
purposes, bitmap pages are regarded as a subset of the overflow pages.
|
|
|
|
Primary bucket pages and overflow pages are allocated independently (since
|
|
any given index might need more or fewer overflow pages relative to its
|
|
number of buckets). The hash code uses an interesting set of addressing
|
|
rules to support a variable number of overflow pages while not having to
|
|
move primary bucket pages around after they are created.
|
|
|
|
Primary bucket pages (henceforth just "bucket pages") are allocated in
|
|
power-of-2 groups, called "split points" in the code. That means at every new
|
|
splitpoint we double the existing number of buckets. Allocating huge chunks
|
|
of bucket pages all at once isn't optimal and we will take ages to consume
|
|
those. To avoid this exponential growth of index size, we did use a trick to
|
|
break up allocation of buckets at the splitpoint into 4 equal phases. If
|
|
(2 ^ x) are the total buckets need to be allocated at a splitpoint (from now on
|
|
we shall call this as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2))
|
|
of total buckets at each phase of splitpoint group. Next quarter of allocation
|
|
will only happen if buckets of the previous phase have been already consumed.
|
|
For the initial splitpoint groups < 10 we will allocate all of their buckets in
|
|
single phase only, as number of buckets allocated at initial groups are small
|
|
in numbers. And for the groups >= 10 the allocation process is distributed
|
|
among four equal phases. At group 10 we allocate (2 ^ 9) buckets in 4
|
|
different phases {2 ^ 7, 2 ^ 7, 2 ^ 7, 2 ^ 7}, the numbers in curly braces
|
|
indicate the number of buckets allocated within each phase of splitpoint group
|
|
10. And, for splitpoint group 11 and 12 allocation phases will be
|
|
{2 ^ 8, 2 ^ 8, 2 ^ 8, 2 ^ 8} and {2 ^ 9, 2 ^ 9, 2 ^ 9, 2 ^ 9} respectively. We
|
|
can see that at each splitpoint group we double the total number of buckets
|
|
from the previous group but in an incremental phase. The bucket pages
|
|
allocated within one phase of a splitpoint group will appear consecutively in
|
|
the index. This addressing scheme allows the physical location of a bucket
|
|
page to be computed from the bucket number relatively easily, using only a
|
|
small amount of control information. If we look at the function
|
|
_hash_spareindex for a given bucket number we first compute the
|
|
splitpoint group it belongs to and then the phase to which the bucket belongs
|
|
to. Adding them we get the global splitpoint phase number S to which the
|
|
bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[]
|
|
is an array stored in the metapage) with given bucket number to compute its
|
|
physical address. The hashm_spares[S] can be interpreted as the total number
|
|
of overflow pages that have been allocated before the bucket pages of
|
|
splitpoint phase S. The hashm_spares[0] is always 0, so that buckets 0 and 1
|
|
always appear at block numbers 1 and 2, just after the meta page. We always
|
|
have hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
|
|
former. The difference between the two represents the number of overflow pages
|
|
appearing between the bucket page groups of splitpoints phase N and N+1.
|
|
(Note: the above describes what happens when filling an initially minimally
|
|
sized hash index. In practice, we try to estimate the required index size and
|
|
allocate a suitable number of splitpoints phases immediately, to avoid
|
|
expensive re-splitting during initial index build.)
|
|
|
|
When S splitpoints exist altogether, the array entries hashm_spares[0]
|
|
through hashm_spares[S] are valid; hashm_spares[S] records the current
|
|
total number of overflow pages. New overflow pages are created as needed
|
|
at the end of the index, and recorded by incrementing hashm_spares[S].
|
|
When it is time to create a new splitpoint phase's worth of bucket pages, we
|
|
copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
|
|
stored in the hashm_ovflpoint field of the meta page). This has the
|
|
effect of reserving the correct number of bucket pages at the end of the
|
|
index, and preparing to allocate additional overflow pages after those
|
|
bucket pages. hashm_spares[] entries before S cannot change anymore,
|
|
since that would require moving already-created bucket pages.
|
|
|
|
The last page nominally used by the index is always determinable from
|
|
hashm_spares[S]. To avoid complaints from smgr, the logical EOF as seen by
|
|
the filesystem and smgr must always be greater than or equal to this page.
|
|
We have to allow the case "greater than" because it's possible that during
|
|
an index extension we crash after allocating filesystem space and before
|
|
updating the metapage. Note that on filesystems that allow "holes" in
|
|
files, it's entirely likely that pages before the logical EOF are not yet
|
|
allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
|
|
physically zero the last such page to force the EOF up, and the first such
|
|
page will be used immediately, but the intervening pages are not written
|
|
until needed.
|
|
|
|
Since overflow pages may be recycled if enough tuples are deleted from
|
|
their bucket, we need a way to keep track of currently-free overflow
|
|
pages. The state of each overflow page (0 = available, 1 = not available)
|
|
is recorded in "bitmap" pages dedicated to this purpose. The entries in
|
|
the bitmap are indexed by "bit number", a zero-based count in which every
|
|
overflow page has a unique entry. We can convert between an overflow
|
|
page's physical block number and its bit number using the information in
|
|
hashm_spares[] (see hashovfl.c for details). The bit number sequence
|
|
includes the bitmap pages, which is the reason for saying that bitmap
|
|
pages are a subset of the overflow pages. It turns out in fact that each
|
|
bitmap page's first bit represents itself --- this is not an essential
|
|
property, but falls out of the fact that we only allocate another bitmap
|
|
page when we really need one. Bit number zero always corresponds to the
|
|
first bitmap page, which is allocated during index creation just after all
|
|
the initially created buckets.
|
|
|
|
|
|
Lock Definitions
|
|
----------------
|
|
|
|
Concurrency control for hash indexes is provided using buffer content
|
|
locks, buffer pins, and cleanup locks. Here as elsewhere in PostgreSQL,
|
|
cleanup lock means that we hold an exclusive lock on the buffer and have
|
|
observed at some point after acquiring the lock that we hold the only pin
|
|
on that buffer. For hash indexes, a cleanup lock on a primary bucket page
|
|
represents the right to perform an arbitrary reorganization of the entire
|
|
bucket. Therefore, scans retain a pin on the primary bucket page for the
|
|
bucket they are currently scanning. Splitting a bucket requires a cleanup
|
|
lock on both the old and new primary bucket pages. VACUUM therefore takes
|
|
a cleanup lock on every bucket page in order to remove tuples. It can also
|
|
remove tuples copied to a new bucket by any previous split operation, because
|
|
the cleanup lock taken on the primary bucket page guarantees that no scans
|
|
which started prior to the most recent split can still be in progress. After
|
|
cleaning each page individually, it attempts to take a cleanup lock on the
|
|
primary bucket page in order to "squeeze" the bucket down to the minimum
|
|
possible number of pages.
|
|
|
|
To avoid deadlocks, we must be consistent about the lock order in which we
|
|
lock the buckets for operations that requires locks on two different buckets.
|
|
We choose to always lock the lower-numbered bucket first. The metapage is
|
|
only ever locked after all bucket locks have been taken.
|
|
|
|
|
|
Metapage Caching
|
|
----------------
|
|
|
|
Both scanning the index and inserting tuples require locating the bucket
|
|
where a given tuple ought to be located. To do this, we need the bucket
|
|
count, highmask, and lowmask from the metapage; however, it's undesirable
|
|
for performance reasons to have to have to lock and pin the metapage for
|
|
every such operation. Instead, we retain a cached copy of the metapage
|
|
in each backend's relcache entry. This will produce the correct
|
|
bucket mapping as long as the target bucket hasn't been split since the
|
|
last cache refresh.
|
|
|
|
To guard against the possibility that such a split has occurred, the
|
|
primary page of each bucket chain stores the number of buckets that
|
|
existed as of the time the bucket was last split, or if never split as
|
|
of the time it was created, in the space normally used for the
|
|
previous block number (that is, hasho_prevblkno). This doesn't cost
|
|
anything because the primary bucket page is always the first page in
|
|
the chain, and the previous block number is therefore always, in
|
|
reality, InvalidBlockNumber.
|
|
|
|
After computing the ostensibly-correct bucket number based on our cached
|
|
copy of the metapage, we lock the corresponding primary bucket page and
|
|
check whether the bucket count stored in hasho_prevblkno is greater than
|
|
the number of buckets stored in our cached copy of the metapage. If
|
|
so, the bucket has certainly been split, because the count must originally
|
|
have been less than the number of buckets that existed at that time and
|
|
can't have increased except due to a split. If not, the bucket can't have
|
|
been split, because a split would have created a new bucket with a higher
|
|
bucket number than any we'd seen previously. In the latter case, we've
|
|
locked the correct bucket and can proceed; in the former case, we must
|
|
release the lock on this bucket, lock the metapage, update our cache,
|
|
unlock the metapage, and retry.
|
|
|
|
Needing to retry occasionally might seem expensive, but the number of times
|
|
any given bucket can be split is limited to a few dozen no matter how
|
|
many times the hash index is accessed, because the total number of
|
|
buckets is limited to less than 2^32. On the other hand, the number of
|
|
times we access a bucket is unbounded and will be several orders of
|
|
magnitude larger even in unsympathetic cases.
|
|
|
|
(The metapage cache is new in v10. Older hash indexes had the primary
|
|
bucket page's hasho_prevblkno initialized to InvalidBuffer.)
|
|
|
|
Pseudocode Algorithms
|
|
---------------------
|
|
|
|
Various flags that are used in hash index operations are described as below:
|
|
|
|
The bucket-being-split and bucket-being-populated flags indicate that split
|
|
the operation is in progress for a bucket. During split operation, a
|
|
bucket-being-split flag is set on the old bucket and bucket-being-populated
|
|
flag is set on new bucket. These flags are cleared once the split operation
|
|
is finished.
|
|
|
|
The split-cleanup flag indicates that a bucket which has been recently split
|
|
still contains tuples that were also copied to the new bucket; it essentially
|
|
marks the split as incomplete. Once we're certain that no scans which
|
|
started before the new bucket was fully populated are still in progress, we
|
|
can remove the copies from the old bucket and clear the flag. We insist that
|
|
this flag must be clear before splitting a bucket; thus, a bucket can't be
|
|
split again until the previous split is totally complete.
|
|
|
|
The moved-by-split flag on a tuple indicates that tuple is moved from old to
|
|
new bucket. Concurrent scans will skip such tuples until the split operation
|
|
is finished. Once the tuple is marked as moved-by-split, it will remain so
|
|
forever but that does no harm. We have intentionally not cleared it as that
|
|
can generate an additional I/O which is not necessary.
|
|
|
|
The operations we need to support are: readers scanning the index for
|
|
entries of a particular hash code (which by definition are all in the same
|
|
bucket); insertion of a new tuple into the correct bucket; enlarging the
|
|
hash table by splitting an existing bucket; and garbage collection
|
|
(deletion of dead tuples and compaction of buckets). Bucket splitting is
|
|
done at conclusion of any insertion that leaves the hash table more full
|
|
than the target load factor, but it is convenient to consider it as an
|
|
independent operation. Note that we do not have a bucket-merge operation
|
|
--- the number of buckets never shrinks. Insertion, splitting, and
|
|
garbage collection may all need access to freelist management, which keeps
|
|
track of available overflow pages.
|
|
|
|
The reader algorithm is:
|
|
|
|
lock the primary bucket page of the target bucket
|
|
if the target bucket is still being populated by a split:
|
|
release the buffer content lock on current bucket page
|
|
pin and acquire the buffer content lock on old bucket in shared mode
|
|
release the buffer content lock on old bucket, but not pin
|
|
retake the buffer content lock on new bucket
|
|
arrange to scan the old bucket normally and the new bucket for
|
|
tuples which are not moved-by-split
|
|
-- then, per read request:
|
|
reacquire content lock on current page
|
|
step to next page if necessary (no chaining of content locks, but keep
|
|
the pin on the primary bucket throughout the scan)
|
|
save all the matching tuples from current index page into an items array
|
|
release pin and content lock (but if it is primary bucket page retain
|
|
its pin till the end of the scan)
|
|
get tuple from an item array
|
|
-- at scan shutdown:
|
|
release all pins still held
|
|
|
|
Holding the buffer pin on the primary bucket page for the whole scan prevents
|
|
the reader's current-tuple pointer from being invalidated by splits or
|
|
compactions. (Of course, other buckets can still be split or compacted.)
|
|
|
|
To minimize lock/unlock traffic, hash index scan always searches the entire
|
|
hash page to identify all the matching items at once, copying their heap tuple
|
|
IDs into backend-local storage. The heap tuple IDs are then processed while not
|
|
holding any page lock within the index thereby, allowing concurrent insertion
|
|
to happen on the same index page without any requirement of re-finding the
|
|
current scan position for the reader. We do continue to hold a pin on the
|
|
bucket page, to protect against concurrent deletions and bucket split.
|
|
|
|
To allow for scans during a bucket split, if at the start of the scan, the
|
|
bucket is marked as bucket-being-populated, it scan all the tuples in that
|
|
bucket except for those that are marked as moved-by-split. Once it finishes
|
|
the scan of all the tuples in the current bucket, it scans the old bucket from
|
|
which this bucket is formed by split.
|
|
|
|
The insertion algorithm is rather similar:
|
|
|
|
lock the primary bucket page of the target bucket
|
|
-- (so far same as reader, except for acquisition of buffer content lock in
|
|
exclusive mode on primary bucket page)
|
|
if the bucket-being-split flag is set for a bucket and pin count on it is
|
|
one, then finish the split
|
|
release the buffer content lock on current bucket
|
|
get the "new" bucket which was being populated by the split
|
|
scan the new bucket and form the hash table of TIDs
|
|
conditionally get the cleanup lock on old and new buckets
|
|
if we get the lock on both the buckets
|
|
finish the split using algorithm mentioned below for split
|
|
release the pin on old bucket and restart the insert from beginning.
|
|
if current page is full, first check if this page contains any dead tuples.
|
|
if yes, remove dead tuples from the current page and again check for the
|
|
availability of the space. If enough space found, insert the tuple else
|
|
release lock but not pin, read/exclusive-lock
|
|
next page; repeat as needed
|
|
>> see below if no space in any page of bucket
|
|
take buffer content lock in exclusive mode on metapage
|
|
insert tuple at appropriate place in page
|
|
mark current page dirty
|
|
increment tuple count, decide if split needed
|
|
mark meta page dirty
|
|
write WAL for insertion of tuple
|
|
release the buffer content lock on metapage
|
|
release buffer content lock on current page
|
|
if current page is not a bucket page, release the pin on bucket page
|
|
if split is needed, enter Split algorithm below
|
|
release the pin on metapage
|
|
|
|
To speed searches, the index entries within any individual index page are
|
|
kept sorted by hash code; the insertion code must take care to insert new
|
|
entries in the right place. It is okay for an insertion to take place in a
|
|
bucket that is being actively scanned, because readers can cope with this
|
|
as explained above. We only need the short-term buffer locks to ensure
|
|
that readers do not see a partially-updated page.
|
|
|
|
To avoid deadlock between readers and inserters, whenever there is a need
|
|
to lock multiple buckets, we always take in the order suggested in Lock
|
|
Definitions above. This algorithm allows them a very high degree of
|
|
concurrency. (The exclusive metapage lock taken to update the tuple count
|
|
is stronger than necessary, since readers do not care about the tuple count,
|
|
but the lock is held for such a short time that this is probably not an
|
|
issue.)
|
|
|
|
When an inserter cannot find space in any existing page of a bucket, it
|
|
must obtain an overflow page and add that page to the bucket's chain.
|
|
Details of that part of the algorithm appear later.
|
|
|
|
The page split algorithm is entered whenever an inserter observes that the
|
|
index is overfull (has a higher-than-wanted ratio of tuples to buckets).
|
|
The algorithm attempts, but does not necessarily succeed, to split one
|
|
existing bucket in two, thereby lowering the fill ratio:
|
|
|
|
pin meta page and take buffer content lock in exclusive mode
|
|
check split still needed
|
|
if split not needed anymore, drop buffer content lock and pin and exit
|
|
decide which bucket to split
|
|
try to take a cleanup lock on that bucket; if fail, give up
|
|
if that bucket is still being split or has split-cleanup work:
|
|
try to finish the split and the cleanup work
|
|
if that succeeds, start over; if it fails, give up
|
|
mark the old and new buckets indicating split is in progress
|
|
mark both old and new buckets as dirty
|
|
write WAL for allocation of new page for split
|
|
copy the tuples that belongs to new bucket from old bucket, marking
|
|
them as moved-by-split
|
|
write WAL record for moving tuples to new page once the new page is full
|
|
or all the pages of old bucket are finished
|
|
release lock but not pin for primary bucket page of old bucket,
|
|
read/shared-lock next page; repeat as needed
|
|
clear the bucket-being-split and bucket-being-populated flags
|
|
mark the old bucket indicating split-cleanup
|
|
write WAL for changing the flags on both old and new buckets
|
|
|
|
The split operation's attempt to acquire cleanup-lock on the old bucket number
|
|
could fail if another process holds any lock or pin on it. We do not want to
|
|
wait if that happens, because we don't want to wait while holding the metapage
|
|
exclusive-lock. So, this is a conditional LWLockAcquire operation, and if
|
|
it fails we just abandon the attempt to split. This is all right since the
|
|
index is overfull but perfectly functional. Every subsequent inserter will
|
|
try to split, and eventually one will succeed. If multiple inserters failed
|
|
to split, the index might still be overfull, but eventually, the index will
|
|
not be overfull and split attempts will stop. (We could make a successful
|
|
splitter loop to see if the index is still overfull, but it seems better to
|
|
distribute the split overhead across successive insertions.)
|
|
|
|
If a split fails partway through (e.g. due to insufficient disk space or an
|
|
interrupt), the index will not be corrupted. Instead, we'll retry the split
|
|
every time a tuple is inserted into the old bucket prior to inserting the new
|
|
tuple; eventually, we should succeed. The fact that a split is left
|
|
unfinished doesn't prevent subsequent buckets from being split, but we won't
|
|
try to split the bucket again until the prior split is finished. In other
|
|
words, a bucket can be in the middle of being split for some time, but it can't
|
|
be in the middle of two splits at the same time.
|
|
|
|
The fourth operation is garbage collection (bulk deletion):
|
|
|
|
next bucket := 0
|
|
pin metapage and take buffer content lock in exclusive mode
|
|
fetch current max bucket number
|
|
release meta page buffer content lock and pin
|
|
while next bucket <= max bucket do
|
|
acquire cleanup lock on primary bucket page
|
|
loop:
|
|
scan and remove tuples
|
|
mark the target page dirty
|
|
write WAL for deleting tuples from target page
|
|
if this is the last bucket page, break out of loop
|
|
pin and x-lock next page
|
|
release prior lock and pin (except keep pin on primary bucket page)
|
|
if the page we have locked is not the primary bucket page:
|
|
release lock and take exclusive lock on primary bucket page
|
|
if there are no other pins on the primary bucket page:
|
|
squeeze the bucket to remove free space
|
|
release the pin on primary bucket page
|
|
next bucket ++
|
|
end loop
|
|
pin metapage and take buffer content lock in exclusive mode
|
|
check if number of buckets changed
|
|
if so, release content lock and pin and return to for-each-bucket loop
|
|
else update metapage tuple count
|
|
mark meta page dirty and write WAL for update of metapage
|
|
release buffer content lock and pin
|
|
|
|
Note that this is designed to allow concurrent splits and scans. If a split
|
|
occurs, tuples relocated into the new bucket will be visited twice by the
|
|
scan, but that does no harm. See also "Interlocking Between Scans and
|
|
VACUUM", below.
|
|
|
|
We must be careful about the statistics reported by the VACUUM operation.
|
|
What we can do is count the number of tuples scanned, and believe this in
|
|
preference to the stored tuple count if the stored tuple count and number of
|
|
buckets did *not* change at any time during the scan. This provides a way of
|
|
correcting the stored tuple count if it gets out of sync for some reason. But
|
|
if a split or insertion does occur concurrently, the scan count is
|
|
untrustworthy; instead, subtract the number of tuples deleted from the stored
|
|
tuple count and use that.
|
|
|
|
Interlocking Between Scans and VACUUM
|
|
-------------------------------------
|
|
|
|
Since we release the lock on bucket page during a cleanup scan of a bucket, a
|
|
concurrent scan could start in that bucket before we've finished vacuuming it.
|
|
If a scan gets ahead of cleanup, we could have the following problem: (1) the
|
|
scan sees heap TIDs that are about to be removed before they are processed by
|
|
VACUUM, (2) the scan decides that one or more of those TIDs are dead, (3)
|
|
VACUUM completes, (4) one or more of the TIDs the scan decided were dead are
|
|
reused for an unrelated tuple, and finally (5) the scan wakes up and
|
|
erroneously kills the new tuple.
|
|
|
|
Note that this requires VACUUM and a scan to be active in the same bucket at
|
|
the same time. If VACUUM completes before the scan starts, the scan never has
|
|
a chance to see the dead tuples; if the scan completes before the VACUUM
|
|
starts, the heap TIDs can't have been reused meanwhile. Furthermore, VACUUM
|
|
can't start on a bucket that has an active scan, because the scan holds a pin
|
|
on the primary bucket page, and VACUUM must take a cleanup lock on that page
|
|
in order to begin cleanup. Therefore, the only way this problem can occur is
|
|
for a scan to start after VACUUM has released the cleanup lock on the bucket
|
|
but before it has processed the entire bucket and then overtake the cleanup
|
|
operation.
|
|
|
|
Currently, we prevent this using lock chaining: cleanup locks the next page
|
|
in the chain before releasing the lock and pin on the page just processed.
|
|
|
|
Free Space Management
|
|
---------------------
|
|
|
|
(Question: why is this so complicated? Why not just have a linked list
|
|
of free pages with the list head in the metapage? It's not like we
|
|
avoid needing to modify the metapage with all this.)
|
|
|
|
Free space management consists of two sub-algorithms, one for reserving
|
|
an overflow page to add to a bucket chain, and one for returning an empty
|
|
overflow page to the free pool.
|
|
|
|
Obtaining an overflow page:
|
|
|
|
take metapage content lock in exclusive mode
|
|
determine next bitmap page number; if none, exit loop
|
|
release meta page content lock
|
|
pin bitmap page and take content lock in exclusive mode
|
|
search for a free page (zero bit in bitmap)
|
|
if found:
|
|
set bit in bitmap
|
|
mark bitmap page dirty
|
|
take metapage buffer content lock in exclusive mode
|
|
if first-free-bit value did not change,
|
|
update it and mark meta page dirty
|
|
else (not found):
|
|
release bitmap page buffer content lock
|
|
loop back to try next bitmap page, if any
|
|
-- here when we have checked all bitmap pages; we hold meta excl. lock
|
|
extend index to add another overflow page; update meta information
|
|
mark meta page dirty
|
|
return page number
|
|
|
|
It is slightly annoying to release and reacquire the metapage lock
|
|
multiple times, but it seems best to do it that way to minimize loss of
|
|
concurrency against processes just entering the index. We don't want
|
|
to hold the metapage exclusive lock while reading in a bitmap page.
|
|
(We can at least avoid repeated buffer pin/unpin here.)
|
|
|
|
The normal path for extending the index does not require doing I/O while
|
|
holding the metapage lock. We do have to do I/O when the extension
|
|
requires adding a new bitmap page as well as the required overflow page
|
|
... but that is an infrequent case, so the loss of concurrency seems
|
|
acceptable.
|
|
|
|
The portion of tuple insertion that calls the above subroutine looks
|
|
like this:
|
|
|
|
-- having determined that no space is free in the target bucket:
|
|
remember last page of bucket, drop write lock on it
|
|
re-write-lock last page of bucket
|
|
if it is not last anymore, step to the last page
|
|
execute free-page-acquire (obtaining an overflow page) mechanism
|
|
described above
|
|
update (former) last page to point to the new page and mark buffer dirty
|
|
write-lock and initialize new page, with back link to former last page
|
|
write WAL for addition of overflow page
|
|
release the locks on meta page and bitmap page acquired in
|
|
free-page-acquire algorithm
|
|
release the lock on former last page
|
|
release the lock on new overflow page
|
|
insert tuple into new page
|
|
-- etc.
|
|
|
|
Notice this handles the case where two concurrent inserters try to extend
|
|
the same bucket. They will end up with a valid, though perhaps
|
|
space-inefficient, configuration: two overflow pages will be added to the
|
|
bucket, each containing one tuple.
|
|
|
|
The last part of this violates the rule about holding write lock on two
|
|
pages concurrently, but it should be okay to write-lock the previously
|
|
free page; there can be no other process holding lock on it.
|
|
|
|
Bucket splitting uses a similar algorithm if it has to extend the new
|
|
bucket, but it need not worry about concurrent extension since it has
|
|
buffer content lock in exclusive mode on the new bucket.
|
|
|
|
Freeing an overflow page requires the process to hold buffer content lock in
|
|
exclusive mode on the containing bucket, so need not worry about other
|
|
accessors of pages in the bucket. The algorithm is:
|
|
|
|
delink overflow page from bucket chain
|
|
(this requires read/update/write/release of fore and aft siblings)
|
|
pin meta page and take buffer content lock in shared mode
|
|
determine which bitmap page contains the free space bit for page
|
|
release meta page buffer content lock
|
|
pin bitmap page and take buffer content lock in exclusive mode
|
|
retake meta page buffer content lock in exclusive mode
|
|
move (insert) tuples that belong to the overflow page being freed
|
|
update bitmap bit
|
|
mark bitmap page dirty
|
|
if page number is still less than first-free-bit,
|
|
update first-free-bit field and mark meta page dirty
|
|
write WAL for delinking overflow page operation
|
|
release buffer content lock and pin
|
|
release meta page buffer content lock and pin
|
|
|
|
We have to do it this way because we must clear the bitmap bit before
|
|
changing the first-free-bit field (hashm_firstfree). It is possible that
|
|
we set first-free-bit too small (because someone has already reused the
|
|
page we just freed), but that is okay; the only cost is the next overflow
|
|
page acquirer will scan more bitmap bits than he needs to. What must be
|
|
avoided is having first-free-bit greater than the actual first free bit,
|
|
because then that free page would never be found by searchers.
|
|
|
|
The reason of moving tuples from overflow page while delinking the later is
|
|
to make that as an atomic operation. Not doing so could lead to spurious reads
|
|
on standby. Basically, the user might see the same tuple twice.
|
|
|
|
|
|
WAL Considerations
|
|
------------------
|
|
|
|
The hash index operations like create index, insert, delete, bucket split,
|
|
allocate overflow page, and squeeze in themselves don't guarantee hash index
|
|
consistency after a crash. To provide robustness, we write WAL for each of
|
|
these operations.
|
|
|
|
CREATE INDEX writes multiple WAL records. First, we write a record to cover
|
|
the initializatoin of the metapage, followed by one for each new bucket
|
|
created, followed by one for the initial bitmap page. It's not important for
|
|
index creation to appear atomic, because the index isn't yet visible to any
|
|
other transaction, and the creating transaction will roll back in the event of
|
|
a crash. It would be difficult to cover the whole operation with a single
|
|
write-ahead log record anyway, because we can log only a fixed number of
|
|
pages, as given by XLR_MAX_BLOCK_ID (32), with current XLog machinery.
|
|
|
|
Ordinary item insertions (that don't force a page split or need a new overflow
|
|
page) are single WAL entries. They touch a single bucket page and the
|
|
metapage. The metapage is updated during replay as it is updated during
|
|
original operation.
|
|
|
|
If an insertion causes the addition of an overflow page, there will be one
|
|
WAL entry for the new overflow page and second entry for insert itself.
|
|
|
|
If an insertion causes a bucket split, there will be one WAL entry for insert
|
|
itself, followed by a WAL entry for allocating a new bucket, followed by a WAL
|
|
entry for each overflow bucket page in the new bucket to which the tuples are
|
|
moved from old bucket, followed by a WAL entry to indicate that split is
|
|
complete for both old and new buckets. A split operation which requires
|
|
overflow pages to complete the operation will need to write a WAL record for
|
|
each new allocation of an overflow page.
|
|
|
|
As splitting involves multiple atomic actions, it's possible that the system
|
|
crashes between moving tuples from bucket pages of the old bucket to new
|
|
bucket. In such a case, after recovery, the old and new buckets will be
|
|
marked with bucket-being-split and bucket-being-populated flags respectively
|
|
which indicates that split is in progress for those buckets. The reader
|
|
algorithm works correctly, as it will scan both the old and new buckets when
|
|
the split is in progress as explained in the reader algorithm section above.
|
|
|
|
We finish the split at next insert or split operation on the old bucket as
|
|
explained in insert and split algorithm above. It could be done during
|
|
searches, too, but it seems best not to put any extra updates in what would
|
|
otherwise be a read-only operation (updating is not possible in hot standby
|
|
mode anyway). It would seem natural to complete the split in VACUUM, but since
|
|
splitting a bucket might require allocating a new page, it might fail if you
|
|
run out of disk space. That would be bad during VACUUM - the reason for
|
|
running VACUUM in the first place might be that you run out of disk space,
|
|
and now VACUUM won't finish because you're out of disk space. In contrast,
|
|
an insertion can require enlarging the physical file anyway.
|
|
|
|
Deletion of tuples from a bucket is performed for two reasons: to remove dead
|
|
tuples, and to remove tuples that were moved by a bucket split. A WAL entry
|
|
is made for each bucket page from which tuples are removed, and then another
|
|
WAL entry is made when we clear the needs-split-cleanup flag. If dead tuples
|
|
are removed, a separate WAL entry is made to update the metapage.
|
|
|
|
As deletion involves multiple atomic operations, it is quite possible that
|
|
system crashes after (a) removing tuples from some of the bucket pages, (b)
|
|
before clearing the garbage flag, or (c) before updating the metapage. If the
|
|
system crashes before completing (b), it will again try to clean the bucket
|
|
during next vacuum or insert after recovery which can have some performance
|
|
impact, but it will work fine. If the system crashes before completing (c),
|
|
after recovery there could be some additional splits until the next vacuum
|
|
updates the metapage, but the other operations like insert, delete and scan
|
|
will work correctly. We can fix this problem by actually updating the
|
|
metapage based on delete operation during replay, but it's not clear whether
|
|
it's worth the complication.
|
|
|
|
A squeeze operation moves tuples from one of the buckets later in the chain to
|
|
one of the bucket earlier in chain and writes WAL record when either the
|
|
bucket to which it is writing tuples is filled or bucket from which it
|
|
is removing the tuples becomes empty.
|
|
|
|
As a squeeze operation involves writing multiple atomic operations, it is
|
|
quite possible that the system crashes before completing the operation on
|
|
entire bucket. After recovery, the operations will work correctly, but
|
|
the index will remain bloated and this can impact performance of read and
|
|
insert operations until the next vacuum squeeze the bucket completely.
|
|
|
|
|
|
Other Notes
|
|
-----------
|
|
|
|
Clean up locks prevent a split from occurring while *another* process is stopped
|
|
in a given bucket. It also ensures that one of our *own* backend's scans is not
|
|
stopped in the bucket.
|