2000-07-21 08:42:39 +02:00
|
|
|
$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.2 2000/07/21 06:42:32 tgl Exp $
|
1996-07-09 08:22:35 +02:00
|
|
|
|
|
|
|
This directory contains a correct implementation of Lehman and Yao's
|
2000-07-21 08:42:39 +02:00
|
|
|
high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
|
|
|
|
Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions
|
|
|
|
on Database Systems, Vol 6, No. 4, December 1981, pp 650-670).
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
We have made the following changes in order to incorporate their algorithm
|
|
|
|
into Postgres:
|
|
|
|
|
2000-07-21 08:42:39 +02:00
|
|
|
+ The requirement that all btree keys be unique is too onerous,
|
|
|
|
but the algorithm won't work correctly without it. Fortunately, it is
|
|
|
|
only necessary that keys be unique on a single tree level, because L&Y
|
|
|
|
only use the assumption of key uniqueness when re-finding a key in a
|
|
|
|
parent node (to determine where to insert the key for a split page).
|
|
|
|
Therefore, we can use the link field to disambiguate multiple
|
|
|
|
occurrences of the same user key: only one entry in the parent level
|
|
|
|
will be pointing at the page we had split. (Indeed we need not look at
|
|
|
|
the real "key" at all, just at the link field.) We can distinguish
|
|
|
|
items at the leaf level in the same way, by examining their links to
|
|
|
|
heap tuples; we'd never have two items for the same heap tuple.
|
|
|
|
|
|
|
|
+ Lehman and Yao assume that the key range for a subtree S is described
|
|
|
|
by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
|
|
|
|
node. This does not work for nonunique keys (for example, if we have
|
|
|
|
enough equal keys to spread across several leaf pages, there *must* be
|
|
|
|
some equal bounding keys in the first level up). Therefore we assume
|
|
|
|
Ki <= v <= Ki+1 instead. A search that finds exact equality to a
|
|
|
|
bounding key in an upper tree level must descend to the left of that
|
|
|
|
key to ensure it finds any equal keys in the preceding page. An
|
|
|
|
insertion that sees the high key of its target page is equal to the key
|
|
|
|
to be inserted has a choice whether or not to move right, since the new
|
|
|
|
key could go on either page. (Currently, we try to find a page where
|
|
|
|
there is room for the new key without a split.)
|
|
|
|
|
|
|
|
+ Lehman and Yao don't require read locks, but assume that in-memory
|
|
|
|
copies of tree nodes are unshared. Postgres shares in-memory buffers
|
|
|
|
among backends. As a result, we do page-level read locking on btree
|
|
|
|
nodes in order to guarantee that no record is modified while we are
|
|
|
|
examining it. This reduces concurrency but guaranteees correct
|
|
|
|
behavior. An advantage is that when trading in a read lock for a
|
|
|
|
write lock, we need not re-read the page after getting the write lock.
|
|
|
|
Since we're also holding a pin on the shared buffer containing the
|
|
|
|
page, we know that buffer still contains the page and is up-to-date.
|
|
|
|
|
|
|
|
+ We support the notion of an ordered "scan" of an index as well as
|
|
|
|
insertions, deletions, and simple lookups. A scan in the forward
|
|
|
|
direction is no problem, we just use the right-sibling pointers that
|
|
|
|
L&Y require anyway. (Thus, once we have descended the tree to the
|
|
|
|
correct start point for the scan, the scan looks only at leaf pages
|
|
|
|
and never at higher tree levels.) To support scans in the backward
|
|
|
|
direction, we also store a "left sibling" link much like the "right
|
|
|
|
sibling". (This adds an extra step to the L&Y split algorithm: while
|
|
|
|
holding the write lock on the page being split, we also lock its former
|
|
|
|
right sibling to update that page's left-link. This is safe since no
|
|
|
|
writer of that page can be interested in acquiring a write lock on our
|
|
|
|
page.) A backwards scan has one additional bit of complexity: after
|
|
|
|
following the left-link we must account for the possibility that the
|
|
|
|
left sibling page got split before we could read it. So, we have to
|
|
|
|
move right until we find a page whose right-link matches the page we
|
|
|
|
came from.
|
|
|
|
|
|
|
|
+ Read locks on a page are held for as long as a scan has a pointer
|
|
|
|
to the page. However, locks are always surrendered before the
|
|
|
|
sibling page lock is acquired (for readers), so we remain deadlock-
|
|
|
|
free. I will do a formal proof if I get bored anytime soon.
|
|
|
|
NOTE: nbtree.c arranges to drop the read lock, but not the buffer pin,
|
|
|
|
on the current page of a scan before control leaves nbtree. When we
|
|
|
|
come back to resume the scan, we have to re-grab the read lock and
|
|
|
|
then move right if the current item moved (see _bt_restscan()).
|
|
|
|
|
|
|
|
+ Lehman and Yao fail to discuss what must happen when the root page
|
|
|
|
becomes full and must be split. Our implementation is to split the
|
|
|
|
root in the same way that any other page would be split, then construct
|
|
|
|
a new root page holding pointers to both of the resulting pages (which
|
|
|
|
now become siblings on level 2 of the tree). The new root page is then
|
|
|
|
installed by altering the root pointer in the meta-data page (see
|
|
|
|
below). This works because the root is not treated specially in any
|
|
|
|
other way --- in particular, searches will move right using its link
|
|
|
|
pointer if the link is set. Therefore, searches will find the data
|
|
|
|
that's been moved into the right sibling even if they read the metadata
|
|
|
|
page before it got updated. This is the same reasoning that makes a
|
|
|
|
split of a non-root page safe. The locking considerations are similar too.
|
|
|
|
|
|
|
|
+ Lehman and Yao assume fixed-size keys, but we must deal with
|
|
|
|
variable-size keys. Therefore there is not a fixed maximum number of
|
|
|
|
keys per page; we just stuff in as many as will fit. When we split a
|
|
|
|
page, we try to equalize the number of bytes, not items, assigned to
|
|
|
|
each of the resulting pages. Note we must include the incoming item in
|
|
|
|
this calculation, otherwise it is possible to find that the incoming
|
|
|
|
item doesn't fit on the split page where it needs to go!
|
1996-07-09 08:22:35 +02:00
|
|
|
|
|
|
|
In addition, the following things are handy to know:
|
|
|
|
|
2000-07-21 08:42:39 +02:00
|
|
|
+ Page zero of every btree is a meta-data page. This page stores
|
|
|
|
the location of the root page, a pointer to a list of free
|
|
|
|
pages, and other stuff that's handy to know. (Currently, we
|
|
|
|
never shrink btree indexes so there are never any free pages.)
|
|
|
|
|
|
|
|
+ The algorithm assumes we can fit at least three items per page
|
|
|
|
(a "high key" and two real data items). Therefore it's unsafe
|
|
|
|
to accept items larger than 1/3rd page size. Larger items would
|
|
|
|
work sometimes, but could cause failures later on depending on
|
|
|
|
what else gets put on their page.
|
|
|
|
|
|
|
|
+ This algorithm doesn't guarantee btree consistency after a kernel crash
|
|
|
|
or hardware failure. To do that, we'd need ordered writes, and UNIX
|
|
|
|
doesn't support ordered writes (short of fsync'ing every update, which
|
|
|
|
is too high a price). Rebuilding corrupted indexes during restart
|
|
|
|
seems more attractive.
|
|
|
|
|
|
|
|
+ On deletions, we need to adjust the position of active scans on
|
|
|
|
the index. The code in nbtscan.c handles this. We don't need to
|
|
|
|
do this for insertions or splits because _bt_restscan can find the
|
|
|
|
new position of the previously-found item. NOTE that nbtscan.c
|
|
|
|
only copes with deletions issued by the current backend. This
|
|
|
|
essentially means that concurrent deletions are not supported, but
|
|
|
|
that's true already in the Lehman and Yao algorithm. nbtscan.c
|
|
|
|
exists only to support VACUUM and allow it to delete items while
|
|
|
|
it's scanning the index.
|
|
|
|
|
|
|
|
Notes about data representation:
|
|
|
|
|
|
|
|
+ The right-sibling link required by L&Y is kept in the page "opaque
|
|
|
|
data" area, as is the left-sibling link and some flags.
|
|
|
|
|
|
|
|
+ We also keep a parent link in the opaque data, but this link is not
|
|
|
|
very trustworthy because it is not updated when the parent page splits.
|
|
|
|
Thus, it points to some page on the parent level, but possibly a page
|
|
|
|
well to the left of the page's actual current parent. In most cases
|
|
|
|
we do not need this link at all. Normally we return to a parent page
|
|
|
|
using a stack of entries that are made as we descend the tree, as in L&Y.
|
|
|
|
There is exactly one case where the stack will not help: concurrent
|
|
|
|
root splits. If an inserter process needs to split what had been the
|
|
|
|
root when it started its descent, but finds that that page is no longer
|
|
|
|
the root (because someone else split it meanwhile), then it uses the
|
|
|
|
parent link to move up to the next level. This is OK because we do fix
|
|
|
|
the parent link in a former root page when splitting it. This logic
|
|
|
|
will work even if the root is split multiple times (even up to creation
|
|
|
|
of multiple new levels) before an inserter returns to it. The same
|
|
|
|
could not be said of finding the new root via the metapage, since that
|
|
|
|
would work only for a single level of added root.
|
|
|
|
|
|
|
|
+ The Postgres disk block data format (an array of items) doesn't fit
|
|
|
|
Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
|
|
|
|
so we have to play some games.
|
|
|
|
|
|
|
|
+ On a page that is not rightmost in its tree level, the "high key" is
|
|
|
|
kept in the page's first item, and real data items start at item 2.
|
|
|
|
The link portion of the "high key" item goes unused. A page that is
|
|
|
|
rightmost has no "high key", so data items start with the first item.
|
|
|
|
Putting the high key at the left, rather than the right, may seem odd,
|
|
|
|
but it avoids moving the high key as we add data items.
|
|
|
|
|
|
|
|
+ On a leaf page, the data items are simply links to (TIDs of) tuples
|
|
|
|
in the relation being indexed, with the associated key values.
|
|
|
|
|
|
|
|
+ On a non-leaf page, the data items are down-links to child pages with
|
|
|
|
bounding keys. The key in each data item is the *lower* bound for
|
|
|
|
keys on that child page, so logically the key is to the left of that
|
|
|
|
downlink. The high key (if present) is the upper bound for the last
|
|
|
|
downlink. The first data item on each such page has no lower bound
|
|
|
|
--- or lower bound of minus infinity, if you prefer. The comparison
|
|
|
|
routines must treat it accordingly. The actual key stored in the
|
|
|
|
item is irrelevant, and need not be stored at all. This arrangement
|
|
|
|
corresponds to the fact that an L&Y non-leaf page has one more pointer
|
|
|
|
than key.
|
1996-07-09 08:22:35 +02:00
|
|
|
|
|
|
|
Notes to operator class implementors:
|
|
|
|
|
2000-07-21 08:42:39 +02:00
|
|
|
+ With this implementation, we require the user to supply us with
|
|
|
|
a procedure for pg_amproc. This procedure should take two keys
|
|
|
|
A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B,
|
|
|
|
respectively. See the contents of that relation for the btree
|
|
|
|
access method for some samples.
|