postgresql/src/backend/access/nbtree/README

$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.2 2000/07/21 06:42:32 tgl Exp $

This directory contains a correct implementation of Lehman and Yao's
high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions
on Database Systems, Vol 6, No. 4, December 1981, pp 650-670).

We have made the following changes in order to incorporate their algorithm
into Postgres:

+  The requirement that all btree keys be unique is too onerous,
   but the algorithm won't work correctly without it.  Fortunately, it is
   only necessary that keys be unique on a single tree level, because L&Y
   only use the assumption of key uniqueness when re-finding a key in a
   parent node (to determine where to insert the key for a split page).
   Therefore, we can use the link field to disambiguate multiple
   occurrences of the same user key: only one entry in the parent level
   will be pointing at the page we had split.  (Indeed we need not look at
   the real "key" at all, just at the link field.)  We can distinguish
   items at the leaf level in the same way, by examining their links to
   heap tuples; we'd never have two items for the same heap tuple.

+  Lehman and Yao assume that the key range for a subtree S is described
   by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
   node.  This does not work for nonunique keys (for example, if we have
   enough equal keys to spread across several leaf pages, there *must* be
   some equal bounding keys in the first level up).  Therefore we assume
   Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
   bounding key in an upper tree level must descend to the left of that
   key to ensure it finds any equal keys in the preceding page.  An
   insertion that sees the high key of its target page is equal to the key
   to be inserted has a choice whether or not to move right, since the new
   key could go on either page.  (Currently, we try to find a page where
   there is room for the new key without a split.)

+  Lehman and Yao don't require read locks, but assume that in-memory
   copies of tree nodes are unshared.  Postgres shares in-memory buffers
   among backends.  As a result, we do page-level read locking on btree
   nodes in order to guarantee that no record is modified while we are
   examining it.  This reduces concurrency but guaranteees correct
   behavior.  An advantage is that when trading in a read lock for a
   write lock, we need not re-read the page after getting the write lock.
   Since we're also holding a pin on the shared buffer containing the
   page, we know that buffer still contains the page and is up-to-date.

+  We support the notion of an ordered "scan" of an index as well as
   insertions, deletions, and simple lookups.  A scan in the forward
   direction is no problem, we just use the right-sibling pointers that
   L&Y require anyway.  (Thus, once we have descended the tree to the
   correct start point for the scan, the scan looks only at leaf pages
   and never at higher tree levels.)  To support scans in the backward
   direction, we also store a "left sibling" link much like the "right
   sibling".  (This adds an extra step to the L&Y split algorithm: while
   holding the write lock on the page being split, we also lock its former
   right sibling to update that page's left-link.  This is safe since no
   writer of that page can be interested in acquiring a write lock on our
   page.)  A backwards scan has one additional bit of complexity: after
   following the left-link we must account for the possibility that the
   left sibling page got split before we could read it.  So, we have to
   move right until we find a page whose right-link matches the page we
   came from.

+  Read locks on a page are held for as long as a scan has a pointer
   to the page.  However, locks are always surrendered before the
   sibling page lock is acquired (for readers), so we remain deadlock-
   free.  I will do a formal proof if I get bored anytime soon.
   NOTE: nbtree.c arranges to drop the read lock, but not the buffer pin,
   on the current page of a scan before control leaves nbtree.  When we
   come back to resume the scan, we have to re-grab the read lock and
   then move right if the current item moved (see _bt_restscan()).

+  Lehman and Yao fail to discuss what must happen when the root page
   becomes full and must be split.  Our implementation is to split the
   root in the same way that any other page would be split, then construct
   a new root page holding pointers to both of the resulting pages (which
   now become siblings on level 2 of the tree).  The new root page is then
   installed by altering the root pointer in the meta-data page (see
   below).  This works because the root is not treated specially in any
   other way --- in particular, searches will move right using its link
   pointer if the link is set.  Therefore, searches will find the data
   that's been moved into the right sibling even if they read the metadata
   page before it got updated.  This is the same reasoning that makes a
   split of a non-root page safe.  The locking considerations are similar too.

+  Lehman and Yao assume fixed-size keys, but we must deal with
   variable-size keys.  Therefore there is not a fixed maximum number of
   keys per page; we just stuff in as many as will fit.  When we split a
   page, we try to equalize the number of bytes, not items, assigned to
   each of the resulting pages.  Note we must include the incoming item in
   this calculation, otherwise it is possible to find that the incoming
   item doesn't fit on the split page where it needs to go!

In addition, the following things are handy to know:

+  Page zero of every btree is a meta-data page.  This page stores
   the location of the root page, a pointer to a list of free
   pages, and other stuff that's handy to know.  (Currently, we
   never shrink btree indexes so there are never any free pages.)

+  The algorithm assumes we can fit at least three items per page
   (a "high key" and two real data items).  Therefore it's unsafe
   to accept items larger than 1/3rd page size.  Larger items would
   work sometimes, but could cause failures later on depending on
   what else gets put on their page.

+  This algorithm doesn't guarantee btree consistency after a kernel crash
   or hardware failure.  To do that, we'd need ordered writes, and UNIX
   doesn't support ordered writes (short of fsync'ing every update, which
   is too high a price).  Rebuilding corrupted indexes during restart
   seems more attractive.

+  On deletions, we need to adjust the position of active scans on
   the index.  The code in nbtscan.c handles this.  We don't need to
   do this for insertions or splits because _bt_restscan can find the
   new position of the previously-found item.  NOTE that nbtscan.c
   only copes with deletions issued by the current backend.  This
   essentially means that concurrent deletions are not supported, but
   that's true already in the Lehman and Yao algorithm.  nbtscan.c
   exists only to support VACUUM and allow it to delete items while
   it's scanning the index.

Notes about data representation:

+  The right-sibling link required by L&Y is kept in the page "opaque
   data" area, as is the left-sibling link and some flags.

+  We also keep a parent link in the opaque data, but this link is not
   very trustworthy because it is not updated when the parent page splits.
   Thus, it points to some page on the parent level, but possibly a page
   well to the left of the page's actual current parent.  In most cases
   we do not need this link at all.  Normally we return to a parent page
   using a stack of entries that are made as we descend the tree, as in L&Y.
   There is exactly one case where the stack will not help: concurrent
   root splits.  If an inserter process needs to split what had been the
   root when it started its descent, but finds that that page is no longer
   the root (because someone else split it meanwhile), then it uses the
   parent link to move up to the next level.  This is OK because we do fix
   the parent link in a former root page when splitting it.  This logic
   will work even if the root is split multiple times (even up to creation
   of multiple new levels) before an inserter returns to it.  The same
   could not be said of finding the new root via the metapage, since that
   would work only for a single level of added root.

+  The Postgres disk block data format (an array of items) doesn't fit
   Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
   so we have to play some games.

+  On a page that is not rightmost in its tree level, the "high key" is
   kept in the page's first item, and real data items start at item 2.
   The link portion of the "high key" item goes unused.  A page that is
   rightmost has no "high key", so data items start with the first item.
   Putting the high key at the left, rather than the right, may seem odd,
   but it avoids moving the high key as we add data items.

+  On a leaf page, the data items are simply links to (TIDs of) tuples
   in the relation being indexed, with the associated key values.

+  On a non-leaf page, the data items are down-links to child pages with
   bounding keys.  The key in each data item is the *lower* bound for
   keys on that child page, so logically the key is to the left of that
   downlink.  The high key (if present) is the upper bound for the last
   downlink.  The first data item on each such page has no lower bound
   --- or lower bound of minus infinity, if you prefer.  The comparison
   routines must treat it accordingly.  The actual key stored in the
   item is irrelevant, and need not be stored at all.  This arrangement
   corresponds to the fact that an L&Y non-leaf page has one more pointer
   than key.

Notes to operator class implementors:

+  With this implementation, we require the user to supply us with
   a procedure for pg_amproc.  This procedure should take two keys
   A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B,
   respectively.  See the contents of that relation for the btree
   access method for some samples.
Major overhaul of btree index code. Eliminate special BTP_CHAIN logic for duplicate keys by letting search go to the left rather than right when an equal key is seen at an upper tree level. Fix poor choice of page split point (leading to insertion failures) that was forced by chaining logic. Don't store leftmost key in non-leaf pages, since it's not necessary. Don't create root page until something is first stored in the index, so an unused index is now 8K not 16K. (Doesn't seem to be as easy to get rid of the metadata page, unfortunately.) Massive cleanup of unreadable code, fix poor, obsolete, and just plain wrong documentation and comments. See src/backend/access/nbtree/README for the gory details. 2000-07-21 08:42:39 +02:00			$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.2 2000/07/21 06:42:32 tgl Exp $
Postgres95 1.01 Distribution - Virgin Sources 1996-07-09 08:22:35 +02:00
			`This directory contains a correct implementation of Lehman and Yao's`
Major overhaul of btree index code. Eliminate special BTP_CHAIN logic for duplicate keys by letting search go to the left rather than right when an equal key is seen at an upper tree level. Fix poor choice of page split point (leading to insertion failures) that was forced by chaining logic. Don't store leftmost key in non-leaf pages, since it's not necessary. Don't create root page until something is first stored in the index, so an unused index is now 8K not 16K. (Doesn't seem to be as easy to get rid of the metadata page, unfortunately.) Massive cleanup of unreadable code, fix poor, obsolete, and just plain wrong documentation and comments. See src/backend/access/nbtree/README for the gory details. 2000-07-21 08:42:39 +02:00			`high-concurrency B-tree management algorithm (P. Lehman and S. Yao,`
			`Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions`
			`on Database Systems, Vol 6, No. 4, December 1981, pp 650-670).`

Postgres95 1.01 Distribution - Virgin Sources 1996-07-09 08:22:35 +02:00			`We have made the following changes in order to incorporate their algorithm`
			`into Postgres:`

Major overhaul of btree index code. Eliminate special BTP_CHAIN logic for duplicate keys by letting search go to the left rather than right when an equal key is seen at an upper tree level. Fix poor choice of page split point (leading to insertion failures) that was forced by chaining logic. Don't store leftmost key in non-leaf pages, since it's not necessary. Don't create root page until something is first stored in the index, so an unused index is now 8K not 16K. (Doesn't seem to be as easy to get rid of the metadata page, unfortunately.) Massive cleanup of unreadable code, fix poor, obsolete, and just plain wrong documentation and comments. See src/backend/access/nbtree/README for the gory details. 2000-07-21 08:42:39 +02:00			`+ The requirement that all btree keys be unique is too onerous,`
			`but the algorithm won't work correctly without it. Fortunately, it is`
			`only necessary that keys be unique on a single tree level, because L&Y`
			`only use the assumption of key uniqueness when re-finding a key in a`
			`parent node (to determine where to insert the key for a split page).`
			`Therefore, we can use the link field to disambiguate multiple`
			`occurrences of the same user key: only one entry in the parent level`
			`will be pointing at the page we had split. (Indeed we need not look at`
			`the real "key" at all, just at the link field.) We can distinguish`
			`items at the leaf level in the same way, by examining their links to`
			`heap tuples; we'd never have two items for the same heap tuple.`

			`+ Lehman and Yao assume that the key range for a subtree S is described`
			`by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent`
			`node. This does not work for nonunique keys (for example, if we have`
			`enough equal keys to spread across several leaf pages, there must be`
			`some equal bounding keys in the first level up). Therefore we assume`
			`Ki <= v <= Ki+1 instead. A search that finds exact equality to a`
			`bounding key in an upper tree level must descend to the left of that`
			`key to ensure it finds any equal keys in the preceding page. An`
			`insertion that sees the high key of its target page is equal to the key`
			`to be inserted has a choice whether or not to move right, since the new`
			`key could go on either page. (Currently, we try to find a page where`
			`there is room for the new key without a split.)`

			`+ Lehman and Yao don't require read locks, but assume that in-memory`
			`copies of tree nodes are unshared. Postgres shares in-memory buffers`
			`among backends. As a result, we do page-level read locking on btree`
			`nodes in order to guarantee that no record is modified while we are`
			`examining it. This reduces concurrency but guaranteees correct`
			`behavior. An advantage is that when trading in a read lock for a`
			`write lock, we need not re-read the page after getting the write lock.`
			`Since we're also holding a pin on the shared buffer containing the`
			`page, we know that buffer still contains the page and is up-to-date.`

			`+ We support the notion of an ordered "scan" of an index as well as`
			`insertions, deletions, and simple lookups. A scan in the forward`
			`direction is no problem, we just use the right-sibling pointers that`
			`L&Y require anyway. (Thus, once we have descended the tree to the`
			`correct start point for the scan, the scan looks only at leaf pages`
			`and never at higher tree levels.) To support scans in the backward`
			`direction, we also store a "left sibling" link much like the "right`
			`sibling". (This adds an extra step to the L&Y split algorithm: while`
			`holding the write lock on the page being split, we also lock its former`
			`right sibling to update that page's left-link. This is safe since no`
			`writer of that page can be interested in acquiring a write lock on our`
			`page.) A backwards scan has one additional bit of complexity: after`
			`following the left-link we must account for the possibility that the`
			`left sibling page got split before we could read it. So, we have to`
			`move right until we find a page whose right-link matches the page we`
			`came from.`

			`+ Read locks on a page are held for as long as a scan has a pointer`
			`to the page. However, locks are always surrendered before the`
			`sibling page lock is acquired (for readers), so we remain deadlock-`
			`free. I will do a formal proof if I get bored anytime soon.`
			`NOTE: nbtree.c arranges to drop the read lock, but not the buffer pin,`
			`on the current page of a scan before control leaves nbtree. When we`
			`come back to resume the scan, we have to re-grab the read lock and`
			`then move right if the current item moved (see _bt_restscan()).`

			`+ Lehman and Yao fail to discuss what must happen when the root page`
			`becomes full and must be split. Our implementation is to split the`
			`root in the same way that any other page would be split, then construct`
			`a new root page holding pointers to both of the resulting pages (which`
			`now become siblings on level 2 of the tree). The new root page is then`
			`installed by altering the root pointer in the meta-data page (see`
			`below). This works because the root is not treated specially in any`
			`other way --- in particular, searches will move right using its link`
			`pointer if the link is set. Therefore, searches will find the data`
			`that's been moved into the right sibling even if they read the metadata`
			`page before it got updated. This is the same reasoning that makes a`
			`split of a non-root page safe. The locking considerations are similar too.`

			`+ Lehman and Yao assume fixed-size keys, but we must deal with`
			`variable-size keys. Therefore there is not a fixed maximum number of`
			`keys per page; we just stuff in as many as will fit. When we split a`
			`page, we try to equalize the number of bytes, not items, assigned to`
			`each of the resulting pages. Note we must include the incoming item in`
			`this calculation, otherwise it is possible to find that the incoming`
			`item doesn't fit on the split page where it needs to go!`
Postgres95 1.01 Distribution - Virgin Sources 1996-07-09 08:22:35 +02:00
			`In addition, the following things are handy to know:`

Major overhaul of btree index code. Eliminate special BTP_CHAIN logic for duplicate keys by letting search go to the left rather than right when an equal key is seen at an upper tree level. Fix poor choice of page split point (leading to insertion failures) that was forced by chaining logic. Don't store leftmost key in non-leaf pages, since it's not necessary. Don't create root page until something is first stored in the index, so an unused index is now 8K not 16K. (Doesn't seem to be as easy to get rid of the metadata page, unfortunately.) Massive cleanup of unreadable code, fix poor, obsolete, and just plain wrong documentation and comments. See src/backend/access/nbtree/README for the gory details. 2000-07-21 08:42:39 +02:00			`+ Page zero of every btree is a meta-data page. This page stores`
			`the location of the root page, a pointer to a list of free`
			`pages, and other stuff that's handy to know. (Currently, we`
			`never shrink btree indexes so there are never any free pages.)`

			`+ The algorithm assumes we can fit at least three items per page`
			`(a "high key" and two real data items). Therefore it's unsafe`
			`to accept items larger than 1/3rd page size. Larger items would`
			`work sometimes, but could cause failures later on depending on`
			`what else gets put on their page.`

			`+ This algorithm doesn't guarantee btree consistency after a kernel crash`
			`or hardware failure. To do that, we'd need ordered writes, and UNIX`
			`doesn't support ordered writes (short of fsync'ing every update, which`
			`is too high a price). Rebuilding corrupted indexes during restart`
			`seems more attractive.`

			`+ On deletions, we need to adjust the position of active scans on`
			`the index. The code in nbtscan.c handles this. We don't need to`
			`do this for insertions or splits because _bt_restscan can find the`
			`new position of the previously-found item. NOTE that nbtscan.c`
			`only copes with deletions issued by the current backend. This`
			`essentially means that concurrent deletions are not supported, but`
			`that's true already in the Lehman and Yao algorithm. nbtscan.c`
			`exists only to support VACUUM and allow it to delete items while`
			`it's scanning the index.`

			`Notes about data representation:`

			`+ The right-sibling link required by L&Y is kept in the page "opaque`
			`data" area, as is the left-sibling link and some flags.`

			`+ We also keep a parent link in the opaque data, but this link is not`
			`very trustworthy because it is not updated when the parent page splits.`
			`Thus, it points to some page on the parent level, but possibly a page`
			`well to the left of the page's actual current parent. In most cases`
			`we do not need this link at all. Normally we return to a parent page`
			`using a stack of entries that are made as we descend the tree, as in L&Y.`
			`There is exactly one case where the stack will not help: concurrent`
			`root splits. If an inserter process needs to split what had been the`
			`root when it started its descent, but finds that that page is no longer`
			`the root (because someone else split it meanwhile), then it uses the`
			`parent link to move up to the next level. This is OK because we do fix`
			`the parent link in a former root page when splitting it. This logic`
			`will work even if the root is split multiple times (even up to creation`
			`of multiple new levels) before an inserter returns to it. The same`
			`could not be said of finding the new root via the metapage, since that`
			`would work only for a single level of added root.`

			`+ The Postgres disk block data format (an array of items) doesn't fit`
			`Lehman and Yao's alternating-keys-and-pointers notion of a disk page,`
			`so we have to play some games.`

			`+ On a page that is not rightmost in its tree level, the "high key" is`
			`kept in the page's first item, and real data items start at item 2.`
			`The link portion of the "high key" item goes unused. A page that is`
			`rightmost has no "high key", so data items start with the first item.`
			`Putting the high key at the left, rather than the right, may seem odd,`
			`but it avoids moving the high key as we add data items.`

			`+ On a leaf page, the data items are simply links to (TIDs of) tuples`
			`in the relation being indexed, with the associated key values.`

			`+ On a non-leaf page, the data items are down-links to child pages with`
			`bounding keys. The key in each data item is the lower bound for`
			`keys on that child page, so logically the key is to the left of that`
			`downlink. The high key (if present) is the upper bound for the last`
			`downlink. The first data item on each such page has no lower bound`
			`--- or lower bound of minus infinity, if you prefer. The comparison`
			`routines must treat it accordingly. The actual key stored in the`
			`item is irrelevant, and need not be stored at all. This arrangement`
			`corresponds to the fact that an L&Y non-leaf page has one more pointer`
			`than key.`
Postgres95 1.01 Distribution - Virgin Sources 1996-07-09 08:22:35 +02:00
			`Notes to operator class implementors:`

Major overhaul of btree index code. Eliminate special BTP_CHAIN logic for duplicate keys by letting search go to the left rather than right when an equal key is seen at an upper tree level. Fix poor choice of page split point (leading to insertion failures) that was forced by chaining logic. Don't store leftmost key in non-leaf pages, since it's not necessary. Don't create root page until something is first stored in the index, so an unused index is now 8K not 16K. (Doesn't seem to be as easy to get rid of the metadata page, unfortunately.) Massive cleanup of unreadable code, fix poor, obsolete, and just plain wrong documentation and comments. See src/backend/access/nbtree/README for the gory details. 2000-07-21 08:42:39 +02:00			`+ With this implementation, we require the user to supply us with`
			`a procedure for pg_amproc. This procedure should take two keys`
			`A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B,`
			`respectively. See the contents of that relation for the btree`
			`access method for some samples.`