postgresql/src/backend/access/nbtree
Tom Lane bd272cace6 Mega-commit to make heap_open/heap_openr/heap_close take an
additional argument specifying the kind of lock to acquire/release (or
'NoLock' to do no lock processing).  Ensure that all relations are locked
with some appropriate lock level before being examined --- this ensures
that relevant shared-inval messages have been processed and should prevent
problems caused by concurrent VACUUM.  Fix several bugs having to do with
mismatched increment/decrement of relation ref count and mismatched
heap_open/close (which amounts to the same thing).  A bogus ref count on
a relation doesn't matter much *unless* a SI Inval message happens to
arrive at the wrong time, which is probably why we got away with this
sloppiness for so long.  Repair missing grab of AccessExclusiveLock in
DROP TABLE, ALTER/RENAME TABLE, etc, as noted by Hiroshi.
Recommend 'make clean all' after pulling this update; I modified the
Relation struct layout slightly.
Will post further discussion to pghackers list shortly.
1999-09-18 19:08:25 +00:00
..
Makefile Hi, 1998-04-06 00:32:26 +00:00
README Postgres95 1.01 Distribution - Virgin Sources 1996-07-09 06:22:35 +00:00
nbtcompare.c Move some system includes into c.h, and remove duplicates. 1999-07-17 20:18:55 +00:00
nbtinsert.c Re-use free space on index pages with duplicates. 1999-08-09 01:39:19 +00:00
nbtpage.c Fix nbtree's failure to clear BTScans list during xact abort. 1999-08-08 20:12:52 +00:00
nbtree.c Mega-commit to make heap_open/heap_openr/heap_close take an 1999-09-18 19:08:25 +00:00
nbtscan.c Fix nbtree's failure to clear BTScans list during xact abort. 1999-08-08 20:12:52 +00:00
nbtsearch.c Move some system includes into c.h, and remove duplicates. 1999-07-17 20:18:55 +00:00
nbtsort.c Install new alignment code to use MAXALIGN rather than DOUBLEALIGN where 1999-07-19 07:07:29 +00:00
nbtstrat.c Change #include's to use <> and "" as appropriate. 1999-07-15 23:04:24 +00:00
nbtutils.c Move some system includes into c.h, and remove duplicates. 1999-07-17 20:18:55 +00:00

README

$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.1.1.1 1996/07/09 06:21:12 scrappy Exp $

This directory contains a correct implementation of Lehman and Yao's
btree management algorithm that supports concurrent access for Postgres.
We have made the following changes in order to incorporate their algorithm
into Postgres:

	+  The requirement that all btree keys be unique is too onerous,
	   but the algorithm won't work correctly without it.  As a result,
	   this implementation adds an OID (guaranteed to be unique) to
	   every key in the index.  This guarantees uniqueness within a set
	   of duplicates.  Space overhead is four bytes.

	   For this reason, when we're passed an index tuple to store by the
	   common access method code, we allocate a larger one and copy the
	   supplied tuple into it.  No Postgres code outside of the btree
	   access method knows about this xid or sequence number.

	+  Lehman and Yao don't require read locks, but assume that in-
	   memory copies of tree nodes are unshared.  Postgres shares
	   in-memory buffers among backends.  As a result, we do page-
	   level read locking on btree nodes in order to guarantee that
	   no record is modified while we are examining it.  This reduces
	   concurrency but guaranteees correct behavior.

	+  Read locks on a page are held for as long as a scan has a pointer
	   to the page.  However, locks are always surrendered before the
	   sibling page lock is acquired (for readers), so we remain deadlock-
	   free.  I will do a formal proof if I get bored anytime soon.

In addition, the following things are handy to know:

	+  Page zero of every btree is a meta-data page.  This page stores
	   the location of the root page, a pointer to a list of free
	   pages, and other stuff that's handy to know.

	+  This algorithm doesn't really work, since it requires ordered
	   writes, and UNIX doesn't support ordered writes.

	+  There's one other case where we may screw up in this
	   implementation.  When we start a scan, we descend the tree
	   to the key nearest the one in the qual, and once we get there,
	   position ourselves correctly for the qual type (eg, <, >=, etc).
	   If we happen to step off a page, decide we want to get back to
	   it, and fetch the page again, and if some bad person has split
	   the page and moved the last tuple we saw off of it, then the
	   code complains about botched concurrency in an elog(WARN, ...)
	   and gives up the ghost.  This is the ONLY violation of Lehman
	   and Yao's guarantee of correct behavior that I am aware of in
	   this code.

Notes to operator class implementors:

	With this implementation, we require the user to supply us with
	a procedure for pg_amproc.  This procedure should take two keys
	A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B,
	respectively.  See the contents of that relation for the btree
	access method for some samples.

Notes to mao for implementation document:

	On deletions, we need to adjust the position of active scans on
	the index.  The code in nbtscan.c handles this.  We don't need to
	do this for splits because of the way splits are handled; if they
	happen behind us, we'll automatically go to the next page, and if
	they happen in front of us, we're not affected by them.  For
	insertions, if we inserted a tuple behind the current scan location
	on the current scan page, we move one space ahead.