postgresql/src/backend/storage/buffer
Tom Lane a4e775a263 Make use of new error context stack mechanism to allow random errors
detected during buffer dump to be labeled with the buffer location.
For example, if a page LSN is clobbered, we now produce something like
ERROR:  XLogFlush: request 2C000000/8468EC8 is not satisfied --- flushed only
to 0/8468EF0
CONTEXT:  writing block 0 of relation 428946/566240
whereas before there was no convenient way to find out which page had
been trashed.
2003-05-10 19:04:30 +00:00
..
Makefile Move s_lock.c and spin.c into lmgr subdirectory, which seems a much 2001-09-27 19:10:02 +00:00
README Implement new 'lightweight lock manager' that's intermediate between 2001-09-29 04:02:27 +00:00
buf_init.c Remove ShutdownBufferPoolAccess exit callback, and do the work in 2002-09-25 20:31:40 +00:00
buf_table.c Update copyright to 2002. 2002-06-20 20:29:54 +00:00
bufmgr.c Make use of new error context stack mechanism to allow random errors 2003-05-10 19:04:30 +00:00
freelist.c Update copyright to 2002. 2002-06-20 20:29:54 +00:00
localbuf.c localbuf.c must be able to do blind writes. 2002-12-05 22:48:03 +00:00

README

$Header: /cvsroot/pgsql/src/backend/storage/buffer/README,v 1.3 2001/09/29 04:02:22 tgl Exp $

Notes about shared buffer access rules
--------------------------------------

There are two separate access control mechanisms for shared disk buffers:
reference counts (a/k/a pin counts) and buffer locks.  (Actually, there's
a third level of access control: one must hold the appropriate kind of
lock on a relation before one can legally access any page belonging to
the relation.  Relation-level locks are not discussed here.)

Pins: one must "hold a pin on" a buffer (increment its reference count)
before being allowed to do anything at all with it.  An unpinned buffer is
subject to being reclaimed and reused for a different page at any instant,
so touching it is unsafe.  Typically a pin is acquired via ReadBuffer and
released via WriteBuffer (if one modified the page) or ReleaseBuffer (if not).
It is OK and indeed common for a single backend to pin a page more than
once concurrently; the buffer manager handles this efficiently.  It is
considered OK to hold a pin for long intervals --- for example, sequential
scans hold a pin on the current page until done processing all the tuples
on the page, which could be quite a while if the scan is the outer scan of
a join.  Similarly, btree index scans hold a pin on the current index page.
This is OK because normal operations never wait for a page's pin count to
drop to zero.  (Anything that might need to do such a wait is instead
handled by waiting to obtain the relation-level lock, which is why you'd
better hold one first.)  Pins may not be held across transaction
boundaries, however.

Buffer locks: there are two kinds of buffer locks, shared and exclusive,
which act just as you'd expect: multiple backends can hold shared locks on
the same buffer, but an exclusive lock prevents anyone else from holding
either shared or exclusive lock.  (These can alternatively be called READ
and WRITE locks.)  These locks are intended to be short-term: they should not
be held for long.  Buffer locks are acquired and released by LockBuffer().
It will *not* work for a single backend to try to acquire multiple locks on
the same buffer.  One must pin a buffer before trying to lock it.

Buffer access rules:

1. To scan a page for tuples, one must hold a pin and either shared or
exclusive lock.  To examine the commit status (XIDs and status bits) of
a tuple in a shared buffer, one must likewise hold a pin and either shared
or exclusive lock.

2. Once one has determined that a tuple is interesting (visible to the
current transaction) one may drop the buffer lock, yet continue to access
the tuple's data for as long as one holds the buffer pin.  This is what is
typically done by heap scans, since the tuple returned by heap_fetch
contains a pointer to tuple data in the shared buffer.  Therefore the
tuple cannot go away while the pin is held (see rule #5).  Its state could
change, but that is assumed not to matter after the initial determination
of visibility is made.

3. To add a tuple or change the xmin/xmax fields of an existing tuple,
one must hold a pin and an exclusive lock on the containing buffer.
This ensures that no one else might see a partially-updated state of the
tuple.

4. It is considered OK to update tuple commit status bits (ie, OR the
values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and
pin on a buffer.  This is OK because another backend looking at the tuple
at about the same time would OR the same bits into the field, so there
is little or no risk of conflicting update; what's more, if there did
manage to be a conflict it would merely mean that one bit-update would
be lost and need to be done again later.  These four bits are only hints
(they cache the results of transaction status lookups in pg_clog), so no
great harm is done if they get reset to zero by conflicting updates.

5. To physically remove a tuple or compact free space on a page, one
must hold a pin and an exclusive lock, *and* observe while holding the
exclusive lock that the buffer's shared reference count is one (ie,
no other backend holds a pin).  If these conditions are met then no other
backend can perform a page scan until the exclusive lock is dropped, and
no other backend can be holding a reference to an existing tuple that it
might expect to examine again.  Note that another backend might pin the
buffer (increment the refcount) while one is performing the cleanup, but
it won't be able to actually examine the page until it acquires shared
or exclusive lock.


As of 7.1, the only operation that removes tuples or compacts free space is
(oldstyle) VACUUM.  It does not have to implement rule #5 directly, because
it instead acquires exclusive lock at the relation level, which ensures
indirectly that no one else is accessing pages of the relation at all.

To implement concurrent VACUUM we will need to make it obey rule #5 fully.
To do this, we'll create a new buffer manager operation
LockBufferForCleanup() that gets an exclusive lock and then checks to see
if the shared pin count is currently 1.  If not, it releases the exclusive
lock (but not the caller's pin) and waits until signaled by another backend,
whereupon it tries again.  The signal will occur when UnpinBuffer
decrements the shared pin count to 1.  As indicated above, this operation
might have to wait a good while before it acquires lock, but that shouldn't
matter much for concurrent VACUUM.  The current implementation only
supports a single waiter for pin-count-1 on any particular shared buffer.
This is enough for VACUUM's use, since we don't allow multiple VACUUMs
concurrently on a single relation anyway.