mirror of
https://git.postgresql.org/git/postgresql.git
synced 2024-09-21 09:31:30 +02:00
281 lines
15 KiB
Plaintext
281 lines
15 KiB
Plaintext
src/backend/storage/buffer/README
|
|
|
|
Notes About Shared Buffer Access Rules
|
|
======================================
|
|
|
|
There are two separate access control mechanisms for shared disk buffers:
|
|
reference counts (a/k/a pin counts) and buffer content locks. (Actually,
|
|
there's a third level of access control: one must hold the appropriate kind
|
|
of lock on a relation before one can legally access any page belonging to
|
|
the relation. Relation-level locks are not discussed here.)
|
|
|
|
Pins: one must "hold a pin on" a buffer (increment its reference count)
|
|
before being allowed to do anything at all with it. An unpinned buffer is
|
|
subject to being reclaimed and reused for a different page at any instant,
|
|
so touching it is unsafe. Normally a pin is acquired via ReadBuffer and
|
|
released via ReleaseBuffer. It is OK and indeed common for a single
|
|
backend to pin a page more than once concurrently; the buffer manager
|
|
handles this efficiently. It is considered OK to hold a pin for long
|
|
intervals --- for example, sequential scans hold a pin on the current page
|
|
until done processing all the tuples on the page, which could be quite a
|
|
while if the scan is the outer scan of a join. Similarly, btree index
|
|
scans hold a pin on the current index page. This is OK because normal
|
|
operations never wait for a page's pin count to drop to zero. (Anything
|
|
that might need to do such a wait is instead handled by waiting to obtain
|
|
the relation-level lock, which is why you'd better hold one first.) Pins
|
|
may not be held across transaction boundaries, however.
|
|
|
|
Buffer content locks: there are two kinds of buffer lock, shared and exclusive,
|
|
which act just as you'd expect: multiple backends can hold shared locks on
|
|
the same buffer, but an exclusive lock prevents anyone else from holding
|
|
either shared or exclusive lock. (These can alternatively be called READ
|
|
and WRITE locks.) These locks are intended to be short-term: they should not
|
|
be held for long. Buffer locks are acquired and released by LockBuffer().
|
|
It will *not* work for a single backend to try to acquire multiple locks on
|
|
the same buffer. One must pin a buffer before trying to lock it.
|
|
|
|
Buffer access rules:
|
|
|
|
1. To scan a page for tuples, one must hold a pin and either shared or
|
|
exclusive content lock. To examine the commit status (XIDs and status bits)
|
|
of a tuple in a shared buffer, one must likewise hold a pin and either shared
|
|
or exclusive lock.
|
|
|
|
2. Once one has determined that a tuple is interesting (visible to the
|
|
current transaction) one may drop the content lock, yet continue to access
|
|
the tuple's data for as long as one holds the buffer pin. This is what is
|
|
typically done by heap scans, since the tuple returned by heap_fetch
|
|
contains a pointer to tuple data in the shared buffer. Therefore the
|
|
tuple cannot go away while the pin is held (see rule #5). Its state could
|
|
change, but that is assumed not to matter after the initial determination
|
|
of visibility is made.
|
|
|
|
3. To add a tuple or change the xmin/xmax fields of an existing tuple,
|
|
one must hold a pin and an exclusive content lock on the containing buffer.
|
|
This ensures that no one else might see a partially-updated state of the
|
|
tuple while they are doing visibility checks.
|
|
|
|
4. It is considered OK to update tuple commit status bits (ie, OR the
|
|
values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
|
|
HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and
|
|
pin on a buffer. This is OK because another backend looking at the tuple
|
|
at about the same time would OR the same bits into the field, so there
|
|
is little or no risk of conflicting update; what's more, if there did
|
|
manage to be a conflict it would merely mean that one bit-update would
|
|
be lost and need to be done again later. These four bits are only hints
|
|
(they cache the results of transaction status lookups in pg_clog), so no
|
|
great harm is done if they get reset to zero by conflicting updates.
|
|
|
|
5. To physically remove a tuple or compact free space on a page, one
|
|
must hold a pin and an exclusive lock, *and* observe while holding the
|
|
exclusive lock that the buffer's shared reference count is one (ie,
|
|
no other backend holds a pin). If these conditions are met then no other
|
|
backend can perform a page scan until the exclusive lock is dropped, and
|
|
no other backend can be holding a reference to an existing tuple that it
|
|
might expect to examine again. Note that another backend might pin the
|
|
buffer (increment the refcount) while one is performing the cleanup, but
|
|
it won't be able to actually examine the page until it acquires shared
|
|
or exclusive content lock.
|
|
|
|
|
|
Rule #5 only affects VACUUM operations. Obtaining the
|
|
necessary lock is done by the bufmgr routine LockBufferForCleanup().
|
|
It first gets an exclusive lock and then checks to see if the shared pin
|
|
count is currently 1. If not, it releases the exclusive lock (but not the
|
|
caller's pin) and waits until signaled by another backend, whereupon it
|
|
tries again. The signal will occur when UnpinBuffer decrements the shared
|
|
pin count to 1. As indicated above, this operation might have to wait a
|
|
good while before it acquires lock, but that shouldn't matter much for
|
|
concurrent VACUUM. The current implementation only supports a single
|
|
waiter for pin-count-1 on any particular shared buffer. This is enough
|
|
for VACUUM's use, since we don't allow multiple VACUUMs concurrently on a
|
|
single relation anyway.
|
|
|
|
|
|
Buffer Manager's Internal Locking
|
|
---------------------------------
|
|
|
|
Before PostgreSQL 8.1, all operations of the shared buffer manager itself
|
|
were protected by a single system-wide lock, the BufMgrLock, which
|
|
unsurprisingly proved to be a source of contention. The new locking scheme
|
|
avoids grabbing system-wide exclusive locks in common code paths. It works
|
|
like this:
|
|
|
|
* There is a system-wide LWLock, the BufMappingLock, that notionally
|
|
protects the mapping from buffer tags (page identifiers) to buffers.
|
|
(Physically, it can be thought of as protecting the hash table maintained
|
|
by buf_table.c.) To look up whether a buffer exists for a tag, it is
|
|
sufficient to obtain share lock on the BufMappingLock. Note that one
|
|
must pin the found buffer, if any, before releasing the BufMappingLock.
|
|
To alter the page assignment of any buffer, one must hold exclusive lock
|
|
on the BufMappingLock. This lock must be held across adjusting the buffer's
|
|
header fields and changing the buf_table hash table. The only common
|
|
operation that needs exclusive lock is reading in a page that was not
|
|
in shared buffers already, which will require at least a kernel call
|
|
and usually a wait for I/O, so it will be slow anyway.
|
|
|
|
* As of PG 8.2, the BufMappingLock has been split into NUM_BUFFER_PARTITIONS
|
|
separate locks, each guarding a portion of the buffer tag space. This allows
|
|
further reduction of contention in the normal code paths. The partition
|
|
that a particular buffer tag belongs to is determined from the low-order
|
|
bits of the tag's hash value. The rules stated above apply to each partition
|
|
independently. If it is necessary to lock more than one partition at a time,
|
|
they must be locked in partition-number order to avoid risk of deadlock.
|
|
|
|
* A separate system-wide LWLock, the BufFreelistLock, provides mutual
|
|
exclusion for operations that access the buffer free list or select
|
|
buffers for replacement. This is always taken in exclusive mode since
|
|
there are no read-only operations on those data structures. The buffer
|
|
management policy is designed so that BufFreelistLock need not be taken
|
|
except in paths that will require I/O, and thus will be slow anyway.
|
|
(Details appear below.) It is never necessary to hold the BufMappingLock
|
|
and the BufFreelistLock at the same time.
|
|
|
|
* Each buffer header contains a spinlock that must be taken when examining
|
|
or changing fields of that buffer header. This allows operations such as
|
|
ReleaseBuffer to make local state changes without taking any system-wide
|
|
lock. We use a spinlock, not an LWLock, since there are no cases where
|
|
the lock needs to be held for more than a few instructions.
|
|
|
|
Note that a buffer header's spinlock does not control access to the data
|
|
held within the buffer. Each buffer header also contains an LWLock, the
|
|
"buffer content lock", that *does* represent the right to access the data
|
|
in the buffer. It is used per the rules above.
|
|
|
|
There is yet another set of per-buffer LWLocks, the io_in_progress locks,
|
|
that are used to wait for I/O on a buffer to complete. The process doing
|
|
a read or write takes exclusive lock for the duration, and processes that
|
|
need to wait for completion try to take shared locks (which they release
|
|
immediately upon obtaining). XXX on systems where an LWLock represents
|
|
nontrivial resources, it's fairly annoying to need so many locks. Possibly
|
|
we could use per-backend LWLocks instead (a buffer header would then contain
|
|
a field to show which backend is doing its I/O).
|
|
|
|
|
|
Normal Buffer Replacement Strategy
|
|
----------------------------------
|
|
|
|
There is a "free list" of buffers that are prime candidates for replacement.
|
|
In particular, buffers that are completely free (contain no valid page) are
|
|
always in this list. We could also throw buffers into this list if we
|
|
consider their pages unlikely to be needed soon; however, the current
|
|
algorithm never does that. The list is singly-linked using fields in the
|
|
buffer headers; we maintain head and tail pointers in global variables.
|
|
(Note: although the list links are in the buffer headers, they are
|
|
considered to be protected by the BufFreelistLock, not the buffer-header
|
|
spinlocks.) To choose a victim buffer to recycle when there are no free
|
|
buffers available, we use a simple clock-sweep algorithm, which avoids the
|
|
need to take system-wide locks during common operations. It works like
|
|
this:
|
|
|
|
Each buffer header contains a usage counter, which is incremented (up to a
|
|
small limit value) whenever the buffer is unpinned. (This requires only the
|
|
buffer header spinlock, which would have to be taken anyway to decrement the
|
|
buffer reference count, so it's nearly free.)
|
|
|
|
The "clock hand" is a buffer index, NextVictimBuffer, that moves circularly
|
|
through all the available buffers. NextVictimBuffer is protected by the
|
|
BufFreelistLock.
|
|
|
|
The algorithm for a process that needs to obtain a victim buffer is:
|
|
|
|
1. Obtain BufFreelistLock.
|
|
|
|
2. If buffer free list is nonempty, remove its head buffer. If the buffer
|
|
is pinned or has a nonzero usage count, it cannot be used; ignore it and
|
|
return to the start of step 2. Otherwise, pin the buffer, release
|
|
BufFreelistLock, and return the buffer.
|
|
|
|
3. Otherwise, select the buffer pointed to by NextVictimBuffer, and
|
|
circularly advance NextVictimBuffer for next time.
|
|
|
|
4. If the selected buffer is pinned or has a nonzero usage count, it cannot
|
|
be used. Decrement its usage count (if nonzero) and return to step 3 to
|
|
examine the next buffer.
|
|
|
|
5. Pin the selected buffer, release BufFreelistLock, and return the buffer.
|
|
|
|
(Note that if the selected buffer is dirty, we will have to write it out
|
|
before we can recycle it; if someone else pins the buffer meanwhile we will
|
|
have to give up and try another buffer. This however is not a concern
|
|
of the basic select-a-victim-buffer algorithm.)
|
|
|
|
|
|
Buffer Ring Replacement Strategy
|
|
---------------------------------
|
|
|
|
When running a query that needs to access a large number of pages just once,
|
|
such as VACUUM or a large sequential scan, a different strategy is used.
|
|
A page that has been touched only by such a scan is unlikely to be needed
|
|
again soon, so instead of running the normal clock sweep algorithm and
|
|
blowing out the entire buffer cache, a small ring of buffers is allocated
|
|
using the normal clock sweep algorithm and those buffers are reused for the
|
|
whole scan. This also implies that much of the write traffic caused by such
|
|
a statement will be done by the backend itself and not pushed off onto other
|
|
processes.
|
|
|
|
For sequential scans, a 256KB ring is used. That's small enough to fit in L2
|
|
cache, which makes transferring pages from OS cache to shared buffer cache
|
|
efficient. Even less would often be enough, but the ring must be big enough
|
|
to accommodate all pages in the scan that are pinned concurrently. 256KB
|
|
should also be enough to leave a small cache trail for other backends to
|
|
join in a synchronized seq scan. If a ring buffer is dirtied and its LSN
|
|
updated, we would normally have to write and flush WAL before we could
|
|
re-use the buffer; in this case we instead discard the buffer from the ring
|
|
and (later) choose a replacement using the normal clock-sweep algorithm.
|
|
Hence this strategy works best for scans that are read-only (or at worst
|
|
update hint bits). In a scan that modifies every page in the scan, like a
|
|
bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and
|
|
the ring strategy effectively degrades to the normal strategy.
|
|
|
|
VACUUM uses a 256KB ring like sequential scans, but dirty pages are not
|
|
removed from the ring. Instead, WAL is flushed if needed to allow reuse of
|
|
the buffers. Before introducing the buffer ring strategy in 8.3, VACUUM's
|
|
buffers were sent to the freelist, which was effectively a buffer ring of 1
|
|
buffer, resulting in excessive WAL flushing. Allowing VACUUM to update
|
|
256KB between WAL flushes should be more efficient.
|
|
|
|
Bulk writes work similarly to VACUUM. Currently this applies only to
|
|
COPY IN and CREATE TABLE AS SELECT. (Might it be interesting to make
|
|
seqscan UPDATE and DELETE use the bulkwrite strategy?) For bulk writes
|
|
we use a ring size of 16MB (but not more than 1/8th of shared_buffers).
|
|
Smaller sizes have been shown to result in the COPY blocking too often
|
|
for WAL flushes. While it's okay for a background vacuum to be slowed by
|
|
doing its own WAL flushing, we'd prefer that COPY not be subject to that,
|
|
so we let it use up a bit more of the buffer arena.
|
|
|
|
|
|
Background Writer's Processing
|
|
------------------------------
|
|
|
|
The background writer is designed to write out pages that are likely to be
|
|
recycled soon, thereby offloading the writing work from active backends.
|
|
To do this, it scans forward circularly from the current position of
|
|
NextVictimBuffer (which it does not change!), looking for buffers that are
|
|
dirty and not pinned nor marked with a positive usage count. It pins,
|
|
writes, and releases any such buffer.
|
|
|
|
If we can assume that reading NextVictimBuffer is an atomic action, then
|
|
the writer doesn't even need to take the BufFreelistLock in order to look
|
|
for buffers to write; it needs only to spinlock each buffer header for long
|
|
enough to check the dirtybit. Even without that assumption, the writer
|
|
only needs to take the lock long enough to read the variable value, not
|
|
while scanning the buffers. (This is a very substantial improvement in
|
|
the contention cost of the writer compared to PG 8.0.)
|
|
|
|
During a checkpoint, the writer's strategy must be to write every dirty
|
|
buffer (pinned or not!). We may as well make it start this scan from
|
|
NextVictimBuffer, however, so that the first-to-be-written pages are the
|
|
ones that backends might otherwise have to write for themselves soon.
|
|
|
|
The background writer takes shared content lock on a buffer while writing it
|
|
out (and anyone else who flushes buffer contents to disk must do so too).
|
|
This ensures that the page image transferred to disk is reasonably consistent.
|
|
We might miss a hint-bit update or two but that isn't a problem, for the same
|
|
reasons mentioned under buffer access rules.
|
|
|
|
As of 8.4, background writer starts during recovery mode when there is
|
|
some form of potentially extended recovery to perform. It performs an
|
|
identical service to normal processing, except that checkpoints it
|
|
writes are technically restartpoints.
|