760 lines
40 KiB
Plaintext
760 lines
40 KiB
Plaintext
src/backend/access/transam/README
|
|
|
|
The Transaction System
|
|
======================
|
|
|
|
PostgreSQL's transaction system is a three-layer system. The bottom layer
|
|
implements low-level transactions and subtransactions, on top of which rests
|
|
the mainloop's control code, which in turn implements user-visible
|
|
transactions and savepoints.
|
|
|
|
The middle layer of code is called by postgres.c before and after the
|
|
processing of each query, or after detecting an error:
|
|
|
|
StartTransactionCommand
|
|
CommitTransactionCommand
|
|
AbortCurrentTransaction
|
|
|
|
Meanwhile, the user can alter the system's state by issuing the SQL commands
|
|
BEGIN, COMMIT, ROLLBACK, SAVEPOINT, ROLLBACK TO or RELEASE. The traffic cop
|
|
redirects these calls to the toplevel routines
|
|
|
|
BeginTransactionBlock
|
|
EndTransactionBlock
|
|
UserAbortTransactionBlock
|
|
DefineSavepoint
|
|
RollbackToSavepoint
|
|
ReleaseSavepoint
|
|
|
|
respectively. Depending on the current state of the system, these functions
|
|
call low level functions to activate the real transaction system:
|
|
|
|
StartTransaction
|
|
CommitTransaction
|
|
AbortTransaction
|
|
CleanupTransaction
|
|
StartSubTransaction
|
|
CommitSubTransaction
|
|
AbortSubTransaction
|
|
CleanupSubTransaction
|
|
|
|
Additionally, within a transaction, CommandCounterIncrement is called to
|
|
increment the command counter, which allows future commands to "see" the
|
|
effects of previous commands within the same transaction. Note that this is
|
|
done automatically by CommitTransactionCommand after each query inside a
|
|
transaction block, but some utility functions also do it internally to allow
|
|
some operations (usually in the system catalogs) to be seen by future
|
|
operations in the same utility command. (For example, in DefineRelation it is
|
|
done after creating the heap so the pg_class row is visible, to be able to
|
|
lock it.)
|
|
|
|
|
|
For example, consider the following sequence of user commands:
|
|
|
|
1) BEGIN
|
|
2) SELECT * FROM foo
|
|
3) INSERT INTO foo VALUES (...)
|
|
4) COMMIT
|
|
|
|
In the main processing loop, this results in the following function call
|
|
sequence:
|
|
|
|
/ StartTransactionCommand;
|
|
/ StartTransaction;
|
|
1) < ProcessUtility; << BEGIN
|
|
\ BeginTransactionBlock;
|
|
\ CommitTransactionCommand;
|
|
|
|
/ StartTransactionCommand;
|
|
2) / ProcessQuery; << SELECT ...
|
|
\ CommitTransactionCommand;
|
|
\ CommandCounterIncrement;
|
|
|
|
/ StartTransactionCommand;
|
|
3) / ProcessQuery; << INSERT ...
|
|
\ CommitTransactionCommand;
|
|
\ CommandCounterIncrement;
|
|
|
|
/ StartTransactionCommand;
|
|
/ ProcessUtility; << COMMIT
|
|
4) < EndTransactionBlock;
|
|
\ CommitTransactionCommand;
|
|
\ CommitTransaction;
|
|
|
|
The point of this example is to demonstrate the need for
|
|
StartTransactionCommand and CommitTransactionCommand to be state smart -- they
|
|
should call CommandCounterIncrement between the calls to BeginTransactionBlock
|
|
and EndTransactionBlock and outside these calls they need to do normal start,
|
|
commit or abort processing.
|
|
|
|
Furthermore, suppose the "SELECT * FROM foo" caused an abort condition. In
|
|
this case AbortCurrentTransaction is called, and the transaction is put in
|
|
aborted state. In this state, any user input is ignored except for
|
|
transaction-termination statements, or ROLLBACK TO <savepoint> commands.
|
|
|
|
Transaction aborts can occur in two ways:
|
|
|
|
1) system dies from some internal cause (syntax error, etc)
|
|
2) user types ROLLBACK
|
|
|
|
The reason we have to distinguish them is illustrated by the following two
|
|
situations:
|
|
|
|
case 1 case 2
|
|
------ ------
|
|
1) user types BEGIN 1) user types BEGIN
|
|
2) user does something 2) user does something
|
|
3) user does not like what 3) system aborts for some reason
|
|
she sees and types ABORT (syntax error, etc)
|
|
|
|
In case 1, we want to abort the transaction and return to the default state.
|
|
In case 2, there may be more commands coming our way which are part of the
|
|
same transaction block; we have to ignore these commands until we see a COMMIT
|
|
or ROLLBACK.
|
|
|
|
Internal aborts are handled by AbortCurrentTransaction, while user aborts are
|
|
handled by UserAbortTransactionBlock. Both of them rely on AbortTransaction
|
|
to do all the real work. The only difference is what state we enter after
|
|
AbortTransaction does its work:
|
|
|
|
* AbortCurrentTransaction leaves us in TBLOCK_ABORT,
|
|
* UserAbortTransactionBlock leaves us in TBLOCK_ABORT_END
|
|
|
|
Low-level transaction abort handling is divided in two phases:
|
|
* AbortTransaction executes as soon as we realize the transaction has
|
|
failed. It should release all shared resources (locks etc) so that we do
|
|
not delay other backends unnecessarily.
|
|
* CleanupTransaction executes when we finally see a user COMMIT
|
|
or ROLLBACK command; it cleans things up and gets us out of the transaction
|
|
completely. In particular, we mustn't destroy TopTransactionContext until
|
|
this point.
|
|
|
|
Also, note that when a transaction is committed, we don't close it right away.
|
|
Rather it's put in TBLOCK_END state, which means that when
|
|
CommitTransactionCommand is called after the query has finished processing,
|
|
the transaction has to be closed. The distinction is subtle but important,
|
|
because it means that control will leave the xact.c code with the transaction
|
|
open, and the main loop will be able to keep processing inside the same
|
|
transaction. So, in a sense, transaction commit is also handled in two
|
|
phases, the first at EndTransactionBlock and the second at
|
|
CommitTransactionCommand (which is where CommitTransaction is actually
|
|
called).
|
|
|
|
The rest of the code in xact.c are routines to support the creation and
|
|
finishing of transactions and subtransactions. For example, AtStart_Memory
|
|
takes care of initializing the memory subsystem at main transaction start.
|
|
|
|
|
|
Subtransaction Handling
|
|
-----------------------
|
|
|
|
Subtransactions are implemented using a stack of TransactionState structures,
|
|
each of which has a pointer to its parent transaction's struct. When a new
|
|
subtransaction is to be opened, PushTransaction is called, which creates a new
|
|
TransactionState, with its parent link pointing to the current transaction.
|
|
StartSubTransaction is in charge of initializing the new TransactionState to
|
|
sane values, and properly initializing other subsystems (AtSubStart routines).
|
|
|
|
When closing a subtransaction, either CommitSubTransaction has to be called
|
|
(if the subtransaction is committing), or AbortSubTransaction and
|
|
CleanupSubTransaction (if it's aborting). In either case, PopTransaction is
|
|
called so the system returns to the parent transaction.
|
|
|
|
One important point regarding subtransaction handling is that several may need
|
|
to be closed in response to a single user command. That's because savepoints
|
|
have names, and we allow to commit or rollback a savepoint by name, which is
|
|
not necessarily the one that was last opened. Also a COMMIT or ROLLBACK
|
|
command must be able to close out the entire stack. We handle this by having
|
|
the utility command subroutine mark all the state stack entries as commit-
|
|
pending or abort-pending, and then when the main loop reaches
|
|
CommitTransactionCommand, the real work is done. The main point of doing
|
|
things this way is that if we get an error while popping state stack entries,
|
|
the remaining stack entries still show what we need to do to finish up.
|
|
|
|
In the case of ROLLBACK TO <savepoint>, we abort all the subtransactions up
|
|
through the one identified by the savepoint name, and then re-create that
|
|
subtransaction level with the same name. So it's a completely new
|
|
subtransaction as far as the internals are concerned.
|
|
|
|
Other subsystems are allowed to start "internal" subtransactions, which are
|
|
handled by BeginInternalSubtransaction. This is to allow implementing
|
|
exception handling, e.g. in PL/pgSQL. ReleaseCurrentSubTransaction and
|
|
RollbackAndReleaseCurrentSubTransaction allows the subsystem to close said
|
|
subtransactions. The main difference between this and the savepoint/release
|
|
path is that we execute the complete state transition immediately in each
|
|
subroutine, rather than deferring some work until CommitTransactionCommand.
|
|
Another difference is that BeginInternalSubtransaction is allowed when no
|
|
explicit transaction block has been established, while DefineSavepoint is not.
|
|
|
|
|
|
Transaction and Subtransaction Numbering
|
|
----------------------------------------
|
|
|
|
Transactions and subtransactions are assigned permanent XIDs only when/if
|
|
they first do something that requires one --- typically, insert/update/delete
|
|
a tuple, though there are a few other places that need an XID assigned.
|
|
If a subtransaction requires an XID, we always first assign one to its
|
|
parent. This maintains the invariant that child transactions have XIDs later
|
|
than their parents, which is assumed in a number of places.
|
|
|
|
The subsidiary actions of obtaining a lock on the XID and entering it into
|
|
pg_subtrans and PG_PROC are done at the time it is assigned.
|
|
|
|
A transaction that has no XID still needs to be identified for various
|
|
purposes, notably holding locks. For this purpose we assign a "virtual
|
|
transaction ID" or VXID to each top-level transaction. VXIDs are formed from
|
|
two fields, the backendID and a backend-local counter; this arrangement allows
|
|
assignment of a new VXID at transaction start without any contention for
|
|
shared memory. To ensure that a VXID isn't re-used too soon after backend
|
|
exit, we store the last local counter value into shared memory at backend
|
|
exit, and initialize it from the previous value for the same backendID slot
|
|
at backend start. All these counters go back to zero at shared memory
|
|
re-initialization, but that's OK because VXIDs never appear anywhere on-disk.
|
|
|
|
Internally, a backend needs a way to identify subtransactions whether or not
|
|
they have XIDs; but this need only lasts as long as the parent top transaction
|
|
endures. Therefore, we have SubTransactionId, which is somewhat like
|
|
CommandId in that it's generated from a counter that we reset at the start of
|
|
each top transaction. The top-level transaction itself has SubTransactionId 1,
|
|
and subtransactions have IDs 2 and up. (Zero is reserved for
|
|
InvalidSubTransactionId.) Note that subtransactions do not have their
|
|
own VXIDs; they use the parent top transaction's VXID.
|
|
|
|
|
|
Interlocking Transaction Begin, Transaction End, and Snapshots
|
|
--------------------------------------------------------------
|
|
|
|
We try hard to minimize the amount of overhead and lock contention involved
|
|
in the frequent activities of beginning/ending a transaction and taking a
|
|
snapshot. Unfortunately, we must have some interlocking for this, because
|
|
we must ensure consistency about the commit order of transactions.
|
|
For example, suppose an UPDATE in xact A is blocked by xact B's prior
|
|
update of the same row, and xact B is doing commit while xact C gets a
|
|
snapshot. Xact A can complete and commit as soon as B releases its locks.
|
|
If xact C's GetSnapshotData sees xact B as still running, then it had
|
|
better see xact A as still running as well, or it will be able to see two
|
|
tuple versions - one deleted by xact B and one inserted by xact A. Another
|
|
reason why this would be bad is that C would see (in the row inserted by A)
|
|
earlier changes by B, and it would be inconsistent for C not to see any
|
|
of B's changes elsewhere in the database.
|
|
|
|
Formally, the correctness requirement is "if a snapshot A considers
|
|
transaction X as committed, and any of transaction X's snapshots considered
|
|
transaction Y as committed, then snapshot A must consider transaction Y as
|
|
committed".
|
|
|
|
What we actually enforce is strict serialization of commits and rollbacks
|
|
with snapshot-taking: we do not allow any transaction to exit the set of
|
|
running transactions while a snapshot is being taken. (This rule is
|
|
stronger than necessary for consistency, but is relatively simple to
|
|
enforce, and it assists with some other issues as explained below.) The
|
|
implementation of this is that GetSnapshotData takes the ProcArrayLock in
|
|
shared mode (so that multiple backends can take snapshots in parallel),
|
|
but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
|
|
while clearing MyPgXact->xid at transaction end (either commit or abort).
|
|
|
|
ProcArrayEndTransaction also holds the lock while advancing the shared
|
|
latestCompletedXid variable. This allows GetSnapshotData to use
|
|
latestCompletedXid + 1 as xmax for its snapshot: there can be no
|
|
transaction >= this xid value that the snapshot needs to consider as
|
|
completed.
|
|
|
|
In short, then, the rule is that no transaction may exit the set of
|
|
currently-running transactions between the time we fetch latestCompletedXid
|
|
and the time we finish building our snapshot. However, this restriction
|
|
only applies to transactions that have an XID --- read-only transactions
|
|
can end without acquiring ProcArrayLock, since they don't affect anyone
|
|
else's snapshot nor latestCompletedXid.
|
|
|
|
Transaction start, per se, doesn't have any interlocking with these
|
|
considerations, since we no longer assign an XID immediately at transaction
|
|
start. But when we do decide to allocate an XID, GetNewTransactionId must
|
|
store the new XID into the shared ProcArray before releasing XidGenLock.
|
|
This ensures that all top-level XIDs <= latestCompletedXid are either
|
|
present in the ProcArray, or not running anymore. (This guarantee doesn't
|
|
apply to subtransaction XIDs, because of the possibility that there's not
|
|
room for them in the subxid array; instead we guarantee that they are
|
|
present or the overflow flag is set.) If a backend released XidGenLock
|
|
before storing its XID into MyPgXact, then it would be possible for another
|
|
backend to allocate and commit a later XID, causing latestCompletedXid to
|
|
pass the first backend's XID, before that value became visible in the
|
|
ProcArray. That would break GetOldestXmin, as discussed below.
|
|
|
|
We allow GetNewTransactionId to store the XID into MyPgXact->xid (or the
|
|
subxid array) without taking ProcArrayLock. This was once necessary to
|
|
avoid deadlock; while that is no longer the case, it's still beneficial for
|
|
performance. We are thereby relying on fetch/store of an XID to be atomic,
|
|
else other backends might see a partially-set XID. This also means that
|
|
readers of the ProcArray xid fields must be careful to fetch a value only
|
|
once, rather than assume they can read it multiple times and get the same
|
|
answer each time. (Use volatile-qualified pointers when doing this, to
|
|
ensure that the C compiler does exactly what you tell it to.)
|
|
|
|
Another important activity that uses the shared ProcArray is GetOldestXmin,
|
|
which must determine a lower bound for the oldest xmin of any active MVCC
|
|
snapshot, system-wide. Each individual backend advertises the smallest
|
|
xmin of its own snapshots in MyPgXact->xmin, or zero if it currently has no
|
|
live snapshots (eg, if it's between transactions or hasn't yet set a
|
|
snapshot for a new transaction). GetOldestXmin takes the MIN() of the
|
|
valid xmin fields. It does this with only shared lock on ProcArrayLock,
|
|
which means there is a potential race condition against other backends
|
|
doing GetSnapshotData concurrently: we must be certain that a concurrent
|
|
backend that is about to set its xmin does not compute an xmin less than
|
|
what GetOldestXmin returns. We ensure that by including all the active
|
|
XIDs into the MIN() calculation, along with the valid xmins. The rule that
|
|
transactions can't exit without taking exclusive ProcArrayLock ensures that
|
|
concurrent holders of shared ProcArrayLock will compute the same minimum of
|
|
currently-active XIDs: no xact, in particular not the oldest, can exit
|
|
while we hold shared ProcArrayLock. So GetOldestXmin's view of the minimum
|
|
active XID will be the same as that of any concurrent GetSnapshotData, and
|
|
so it can't produce an overestimate. If there is no active transaction at
|
|
all, GetOldestXmin returns latestCompletedXid + 1, which is a lower bound
|
|
for the xmin that might be computed by concurrent or later GetSnapshotData
|
|
calls. (We know that no XID less than this could be about to appear in
|
|
the ProcArray, because of the XidGenLock interlock discussed above.)
|
|
|
|
GetSnapshotData also performs an oldest-xmin calculation (which had better
|
|
match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used
|
|
for some tuple age cutoff checks where a fresh call of GetOldestXmin seems
|
|
too expensive. Note that while it is certain that two concurrent
|
|
executions of GetSnapshotData will compute the same xmin for their own
|
|
snapshots, as argued above, it is not certain that they will arrive at the
|
|
same estimate of RecentGlobalXmin. This is because we allow XID-less
|
|
transactions to clear their MyPgXact->xmin asynchronously (without taking
|
|
ProcArrayLock), so one execution might see what had been the oldest xmin,
|
|
and another not. This is OK since RecentGlobalXmin need only be a valid
|
|
lower bound. As noted above, we are already assuming that fetch/store
|
|
of the xid fields is atomic, so assuming it for xmin as well is no extra
|
|
risk.
|
|
|
|
|
|
pg_clog and pg_subtrans
|
|
-----------------------
|
|
|
|
pg_clog and pg_subtrans are permanent (on-disk) storage of transaction related
|
|
information. There is a limited number of pages of each kept in memory, so
|
|
in many cases there is no need to actually read from disk. However, if
|
|
there's a long running transaction or a backend sitting idle with an open
|
|
transaction, it may be necessary to be able to read and write this information
|
|
from disk. They also allow information to be permanent across server restarts.
|
|
|
|
pg_clog records the commit status for each transaction that has been assigned
|
|
an XID. A transaction can be in progress, committed, aborted, or
|
|
"sub-committed". This last state means that it's a subtransaction that's no
|
|
longer running, but its parent has not updated its state yet. It is not
|
|
necessary to update a subtransaction's transaction status to subcommit, so we
|
|
can just defer it until main transaction commit. The main role of marking
|
|
transactions as sub-committed is to provide an atomic commit protocol when
|
|
transaction status is spread across multiple clog pages. As a result, whenever
|
|
transaction status spreads across multiple pages we must use a two-phase commit
|
|
protocol: the first phase is to mark the subtransactions as sub-committed, then
|
|
we mark the top level transaction and all its subtransactions committed (in
|
|
that order). Thus, subtransactions that have not aborted appear as in-progress
|
|
even when they have already finished, and the subcommit status appears as a
|
|
very short transitory state during main transaction commit. Subtransaction
|
|
abort is always marked in clog as soon as it occurs. When the transaction
|
|
status all fit in a single CLOG page, we atomically mark them all as committed
|
|
without bothering with the intermediate sub-commit state.
|
|
|
|
Savepoints are implemented using subtransactions. A subtransaction is a
|
|
transaction inside a transaction; its commit or abort status is not only
|
|
dependent on whether it committed itself, but also whether its parent
|
|
transaction committed. To implement multiple savepoints in a transaction we
|
|
allow unlimited transaction nesting depth, so any particular subtransaction's
|
|
commit state is dependent on the commit status of each and every ancestor
|
|
transaction.
|
|
|
|
The "subtransaction parent" (pg_subtrans) mechanism records, for each
|
|
transaction with an XID, the TransactionId of its parent transaction. This
|
|
information is stored as soon as the subtransaction is assigned an XID.
|
|
Top-level transactions do not have a parent, so they leave their pg_subtrans
|
|
entries set to the default value of zero (InvalidTransactionId).
|
|
|
|
pg_subtrans is used to check whether the transaction in question is still
|
|
running --- the main Xid of a transaction is recorded in the PGXACT struct,
|
|
but since we allow arbitrary nesting of subtransactions, we can't fit all Xids
|
|
in shared memory, so we have to store them on disk. Note, however, that for
|
|
each transaction we keep a "cache" of Xids that are known to be part of the
|
|
transaction tree, so we can skip looking at pg_subtrans unless we know the
|
|
cache has been overflowed. See storage/ipc/procarray.c for the gory details.
|
|
|
|
slru.c is the supporting mechanism for both pg_clog and pg_subtrans. It
|
|
implements the LRU policy for in-memory buffer pages. The high-level routines
|
|
for pg_clog are implemented in transam.c, while the low-level functions are in
|
|
clog.c. pg_subtrans is contained completely in subtrans.c.
|
|
|
|
|
|
Write-Ahead Log Coding
|
|
----------------------
|
|
|
|
The WAL subsystem (also called XLOG in the code) exists to guarantee crash
|
|
recovery. It can also be used to provide point-in-time recovery, as well as
|
|
hot-standby replication via log shipping. Here are some notes about
|
|
non-obvious aspects of its design.
|
|
|
|
A basic assumption of a write AHEAD log is that log entries must reach stable
|
|
storage before the data-page changes they describe. This ensures that
|
|
replaying the log to its end will bring us to a consistent state where there
|
|
are no partially-performed transactions. To guarantee this, each data page
|
|
(either heap or index) is marked with the LSN (log sequence number --- in
|
|
practice, a WAL file location) of the latest XLOG record affecting the page.
|
|
Before the bufmgr can write out a dirty page, it must ensure that xlog has
|
|
been flushed to disk at least up to the page's LSN. This low-level
|
|
interaction improves performance by not waiting for XLOG I/O until necessary.
|
|
The LSN check exists only in the shared-buffer manager, not in the local
|
|
buffer manager used for temp tables; hence operations on temp tables must not
|
|
be WAL-logged.
|
|
|
|
During WAL replay, we can check the LSN of a page to detect whether the change
|
|
recorded by the current log entry is already applied (it has been, if the page
|
|
LSN is >= the log entry's WAL location).
|
|
|
|
Usually, log entries contain just enough information to redo a single
|
|
incremental update on a page (or small group of pages). This will work only
|
|
if the filesystem and hardware implement data page writes as atomic actions,
|
|
so that a page is never left in a corrupt partly-written state. Since that's
|
|
often an untenable assumption in practice, we log additional information to
|
|
allow complete reconstruction of modified pages. The first WAL record
|
|
affecting a given page after a checkpoint is made to contain a copy of the
|
|
entire page, and we implement replay by restoring that page copy instead of
|
|
redoing the update. (This is more reliable than the data storage itself would
|
|
be because we can check the validity of the WAL record's CRC.) We can detect
|
|
the "first change after checkpoint" by noting whether the page's old LSN
|
|
precedes the end of WAL as of the last checkpoint (the RedoRecPtr).
|
|
|
|
The general schema for executing a WAL-logged action is
|
|
|
|
1. Pin and exclusive-lock the shared buffer(s) containing the data page(s)
|
|
to be modified.
|
|
|
|
2. START_CRIT_SECTION() (Any error during the next three steps must cause a
|
|
PANIC because the shared buffers will contain unlogged changes, which we
|
|
have to ensure don't get to disk. Obviously, you should check conditions
|
|
such as whether there's enough free space on the page before you start the
|
|
critical section.)
|
|
|
|
3. Apply the required changes to the shared buffer(s).
|
|
|
|
4. Mark the shared buffer(s) as dirty with MarkBufferDirty(). (This must
|
|
happen before the WAL record is inserted; see notes in SyncOneBuffer().)
|
|
|
|
5. Build a WAL log record and pass it to XLogInsert(); then update the page's
|
|
LSN and TLI using the returned XLOG location. For instance,
|
|
|
|
recptr = XLogInsert(rmgr_id, info, rdata);
|
|
|
|
PageSetLSN(dp, recptr);
|
|
PageSetTLI(dp, ThisTimeLineID);
|
|
|
|
6. END_CRIT_SECTION()
|
|
|
|
7. Unlock and unpin the buffer(s).
|
|
|
|
XLogInsert's "rdata" argument is an array of pointer/size items identifying
|
|
chunks of data to be written in the XLOG record, plus optional shared-buffer
|
|
IDs for chunks that are in shared buffers rather than temporary variables.
|
|
The "rdata" array must mention (at least once) each of the shared buffers
|
|
being modified, unless the action is such that the WAL replay routine can
|
|
reconstruct the entire page contents. XLogInsert includes the logic that
|
|
tests to see whether a shared buffer has been modified since the last
|
|
checkpoint. If not, the entire page contents are logged rather than just the
|
|
portion(s) pointed to by "rdata".
|
|
|
|
Because XLogInsert drops the rdata components associated with buffers it
|
|
chooses to log in full, the WAL replay routines normally need to test to see
|
|
which buffers were handled that way --- otherwise they may be misled about
|
|
what the XLOG record actually contains. XLOG records that describe multi-page
|
|
changes therefore require some care to design: you must be certain that you
|
|
know what data is indicated by each "BKP" bit. An example of the trickiness
|
|
is that in a HEAP_UPDATE record, BKP(1) normally is associated with the source
|
|
page and BKP(2) is associated with the destination page --- but if these are
|
|
the same page, only BKP(1) would have been set.
|
|
|
|
For this reason as well as the risk of deadlocking on buffer locks, it's best
|
|
to design WAL records so that they reflect small atomic actions involving just
|
|
one or a few pages. The current XLOG infrastructure cannot handle WAL records
|
|
involving references to more than four shared buffers, anyway.
|
|
|
|
In the case where the WAL record contains enough information to re-generate
|
|
the entire contents of a page, do *not* show that page's buffer ID in the
|
|
rdata array, even if some of the rdata items point into the buffer. This is
|
|
because you don't want XLogInsert to log the whole page contents. The
|
|
standard replay-routine pattern for this case is
|
|
|
|
buffer = XLogReadBuffer(rnode, blkno, true);
|
|
Assert(BufferIsValid(buffer));
|
|
page = (Page) BufferGetPage(buffer);
|
|
|
|
... initialize the page ...
|
|
|
|
PageSetLSN(page, lsn);
|
|
PageSetTLI(page, ThisTimeLineID);
|
|
MarkBufferDirty(buffer);
|
|
UnlockReleaseBuffer(buffer);
|
|
|
|
In the case where the WAL record provides only enough information to
|
|
incrementally update the page, the rdata array *must* mention the buffer
|
|
ID at least once; otherwise there is no defense against torn-page problems.
|
|
The standard replay-routine pattern for this case is
|
|
|
|
if (record->xl_info & XLR_BKP_BLOCK_n)
|
|
<< do nothing, page was rewritten from logged copy >>;
|
|
|
|
buffer = XLogReadBuffer(rnode, blkno, false);
|
|
if (!BufferIsValid(buffer))
|
|
<< do nothing, page has been deleted >>;
|
|
page = (Page) BufferGetPage(buffer);
|
|
|
|
if (XLByteLE(lsn, PageGetLSN(page)))
|
|
{
|
|
/* changes are already applied */
|
|
UnlockReleaseBuffer(buffer);
|
|
return;
|
|
}
|
|
|
|
... apply the change ...
|
|
|
|
PageSetLSN(page, lsn);
|
|
PageSetTLI(page, ThisTimeLineID);
|
|
MarkBufferDirty(buffer);
|
|
UnlockReleaseBuffer(buffer);
|
|
|
|
As noted above, for a multi-page update you need to be able to determine
|
|
which XLR_BKP_BLOCK_n flag applies to each page. If a WAL record reflects
|
|
a combination of fully-rewritable and incremental updates, then the rewritable
|
|
pages don't count for the XLR_BKP_BLOCK_n numbering. (XLR_BKP_BLOCK_n is
|
|
associated with the n'th distinct buffer ID seen in the "rdata" array, and
|
|
per the above discussion, fully-rewritable buffers shouldn't be mentioned in
|
|
"rdata".)
|
|
|
|
Due to all these constraints, complex changes (such as a multilevel index
|
|
insertion) normally need to be described by a series of atomic-action WAL
|
|
records. What do you do if the intermediate states are not self-consistent?
|
|
The answer is that the WAL replay logic has to be able to fix things up.
|
|
In btree indexes, for example, a page split requires insertion of a new key in
|
|
the parent btree level, but for locking reasons this has to be reflected by
|
|
two separate WAL records. The replay code has to remember "unfinished" split
|
|
operations, and match them up to subsequent insertions in the parent level.
|
|
If no matching insert has been found by the time the WAL replay ends, the
|
|
replay code has to do the insertion on its own to restore the index to
|
|
consistency. Such insertions occur after WAL is operational, so they can
|
|
and should write WAL records for the additional generated actions.
|
|
|
|
|
|
Write-Ahead Logging for Filesystem Actions
|
|
------------------------------------------
|
|
|
|
The previous section described how to WAL-log actions that only change page
|
|
contents within shared buffers. For that type of action it is generally
|
|
possible to check all likely error cases (such as insufficient space on the
|
|
page) before beginning to make the actual change. Therefore we can make
|
|
the change and the creation of the associated WAL log record "atomic" by
|
|
wrapping them into a critical section --- the odds of failure partway
|
|
through are low enough that PANIC is acceptable if it does happen.
|
|
|
|
Clearly, that approach doesn't work for cases where there's a significant
|
|
probability of failure within the action to be logged, such as creation
|
|
of a new file or database. We don't want to PANIC, and we especially don't
|
|
want to PANIC after having already written a WAL record that says we did
|
|
the action --- if we did, replay of the record would probably fail again
|
|
and PANIC again, making the failure unrecoverable. This means that the
|
|
ordinary WAL rule of "write WAL before the changes it describes" doesn't
|
|
work, and we need a different design for such cases.
|
|
|
|
There are several basic types of filesystem actions that have this
|
|
issue. Here is how we deal with each:
|
|
|
|
1. Adding a disk page to an existing table.
|
|
|
|
This action isn't WAL-logged at all. We extend a table by writing a page
|
|
of zeroes at its end. We must actually do this write so that we are sure
|
|
the filesystem has allocated the space. If the write fails we can just
|
|
error out normally. Once the space is known allocated, we can initialize
|
|
and fill the page via one or more normal WAL-logged actions. Because it's
|
|
possible that we crash between extending the file and writing out the WAL
|
|
entries, we have to treat discovery of an all-zeroes page in a table or
|
|
index as being a non-error condition. In such cases we can just reclaim
|
|
the space for re-use.
|
|
|
|
2. Creating a new table, which requires a new file in the filesystem.
|
|
|
|
We try to create the file, and if successful we make a WAL record saying
|
|
we did it. If not successful, we can just throw an error. Notice that
|
|
there is a window where we have created the file but not yet written any
|
|
WAL about it to disk. If we crash during this window, the file remains
|
|
on disk as an "orphan". It would be possible to clean up such orphans
|
|
by having database restart search for files that don't have any committed
|
|
entry in pg_class, but that currently isn't done because of the possibility
|
|
of deleting data that is useful for forensic analysis of the crash.
|
|
Orphan files are harmless --- at worst they waste a bit of disk space ---
|
|
because we check for on-disk collisions when allocating new relfilenode
|
|
OIDs. So cleaning up isn't really necessary.
|
|
|
|
3. Deleting a table, which requires an unlink() that could fail.
|
|
|
|
Our approach here is to WAL-log the operation first, but to treat failure
|
|
of the actual unlink() call as a warning rather than error condition.
|
|
Again, this can leave an orphan file behind, but that's cheap compared to
|
|
the alternatives. Since we can't actually do the unlink() until after
|
|
we've committed the DROP TABLE transaction, throwing an error would be out
|
|
of the question anyway. (It may be worth noting that the WAL entry about
|
|
the file deletion is actually part of the commit record for the dropping
|
|
transaction.)
|
|
|
|
4. Creating and deleting databases and tablespaces, which requires creating
|
|
and deleting directories and entire directory trees.
|
|
|
|
These cases are handled similarly to creating individual files, ie, we
|
|
try to do the action first and then write a WAL entry if it succeeded.
|
|
The potential amount of wasted disk space is rather larger, of course.
|
|
In the creation case we try to delete the directory tree again if creation
|
|
fails, so as to reduce the risk of wasted space. Failure partway through
|
|
a deletion operation results in a corrupt database: the DROP failed, but
|
|
some of the data is gone anyway. There is little we can do about that,
|
|
though, and in any case it was presumably data the user no longer wants.
|
|
|
|
In all of these cases, if WAL replay fails to redo the original action
|
|
we must panic and abort recovery. The DBA will have to manually clean up
|
|
(for instance, free up some disk space or fix directory permissions) and
|
|
then restart recovery. This is part of the reason for not writing a WAL
|
|
entry until we've successfully done the original action.
|
|
|
|
|
|
Asynchronous Commit
|
|
-------------------
|
|
|
|
As of PostgreSQL 8.3 it is possible to perform asynchronous commits - i.e.,
|
|
we don't wait while the WAL record for the commit is fsync'ed.
|
|
We perform an asynchronous commit when synchronous_commit = off. Instead
|
|
of performing an XLogFlush() up to the LSN of the commit, we merely note
|
|
the LSN in shared memory. The backend then continues with other work.
|
|
We record the LSN only for an asynchronous commit, not an abort; there's
|
|
never any need to flush an abort record, since the presumption after a
|
|
crash would be that the transaction aborted anyway.
|
|
|
|
We always force synchronous commit when the transaction is deleting
|
|
relations, to ensure the commit record is down to disk before the relations
|
|
are removed from the filesystem. Also, certain utility commands that have
|
|
non-roll-backable side effects (such as filesystem changes) force sync
|
|
commit to minimize the window in which the filesystem change has been made
|
|
but the transaction isn't guaranteed committed.
|
|
|
|
Every wal_writer_delay milliseconds, the walwriter process performs an
|
|
XLogBackgroundFlush(). This checks the location of the last completely
|
|
filled WAL page. If that has moved forwards, then we write all the changed
|
|
buffers up to that point, so that under full load we write only whole
|
|
buffers. If there has been a break in activity and the current WAL page is
|
|
the same as before, then we find out the LSN of the most recent
|
|
asynchronous commit, and flush up to that point, if required (i.e.,
|
|
if it's in the current WAL page). This arrangement in itself would
|
|
guarantee that an async commit record reaches disk during at worst the
|
|
second walwriter cycle after the transaction completes. However, we also
|
|
allow XLogFlush to flush full buffers "flexibly" (ie, not wrapping around
|
|
at the end of the circular WAL buffer area), so as to minimize the number
|
|
of writes issued under high load when multiple WAL pages are filled per
|
|
walwriter cycle. This makes the worst-case delay three walwriter cycles.
|
|
|
|
There are some other subtle points to consider with asynchronous commits.
|
|
First, for each page of CLOG we must remember the LSN of the latest commit
|
|
affecting the page, so that we can enforce the same flush-WAL-before-write
|
|
rule that we do for ordinary relation pages. Otherwise the record of the
|
|
commit might reach disk before the WAL record does. Again, abort records
|
|
need not factor into this consideration.
|
|
|
|
In fact, we store more than one LSN for each clog page. This relates to
|
|
the way we set transaction status hint bits during visibility tests.
|
|
We must not set a transaction-committed hint bit on a relation page and
|
|
have that record make it to disk prior to the WAL record of the commit.
|
|
Since visibility tests are normally made while holding buffer share locks,
|
|
we do not have the option of changing the page's LSN to guarantee WAL
|
|
synchronization. Instead, we defer the setting of the hint bit if we have
|
|
not yet flushed WAL as far as the LSN associated with the transaction.
|
|
This requires tracking the LSN of each unflushed async commit. It is
|
|
convenient to associate this data with clog buffers: because we will flush
|
|
WAL before writing a clog page, we know that we do not need to remember a
|
|
transaction's LSN longer than the clog page holding its commit status
|
|
remains in memory. However, the naive approach of storing an LSN for each
|
|
clog position is unattractive: the LSNs are 32x bigger than the two-bit
|
|
commit status fields, and so we'd need 256K of additional shared memory for
|
|
each 8K clog buffer page. We choose instead to store a smaller number of
|
|
LSNs per page, where each LSN is the highest LSN associated with any
|
|
transaction commit in a contiguous range of transaction IDs on that page.
|
|
This saves storage at the price of some possibly-unnecessary delay in
|
|
setting transaction hint bits.
|
|
|
|
How many transactions should share the same cached LSN (N)? If the
|
|
system's workload consists only of small async-commit transactions, then
|
|
it's reasonable to have N similar to the number of transactions per
|
|
walwriter cycle, since that is the granularity with which transactions will
|
|
become truly committed (and thus hintable) anyway. The worst case is where
|
|
a sync-commit xact shares a cached LSN with an async-commit xact that
|
|
commits a bit later; even though we paid to sync the first xact to disk,
|
|
we won't be able to hint its outputs until the second xact is sync'd, up to
|
|
three walwriter cycles later. This argues for keeping N (the group size)
|
|
as small as possible. For the moment we are setting the group size to 32,
|
|
which makes the LSN cache space the same size as the actual clog buffer
|
|
space (independently of BLCKSZ).
|
|
|
|
It is useful that we can run both synchronous and asynchronous commit
|
|
transactions concurrently, but the safety of this is perhaps not
|
|
immediately obvious. Assume we have two transactions, T1 and T2. The Log
|
|
Sequence Number (LSN) is the point in the WAL sequence where a transaction
|
|
commit is recorded, so LSN1 and LSN2 are the commit records of those
|
|
transactions. If T2 can see changes made by T1 then when T2 commits it
|
|
must be true that LSN2 follows LSN1. Thus when T2 commits it is certain
|
|
that all of the changes made by T1 are also now recorded in the WAL. This
|
|
is true whether T1 was asynchronous or synchronous. As a result, it is
|
|
safe for asynchronous commits and synchronous commits to work concurrently
|
|
without endangering data written by synchronous commits. Sub-transactions
|
|
are not important here since the final write to disk only occurs at the
|
|
commit of the top level transaction.
|
|
|
|
Changes to data blocks cannot reach disk unless WAL is flushed up to the
|
|
point of the LSN of the data blocks. Any attempt to write unsafe data to
|
|
disk will trigger a write which ensures the safety of all data written by
|
|
that and prior transactions. Data blocks and clog pages are both protected
|
|
by LSNs.
|
|
|
|
Changes to a temp table are not WAL-logged, hence could reach disk in
|
|
advance of T1's commit, but we don't care since temp table contents don't
|
|
survive crashes anyway.
|
|
|
|
Database writes made via any of the paths we have introduced to avoid WAL
|
|
overhead for bulk updates are also safe. In these cases it's entirely
|
|
possible for the data to reach disk before T1's commit, because T1 will
|
|
fsync it down to disk without any sort of interlock, as soon as it finishes
|
|
the bulk update. However, all these paths are designed to write data that
|
|
no other transaction can see until after T1 commits. The situation is thus
|
|
not different from ordinary WAL-logged updates.
|
|
|
|
Transaction Emulation during Recovery
|
|
-------------------------------------
|
|
|
|
During Recovery we replay transaction changes in the order they occurred.
|
|
As part of this replay we emulate some transactional behaviour, so that
|
|
read only backends can take MVCC snapshots. We do this by maintaining a
|
|
list of XIDs belonging to transactions that are being replayed, so that
|
|
each transaction that has recorded WAL records for database writes exist
|
|
in the array until it commits. Further details are given in comments in
|
|
procarray.c.
|
|
|
|
Many actions write no WAL records at all, for example read only transactions.
|
|
These have no effect on MVCC in recovery and we can pretend they never
|
|
occurred at all. Subtransaction commit does not write a WAL record either
|
|
and has very little effect, since lock waiters need to wait for the
|
|
parent transaction to complete.
|
|
|
|
Not all transactional behaviour is emulated, for example we do not insert
|
|
a transaction entry into the lock table, nor do we maintain the transaction
|
|
stack in memory. Clog entries are made normally. Multitrans is not maintained
|
|
because its purpose is to record tuple level locks that an application has
|
|
requested to prevent write locks. Since write locks cannot be obtained at all,
|
|
there is never any conflict and so there is no reason to update multitrans.
|
|
Subtrans is maintained during recovery but the details of the transaction
|
|
tree are ignored and all subtransactions reference the top-level TransactionId
|
|
directly. Since commit is atomic this provides correct lock wait behaviour
|
|
yet simplifies emulation of subtransactions considerably.
|
|
|
|
Further details on locking mechanics in recovery are given in comments
|
|
with the Lock rmgr code.
|