2010-09-20 22:08:53 +02:00
|
|
|
src/backend/access/transam/README
|
2004-08-01 22:57:59 +02:00
|
|
|
|
|
|
|
The Transaction System
|
2008-03-21 14:23:29 +01:00
|
|
|
======================
|
2004-08-01 22:57:59 +02:00
|
|
|
|
|
|
|
PostgreSQL's transaction system is a three-layer system. The bottom layer
|
|
|
|
implements low-level transactions and subtransactions, on top of which rests
|
|
|
|
the mainloop's control code, which in turn implements user-visible
|
|
|
|
transactions and savepoints.
|
|
|
|
|
|
|
|
The middle layer of code is called by postgres.c before and after the
|
2004-09-16 18:58:44 +02:00
|
|
|
processing of each query, or after detecting an error:
|
2004-08-01 22:57:59 +02:00
|
|
|
|
|
|
|
StartTransactionCommand
|
|
|
|
CommitTransactionCommand
|
|
|
|
AbortCurrentTransaction
|
|
|
|
|
|
|
|
Meanwhile, the user can alter the system's state by issuing the SQL commands
|
|
|
|
BEGIN, COMMIT, ROLLBACK, SAVEPOINT, ROLLBACK TO or RELEASE. The traffic cop
|
|
|
|
redirects these calls to the toplevel routines
|
|
|
|
|
|
|
|
BeginTransactionBlock
|
|
|
|
EndTransactionBlock
|
|
|
|
UserAbortTransactionBlock
|
|
|
|
DefineSavepoint
|
|
|
|
RollbackToSavepoint
|
|
|
|
ReleaseSavepoint
|
|
|
|
|
|
|
|
respectively. Depending on the current state of the system, these functions
|
|
|
|
call low level functions to activate the real transaction system:
|
|
|
|
|
|
|
|
StartTransaction
|
|
|
|
CommitTransaction
|
|
|
|
AbortTransaction
|
|
|
|
CleanupTransaction
|
|
|
|
StartSubTransaction
|
|
|
|
CommitSubTransaction
|
|
|
|
AbortSubTransaction
|
|
|
|
CleanupSubTransaction
|
|
|
|
|
|
|
|
Additionally, within a transaction, CommandCounterIncrement is called to
|
|
|
|
increment the command counter, which allows future commands to "see" the
|
|
|
|
effects of previous commands within the same transaction. Note that this is
|
|
|
|
done automatically by CommitTransactionCommand after each query inside a
|
|
|
|
transaction block, but some utility functions also do it internally to allow
|
|
|
|
some operations (usually in the system catalogs) to be seen by future
|
2004-09-16 18:58:44 +02:00
|
|
|
operations in the same utility command. (For example, in DefineRelation it is
|
2004-08-01 22:57:59 +02:00
|
|
|
done after creating the heap so the pg_class row is visible, to be able to
|
2004-09-16 18:58:44 +02:00
|
|
|
lock it.)
|
2004-08-01 22:57:59 +02:00
|
|
|
|
|
|
|
|
|
|
|
For example, consider the following sequence of user commands:
|
|
|
|
|
|
|
|
1) BEGIN
|
|
|
|
2) SELECT * FROM foo
|
|
|
|
3) INSERT INTO foo VALUES (...)
|
|
|
|
4) COMMIT
|
|
|
|
|
|
|
|
In the main processing loop, this results in the following function call
|
|
|
|
sequence:
|
|
|
|
|
2011-08-30 04:25:17 +02:00
|
|
|
/ StartTransactionCommand;
|
|
|
|
/ StartTransaction;
|
|
|
|
1) < ProcessUtility; << BEGIN
|
|
|
|
\ BeginTransactionBlock;
|
|
|
|
\ CommitTransactionCommand;
|
|
|
|
|
|
|
|
/ StartTransactionCommand;
|
|
|
|
2) / ProcessQuery; << SELECT ...
|
|
|
|
\ CommitTransactionCommand;
|
|
|
|
\ CommandCounterIncrement;
|
|
|
|
|
|
|
|
/ StartTransactionCommand;
|
|
|
|
3) / ProcessQuery; << INSERT ...
|
|
|
|
\ CommitTransactionCommand;
|
|
|
|
\ CommandCounterIncrement;
|
|
|
|
|
|
|
|
/ StartTransactionCommand;
|
|
|
|
/ ProcessUtility; << COMMIT
|
|
|
|
4) < EndTransactionBlock;
|
|
|
|
\ CommitTransactionCommand;
|
|
|
|
\ CommitTransaction;
|
2004-08-01 22:57:59 +02:00
|
|
|
|
|
|
|
The point of this example is to demonstrate the need for
|
|
|
|
StartTransactionCommand and CommitTransactionCommand to be state smart -- they
|
|
|
|
should call CommandCounterIncrement between the calls to BeginTransactionBlock
|
|
|
|
and EndTransactionBlock and outside these calls they need to do normal start,
|
|
|
|
commit or abort processing.
|
|
|
|
|
2011-08-30 04:25:17 +02:00
|
|
|
Furthermore, suppose the "SELECT * FROM foo" caused an abort condition. In
|
2004-08-01 22:57:59 +02:00
|
|
|
this case AbortCurrentTransaction is called, and the transaction is put in
|
|
|
|
aborted state. In this state, any user input is ignored except for
|
|
|
|
transaction-termination statements, or ROLLBACK TO <savepoint> commands.
|
|
|
|
|
|
|
|
Transaction aborts can occur in two ways:
|
|
|
|
|
2011-08-30 04:25:17 +02:00
|
|
|
1) system dies from some internal cause (syntax error, etc)
|
|
|
|
2) user types ROLLBACK
|
2004-08-01 22:57:59 +02:00
|
|
|
|
|
|
|
The reason we have to distinguish them is illustrated by the following two
|
|
|
|
situations:
|
|
|
|
|
2011-08-30 04:25:17 +02:00
|
|
|
case 1 case 2
|
|
|
|
------ ------
|
|
|
|
1) user types BEGIN 1) user types BEGIN
|
|
|
|
2) user does something 2) user does something
|
|
|
|
3) user does not like what 3) system aborts for some reason
|
|
|
|
she sees and types ABORT (syntax error, etc)
|
2004-08-01 22:57:59 +02:00
|
|
|
|
|
|
|
In case 1, we want to abort the transaction and return to the default state.
|
|
|
|
In case 2, there may be more commands coming our way which are part of the
|
|
|
|
same transaction block; we have to ignore these commands until we see a COMMIT
|
|
|
|
or ROLLBACK.
|
|
|
|
|
|
|
|
Internal aborts are handled by AbortCurrentTransaction, while user aborts are
|
|
|
|
handled by UserAbortTransactionBlock. Both of them rely on AbortTransaction
|
|
|
|
to do all the real work. The only difference is what state we enter after
|
|
|
|
AbortTransaction does its work:
|
|
|
|
|
|
|
|
* AbortCurrentTransaction leaves us in TBLOCK_ABORT,
|
2004-09-16 18:58:44 +02:00
|
|
|
* UserAbortTransactionBlock leaves us in TBLOCK_ABORT_END
|
2004-08-01 22:57:59 +02:00
|
|
|
|
|
|
|
Low-level transaction abort handling is divided in two phases:
|
|
|
|
* AbortTransaction executes as soon as we realize the transaction has
|
|
|
|
failed. It should release all shared resources (locks etc) so that we do
|
|
|
|
not delay other backends unnecessarily.
|
|
|
|
* CleanupTransaction executes when we finally see a user COMMIT
|
|
|
|
or ROLLBACK command; it cleans things up and gets us out of the transaction
|
2004-09-16 18:58:44 +02:00
|
|
|
completely. In particular, we mustn't destroy TopTransactionContext until
|
2004-08-01 22:57:59 +02:00
|
|
|
this point.
|
|
|
|
|
|
|
|
Also, note that when a transaction is committed, we don't close it right away.
|
|
|
|
Rather it's put in TBLOCK_END state, which means that when
|
|
|
|
CommitTransactionCommand is called after the query has finished processing,
|
|
|
|
the transaction has to be closed. The distinction is subtle but important,
|
|
|
|
because it means that control will leave the xact.c code with the transaction
|
|
|
|
open, and the main loop will be able to keep processing inside the same
|
|
|
|
transaction. So, in a sense, transaction commit is also handled in two
|
|
|
|
phases, the first at EndTransactionBlock and the second at
|
|
|
|
CommitTransactionCommand (which is where CommitTransaction is actually
|
|
|
|
called).
|
|
|
|
|
|
|
|
The rest of the code in xact.c are routines to support the creation and
|
|
|
|
finishing of transactions and subtransactions. For example, AtStart_Memory
|
|
|
|
takes care of initializing the memory subsystem at main transaction start.
|
|
|
|
|
|
|
|
|
2008-03-20 18:55:15 +01:00
|
|
|
Subtransaction Handling
|
2004-08-01 22:57:59 +02:00
|
|
|
-----------------------
|
|
|
|
|
|
|
|
Subtransactions are implemented using a stack of TransactionState structures,
|
|
|
|
each of which has a pointer to its parent transaction's struct. When a new
|
|
|
|
subtransaction is to be opened, PushTransaction is called, which creates a new
|
|
|
|
TransactionState, with its parent link pointing to the current transaction.
|
|
|
|
StartSubTransaction is in charge of initializing the new TransactionState to
|
|
|
|
sane values, and properly initializing other subsystems (AtSubStart routines).
|
|
|
|
|
|
|
|
When closing a subtransaction, either CommitSubTransaction has to be called
|
|
|
|
(if the subtransaction is committing), or AbortSubTransaction and
|
|
|
|
CleanupSubTransaction (if it's aborting). In either case, PopTransaction is
|
|
|
|
called so the system returns to the parent transaction.
|
|
|
|
|
|
|
|
One important point regarding subtransaction handling is that several may need
|
|
|
|
to be closed in response to a single user command. That's because savepoints
|
|
|
|
have names, and we allow to commit or rollback a savepoint by name, which is
|
2004-09-16 18:58:44 +02:00
|
|
|
not necessarily the one that was last opened. Also a COMMIT or ROLLBACK
|
|
|
|
command must be able to close out the entire stack. We handle this by having
|
|
|
|
the utility command subroutine mark all the state stack entries as commit-
|
|
|
|
pending or abort-pending, and then when the main loop reaches
|
|
|
|
CommitTransactionCommand, the real work is done. The main point of doing
|
|
|
|
things this way is that if we get an error while popping state stack entries,
|
|
|
|
the remaining stack entries still show what we need to do to finish up.
|
|
|
|
|
|
|
|
In the case of ROLLBACK TO <savepoint>, we abort all the subtransactions up
|
|
|
|
through the one identified by the savepoint name, and then re-create that
|
|
|
|
subtransaction level with the same name. So it's a completely new
|
|
|
|
subtransaction as far as the internals are concerned.
|
2004-08-01 22:57:59 +02:00
|
|
|
|
|
|
|
Other subsystems are allowed to start "internal" subtransactions, which are
|
|
|
|
handled by BeginInternalSubtransaction. This is to allow implementing
|
|
|
|
exception handling, e.g. in PL/pgSQL. ReleaseCurrentSubTransaction and
|
|
|
|
RollbackAndReleaseCurrentSubTransaction allows the subsystem to close said
|
|
|
|
subtransactions. The main difference between this and the savepoint/release
|
2004-09-16 18:58:44 +02:00
|
|
|
path is that we execute the complete state transition immediately in each
|
|
|
|
subroutine, rather than deferring some work until CommitTransactionCommand.
|
|
|
|
Another difference is that BeginInternalSubtransaction is allowed when no
|
|
|
|
explicit transaction block has been established, while DefineSavepoint is not.
|
|
|
|
|
|
|
|
|
2008-03-20 18:55:15 +01:00
|
|
|
Transaction and Subtransaction Numbering
|
2007-09-05 20:10:48 +02:00
|
|
|
----------------------------------------
|
2004-09-16 18:58:44 +02:00
|
|
|
|
2007-09-05 20:10:48 +02:00
|
|
|
Transactions and subtransactions are assigned permanent XIDs only when/if
|
|
|
|
they first do something that requires one --- typically, insert/update/delete
|
|
|
|
a tuple, though there are a few other places that need an XID assigned.
|
|
|
|
If a subtransaction requires an XID, we always first assign one to its
|
|
|
|
parent. This maintains the invariant that child transactions have XIDs later
|
|
|
|
than their parents, which is assumed in a number of places.
|
|
|
|
|
2012-04-22 18:23:47 +02:00
|
|
|
The subsidiary actions of obtaining a lock on the XID and entering it into
|
2007-09-05 20:10:48 +02:00
|
|
|
pg_subtrans and PG_PROC are done at the time it is assigned.
|
|
|
|
|
|
|
|
A transaction that has no XID still needs to be identified for various
|
|
|
|
purposes, notably holding locks. For this purpose we assign a "virtual
|
|
|
|
transaction ID" or VXID to each top-level transaction. VXIDs are formed from
|
|
|
|
two fields, the backendID and a backend-local counter; this arrangement allows
|
|
|
|
assignment of a new VXID at transaction start without any contention for
|
|
|
|
shared memory. To ensure that a VXID isn't re-used too soon after backend
|
|
|
|
exit, we store the last local counter value into shared memory at backend
|
|
|
|
exit, and initialize it from the previous value for the same backendID slot
|
|
|
|
at backend start. All these counters go back to zero at shared memory
|
|
|
|
re-initialization, but that's OK because VXIDs never appear anywhere on-disk.
|
2004-09-16 18:58:44 +02:00
|
|
|
|
|
|
|
Internally, a backend needs a way to identify subtransactions whether or not
|
|
|
|
they have XIDs; but this need only lasts as long as the parent top transaction
|
|
|
|
endures. Therefore, we have SubTransactionId, which is somewhat like
|
|
|
|
CommandId in that it's generated from a counter that we reset at the start of
|
|
|
|
each top transaction. The top-level transaction itself has SubTransactionId 1,
|
|
|
|
and subtransactions have IDs 2 and up. (Zero is reserved for
|
2007-09-05 20:10:48 +02:00
|
|
|
InvalidSubTransactionId.) Note that subtransactions do not have their
|
|
|
|
own VXIDs; they use the parent top transaction's VXID.
|
2004-08-01 22:57:59 +02:00
|
|
|
|
|
|
|
|
2008-03-20 18:55:15 +01:00
|
|
|
Interlocking Transaction Begin, Transaction End, and Snapshots
|
2007-09-07 22:59:26 +02:00
|
|
|
--------------------------------------------------------------
|
|
|
|
|
|
|
|
We try hard to minimize the amount of overhead and lock contention involved
|
|
|
|
in the frequent activities of beginning/ending a transaction and taking a
|
|
|
|
snapshot. Unfortunately, we must have some interlocking for this, because
|
|
|
|
we must ensure consistency about the commit order of transactions.
|
|
|
|
For example, suppose an UPDATE in xact A is blocked by xact B's prior
|
|
|
|
update of the same row, and xact B is doing commit while xact C gets a
|
|
|
|
snapshot. Xact A can complete and commit as soon as B releases its locks.
|
|
|
|
If xact C's GetSnapshotData sees xact B as still running, then it had
|
|
|
|
better see xact A as still running as well, or it will be able to see two
|
|
|
|
tuple versions - one deleted by xact B and one inserted by xact A. Another
|
|
|
|
reason why this would be bad is that C would see (in the row inserted by A)
|
|
|
|
earlier changes by B, and it would be inconsistent for C not to see any
|
|
|
|
of B's changes elsewhere in the database.
|
|
|
|
|
2007-09-08 22:31:15 +02:00
|
|
|
Formally, the correctness requirement is "if a snapshot A considers
|
|
|
|
transaction X as committed, and any of transaction X's snapshots considered
|
|
|
|
transaction Y as committed, then snapshot A must consider transaction Y as
|
|
|
|
committed".
|
2007-09-07 22:59:26 +02:00
|
|
|
|
|
|
|
What we actually enforce is strict serialization of commits and rollbacks
|
|
|
|
with snapshot-taking: we do not allow any transaction to exit the set of
|
|
|
|
running transactions while a snapshot is being taken. (This rule is
|
|
|
|
stronger than necessary for consistency, but is relatively simple to
|
|
|
|
enforce, and it assists with some other issues as explained below.) The
|
|
|
|
implementation of this is that GetSnapshotData takes the ProcArrayLock in
|
|
|
|
shared mode (so that multiple backends can take snapshots in parallel),
|
2007-09-08 22:31:15 +02:00
|
|
|
but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
|
2012-05-14 09:22:44 +02:00
|
|
|
while clearing MyPgXact->xid at transaction end (either commit or abort).
|
2007-09-07 22:59:26 +02:00
|
|
|
|
2007-09-08 22:31:15 +02:00
|
|
|
ProcArrayEndTransaction also holds the lock while advancing the shared
|
|
|
|
latestCompletedXid variable. This allows GetSnapshotData to use
|
|
|
|
latestCompletedXid + 1 as xmax for its snapshot: there can be no
|
|
|
|
transaction >= this xid value that the snapshot needs to consider as
|
|
|
|
completed.
|
2007-09-07 22:59:26 +02:00
|
|
|
|
|
|
|
In short, then, the rule is that no transaction may exit the set of
|
2007-09-08 22:31:15 +02:00
|
|
|
currently-running transactions between the time we fetch latestCompletedXid
|
|
|
|
and the time we finish building our snapshot. However, this restriction
|
|
|
|
only applies to transactions that have an XID --- read-only transactions
|
|
|
|
can end without acquiring ProcArrayLock, since they don't affect anyone
|
|
|
|
else's snapshot nor latestCompletedXid.
|
2007-09-07 22:59:26 +02:00
|
|
|
|
|
|
|
Transaction start, per se, doesn't have any interlocking with these
|
|
|
|
considerations, since we no longer assign an XID immediately at transaction
|
2007-09-08 22:31:15 +02:00
|
|
|
start. But when we do decide to allocate an XID, GetNewTransactionId must
|
|
|
|
store the new XID into the shared ProcArray before releasing XidGenLock.
|
|
|
|
This ensures that all top-level XIDs <= latestCompletedXid are either
|
|
|
|
present in the ProcArray, or not running anymore. (This guarantee doesn't
|
|
|
|
apply to subtransaction XIDs, because of the possibility that there's not
|
|
|
|
room for them in the subxid array; instead we guarantee that they are
|
|
|
|
present or the overflow flag is set.) If a backend released XidGenLock
|
2012-05-14 09:22:44 +02:00
|
|
|
before storing its XID into MyPgXact, then it would be possible for another
|
2007-09-08 22:31:15 +02:00
|
|
|
backend to allocate and commit a later XID, causing latestCompletedXid to
|
|
|
|
pass the first backend's XID, before that value became visible in the
|
|
|
|
ProcArray. That would break GetOldestXmin, as discussed below.
|
|
|
|
|
2012-05-14 09:22:44 +02:00
|
|
|
We allow GetNewTransactionId to store the XID into MyPgXact->xid (or the
|
2007-09-08 22:31:15 +02:00
|
|
|
subxid array) without taking ProcArrayLock. This was once necessary to
|
|
|
|
avoid deadlock; while that is no longer the case, it's still beneficial for
|
|
|
|
performance. We are thereby relying on fetch/store of an XID to be atomic,
|
|
|
|
else other backends might see a partially-set XID. This also means that
|
|
|
|
readers of the ProcArray xid fields must be careful to fetch a value only
|
|
|
|
once, rather than assume they can read it multiple times and get the same
|
|
|
|
answer each time. (Use volatile-qualified pointers when doing this, to
|
|
|
|
ensure that the C compiler does exactly what you tell it to.)
|
2007-09-07 22:59:26 +02:00
|
|
|
|
|
|
|
Another important activity that uses the shared ProcArray is GetOldestXmin,
|
|
|
|
which must determine a lower bound for the oldest xmin of any active MVCC
|
|
|
|
snapshot, system-wide. Each individual backend advertises the smallest
|
2012-05-14 09:22:44 +02:00
|
|
|
xmin of its own snapshots in MyPgXact->xmin, or zero if it currently has no
|
2007-09-07 22:59:26 +02:00
|
|
|
live snapshots (eg, if it's between transactions or hasn't yet set a
|
|
|
|
snapshot for a new transaction). GetOldestXmin takes the MIN() of the
|
|
|
|
valid xmin fields. It does this with only shared lock on ProcArrayLock,
|
|
|
|
which means there is a potential race condition against other backends
|
|
|
|
doing GetSnapshotData concurrently: we must be certain that a concurrent
|
|
|
|
backend that is about to set its xmin does not compute an xmin less than
|
|
|
|
what GetOldestXmin returns. We ensure that by including all the active
|
|
|
|
XIDs into the MIN() calculation, along with the valid xmins. The rule that
|
|
|
|
transactions can't exit without taking exclusive ProcArrayLock ensures that
|
|
|
|
concurrent holders of shared ProcArrayLock will compute the same minimum of
|
|
|
|
currently-active XIDs: no xact, in particular not the oldest, can exit
|
|
|
|
while we hold shared ProcArrayLock. So GetOldestXmin's view of the minimum
|
|
|
|
active XID will be the same as that of any concurrent GetSnapshotData, and
|
|
|
|
so it can't produce an overestimate. If there is no active transaction at
|
2007-09-08 22:31:15 +02:00
|
|
|
all, GetOldestXmin returns latestCompletedXid + 1, which is a lower bound
|
|
|
|
for the xmin that might be computed by concurrent or later GetSnapshotData
|
|
|
|
calls. (We know that no XID less than this could be about to appear in
|
|
|
|
the ProcArray, because of the XidGenLock interlock discussed above.)
|
2007-09-07 22:59:26 +02:00
|
|
|
|
|
|
|
GetSnapshotData also performs an oldest-xmin calculation (which had better
|
|
|
|
match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used
|
|
|
|
for some tuple age cutoff checks where a fresh call of GetOldestXmin seems
|
|
|
|
too expensive. Note that while it is certain that two concurrent
|
|
|
|
executions of GetSnapshotData will compute the same xmin for their own
|
|
|
|
snapshots, as argued above, it is not certain that they will arrive at the
|
|
|
|
same estimate of RecentGlobalXmin. This is because we allow XID-less
|
2012-05-14 09:22:44 +02:00
|
|
|
transactions to clear their MyPgXact->xmin asynchronously (without taking
|
2007-09-07 22:59:26 +02:00
|
|
|
ProcArrayLock), so one execution might see what had been the oldest xmin,
|
|
|
|
and another not. This is OK since RecentGlobalXmin need only be a valid
|
|
|
|
lower bound. As noted above, we are already assuming that fetch/store
|
|
|
|
of the xid fields is atomic, so assuming it for xmin as well is no extra
|
|
|
|
risk.
|
|
|
|
|
|
|
|
|
2004-08-01 22:57:59 +02:00
|
|
|
pg_clog and pg_subtrans
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
pg_clog and pg_subtrans are permanent (on-disk) storage of transaction related
|
|
|
|
information. There is a limited number of pages of each kept in memory, so
|
|
|
|
in many cases there is no need to actually read from disk. However, if
|
|
|
|
there's a long running transaction or a backend sitting idle with an open
|
|
|
|
transaction, it may be necessary to be able to read and write this information
|
|
|
|
from disk. They also allow information to be permanent across server restarts.
|
|
|
|
|
2004-09-16 18:58:44 +02:00
|
|
|
pg_clog records the commit status for each transaction that has been assigned
|
|
|
|
an XID. A transaction can be in progress, committed, aborted, or
|
|
|
|
"sub-committed". This last state means that it's a subtransaction that's no
|
2008-10-20 21:18:18 +02:00
|
|
|
longer running, but its parent has not updated its state yet. It is not
|
|
|
|
necessary to update a subtransaction's transaction status to subcommit, so we
|
|
|
|
can just defer it until main transaction commit. The main role of marking
|
|
|
|
transactions as sub-committed is to provide an atomic commit protocol when
|
|
|
|
transaction status is spread across multiple clog pages. As a result, whenever
|
|
|
|
transaction status spreads across multiple pages we must use a two-phase commit
|
|
|
|
protocol: the first phase is to mark the subtransactions as sub-committed, then
|
|
|
|
we mark the top level transaction and all its subtransactions committed (in
|
|
|
|
that order). Thus, subtransactions that have not aborted appear as in-progress
|
|
|
|
even when they have already finished, and the subcommit status appears as a
|
|
|
|
very short transitory state during main transaction commit. Subtransaction
|
|
|
|
abort is always marked in clog as soon as it occurs. When the transaction
|
|
|
|
status all fit in a single CLOG page, we atomically mark them all as committed
|
|
|
|
without bothering with the intermediate sub-commit state.
|
2004-08-01 22:57:59 +02:00
|
|
|
|
|
|
|
Savepoints are implemented using subtransactions. A subtransaction is a
|
2004-09-16 18:58:44 +02:00
|
|
|
transaction inside a transaction; its commit or abort status is not only
|
|
|
|
dependent on whether it committed itself, but also whether its parent
|
|
|
|
transaction committed. To implement multiple savepoints in a transaction we
|
|
|
|
allow unlimited transaction nesting depth, so any particular subtransaction's
|
|
|
|
commit state is dependent on the commit status of each and every ancestor
|
|
|
|
transaction.
|
2004-08-01 22:57:59 +02:00
|
|
|
|
|
|
|
The "subtransaction parent" (pg_subtrans) mechanism records, for each
|
2004-09-16 18:58:44 +02:00
|
|
|
transaction with an XID, the TransactionId of its parent transaction. This
|
|
|
|
information is stored as soon as the subtransaction is assigned an XID.
|
|
|
|
Top-level transactions do not have a parent, so they leave their pg_subtrans
|
|
|
|
entries set to the default value of zero (InvalidTransactionId).
|
2004-08-01 22:57:59 +02:00
|
|
|
|
|
|
|
pg_subtrans is used to check whether the transaction in question is still
|
2012-05-14 09:22:44 +02:00
|
|
|
running --- the main Xid of a transaction is recorded in the PGXACT struct,
|
2004-08-01 22:57:59 +02:00
|
|
|
but since we allow arbitrary nesting of subtransactions, we can't fit all Xids
|
|
|
|
in shared memory, so we have to store them on disk. Note, however, that for
|
|
|
|
each transaction we keep a "cache" of Xids that are known to be part of the
|
|
|
|
transaction tree, so we can skip looking at pg_subtrans unless we know the
|
2005-05-19 23:35:48 +02:00
|
|
|
cache has been overflowed. See storage/ipc/procarray.c for the gory details.
|
2004-08-01 22:57:59 +02:00
|
|
|
|
|
|
|
slru.c is the supporting mechanism for both pg_clog and pg_subtrans. It
|
|
|
|
implements the LRU policy for in-memory buffer pages. The high-level routines
|
|
|
|
for pg_clog are implemented in transam.c, while the low-level functions are in
|
|
|
|
clog.c. pg_subtrans is contained completely in subtrans.c.
|
2006-03-29 23:17:39 +02:00
|
|
|
|
|
|
|
|
2008-03-20 18:55:15 +01:00
|
|
|
Write-Ahead Log Coding
|
2006-03-29 23:17:39 +02:00
|
|
|
----------------------
|
|
|
|
|
|
|
|
The WAL subsystem (also called XLOG in the code) exists to guarantee crash
|
|
|
|
recovery. It can also be used to provide point-in-time recovery, as well as
|
|
|
|
hot-standby replication via log shipping. Here are some notes about
|
|
|
|
non-obvious aspects of its design.
|
|
|
|
|
|
|
|
A basic assumption of a write AHEAD log is that log entries must reach stable
|
|
|
|
storage before the data-page changes they describe. This ensures that
|
|
|
|
replaying the log to its end will bring us to a consistent state where there
|
|
|
|
are no partially-performed transactions. To guarantee this, each data page
|
|
|
|
(either heap or index) is marked with the LSN (log sequence number --- in
|
|
|
|
practice, a WAL file location) of the latest XLOG record affecting the page.
|
|
|
|
Before the bufmgr can write out a dirty page, it must ensure that xlog has
|
|
|
|
been flushed to disk at least up to the page's LSN. This low-level
|
|
|
|
interaction improves performance by not waiting for XLOG I/O until necessary.
|
|
|
|
The LSN check exists only in the shared-buffer manager, not in the local
|
|
|
|
buffer manager used for temp tables; hence operations on temp tables must not
|
|
|
|
be WAL-logged.
|
|
|
|
|
|
|
|
During WAL replay, we can check the LSN of a page to detect whether the change
|
|
|
|
recorded by the current log entry is already applied (it has been, if the page
|
|
|
|
LSN is >= the log entry's WAL location).
|
|
|
|
|
|
|
|
Usually, log entries contain just enough information to redo a single
|
|
|
|
incremental update on a page (or small group of pages). This will work only
|
|
|
|
if the filesystem and hardware implement data page writes as atomic actions,
|
|
|
|
so that a page is never left in a corrupt partly-written state. Since that's
|
|
|
|
often an untenable assumption in practice, we log additional information to
|
|
|
|
allow complete reconstruction of modified pages. The first WAL record
|
|
|
|
affecting a given page after a checkpoint is made to contain a copy of the
|
|
|
|
entire page, and we implement replay by restoring that page copy instead of
|
|
|
|
redoing the update. (This is more reliable than the data storage itself would
|
|
|
|
be because we can check the validity of the WAL record's CRC.) We can detect
|
|
|
|
the "first change after checkpoint" by noting whether the page's old LSN
|
|
|
|
precedes the end of WAL as of the last checkpoint (the RedoRecPtr).
|
|
|
|
|
|
|
|
The general schema for executing a WAL-logged action is
|
|
|
|
|
|
|
|
1. Pin and exclusive-lock the shared buffer(s) containing the data page(s)
|
|
|
|
to be modified.
|
|
|
|
|
2006-04-01 01:32:07 +02:00
|
|
|
2. START_CRIT_SECTION() (Any error during the next three steps must cause a
|
2006-03-29 23:17:39 +02:00
|
|
|
PANIC because the shared buffers will contain unlogged changes, which we
|
|
|
|
have to ensure don't get to disk. Obviously, you should check conditions
|
|
|
|
such as whether there's enough free space on the page before you start the
|
|
|
|
critical section.)
|
|
|
|
|
|
|
|
3. Apply the required changes to the shared buffer(s).
|
|
|
|
|
2006-04-01 01:32:07 +02:00
|
|
|
4. Mark the shared buffer(s) as dirty with MarkBufferDirty(). (This must
|
|
|
|
happen before the WAL record is inserted; see notes in SyncOneBuffer().)
|
2013-03-22 14:54:07 +01:00
|
|
|
Note that marking a buffer dirty with MarkBufferDirty() should only
|
|
|
|
happen iff you write a WAL record; see Writing Hints below.
|
2006-04-01 01:32:07 +02:00
|
|
|
|
Fix multiple problems in WAL replay.
Most of the replay functions for WAL record types that modify more than
one page failed to ensure that those pages were locked correctly to ensure
that concurrent queries could not see inconsistent page states. This is
a hangover from coding decisions made long before Hot Standby was added,
when it was hardly necessary to acquire buffer locks during WAL replay
at all, let alone hold them for carefully-chosen periods.
The key problem was that RestoreBkpBlocks was written to hold lock on each
page restored from a full-page image for only as long as it took to update
that page. This was guaranteed to break any WAL replay function in which
there was any update-ordering constraint between pages, because even if the
nominal order of the pages is the right one, any mixture of full-page and
non-full-page updates in the same record would result in out-of-order
updates. Moreover, it wouldn't work for situations where there's a
requirement to maintain lock on one page while updating another. Failure
to honor an update ordering constraint in this way is thought to be the
cause of bug #7648 from Daniel Farina: what seems to have happened there
is that a btree page being split was rewritten from a full-page image
before the new right sibling page was written, and because lock on the
original page was not maintained it was possible for hot standby queries to
try to traverse the page's right-link to the not-yet-existing sibling page.
To fix, get rid of RestoreBkpBlocks as such, and instead create a new
function RestoreBackupBlock that restores just one full-page image at a
time. This function can be invoked by WAL replay functions at the points
where they would otherwise perform non-full-page updates; in this way, the
physical order of page updates remains the same no matter which pages are
replaced by full-page images. We can then further adjust the logic in
individual replay functions if it is necessary to hold buffer locks
for overlapping periods. A side benefit is that we can simplify the
handling of concurrency conflict resolution by moving that code into the
record-type-specfic functions; there's no more need to contort the code
layout to keep conflict resolution in front of the RestoreBkpBlocks call.
In connection with that, standardize on zero-based numbering rather than
one-based numbering for referencing the full-page images. In HEAD, I
removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are
still there in the header files in previous branches, but are no longer
used by the code.
In addition, fix some other bugs identified in the course of making these
changes:
spgRedoAddNode could fail to update the parent downlink at all, if the
parent tuple is in the same page as either the old or new split tuple and
we're not doing a full-page image: it would get fooled by the LSN having
been advanced already. This would result in permanent index corruption,
not just transient failure of concurrent queries.
Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.
heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
exclusive lock: it did the former in the normal path but the latter for a
full-page image. A plain exclusive lock seems sufficient, so change to
that.
Also, remove gistRedoPageDeleteRecord(), which has been dead code since
VACUUM FULL was rewritten.
Back-patch to 9.0, where hot standby was introduced. Note however that 9.0
had a significantly different WAL-logging scheme for GIST index updates,
and it doesn't appear possible to make that scheme safe for concurrent hot
standby queries, because it can leave inconsistent states in the index even
between WAL records. Given the lack of complaints from the field, we won't
work too hard on fixing that branch.
2012-11-13 04:05:08 +01:00
|
|
|
5. If the relation requires WAL-logging, build a WAL log record and pass it
|
2013-03-18 14:46:42 +01:00
|
|
|
to XLogInsert(); then update the page's LSN using the returned XLOG
|
Fix multiple problems in WAL replay.
Most of the replay functions for WAL record types that modify more than
one page failed to ensure that those pages were locked correctly to ensure
that concurrent queries could not see inconsistent page states. This is
a hangover from coding decisions made long before Hot Standby was added,
when it was hardly necessary to acquire buffer locks during WAL replay
at all, let alone hold them for carefully-chosen periods.
The key problem was that RestoreBkpBlocks was written to hold lock on each
page restored from a full-page image for only as long as it took to update
that page. This was guaranteed to break any WAL replay function in which
there was any update-ordering constraint between pages, because even if the
nominal order of the pages is the right one, any mixture of full-page and
non-full-page updates in the same record would result in out-of-order
updates. Moreover, it wouldn't work for situations where there's a
requirement to maintain lock on one page while updating another. Failure
to honor an update ordering constraint in this way is thought to be the
cause of bug #7648 from Daniel Farina: what seems to have happened there
is that a btree page being split was rewritten from a full-page image
before the new right sibling page was written, and because lock on the
original page was not maintained it was possible for hot standby queries to
try to traverse the page's right-link to the not-yet-existing sibling page.
To fix, get rid of RestoreBkpBlocks as such, and instead create a new
function RestoreBackupBlock that restores just one full-page image at a
time. This function can be invoked by WAL replay functions at the points
where they would otherwise perform non-full-page updates; in this way, the
physical order of page updates remains the same no matter which pages are
replaced by full-page images. We can then further adjust the logic in
individual replay functions if it is necessary to hold buffer locks
for overlapping periods. A side benefit is that we can simplify the
handling of concurrency conflict resolution by moving that code into the
record-type-specfic functions; there's no more need to contort the code
layout to keep conflict resolution in front of the RestoreBkpBlocks call.
In connection with that, standardize on zero-based numbering rather than
one-based numbering for referencing the full-page images. In HEAD, I
removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are
still there in the header files in previous branches, but are no longer
used by the code.
In addition, fix some other bugs identified in the course of making these
changes:
spgRedoAddNode could fail to update the parent downlink at all, if the
parent tuple is in the same page as either the old or new split tuple and
we're not doing a full-page image: it would get fooled by the LSN having
been advanced already. This would result in permanent index corruption,
not just transient failure of concurrent queries.
Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.
heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
exclusive lock: it did the former in the normal path but the latter for a
full-page image. A plain exclusive lock seems sufficient, so change to
that.
Also, remove gistRedoPageDeleteRecord(), which has been dead code since
VACUUM FULL was rewritten.
Back-patch to 9.0, where hot standby was introduced. Note however that 9.0
had a significantly different WAL-logging scheme for GIST index updates,
and it doesn't appear possible to make that scheme safe for concurrent hot
standby queries, because it can leave inconsistent states in the index even
between WAL records. Given the lack of complaints from the field, we won't
work too hard on fixing that branch.
2012-11-13 04:05:08 +01:00
|
|
|
location. For instance,
|
2006-03-29 23:17:39 +02:00
|
|
|
|
|
|
|
recptr = XLogInsert(rmgr_id, info, rdata);
|
|
|
|
|
|
|
|
PageSetLSN(dp, recptr);
|
2013-03-18 14:46:42 +01:00
|
|
|
// Note that we no longer do PageSetTLI() from 9.3 onwards
|
|
|
|
// since that field on a page has now changed its meaning.
|
2006-03-29 23:17:39 +02:00
|
|
|
|
2006-04-01 01:32:07 +02:00
|
|
|
6. END_CRIT_SECTION()
|
2006-03-29 23:17:39 +02:00
|
|
|
|
2006-04-01 01:32:07 +02:00
|
|
|
7. Unlock and unpin the buffer(s).
|
2006-03-29 23:17:39 +02:00
|
|
|
|
|
|
|
XLogInsert's "rdata" argument is an array of pointer/size items identifying
|
|
|
|
chunks of data to be written in the XLOG record, plus optional shared-buffer
|
|
|
|
IDs for chunks that are in shared buffers rather than temporary variables.
|
|
|
|
The "rdata" array must mention (at least once) each of the shared buffers
|
|
|
|
being modified, unless the action is such that the WAL replay routine can
|
|
|
|
reconstruct the entire page contents. XLogInsert includes the logic that
|
|
|
|
tests to see whether a shared buffer has been modified since the last
|
|
|
|
checkpoint. If not, the entire page contents are logged rather than just the
|
|
|
|
portion(s) pointed to by "rdata".
|
|
|
|
|
|
|
|
Because XLogInsert drops the rdata components associated with buffers it
|
|
|
|
chooses to log in full, the WAL replay routines normally need to test to see
|
|
|
|
which buffers were handled that way --- otherwise they may be misled about
|
|
|
|
what the XLOG record actually contains. XLOG records that describe multi-page
|
|
|
|
changes therefore require some care to design: you must be certain that you
|
|
|
|
know what data is indicated by each "BKP" bit. An example of the trickiness
|
Fix multiple problems in WAL replay.
Most of the replay functions for WAL record types that modify more than
one page failed to ensure that those pages were locked correctly to ensure
that concurrent queries could not see inconsistent page states. This is
a hangover from coding decisions made long before Hot Standby was added,
when it was hardly necessary to acquire buffer locks during WAL replay
at all, let alone hold them for carefully-chosen periods.
The key problem was that RestoreBkpBlocks was written to hold lock on each
page restored from a full-page image for only as long as it took to update
that page. This was guaranteed to break any WAL replay function in which
there was any update-ordering constraint between pages, because even if the
nominal order of the pages is the right one, any mixture of full-page and
non-full-page updates in the same record would result in out-of-order
updates. Moreover, it wouldn't work for situations where there's a
requirement to maintain lock on one page while updating another. Failure
to honor an update ordering constraint in this way is thought to be the
cause of bug #7648 from Daniel Farina: what seems to have happened there
is that a btree page being split was rewritten from a full-page image
before the new right sibling page was written, and because lock on the
original page was not maintained it was possible for hot standby queries to
try to traverse the page's right-link to the not-yet-existing sibling page.
To fix, get rid of RestoreBkpBlocks as such, and instead create a new
function RestoreBackupBlock that restores just one full-page image at a
time. This function can be invoked by WAL replay functions at the points
where they would otherwise perform non-full-page updates; in this way, the
physical order of page updates remains the same no matter which pages are
replaced by full-page images. We can then further adjust the logic in
individual replay functions if it is necessary to hold buffer locks
for overlapping periods. A side benefit is that we can simplify the
handling of concurrency conflict resolution by moving that code into the
record-type-specfic functions; there's no more need to contort the code
layout to keep conflict resolution in front of the RestoreBkpBlocks call.
In connection with that, standardize on zero-based numbering rather than
one-based numbering for referencing the full-page images. In HEAD, I
removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are
still there in the header files in previous branches, but are no longer
used by the code.
In addition, fix some other bugs identified in the course of making these
changes:
spgRedoAddNode could fail to update the parent downlink at all, if the
parent tuple is in the same page as either the old or new split tuple and
we're not doing a full-page image: it would get fooled by the LSN having
been advanced already. This would result in permanent index corruption,
not just transient failure of concurrent queries.
Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.
heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
exclusive lock: it did the former in the normal path but the latter for a
full-page image. A plain exclusive lock seems sufficient, so change to
that.
Also, remove gistRedoPageDeleteRecord(), which has been dead code since
VACUUM FULL was rewritten.
Back-patch to 9.0, where hot standby was introduced. Note however that 9.0
had a significantly different WAL-logging scheme for GIST index updates,
and it doesn't appear possible to make that scheme safe for concurrent hot
standby queries, because it can leave inconsistent states in the index even
between WAL records. Given the lack of complaints from the field, we won't
work too hard on fixing that branch.
2012-11-13 04:05:08 +01:00
|
|
|
is that in a HEAP_UPDATE record, BKP(0) normally is associated with the source
|
|
|
|
page and BKP(1) is associated with the destination page --- but if these are
|
|
|
|
the same page, only BKP(0) would have been set.
|
2006-03-29 23:17:39 +02:00
|
|
|
|
|
|
|
For this reason as well as the risk of deadlocking on buffer locks, it's best
|
|
|
|
to design WAL records so that they reflect small atomic actions involving just
|
|
|
|
one or a few pages. The current XLOG infrastructure cannot handle WAL records
|
2011-12-12 22:22:14 +01:00
|
|
|
involving references to more than four shared buffers, anyway.
|
2006-03-29 23:17:39 +02:00
|
|
|
|
|
|
|
In the case where the WAL record contains enough information to re-generate
|
|
|
|
the entire contents of a page, do *not* show that page's buffer ID in the
|
|
|
|
rdata array, even if some of the rdata items point into the buffer. This is
|
|
|
|
because you don't want XLogInsert to log the whole page contents. The
|
|
|
|
standard replay-routine pattern for this case is
|
|
|
|
|
2011-12-18 00:26:52 +01:00
|
|
|
buffer = XLogReadBuffer(rnode, blkno, true);
|
2006-03-29 23:17:39 +02:00
|
|
|
Assert(BufferIsValid(buffer));
|
|
|
|
page = (Page) BufferGetPage(buffer);
|
|
|
|
|
|
|
|
... initialize the page ...
|
|
|
|
|
|
|
|
PageSetLSN(page, lsn);
|
2006-04-01 01:32:07 +02:00
|
|
|
MarkBufferDirty(buffer);
|
|
|
|
UnlockReleaseBuffer(buffer);
|
2006-03-29 23:17:39 +02:00
|
|
|
|
|
|
|
In the case where the WAL record provides only enough information to
|
|
|
|
incrementally update the page, the rdata array *must* mention the buffer
|
|
|
|
ID at least once; otherwise there is no defense against torn-page problems.
|
|
|
|
The standard replay-routine pattern for this case is
|
|
|
|
|
Fix multiple problems in WAL replay.
Most of the replay functions for WAL record types that modify more than
one page failed to ensure that those pages were locked correctly to ensure
that concurrent queries could not see inconsistent page states. This is
a hangover from coding decisions made long before Hot Standby was added,
when it was hardly necessary to acquire buffer locks during WAL replay
at all, let alone hold them for carefully-chosen periods.
The key problem was that RestoreBkpBlocks was written to hold lock on each
page restored from a full-page image for only as long as it took to update
that page. This was guaranteed to break any WAL replay function in which
there was any update-ordering constraint between pages, because even if the
nominal order of the pages is the right one, any mixture of full-page and
non-full-page updates in the same record would result in out-of-order
updates. Moreover, it wouldn't work for situations where there's a
requirement to maintain lock on one page while updating another. Failure
to honor an update ordering constraint in this way is thought to be the
cause of bug #7648 from Daniel Farina: what seems to have happened there
is that a btree page being split was rewritten from a full-page image
before the new right sibling page was written, and because lock on the
original page was not maintained it was possible for hot standby queries to
try to traverse the page's right-link to the not-yet-existing sibling page.
To fix, get rid of RestoreBkpBlocks as such, and instead create a new
function RestoreBackupBlock that restores just one full-page image at a
time. This function can be invoked by WAL replay functions at the points
where they would otherwise perform non-full-page updates; in this way, the
physical order of page updates remains the same no matter which pages are
replaced by full-page images. We can then further adjust the logic in
individual replay functions if it is necessary to hold buffer locks
for overlapping periods. A side benefit is that we can simplify the
handling of concurrency conflict resolution by moving that code into the
record-type-specfic functions; there's no more need to contort the code
layout to keep conflict resolution in front of the RestoreBkpBlocks call.
In connection with that, standardize on zero-based numbering rather than
one-based numbering for referencing the full-page images. In HEAD, I
removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are
still there in the header files in previous branches, but are no longer
used by the code.
In addition, fix some other bugs identified in the course of making these
changes:
spgRedoAddNode could fail to update the parent downlink at all, if the
parent tuple is in the same page as either the old or new split tuple and
we're not doing a full-page image: it would get fooled by the LSN having
been advanced already. This would result in permanent index corruption,
not just transient failure of concurrent queries.
Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.
heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
exclusive lock: it did the former in the normal path but the latter for a
full-page image. A plain exclusive lock seems sufficient, so change to
that.
Also, remove gistRedoPageDeleteRecord(), which has been dead code since
VACUUM FULL was rewritten.
Back-patch to 9.0, where hot standby was introduced. Note however that 9.0
had a significantly different WAL-logging scheme for GIST index updates,
and it doesn't appear possible to make that scheme safe for concurrent hot
standby queries, because it can leave inconsistent states in the index even
between WAL records. Given the lack of complaints from the field, we won't
work too hard on fixing that branch.
2012-11-13 04:05:08 +01:00
|
|
|
if (record->xl_info & XLR_BKP_BLOCK(N))
|
|
|
|
{
|
|
|
|
/* apply the change from the full-page image */
|
|
|
|
(void) RestoreBackupBlock(lsn, record, N, false, false);
|
|
|
|
return;
|
|
|
|
}
|
2006-03-29 23:17:39 +02:00
|
|
|
|
2011-12-18 00:26:52 +01:00
|
|
|
buffer = XLogReadBuffer(rnode, blkno, false);
|
2006-03-29 23:17:39 +02:00
|
|
|
if (!BufferIsValid(buffer))
|
Fix multiple problems in WAL replay.
Most of the replay functions for WAL record types that modify more than
one page failed to ensure that those pages were locked correctly to ensure
that concurrent queries could not see inconsistent page states. This is
a hangover from coding decisions made long before Hot Standby was added,
when it was hardly necessary to acquire buffer locks during WAL replay
at all, let alone hold them for carefully-chosen periods.
The key problem was that RestoreBkpBlocks was written to hold lock on each
page restored from a full-page image for only as long as it took to update
that page. This was guaranteed to break any WAL replay function in which
there was any update-ordering constraint between pages, because even if the
nominal order of the pages is the right one, any mixture of full-page and
non-full-page updates in the same record would result in out-of-order
updates. Moreover, it wouldn't work for situations where there's a
requirement to maintain lock on one page while updating another. Failure
to honor an update ordering constraint in this way is thought to be the
cause of bug #7648 from Daniel Farina: what seems to have happened there
is that a btree page being split was rewritten from a full-page image
before the new right sibling page was written, and because lock on the
original page was not maintained it was possible for hot standby queries to
try to traverse the page's right-link to the not-yet-existing sibling page.
To fix, get rid of RestoreBkpBlocks as such, and instead create a new
function RestoreBackupBlock that restores just one full-page image at a
time. This function can be invoked by WAL replay functions at the points
where they would otherwise perform non-full-page updates; in this way, the
physical order of page updates remains the same no matter which pages are
replaced by full-page images. We can then further adjust the logic in
individual replay functions if it is necessary to hold buffer locks
for overlapping periods. A side benefit is that we can simplify the
handling of concurrency conflict resolution by moving that code into the
record-type-specfic functions; there's no more need to contort the code
layout to keep conflict resolution in front of the RestoreBkpBlocks call.
In connection with that, standardize on zero-based numbering rather than
one-based numbering for referencing the full-page images. In HEAD, I
removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are
still there in the header files in previous branches, but are no longer
used by the code.
In addition, fix some other bugs identified in the course of making these
changes:
spgRedoAddNode could fail to update the parent downlink at all, if the
parent tuple is in the same page as either the old or new split tuple and
we're not doing a full-page image: it would get fooled by the LSN having
been advanced already. This would result in permanent index corruption,
not just transient failure of concurrent queries.
Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.
heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
exclusive lock: it did the former in the normal path but the latter for a
full-page image. A plain exclusive lock seems sufficient, so change to
that.
Also, remove gistRedoPageDeleteRecord(), which has been dead code since
VACUUM FULL was rewritten.
Back-patch to 9.0, where hot standby was introduced. Note however that 9.0
had a significantly different WAL-logging scheme for GIST index updates,
and it doesn't appear possible to make that scheme safe for concurrent hot
standby queries, because it can leave inconsistent states in the index even
between WAL records. Given the lack of complaints from the field, we won't
work too hard on fixing that branch.
2012-11-13 04:05:08 +01:00
|
|
|
{
|
|
|
|
/* page has been deleted, so we need do nothing */
|
|
|
|
return;
|
|
|
|
}
|
2006-03-29 23:17:39 +02:00
|
|
|
page = (Page) BufferGetPage(buffer);
|
|
|
|
|
|
|
|
if (XLByteLE(lsn, PageGetLSN(page)))
|
|
|
|
{
|
|
|
|
/* changes are already applied */
|
2006-04-01 01:32:07 +02:00
|
|
|
UnlockReleaseBuffer(buffer);
|
2006-03-29 23:17:39 +02:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
... apply the change ...
|
|
|
|
|
|
|
|
PageSetLSN(page, lsn);
|
2006-04-01 01:32:07 +02:00
|
|
|
MarkBufferDirty(buffer);
|
|
|
|
UnlockReleaseBuffer(buffer);
|
2006-03-29 23:17:39 +02:00
|
|
|
|
|
|
|
As noted above, for a multi-page update you need to be able to determine
|
Fix multiple problems in WAL replay.
Most of the replay functions for WAL record types that modify more than
one page failed to ensure that those pages were locked correctly to ensure
that concurrent queries could not see inconsistent page states. This is
a hangover from coding decisions made long before Hot Standby was added,
when it was hardly necessary to acquire buffer locks during WAL replay
at all, let alone hold them for carefully-chosen periods.
The key problem was that RestoreBkpBlocks was written to hold lock on each
page restored from a full-page image for only as long as it took to update
that page. This was guaranteed to break any WAL replay function in which
there was any update-ordering constraint between pages, because even if the
nominal order of the pages is the right one, any mixture of full-page and
non-full-page updates in the same record would result in out-of-order
updates. Moreover, it wouldn't work for situations where there's a
requirement to maintain lock on one page while updating another. Failure
to honor an update ordering constraint in this way is thought to be the
cause of bug #7648 from Daniel Farina: what seems to have happened there
is that a btree page being split was rewritten from a full-page image
before the new right sibling page was written, and because lock on the
original page was not maintained it was possible for hot standby queries to
try to traverse the page's right-link to the not-yet-existing sibling page.
To fix, get rid of RestoreBkpBlocks as such, and instead create a new
function RestoreBackupBlock that restores just one full-page image at a
time. This function can be invoked by WAL replay functions at the points
where they would otherwise perform non-full-page updates; in this way, the
physical order of page updates remains the same no matter which pages are
replaced by full-page images. We can then further adjust the logic in
individual replay functions if it is necessary to hold buffer locks
for overlapping periods. A side benefit is that we can simplify the
handling of concurrency conflict resolution by moving that code into the
record-type-specfic functions; there's no more need to contort the code
layout to keep conflict resolution in front of the RestoreBkpBlocks call.
In connection with that, standardize on zero-based numbering rather than
one-based numbering for referencing the full-page images. In HEAD, I
removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are
still there in the header files in previous branches, but are no longer
used by the code.
In addition, fix some other bugs identified in the course of making these
changes:
spgRedoAddNode could fail to update the parent downlink at all, if the
parent tuple is in the same page as either the old or new split tuple and
we're not doing a full-page image: it would get fooled by the LSN having
been advanced already. This would result in permanent index corruption,
not just transient failure of concurrent queries.
Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.
heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
exclusive lock: it did the former in the normal path but the latter for a
full-page image. A plain exclusive lock seems sufficient, so change to
that.
Also, remove gistRedoPageDeleteRecord(), which has been dead code since
VACUUM FULL was rewritten.
Back-patch to 9.0, where hot standby was introduced. Note however that 9.0
had a significantly different WAL-logging scheme for GIST index updates,
and it doesn't appear possible to make that scheme safe for concurrent hot
standby queries, because it can leave inconsistent states in the index even
between WAL records. Given the lack of complaints from the field, we won't
work too hard on fixing that branch.
2012-11-13 04:05:08 +01:00
|
|
|
which XLR_BKP_BLOCK(N) flag applies to each page. If a WAL record reflects
|
2006-03-29 23:17:39 +02:00
|
|
|
a combination of fully-rewritable and incremental updates, then the rewritable
|
Fix multiple problems in WAL replay.
Most of the replay functions for WAL record types that modify more than
one page failed to ensure that those pages were locked correctly to ensure
that concurrent queries could not see inconsistent page states. This is
a hangover from coding decisions made long before Hot Standby was added,
when it was hardly necessary to acquire buffer locks during WAL replay
at all, let alone hold them for carefully-chosen periods.
The key problem was that RestoreBkpBlocks was written to hold lock on each
page restored from a full-page image for only as long as it took to update
that page. This was guaranteed to break any WAL replay function in which
there was any update-ordering constraint between pages, because even if the
nominal order of the pages is the right one, any mixture of full-page and
non-full-page updates in the same record would result in out-of-order
updates. Moreover, it wouldn't work for situations where there's a
requirement to maintain lock on one page while updating another. Failure
to honor an update ordering constraint in this way is thought to be the
cause of bug #7648 from Daniel Farina: what seems to have happened there
is that a btree page being split was rewritten from a full-page image
before the new right sibling page was written, and because lock on the
original page was not maintained it was possible for hot standby queries to
try to traverse the page's right-link to the not-yet-existing sibling page.
To fix, get rid of RestoreBkpBlocks as such, and instead create a new
function RestoreBackupBlock that restores just one full-page image at a
time. This function can be invoked by WAL replay functions at the points
where they would otherwise perform non-full-page updates; in this way, the
physical order of page updates remains the same no matter which pages are
replaced by full-page images. We can then further adjust the logic in
individual replay functions if it is necessary to hold buffer locks
for overlapping periods. A side benefit is that we can simplify the
handling of concurrency conflict resolution by moving that code into the
record-type-specfic functions; there's no more need to contort the code
layout to keep conflict resolution in front of the RestoreBkpBlocks call.
In connection with that, standardize on zero-based numbering rather than
one-based numbering for referencing the full-page images. In HEAD, I
removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are
still there in the header files in previous branches, but are no longer
used by the code.
In addition, fix some other bugs identified in the course of making these
changes:
spgRedoAddNode could fail to update the parent downlink at all, if the
parent tuple is in the same page as either the old or new split tuple and
we're not doing a full-page image: it would get fooled by the LSN having
been advanced already. This would result in permanent index corruption,
not just transient failure of concurrent queries.
Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.
heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
exclusive lock: it did the former in the normal path but the latter for a
full-page image. A plain exclusive lock seems sufficient, so change to
that.
Also, remove gistRedoPageDeleteRecord(), which has been dead code since
VACUUM FULL was rewritten.
Back-patch to 9.0, where hot standby was introduced. Note however that 9.0
had a significantly different WAL-logging scheme for GIST index updates,
and it doesn't appear possible to make that scheme safe for concurrent hot
standby queries, because it can leave inconsistent states in the index even
between WAL records. Given the lack of complaints from the field, we won't
work too hard on fixing that branch.
2012-11-13 04:05:08 +01:00
|
|
|
pages don't count for the XLR_BKP_BLOCK(N) numbering. (XLR_BKP_BLOCK(N) is
|
|
|
|
associated with the N'th distinct buffer ID seen in the "rdata" array, and
|
2006-03-29 23:17:39 +02:00
|
|
|
per the above discussion, fully-rewritable buffers shouldn't be mentioned in
|
|
|
|
"rdata".)
|
|
|
|
|
Fix multiple problems in WAL replay.
Most of the replay functions for WAL record types that modify more than
one page failed to ensure that those pages were locked correctly to ensure
that concurrent queries could not see inconsistent page states. This is
a hangover from coding decisions made long before Hot Standby was added,
when it was hardly necessary to acquire buffer locks during WAL replay
at all, let alone hold them for carefully-chosen periods.
The key problem was that RestoreBkpBlocks was written to hold lock on each
page restored from a full-page image for only as long as it took to update
that page. This was guaranteed to break any WAL replay function in which
there was any update-ordering constraint between pages, because even if the
nominal order of the pages is the right one, any mixture of full-page and
non-full-page updates in the same record would result in out-of-order
updates. Moreover, it wouldn't work for situations where there's a
requirement to maintain lock on one page while updating another. Failure
to honor an update ordering constraint in this way is thought to be the
cause of bug #7648 from Daniel Farina: what seems to have happened there
is that a btree page being split was rewritten from a full-page image
before the new right sibling page was written, and because lock on the
original page was not maintained it was possible for hot standby queries to
try to traverse the page's right-link to the not-yet-existing sibling page.
To fix, get rid of RestoreBkpBlocks as such, and instead create a new
function RestoreBackupBlock that restores just one full-page image at a
time. This function can be invoked by WAL replay functions at the points
where they would otherwise perform non-full-page updates; in this way, the
physical order of page updates remains the same no matter which pages are
replaced by full-page images. We can then further adjust the logic in
individual replay functions if it is necessary to hold buffer locks
for overlapping periods. A side benefit is that we can simplify the
handling of concurrency conflict resolution by moving that code into the
record-type-specfic functions; there's no more need to contort the code
layout to keep conflict resolution in front of the RestoreBkpBlocks call.
In connection with that, standardize on zero-based numbering rather than
one-based numbering for referencing the full-page images. In HEAD, I
removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are
still there in the header files in previous branches, but are no longer
used by the code.
In addition, fix some other bugs identified in the course of making these
changes:
spgRedoAddNode could fail to update the parent downlink at all, if the
parent tuple is in the same page as either the old or new split tuple and
we're not doing a full-page image: it would get fooled by the LSN having
been advanced already. This would result in permanent index corruption,
not just transient failure of concurrent queries.
Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.
heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
exclusive lock: it did the former in the normal path but the latter for a
full-page image. A plain exclusive lock seems sufficient, so change to
that.
Also, remove gistRedoPageDeleteRecord(), which has been dead code since
VACUUM FULL was rewritten.
Back-patch to 9.0, where hot standby was introduced. Note however that 9.0
had a significantly different WAL-logging scheme for GIST index updates,
and it doesn't appear possible to make that scheme safe for concurrent hot
standby queries, because it can leave inconsistent states in the index even
between WAL records. Given the lack of complaints from the field, we won't
work too hard on fixing that branch.
2012-11-13 04:05:08 +01:00
|
|
|
When replaying a WAL record that describes changes on multiple pages, you
|
|
|
|
must be careful to lock the pages properly to prevent concurrent Hot Standby
|
|
|
|
queries from seeing an inconsistent state. If this requires that two
|
|
|
|
or more buffer locks be held concurrently, the coding pattern shown above
|
|
|
|
is too simplistic, since it assumes the routine can exit as soon as it's
|
|
|
|
known the current page requires no modification. Instead, you might have
|
|
|
|
something like
|
|
|
|
|
|
|
|
if (record->xl_info & XLR_BKP_BLOCK(0))
|
|
|
|
{
|
|
|
|
/* apply the change from the full-page image */
|
|
|
|
buffer0 = RestoreBackupBlock(lsn, record, 0, false, true);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
buffer0 = XLogReadBuffer(rnode, blkno, false);
|
|
|
|
if (BufferIsValid(buffer0))
|
|
|
|
{
|
|
|
|
... apply the change if not already done ...
|
|
|
|
MarkBufferDirty(buffer0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
... similarly apply the changes for remaining pages ...
|
|
|
|
|
|
|
|
/* and now we can release the lock on the first page */
|
|
|
|
if (BufferIsValid(buffer0))
|
|
|
|
UnlockReleaseBuffer(buffer0);
|
|
|
|
|
2012-12-03 12:59:25 +01:00
|
|
|
Note that we must only use PageSetLSN/PageGetLSN() when we know the action
|
|
|
|
is serialised. Only Startup process may modify data blocks during recovery,
|
|
|
|
so Startup process may execute PageGetLSN() without fear of serialisation
|
|
|
|
problems. All other processes must only call PageSet/GetLSN when holding
|
|
|
|
either an exclusive buffer lock or a shared lock plus buffer header lock,
|
|
|
|
or be writing the data block directly rather than through shared buffers
|
|
|
|
while holding AccessExclusiveLock on the relation.
|
|
|
|
|
2006-03-29 23:17:39 +02:00
|
|
|
Due to all these constraints, complex changes (such as a multilevel index
|
|
|
|
insertion) normally need to be described by a series of atomic-action WAL
|
|
|
|
records. What do you do if the intermediate states are not self-consistent?
|
|
|
|
The answer is that the WAL replay logic has to be able to fix things up.
|
|
|
|
In btree indexes, for example, a page split requires insertion of a new key in
|
|
|
|
the parent btree level, but for locking reasons this has to be reflected by
|
|
|
|
two separate WAL records. The replay code has to remember "unfinished" split
|
|
|
|
operations, and match them up to subsequent insertions in the parent level.
|
|
|
|
If no matching insert has been found by the time the WAL replay ends, the
|
|
|
|
replay code has to do the insertion on its own to restore the index to
|
2007-08-02 00:45:09 +02:00
|
|
|
consistency. Such insertions occur after WAL is operational, so they can
|
|
|
|
and should write WAL records for the additional generated actions.
|
|
|
|
|
2013-03-22 14:54:07 +01:00
|
|
|
Writing Hints
|
|
|
|
-------------
|
|
|
|
|
|
|
|
In some cases, we write additional information to data blocks without
|
|
|
|
writing a preceding WAL record. This should only happen iff the data can
|
|
|
|
be reconstructed later following a crash and the action is simply a way
|
|
|
|
of optimising for performance. When a hint is written we use
|
|
|
|
MarkBufferDirtyHint() to mark the block dirty.
|
|
|
|
|
|
|
|
If the buffer is clean and checksums are in use then
|
|
|
|
MarkBufferDirtyHint() inserts an XLOG_HINT record to ensure that we
|
|
|
|
take a full page image that includes the hint. We do this to avoid
|
|
|
|
a partial page write, when we write the dirtied page. WAL is not
|
|
|
|
written during recovery, so we simply skip dirtying blocks because
|
|
|
|
of hints when in recovery.
|
|
|
|
|
|
|
|
If you do decide to optimise away a WAL record, then any calls to
|
|
|
|
MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
|
|
|
|
otherwise you will expose the risk of partial page writes.
|
|
|
|
|
2007-08-02 00:45:09 +02:00
|
|
|
|
2010-09-17 02:42:39 +02:00
|
|
|
Write-Ahead Logging for Filesystem Actions
|
|
|
|
------------------------------------------
|
|
|
|
|
|
|
|
The previous section described how to WAL-log actions that only change page
|
|
|
|
contents within shared buffers. For that type of action it is generally
|
|
|
|
possible to check all likely error cases (such as insufficient space on the
|
|
|
|
page) before beginning to make the actual change. Therefore we can make
|
|
|
|
the change and the creation of the associated WAL log record "atomic" by
|
|
|
|
wrapping them into a critical section --- the odds of failure partway
|
|
|
|
through are low enough that PANIC is acceptable if it does happen.
|
|
|
|
|
|
|
|
Clearly, that approach doesn't work for cases where there's a significant
|
|
|
|
probability of failure within the action to be logged, such as creation
|
|
|
|
of a new file or database. We don't want to PANIC, and we especially don't
|
|
|
|
want to PANIC after having already written a WAL record that says we did
|
|
|
|
the action --- if we did, replay of the record would probably fail again
|
|
|
|
and PANIC again, making the failure unrecoverable. This means that the
|
|
|
|
ordinary WAL rule of "write WAL before the changes it describes" doesn't
|
|
|
|
work, and we need a different design for such cases.
|
|
|
|
|
|
|
|
There are several basic types of filesystem actions that have this
|
|
|
|
issue. Here is how we deal with each:
|
|
|
|
|
|
|
|
1. Adding a disk page to an existing table.
|
|
|
|
|
|
|
|
This action isn't WAL-logged at all. We extend a table by writing a page
|
|
|
|
of zeroes at its end. We must actually do this write so that we are sure
|
|
|
|
the filesystem has allocated the space. If the write fails we can just
|
|
|
|
error out normally. Once the space is known allocated, we can initialize
|
|
|
|
and fill the page via one or more normal WAL-logged actions. Because it's
|
|
|
|
possible that we crash between extending the file and writing out the WAL
|
|
|
|
entries, we have to treat discovery of an all-zeroes page in a table or
|
|
|
|
index as being a non-error condition. In such cases we can just reclaim
|
|
|
|
the space for re-use.
|
|
|
|
|
|
|
|
2. Creating a new table, which requires a new file in the filesystem.
|
|
|
|
|
|
|
|
We try to create the file, and if successful we make a WAL record saying
|
|
|
|
we did it. If not successful, we can just throw an error. Notice that
|
|
|
|
there is a window where we have created the file but not yet written any
|
|
|
|
WAL about it to disk. If we crash during this window, the file remains
|
|
|
|
on disk as an "orphan". It would be possible to clean up such orphans
|
|
|
|
by having database restart search for files that don't have any committed
|
|
|
|
entry in pg_class, but that currently isn't done because of the possibility
|
|
|
|
of deleting data that is useful for forensic analysis of the crash.
|
|
|
|
Orphan files are harmless --- at worst they waste a bit of disk space ---
|
|
|
|
because we check for on-disk collisions when allocating new relfilenode
|
|
|
|
OIDs. So cleaning up isn't really necessary.
|
|
|
|
|
|
|
|
3. Deleting a table, which requires an unlink() that could fail.
|
|
|
|
|
|
|
|
Our approach here is to WAL-log the operation first, but to treat failure
|
|
|
|
of the actual unlink() call as a warning rather than error condition.
|
|
|
|
Again, this can leave an orphan file behind, but that's cheap compared to
|
|
|
|
the alternatives. Since we can't actually do the unlink() until after
|
|
|
|
we've committed the DROP TABLE transaction, throwing an error would be out
|
|
|
|
of the question anyway. (It may be worth noting that the WAL entry about
|
|
|
|
the file deletion is actually part of the commit record for the dropping
|
|
|
|
transaction.)
|
|
|
|
|
|
|
|
4. Creating and deleting databases and tablespaces, which requires creating
|
|
|
|
and deleting directories and entire directory trees.
|
|
|
|
|
|
|
|
These cases are handled similarly to creating individual files, ie, we
|
|
|
|
try to do the action first and then write a WAL entry if it succeeded.
|
|
|
|
The potential amount of wasted disk space is rather larger, of course.
|
|
|
|
In the creation case we try to delete the directory tree again if creation
|
|
|
|
fails, so as to reduce the risk of wasted space. Failure partway through
|
|
|
|
a deletion operation results in a corrupt database: the DROP failed, but
|
|
|
|
some of the data is gone anyway. There is little we can do about that,
|
|
|
|
though, and in any case it was presumably data the user no longer wants.
|
|
|
|
|
|
|
|
In all of these cases, if WAL replay fails to redo the original action
|
|
|
|
we must panic and abort recovery. The DBA will have to manually clean up
|
|
|
|
(for instance, free up some disk space or fix directory permissions) and
|
|
|
|
then restart recovery. This is part of the reason for not writing a WAL
|
|
|
|
entry until we've successfully done the original action.
|
|
|
|
|
|
|
|
|
2007-08-02 00:45:09 +02:00
|
|
|
Asynchronous Commit
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
As of PostgreSQL 8.3 it is possible to perform asynchronous commits - i.e.,
|
|
|
|
we don't wait while the WAL record for the commit is fsync'ed.
|
|
|
|
We perform an asynchronous commit when synchronous_commit = off. Instead
|
|
|
|
of performing an XLogFlush() up to the LSN of the commit, we merely note
|
|
|
|
the LSN in shared memory. The backend then continues with other work.
|
|
|
|
We record the LSN only for an asynchronous commit, not an abort; there's
|
|
|
|
never any need to flush an abort record, since the presumption after a
|
|
|
|
crash would be that the transaction aborted anyway.
|
|
|
|
|
|
|
|
We always force synchronous commit when the transaction is deleting
|
|
|
|
relations, to ensure the commit record is down to disk before the relations
|
|
|
|
are removed from the filesystem. Also, certain utility commands that have
|
|
|
|
non-roll-backable side effects (such as filesystem changes) force sync
|
|
|
|
commit to minimize the window in which the filesystem change has been made
|
|
|
|
but the transaction isn't guaranteed committed.
|
|
|
|
|
|
|
|
Every wal_writer_delay milliseconds, the walwriter process performs an
|
|
|
|
XLogBackgroundFlush(). This checks the location of the last completely
|
|
|
|
filled WAL page. If that has moved forwards, then we write all the changed
|
|
|
|
buffers up to that point, so that under full load we write only whole
|
|
|
|
buffers. If there has been a break in activity and the current WAL page is
|
|
|
|
the same as before, then we find out the LSN of the most recent
|
|
|
|
asynchronous commit, and flush up to that point, if required (i.e.,
|
|
|
|
if it's in the current WAL page). This arrangement in itself would
|
|
|
|
guarantee that an async commit record reaches disk during at worst the
|
|
|
|
second walwriter cycle after the transaction completes. However, we also
|
|
|
|
allow XLogFlush to flush full buffers "flexibly" (ie, not wrapping around
|
|
|
|
at the end of the circular WAL buffer area), so as to minimize the number
|
|
|
|
of writes issued under high load when multiple WAL pages are filled per
|
|
|
|
walwriter cycle. This makes the worst-case delay three walwriter cycles.
|
|
|
|
|
|
|
|
There are some other subtle points to consider with asynchronous commits.
|
|
|
|
First, for each page of CLOG we must remember the LSN of the latest commit
|
|
|
|
affecting the page, so that we can enforce the same flush-WAL-before-write
|
|
|
|
rule that we do for ordinary relation pages. Otherwise the record of the
|
|
|
|
commit might reach disk before the WAL record does. Again, abort records
|
|
|
|
need not factor into this consideration.
|
|
|
|
|
|
|
|
In fact, we store more than one LSN for each clog page. This relates to
|
|
|
|
the way we set transaction status hint bits during visibility tests.
|
|
|
|
We must not set a transaction-committed hint bit on a relation page and
|
|
|
|
have that record make it to disk prior to the WAL record of the commit.
|
|
|
|
Since visibility tests are normally made while holding buffer share locks,
|
|
|
|
we do not have the option of changing the page's LSN to guarantee WAL
|
|
|
|
synchronization. Instead, we defer the setting of the hint bit if we have
|
|
|
|
not yet flushed WAL as far as the LSN associated with the transaction.
|
|
|
|
This requires tracking the LSN of each unflushed async commit. It is
|
|
|
|
convenient to associate this data with clog buffers: because we will flush
|
|
|
|
WAL before writing a clog page, we know that we do not need to remember a
|
|
|
|
transaction's LSN longer than the clog page holding its commit status
|
|
|
|
remains in memory. However, the naive approach of storing an LSN for each
|
|
|
|
clog position is unattractive: the LSNs are 32x bigger than the two-bit
|
|
|
|
commit status fields, and so we'd need 256K of additional shared memory for
|
|
|
|
each 8K clog buffer page. We choose instead to store a smaller number of
|
|
|
|
LSNs per page, where each LSN is the highest LSN associated with any
|
|
|
|
transaction commit in a contiguous range of transaction IDs on that page.
|
|
|
|
This saves storage at the price of some possibly-unnecessary delay in
|
|
|
|
setting transaction hint bits.
|
|
|
|
|
|
|
|
How many transactions should share the same cached LSN (N)? If the
|
|
|
|
system's workload consists only of small async-commit transactions, then
|
|
|
|
it's reasonable to have N similar to the number of transactions per
|
|
|
|
walwriter cycle, since that is the granularity with which transactions will
|
|
|
|
become truly committed (and thus hintable) anyway. The worst case is where
|
|
|
|
a sync-commit xact shares a cached LSN with an async-commit xact that
|
|
|
|
commits a bit later; even though we paid to sync the first xact to disk,
|
|
|
|
we won't be able to hint its outputs until the second xact is sync'd, up to
|
|
|
|
three walwriter cycles later. This argues for keeping N (the group size)
|
|
|
|
as small as possible. For the moment we are setting the group size to 32,
|
|
|
|
which makes the LSN cache space the same size as the actual clog buffer
|
|
|
|
space (independently of BLCKSZ).
|
|
|
|
|
|
|
|
It is useful that we can run both synchronous and asynchronous commit
|
|
|
|
transactions concurrently, but the safety of this is perhaps not
|
|
|
|
immediately obvious. Assume we have two transactions, T1 and T2. The Log
|
|
|
|
Sequence Number (LSN) is the point in the WAL sequence where a transaction
|
|
|
|
commit is recorded, so LSN1 and LSN2 are the commit records of those
|
|
|
|
transactions. If T2 can see changes made by T1 then when T2 commits it
|
|
|
|
must be true that LSN2 follows LSN1. Thus when T2 commits it is certain
|
|
|
|
that all of the changes made by T1 are also now recorded in the WAL. This
|
|
|
|
is true whether T1 was asynchronous or synchronous. As a result, it is
|
|
|
|
safe for asynchronous commits and synchronous commits to work concurrently
|
|
|
|
without endangering data written by synchronous commits. Sub-transactions
|
|
|
|
are not important here since the final write to disk only occurs at the
|
|
|
|
commit of the top level transaction.
|
|
|
|
|
|
|
|
Changes to data blocks cannot reach disk unless WAL is flushed up to the
|
|
|
|
point of the LSN of the data blocks. Any attempt to write unsafe data to
|
|
|
|
disk will trigger a write which ensures the safety of all data written by
|
|
|
|
that and prior transactions. Data blocks and clog pages are both protected
|
|
|
|
by LSNs.
|
|
|
|
|
|
|
|
Changes to a temp table are not WAL-logged, hence could reach disk in
|
|
|
|
advance of T1's commit, but we don't care since temp table contents don't
|
|
|
|
survive crashes anyway.
|
|
|
|
|
|
|
|
Database writes made via any of the paths we have introduced to avoid WAL
|
|
|
|
overhead for bulk updates are also safe. In these cases it's entirely
|
|
|
|
possible for the data to reach disk before T1's commit, because T1 will
|
|
|
|
fsync it down to disk without any sort of interlock, as soon as it finishes
|
|
|
|
the bulk update. However, all these paths are designed to write data that
|
|
|
|
no other transaction can see until after T1 commits. The situation is thus
|
|
|
|
not different from ordinary WAL-logged updates.
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
|
|
|
|
Transaction Emulation during Recovery
|
|
|
|
-------------------------------------
|
|
|
|
|
|
|
|
During Recovery we replay transaction changes in the order they occurred.
|
|
|
|
As part of this replay we emulate some transactional behaviour, so that
|
|
|
|
read only backends can take MVCC snapshots. We do this by maintaining a
|
|
|
|
list of XIDs belonging to transactions that are being replayed, so that
|
|
|
|
each transaction that has recorded WAL records for database writes exist
|
|
|
|
in the array until it commits. Further details are given in comments in
|
|
|
|
procarray.c.
|
|
|
|
|
|
|
|
Many actions write no WAL records at all, for example read only transactions.
|
|
|
|
These have no effect on MVCC in recovery and we can pretend they never
|
|
|
|
occurred at all. Subtransaction commit does not write a WAL record either
|
|
|
|
and has very little effect, since lock waiters need to wait for the
|
|
|
|
parent transaction to complete.
|
|
|
|
|
|
|
|
Not all transactional behaviour is emulated, for example we do not insert
|
|
|
|
a transaction entry into the lock table, nor do we maintain the transaction
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
stack in memory. Clog entries are made normally. Multixact is not maintained
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
because its purpose is to record tuple level locks that an application has
|
Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
1290721684-sup-3951@alvh.no-ip.org
1294953201-sup-2099@alvh.no-ip.org
1320343602-sup-2290@alvh.no-ip.org
1339690386-sup-8927@alvh.no-ip.org
4FE5FF020200002500048A3D@gw.wicourts.gov
4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
|
|
|
requested to prevent other tuple locks. Since tuple locks cannot be obtained at
|
|
|
|
all, there is never any conflict and so there is no reason to update multixact.
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
Subtrans is maintained during recovery but the details of the transaction
|
|
|
|
tree are ignored and all subtransactions reference the top-level TransactionId
|
|
|
|
directly. Since commit is atomic this provides correct lock wait behaviour
|
|
|
|
yet simplifies emulation of subtransactions considerably.
|
|
|
|
|
|
|
|
Further details on locking mechanics in recovery are given in comments
|
|
|
|
with the Lock rmgr code.
|