postgresql/src/backend/access/transam/README

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

914 lines
48 KiB
Plaintext
Raw Normal View History

2010-09-20 22:08:53 +02:00
src/backend/access/transam/README
The Transaction System
2008-03-21 14:23:29 +01:00
======================
PostgreSQL's transaction system is a three-layer system. The bottom layer
implements low-level transactions and subtransactions, on top of which rests
the mainloop's control code, which in turn implements user-visible
transactions and savepoints.
The middle layer of code is called by postgres.c before and after the
processing of each query, or after detecting an error:
StartTransactionCommand
CommitTransactionCommand
AbortCurrentTransaction
Meanwhile, the user can alter the system's state by issuing the SQL commands
BEGIN, COMMIT, ROLLBACK, SAVEPOINT, ROLLBACK TO or RELEASE. The traffic cop
redirects these calls to the toplevel routines
BeginTransactionBlock
EndTransactionBlock
UserAbortTransactionBlock
DefineSavepoint
RollbackToSavepoint
ReleaseSavepoint
respectively. Depending on the current state of the system, these functions
call low level functions to activate the real transaction system:
StartTransaction
CommitTransaction
AbortTransaction
CleanupTransaction
StartSubTransaction
CommitSubTransaction
AbortSubTransaction
CleanupSubTransaction
Additionally, within a transaction, CommandCounterIncrement is called to
increment the command counter, which allows future commands to "see" the
effects of previous commands within the same transaction. Note that this is
done automatically by CommitTransactionCommand after each query inside a
transaction block, but some utility functions also do it internally to allow
some operations (usually in the system catalogs) to be seen by future
operations in the same utility command. (For example, in DefineRelation it is
done after creating the heap so the pg_class row is visible, to be able to
lock it.)
For example, consider the following sequence of user commands:
1) BEGIN
2) SELECT * FROM foo
3) INSERT INTO foo VALUES (...)
4) COMMIT
In the main processing loop, this results in the following function call
sequence:
/ StartTransactionCommand;
/ StartTransaction;
1) < ProcessUtility; << BEGIN
\ BeginTransactionBlock;
\ CommitTransactionCommand;
/ StartTransactionCommand;
2) / PortalRunSelect; << SELECT ...
\ CommitTransactionCommand;
\ CommandCounterIncrement;
/ StartTransactionCommand;
3) / ProcessQuery; << INSERT ...
\ CommitTransactionCommand;
\ CommandCounterIncrement;
/ StartTransactionCommand;
/ ProcessUtility; << COMMIT
4) < EndTransactionBlock;
\ CommitTransactionCommand;
\ CommitTransaction;
The point of this example is to demonstrate the need for
StartTransactionCommand and CommitTransactionCommand to be state smart -- they
should call CommandCounterIncrement between the calls to BeginTransactionBlock
and EndTransactionBlock and outside these calls they need to do normal start,
commit or abort processing.
Furthermore, suppose the "SELECT * FROM foo" caused an abort condition. In
this case AbortCurrentTransaction is called, and the transaction is put in
aborted state. In this state, any user input is ignored except for
transaction-termination statements, or ROLLBACK TO <savepoint> commands.
Transaction aborts can occur in two ways:
1) system dies from some internal cause (syntax error, etc)
2) user types ROLLBACK
The reason we have to distinguish them is illustrated by the following two
situations:
case 1 case 2
------ ------
1) user types BEGIN 1) user types BEGIN
2) user does something 2) user does something
3) user does not like what 3) system aborts for some reason
she sees and types ABORT (syntax error, etc)
In case 1, we want to abort the transaction and return to the default state.
In case 2, there may be more commands coming our way which are part of the
same transaction block; we have to ignore these commands until we see a COMMIT
or ROLLBACK.
Internal aborts are handled by AbortCurrentTransaction, while user aborts are
handled by UserAbortTransactionBlock. Both of them rely on AbortTransaction
to do all the real work. The only difference is what state we enter after
AbortTransaction does its work:
* AbortCurrentTransaction leaves us in TBLOCK_ABORT,
* UserAbortTransactionBlock leaves us in TBLOCK_ABORT_END
Low-level transaction abort handling is divided in two phases:
* AbortTransaction executes as soon as we realize the transaction has
failed. It should release all shared resources (locks etc) so that we do
not delay other backends unnecessarily.
* CleanupTransaction executes when we finally see a user COMMIT
or ROLLBACK command; it cleans things up and gets us out of the transaction
completely. In particular, we mustn't destroy TopTransactionContext until
this point.
Also, note that when a transaction is committed, we don't close it right away.
Rather it's put in TBLOCK_END state, which means that when
CommitTransactionCommand is called after the query has finished processing,
the transaction has to be closed. The distinction is subtle but important,
because it means that control will leave the xact.c code with the transaction
open, and the main loop will be able to keep processing inside the same
transaction. So, in a sense, transaction commit is also handled in two
phases, the first at EndTransactionBlock and the second at
CommitTransactionCommand (which is where CommitTransaction is actually
called).
The rest of the code in xact.c are routines to support the creation and
finishing of transactions and subtransactions. For example, AtStart_Memory
takes care of initializing the memory subsystem at main transaction start.
Subtransaction Handling
-----------------------
Subtransactions are implemented using a stack of TransactionState structures,
each of which has a pointer to its parent transaction's struct. When a new
subtransaction is to be opened, PushTransaction is called, which creates a new
TransactionState, with its parent link pointing to the current transaction.
StartSubTransaction is in charge of initializing the new TransactionState to
sane values, and properly initializing other subsystems (AtSubStart routines).
When closing a subtransaction, either CommitSubTransaction has to be called
(if the subtransaction is committing), or AbortSubTransaction and
CleanupSubTransaction (if it's aborting). In either case, PopTransaction is
called so the system returns to the parent transaction.
One important point regarding subtransaction handling is that several may need
to be closed in response to a single user command. That's because savepoints
have names, and we allow to commit or rollback a savepoint by name, which is
not necessarily the one that was last opened. Also a COMMIT or ROLLBACK
command must be able to close out the entire stack. We handle this by having
the utility command subroutine mark all the state stack entries as commit-
pending or abort-pending, and then when the main loop reaches
CommitTransactionCommand, the real work is done. The main point of doing
things this way is that if we get an error while popping state stack entries,
the remaining stack entries still show what we need to do to finish up.
In the case of ROLLBACK TO <savepoint>, we abort all the subtransactions up
through the one identified by the savepoint name, and then re-create that
subtransaction level with the same name. So it's a completely new
subtransaction as far as the internals are concerned.
Other subsystems are allowed to start "internal" subtransactions, which are
handled by BeginInternalSubTransaction. This is to allow implementing
exception handling, e.g. in PL/pgSQL. ReleaseCurrentSubTransaction and
RollbackAndReleaseCurrentSubTransaction allows the subsystem to close said
subtransactions. The main difference between this and the savepoint/release
path is that we execute the complete state transition immediately in each
subroutine, rather than deferring some work until CommitTransactionCommand.
Another difference is that BeginInternalSubTransaction is allowed when no
explicit transaction block has been established, while DefineSavepoint is not.
Transaction and Subtransaction Numbering
----------------------------------------
Transactions and subtransactions are assigned permanent XIDs only when/if
they first do something that requires one --- typically, insert/update/delete
a tuple, though there are a few other places that need an XID assigned.
If a subtransaction requires an XID, we always first assign one to its
parent. This maintains the invariant that child transactions have XIDs later
than their parents, which is assumed in a number of places.
2012-04-22 18:23:47 +02:00
The subsidiary actions of obtaining a lock on the XID and entering it into
pg_subtrans and PGPROC are done at the time it is assigned.
A transaction that has no XID still needs to be identified for various
purposes, notably holding locks. For this purpose we assign a "virtual
transaction ID" or VXID to each top-level transaction. VXIDs are formed from
two fields, the backendID and a backend-local counter; this arrangement allows
assignment of a new VXID at transaction start without any contention for
shared memory. To ensure that a VXID isn't re-used too soon after backend
exit, we store the last local counter value into shared memory at backend
exit, and initialize it from the previous value for the same backendID slot
at backend start. All these counters go back to zero at shared memory
re-initialization, but that's OK because VXIDs never appear anywhere on-disk.
Internally, a backend needs a way to identify subtransactions whether or not
they have XIDs; but this need only lasts as long as the parent top transaction
endures. Therefore, we have SubTransactionId, which is somewhat like
CommandId in that it's generated from a counter that we reset at the start of
each top transaction. The top-level transaction itself has SubTransactionId 1,
and subtransactions have IDs 2 and up. (Zero is reserved for
InvalidSubTransactionId.) Note that subtransactions do not have their
own VXIDs; they use the parent top transaction's VXID.
Interlocking Transaction Begin, Transaction End, and Snapshots
--------------------------------------------------------------
We try hard to minimize the amount of overhead and lock contention involved
in the frequent activities of beginning/ending a transaction and taking a
snapshot. Unfortunately, we must have some interlocking for this, because
we must ensure consistency about the commit order of transactions.
For example, suppose an UPDATE in xact A is blocked by xact B's prior
update of the same row, and xact B is doing commit while xact C gets a
snapshot. Xact A can complete and commit as soon as B releases its locks.
If xact C's GetSnapshotData sees xact B as still running, then it had
better see xact A as still running as well, or it will be able to see two
tuple versions - one deleted by xact B and one inserted by xact A. Another
reason why this would be bad is that C would see (in the row inserted by A)
earlier changes by B, and it would be inconsistent for C not to see any
of B's changes elsewhere in the database.
Formally, the correctness requirement is "if a snapshot A considers
transaction X as committed, and any of transaction X's snapshots considered
transaction Y as committed, then snapshot A must consider transaction Y as
committed".
What we actually enforce is strict serialization of commits and rollbacks
with snapshot-taking: we do not allow any transaction to exit the set of
running transactions while a snapshot is being taken. (This rule is
stronger than necessary for consistency, but is relatively simple to
enforce, and it assists with some other issues as explained below.) The
implementation of this is that GetSnapshotData takes the ProcArrayLock in
shared mode (so that multiple backends can take snapshots in parallel),
but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
snapshot scalability: Introduce dense array of in-progress xids. The new array contains the xids for all connected backends / in-use PGPROC entries in a dense manner (in contrast to the PGPROC/PGXACT arrays which can have unused entries interspersed). This improves performance because GetSnapshotData() always needs to scan the xids of all live procarray entries and now there's no need to go through the procArray->pgprocnos indirection anymore. As the set of running top-level xids changes rarely, compared to the number of snapshots taken, this substantially increases the likelihood of most data required for a snapshot being in l2 cache. In read-mostly workloads scanning the xids[] array will sufficient to build a snapshot, as most backends will not have an xid assigned. To keep the xid array dense ProcArrayRemove() needs to move entries behind the to-be-removed proc's one further up in the array. Obviously moving array entries cannot happen while a backend sets it xid. I.e. locking needs to prevent that array entries are moved while a backend modifies its xid. To avoid locking ProcArrayLock in GetNewTransactionId() - a fairly hot spot already - ProcArrayAdd() / ProcArrayRemove() now needs to hold XidGenLock in addition to ProcArrayLock. Adding / Removing a procarray entry is not a very frequent operation, even taking 2PC into account. Due to the above, the dense array entries can only be read or modified while holding ProcArrayLock and/or XidGenLock. This prevents a concurrent ProcArrayRemove() from shifting the dense array while it is accessed concurrently. While the new dense array is very good when needing to look at all xids it is less suitable when accessing a single backend's xid. In particular it would be problematic to have to acquire a lock to access a backend's own xid. Therefore a backend's xid is not just stored in the dense array, but also in PGPROC. This also allows a backend to only access the shared xid value when the backend had acquired an xid. The infrastructure added in this commit will be used for the remaining PGXACT fields in subsequent commits. They are kept separate to make review easier. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-14 21:15:38 +02:00
while clearing the ProcGlobal->xids[] entry at transaction end (either
commit or abort). (To reduce context switching, when multiple transactions
commit nearly simultaneously, we have one backend take ProcArrayLock and
clear the XIDs of multiple processes at once.)
ProcArrayEndTransaction also holds the lock while advancing the shared
latestCompletedXid variable. This allows GetSnapshotData to use
latestCompletedXid + 1 as xmax for its snapshot: there can be no
transaction >= this xid value that the snapshot needs to consider as
completed.
In short, then, the rule is that no transaction may exit the set of
currently-running transactions between the time we fetch latestCompletedXid
and the time we finish building our snapshot. However, this restriction
only applies to transactions that have an XID --- read-only transactions
can end without acquiring ProcArrayLock, since they don't affect anyone
else's snapshot nor latestCompletedXid.
Transaction start, per se, doesn't have any interlocking with these
considerations, since we no longer assign an XID immediately at transaction
start. But when we do decide to allocate an XID, GetNewTransactionId must
store the new XID into the shared ProcArray before releasing XidGenLock.
This ensures that all top-level XIDs <= latestCompletedXid are either
present in the ProcArray, or not running anymore. (This guarantee doesn't
apply to subtransaction XIDs, because of the possibility that there's not
room for them in the subxid array; instead we guarantee that they are
present or the overflow flag is set.) If a backend released XidGenLock
snapshot scalability: Introduce dense array of in-progress xids. The new array contains the xids for all connected backends / in-use PGPROC entries in a dense manner (in contrast to the PGPROC/PGXACT arrays which can have unused entries interspersed). This improves performance because GetSnapshotData() always needs to scan the xids of all live procarray entries and now there's no need to go through the procArray->pgprocnos indirection anymore. As the set of running top-level xids changes rarely, compared to the number of snapshots taken, this substantially increases the likelihood of most data required for a snapshot being in l2 cache. In read-mostly workloads scanning the xids[] array will sufficient to build a snapshot, as most backends will not have an xid assigned. To keep the xid array dense ProcArrayRemove() needs to move entries behind the to-be-removed proc's one further up in the array. Obviously moving array entries cannot happen while a backend sets it xid. I.e. locking needs to prevent that array entries are moved while a backend modifies its xid. To avoid locking ProcArrayLock in GetNewTransactionId() - a fairly hot spot already - ProcArrayAdd() / ProcArrayRemove() now needs to hold XidGenLock in addition to ProcArrayLock. Adding / Removing a procarray entry is not a very frequent operation, even taking 2PC into account. Due to the above, the dense array entries can only be read or modified while holding ProcArrayLock and/or XidGenLock. This prevents a concurrent ProcArrayRemove() from shifting the dense array while it is accessed concurrently. While the new dense array is very good when needing to look at all xids it is less suitable when accessing a single backend's xid. In particular it would be problematic to have to acquire a lock to access a backend's own xid. Therefore a backend's xid is not just stored in the dense array, but also in PGPROC. This also allows a backend to only access the shared xid value when the backend had acquired an xid. The infrastructure added in this commit will be used for the remaining PGXACT fields in subsequent commits. They are kept separate to make review easier. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-14 21:15:38 +02:00
before storing its XID into ProcGlobal->xids[], then it would be possible for
another backend to allocate and commit a later XID, causing latestCompletedXid
to pass the first backend's XID, before that value became visible in the
snapshot scalability: Don't compute global horizons while building snapshots. To make GetSnapshotData() more scalable, it cannot not look at at each proc's xmin: While snapshot contents do not need to change whenever a read-only transaction commits or a snapshot is released, a proc's xmin is modified in those cases. The frequency of xmin modifications leads to, particularly on higher core count systems, many cache misses inside GetSnapshotData(), despite the data underlying a snapshot not changing. That is the most significant source of GetSnapshotData() scaling poorly on larger systems. Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons / thresholds as it has so far. But we don't really have to: The horizons don't actually change that much between GetSnapshotData() calls. Nor are the horizons actually used every time a snapshot is built. The trick this commit introduces is to delay computation of accurate horizons until there use and using horizon boundaries to determine whether accurate horizons need to be computed. The use of RecentGlobal[Data]Xmin to decide whether a row version could be removed has been replaces with new GlobalVisTest* functions. These use two thresholds to determine whether a row can be pruned: 1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed are definitely still visible. 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can definitely be removed GetSnapshotData() updates definitely_needed to be the xmin of the computed snapshot. When testing whether a row can be removed (with GlobalVisTestIsRemovableXid()) and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID < definitely_needed) the boundaries can be recomputed to be more accurate. As it is not cheap to compute accurate boundaries, we limit the number of times that happens in short succession. As the boundaries used by GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by GetSnapshotData()), it is likely that further test can benefit from an earlier computation of accurate horizons. To avoid regressing performance when old_snapshot_threshold is set (as that requires an accurate horizon to be computed), heap_page_prune_opt() doesn't unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the computation of the limited horizon, and the triggering of errors (with SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove tuples. This commit just removes the accesses to PGXACT->xmin from GetSnapshotData(), but other members of PGXACT residing in the same cache line are accessed. Therefore this in itself does not result in a significant improvement. Subsequent commits will take advantage of the fact that GetSnapshotData() now does not need to access xmins anymore. Note: This contains a workaround in heap_page_prune_opt() to keep the snapshot_too_old tests working. While that workaround is ugly, the tests currently are not meaningful, and it seems best to address them separately. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-13 01:03:49 +02:00
ProcArray. That would break ComputeXidHorizons, as discussed below.
snapshot scalability: Introduce dense array of in-progress xids. The new array contains the xids for all connected backends / in-use PGPROC entries in a dense manner (in contrast to the PGPROC/PGXACT arrays which can have unused entries interspersed). This improves performance because GetSnapshotData() always needs to scan the xids of all live procarray entries and now there's no need to go through the procArray->pgprocnos indirection anymore. As the set of running top-level xids changes rarely, compared to the number of snapshots taken, this substantially increases the likelihood of most data required for a snapshot being in l2 cache. In read-mostly workloads scanning the xids[] array will sufficient to build a snapshot, as most backends will not have an xid assigned. To keep the xid array dense ProcArrayRemove() needs to move entries behind the to-be-removed proc's one further up in the array. Obviously moving array entries cannot happen while a backend sets it xid. I.e. locking needs to prevent that array entries are moved while a backend modifies its xid. To avoid locking ProcArrayLock in GetNewTransactionId() - a fairly hot spot already - ProcArrayAdd() / ProcArrayRemove() now needs to hold XidGenLock in addition to ProcArrayLock. Adding / Removing a procarray entry is not a very frequent operation, even taking 2PC into account. Due to the above, the dense array entries can only be read or modified while holding ProcArrayLock and/or XidGenLock. This prevents a concurrent ProcArrayRemove() from shifting the dense array while it is accessed concurrently. While the new dense array is very good when needing to look at all xids it is less suitable when accessing a single backend's xid. In particular it would be problematic to have to acquire a lock to access a backend's own xid. Therefore a backend's xid is not just stored in the dense array, but also in PGPROC. This also allows a backend to only access the shared xid value when the backend had acquired an xid. The infrastructure added in this commit will be used for the remaining PGXACT fields in subsequent commits. They are kept separate to make review easier. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-14 21:15:38 +02:00
We allow GetNewTransactionId to store the XID into ProcGlobal->xids[] (or the
subxid array) without taking ProcArrayLock. This was once necessary to
avoid deadlock; while that is no longer the case, it's still beneficial for
performance. We are thereby relying on fetch/store of an XID to be atomic,
else other backends might see a partially-set XID. This also means that
readers of the ProcArray xid fields must be careful to fetch a value only
once, rather than assume they can read it multiple times and get the same
answer each time. (Use volatile-qualified pointers when doing this, to
ensure that the C compiler does exactly what you tell it to.)
snapshot scalability: Don't compute global horizons while building snapshots. To make GetSnapshotData() more scalable, it cannot not look at at each proc's xmin: While snapshot contents do not need to change whenever a read-only transaction commits or a snapshot is released, a proc's xmin is modified in those cases. The frequency of xmin modifications leads to, particularly on higher core count systems, many cache misses inside GetSnapshotData(), despite the data underlying a snapshot not changing. That is the most significant source of GetSnapshotData() scaling poorly on larger systems. Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons / thresholds as it has so far. But we don't really have to: The horizons don't actually change that much between GetSnapshotData() calls. Nor are the horizons actually used every time a snapshot is built. The trick this commit introduces is to delay computation of accurate horizons until there use and using horizon boundaries to determine whether accurate horizons need to be computed. The use of RecentGlobal[Data]Xmin to decide whether a row version could be removed has been replaces with new GlobalVisTest* functions. These use two thresholds to determine whether a row can be pruned: 1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed are definitely still visible. 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can definitely be removed GetSnapshotData() updates definitely_needed to be the xmin of the computed snapshot. When testing whether a row can be removed (with GlobalVisTestIsRemovableXid()) and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID < definitely_needed) the boundaries can be recomputed to be more accurate. As it is not cheap to compute accurate boundaries, we limit the number of times that happens in short succession. As the boundaries used by GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by GetSnapshotData()), it is likely that further test can benefit from an earlier computation of accurate horizons. To avoid regressing performance when old_snapshot_threshold is set (as that requires an accurate horizon to be computed), heap_page_prune_opt() doesn't unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the computation of the limited horizon, and the triggering of errors (with SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove tuples. This commit just removes the accesses to PGXACT->xmin from GetSnapshotData(), but other members of PGXACT residing in the same cache line are accessed. Therefore this in itself does not result in a significant improvement. Subsequent commits will take advantage of the fact that GetSnapshotData() now does not need to access xmins anymore. Note: This contains a workaround in heap_page_prune_opt() to keep the snapshot_too_old tests working. While that workaround is ugly, the tests currently are not meaningful, and it seems best to address them separately. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-13 01:03:49 +02:00
Another important activity that uses the shared ProcArray is
ComputeXidHorizons, which must determine a lower bound for the oldest xmin
of any active MVCC snapshot, system-wide. Each individual backend
advertises the smallest xmin of its own snapshots in MyProc->xmin, or zero
snapshot scalability: Don't compute global horizons while building snapshots. To make GetSnapshotData() more scalable, it cannot not look at at each proc's xmin: While snapshot contents do not need to change whenever a read-only transaction commits or a snapshot is released, a proc's xmin is modified in those cases. The frequency of xmin modifications leads to, particularly on higher core count systems, many cache misses inside GetSnapshotData(), despite the data underlying a snapshot not changing. That is the most significant source of GetSnapshotData() scaling poorly on larger systems. Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons / thresholds as it has so far. But we don't really have to: The horizons don't actually change that much between GetSnapshotData() calls. Nor are the horizons actually used every time a snapshot is built. The trick this commit introduces is to delay computation of accurate horizons until there use and using horizon boundaries to determine whether accurate horizons need to be computed. The use of RecentGlobal[Data]Xmin to decide whether a row version could be removed has been replaces with new GlobalVisTest* functions. These use two thresholds to determine whether a row can be pruned: 1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed are definitely still visible. 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can definitely be removed GetSnapshotData() updates definitely_needed to be the xmin of the computed snapshot. When testing whether a row can be removed (with GlobalVisTestIsRemovableXid()) and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID < definitely_needed) the boundaries can be recomputed to be more accurate. As it is not cheap to compute accurate boundaries, we limit the number of times that happens in short succession. As the boundaries used by GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by GetSnapshotData()), it is likely that further test can benefit from an earlier computation of accurate horizons. To avoid regressing performance when old_snapshot_threshold is set (as that requires an accurate horizon to be computed), heap_page_prune_opt() doesn't unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the computation of the limited horizon, and the triggering of errors (with SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove tuples. This commit just removes the accesses to PGXACT->xmin from GetSnapshotData(), but other members of PGXACT residing in the same cache line are accessed. Therefore this in itself does not result in a significant improvement. Subsequent commits will take advantage of the fact that GetSnapshotData() now does not need to access xmins anymore. Note: This contains a workaround in heap_page_prune_opt() to keep the snapshot_too_old tests working. While that workaround is ugly, the tests currently are not meaningful, and it seems best to address them separately. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-13 01:03:49 +02:00
if it currently has no live snapshots (eg, if it's between transactions or
hasn't yet set a snapshot for a new transaction). ComputeXidHorizons takes
the MIN() of the valid xmin fields. It does this with only shared lock on
ProcArrayLock, which means there is a potential race condition against other
backends doing GetSnapshotData concurrently: we must be certain that a
concurrent backend that is about to set its xmin does not compute an xmin
less than what ComputeXidHorizons determines. We ensure that by including
all the active XIDs into the MIN() calculation, along with the valid xmins.
The rule that transactions can't exit without taking exclusive ProcArrayLock
ensures that concurrent holders of shared ProcArrayLock will compute the
same minimum of currently-active XIDs: no xact, in particular not the
oldest, can exit while we hold shared ProcArrayLock. So
ComputeXidHorizons's view of the minimum active XID will be the same as that
of any concurrent GetSnapshotData, and so it can't produce an overestimate.
If there is no active transaction at all, ComputeXidHorizons uses
latestCompletedXid + 1, which is a lower bound for the xmin that might
be computed by concurrent or later GetSnapshotData calls. (We know that no
XID less than this could be about to appear in the ProcArray, because of the
XidGenLock interlock discussed above.)
As GetSnapshotData is performance critical, it does not perform an accurate
oldest-xmin calculation (it used to, until v14). The contents of a snapshot
snapshot scalability: Don't compute global horizons while building snapshots. To make GetSnapshotData() more scalable, it cannot not look at at each proc's xmin: While snapshot contents do not need to change whenever a read-only transaction commits or a snapshot is released, a proc's xmin is modified in those cases. The frequency of xmin modifications leads to, particularly on higher core count systems, many cache misses inside GetSnapshotData(), despite the data underlying a snapshot not changing. That is the most significant source of GetSnapshotData() scaling poorly on larger systems. Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons / thresholds as it has so far. But we don't really have to: The horizons don't actually change that much between GetSnapshotData() calls. Nor are the horizons actually used every time a snapshot is built. The trick this commit introduces is to delay computation of accurate horizons until there use and using horizon boundaries to determine whether accurate horizons need to be computed. The use of RecentGlobal[Data]Xmin to decide whether a row version could be removed has been replaces with new GlobalVisTest* functions. These use two thresholds to determine whether a row can be pruned: 1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed are definitely still visible. 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can definitely be removed GetSnapshotData() updates definitely_needed to be the xmin of the computed snapshot. When testing whether a row can be removed (with GlobalVisTestIsRemovableXid()) and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID < definitely_needed) the boundaries can be recomputed to be more accurate. As it is not cheap to compute accurate boundaries, we limit the number of times that happens in short succession. As the boundaries used by GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by GetSnapshotData()), it is likely that further test can benefit from an earlier computation of accurate horizons. To avoid regressing performance when old_snapshot_threshold is set (as that requires an accurate horizon to be computed), heap_page_prune_opt() doesn't unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the computation of the limited horizon, and the triggering of errors (with SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove tuples. This commit just removes the accesses to PGXACT->xmin from GetSnapshotData(), but other members of PGXACT residing in the same cache line are accessed. Therefore this in itself does not result in a significant improvement. Subsequent commits will take advantage of the fact that GetSnapshotData() now does not need to access xmins anymore. Note: This contains a workaround in heap_page_prune_opt() to keep the snapshot_too_old tests working. While that workaround is ugly, the tests currently are not meaningful, and it seems best to address them separately. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-13 01:03:49 +02:00
only depend on the xids of other backends, not their xmin. As backend's xmin
changes much more often than its xid, having GetSnapshotData look at xmins
can lead to a lot of unnecessary cacheline ping-pong. Instead
GetSnapshotData updates approximate thresholds (one that guarantees that all
deleted rows older than it can be removed, another determining that deleted
rows newer than it can not be removed). GlobalVisTest* uses those threshold
to make invisibility decision, falling back to ComputeXidHorizons if
necessary.
Note that while it is certain that two concurrent executions of
GetSnapshotData will compute the same xmin for their own snapshots, there is
no such guarantee for the horizons computed by ComputeXidHorizons. This is
because we allow XID-less transactions to clear their MyProc->xmin
snapshot scalability: Don't compute global horizons while building snapshots. To make GetSnapshotData() more scalable, it cannot not look at at each proc's xmin: While snapshot contents do not need to change whenever a read-only transaction commits or a snapshot is released, a proc's xmin is modified in those cases. The frequency of xmin modifications leads to, particularly on higher core count systems, many cache misses inside GetSnapshotData(), despite the data underlying a snapshot not changing. That is the most significant source of GetSnapshotData() scaling poorly on larger systems. Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons / thresholds as it has so far. But we don't really have to: The horizons don't actually change that much between GetSnapshotData() calls. Nor are the horizons actually used every time a snapshot is built. The trick this commit introduces is to delay computation of accurate horizons until there use and using horizon boundaries to determine whether accurate horizons need to be computed. The use of RecentGlobal[Data]Xmin to decide whether a row version could be removed has been replaces with new GlobalVisTest* functions. These use two thresholds to determine whether a row can be pruned: 1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed are definitely still visible. 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can definitely be removed GetSnapshotData() updates definitely_needed to be the xmin of the computed snapshot. When testing whether a row can be removed (with GlobalVisTestIsRemovableXid()) and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID < definitely_needed) the boundaries can be recomputed to be more accurate. As it is not cheap to compute accurate boundaries, we limit the number of times that happens in short succession. As the boundaries used by GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by GetSnapshotData()), it is likely that further test can benefit from an earlier computation of accurate horizons. To avoid regressing performance when old_snapshot_threshold is set (as that requires an accurate horizon to be computed), heap_page_prune_opt() doesn't unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the computation of the limited horizon, and the triggering of errors (with SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove tuples. This commit just removes the accesses to PGXACT->xmin from GetSnapshotData(), but other members of PGXACT residing in the same cache line are accessed. Therefore this in itself does not result in a significant improvement. Subsequent commits will take advantage of the fact that GetSnapshotData() now does not need to access xmins anymore. Note: This contains a workaround in heap_page_prune_opt() to keep the snapshot_too_old tests working. While that workaround is ugly, the tests currently are not meaningful, and it seems best to address them separately. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-13 01:03:49 +02:00
asynchronously (without taking ProcArrayLock), so one execution might see
what had been the oldest xmin, and another not. This is OK since the
thresholds need only be a valid lower bound. As noted above, we are already
assuming that fetch/store of the xid fields is atomic, so assuming it for
xmin as well is no extra risk.
pg_xact and pg_subtrans
-----------------------
pg_xact and pg_subtrans are permanent (on-disk) storage of transaction related
information. There is a limited number of pages of each kept in memory, so
in many cases there is no need to actually read from disk. However, if
there's a long running transaction or a backend sitting idle with an open
transaction, it may be necessary to be able to read and write this information
from disk. They also allow information to be permanent across server restarts.
pg_xact records the commit status for each transaction that has been assigned
an XID. A transaction can be in progress, committed, aborted, or
"sub-committed". This last state means that it's a subtransaction that's no
longer running, but its parent has not updated its state yet. It is not
necessary to update a subtransaction's transaction status to subcommit, so we
can just defer it until main transaction commit. The main role of marking
transactions as sub-committed is to provide an atomic commit protocol when
transaction status is spread across multiple clog pages. As a result, whenever
transaction status spreads across multiple pages we must use a two-phase commit
protocol: the first phase is to mark the subtransactions as sub-committed, then
we mark the top level transaction and all its subtransactions committed (in
that order). Thus, subtransactions that have not aborted appear as in-progress
even when they have already finished, and the subcommit status appears as a
very short transitory state during main transaction commit. Subtransaction
abort is always marked in clog as soon as it occurs. When the transaction
status all fit in a single CLOG page, we atomically mark them all as committed
without bothering with the intermediate sub-commit state.
Savepoints are implemented using subtransactions. A subtransaction is a
transaction inside a transaction; its commit or abort status is not only
dependent on whether it committed itself, but also whether its parent
transaction committed. To implement multiple savepoints in a transaction we
allow unlimited transaction nesting depth, so any particular subtransaction's
commit state is dependent on the commit status of each and every ancestor
transaction.
The "subtransaction parent" (pg_subtrans) mechanism records, for each
transaction with an XID, the TransactionId of its parent transaction. This
information is stored as soon as the subtransaction is assigned an XID.
Top-level transactions do not have a parent, so they leave their pg_subtrans
entries set to the default value of zero (InvalidTransactionId).
pg_subtrans is used to check whether the transaction in question is still
snapshot scalability: Introduce dense array of in-progress xids. The new array contains the xids for all connected backends / in-use PGPROC entries in a dense manner (in contrast to the PGPROC/PGXACT arrays which can have unused entries interspersed). This improves performance because GetSnapshotData() always needs to scan the xids of all live procarray entries and now there's no need to go through the procArray->pgprocnos indirection anymore. As the set of running top-level xids changes rarely, compared to the number of snapshots taken, this substantially increases the likelihood of most data required for a snapshot being in l2 cache. In read-mostly workloads scanning the xids[] array will sufficient to build a snapshot, as most backends will not have an xid assigned. To keep the xid array dense ProcArrayRemove() needs to move entries behind the to-be-removed proc's one further up in the array. Obviously moving array entries cannot happen while a backend sets it xid. I.e. locking needs to prevent that array entries are moved while a backend modifies its xid. To avoid locking ProcArrayLock in GetNewTransactionId() - a fairly hot spot already - ProcArrayAdd() / ProcArrayRemove() now needs to hold XidGenLock in addition to ProcArrayLock. Adding / Removing a procarray entry is not a very frequent operation, even taking 2PC into account. Due to the above, the dense array entries can only be read or modified while holding ProcArrayLock and/or XidGenLock. This prevents a concurrent ProcArrayRemove() from shifting the dense array while it is accessed concurrently. While the new dense array is very good when needing to look at all xids it is less suitable when accessing a single backend's xid. In particular it would be problematic to have to acquire a lock to access a backend's own xid. Therefore a backend's xid is not just stored in the dense array, but also in PGPROC. This also allows a backend to only access the shared xid value when the backend had acquired an xid. The infrastructure added in this commit will be used for the remaining PGXACT fields in subsequent commits. They are kept separate to make review easier. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-14 21:15:38 +02:00
running --- the main Xid of a transaction is recorded in ProcGlobal->xids[],
with a copy in PGPROC->xid, but since we allow arbitrary nesting of
subtransactions, we can't fit all Xids in shared memory, so we have to store
them on disk. Note, however, that for each transaction we keep a "cache" of
Xids that are known to be part of the transaction tree, so we can skip looking
at pg_subtrans unless we know the cache has been overflowed. See
storage/ipc/procarray.c for the gory details.
slru.c is the supporting mechanism for both pg_xact and pg_subtrans. It
implements the LRU policy for in-memory buffer pages. The high-level routines
for pg_xact are implemented in transam.c, while the low-level functions are in
clog.c. pg_subtrans is contained completely in subtrans.c.
Write-Ahead Log Coding
----------------------
The WAL subsystem (also called XLOG in the code) exists to guarantee crash
recovery. It can also be used to provide point-in-time recovery, as well as
hot-standby replication via log shipping. Here are some notes about
non-obvious aspects of its design.
A basic assumption of a write AHEAD log is that log entries must reach stable
storage before the data-page changes they describe. This ensures that
replaying the log to its end will bring us to a consistent state where there
are no partially-performed transactions. To guarantee this, each data page
(either heap or index) is marked with the LSN (log sequence number --- in
practice, a WAL file location) of the latest XLOG record affecting the page.
Before the bufmgr can write out a dirty page, it must ensure that xlog has
been flushed to disk at least up to the page's LSN. This low-level
interaction improves performance by not waiting for XLOG I/O until necessary.
The LSN check exists only in the shared-buffer manager, not in the local
buffer manager used for temp tables; hence operations on temp tables must not
be WAL-logged.
During WAL replay, we can check the LSN of a page to detect whether the change
recorded by the current log entry is already applied (it has been, if the page
LSN is >= the log entry's WAL location).
Usually, log entries contain just enough information to redo a single
incremental update on a page (or small group of pages). This will work only
if the filesystem and hardware implement data page writes as atomic actions,
so that a page is never left in a corrupt partly-written state. Since that's
often an untenable assumption in practice, we log additional information to
allow complete reconstruction of modified pages. The first WAL record
affecting a given page after a checkpoint is made to contain a copy of the
entire page, and we implement replay by restoring that page copy instead of
redoing the update. (This is more reliable than the data storage itself would
be because we can check the validity of the WAL record's CRC.) We can detect
the "first change after checkpoint" by noting whether the page's old LSN
precedes the end of WAL as of the last checkpoint (the RedoRecPtr).
The general schema for executing a WAL-logged action is
1. Pin and exclusive-lock the shared buffer(s) containing the data page(s)
to be modified.
2. START_CRIT_SECTION() (Any error during the next three steps must cause a
PANIC because the shared buffers will contain unlogged changes, which we
have to ensure don't get to disk. Obviously, you should check conditions
such as whether there's enough free space on the page before you start the
critical section.)
3. Apply the required changes to the shared buffer(s).
4. Mark the shared buffer(s) as dirty with MarkBufferDirty(). (This must
happen before the WAL record is inserted; see notes in SyncOneBuffer().)
Note that marking a buffer dirty with MarkBufferDirty() should only
happen iff you write a WAL record; see Writing Hints below.
Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-20 16:56:26 +01:00
5. If the relation requires WAL-logging, build a WAL record using
XLogBeginInsert and XLogRegister* functions, and insert it. (See
"Constructing a WAL record" below). Then update the page's LSN using the
returned XLOG location. For instance,
Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-20 16:56:26 +01:00
XLogBeginInsert();
XLogRegisterBuffer(...)
XLogRegisterData(...)
recptr = XLogInsert(rmgr_id, info);
PageSetLSN(dp, recptr);
6. END_CRIT_SECTION()
7. Unlock and unpin the buffer(s).
Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-20 16:56:26 +01:00
Complex changes (such as a multilevel index insertion) normally need to be
described by a series of atomic-action WAL records. The intermediate states
must be self-consistent, so that if the replay is interrupted between any
two actions, the system is fully functional. In btree indexes, for example,
a page split requires a new page to be allocated, and an insertion of a new
key in the parent btree level, but for locking reasons this has to be
reflected by two separate WAL records. Replaying the first record, to
allocate the new page and move tuples to it, sets a flag on the page to
indicate that the key has not been inserted to the parent yet. Replaying the
second record clears the flag. This intermediate state is never seen by
other backends during normal operation, because the lock on the child page
is held across the two actions, but will be seen if the operation is
interrupted before writing the second WAL record. The search algorithm works
with the intermediate state as normal, but if an insertion encounters a page
with the incomplete-split flag set, it will finish the interrupted split by
inserting the key to the parent, before proceeding.
Constructing a WAL record
-------------------------
A WAL record consists of a header common to all WAL record types,
record-specific data, and information about the data blocks modified. Each
modified data block is identified by an ID number, and can optionally have
more record-specific data associated with the block. If XLogInsert decides
that a full-page image of a block needs to be taken, the data associated
with that block is not included.
The API for constructing a WAL record consists of five functions:
XLogBeginInsert, XLogRegisterBuffer, XLogRegisterData, XLogRegisterBufData,
and XLogInsert. First, call XLogBeginInsert(). Then register all the buffers
modified, and data needed to replay the changes, using XLogRegister*
functions. Finally, insert the constructed record to the WAL by calling
XLogInsert().
XLogBeginInsert();
/* register buffers modified as part of this WAL-logged action */
XLogRegisterBuffer(0, lbuffer, REGBUF_STANDARD);
XLogRegisterBuffer(1, rbuffer, REGBUF_STANDARD);
/* register data that is always included in the WAL record */
XLogRegisterData(&xlrec, SizeOfFictionalAction);
/*
* register data associated with a buffer. This will not be included
* in the record if a full-page image is taken.
*/
XLogRegisterBufData(0, tuple->data, tuple->len);
/* more data associated with the buffer */
XLogRegisterBufData(0, data2, len2);
/*
* Ok, all the data and buffers to include in the WAL record have
* been registered. Insert the record.
*/
recptr = XLogInsert(RM_FOO_ID, XLOG_FOOBAR_DO_STUFF);
Details of the API functions:
void XLogBeginInsert(void)
Must be called before XLogRegisterBuffer and XLogRegisterData.
void XLogResetInsertion(void)
Clear any currently registered data and buffers from the WAL record
construction workspace. This is only needed if you have already called
XLogBeginInsert(), but decide to not insert the record after all.
void XLogEnsureRecordSpace(int max_block_id, int ndatas)
Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-20 16:56:26 +01:00
Normally, the WAL record construction buffers have the following limits:
* highest block ID that can be used is 4 (allowing five block references)
* Max 20 chunks of registered data
These default limits are enough for most record types that change some
on-disk structures. For the odd case that requires more data, or needs to
modify more buffers, these limits can be raised by calling
XLogEnsureRecordSpace(). XLogEnsureRecordSpace() must be called before
XLogBeginInsert(), and outside a critical section.
void XLogRegisterBuffer(uint8 block_id, Buffer buf, uint8 flags);
XLogRegisterBuffer adds information about a data block to the WAL record.
block_id is an arbitrary number used to identify this page reference in
the redo routine. The information needed to re-find the page at redo -
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
relfilelocator, fork, and block number - are included in the WAL record.
Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
2014-11-20 16:56:26 +01:00
XLogInsert will automatically include a full copy of the page contents, if
this is the first modification of the buffer since the last checkpoint.
It is important to register every buffer modified by the action with
XLogRegisterBuffer, to avoid torn-page hazards.
The flags control when and how the buffer contents are included in the
WAL record. Normally, a full-page image is taken only if the page has not
been modified since the last checkpoint, and only if full_page_writes=on
or an online backup is in progress. The REGBUF_FORCE_IMAGE flag can be
used to force a full-page image to always be included; that is useful
e.g. for an operation that rewrites most of the page, so that tracking the
details is not worth it. For the rare case where it is not necessary to
protect from torn pages, REGBUF_NO_IMAGE flag can be used to suppress
full page image from being taken. REGBUF_WILL_INIT also suppresses a full
page image, but the redo routine must re-generate the page from scratch,
without looking at the old page contents. Re-initializing the page
protects from torn page hazards like a full page image does.
The REGBUF_STANDARD flag can be specified together with the other flags to
indicate that the page follows the standard page layout. It causes the
area between pd_lower and pd_upper to be left out from the image, reducing
WAL volume.
If the REGBUF_KEEP_DATA flag is given, any per-buffer data registered with
XLogRegisterBufData() is included in the WAL record even if a full-page
image is taken.
void XLogRegisterData(char *data, int len);
XLogRegisterData is used to include arbitrary data in the WAL record. If
XLogRegisterData() is called multiple times, the data are appended, and
will be made available to the redo routine as one contiguous chunk.
void XLogRegisterBufData(uint8 block_id, char *data, int len);
XLogRegisterBufData is used to include data associated with a particular
buffer that was registered earlier with XLogRegisterBuffer(). If
XLogRegisterBufData() is called multiple times with the same block ID, the
data are appended, and will be made available to the redo routine as one
contiguous chunk.
If a full-page image of the buffer is taken at insertion, the data is not
included in the WAL record, unless the REGBUF_KEEP_DATA flag is used.
Writing a REDO routine
----------------------
A REDO routine uses the data and page references included in the WAL record
to reconstruct the new state of the page. The record decoding functions
and macros in xlogreader.c/h can be used to extract the data from the record.
Fix multiple problems in WAL replay. Most of the replay functions for WAL record types that modify more than one page failed to ensure that those pages were locked correctly to ensure that concurrent queries could not see inconsistent page states. This is a hangover from coding decisions made long before Hot Standby was added, when it was hardly necessary to acquire buffer locks during WAL replay at all, let alone hold them for carefully-chosen periods. The key problem was that RestoreBkpBlocks was written to hold lock on each page restored from a full-page image for only as long as it took to update that page. This was guaranteed to break any WAL replay function in which there was any update-ordering constraint between pages, because even if the nominal order of the pages is the right one, any mixture of full-page and non-full-page updates in the same record would result in out-of-order updates. Moreover, it wouldn't work for situations where there's a requirement to maintain lock on one page while updating another. Failure to honor an update ordering constraint in this way is thought to be the cause of bug #7648 from Daniel Farina: what seems to have happened there is that a btree page being split was rewritten from a full-page image before the new right sibling page was written, and because lock on the original page was not maintained it was possible for hot standby queries to try to traverse the page's right-link to the not-yet-existing sibling page. To fix, get rid of RestoreBkpBlocks as such, and instead create a new function RestoreBackupBlock that restores just one full-page image at a time. This function can be invoked by WAL replay functions at the points where they would otherwise perform non-full-page updates; in this way, the physical order of page updates remains the same no matter which pages are replaced by full-page images. We can then further adjust the logic in individual replay functions if it is necessary to hold buffer locks for overlapping periods. A side benefit is that we can simplify the handling of concurrency conflict resolution by moving that code into the record-type-specfic functions; there's no more need to contort the code layout to keep conflict resolution in front of the RestoreBkpBlocks call. In connection with that, standardize on zero-based numbering rather than one-based numbering for referencing the full-page images. In HEAD, I removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are still there in the header files in previous branches, but are no longer used by the code. In addition, fix some other bugs identified in the course of making these changes: spgRedoAddNode could fail to update the parent downlink at all, if the parent tuple is in the same page as either the old or new split tuple and we're not doing a full-page image: it would get fooled by the LSN having been advanced already. This would result in permanent index corruption, not just transient failure of concurrent queries. Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old tail page as a candidate for a full-page image; in the worst case this could result in torn-page corruption. heap_xlog_freeze() was inconsistent about using a cleanup lock or plain exclusive lock: it did the former in the normal path but the latter for a full-page image. A plain exclusive lock seems sufficient, so change to that. Also, remove gistRedoPageDeleteRecord(), which has been dead code since VACUUM FULL was rewritten. Back-patch to 9.0, where hot standby was introduced. Note however that 9.0 had a significantly different WAL-logging scheme for GIST index updates, and it doesn't appear possible to make that scheme safe for concurrent hot standby queries, because it can leave inconsistent states in the index even between WAL records. Given the lack of complaints from the field, we won't work too hard on fixing that branch.
2012-11-13 04:05:08 +01:00
When replaying a WAL record that describes changes on multiple pages, you
must be careful to lock the pages properly to prevent concurrent Hot Standby
queries from seeing an inconsistent state. If this requires that two
or more buffer locks be held concurrently, you must lock the pages in
appropriate order, and not release the locks until all the changes are done.
Fix multiple problems in WAL replay. Most of the replay functions for WAL record types that modify more than one page failed to ensure that those pages were locked correctly to ensure that concurrent queries could not see inconsistent page states. This is a hangover from coding decisions made long before Hot Standby was added, when it was hardly necessary to acquire buffer locks during WAL replay at all, let alone hold them for carefully-chosen periods. The key problem was that RestoreBkpBlocks was written to hold lock on each page restored from a full-page image for only as long as it took to update that page. This was guaranteed to break any WAL replay function in which there was any update-ordering constraint between pages, because even if the nominal order of the pages is the right one, any mixture of full-page and non-full-page updates in the same record would result in out-of-order updates. Moreover, it wouldn't work for situations where there's a requirement to maintain lock on one page while updating another. Failure to honor an update ordering constraint in this way is thought to be the cause of bug #7648 from Daniel Farina: what seems to have happened there is that a btree page being split was rewritten from a full-page image before the new right sibling page was written, and because lock on the original page was not maintained it was possible for hot standby queries to try to traverse the page's right-link to the not-yet-existing sibling page. To fix, get rid of RestoreBkpBlocks as such, and instead create a new function RestoreBackupBlock that restores just one full-page image at a time. This function can be invoked by WAL replay functions at the points where they would otherwise perform non-full-page updates; in this way, the physical order of page updates remains the same no matter which pages are replaced by full-page images. We can then further adjust the logic in individual replay functions if it is necessary to hold buffer locks for overlapping periods. A side benefit is that we can simplify the handling of concurrency conflict resolution by moving that code into the record-type-specfic functions; there's no more need to contort the code layout to keep conflict resolution in front of the RestoreBkpBlocks call. In connection with that, standardize on zero-based numbering rather than one-based numbering for referencing the full-page images. In HEAD, I removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are still there in the header files in previous branches, but are no longer used by the code. In addition, fix some other bugs identified in the course of making these changes: spgRedoAddNode could fail to update the parent downlink at all, if the parent tuple is in the same page as either the old or new split tuple and we're not doing a full-page image: it would get fooled by the LSN having been advanced already. This would result in permanent index corruption, not just transient failure of concurrent queries. Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old tail page as a candidate for a full-page image; in the worst case this could result in torn-page corruption. heap_xlog_freeze() was inconsistent about using a cleanup lock or plain exclusive lock: it did the former in the normal path but the latter for a full-page image. A plain exclusive lock seems sufficient, so change to that. Also, remove gistRedoPageDeleteRecord(), which has been dead code since VACUUM FULL was rewritten. Back-patch to 9.0, where hot standby was introduced. Note however that 9.0 had a significantly different WAL-logging scheme for GIST index updates, and it doesn't appear possible to make that scheme safe for concurrent hot standby queries, because it can leave inconsistent states in the index even between WAL records. Given the lack of complaints from the field, we won't work too hard on fixing that branch.
2012-11-13 04:05:08 +01:00
Note that we must only use PageSetLSN/PageGetLSN() when we know the action
is serialised. Only Startup process may modify data blocks during recovery,
so Startup process may execute PageGetLSN() without fear of serialisation
problems. All other processes must only call PageSet/GetLSN when holding
either an exclusive buffer lock or a shared lock plus buffer header lock,
or be writing the data block directly rather than through shared buffers
while holding AccessExclusiveLock on the relation.
Writing Hints
-------------
In some cases, we write additional information to data blocks without
writing a preceding WAL record. This should only happen iff the data can
be reconstructed later following a crash and the action is simply a way
of optimising for performance. When a hint is written we use
MarkBufferDirtyHint() to mark the block dirty.
If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
that includes the hint. We do this to avoid a partial page write, when we
write the dirtied page. WAL is not written during recovery, so we simply skip
dirtying blocks because of hints when in recovery.
If you do decide to optimise away a WAL record, then any calls to
MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
otherwise you will expose the risk of partial page writes.
The all-visible hint in a heap page (PD_ALL_VISIBLE) is a special
case, because it is treated like a durable change in some respects and
a hint in other respects. It must satisfy the invariant that, if a
heap page's associated visibilitymap (VM) bit is set, then
PD_ALL_VISIBLE is set on the heap page itself. Clearing of
PD_ALL_VISIBLE is always treated like a fully-durable change to
maintain this invariant. Additionally, if checksums or wal_log_hints
are enabled, setting PD_ALL_VISIBLE is also treated like a
fully-durable change to protect against torn pages.
But, if neither checksums nor wal_log_hints are enabled, torn pages
are of no consequence if the only change is to PD_ALL_VISIBLE; so no
full heap page image is taken, and the heap page's LSN is not
updated. NB: it would be incorrect to update the heap page's LSN when
applying this optimization, even though there is an associated WAL
record, because subsequent modifiers (e.g. an unrelated UPDATE) of the
page may falsely believe that a full page image is not required.
Write-Ahead Logging for Filesystem Actions
------------------------------------------
The previous section described how to WAL-log actions that only change page
contents within shared buffers. For that type of action it is generally
possible to check all likely error cases (such as insufficient space on the
page) before beginning to make the actual change. Therefore we can make
the change and the creation of the associated WAL log record "atomic" by
wrapping them into a critical section --- the odds of failure partway
through are low enough that PANIC is acceptable if it does happen.
Clearly, that approach doesn't work for cases where there's a significant
probability of failure within the action to be logged, such as creation
of a new file or database. We don't want to PANIC, and we especially don't
want to PANIC after having already written a WAL record that says we did
the action --- if we did, replay of the record would probably fail again
and PANIC again, making the failure unrecoverable. This means that the
ordinary WAL rule of "write WAL before the changes it describes" doesn't
work, and we need a different design for such cases.
There are several basic types of filesystem actions that have this
issue. Here is how we deal with each:
1. Adding a disk page to an existing table.
This action isn't WAL-logged at all. We extend a table by writing a page
of zeroes at its end. We must actually do this write so that we are sure
the filesystem has allocated the space. If the write fails we can just
error out normally. Once the space is known allocated, we can initialize
and fill the page via one or more normal WAL-logged actions. Because it's
possible that we crash between extending the file and writing out the WAL
entries, we have to treat discovery of an all-zeroes page in a table or
index as being a non-error condition. In such cases we can just reclaim
the space for re-use.
2. Creating a new table, which requires a new file in the filesystem.
We try to create the file, and if successful we make a WAL record saying
we did it. If not successful, we can just throw an error. Notice that
there is a window where we have created the file but not yet written any
WAL about it to disk. If we crash during this window, the file remains
on disk as an "orphan". It would be possible to clean up such orphans
by having database restart search for files that don't have any committed
entry in pg_class, but that currently isn't done because of the possibility
of deleting data that is useful for forensic analysis of the crash.
Orphan files are harmless --- at worst they waste a bit of disk space ---
because we check for on-disk collisions when allocating new relfilenumber
OIDs. So cleaning up isn't really necessary.
3. Deleting a table, which requires an unlink() that could fail.
Our approach here is to WAL-log the operation first, but to treat failure
of the actual unlink() call as a warning rather than error condition.
Again, this can leave an orphan file behind, but that's cheap compared to
the alternatives. Since we can't actually do the unlink() until after
we've committed the DROP TABLE transaction, throwing an error would be out
of the question anyway. (It may be worth noting that the WAL entry about
the file deletion is actually part of the commit record for the dropping
transaction.)
4. Creating and deleting databases and tablespaces, which requires creating
and deleting directories and entire directory trees.
These cases are handled similarly to creating individual files, ie, we
try to do the action first and then write a WAL entry if it succeeded.
The potential amount of wasted disk space is rather larger, of course.
In the creation case we try to delete the directory tree again if creation
fails, so as to reduce the risk of wasted space. Failure partway through
a deletion operation results in a corrupt database: the DROP failed, but
some of the data is gone anyway. There is little we can do about that,
though, and in any case it was presumably data the user no longer wants.
In all of these cases, if WAL replay fails to redo the original action
we must panic and abort recovery. The DBA will have to manually clean up
(for instance, free up some disk space or fix directory permissions) and
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
Skipping WAL for New RelFileLocator
Skip WAL for new relfilenodes, under wal_level=minimal. Until now, only selected bulk operations (e.g. COPY) did this. If a given relfilenode received both a WAL-skipping COPY and a WAL-logged operation (e.g. INSERT), recovery could lose tuples from the COPY. See src/backend/access/transam/README section "Skipping WAL for New RelFileNode" for the new coding rules. Maintainers of table access methods should examine that section. To maintain data durability, just before commit, we choose between an fsync of the relfilenode and copying its contents to WAL. A new GUC, wal_skip_threshold, guides that choice. If this change slows a workload that creates small, permanent relfilenodes under wal_level=minimal, try adjusting wal_skip_threshold. Users setting a timeout on COMMIT may need to adjust that timeout, and log_min_duration_statement analysis will reflect time consumption moving to COMMIT from commands like COPY. Internally, this requires a reliable determination of whether RollbackAndReleaseCurrentSubTransaction() would unlink a relation's current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the specification of rd_createSubid such that the field is zero when a new rel has an old rd_node. Make relcache.c retain entries for certain dropped relations until end of transaction. Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN. Future servers accept older WAL, so this bump is discretionary. Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert Haas. Heikki Linnakangas and Michael Paquier implemented earlier designs that materially clarified the problem. Reviewed, in earlier designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane, Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout. Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-04-04 21:25:34 +02:00
--------------------------------
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
Under wal_level=minimal, if a change modifies a relfilenumber that ROLLBACK
Skip WAL for new relfilenodes, under wal_level=minimal. Until now, only selected bulk operations (e.g. COPY) did this. If a given relfilenode received both a WAL-skipping COPY and a WAL-logged operation (e.g. INSERT), recovery could lose tuples from the COPY. See src/backend/access/transam/README section "Skipping WAL for New RelFileNode" for the new coding rules. Maintainers of table access methods should examine that section. To maintain data durability, just before commit, we choose between an fsync of the relfilenode and copying its contents to WAL. A new GUC, wal_skip_threshold, guides that choice. If this change slows a workload that creates small, permanent relfilenodes under wal_level=minimal, try adjusting wal_skip_threshold. Users setting a timeout on COMMIT may need to adjust that timeout, and log_min_duration_statement analysis will reflect time consumption moving to COMMIT from commands like COPY. Internally, this requires a reliable determination of whether RollbackAndReleaseCurrentSubTransaction() would unlink a relation's current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the specification of rd_createSubid such that the field is zero when a new rel has an old rd_node. Make relcache.c retain entries for certain dropped relations until end of transaction. Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN. Future servers accept older WAL, so this bump is discretionary. Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert Haas. Heikki Linnakangas and Michael Paquier implemented earlier designs that materially clarified the problem. Reviewed, in earlier designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane, Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout. Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-04-04 21:25:34 +02:00
would unlink, in-tree access methods write no WAL for that change. Code that
writes WAL without calling RelationNeedsWAL() must check for this case. This
skipping is mandatory. If a WAL-writing change preceded a WAL-skipping change
for the same block, REDO could overwrite the WAL-skipping change. If a
WAL-writing change followed a WAL-skipping change for the same block, a
related problem would arise. When a WAL record contains no full-page image,
REDO expects the page to match its contents from just before record insertion.
A WAL-skipping change may not reach disk at all, violating REDO's expectation
under full_page_writes=off. For any access method, CommitTransaction() writes
and fsyncs affected blocks before recording the commit.
Prefer to do the same in future access methods. However, two other approaches
can work. First, an access method can irreversibly transition a given fork
from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
smgrimmedsync(). Second, an access method can opt to write WAL
unconditionally for permanent relations. Under these approaches, the access
method callbacks must not call functions that react to RelationNeedsWAL().
This applies only to WAL records whose replay would modify bytes stored in the
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
new relfilenumber. It does not apply to other records about the relfilenumber,
Skip WAL for new relfilenodes, under wal_level=minimal. Until now, only selected bulk operations (e.g. COPY) did this. If a given relfilenode received both a WAL-skipping COPY and a WAL-logged operation (e.g. INSERT), recovery could lose tuples from the COPY. See src/backend/access/transam/README section "Skipping WAL for New RelFileNode" for the new coding rules. Maintainers of table access methods should examine that section. To maintain data durability, just before commit, we choose between an fsync of the relfilenode and copying its contents to WAL. A new GUC, wal_skip_threshold, guides that choice. If this change slows a workload that creates small, permanent relfilenodes under wal_level=minimal, try adjusting wal_skip_threshold. Users setting a timeout on COMMIT may need to adjust that timeout, and log_min_duration_statement analysis will reflect time consumption moving to COMMIT from commands like COPY. Internally, this requires a reliable determination of whether RollbackAndReleaseCurrentSubTransaction() would unlink a relation's current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the specification of rd_createSubid such that the field is zero when a new rel has an old rd_node. Make relcache.c retain entries for certain dropped relations until end of transaction. Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN. Future servers accept older WAL, so this bump is discretionary. Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert Haas. Heikki Linnakangas and Michael Paquier implemented earlier designs that materially clarified the problem. Reviewed, in earlier designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane, Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout. Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-04-04 21:25:34 +02:00
such as XLOG_SMGR_CREATE. Because it operates at the level of individual
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
relfilenumbers, RelationNeedsWAL() can differ for tightly-coupled relations.
Skip WAL for new relfilenodes, under wal_level=minimal. Until now, only selected bulk operations (e.g. COPY) did this. If a given relfilenode received both a WAL-skipping COPY and a WAL-logged operation (e.g. INSERT), recovery could lose tuples from the COPY. See src/backend/access/transam/README section "Skipping WAL for New RelFileNode" for the new coding rules. Maintainers of table access methods should examine that section. To maintain data durability, just before commit, we choose between an fsync of the relfilenode and copying its contents to WAL. A new GUC, wal_skip_threshold, guides that choice. If this change slows a workload that creates small, permanent relfilenodes under wal_level=minimal, try adjusting wal_skip_threshold. Users setting a timeout on COMMIT may need to adjust that timeout, and log_min_duration_statement analysis will reflect time consumption moving to COMMIT from commands like COPY. Internally, this requires a reliable determination of whether RollbackAndReleaseCurrentSubTransaction() would unlink a relation's current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the specification of rd_createSubid such that the field is zero when a new rel has an old rd_node. Make relcache.c retain entries for certain dropped relations until end of transaction. Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN. Future servers accept older WAL, so this bump is discretionary. Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert Haas. Heikki Linnakangas and Michael Paquier implemented earlier designs that materially clarified the problem. Reviewed, in earlier designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane, Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout. Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-04-04 21:25:34 +02:00
Consider "CREATE TABLE t (); BEGIN; ALTER TABLE t ADD c text; ..." in which
ALTER TABLE adds a TOAST relation. The TOAST relation will skip WAL, while
the table owning it will not. ALTER TABLE SET TABLESPACE will cause a table
to skip WAL, but that won't affect its indexes.
Asynchronous Commit
-------------------
As of PostgreSQL 8.3 it is possible to perform asynchronous commits - i.e.,
we don't wait while the WAL record for the commit is fsync'ed.
We perform an asynchronous commit when synchronous_commit = off. Instead
of performing an XLogFlush() up to the LSN of the commit, we merely note
the LSN in shared memory. The backend then continues with other work.
We record the LSN only for an asynchronous commit, not an abort; there's
never any need to flush an abort record, since the presumption after a
crash would be that the transaction aborted anyway.
We always force synchronous commit when the transaction is deleting
relations, to ensure the commit record is down to disk before the relations
are removed from the filesystem. Also, certain utility commands that have
non-roll-backable side effects (such as filesystem changes) force sync
commit to minimize the window in which the filesystem change has been made
but the transaction isn't guaranteed committed.
The walwriter regularly wakes up (via wal_writer_delay) or is woken up
(via its latch, which is set by backends committing asynchronously) and
performs an XLogBackgroundFlush(). This checks the location of the last
completely filled WAL page. If that has moved forwards, then we write all
the changed buffers up to that point, so that under full load we write
only whole buffers. If there has been a break in activity and the current
WAL page is the same as before, then we find out the LSN of the most
recent asynchronous commit, and write up to that point, if required (i.e.
if it's in the current WAL page). If more than wal_writer_delay has
passed, or more than wal_writer_flush_after blocks have been written, since
the last flush, WAL is also flushed up to the current location. This
arrangement in itself would guarantee that an async commit record reaches
disk after at most two times wal_writer_delay after the transaction
completes. However, we also allow XLogFlush to write/flush full buffers
"flexibly" (ie, not wrapping around at the end of the circular WAL buffer
area), so as to minimize the number of writes issued under high load when
multiple WAL pages are filled per walwriter cycle. This makes the worst-case
delay three wal_writer_delay cycles.
There are some other subtle points to consider with asynchronous commits.
First, for each page of CLOG we must remember the LSN of the latest commit
affecting the page, so that we can enforce the same flush-WAL-before-write
rule that we do for ordinary relation pages. Otherwise the record of the
commit might reach disk before the WAL record does. Again, abort records
need not factor into this consideration.
In fact, we store more than one LSN for each clog page. This relates to
the way we set transaction status hint bits during visibility tests.
We must not set a transaction-committed hint bit on a relation page and
have that record make it to disk prior to the WAL record of the commit.
Since visibility tests are normally made while holding buffer share locks,
we do not have the option of changing the page's LSN to guarantee WAL
synchronization. Instead, we defer the setting of the hint bit if we have
not yet flushed WAL as far as the LSN associated with the transaction.
This requires tracking the LSN of each unflushed async commit. It is
convenient to associate this data with clog buffers: because we will flush
WAL before writing a clog page, we know that we do not need to remember a
transaction's LSN longer than the clog page holding its commit status
remains in memory. However, the naive approach of storing an LSN for each
clog position is unattractive: the LSNs are 32x bigger than the two-bit
commit status fields, and so we'd need 256K of additional shared memory for
each 8K clog buffer page. We choose instead to store a smaller number of
LSNs per page, where each LSN is the highest LSN associated with any
transaction commit in a contiguous range of transaction IDs on that page.
This saves storage at the price of some possibly-unnecessary delay in
setting transaction hint bits.
How many transactions should share the same cached LSN (N)? If the
system's workload consists only of small async-commit transactions, then
it's reasonable to have N similar to the number of transactions per
walwriter cycle, since that is the granularity with which transactions will
become truly committed (and thus hintable) anyway. The worst case is where
a sync-commit xact shares a cached LSN with an async-commit xact that
commits a bit later; even though we paid to sync the first xact to disk,
we won't be able to hint its outputs until the second xact is sync'd, up to
three walwriter cycles later. This argues for keeping N (the group size)
as small as possible. For the moment we are setting the group size to 32,
which makes the LSN cache space the same size as the actual clog buffer
space (independently of BLCKSZ).
It is useful that we can run both synchronous and asynchronous commit
transactions concurrently, but the safety of this is perhaps not
immediately obvious. Assume we have two transactions, T1 and T2. The Log
Sequence Number (LSN) is the point in the WAL sequence where a transaction
commit is recorded, so LSN1 and LSN2 are the commit records of those
transactions. If T2 can see changes made by T1 then when T2 commits it
must be true that LSN2 follows LSN1. Thus when T2 commits it is certain
that all of the changes made by T1 are also now recorded in the WAL. This
is true whether T1 was asynchronous or synchronous. As a result, it is
safe for asynchronous commits and synchronous commits to work concurrently
without endangering data written by synchronous commits. Sub-transactions
are not important here since the final write to disk only occurs at the
commit of the top level transaction.
Changes to data blocks cannot reach disk unless WAL is flushed up to the
point of the LSN of the data blocks. Any attempt to write unsafe data to
disk will trigger a write which ensures the safety of all data written by
that and prior transactions. Data blocks and clog pages are both protected
by LSNs.
Changes to a temp table are not WAL-logged, hence could reach disk in
advance of T1's commit, but we don't care since temp table contents don't
survive crashes anyway.
Change internal RelFileNode references to RelFileNumber or RelFileLocator. We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
Database writes that skip WAL for new relfilenumbers are also safe. In these
Skip WAL for new relfilenodes, under wal_level=minimal. Until now, only selected bulk operations (e.g. COPY) did this. If a given relfilenode received both a WAL-skipping COPY and a WAL-logged operation (e.g. INSERT), recovery could lose tuples from the COPY. See src/backend/access/transam/README section "Skipping WAL for New RelFileNode" for the new coding rules. Maintainers of table access methods should examine that section. To maintain data durability, just before commit, we choose between an fsync of the relfilenode and copying its contents to WAL. A new GUC, wal_skip_threshold, guides that choice. If this change slows a workload that creates small, permanent relfilenodes under wal_level=minimal, try adjusting wal_skip_threshold. Users setting a timeout on COMMIT may need to adjust that timeout, and log_min_duration_statement analysis will reflect time consumption moving to COMMIT from commands like COPY. Internally, this requires a reliable determination of whether RollbackAndReleaseCurrentSubTransaction() would unlink a relation's current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the specification of rd_createSubid such that the field is zero when a new rel has an old rd_node. Make relcache.c retain entries for certain dropped relations until end of transaction. Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN. Future servers accept older WAL, so this bump is discretionary. Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert Haas. Heikki Linnakangas and Michael Paquier implemented earlier designs that materially clarified the problem. Reviewed, in earlier designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane, Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout. Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-04-04 21:25:34 +02:00
cases it's entirely possible for the data to reach disk before T1's commit,
because T1 will fsync it down to disk without any sort of interlock. However,
all these paths are designed to write data that no other transaction can see
until after T1 commits. The situation is thus not different from ordinary
WAL-logged updates.
Allow read only connections during recovery, known as Hot Standby. Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record. New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far. This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required. Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit. Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
Transaction Emulation during Recovery
-------------------------------------
During Recovery we replay transaction changes in the order they occurred.
As part of this replay we emulate some transactional behaviour, so that
read only backends can take MVCC snapshots. We do this by maintaining a
list of XIDs belonging to transactions that are being replayed, so that
each transaction that has recorded WAL records for database writes exist
in the array until it commits. Further details are given in comments in
procarray.c.
Many actions write no WAL records at all, for example read only transactions.
These have no effect on MVCC in recovery and we can pretend they never
occurred at all. Subtransaction commit does not write a WAL record either
and has very little effect, since lock waiters need to wait for the
parent transaction to complete.
Not all transactional behaviour is emulated, for example we do not insert
a transaction entry into the lock table, nor do we maintain the transaction
stack in memory. Clog, multixact and commit_ts entries are made normally.
Allow read only connections during recovery, known as Hot Standby. Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record. New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far. This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required. Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit. Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
Subtrans is maintained during recovery but the details of the transaction
tree are ignored and all subtransactions reference the top-level TransactionId
directly. Since commit is atomic this provides correct lock wait behaviour
yet simplifies emulation of subtransactions considerably.
Further details on locking mechanics in recovery are given in comments
with the Lock rmgr code.