postgresql/src/backend/access/heap/README.tuplock

140 lines
7.1 KiB
Plaintext
Raw Normal View History

Improve concurrency of foreign key locking This patch introduces two additional lock modes for tuples: "SELECT FOR KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each other, in contrast with already existing "SELECT FOR SHARE" and "SELECT FOR UPDATE". UPDATE commands that do not modify the values stored in the columns that are part of the key of the tuple now grab a SELECT FOR NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently with tuple locks of the FOR KEY SHARE variety. Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this means the concurrency improvement applies to them, which is the whole point of this patch. The added tuple lock semantics require some rejiggering of the multixact module, so that the locking level that each transaction is holding can be stored alongside its Xid. Also, multixacts now need to persist across server restarts and crashes, because they can now represent not only tuple locks, but also tuple updates. This means we need more careful tracking of lifetime of pg_multixact SLRU files; since they now persist longer, we require more infrastructure to figure out when they can be removed. pg_upgrade also needs to be careful to copy pg_multixact files over from the old server to the new, or at least part of multixact.c state, depending on the versions of the old and new servers. Tuple time qualification rules (HeapTupleSatisfies routines) need to be careful not to consider tuples with the "is multi" infomask bit set as being only locked; they might need to look up MultiXact values (i.e. possibly do pg_multixact I/O) to find out the Xid that updated a tuple, whereas they previously were assured to only use information readily available from the tuple header. This is considered acceptable, because the extra I/O would involve cases that would previously cause some commands to block waiting for concurrent transactions to finish. Another important change is the fact that locking tuples that have previously been updated causes the future versions to be marked as locked, too; this is essential for correctness of foreign key checks. This causes additional WAL-logging, also (there was previously a single WAL record for a locked tuple; now there are as many as updated copies of the tuple there exist.) With all this in place, contention related to tuples being checked by foreign key rules should be much reduced. As a bonus, the old behavior that a subtransaction grabbing a stronger tuple lock than the parent (sub)transaction held on a given tuple and later aborting caused the weaker lock to be lost, has been fixed. Many new spec files were added for isolation tester framework, to ensure overall behavior is sane. There's probably room for several more tests. There were several reviewers of this patch; in particular, Noah Misch and Andres Freund spent considerable time in it. Original idea for the patch came from Simon Riggs, after a problem report by Joel Jacobson. Most code is from me, with contributions from Marti Raudsepp, Alexander Shulgin, Noah Misch and Andres Freund. This patch was discussed in several pgsql-hackers threads; the most important start at the following message-ids: AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com 1290721684-sup-3951@alvh.no-ip.org 1294953201-sup-2099@alvh.no-ip.org 1320343602-sup-2290@alvh.no-ip.org 1339690386-sup-8927@alvh.no-ip.org 4FE5FF020200002500048A3D@gw.wicourts.gov 4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 16:04:59 +01:00
Locking tuples
--------------
Locking tuples is not as easy as locking tables or other database objects.
The problem is that transactions might want to lock large numbers of tuples at
any one time, so it's not possible to keep the locks objects in shared memory.
To work around this limitation, we use a two-level mechanism. The first level
is implemented by storing locking information in the tuple header: a tuple is
marked as locked by setting the current transaction's XID as its XMAX, and
setting additional infomask bits to distinguish this case from the more normal
case of having deleted the tuple. When multiple transactions concurrently
lock a tuple, a MultiXact is used; see below. This mechanism can accomodate
arbitrarily large numbers of tuples being locked simultaneously.
When it is necessary to wait for a tuple-level lock to be released, the basic
delay is provided by XactLockTableWait or MultiXactIdWait on the contents of
the tuple's XMAX. However, that mechanism will release all waiters
concurrently, so there would be a race condition as to which waiter gets the
tuple, potentially leading to indefinite starvation of some waiters. The
possibility of share-locking makes the problem much worse --- a steady stream
of share-lockers can easily block an exclusive locker forever. To provide
more reliable semantics about who gets a tuple-level lock first, we use the
standard lock manager, which implements the second level mentioned above. The
protocol for waiting for a tuple-level lock is really
LockTuple()
XactLockTableWait()
mark tuple as locked by me
UnlockTuple()
When there are multiple waiters, arbitration of who is to get the lock next
is provided by LockTuple(). However, at most one tuple-level lock will
be held or awaited per backend at any time, so we don't risk overflow
of the lock table. Note that incoming share-lockers are required to
do LockTuple as well, if there is any conflict, to ensure that they don't
starve out waiting exclusive-lockers. However, if there is not any active
conflict for a tuple, we don't incur any extra overhead.
We provide four levels of tuple locking strength: SELECT FOR KEY UPDATE is
super-exclusive locking (used to delete tuples and more generally to update
tuples modifying the values of the columns that make up the key of the tuple);
SELECT FOR UPDATE is a standards-compliant exclusive lock; SELECT FOR SHARE
implements shared locks; and finally SELECT FOR KEY SHARE is a super-weak mode
that does not conflict with exclusive mode, but conflicts with SELECT FOR KEY
UPDATE. This last mode implements a mode just strong enough to implement RI
checks, i.e. it ensures that tuples do not go away from under a check, without
blocking when some other transaction that want to update the tuple without
changing its key.
The conflict table is:
KEY UPDATE UPDATE SHARE KEY SHARE
KEY UPDATE conflict conflict conflict conflict
UPDATE conflict conflict conflict
SHARE conflict conflict
KEY SHARE conflict
When there is a single locker in a tuple, we can just store the locking info
in the tuple itself. We do this by storing the locker's Xid in XMAX, and
setting infomask bits specifying the locking strength. There is one exception
here: since infomask space is limited, we do not provide a separate bit
for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in
that case. (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, are
presumably more commonly used due to being the standards-mandated locking
mechanism, or heavily used by the RI code, so we want to provide fast paths
for those.)
MultiXacts
----------
A tuple header provides very limited space for storing information about tuple
locking and updates: there is room only for a single Xid and a small number of
infomask bits. Whenever we need to store more than one lock, we replace the
first locker's Xid with a new MultiXactId. Each MultiXact provides extended
locking data; it comprises an array of Xids plus some flags bits for each one.
The flags are currently used to store the locking strength of each member
transaction. (The flags also distinguish a pure locker from an updater.)
In earlier PostgreSQL releases, a MultiXact always meant that the tuple was
locked in shared mode by multiple transactions. This is no longer the case; a
MultiXact may contain an update or delete Xid. (Keep in mind that tuple locks
in a transaction do not conflict with other tuple locks in the same
transaction, so it's possible to have otherwise conflicting locks in a
MultiXact if they belong to the same transaction).
Note that each lock is attributed to the subtransaction that acquires it.
This means that a subtransaction that aborts is seen as though it releases the
locks it acquired; concurrent transactions can then proceed without having to
wait for the main transaction to finish. It also means that a subtransaction
can upgrade to a stronger lock level than an earlier transaction had, and if
the subxact aborts, the earlier, weaker lock is kept.
The possibility of having an update within a MultiXact means that they must
persist across crashes and restarts: a future reader of the tuple needs to
figure out whether the update committed or aborted. So we have a requirement
that pg_multixact needs to retain pages of its data until we're certain that
the MultiXacts in them are no longer of interest.
VACUUM is in charge of removing old MultiXacts at the time of tuple freezing.
This works in the same way that pg_clog segments are removed: we have a
pg_class column that stores the earliest multixact that could possibly be
stored in the table; the minimum of all such values is stored in a pg_database
column. VACUUM computes the minimum across all pg_database values, and
removes pg_multixact segments older than the minimum.
Infomask Bits
-------------
The following infomask bits are applicable:
- HEAP_XMAX_INVALID
Any tuple with this bit set does not have a valid value stored in XMAX.
- HEAP_XMAX_IS_MULTI
This bit is set if the tuple's Xmax is a MultiXactId (as opposed to a
regular TransactionId).
- HEAP_XMAX_LOCK_ONLY
This bit is set when the XMAX is a locker only; that is, if it's a
multixact, it does not contain an update among its members. It's set when
the XMAX is a plain Xid that locked the tuple, as well.
- HEAP_XMAX_KEYSHR_LOCK
- HEAP_XMAX_EXCL_LOCK
These bits indicate the strength of the lock acquired; they are useful when
the XMAX is not a MultiXactId. If it's a multi, the info is to be found in
the member flags. If HEAP_XMAX_IS_MULTI is not set and HEAP_XMAX_LOCK_ONLY
is set, then one of these *must* be set as well.
Note there is no infomask bit for a SELECT FOR SHARE lock. Also there is no
separate bit for a SELECT FOR KEY UPDATE lock; this is implemented by the
HEAP_KEYS_UPDATED bit.
- HEAP_KEYS_UPDATED
This bit lives in t_infomask2. If set, indicates that the XMAX updated
this tuple and changed the key values, or it deleted the tuple.
It's set regardless of whether the XMAX is a TransactionId or a MultiXactId.
We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
is set.