mirror of
https://git.postgresql.org/git/postgresql.git
synced 2024-09-27 22:02:03 +02:00
cae1c788b9
Transmit the leader's temp-namespace state to workers. This is important because without it, the workers do not really have the same search path as the leader. For example, there is no good reason (and no extant code either) to prevent a worker from executing a temp function that the leader created previously; but as things stood it would fail to find the temp function, and then either fail or execute the wrong function entirely. We still prohibit a worker from creating a temp namespace on its own. In effect, a worker can only see the session's temp namespace if the leader had created it before starting the worker, which seems like the right semantics. Also, transmit the leader's BackendId to workers, and arrange for workers to use that when determining the physical file path of a temp relation belonging to their session. While the original intent was to prevent such accesses entirely, there were a number of holes in that, notably in places like dbsize.c which assume they can safely access temp rels of other sessions anyway. We might as well get this right, as a small down payment on someday allowing workers to access the leader's temp tables. (With this change, directly using "MyBackendId" as a relation or buffer backend ID is deprecated; you should use BackendIdForTempRelations() instead. I left a couple of such uses alone though, as they're not going to be reachable in parallel workers until we do something about localbuf.c.) Move the thou-shalt-not-access-thy-leader's-temp-tables prohibition down into localbuf.c, which is where it actually matters, instead of having it in relation_open(). This amounts to recognizing that access to temp tables' catalog entries is perfectly safe in a worker, it's only the data in local buffers that is problematic. Having done all that, we can get rid of the test in has_parallel_hazard() that says that use of a temp table's rowtype is unsafe in parallel workers. That test was unduly expensive, and if we really did need such a prohibition, that was not even close to being a bulletproof guard for it. (For example, any user-defined function executed in a parallel worker might have attempted such access.) |
||
---|---|---|
.. | ||
heapam.c | ||
hio.c | ||
Makefile | ||
pruneheap.c | ||
README.HOT | ||
README.tuplock | ||
rewriteheap.c | ||
syncscan.c | ||
tuptoaster.c | ||
visibilitymap.c |
Locking tuples -------------- Locking tuples is not as easy as locking tables or other database objects. The problem is that transactions might want to lock large numbers of tuples at any one time, so it's not possible to keep the locks objects in shared memory. To work around this limitation, we use a two-level mechanism. The first level is implemented by storing locking information in the tuple header: a tuple is marked as locked by setting the current transaction's XID as its XMAX, and setting additional infomask bits to distinguish this case from the more normal case of having deleted the tuple. When multiple transactions concurrently lock a tuple, a MultiXact is used; see below. This mechanism can accommodate arbitrarily large numbers of tuples being locked simultaneously. When it is necessary to wait for a tuple-level lock to be released, the basic delay is provided by XactLockTableWait or MultiXactIdWait on the contents of the tuple's XMAX. However, that mechanism will release all waiters concurrently, so there would be a race condition as to which waiter gets the tuple, potentially leading to indefinite starvation of some waiters. The possibility of share-locking makes the problem much worse --- a steady stream of share-lockers can easily block an exclusive locker forever. To provide more reliable semantics about who gets a tuple-level lock first, we use the standard lock manager, which implements the second level mentioned above. The protocol for waiting for a tuple-level lock is really LockTuple() XactLockTableWait() mark tuple as locked by me UnlockTuple() When there are multiple waiters, arbitration of who is to get the lock next is provided by LockTuple(). However, at most one tuple-level lock will be held or awaited per backend at any time, so we don't risk overflow of the lock table. Note that incoming share-lockers are required to do LockTuple as well, if there is any conflict, to ensure that they don't starve out waiting exclusive-lockers. However, if there is not any active conflict for a tuple, we don't incur any extra overhead. We provide four levels of tuple locking strength: SELECT FOR UPDATE obtains an exclusive lock which prevents any kind of modification of the tuple. This is the lock level that is implicitly taken by DELETE operations, and also by UPDATE operations if they modify any of the tuple's key fields. SELECT FOR NO KEY UPDATE likewise obtains an exclusive lock, but only prevents tuple removal and modifications which might alter the tuple's key. This is the lock that is implicitly taken by UPDATE operations which leave all key fields unchanged. SELECT FOR SHARE obtains a shared lock which prevents any kind of tuple modification. Finally, SELECT FOR KEY SHARE obtains a shared lock which only prevents tuple removal and modifications of key fields. This last mode implements a mode just strong enough to implement RI checks, i.e. it ensures that tuples do not go away from under a check, without blocking when some other transaction that want to update the tuple without changing its key. The conflict table is: UPDATE NO KEY UPDATE SHARE KEY SHARE UPDATE conflict conflict conflict conflict NO KEY UPDATE conflict conflict conflict SHARE conflict conflict KEY SHARE conflict When there is a single locker in a tuple, we can just store the locking info in the tuple itself. We do this by storing the locker's Xid in XMAX, and setting infomask bits specifying the locking strength. There is one exception here: since infomask space is limited, we do not provide a separate bit for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in that case. (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, are presumably more commonly used due to being the standards-mandated locking mechanism, or heavily used by the RI code, so we want to provide fast paths for those.) MultiXacts ---------- A tuple header provides very limited space for storing information about tuple locking and updates: there is room only for a single Xid and a small number of infomask bits. Whenever we need to store more than one lock, we replace the first locker's Xid with a new MultiXactId. Each MultiXact provides extended locking data; it comprises an array of Xids plus some flags bits for each one. The flags are currently used to store the locking strength of each member transaction. (The flags also distinguish a pure locker from an updater.) In earlier PostgreSQL releases, a MultiXact always meant that the tuple was locked in shared mode by multiple transactions. This is no longer the case; a MultiXact may contain an update or delete Xid. (Keep in mind that tuple locks in a transaction do not conflict with other tuple locks in the same transaction, so it's possible to have otherwise conflicting locks in a MultiXact if they belong to the same transaction). Note that each lock is attributed to the subtransaction that acquires it. This means that a subtransaction that aborts is seen as though it releases the locks it acquired; concurrent transactions can then proceed without having to wait for the main transaction to finish. It also means that a subtransaction can upgrade to a stronger lock level than an earlier transaction had, and if the subxact aborts, the earlier, weaker lock is kept. The possibility of having an update within a MultiXact means that they must persist across crashes and restarts: a future reader of the tuple needs to figure out whether the update committed or aborted. So we have a requirement that pg_multixact needs to retain pages of its data until we're certain that the MultiXacts in them are no longer of interest. VACUUM is in charge of removing old MultiXacts at the time of tuple freezing. The lower bound used by vacuum (that is, the value below which all multixacts are removed) is stored as pg_class.relminmxid for each table; the minimum of all such values is stored in pg_database.datminmxid. The minimum across all databases, in turn, is recorded in checkpoint records, and CHECKPOINT removes pg_multixact/ segments older than that value once the checkpoint record has been flushed. Infomask Bits ------------- The following infomask bits are applicable: - HEAP_XMAX_INVALID Any tuple with this bit set does not have a valid value stored in XMAX. - HEAP_XMAX_IS_MULTI This bit is set if the tuple's Xmax is a MultiXactId (as opposed to a regular TransactionId). - HEAP_XMAX_LOCK_ONLY This bit is set when the XMAX is a locker only; that is, if it's a multixact, it does not contain an update among its members. It's set when the XMAX is a plain Xid that locked the tuple, as well. - HEAP_XMAX_KEYSHR_LOCK - HEAP_XMAX_SHR_LOCK - HEAP_XMAX_EXCL_LOCK These bits indicate the strength of the lock acquired; they are useful when the XMAX is not a MultiXactId. If it's a multi, the info is to be found in the member flags. If HEAP_XMAX_IS_MULTI is not set and HEAP_XMAX_LOCK_ONLY is set, then one of these *must* be set as well. Note that HEAP_XMAX_EXCL_LOCK does not distinguish FOR NO KEY UPDATE from FOR UPDATE; this is implemented by the HEAP_KEYS_UPDATED bit. - HEAP_KEYS_UPDATED This bit lives in t_infomask2. If set, indicates that the XMAX updated this tuple and changed the key values, or it deleted the tuple. It's set regardless of whether the XMAX is a TransactionId or a MultiXactId. We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit is set.