postgresql/src/backend/access/heap
Andres Freund 578b229718 Remove WITH OIDS support, change oid catalog column visibility.
Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.

This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row.  Neither pg_dump nor COPY included the contents of the
oid column by default.

The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.

WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.

Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
  WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
  issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
  restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
  OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
  plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.

The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.

The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such.  This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.

The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.

Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).

The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.

While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.

Catversion bump, for obvious reasons.

Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
2018-11-20 16:00:17 -08:00
..
Makefile Remove cvs keywords from all files. 2010-09-20 22:08:53 +02:00
README.HOT Fix assorted bugs in CREATE/DROP INDEX CONCURRENTLY. 2012-11-28 21:26:01 -05:00
README.tuplock Fix grammar in README.tuplock 2018-07-27 10:56:30 -04:00
heapam.c Remove WITH OIDS support, change oid catalog column visibility. 2018-11-20 16:00:17 -08:00
hio.c Remove UpdateFreeSpaceMap(), use FreeSpaceMapVacuumRange() instead. 2018-03-29 12:22:44 -04:00
pruneheap.c Raise error when affecting tuple moved into different partition. 2018-04-07 13:24:27 -07:00
rewriteheap.c PANIC on fsync() failure. 2018-11-19 17:41:26 +13:00
syncscan.c Update copyright for 2018 2018-01-02 23:30:12 -05:00
tuptoaster.c Remove WITH OIDS support, change oid catalog column visibility. 2018-11-20 16:00:17 -08:00
visibilitymap.c Avoid using potentially-under-aligned page buffers. 2018-09-01 15:27:17 -04:00

README.tuplock

Locking tuples
--------------

Locking tuples is not as easy as locking tables or other database objects.
The problem is that transactions might want to lock large numbers of tuples at
any one time, so it's not possible to keep the locks objects in shared memory.
To work around this limitation, we use a two-level mechanism.  The first level
is implemented by storing locking information in the tuple header: a tuple is
marked as locked by setting the current transaction's XID as its XMAX, and
setting additional infomask bits to distinguish this case from the more normal
case of having deleted the tuple.  When multiple transactions concurrently
lock a tuple, a MultiXact is used; see below.  This mechanism can accommodate
arbitrarily large numbers of tuples being locked simultaneously.

When it is necessary to wait for a tuple-level lock to be released, the basic
delay is provided by XactLockTableWait or MultiXactIdWait on the contents of
the tuple's XMAX.  However, that mechanism will release all waiters
concurrently, so there would be a race condition as to which waiter gets the
tuple, potentially leading to indefinite starvation of some waiters.  The
possibility of share-locking makes the problem much worse --- a steady stream
of share-lockers can easily block an exclusive locker forever.  To provide
more reliable semantics about who gets a tuple-level lock first, we use the
standard lock manager, which implements the second level mentioned above.  The
protocol for waiting for a tuple-level lock is really

     LockTuple()
     XactLockTableWait()
     mark tuple as locked by me
     UnlockTuple()

When there are multiple waiters, arbitration of who is to get the lock next
is provided by LockTuple().  However, at most one tuple-level lock will
be held or awaited per backend at any time, so we don't risk overflow
of the lock table.  Note that incoming share-lockers are required to
do LockTuple as well, if there is any conflict, to ensure that they don't
starve out waiting exclusive-lockers.  However, if there is not any active
conflict for a tuple, we don't incur any extra overhead.

We provide four levels of tuple locking strength: SELECT FOR UPDATE obtains an
exclusive lock which prevents any kind of modification of the tuple. This is
the lock level that is implicitly taken by DELETE operations, and also by
UPDATE operations if they modify any of the tuple's key fields. SELECT FOR NO
KEY UPDATE likewise obtains an exclusive lock, but only prevents tuple removal
and modifications which might alter the tuple's key. This is the lock that is
implicitly taken by UPDATE operations which leave all key fields unchanged.
SELECT FOR SHARE obtains a shared lock which prevents any kind of tuple
modification. Finally, SELECT FOR KEY SHARE obtains a shared lock which only
prevents tuple removal and modifications of key fields. This lock level is
just strong enough to implement RI checks, i.e. it ensures that tuples do not
go away from under a check, without blocking transactions that want to update
the tuple without changing its key.

The conflict table is:

                  UPDATE       NO KEY UPDATE    SHARE        KEY SHARE
UPDATE           conflict        conflict      conflict      conflict
NO KEY UPDATE    conflict        conflict      conflict
SHARE            conflict        conflict
KEY SHARE        conflict

When there is a single locker in a tuple, we can just store the locking info
in the tuple itself.  We do this by storing the locker's Xid in XMAX, and
setting infomask bits specifying the locking strength.  There is one exception
here: since infomask space is limited, we do not provide a separate bit
for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in
that case.  (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, are
presumably more commonly used due to being the standards-mandated locking
mechanism, or heavily used by the RI code, so we want to provide fast paths
for those.)

MultiXacts
----------

A tuple header provides very limited space for storing information about tuple
locking and updates: there is room only for a single Xid and a small number of
infomask bits.  Whenever we need to store more than one lock, we replace the
first locker's Xid with a new MultiXactId.  Each MultiXact provides extended
locking data; it comprises an array of Xids plus some flags bits for each one.
The flags are currently used to store the locking strength of each member
transaction.  (The flags also distinguish a pure locker from an updater.)

In earlier PostgreSQL releases, a MultiXact always meant that the tuple was
locked in shared mode by multiple transactions.  This is no longer the case; a
MultiXact may contain an update or delete Xid.  (Keep in mind that tuple locks
in a transaction do not conflict with other tuple locks in the same
transaction, so it's possible to have otherwise conflicting locks in a
MultiXact if they belong to the same transaction).

Note that each lock is attributed to the subtransaction that acquires it.
This means that a subtransaction that aborts is seen as though it releases the
locks it acquired; concurrent transactions can then proceed without having to
wait for the main transaction to finish.  It also means that a subtransaction
can upgrade to a stronger lock level than an earlier transaction had, and if
the subxact aborts, the earlier, weaker lock is kept.

The possibility of having an update within a MultiXact means that they must
persist across crashes and restarts: a future reader of the tuple needs to
figure out whether the update committed or aborted.  So we have a requirement
that pg_multixact needs to retain pages of its data until we're certain that
the MultiXacts in them are no longer of interest.

VACUUM is in charge of removing old MultiXacts at the time of tuple freezing.
The lower bound used by vacuum (that is, the value below which all multixacts
are removed) is stored as pg_class.relminmxid for each table; the minimum of
all such values is stored in pg_database.datminmxid.  The minimum across
all databases, in turn, is recorded in checkpoint records, and CHECKPOINT
removes pg_multixact/ segments older than that value once the checkpoint
record has been flushed.

Infomask Bits
-------------

The following infomask bits are applicable:

- HEAP_XMAX_INVALID
  Any tuple with this bit set does not have a valid value stored in XMAX.

- HEAP_XMAX_IS_MULTI
  This bit is set if the tuple's Xmax is a MultiXactId (as opposed to a
  regular TransactionId).

- HEAP_XMAX_LOCK_ONLY
  This bit is set when the XMAX is a locker only; that is, if it's a
  multixact, it does not contain an update among its members.  It's set when
  the XMAX is a plain Xid that locked the tuple, as well.

- HEAP_XMAX_KEYSHR_LOCK
- HEAP_XMAX_SHR_LOCK
- HEAP_XMAX_EXCL_LOCK
  These bits indicate the strength of the lock acquired; they are useful when
  the XMAX is not a MultiXactId.  If it's a multi, the info is to be found in
  the member flags.  If HEAP_XMAX_IS_MULTI is not set and HEAP_XMAX_LOCK_ONLY
  is set, then one of these *must* be set as well.

  Note that HEAP_XMAX_EXCL_LOCK does not distinguish FOR NO KEY UPDATE from
  FOR UPDATE; this is implemented by the HEAP_KEYS_UPDATED bit.

- HEAP_KEYS_UPDATED
  This bit lives in t_infomask2.  If set, indicates that the XMAX updated
  this tuple and changed the key values, or it deleted the tuple.
  It's set regardless of whether the XMAX is a TransactionId or a MultiXactId.

We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
is set.