Commit 0ac5ad5134 removed an optimization in multixact.c that skipped
fetching members of MultiXactId that were older than our
OldestVisibleMXactId value. The reason this was removed is that it is
possible for multixacts that contain updates to be older than that
value. However, if the caller is certain that the multi does not
contain an update (because the infomask bits say so), it can pass this
info down to GetMultiXactIdMembers, enabling it to use the old
optimization.
Pointed out by Andres Freund in 20131121200517.GM7240@alap2.anarazel.de
HeapTupleSatisfiesVacuum() didn't properly discern between
DELETE_IN_PROGRESS and INSERT_IN_PROGRESS for rows that have been
inserted in the current transaction and deleted in a aborted
subtransaction of the current backend. At the very least that caused
problems for CLUSTER and CREATE INDEX in transactions that had
aborting subtransactions producing rows, leading to warnings like:
WARNING: concurrent delete in progress within table "..."
possibly in an endless, uninterruptible, loop.
Instead of treating *InProgress xmins the same as *IsCurrent ones,
treat them as being distinct like the other visibility routines. As
implemented this separatation can cause a behaviour change for rows
that have been inserted and deleted in another, still running,
transaction. HTSV will now return INSERT_IN_PROGRESS instead of
DELETE_IN_PROGRESS for those. That's both, more in line with the other
visibility routines and arguably more correct. The latter because a
INSERT_IN_PROGRESS will make callers look at/wait for xmin, instead of
xmax.
The only current caller where that's possibly worse than the old
behaviour is heap_prune_chain() which now won't mark the page as
prunable if a row has concurrently been inserted and deleted. That's
harmless enough.
As a cautionary measure also insert a interrupt check before the gotos
in IndexBuildHeapScan() that lead to the uninterruptible loop. There
are other possible causes, like a row that several sessions try to
update and all fail, for repeated loops and the cost of doing so in
the retry case is low.
As this bug goes back all the way to the introduction of
subtransactions in 573a71a5da backpatch to all supported releases.
Reported-By: Sandro Santilli
HeapTupleHeaderGetCmax() asserts that it is only used if the tuple has
been updated by the current transaction. That check is correct and
sensible but requires allocating memory if xmax is a multixact. When
wal_level is set to logical cmax needs to be included in a wal record
, generated inside a critical section, which can trigger the assertion
added in 4a170ee9e.
Reported-By: Steve Singer
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
Instead of changing the tuple xmin to FrozenTransactionId, the combination
of HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID, which were previously never
set together, is now defined as HEAP_XMIN_FROZEN. A variety of previous
proposals to freeze tuples opportunistically before vacuum_freeze_min_age
is reached have foundered on the objection that replacing xmin by
FrozenTransactionId might hinder debugging efforts when things in this
area go awry; this patch is intended to solve that problem by keeping
the XID around (but largely ignoring the value to which it is set).
Third-party code that checks for HEAP_XMIN_INVALID on tuples where
HEAP_XMIN_COMMITTED might be set will be broken by this change. To fix,
use the new accessor macros in htup_details.h rather than consulting the
bits directly. HeapTupleHeaderGetXmin has been modified to return
FrozenTransactionId when the infomask bits indicate that the tuple is
frozen; use HeapTupleHeaderGetRawXmin when you already know that the
tuple isn't marked commited or frozen, or want the raw value anyway.
We currently do this in routines that display the xmin for user consumption,
in tqual.c where it's known to be safe and important for the avoidance of
extra cycles, and in the function-caching code for various procedural
languages, which shouldn't invalidate the cache just because the tuple
gets frozen.
Robert Haas and Andres Freund
If a tuple was locked by transaction A, and transaction B updated it,
the new version of the tuple created by B would be locked by A, yet
visible only to B; due to an oversight in HeapTupleSatisfiesUpdate, the
lock held by A wouldn't get checked if transaction B later deleted (or
key-updated) the new version of the tuple. This might cause referential
integrity checks to give false positives (that is, allow deletes that
should have been rejected).
This is an easy oversight to have made, because prior to improved tuple
locks in commit 0ac5ad5134 it wasn't possible to have tuples created by
our own transaction that were also locked by remote transactions, and so
locks weren't even considered in that code path.
It is recommended that foreign keys be rechecked manually in bulk after
installing this update, in case some referenced rows are missing with
some referencing row remaining.
Per bug reported by Daniel Wood in
CAPweHKe5QQ1747X2c0tA=5zf4YnS2xcvGf13Opd-1Mq24rF1cQ@mail.gmail.com
HeapTupleSatisfiesUpdate can very easily "forget" tuple locks while
checking the contents of a multixact and finding it contains an aborted
update, by setting the HEAP_XMAX_INVALID bit. This would lead to
concurrent transactions not noticing any previous locks held by
transactions that might still be running, and thus being able to acquire
subsequent locks they wouldn't be normally able to acquire.
This bug was introduced in commit 1ce150b7bb; backpatch this fix to 9.3,
like that commit.
This change reverts the change to the delete-abort-savept isolation test
in 1ce150b7bb, because that behavior change was caused by this bug.
Noticed by Andres Freund while investigating a different issue reported
by Noah Misch.
It is dangerous to do so, because some code expects to be able to see what's
the true Xmax even if it is aborted (particularly while traversing HOT
chains). So don't do it, and instead rely on the callers to verify for
abortedness, if necessary.
Several race conditions and bugs fixed in the process. One isolation test
changes the expected output due to these.
This also reverts commit c235a6a589, which is no longer necessary.
Backpatch to 9.3, where this function was introduced.
Andres Freund
If a tuple is locked but not updated by a concurrent transaction,
HeapTupleSatisfiesDirty would return that transaction's Xid in xmax,
causing callers to wait on it, when it is not necessary (in fact, if the
other transaction had used a multixact instead of a plain Xid to mark
the tuple, HeapTupleSatisfiesDirty would have behave differently and
*not* returned the Xmax).
This bug was introduced in commit 3f7fbf85dc, dated December 1998,
so it's almost 15 years old now. However, it's hard to see this
misbehave, because before we had NOWAIT the only consequence of this is
that transactions would wait for slightly more time than necessary; so
it's not surprising that this hasn't been reported yet.
Craig Ringer and Andres Freund
We now use MVCC catalog scans, and, per discussion, have eliminated
all other remaining uses of SnapshotNow, so that we can now get rid of
it. This will break third-party code which is still using it, which
is intentional, as we want such code to be updated to do things the
new way.
Previously, these functions took a HeapTupleHeader, but upcoming
patches for logical replication will introduce new a new snapshot
type under which the tuple's TID will be used to lookup (CMIN, CMAX)
for visibility determination purposes. This makes that information
available. Code churn is minimal since HeapTupleSatisfiesVisibility
took the HeapTuple anyway, and deferenced it before calling the
satisfies function.
Independently of logical replication, this allows t_tableOid and
t_self to be cross-checked via assertions in tqual.c. This seems
like a useful way to make sure that all callers are setting these
values properly, which has been previously put forward as
desirable.
Andres Freund, reviewed by Álvaro Herrera
By using only the macro that checks infomask bits
HEAP_XMAX_IS_LOCKED_ONLY to verify whether a multixact is not an
updater, and not the full HeapTupleHeaderIsOnlyLocked, it would come to
the wrong result in case of a multixact containing an aborted update;
therefore returning the wrong result code. This would cause predicate.c
to break completely (as in bug report #8273 from David Leverton), and
certain index builds would misbehave. As far as I can tell, other
callers of the bogus routine would make harmless mistakes or not be
affected by the difference at all; so this was a pretty narrow case.
Also, no other user of the HEAP_XMAX_IS_LOCKED_ONLY macro is as
careless; they all check specifically for the HEAP_XMAX_IS_MULTI case,
and they all verify whether the updater is InvalidXid before concluding
that it's a valid updater. So there doesn't seem to be any similar bug.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
MarkBufferDirtyHint() writes WAL, and should know if it's got a
standard buffer or not. Currently, the only callers where buffer_std
is false are related to the FSM.
In passing, rename XLOG_HINT to XLOG_FPI, which is more descriptive.
Back-patch to 9.3.
Checksums are set immediately prior to flush out of shared buffers
and checked when pages are read in again. Hint bit setting will
require full page write when block is dirtied, which causes various
infrastructure changes. Extensive comments, docs and README.
WARNING message thrown if checksum fails on non-all zeroes page;
ERROR thrown but can be disabled with ignore_checksum_failure = on.
Feature enabled by an initdb option, since transition from option off
to option on is long and complex and has not yet been implemented.
Default is not to use checksums.
Checksum used is WAL CRC-32 truncated to 16-bits.
Simon Riggs, Jeff Davis, Greg Smith
Wide input and assistance from many community members. Thank you.
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com1290721684-sup-3951@alvh.no-ip.org1294953201-sup-2099@alvh.no-ip.org1320343602-sup-2290@alvh.no-ip.org1339690386-sup-8927@alvh.no-ip.org4FE5FF020200002500048A3D@gw.wicourts.gov4FEAB90A0200002500048B7D@gw.wicourts.gov
When a relfilenode is created in this subtransaction or
a committed child transaction and it cannot otherwise
be seen by our own process, mark tuples committed ahead
of transaction commit for all COPY commands in same
transaction. If FREEZE specified on COPY
and pre-conditions met then rows will also be frozen.
Both options designed to avoid revisiting rows after commit,
increasing performance of subsequent commands after
data load and upgrade. pg_restore changes later.
Simon Riggs, review comments from Heikki Linnakangas, Noah Misch and design
input from Tom Lane, Robert Haas and Kevin Grittner
This reduces unnecessary exposure of other headers through htup.h, which
is very widely included by many files.
I have chosen to move the function prototypes to the new file as well,
because that means htup.h no longer needs to include tupdesc.h. In
itself this doesn't have much effect in indirect inclusion of tupdesc.h
throughout the tree, because it's also required by execnodes.h; but it's
something to explore in the future, and it seemed best to do the htup.h
change now while I'm busy with it.
When (re) loading the typcache comparison cache for an enum type's values,
use an up-to-date MVCC snapshot, not the transaction's existing snapshot.
This avoids problems if we encounter an enum OID that was created since our
transaction started. Per report from Andres Freund and diagnosis by Robert
Haas.
To ensure this is safe even if enum comparison manages to get invoked
before we've set a transaction snapshot, tweak GetLatestSnapshot to
redirect to GetTransactionSnapshot instead of throwing error when
FirstSnapshotSet is false. The existing uses of GetLatestSnapshot (in
ri_triggers.c) don't care since they couldn't be invoked except in a
transaction that's already done some work --- but it seems just conceivable
that this might not be true of enums, especially if we ever choose to use
enums in system catalogs.
Note that the comparable coding in enum_endpoint and enum_range_internal
remains GetTransactionSnapshot; this is perhaps debatable, but if we
changed it those functions would have to be marked volatile, which doesn't
seem attractive.
Back-patch to 9.1 where ALTER TYPE ADD VALUE was added.
When the "hot" members of PGPROC were split off to separate PGXACT structs,
many PGPROC fields referred to in comments were moved to PGXACT, but the
comments were neglected in the commit. Mostly this is just a search/replace
of PGPROC with PGXACT, but the way the dummy PGPROC entries are created for
prepared transactions changed more, making some of the comments totally
bogus.
Noah Misch
At the time we check whether the tuple is dead to all running
transactions, we've already verified that it isn't visible to our
scan, setting hint bits if appropriate. So there's no need to
recheck CLOG for the all-dead test we do just a moment later.
So, add HeapTupleIsSurelyDead() to test the appropriate condition
under the assumption that all relevant hit bits are already set.
Review by Tom Lane.
This has been wrong for a really long time. We don't use two-phase
locking to protect against serialization anomalies.
Per discussion on pgsql-hackers about 2011-03-07; original report
by Dan Ports.
This speeds up snapshot-taking and reduces ProcArrayLock contention.
Also, the PGPROC (and PGXACT) structures used by two-phase commit are
now allocated as part of the main array, rather than in a separate
array, and we keep ProcArray sorted in pointer order. These changes
are intended to minimize the number of cache lines that must be pulled
in to take a snapshot, and testing shows a substantial increase in
performance on both read and write workloads at high concurrencies.
Pavan Deolasee, Heikki Linnakangas, Robert Haas
We need not wait until the commit record is durably on disk, because
in the event of a crash the page we're updating with hint bits will
be gone anyway. Per off-list report from Heikki Linnakangas, this
can significantly degrade the performance of unlogged tables; I was
able to show a 2x speedup from this patch on a pgbench run with scale
factor 15. In practice, this will mostly help small, heavily updated
tables, because on larger tables you're unlikely to run into the same
row again before the commit record makes it out to disk.
A transaction can export a snapshot with pg_export_snapshot(), and then
others can import it with SET TRANSACTION SNAPSHOT. The data does not
leave the server so there are not security issues. A snapshot can only
be imported while the exporting transaction is still running, and there
are some other restrictions.
I'm not totally convinced that we've covered all the bases for SSI (true
serializable) mode, but it works fine for lesser isolation modes.
Joachim Wieland, reviewed by Marko Tiikkaja, and rather heavily modified
by Tom Lane
In REPEATABLE READ (nee SERIALIZABLE) mode, an attempt to do
GetTransactionSnapshot() between AbortTransaction and CleanupTransaction
failed, because GetTransactionSnapshot would recompute the transaction
snapshot (which is already wrong, given the isolation mode) and then
re-register it in the TopTransactionResourceOwner, leading to an Assert
because the TopTransactionResourceOwner should be empty of resources after
AbortTransaction. This is the root cause of bug #6218 from Yamamoto
Takashi. While changing plancache.c to avoid requesting a snapshot when
handling a ROLLBACK masks the problem, I think this is really a snapmgr.c
bug: it's lower-level than the resource manager mechanism and should not be
shutting itself down before we unwind resource manager resources. However,
just postponing the release of the transaction snapshot until cleanup time
didn't work because of the circular dependency with
TopTransactionResourceOwner. Fix by managing the internal reference to
that snapshot manually instead of depending on TopTransactionResourceOwner.
This saves a few cycles as well as making the module layering more
straightforward. predicate.c's dependencies on TopTransactionResourceOwner
go away too.
I think this is a longstanding bug, but there's no evidence that it's more
than a latent bug, so it doesn't seem worth any risk of back-patching.
walsender.h should depend on xlog.h, not vice versa. (Actually, the
inclusion was circular until a couple hours ago, which was even sillier;
but Bruce broke it in the expedient rather than logically correct
direction.) Because of that poor decision, plus blind application of
pgrminclude, we had a situation where half the system was depending on
xlog.h to include such unrelated stuff as array.h and guc.h. Clean up
the header inclusion, and manually revert a lot of what pgrminclude had
done so things build again.
This episode reinforces my feeling that pgrminclude should not be run
without adult supervision. Inclusion changes in header files in particular
need to be reviewed with great care. More generally, it'd be good if we
had a clearer notion of module layering to dictate which headers can sanely
include which others ... but that's a big task for another day.
VACUUM was willing to remove a committed-dead tuple immediately if it was
deleted by the same transaction that inserted it. The idea is that such a
tuple could never have been visible to any other transaction, so we don't
need to keep it around to satisfy MVCC snapshots. However, there was
already an exception for tuples that are part of an update chain, and this
exception created a problem: we might remove TOAST tuples (which are never
part of an update chain) while their parent tuple stayed around (if it was
part of an update chain). This didn't pose a problem for most things,
since the parent tuple is indeed dead: no snapshot will ever consider it
visible. But MVCC-safe CLUSTER had a problem, since it will try to copy
RECENTLY_DEAD tuples to the new table. It then has to copy their TOAST
data too, and would fail if VACUUM had already removed the toast tuples.
Easiest fix is to get rid of the special case for xmin == xmax. This may
delay reclaiming dead space for a little bit in some cases, but it's by far
the most reliable way to fix the issue.
Per bug #5998 from Mark Reid. Back-patch to 8.3, which is the oldest
version with MVCC-safe CLUSTER.
With this patch, portals, SQL functions, and SPI all agree that there
should be only a CommandCounterIncrement between the queries that are
generated from a single SQL command by rule expansion. Fetching a whole
new snapshot now happens only between original queries. This is equivalent
to the existing behavior of EXPLAIN ANALYZE, and it was judged to be the
best choice since it eliminates one source of concurrency hazards for
rules. The patch should also make things marginally faster by reducing the
number of snapshot push/pop operations.
The patch removes pg_parse_and_rewrite(), which is no longer used anywhere.
There was considerable discussion about more aggressive refactoring of the
query-processing functions exported by postgres.c, but for the moment
nothing more has been done there.
I also took the opportunity to refactor snapmgr.c's API slightly: the
former PushUpdatedSnapshot() has been split into two functions.
Marko Tiikkaja, reviewed by Steve Singer and Tom Lane
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
transaction snapshots, i.e. a snapshot registered at the beginning of
a transaction. Change variable naming and comments to reflect this reality
in preparation for a future, truly serializable mode, e.g.
Serializable Snapshot Isolation (SSI).
For the moment transaction snapshots are still used to implement
SERIALIZABLE, but hopefully not for too much longer. Patch by Kevin
Grittner and Dan Ports with review and some minor wording changes by me.
VACUUM FULL INPLACE), along with a boatload of subsidiary code and complexity.
Per discussion, the use case for this method of vacuuming is no longer large
enough to justify maintaining it; not to mention that we don't wish to invest
the work that would be needed to make it play nicely with Hot Standby.
Aside from the code directly related to old-style VACUUM FULL, this commit
removes support for certain WAL record types that could only be generated
within VACUUM FULL, redirect-pointer removal in heap_page_prune, and
nontransactional generation of cache invalidation sinval messages (the last
being the sticking point for Hot Standby).
We still have to retain all code that copes with finding HEAP_MOVED_OFF and
HEAP_MOVED_IN flag bits on existing tuples. This can't be removed as long
as we want to support in-place update from pre-9.0 databases.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
Partially revert the previous patch I installed and replace it with a more
general fix: any time a snapshot is pushed as Active, we need to ensure that it
will not be modified in the future. This means that if the same snapshot is
used as CurrentSnapshot, it needs to be copied separately. This affects
serializable transactions only, because CurrentSnapshot has already been copied
by RegisterSnapshot and so PushActiveSnapshot does not think it needs another
copy. However, CommandCounterIncrement would modify CurrentSnapshot, whereas
ActiveSnapshots must not have their command counters incremented.
I say "partially" because the regression test I added for the previous bug
has been kept.
(This restores 8.3 behavior, because before snapmgr.c existed, any snapshot set
as Active was copied.)
Per bug report from Stuart Bishop in
6bc73d4c0910042358k3d1adff3qa36f8df75198ecea@mail.gmail.com
The old coding was using a regular snapshot, referenced elsewhere, that was
subject to having its command counter updated. Fix by creating a private copy
of the snapshot exclusively for the cursor.
Backpatch to 8.4, which is when the bug was introduced during the snapshot
management rewrite.
non-writable large objects need to have their snapshots registered on the
transaction resowner, not the current portal's, because it must persist until
the large object is closed (which the portal does not). Also, ensure that the
serializable snapshot is recorded by the transaction resource owner too, even
when a subtransaction has changed the current resource owner before
serializable is taken.
Per bug reports from Pavan Deolasee.
by hand. As an added bonus, the new code is smaller and more understandable,
and the ugly loops are gone.
This had been discussed all along but never implemented. It became clear that
it really needed to be fixed after a bug report by Pavan Deolasee.
GetOldestXmin() instead of RecentGlobalXmin; this is safer because we do not
depend on the latter being correctly set elsewhere, and while it is more
expensive, this code path is not performance-critical. This is a real
risk for autovacuum, because it can execute whole cycles without doing
a single vacuum, which would mean that RecentGlobalXmin would stay at its
initialization value, FirstNormalTransactionId, causing a bogus value to be
inserted in pg_database. This bug could explain some recent reports of
failure to truncate pg_clog.
At the same time, change the initialization of RecentGlobalXmin to
InvalidTransactionId, and ensure that it's set to something else whenever
it's going to be used. Using it as FirstNormalTransactionId in HOT page
pruning could incur in data loss. InitPostgres takes care of setting it
to a valid value, but the extra checks are there to prevent "special"
backends from behaving in unusual ways.
Per Tom Lane's detailed problem dissection in 29544.1221061979@sss.pgh.pa.us
command id is the cmin, when it can in fact be a combo cid. That made rows
incorrectly invisible to a transaction where a tuple was deleted by multiple
aborted subtransactions.
Report and patch Karl Schnaitter. Back-patch to 8.3, where combo cids was
introduced.
CopySnapshot, per Neil Conway. Also add a comment about the assumption in
GetSnapshotData that the argument is statically allocated.
Also, fix some more typos in comments in snapmgr.c.
There are two ways to track a snapshot: there's the "registered" list, which
is used for arbitrary long-lived snapshots; and there's the "active stack",
which is used for the snapshot that is considered "active" at any time.
This also allows users of snapshots to stop worrying about snapshot memory
allocation and freeing, and about using PG_TRY blocks around ActiveSnapshot
assignment. This is all done automatically now.
As a consequence, this allows us to reset MyProc->xmin when there are no
more snapshots registered in the current backend, reducing the impact that
long-running transactions have on VACUUM.
snapmgmt.c file for the former. The header files have also been reorganized
in three parts: the most basic snapshot definitions are now in a new file
snapshot.h, and the also new snapmgmt.h keeps the definitions for snapmgmt.c.
tqual.h has been reduced to the bare minimum.
This patch is just a first step towards managing live snapshots within a
transaction; there is no functionality change.
Per my proposal to pgsql-patches on 20080318191940.GB27458@alvh.no-ip.org and
subsequent discussion.
but no database changes have been made since the last CommandCounterIncrement.
This should result in a significant improvement in the number of "commands"
that can typically be performed within a transaction before hitting the 2^32
CommandId size limit. In particular this buys back (and more) the possible
adverse consequences of my previous patch to fix plan caching behavior.
The implementation requires tracking whether the current CommandCounter
value has been "used" to mark any tuples. CommandCounter values stored into
snapshots are presumed not to be used for this purpose. This requires some
small executor changes, since the executor used to conflate the curcid of
the snapshot it was using with the command ID to mark output tuples with.
Separating these concepts allows some small simplifications in executor APIs.
Something for the TODO list: look into having CommandCounterIncrement not do
AcceptInvalidationMessages. It seems fairly bogus to be doing it there,
but exactly where to do it instead isn't clear, and I'm disinclined to mess
with asynchronous behavior during late beta.
TransactionIdDidAbort, when handling the case that xmin is one of the current
transaction's XIDs and the tuple has been deleted. xmax must also be one of
the current transaction's XIDs, since no one else can see it yet, and it's
cheaper to look at local state than shared state to find out if xmax aborted.
Per an idea of Heikki's.
ReadNewTransactionId from GetSnapshotData --- with a "latestCompletedXid"
variable that is updated during transaction commit or abort. Since
latestCompletedXid is written only in places that had to lock ProcArrayLock
exclusively anyway, and is read only in places that had to lock ProcArrayLock
shared anyway, it adds no new locking requirements to the system despite being
cluster-wide. Moreover, removing ReadNewTransactionId from snapshot
acquisition eliminates the need to take both XidGenLock and ProcArrayLock at
the same time. Since XidGenLock is sometimes held across I/O this can be a
significant win. Some preliminary benchmarking suggested that this patch has
no effect on average throughput but can significantly improve the worst-case
transaction times seen in pgbench. Concept by Florian Pflug, implementation
by Tom Lane.
that still thought they could set HEAP_XMAX_COMMITTED immediately after
seeing the other transaction commit. Make them use the same logic as
tqual.c does to determine if the hint bit can be set yet.
before reporting a transaction committed. Data consistency is still
guaranteed (unlike setting fsync = off), but a crash may lose the effects
of the last few transactions. Patch by Simon, some editorialization by Tom.
pointer" in every Snapshot struct. This allows removal of the case-by-case
tests in HeapTupleSatisfiesVisibility, which should make it a bit faster
(I didn't try any performance tests though). More importantly, we are no
longer violating portable C practices by assuming that small integers are
distinct from all pointer values, and HeapTupleSatisfiesDirty no longer
has a non-reentrant API involving side-effects on a global variable.
There were a couple of places calling HeapTupleSatisfiesXXX routines
directly rather than through the HeapTupleSatisfiesVisibility macro.
Since these places had to be changed anyway, I chose to make them go
through the macro for uniformity.
Along the way I renamed HeapTupleSatisfiesSnapshot to HeapTupleSatisfiesMVCC
to emphasize that it's only used with MVCC-type snapshots. I was sorely
tempted to rename HeapTupleSatisfiesVisibility to HeapTupleSatisfiesSnapshot,
but forebore for the moment to avoid confusion and reduce the likelihood that
this patch breaks some of the pending patches. Might want to reconsider
doing that later.
keeping private state in each backend that has inserted and deleted the same
tuple during its current top-level transaction. This is sufficient since
there is no need to be able to determine the cmin/cmax from any other
transaction. This gets us back down to 23-byte headers, removing a penalty
paid in 8.0 to support subtransactions. Patch by Heikki Linnakangas, with
minor revisions by moi, following a design hashed out awhile back on the
pghackers list.
cases where we already hold the desired lock "indirectly", either via
membership in a MultiXact or because the lock was originally taken by a
different subtransaction of the current transaction. These cases must be
accounted for to avoid needless deadlocks and/or inappropriate replacement of
an exclusive lock with a shared lock. Per report from Clarence Gardner and
subsequent investigation.
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
PGPROC array into snapshots, and use this information to avoid visits
to pg_subtrans in HeapTupleSatisfiesSnapshot. This appears to solve
the pg_subtrans-related context swap storm problem that's been reported
by several people for 8.1. While at it, modify GetSnapshotData to not
take an exclusive lock on ProcArrayLock, as closer analysis shows that
shared lock is always sufficient.
Itagaki Takahiro and Tom Lane
comment line where output as too long, and update typedefs for /lib
directory. Also fix case where identifiers were used as variable names
in the backend, but as typedefs in ecpg (favor the backend for
indenting).
Backpatch to 8.1.X.
insufficient paranoia in code that follows t_ctid links. (We must do both
because even with VACUUM doing it properly, the intermediate state with
a dangling t_ctid link is visible concurrently during lazy VACUUM, and
could be seen afterwards if either type of VACUUM crashes partway through.)
Also try to improve documentation about what's going on. Patch is a bit
bulky because passing the XMAX information around required changing the
APIs of some low-level heapam.c routines, but it's not conceptually very
complicated. Per trouble report from Teodor and subsequent analysis.
This needs to be back-patched, but I'll do that after 8.1 beta is out.
communication structure, and make it its own module with its own lock.
This should reduce contention at least a little, and it definitely makes
the code seem cleaner. Per my recent proposal.
before we check commit/abort status. Formerly this was done in some paths
but not all, with the result that a transaction might be considered
committed for some purposes before it became committed for others.
Per example found by Jan Wieck.
to eliminate unnecessary deadlocks. This commit adds SELECT ... FOR SHARE
paralleling SELECT ... FOR UPDATE. The implementation uses a new SLRU
data structure (managed much like pg_subtrans) to represent multiple-
transaction-ID sets. When more than one transaction is holding a shared
lock on a particular row, we create a MultiXactId representing that set
of transactions and store its ID in the row's XMAX. This scheme allows
an effectively unlimited number of row locks, just as we did before,
while not costing any extra overhead except when a shared lock actually
has to be shared. Still TODO: use the regular lock manager to control
the grant order when multiple backends are waiting for a row lock.
Alvaro Herrera and Tom Lane.
Also performed an initial run through of upgrading our Copyright date to
extend to 2005 ... first run here was very simple ... change everything
where: grep 1996-2004 && the word 'Copyright' ... scanned through the
generated list with 'less' first, and after, to make sure that I only
picked up the right entries ...
http://archives.postgresql.org/pgsql-hackers/2004-10/msg00464.php.
This fix is intended to be permanent: it moves the responsibility for
calling SetBufferCommitInfoNeedsSave() into the tqual.c routines,
eliminating the requirement for callers to test whether t_infomask changed.
Also, tighten validity checking on buffer IDs in bufmgr.c --- several
routines were paranoid about out-of-range shared buffer numbers but not
about out-of-range local ones, which seems a tad pointless.
pg_subtrans --- what we need is the oldest xmin of any snapshot in use
in the current top transaction. Introduce a new variable TransactionXmin
to play this role. Fixes intermittent regression failure reported by
Neil Conway.
mode see a fresh snapshot for each command in the function, rather than
using the latest interactive command's snapshot. Also, suppress fresh
snapshots as well as CommandCounterIncrement inside STABLE and IMMUTABLE
functions, instead using the snapshot taken for the most closely nested
regular query. (This behavior is only sane for read-only functions, so
the patch also enforces that such functions contain only SELECT commands.)
As per my proposal of 6-Sep-2004; I note that I floated essentially the
same proposal on 19-Jun-2002, but that discussion tailed off without any
action. Since 8.0 seems like the right place to be taking possibly
nontrivial backwards compatibility hits, let's get it done now.
RecentXmin (== MyProc->xmin). This ensures that it will be safe to
truncate pg_subtrans at RecentGlobalXmin, which should largely eliminate
any fear of bloat. Along the way, eliminate SubTransXidsHaveCommonAncestor,
which isn't really needed and could not give a trustworthy result anyway
under the lookback restriction.
In an unrelated but nearby change, #ifdef out GetUndoRecPtr, which has
been dead code since 2001 and seems unlikely to ever be resurrected.
password/group files. Also allow read-only subtransactions of a read-write
parent, but not vice versa. These are the reasonably noncontroversial
parts of Alvaro's recent mop-up patch, plus further work on large objects
to minimize use of the TopTransactionResourceOwner.
performance front, but with feature freeze upon us I think it's time to
drive a stake in the ground and say that this will be in 7.5.
Alvaro Herrera, with some help from Tom Lane.
discussion on pgsql-hackers: in READ COMMITTED mode we just have to force
a QuerySnapshot update in the trigger, but in SERIALIZABLE mode we have
to run the scan under a current snapshot and then complain if any rows
would be updated/deleted that are not visible in the transaction snapshot.
to allow es_snapshot to be set to SnapshotNow rather than a query snapshot.
This solves a bug reported by Wade Klaver, wherein triggers fired as a
result of RI cascade updates could misbehave.
difference between INSERT_IN_PROGRESS and DELETE_IN_PROGRESS for
tuples inserted and then deleted by a concurrent transaction.
Example of bug:
regression=# create table foo (f1 int);
CREATE TABLE
regression=# begin;
BEGIN
regression=# insert into foo values(1);
INSERT 195531 1
regression=# delete from foo;
DELETE 1
regression=# insert into foo values(1);
INSERT 195532 1
regression=# create unique index fooi on foo(f1);
ERROR: could not create unique index
DETAIL: Table contains duplicated values.
>> actually having updated the tuple, [...] can we simply
>> set the HEAP_XMAX_INVALID hint bit of the tuple?
>
>AFAICS this is a reasonable thing to do.
Thanks for the confirmation. Here's a patch which also contains some
more noncritical changes to tqual.c:
. make code more readable by introducing local variables for xvac
. no longer two separate branches for aborted and crashed.
The actions were the same in all cases.
Manfred Koizar
free'd for every transaction or statement, respectively. This patch
puts these data structures into static memory, thus saving a few CPU
cycles and two malloc calls per transaction or (in isolation level
READ COMMITTED) per query.
Manfred Koizar
into postgres.c; make sure it happens for all cases that seem to need it.
Perhaps it would be better to explicitly exclude just a few utility
statement types from setting a snapshot?
(overlaying low byte of page size) and add HEAP_HASOID bit to t_infomask,
per earlier discussion. Simplify scheme for overlaying fields in tuple
header (no need for cmax to live in more than one place). Don't try to
clear infomask status bits in tqual.c --- not safe to do it there. Don't
try to force output table of a SELECT INTO to have OIDs, either. Get rid
of unnecessarily complex three-state scheme for TupleDesc.tdhasoids, which
has already caused one recent failure. Improve documentation.
tuple header. The fix is based on the thought that HEAP_MOVED_IN is
not needed any more as soon as HEAP_XMIN_COMMITTED has been set. So
in tqual.c and vacuum.c the HEAP_MOVED bits are cleared when
HEAP_XMIN_COMMITTED is set.
Vacuum robustness is enhanced by rearranging ifs, so that we have a
chance to elog(ERROR, ...) before an assertion fails.
A new regression test is included.
Manfred Koizar
HeapTupleHeaderData in setter and getter macros called
HeapTupleHeaderGetXmin, HeapTupleHeaderSetXmin etc.
It also introduces a "virtual" field xvac by defining
HeapTupleHeaderGetXvac and HeapTupleHeaderSetXvac. Xvac is used by
VACUUM, in fact it is stored in t_cmin.
Manfred Koizar
transaction, so as to avoid returning them out of the index AM. Saves
repeated heap_fetch operations on frequently-updated rows. Also detect
queries on unique keys (equality to all columns of a unique index), and
don't bother continuing scan once we have found first match.
Killing is implemented in the btree and hash AMs, but not yet in rtree
or gist, because there isn't an equally convenient place to do it in
those AMs (the outer amgetnext routine can't do it without re-pinning
the index page).
Did some small cleanup on APIs of HeapTupleSatisfies, heap_fetch, and
index_insert to make this a little easier.
in snapshots, per my proposal of a few days ago. Also, tweak heapam.c
routines (heap_insert, heap_update, heap_delete, heap_mark4update) to
be passed the command ID to use, instead of doing GetCurrentCommandID.
For catalog updates they'll still get passed current command ID, but
for updates generated from the main executor they'll get passed the
command ID saved in the snapshot the query is using. This should fix
some corner cases associated with functions and triggers that advance
current command ID while an outer query is still in progress.
mess up after an aborted VACUUM FULL, per today's pghackers discussion.
Add a suitable HeapTupleSatisfiesToast routine. Remove useless special-
case test in HeapTupleSatisfiesVisibility macro for xmax =
BootstrapTransactionId; perhaps that was needed at one time, but it's
a waste of cycles now, not to mention actively wrong for SnapshotAny.
Along the way, add some much-needed comments to tqual.c, and simplify
toast_fetch_datum, which no longer needs to assume it may see chunks
out-of-order.
as either HEAP_XMAX_COMMITTED or HEAP_XMAX_INVALID once the updating
transaction is gone. Otherwise some other transaction may come along
and try to test the commit status of t_xmax later --- which could be
after VACUUM has recycled the CLOG status for that xact. Bug introduced
in post-beta4 bug fix.
buffer manager with 'pg_clog', a specialized access method modeled
on pg_xlog. This simplifies startup (don't need to play games to
open pg_log; among other things, OverrideTransactionSystem goes away),
should improve performance a little, and opens the door to recycling
commit log space by removing no-longer-needed segments of the commit
log. Actual recycling is not there yet, but I felt I should commit
this part separately since it'd still be useful if we chose not to
do transaction ID wraparound.
in GetSnapshotData, GetNewTransactionId, CommitTransaction, AbortTransaction,
etc. Correct race condition in transaction status testing in
HeapTupleSatisfiesVacuum --- this wasn't important for old VACUUM with
exclusive lock on its table, but it sure is important now. All per
pghackers discussion 7/11/01 and 7/12/01.
validity checking rules for VACUUM. Make some other rearrangements of the
VACUUM code to allow more code to be shared between full and lazy VACUUM.
Minor code cleanups and added comments for TransactionId manipulations.