2008-03-26 17:20:48 +01:00
|
|
|
/*-------------------------------------------------------------------------
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
*
|
2008-03-26 19:48:59 +01:00
|
|
|
* snapmgr.c
|
|
|
|
* PostgreSQL snapshot manager
|
2008-03-26 17:20:48 +01:00
|
|
|
*
|
2008-11-25 21:28:29 +01:00
|
|
|
* We keep track of snapshots in two ways: those "registered" by resowner.c,
|
2008-07-11 04:10:14 +02:00
|
|
|
* and the "active snapshot" stack. All snapshots in either of them live in
|
|
|
|
* persistent memory. When a snapshot is no longer in any of these lists
|
|
|
|
* (tracked by separate refcounts on each snapshot), its memory can be freed.
|
2008-05-12 22:02:02 +02:00
|
|
|
*
|
2011-09-27 04:25:28 +02:00
|
|
|
* The FirstXactSnapshot, if any, is treated a bit specially: we increment its
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
* regd_count and list it in RegisteredSnapshots, but this reference is not
|
2011-09-27 04:25:28 +02:00
|
|
|
* tracked by a resource owner. We used to use the TopTransactionResourceOwner
|
|
|
|
* to track this snapshot reference, but that introduces logical circularity
|
|
|
|
* and thus makes it impossible to clean up in a sane fashion. It's better to
|
|
|
|
* handle this reference as an internally-tracked registration, so that this
|
|
|
|
* module is entirely lower-level than ResourceOwners.
|
|
|
|
*
|
2011-10-23 00:22:45 +02:00
|
|
|
* Likewise, any snapshots that have been exported by pg_export_snapshot
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
* have regd_count = 1 and are listed in RegisteredSnapshots, but are not
|
2011-10-23 00:22:45 +02:00
|
|
|
* tracked by any resource owner.
|
|
|
|
*
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
* Likewise, the CatalogSnapshot is listed in RegisteredSnapshots when it
|
|
|
|
* is valid, but is not tracked by any resource owner.
|
|
|
|
*
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
* The same is true for historic snapshots used during logical decoding,
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
* their lifetime is managed separately (as they live longer than one xact.c
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
* transaction).
|
|
|
|
*
|
2020-08-14 01:25:21 +02:00
|
|
|
* These arrangements let us reset MyProc->xmin when there are no snapshots
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
* referenced by this transaction, and advance it when the one with oldest
|
|
|
|
* Xmin is no longer referenced. For simplicity however, only registered
|
|
|
|
* snapshots not active snapshots participate in tracking which one is oldest;
|
2020-08-14 01:25:21 +02:00
|
|
|
* we don't try to change MyProc->xmin except when the active-snapshot
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
* stack is empty.
|
2008-05-12 22:02:02 +02:00
|
|
|
*
|
|
|
|
*
|
2023-01-02 21:00:37 +01:00
|
|
|
* Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
|
2008-03-26 17:20:48 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/backend/utils/time/snapmgr.c
|
2008-03-26 17:20:48 +01:00
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
#include "postgres.h"
|
|
|
|
|
2011-10-23 00:22:45 +02:00
|
|
|
#include <sys/stat.h>
|
|
|
|
#include <unistd.h>
|
|
|
|
|
2019-01-22 02:03:15 +01:00
|
|
|
#include "access/subtrans.h"
|
2008-03-26 17:20:48 +01:00
|
|
|
#include "access/transam.h"
|
2008-07-11 02:00:29 +02:00
|
|
|
#include "access/xact.h"
|
2016-04-08 21:36:30 +02:00
|
|
|
#include "access/xlog.h"
|
|
|
|
#include "catalog/catalog.h"
|
2020-08-17 09:50:13 +02:00
|
|
|
#include "datatype/timestamp.h"
|
2015-01-17 00:14:32 +01:00
|
|
|
#include "lib/pairingheap.h"
|
2011-10-23 00:22:45 +02:00
|
|
|
#include "miscadmin.h"
|
2022-08-03 18:59:28 +02:00
|
|
|
#include "port/pg_lfind.h"
|
Make ResourceOwners more easily extensible.
Instead of having a separate array/hash for each resource kind, use a
single array and hash to hold all kinds of resources. This makes it
possible to introduce new resource "kinds" without having to modify
the ResourceOwnerData struct. In particular, this makes it possible
for extensions to register custom resource kinds.
The old approach was to have a small array of resources of each kind,
and if it fills up, switch to a hash table. The new approach also uses
an array and a hash, but now the array and the hash are used at the
same time. The array is used to hold the recently added resources, and
when it fills up, they are moved to the hash. This keeps the access to
recent entries fast, even when there are a lot of long-held resources.
All the resource-specific ResourceOwnerEnlarge*(),
ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have
been replaced with three generic functions that take resource kind as
argument. For convenience, we still define resource-specific wrapper
macros around the generic functions with the old names, but they are
now defined in the source files that use those resource kinds.
The release callback no longer needs to call ResourceOwnerForget on
the resource being released. ResourceOwnerRelease unregisters the
resource from the owner before calling the callback. That needed some
changes in bufmgr.c and some other files, where releasing the
resources previously always called ResourceOwnerForget.
Each resource kind specifies a release priority, and
ResourceOwnerReleaseAll releases the resources in priority order. To
make that possible, we have to restrict what you can do between
phases. After calling ResourceOwnerRelease(), you are no longer
allowed to remember any more resources in it or to forget any
previously remembered resources by calling ResourceOwnerForget. There
was one case where that was done previously. At subtransaction commit,
AtEOSubXact_Inval() would handle the invalidation messages and call
RelationFlushRelation(), which temporarily increased the reference
count on the relation being flushed. We now switch to the parent
subtransaction's resource owner before calling AtEOSubXact_Inval(), so
that there is a valid ResourceOwner to temporarily hold that relcache
reference.
Other end-of-xact routines make similar calls to AtEOXact_Inval()
between release phases, but I didn't see any regression test failures
from those, so I'm not sure if they could reach a codepath that needs
remembering extra resources.
There were two exceptions to how the resource leak WARNINGs on commit
were printed previously: llvmjit silently released the context without
printing the warning, and a leaked buffer io triggered a PANIC. Now
everything prints a WARNING, including those cases.
Add tests in src/test/modules/test_resowner.
Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 12:30:50 +01:00
|
|
|
#include "storage/fd.h"
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
#include "storage/predicate.h"
|
2011-09-04 07:13:16 +02:00
|
|
|
#include "storage/proc.h"
|
2008-03-26 17:20:48 +01:00
|
|
|
#include "storage/procarray.h"
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
#include "storage/sinval.h"
|
2017-06-14 20:57:21 +02:00
|
|
|
#include "storage/sinvaladt.h"
|
2016-04-08 21:36:30 +02:00
|
|
|
#include "storage/spin.h"
|
2011-10-23 00:22:45 +02:00
|
|
|
#include "utils/builtins.h"
|
2008-07-11 02:00:29 +02:00
|
|
|
#include "utils/memutils.h"
|
2016-04-08 21:36:30 +02:00
|
|
|
#include "utils/rel.h"
|
Make ResourceOwners more easily extensible.
Instead of having a separate array/hash for each resource kind, use a
single array and hash to hold all kinds of resources. This makes it
possible to introduce new resource "kinds" without having to modify
the ResourceOwnerData struct. In particular, this makes it possible
for extensions to register custom resource kinds.
The old approach was to have a small array of resources of each kind,
and if it fills up, switch to a hash table. The new approach also uses
an array and a hash, but now the array and the hash are used at the
same time. The array is used to hold the recently added resources, and
when it fills up, they are moved to the hash. This keeps the access to
recent entries fast, even when there are a lot of long-held resources.
All the resource-specific ResourceOwnerEnlarge*(),
ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have
been replaced with three generic functions that take resource kind as
argument. For convenience, we still define resource-specific wrapper
macros around the generic functions with the old names, but they are
now defined in the source files that use those resource kinds.
The release callback no longer needs to call ResourceOwnerForget on
the resource being released. ResourceOwnerRelease unregisters the
resource from the owner before calling the callback. That needed some
changes in bufmgr.c and some other files, where releasing the
resources previously always called ResourceOwnerForget.
Each resource kind specifies a release priority, and
ResourceOwnerReleaseAll releases the resources in priority order. To
make that possible, we have to restrict what you can do between
phases. After calling ResourceOwnerRelease(), you are no longer
allowed to remember any more resources in it or to forget any
previously remembered resources by calling ResourceOwnerForget. There
was one case where that was done previously. At subtransaction commit,
AtEOSubXact_Inval() would handle the invalidation messages and call
RelationFlushRelation(), which temporarily increased the reference
count on the relation being flushed. We now switch to the parent
subtransaction's resource owner before calling AtEOSubXact_Inval(), so
that there is a valid ResourceOwner to temporarily hold that relcache
reference.
Other end-of-xact routines make similar calls to AtEOXact_Inval()
between release phases, but I didn't see any regression test failures
from those, so I'm not sure if they could reach a codepath that needs
remembering extra resources.
There were two exceptions to how the resource leak WARNINGs on commit
were printed previously: llvmjit silently released the context without
printing the warning, and a leaked buffer io triggered a PANIC. Now
everything prints a WARNING, including those cases.
Add tests in src/test/modules/test_resowner.
Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 12:30:50 +01:00
|
|
|
#include "utils/resowner.h"
|
2008-03-26 19:48:59 +01:00
|
|
|
#include "utils/snapmgr.h"
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
#include "utils/syscache.h"
|
2020-08-17 09:50:13 +02:00
|
|
|
#include "utils/timestamp.h"
|
2008-03-26 17:20:48 +01:00
|
|
|
|
|
|
|
|
|
|
|
/*
|
2010-09-11 20:38:58 +02:00
|
|
|
* CurrentSnapshot points to the only snapshot taken in transaction-snapshot
|
|
|
|
* mode, and to the latest one taken in a read-committed transaction.
|
2008-05-12 22:02:02 +02:00
|
|
|
* SecondarySnapshot is a snapshot that's always up-to-date as of the current
|
2010-09-11 20:38:58 +02:00
|
|
|
* instant, even in transaction-snapshot mode. It should only be used for
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
* special-purpose code (say, RI checking.) CatalogSnapshot points to an
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
* MVCC snapshot intended to be used for catalog scans; we must invalidate it
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
* whenever a system catalog change occurs.
|
2008-05-12 22:02:02 +02:00
|
|
|
*
|
2008-03-26 17:20:48 +01:00
|
|
|
* These SnapshotData structs are static to simplify memory allocation
|
|
|
|
* (see the hack in GetSnapshotData to avoid repeated malloc/free).
|
|
|
|
*/
|
2019-01-22 02:03:15 +01:00
|
|
|
static SnapshotData CurrentSnapshotData = {SNAPSHOT_MVCC};
|
|
|
|
static SnapshotData SecondarySnapshotData = {SNAPSHOT_MVCC};
|
|
|
|
SnapshotData CatalogSnapshotData = {SNAPSHOT_MVCC};
|
2019-01-22 02:03:15 +01:00
|
|
|
SnapshotData SnapshotSelfData = {SNAPSHOT_SELF};
|
|
|
|
SnapshotData SnapshotAnyData = {SNAPSHOT_ANY};
|
2008-03-26 17:20:48 +01:00
|
|
|
|
2008-05-12 22:02:02 +02:00
|
|
|
/* Pointers to valid snapshots */
|
|
|
|
static Snapshot CurrentSnapshot = NULL;
|
|
|
|
static Snapshot SecondarySnapshot = NULL;
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
static Snapshot CatalogSnapshot = NULL;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
static Snapshot HistoricSnapshot = NULL;
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
|
2008-03-26 17:20:48 +01:00
|
|
|
/*
|
|
|
|
* These are updated by GetSnapshotData. We initialize them this way
|
|
|
|
* for the convenience of TransactionIdIsInProgress: even in bootstrap
|
|
|
|
* mode, we don't want it to say that BootstrapTransactionId is in progress.
|
|
|
|
*/
|
|
|
|
TransactionId TransactionXmin = FirstNormalTransactionId;
|
|
|
|
TransactionId RecentXmin = FirstNormalTransactionId;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/* (table, ctid) => (cmin, cmax) mapping during timetravel */
|
|
|
|
static HTAB *tuplecid_data = NULL;
|
2008-03-26 17:20:48 +01:00
|
|
|
|
2008-05-12 22:02:02 +02:00
|
|
|
/*
|
|
|
|
* Elements of the active snapshot stack.
|
|
|
|
*
|
2008-11-25 21:28:29 +01:00
|
|
|
* Each element here accounts for exactly one active_count on SnapshotData.
|
2008-05-12 22:02:02 +02:00
|
|
|
*
|
|
|
|
* NB: the code assumes that elements in this list are in non-increasing
|
|
|
|
* order of as_level; also, the list must be NULL-terminated.
|
|
|
|
*/
|
|
|
|
typedef struct ActiveSnapshotElt
|
|
|
|
{
|
|
|
|
Snapshot as_snap;
|
|
|
|
int as_level;
|
|
|
|
struct ActiveSnapshotElt *as_next;
|
|
|
|
} ActiveSnapshotElt;
|
|
|
|
|
|
|
|
/* Top of the stack of active snapshots */
|
|
|
|
static ActiveSnapshotElt *ActiveSnapshot = NULL;
|
|
|
|
|
2016-08-03 22:41:43 +02:00
|
|
|
/* Bottom of the stack of active snapshots */
|
|
|
|
static ActiveSnapshotElt *OldestActiveSnapshot = NULL;
|
|
|
|
|
2008-11-25 21:28:29 +01:00
|
|
|
/*
|
2015-01-17 00:14:32 +01:00
|
|
|
* Currently registered Snapshots. Ordered in a heap by xmin, so that we can
|
2020-08-14 01:25:21 +02:00
|
|
|
* quickly find the one with lowest xmin, to advance our MyProc->xmin.
|
2008-11-25 21:28:29 +01:00
|
|
|
*/
|
2015-01-17 00:14:32 +01:00
|
|
|
static int xmin_cmp(const pairingheap_node *a, const pairingheap_node *b,
|
|
|
|
void *arg);
|
|
|
|
|
|
|
|
static pairingheap RegisteredSnapshots = {&xmin_cmp, NULL, NULL};
|
2008-11-25 21:28:29 +01:00
|
|
|
|
2008-05-12 22:02:02 +02:00
|
|
|
/* first GetTransactionSnapshot call in a transaction? */
|
|
|
|
bool FirstSnapshotSet = false;
|
|
|
|
|
|
|
|
/*
|
2011-09-27 04:25:28 +02:00
|
|
|
* Remember the serializable transaction snapshot, if any. We cannot trust
|
|
|
|
* FirstSnapshotSet in combination with IsolationUsesXactSnapshot(), because
|
|
|
|
* GUC may be reset before us, changing the value of IsolationUsesXactSnapshot.
|
2008-05-12 22:02:02 +02:00
|
|
|
*/
|
2011-09-27 04:25:28 +02:00
|
|
|
static Snapshot FirstXactSnapshot = NULL;
|
2008-05-12 22:02:02 +02:00
|
|
|
|
2011-10-23 00:22:45 +02:00
|
|
|
/* Define pathname of exported-snapshot files */
|
|
|
|
#define SNAPSHOT_EXPORT_DIR "pg_snapshots"
|
|
|
|
|
2017-06-14 20:57:21 +02:00
|
|
|
/* Structure holding info about exported snapshot. */
|
|
|
|
typedef struct ExportedSnapshot
|
|
|
|
{
|
|
|
|
char *snapfile;
|
|
|
|
Snapshot snapshot;
|
|
|
|
} ExportedSnapshot;
|
|
|
|
|
|
|
|
/* Current xact's exported snapshots (a list of ExportedSnapshot structs) */
|
2011-10-23 00:22:45 +02:00
|
|
|
static List *exportedSnapshots = NIL;
|
|
|
|
|
2015-01-17 00:14:32 +01:00
|
|
|
/* Prototypes for local functions */
|
2009-10-07 18:27:18 +02:00
|
|
|
static Snapshot CopySnapshot(Snapshot snapshot);
|
Make ResourceOwners more easily extensible.
Instead of having a separate array/hash for each resource kind, use a
single array and hash to hold all kinds of resources. This makes it
possible to introduce new resource "kinds" without having to modify
the ResourceOwnerData struct. In particular, this makes it possible
for extensions to register custom resource kinds.
The old approach was to have a small array of resources of each kind,
and if it fills up, switch to a hash table. The new approach also uses
an array and a hash, but now the array and the hash are used at the
same time. The array is used to hold the recently added resources, and
when it fills up, they are moved to the hash. This keeps the access to
recent entries fast, even when there are a lot of long-held resources.
All the resource-specific ResourceOwnerEnlarge*(),
ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have
been replaced with three generic functions that take resource kind as
argument. For convenience, we still define resource-specific wrapper
macros around the generic functions with the old names, but they are
now defined in the source files that use those resource kinds.
The release callback no longer needs to call ResourceOwnerForget on
the resource being released. ResourceOwnerRelease unregisters the
resource from the owner before calling the callback. That needed some
changes in bufmgr.c and some other files, where releasing the
resources previously always called ResourceOwnerForget.
Each resource kind specifies a release priority, and
ResourceOwnerReleaseAll releases the resources in priority order. To
make that possible, we have to restrict what you can do between
phases. After calling ResourceOwnerRelease(), you are no longer
allowed to remember any more resources in it or to forget any
previously remembered resources by calling ResourceOwnerForget. There
was one case where that was done previously. At subtransaction commit,
AtEOSubXact_Inval() would handle the invalidation messages and call
RelationFlushRelation(), which temporarily increased the reference
count on the relation being flushed. We now switch to the parent
subtransaction's resource owner before calling AtEOSubXact_Inval(), so
that there is a valid ResourceOwner to temporarily hold that relcache
reference.
Other end-of-xact routines make similar calls to AtEOXact_Inval()
between release phases, but I didn't see any regression test failures
from those, so I'm not sure if they could reach a codepath that needs
remembering extra resources.
There were two exceptions to how the resource leak WARNINGs on commit
were printed previously: llvmjit silently released the context without
printing the warning, and a leaked buffer io triggered a PANIC. Now
everything prints a WARNING, including those cases.
Add tests in src/test/modules/test_resowner.
Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 12:30:50 +01:00
|
|
|
static void UnregisterSnapshotNoOwner(Snapshot snapshot);
|
2008-05-12 22:02:02 +02:00
|
|
|
static void FreeSnapshot(Snapshot snapshot);
|
|
|
|
static void SnapshotResetXmin(void);
|
|
|
|
|
Make ResourceOwners more easily extensible.
Instead of having a separate array/hash for each resource kind, use a
single array and hash to hold all kinds of resources. This makes it
possible to introduce new resource "kinds" without having to modify
the ResourceOwnerData struct. In particular, this makes it possible
for extensions to register custom resource kinds.
The old approach was to have a small array of resources of each kind,
and if it fills up, switch to a hash table. The new approach also uses
an array and a hash, but now the array and the hash are used at the
same time. The array is used to hold the recently added resources, and
when it fills up, they are moved to the hash. This keeps the access to
recent entries fast, even when there are a lot of long-held resources.
All the resource-specific ResourceOwnerEnlarge*(),
ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have
been replaced with three generic functions that take resource kind as
argument. For convenience, we still define resource-specific wrapper
macros around the generic functions with the old names, but they are
now defined in the source files that use those resource kinds.
The release callback no longer needs to call ResourceOwnerForget on
the resource being released. ResourceOwnerRelease unregisters the
resource from the owner before calling the callback. That needed some
changes in bufmgr.c and some other files, where releasing the
resources previously always called ResourceOwnerForget.
Each resource kind specifies a release priority, and
ResourceOwnerReleaseAll releases the resources in priority order. To
make that possible, we have to restrict what you can do between
phases. After calling ResourceOwnerRelease(), you are no longer
allowed to remember any more resources in it or to forget any
previously remembered resources by calling ResourceOwnerForget. There
was one case where that was done previously. At subtransaction commit,
AtEOSubXact_Inval() would handle the invalidation messages and call
RelationFlushRelation(), which temporarily increased the reference
count on the relation being flushed. We now switch to the parent
subtransaction's resource owner before calling AtEOSubXact_Inval(), so
that there is a valid ResourceOwner to temporarily hold that relcache
reference.
Other end-of-xact routines make similar calls to AtEOXact_Inval()
between release phases, but I didn't see any regression test failures
from those, so I'm not sure if they could reach a codepath that needs
remembering extra resources.
There were two exceptions to how the resource leak WARNINGs on commit
were printed previously: llvmjit silently released the context without
printing the warning, and a leaked buffer io triggered a PANIC. Now
everything prints a WARNING, including those cases.
Add tests in src/test/modules/test_resowner.
Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 12:30:50 +01:00
|
|
|
/* ResourceOwner callbacks to track snapshot references */
|
|
|
|
static void ResOwnerReleaseSnapshot(Datum res);
|
|
|
|
|
|
|
|
static const ResourceOwnerDesc snapshot_resowner_desc =
|
|
|
|
{
|
|
|
|
.name = "snapshot reference",
|
|
|
|
.release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
|
|
|
|
.release_priority = RELEASE_PRIO_SNAPSHOT_REFS,
|
|
|
|
.ReleaseResource = ResOwnerReleaseSnapshot,
|
|
|
|
.DebugPrint = NULL /* the default message is fine */
|
|
|
|
};
|
|
|
|
|
|
|
|
/* Convenience wrappers over ResourceOwnerRemember/Forget */
|
|
|
|
static inline void
|
|
|
|
ResourceOwnerRememberSnapshot(ResourceOwner owner, Snapshot snap)
|
|
|
|
{
|
|
|
|
ResourceOwnerRemember(owner, PointerGetDatum(snap), &snapshot_resowner_desc);
|
|
|
|
}
|
|
|
|
static inline void
|
|
|
|
ResourceOwnerForgetSnapshot(ResourceOwner owner, Snapshot snap)
|
|
|
|
{
|
|
|
|
ResourceOwnerForget(owner, PointerGetDatum(snap), &snapshot_resowner_desc);
|
|
|
|
}
|
|
|
|
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
/*
|
|
|
|
* Snapshot fields to be serialized.
|
|
|
|
*
|
|
|
|
* Only these fields need to be sent to the cooperating backend; the
|
2016-06-07 17:14:48 +02:00
|
|
|
* remaining ones can (and must) be set by the receiver upon restore.
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
*/
|
|
|
|
typedef struct SerializedSnapshotData
|
|
|
|
{
|
|
|
|
TransactionId xmin;
|
|
|
|
TransactionId xmax;
|
|
|
|
uint32 xcnt;
|
|
|
|
int32 subxcnt;
|
|
|
|
bool suboverflowed;
|
|
|
|
bool takenDuringRecovery;
|
|
|
|
CommandId curcid;
|
2017-02-23 21:57:08 +01:00
|
|
|
TimestampTz whenTaken;
|
2016-06-03 18:13:28 +02:00
|
|
|
XLogRecPtr lsn;
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
} SerializedSnapshotData;
|
2008-03-26 17:20:48 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* GetTransactionSnapshot
|
|
|
|
* Get the appropriate snapshot for a new query in a transaction.
|
|
|
|
*
|
2008-05-12 22:02:02 +02:00
|
|
|
* Note that the return value may point at static storage that will be modified
|
|
|
|
* by future calls and by CommandCounterIncrement(). Callers should call
|
|
|
|
* RegisterSnapshot or PushActiveSnapshot on the returned snap if it is to be
|
|
|
|
* used very long.
|
2008-03-26 17:20:48 +01:00
|
|
|
*/
|
|
|
|
Snapshot
|
|
|
|
GetTransactionSnapshot(void)
|
|
|
|
{
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
/*
|
|
|
|
* Return historic snapshot if doing logical decoding. We'll never need a
|
|
|
|
* non-historic transaction snapshot in this (sub-)transaction, so there's
|
|
|
|
* no need to be careful to set one up for later calls to
|
|
|
|
* GetTransactionSnapshot().
|
|
|
|
*/
|
|
|
|
if (HistoricSnapshotActive())
|
|
|
|
{
|
|
|
|
Assert(!FirstSnapshotSet);
|
|
|
|
return HistoricSnapshot;
|
|
|
|
}
|
|
|
|
|
2008-03-26 17:20:48 +01:00
|
|
|
/* First call in transaction? */
|
2008-05-12 22:02:02 +02:00
|
|
|
if (!FirstSnapshotSet)
|
2008-03-26 17:20:48 +01:00
|
|
|
{
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
/*
|
|
|
|
* Don't allow catalog snapshot to be older than xact snapshot. Must
|
|
|
|
* do this first to allow the empty-heap Assert to succeed.
|
|
|
|
*/
|
|
|
|
InvalidateCatalogSnapshot();
|
|
|
|
|
2015-01-17 00:14:32 +01:00
|
|
|
Assert(pairingheap_is_empty(&RegisteredSnapshots));
|
2011-09-27 04:25:28 +02:00
|
|
|
Assert(FirstXactSnapshot == NULL);
|
2008-11-25 21:28:29 +01:00
|
|
|
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
if (IsInParallelMode())
|
|
|
|
elog(ERROR,
|
|
|
|
"cannot take query snapshot during a parallel operation");
|
|
|
|
|
2008-05-12 22:02:02 +02:00
|
|
|
/*
|
2010-09-11 20:38:58 +02:00
|
|
|
* In transaction-snapshot mode, the first snapshot must live until
|
|
|
|
* end of xact regardless of what the caller does with it, so we must
|
2011-09-27 04:25:28 +02:00
|
|
|
* make a copy of it rather than returning CurrentSnapshotData
|
2011-10-23 00:22:45 +02:00
|
|
|
* directly. Furthermore, if we're running in serializable mode,
|
|
|
|
* predicate.c needs to wrap the snapshot fetch in its own processing.
|
2008-05-12 22:02:02 +02:00
|
|
|
*/
|
2010-09-11 20:38:58 +02:00
|
|
|
if (IsolationUsesXactSnapshot())
|
2008-05-12 22:02:02 +02:00
|
|
|
{
|
2011-09-27 04:25:28 +02:00
|
|
|
/* First, create the snapshot in CurrentSnapshotData */
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
if (IsolationIsSerializable())
|
2011-09-27 04:25:28 +02:00
|
|
|
CurrentSnapshot = GetSerializableTransactionSnapshot(&CurrentSnapshotData);
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
else
|
|
|
|
CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
|
2011-09-27 04:25:28 +02:00
|
|
|
/* Make a saved copy */
|
|
|
|
CurrentSnapshot = CopySnapshot(CurrentSnapshot);
|
|
|
|
FirstXactSnapshot = CurrentSnapshot;
|
|
|
|
/* Mark it as "registered" in FirstXactSnapshot */
|
|
|
|
FirstXactSnapshot->regd_count++;
|
2015-01-17 00:14:32 +01:00
|
|
|
pairingheap_add(&RegisteredSnapshots, &FirstXactSnapshot->ph_node);
|
2008-05-12 22:02:02 +02:00
|
|
|
}
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
else
|
|
|
|
CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
|
2008-05-12 22:02:02 +02:00
|
|
|
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
FirstSnapshotSet = true;
|
2008-05-12 22:02:02 +02:00
|
|
|
return CurrentSnapshot;
|
2008-03-26 17:20:48 +01:00
|
|
|
}
|
|
|
|
|
2010-09-11 20:38:58 +02:00
|
|
|
if (IsolationUsesXactSnapshot())
|
2008-05-12 22:02:02 +02:00
|
|
|
return CurrentSnapshot;
|
2008-03-26 17:20:48 +01:00
|
|
|
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
/* Don't allow catalog snapshot to be older than xact snapshot. */
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
InvalidateCatalogSnapshot();
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
|
2008-05-12 22:02:02 +02:00
|
|
|
CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
|
2008-03-26 17:20:48 +01:00
|
|
|
|
2008-05-12 22:02:02 +02:00
|
|
|
return CurrentSnapshot;
|
2008-03-26 17:20:48 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* GetLatestSnapshot
|
|
|
|
* Get a snapshot that is up-to-date as of the current instant,
|
2010-09-11 20:38:58 +02:00
|
|
|
* even if we are executing in transaction-snapshot mode.
|
2008-03-26 17:20:48 +01:00
|
|
|
*/
|
|
|
|
Snapshot
|
|
|
|
GetLatestSnapshot(void)
|
|
|
|
{
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
/*
|
|
|
|
* We might be able to relax this, but nothing that could otherwise work
|
|
|
|
* needs it.
|
|
|
|
*/
|
|
|
|
if (IsInParallelMode())
|
|
|
|
elog(ERROR,
|
|
|
|
"cannot update SecondarySnapshot during a parallel operation");
|
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
/*
|
|
|
|
* So far there are no cases requiring support for GetLatestSnapshot()
|
|
|
|
* during logical decoding, but it wouldn't be hard to add if required.
|
|
|
|
*/
|
|
|
|
Assert(!HistoricSnapshotActive());
|
|
|
|
|
2012-07-01 23:12:49 +02:00
|
|
|
/* If first call in transaction, go ahead and set the xact snapshot */
|
2008-05-12 22:02:02 +02:00
|
|
|
if (!FirstSnapshotSet)
|
2012-07-01 23:12:49 +02:00
|
|
|
return GetTransactionSnapshot();
|
2008-03-26 17:20:48 +01:00
|
|
|
|
2008-05-12 22:02:02 +02:00
|
|
|
SecondarySnapshot = GetSnapshotData(&SecondarySnapshotData);
|
2008-03-26 17:20:48 +01:00
|
|
|
|
2008-05-12 22:02:02 +02:00
|
|
|
return SecondarySnapshot;
|
|
|
|
}
|
|
|
|
|
2016-08-03 22:41:43 +02:00
|
|
|
/*
|
|
|
|
* GetOldestSnapshot
|
|
|
|
*
|
2016-08-07 20:36:02 +02:00
|
|
|
* Get the transaction's oldest known snapshot, as judged by the LSN.
|
|
|
|
* Will return NULL if there are no active or registered snapshots.
|
2016-08-03 22:41:43 +02:00
|
|
|
*/
|
|
|
|
Snapshot
|
|
|
|
GetOldestSnapshot(void)
|
|
|
|
{
|
|
|
|
Snapshot OldestRegisteredSnapshot = NULL;
|
|
|
|
XLogRecPtr RegisteredLSN = InvalidXLogRecPtr;
|
|
|
|
|
|
|
|
if (!pairingheap_is_empty(&RegisteredSnapshots))
|
|
|
|
{
|
|
|
|
OldestRegisteredSnapshot = pairingheap_container(SnapshotData, ph_node,
|
|
|
|
pairingheap_first(&RegisteredSnapshots));
|
|
|
|
RegisteredLSN = OldestRegisteredSnapshot->lsn;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (OldestActiveSnapshot != NULL)
|
2016-08-07 20:36:02 +02:00
|
|
|
{
|
|
|
|
XLogRecPtr ActiveLSN = OldestActiveSnapshot->as_snap->lsn;
|
2016-08-03 22:41:43 +02:00
|
|
|
|
2016-08-07 20:36:02 +02:00
|
|
|
if (XLogRecPtrIsInvalid(RegisteredLSN) || RegisteredLSN > ActiveLSN)
|
|
|
|
return OldestActiveSnapshot->as_snap;
|
|
|
|
}
|
2016-08-03 22:41:43 +02:00
|
|
|
|
|
|
|
return OldestRegisteredSnapshot;
|
|
|
|
}
|
|
|
|
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
/*
|
|
|
|
* GetCatalogSnapshot
|
|
|
|
* Get a snapshot that is sufficiently up-to-date for scan of the
|
|
|
|
* system catalog with the specified OID.
|
|
|
|
*/
|
|
|
|
Snapshot
|
|
|
|
GetCatalogSnapshot(Oid relid)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
|
|
|
/*
|
2014-03-12 19:07:41 +01:00
|
|
|
* Return historic snapshot while we're doing logical decoding, so we can
|
|
|
|
* see the appropriate state of the catalog.
|
|
|
|
*
|
|
|
|
* This is the primary reason for needing to reset the system caches after
|
|
|
|
* finishing decoding.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
|
|
|
if (HistoricSnapshotActive())
|
|
|
|
return HistoricSnapshot;
|
|
|
|
|
|
|
|
return GetNonHistoricCatalogSnapshot(relid);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* GetNonHistoricCatalogSnapshot
|
|
|
|
* Get a snapshot that is sufficiently up-to-date for scan of the system
|
|
|
|
* catalog with the specified OID, even while historic snapshots are set
|
|
|
|
* up.
|
|
|
|
*/
|
|
|
|
Snapshot
|
|
|
|
GetNonHistoricCatalogSnapshot(Oid relid)
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If the caller is trying to scan a relation that has no syscache, no
|
2014-10-20 16:23:40 +02:00
|
|
|
* catcache invalidations will be sent when it is updated. For a few key
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
* relations, snapshot invalidations are sent instead. If we're trying to
|
|
|
|
* scan a relation for which neither catcache nor snapshot invalidations
|
|
|
|
* are sent, we must refresh the snapshot every time.
|
|
|
|
*/
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
if (CatalogSnapshot &&
|
|
|
|
!RelationInvalidatesSnapshotsOnly(relid) &&
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
!RelationHasSysCache(relid))
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
InvalidateCatalogSnapshot();
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
if (CatalogSnapshot == NULL)
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
{
|
|
|
|
/* Get new snapshot. */
|
|
|
|
CatalogSnapshot = GetSnapshotData(&CatalogSnapshotData);
|
|
|
|
|
|
|
|
/*
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
* Make sure the catalog snapshot will be accounted for in decisions
|
2020-08-14 01:25:21 +02:00
|
|
|
* about advancing PGPROC->xmin. We could apply RegisterSnapshot, but
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
* that would result in making a physical copy, which is overkill; and
|
|
|
|
* it would also create a dependency on some resource owner, which we
|
|
|
|
* do not want for reasons explained at the head of this file. Instead
|
|
|
|
* just shove the CatalogSnapshot into the pairing heap manually. This
|
|
|
|
* has to be reversed in InvalidateCatalogSnapshot, of course.
|
|
|
|
*
|
|
|
|
* NB: it had better be impossible for this to throw error, since the
|
|
|
|
* CatalogSnapshot pointer is already valid.
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
*/
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
pairingheap_add(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
return CatalogSnapshot;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
* InvalidateCatalogSnapshot
|
|
|
|
* Mark the current catalog snapshot, if any, as invalid
|
|
|
|
*
|
|
|
|
* We could change this API to allow the caller to provide more fine-grained
|
|
|
|
* invalidation details, so that a change to relation A wouldn't prevent us
|
|
|
|
* from using our cached snapshot to scan relation B, but so far there's no
|
|
|
|
* evidence that the CPU cycles we spent tracking such fine details would be
|
|
|
|
* well-spent.
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
*/
|
|
|
|
void
|
2015-08-15 17:25:00 +02:00
|
|
|
InvalidateCatalogSnapshot(void)
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
{
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
if (CatalogSnapshot)
|
|
|
|
{
|
|
|
|
pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
|
|
|
|
CatalogSnapshot = NULL;
|
|
|
|
SnapshotResetXmin();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* InvalidateCatalogSnapshotConditionally
|
|
|
|
* Drop catalog snapshot if it's the only one we have
|
|
|
|
*
|
|
|
|
* This is called when we are about to wait for client input, so we don't
|
|
|
|
* want to continue holding the catalog snapshot if it might mean that the
|
|
|
|
* global xmin horizon can't advance. However, if there are other snapshots
|
|
|
|
* still active or registered, the catalog snapshot isn't likely to be the
|
|
|
|
* oldest one, so we might as well keep it.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
InvalidateCatalogSnapshotConditionally(void)
|
|
|
|
{
|
|
|
|
if (CatalogSnapshot &&
|
|
|
|
ActiveSnapshot == NULL &&
|
|
|
|
pairingheap_is_singular(&RegisteredSnapshots))
|
|
|
|
InvalidateCatalogSnapshot();
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
}
|
|
|
|
|
2008-05-12 22:02:02 +02:00
|
|
|
/*
|
|
|
|
* SnapshotSetCommandId
|
|
|
|
* Propagate CommandCounterIncrement into the static snapshots, if set
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
SnapshotSetCommandId(CommandId curcid)
|
|
|
|
{
|
|
|
|
if (!FirstSnapshotSet)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (CurrentSnapshot)
|
|
|
|
CurrentSnapshot->curcid = curcid;
|
|
|
|
if (SecondarySnapshot)
|
|
|
|
SecondarySnapshot->curcid = curcid;
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
/* Should we do the same with CatalogSnapshot? */
|
2008-03-26 17:20:48 +01:00
|
|
|
}
|
|
|
|
|
2011-10-23 00:22:45 +02:00
|
|
|
/*
|
|
|
|
* SetTransactionSnapshot
|
|
|
|
* Set the transaction's snapshot from an imported MVCC snapshot.
|
|
|
|
*
|
|
|
|
* Note that this is very closely tied to GetTransactionSnapshot --- it
|
|
|
|
* must take care of all the same considerations as the first-snapshot case
|
|
|
|
* in GetTransactionSnapshot.
|
|
|
|
*/
|
|
|
|
static void
|
2017-06-14 20:57:21 +02:00
|
|
|
SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
|
|
|
|
int sourcepid, PGPROC *sourceproc)
|
2011-10-23 00:22:45 +02:00
|
|
|
{
|
|
|
|
/* Caller should have checked this already */
|
|
|
|
Assert(!FirstSnapshotSet);
|
|
|
|
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
/* Better do this to ensure following Assert succeeds. */
|
|
|
|
InvalidateCatalogSnapshot();
|
|
|
|
|
2015-01-17 00:14:32 +01:00
|
|
|
Assert(pairingheap_is_empty(&RegisteredSnapshots));
|
2011-10-23 00:22:45 +02:00
|
|
|
Assert(FirstXactSnapshot == NULL);
|
2014-03-12 19:07:41 +01:00
|
|
|
Assert(!HistoricSnapshotActive());
|
2011-10-23 00:22:45 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Even though we are not going to use the snapshot it computes, we must
|
|
|
|
* call GetSnapshotData, for two reasons: (1) to be sure that
|
|
|
|
* CurrentSnapshotData's XID arrays have been allocated, and (2) to update
|
snapshot scalability: Don't compute global horizons while building snapshots.
To make GetSnapshotData() more scalable, it cannot not look at at each proc's
xmin: While snapshot contents do not need to change whenever a read-only
transaction commits or a snapshot is released, a proc's xmin is modified in
those cases. The frequency of xmin modifications leads to, particularly on
higher core count systems, many cache misses inside GetSnapshotData(), despite
the data underlying a snapshot not changing. That is the most
significant source of GetSnapshotData() scaling poorly on larger systems.
Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons /
thresholds as it has so far. But we don't really have to: The horizons don't
actually change that much between GetSnapshotData() calls. Nor are the horizons
actually used every time a snapshot is built.
The trick this commit introduces is to delay computation of accurate horizons
until there use and using horizon boundaries to determine whether accurate
horizons need to be computed.
The use of RecentGlobal[Data]Xmin to decide whether a row version could be
removed has been replaces with new GlobalVisTest* functions. These use two
thresholds to determine whether a row can be pruned:
1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed
are definitely still visible.
2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
definitely be removed
GetSnapshotData() updates definitely_needed to be the xmin of the computed
snapshot.
When testing whether a row can be removed (with GlobalVisTestIsRemovableXid())
and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID <
definitely_needed) the boundaries can be recomputed to be more accurate. As it
is not cheap to compute accurate boundaries, we limit the number of times that
happens in short succession. As the boundaries used by
GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by
GetSnapshotData()), it is likely that further test can benefit from an earlier
computation of accurate horizons.
To avoid regressing performance when old_snapshot_threshold is set (as that
requires an accurate horizon to be computed), heap_page_prune_opt() doesn't
unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the
computation of the limited horizon, and the triggering of errors (with
SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove
tuples.
This commit just removes the accesses to PGXACT->xmin from
GetSnapshotData(), but other members of PGXACT residing in the same
cache line are accessed. Therefore this in itself does not result in a
significant improvement. Subsequent commits will take advantage of the
fact that GetSnapshotData() now does not need to access xmins anymore.
Note: This contains a workaround in heap_page_prune_opt() to keep the
snapshot_too_old tests working. While that workaround is ugly, the tests
currently are not meaningful, and it seems best to address them separately.
Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-13 01:03:49 +02:00
|
|
|
* the state for GlobalVis*.
|
2011-10-23 00:22:45 +02:00
|
|
|
*/
|
|
|
|
CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now copy appropriate fields from the source snapshot.
|
|
|
|
*/
|
|
|
|
CurrentSnapshot->xmin = sourcesnap->xmin;
|
|
|
|
CurrentSnapshot->xmax = sourcesnap->xmax;
|
|
|
|
CurrentSnapshot->xcnt = sourcesnap->xcnt;
|
|
|
|
Assert(sourcesnap->xcnt <= GetMaxSnapshotXidCount());
|
2022-03-04 00:13:24 +01:00
|
|
|
if (sourcesnap->xcnt > 0)
|
|
|
|
memcpy(CurrentSnapshot->xip, sourcesnap->xip,
|
|
|
|
sourcesnap->xcnt * sizeof(TransactionId));
|
2011-10-23 00:22:45 +02:00
|
|
|
CurrentSnapshot->subxcnt = sourcesnap->subxcnt;
|
|
|
|
Assert(sourcesnap->subxcnt <= GetMaxSnapshotSubxidCount());
|
2022-03-04 00:13:24 +01:00
|
|
|
if (sourcesnap->subxcnt > 0)
|
|
|
|
memcpy(CurrentSnapshot->subxip, sourcesnap->subxip,
|
|
|
|
sourcesnap->subxcnt * sizeof(TransactionId));
|
2011-10-23 00:22:45 +02:00
|
|
|
CurrentSnapshot->suboverflowed = sourcesnap->suboverflowed;
|
|
|
|
CurrentSnapshot->takenDuringRecovery = sourcesnap->takenDuringRecovery;
|
|
|
|
/* NB: curcid should NOT be copied, it's a local matter */
|
|
|
|
|
2020-08-18 06:07:10 +02:00
|
|
|
CurrentSnapshot->snapXactCompletionCount = 0;
|
|
|
|
|
2011-10-23 00:22:45 +02:00
|
|
|
/*
|
2020-08-14 01:25:21 +02:00
|
|
|
* Now we have to fix what GetSnapshotData did with MyProc->xmin and
|
2011-10-23 00:22:45 +02:00
|
|
|
* TransactionXmin. There is a race condition: to make sure we are not
|
|
|
|
* causing the global xmin to go backwards, we have to test that the
|
|
|
|
* source transaction is still running, and that has to be done
|
|
|
|
* atomically. So let procarray.c do it.
|
|
|
|
*
|
|
|
|
* Note: in serializable mode, predicate.c will do this a second time. It
|
|
|
|
* doesn't seem worth contorting the logic here to avoid two calls,
|
|
|
|
* especially since it's not clear that predicate.c *must* do this.
|
|
|
|
*/
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
if (sourceproc != NULL)
|
|
|
|
{
|
|
|
|
if (!ProcArrayInstallRestoredXmin(CurrentSnapshot->xmin, sourceproc))
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
|
|
|
|
errmsg("could not import the requested snapshot"),
|
|
|
|
errdetail("The source transaction is not running anymore.")));
|
|
|
|
}
|
2017-06-14 20:57:21 +02:00
|
|
|
else if (!ProcArrayInstallImportedXmin(CurrentSnapshot->xmin, sourcevxid))
|
2011-10-23 00:22:45 +02:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
|
|
|
|
errmsg("could not import the requested snapshot"),
|
2017-09-11 17:20:47 +02:00
|
|
|
errdetail("The source process with PID %d is not running anymore.",
|
2017-06-14 20:57:21 +02:00
|
|
|
sourcepid)));
|
2011-10-23 00:22:45 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* In transaction-snapshot mode, the first snapshot must live until end of
|
|
|
|
* xact, so we must make a copy of it. Furthermore, if we're running in
|
|
|
|
* serializable mode, predicate.c needs to do its own processing.
|
|
|
|
*/
|
|
|
|
if (IsolationUsesXactSnapshot())
|
|
|
|
{
|
|
|
|
if (IsolationIsSerializable())
|
2017-06-14 20:57:21 +02:00
|
|
|
SetSerializableTransactionSnapshot(CurrentSnapshot, sourcevxid,
|
|
|
|
sourcepid);
|
2011-10-23 00:22:45 +02:00
|
|
|
/* Make a saved copy */
|
|
|
|
CurrentSnapshot = CopySnapshot(CurrentSnapshot);
|
|
|
|
FirstXactSnapshot = CurrentSnapshot;
|
|
|
|
/* Mark it as "registered" in FirstXactSnapshot */
|
|
|
|
FirstXactSnapshot->regd_count++;
|
2015-01-17 00:14:32 +01:00
|
|
|
pairingheap_add(&RegisteredSnapshots, &FirstXactSnapshot->ph_node);
|
2011-10-23 00:22:45 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
FirstSnapshotSet = true;
|
|
|
|
}
|
|
|
|
|
2008-03-26 17:20:48 +01:00
|
|
|
/*
|
|
|
|
* CopySnapshot
|
|
|
|
* Copy the given snapshot.
|
|
|
|
*
|
2008-05-12 22:02:02 +02:00
|
|
|
* The copy is palloc'd in TopTransactionContext and has initial refcounts set
|
|
|
|
* to 0. The returned snapshot has the copied flag set.
|
2008-03-26 17:20:48 +01:00
|
|
|
*/
|
2009-10-07 18:27:18 +02:00
|
|
|
static Snapshot
|
2008-03-26 17:20:48 +01:00
|
|
|
CopySnapshot(Snapshot snapshot)
|
|
|
|
{
|
|
|
|
Snapshot newsnap;
|
|
|
|
Size subxipoff;
|
|
|
|
Size size;
|
|
|
|
|
2008-05-12 22:02:02 +02:00
|
|
|
Assert(snapshot != InvalidSnapshot);
|
|
|
|
|
2008-03-26 17:20:48 +01:00
|
|
|
/* We allocate any XID arrays needed in the same palloc block. */
|
|
|
|
size = subxipoff = sizeof(SnapshotData) +
|
|
|
|
snapshot->xcnt * sizeof(TransactionId);
|
|
|
|
if (snapshot->subxcnt > 0)
|
|
|
|
size += snapshot->subxcnt * sizeof(TransactionId);
|
|
|
|
|
2008-05-12 22:02:02 +02:00
|
|
|
newsnap = (Snapshot) MemoryContextAlloc(TopTransactionContext, size);
|
2008-03-26 17:20:48 +01:00
|
|
|
memcpy(newsnap, snapshot, sizeof(SnapshotData));
|
|
|
|
|
2008-05-12 22:02:02 +02:00
|
|
|
newsnap->regd_count = 0;
|
|
|
|
newsnap->active_count = 0;
|
|
|
|
newsnap->copied = true;
|
2020-08-18 06:07:10 +02:00
|
|
|
newsnap->snapXactCompletionCount = 0;
|
2008-05-12 22:02:02 +02:00
|
|
|
|
2008-03-26 17:20:48 +01:00
|
|
|
/* setup XID array */
|
|
|
|
if (snapshot->xcnt > 0)
|
|
|
|
{
|
|
|
|
newsnap->xip = (TransactionId *) (newsnap + 1);
|
|
|
|
memcpy(newsnap->xip, snapshot->xip,
|
|
|
|
snapshot->xcnt * sizeof(TransactionId));
|
|
|
|
}
|
|
|
|
else
|
|
|
|
newsnap->xip = NULL;
|
|
|
|
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
/*
|
|
|
|
* Setup subXID array. Don't bother to copy it if it had overflowed,
|
|
|
|
* though, because it's not used anywhere in that case. Except if it's a
|
|
|
|
* snapshot taken during recovery; all the top-level XIDs are in subxip as
|
|
|
|
* well in that case, so we mustn't lose them.
|
|
|
|
*/
|
|
|
|
if (snapshot->subxcnt > 0 &&
|
|
|
|
(!snapshot->suboverflowed || snapshot->takenDuringRecovery))
|
2008-03-26 17:20:48 +01:00
|
|
|
{
|
|
|
|
newsnap->subxip = (TransactionId *) ((char *) newsnap + subxipoff);
|
|
|
|
memcpy(newsnap->subxip, snapshot->subxip,
|
|
|
|
snapshot->subxcnt * sizeof(TransactionId));
|
|
|
|
}
|
|
|
|
else
|
|
|
|
newsnap->subxip = NULL;
|
|
|
|
|
|
|
|
return newsnap;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* FreeSnapshot
|
2008-05-12 22:02:02 +02:00
|
|
|
* Free the memory associated with a snapshot.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
FreeSnapshot(Snapshot snapshot)
|
|
|
|
{
|
|
|
|
Assert(snapshot->regd_count == 0);
|
|
|
|
Assert(snapshot->active_count == 0);
|
2008-07-11 04:10:14 +02:00
|
|
|
Assert(snapshot->copied);
|
2008-05-12 22:02:02 +02:00
|
|
|
|
|
|
|
pfree(snapshot);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* PushActiveSnapshot
|
|
|
|
* Set the given snapshot as the current active snapshot
|
2008-03-26 17:20:48 +01:00
|
|
|
*
|
2009-10-07 18:27:18 +02:00
|
|
|
* If the passed snapshot is a statically-allocated one, or it is possibly
|
|
|
|
* subject to a future command counter update, create a new long-lived copy
|
|
|
|
* with active refcount=1. Otherwise, only increment the refcount.
|
2008-05-12 22:02:02 +02:00
|
|
|
*/
|
|
|
|
void
|
2022-09-20 22:09:30 +02:00
|
|
|
PushActiveSnapshot(Snapshot snapshot)
|
2021-10-01 17:10:12 +02:00
|
|
|
{
|
2022-09-20 22:09:30 +02:00
|
|
|
PushActiveSnapshotWithLevel(snapshot, GetCurrentTransactionNestLevel());
|
2021-10-01 17:10:12 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* PushActiveSnapshotWithLevel
|
|
|
|
* Set the given snapshot as the current active snapshot
|
|
|
|
*
|
|
|
|
* Same as PushActiveSnapshot except that caller can specify the
|
|
|
|
* transaction nesting level that "owns" the snapshot. This level
|
|
|
|
* must not be deeper than the current top of the snapshot stack.
|
|
|
|
*/
|
|
|
|
void
|
2022-09-20 22:09:30 +02:00
|
|
|
PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level)
|
2008-05-12 22:02:02 +02:00
|
|
|
{
|
|
|
|
ActiveSnapshotElt *newactive;
|
|
|
|
|
2022-09-20 22:09:30 +02:00
|
|
|
Assert(snapshot != InvalidSnapshot);
|
2021-10-01 17:10:12 +02:00
|
|
|
Assert(ActiveSnapshot == NULL || snap_level >= ActiveSnapshot->as_level);
|
2008-05-12 22:02:02 +02:00
|
|
|
|
|
|
|
newactive = MemoryContextAlloc(TopTransactionContext, sizeof(ActiveSnapshotElt));
|
2009-10-07 18:27:18 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Checking SecondarySnapshot is probably useless here, but it seems
|
|
|
|
* better to be sure.
|
|
|
|
*/
|
2022-09-20 22:09:30 +02:00
|
|
|
if (snapshot == CurrentSnapshot || snapshot == SecondarySnapshot ||
|
|
|
|
!snapshot->copied)
|
|
|
|
newactive->as_snap = CopySnapshot(snapshot);
|
2009-10-07 18:27:18 +02:00
|
|
|
else
|
2022-09-20 22:09:30 +02:00
|
|
|
newactive->as_snap = snapshot;
|
2009-10-07 18:27:18 +02:00
|
|
|
|
2008-05-12 22:02:02 +02:00
|
|
|
newactive->as_next = ActiveSnapshot;
|
2021-10-01 17:10:12 +02:00
|
|
|
newactive->as_level = snap_level;
|
2008-05-12 22:02:02 +02:00
|
|
|
|
|
|
|
newactive->as_snap->active_count++;
|
|
|
|
|
|
|
|
ActiveSnapshot = newactive;
|
2016-08-03 22:41:43 +02:00
|
|
|
if (OldestActiveSnapshot == NULL)
|
|
|
|
OldestActiveSnapshot = ActiveSnapshot;
|
2008-05-12 22:02:02 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2011-03-01 05:27:18 +01:00
|
|
|
* PushCopiedSnapshot
|
|
|
|
* As above, except forcibly copy the presented snapshot.
|
|
|
|
*
|
|
|
|
* This should be used when the ActiveSnapshot has to be modifiable, for
|
|
|
|
* example if the caller intends to call UpdateActiveSnapshotCommandId.
|
|
|
|
* The new snapshot will be released when popped from the stack.
|
2008-05-12 22:02:02 +02:00
|
|
|
*/
|
|
|
|
void
|
2011-03-01 05:27:18 +01:00
|
|
|
PushCopiedSnapshot(Snapshot snapshot)
|
2008-05-12 22:02:02 +02:00
|
|
|
{
|
2011-03-01 05:27:18 +01:00
|
|
|
PushActiveSnapshot(CopySnapshot(snapshot));
|
|
|
|
}
|
2008-05-12 22:02:02 +02:00
|
|
|
|
2011-03-01 05:27:18 +01:00
|
|
|
/*
|
|
|
|
* UpdateActiveSnapshotCommandId
|
|
|
|
*
|
|
|
|
* Update the current CID of the active snapshot. This can only be applied
|
|
|
|
* to a snapshot that is not referenced elsewhere.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
UpdateActiveSnapshotCommandId(void)
|
|
|
|
{
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
CommandId save_curcid,
|
|
|
|
curcid;
|
2015-05-24 03:35:49 +02:00
|
|
|
|
2011-03-01 05:27:18 +01:00
|
|
|
Assert(ActiveSnapshot != NULL);
|
|
|
|
Assert(ActiveSnapshot->as_snap->active_count == 1);
|
|
|
|
Assert(ActiveSnapshot->as_snap->regd_count == 0);
|
2008-05-12 22:02:02 +02:00
|
|
|
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
/*
|
|
|
|
* Don't allow modification of the active snapshot during parallel
|
2016-06-07 17:14:48 +02:00
|
|
|
* operation. We share the snapshot to worker backends at the beginning
|
|
|
|
* of parallel operation, so any change to the snapshot can lead to
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
* inconsistencies. We have other defenses against
|
|
|
|
* CommandCounterIncrement, but there are a few places that call this
|
|
|
|
* directly, so we put an additional guard here.
|
|
|
|
*/
|
|
|
|
save_curcid = ActiveSnapshot->as_snap->curcid;
|
|
|
|
curcid = GetCurrentCommandId(false);
|
|
|
|
if (IsInParallelMode() && save_curcid != curcid)
|
|
|
|
elog(ERROR, "cannot modify commandid in active snapshot during a parallel operation");
|
|
|
|
ActiveSnapshot->as_snap->curcid = curcid;
|
2008-05-12 22:02:02 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* PopActiveSnapshot
|
2008-03-26 17:20:48 +01:00
|
|
|
*
|
2008-05-12 22:02:02 +02:00
|
|
|
* Remove the topmost snapshot from the active snapshot stack, decrementing the
|
|
|
|
* reference count, and free it if this was the last reference.
|
2008-03-26 17:20:48 +01:00
|
|
|
*/
|
|
|
|
void
|
2008-05-12 22:02:02 +02:00
|
|
|
PopActiveSnapshot(void)
|
2008-03-26 17:20:48 +01:00
|
|
|
{
|
2008-05-12 22:02:02 +02:00
|
|
|
ActiveSnapshotElt *newstack;
|
|
|
|
|
|
|
|
newstack = ActiveSnapshot->as_next;
|
|
|
|
|
|
|
|
Assert(ActiveSnapshot->as_snap->active_count > 0);
|
|
|
|
|
|
|
|
ActiveSnapshot->as_snap->active_count--;
|
|
|
|
|
|
|
|
if (ActiveSnapshot->as_snap->active_count == 0 &&
|
|
|
|
ActiveSnapshot->as_snap->regd_count == 0)
|
|
|
|
FreeSnapshot(ActiveSnapshot->as_snap);
|
|
|
|
|
|
|
|
pfree(ActiveSnapshot);
|
|
|
|
ActiveSnapshot = newstack;
|
2016-08-03 22:41:43 +02:00
|
|
|
if (ActiveSnapshot == NULL)
|
|
|
|
OldestActiveSnapshot = NULL;
|
2008-05-12 22:02:02 +02:00
|
|
|
|
|
|
|
SnapshotResetXmin();
|
2008-03-26 17:20:48 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2008-05-12 22:02:02 +02:00
|
|
|
* GetActiveSnapshot
|
|
|
|
* Return the topmost snapshot in the Active stack.
|
|
|
|
*/
|
|
|
|
Snapshot
|
|
|
|
GetActiveSnapshot(void)
|
|
|
|
{
|
|
|
|
Assert(ActiveSnapshot != NULL);
|
|
|
|
|
|
|
|
return ActiveSnapshot->as_snap;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ActiveSnapshotSet
|
2008-07-11 02:00:29 +02:00
|
|
|
* Return whether there is at least one snapshot in the Active stack
|
2008-05-12 22:02:02 +02:00
|
|
|
*/
|
|
|
|
bool
|
|
|
|
ActiveSnapshotSet(void)
|
|
|
|
{
|
|
|
|
return ActiveSnapshot != NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* RegisterSnapshot
|
2008-12-04 15:51:02 +01:00
|
|
|
* Register a snapshot as being in use by the current resource owner
|
2008-05-12 22:02:02 +02:00
|
|
|
*
|
|
|
|
* If InvalidSnapshot is passed, it is not registered.
|
|
|
|
*/
|
|
|
|
Snapshot
|
|
|
|
RegisterSnapshot(Snapshot snapshot)
|
|
|
|
{
|
2008-12-04 15:51:02 +01:00
|
|
|
if (snapshot == InvalidSnapshot)
|
|
|
|
return InvalidSnapshot;
|
|
|
|
|
|
|
|
return RegisterSnapshotOnOwner(snapshot, CurrentResourceOwner);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* RegisterSnapshotOnOwner
|
|
|
|
* As above, but use the specified resource owner
|
|
|
|
*/
|
|
|
|
Snapshot
|
|
|
|
RegisterSnapshotOnOwner(Snapshot snapshot, ResourceOwner owner)
|
|
|
|
{
|
|
|
|
Snapshot snap;
|
2008-05-12 22:02:02 +02:00
|
|
|
|
|
|
|
if (snapshot == InvalidSnapshot)
|
|
|
|
return InvalidSnapshot;
|
|
|
|
|
|
|
|
/* Static snapshot? Create a persistent copy */
|
2008-11-25 21:28:29 +01:00
|
|
|
snap = snapshot->copied ? snapshot : CopySnapshot(snapshot);
|
2008-05-12 22:02:02 +02:00
|
|
|
|
2008-11-25 21:28:29 +01:00
|
|
|
/* and tell resowner.c about it */
|
Make ResourceOwners more easily extensible.
Instead of having a separate array/hash for each resource kind, use a
single array and hash to hold all kinds of resources. This makes it
possible to introduce new resource "kinds" without having to modify
the ResourceOwnerData struct. In particular, this makes it possible
for extensions to register custom resource kinds.
The old approach was to have a small array of resources of each kind,
and if it fills up, switch to a hash table. The new approach also uses
an array and a hash, but now the array and the hash are used at the
same time. The array is used to hold the recently added resources, and
when it fills up, they are moved to the hash. This keeps the access to
recent entries fast, even when there are a lot of long-held resources.
All the resource-specific ResourceOwnerEnlarge*(),
ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have
been replaced with three generic functions that take resource kind as
argument. For convenience, we still define resource-specific wrapper
macros around the generic functions with the old names, but they are
now defined in the source files that use those resource kinds.
The release callback no longer needs to call ResourceOwnerForget on
the resource being released. ResourceOwnerRelease unregisters the
resource from the owner before calling the callback. That needed some
changes in bufmgr.c and some other files, where releasing the
resources previously always called ResourceOwnerForget.
Each resource kind specifies a release priority, and
ResourceOwnerReleaseAll releases the resources in priority order. To
make that possible, we have to restrict what you can do between
phases. After calling ResourceOwnerRelease(), you are no longer
allowed to remember any more resources in it or to forget any
previously remembered resources by calling ResourceOwnerForget. There
was one case where that was done previously. At subtransaction commit,
AtEOSubXact_Inval() would handle the invalidation messages and call
RelationFlushRelation(), which temporarily increased the reference
count on the relation being flushed. We now switch to the parent
subtransaction's resource owner before calling AtEOSubXact_Inval(), so
that there is a valid ResourceOwner to temporarily hold that relcache
reference.
Other end-of-xact routines make similar calls to AtEOXact_Inval()
between release phases, but I didn't see any regression test failures
from those, so I'm not sure if they could reach a codepath that needs
remembering extra resources.
There were two exceptions to how the resource leak WARNINGs on commit
were printed previously: llvmjit silently released the context without
printing the warning, and a leaked buffer io triggered a PANIC. Now
everything prints a WARNING, including those cases.
Add tests in src/test/modules/test_resowner.
Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 12:30:50 +01:00
|
|
|
ResourceOwnerEnlarge(owner);
|
2008-11-25 21:28:29 +01:00
|
|
|
snap->regd_count++;
|
2008-12-04 15:51:02 +01:00
|
|
|
ResourceOwnerRememberSnapshot(owner, snap);
|
2008-05-12 22:02:02 +02:00
|
|
|
|
2015-01-17 00:14:32 +01:00
|
|
|
if (snap->regd_count == 1)
|
|
|
|
pairingheap_add(&RegisteredSnapshots, &snap->ph_node);
|
2008-05-12 22:02:02 +02:00
|
|
|
|
2008-11-25 21:28:29 +01:00
|
|
|
return snap;
|
2008-05-12 22:02:02 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* UnregisterSnapshot
|
|
|
|
*
|
2008-11-25 21:28:29 +01:00
|
|
|
* Decrement the reference count of a snapshot, remove the corresponding
|
|
|
|
* reference from CurrentResourceOwner, and free the snapshot if no more
|
|
|
|
* references remain.
|
2008-05-12 22:02:02 +02:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
UnregisterSnapshot(Snapshot snapshot)
|
2008-12-04 15:51:02 +01:00
|
|
|
{
|
|
|
|
if (snapshot == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
UnregisterSnapshotFromOwner(snapshot, CurrentResourceOwner);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* UnregisterSnapshotFromOwner
|
|
|
|
* As above, but use the specified resource owner
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
UnregisterSnapshotFromOwner(Snapshot snapshot, ResourceOwner owner)
|
2008-05-12 22:02:02 +02:00
|
|
|
{
|
2008-11-25 21:28:29 +01:00
|
|
|
if (snapshot == NULL)
|
2008-05-12 22:02:02 +02:00
|
|
|
return;
|
|
|
|
|
Make ResourceOwners more easily extensible.
Instead of having a separate array/hash for each resource kind, use a
single array and hash to hold all kinds of resources. This makes it
possible to introduce new resource "kinds" without having to modify
the ResourceOwnerData struct. In particular, this makes it possible
for extensions to register custom resource kinds.
The old approach was to have a small array of resources of each kind,
and if it fills up, switch to a hash table. The new approach also uses
an array and a hash, but now the array and the hash are used at the
same time. The array is used to hold the recently added resources, and
when it fills up, they are moved to the hash. This keeps the access to
recent entries fast, even when there are a lot of long-held resources.
All the resource-specific ResourceOwnerEnlarge*(),
ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have
been replaced with three generic functions that take resource kind as
argument. For convenience, we still define resource-specific wrapper
macros around the generic functions with the old names, but they are
now defined in the source files that use those resource kinds.
The release callback no longer needs to call ResourceOwnerForget on
the resource being released. ResourceOwnerRelease unregisters the
resource from the owner before calling the callback. That needed some
changes in bufmgr.c and some other files, where releasing the
resources previously always called ResourceOwnerForget.
Each resource kind specifies a release priority, and
ResourceOwnerReleaseAll releases the resources in priority order. To
make that possible, we have to restrict what you can do between
phases. After calling ResourceOwnerRelease(), you are no longer
allowed to remember any more resources in it or to forget any
previously remembered resources by calling ResourceOwnerForget. There
was one case where that was done previously. At subtransaction commit,
AtEOSubXact_Inval() would handle the invalidation messages and call
RelationFlushRelation(), which temporarily increased the reference
count on the relation being flushed. We now switch to the parent
subtransaction's resource owner before calling AtEOSubXact_Inval(), so
that there is a valid ResourceOwner to temporarily hold that relcache
reference.
Other end-of-xact routines make similar calls to AtEOXact_Inval()
between release phases, but I didn't see any regression test failures
from those, so I'm not sure if they could reach a codepath that needs
remembering extra resources.
There were two exceptions to how the resource leak WARNINGs on commit
were printed previously: llvmjit silently released the context without
printing the warning, and a leaked buffer io triggered a PANIC. Now
everything prints a WARNING, including those cases.
Add tests in src/test/modules/test_resowner.
Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 12:30:50 +01:00
|
|
|
ResourceOwnerForgetSnapshot(owner, snapshot);
|
|
|
|
UnregisterSnapshotNoOwner(snapshot);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
UnregisterSnapshotNoOwner(Snapshot snapshot)
|
|
|
|
{
|
2008-11-25 21:28:29 +01:00
|
|
|
Assert(snapshot->regd_count > 0);
|
2015-01-17 00:14:32 +01:00
|
|
|
Assert(!pairingheap_is_empty(&RegisteredSnapshots));
|
2008-05-12 22:02:02 +02:00
|
|
|
|
2015-01-17 00:14:32 +01:00
|
|
|
snapshot->regd_count--;
|
|
|
|
if (snapshot->regd_count == 0)
|
|
|
|
pairingheap_remove(&RegisteredSnapshots, &snapshot->ph_node);
|
|
|
|
|
|
|
|
if (snapshot->regd_count == 0 && snapshot->active_count == 0)
|
2008-11-25 21:28:29 +01:00
|
|
|
{
|
|
|
|
FreeSnapshot(snapshot);
|
|
|
|
SnapshotResetXmin();
|
2008-05-12 22:02:02 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-01-17 00:14:32 +01:00
|
|
|
/*
|
|
|
|
* Comparison function for RegisteredSnapshots heap. Snapshots are ordered
|
|
|
|
* by xmin, so that the snapshot with smallest xmin is at the top.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
xmin_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
|
|
|
|
{
|
|
|
|
const SnapshotData *asnap = pairingheap_const_container(SnapshotData, ph_node, a);
|
|
|
|
const SnapshotData *bsnap = pairingheap_const_container(SnapshotData, ph_node, b);
|
|
|
|
|
|
|
|
if (TransactionIdPrecedes(asnap->xmin, bsnap->xmin))
|
|
|
|
return 1;
|
|
|
|
else if (TransactionIdFollows(asnap->xmin, bsnap->xmin))
|
|
|
|
return -1;
|
|
|
|
else
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-05-12 22:02:02 +02:00
|
|
|
/*
|
|
|
|
* SnapshotResetXmin
|
|
|
|
*
|
2021-10-04 10:31:01 +02:00
|
|
|
* If there are no more snapshots, we can reset our PGPROC->xmin to
|
|
|
|
* InvalidTransactionId. Note we can do this without locking because we assume
|
|
|
|
* that storing an Xid is atomic.
|
2015-01-17 00:14:32 +01:00
|
|
|
*
|
|
|
|
* Even if there are some remaining snapshots, we may be able to advance our
|
2020-08-14 01:25:21 +02:00
|
|
|
* PGPROC->xmin to some degree. This typically happens when a portal is
|
|
|
|
* dropped. For efficiency, we only consider recomputing PGPROC->xmin when
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
* the active snapshot stack is empty; this allows us not to need to track
|
|
|
|
* which active snapshot is oldest.
|
|
|
|
*
|
|
|
|
* Note: it's tempting to use GetOldestSnapshot() here so that we can include
|
|
|
|
* active snapshots in the calculation. However, that compares by LSN not
|
|
|
|
* xmin so it's not entirely clear that it's the same thing. Also, we'd be
|
|
|
|
* critically dependent on the assumption that the bottommost active snapshot
|
|
|
|
* stack entry has the oldest xmin. (Current uses of GetOldestSnapshot() are
|
|
|
|
* not actually critical, but this would be.)
|
2008-05-12 22:02:02 +02:00
|
|
|
*/
|
|
|
|
static void
|
|
|
|
SnapshotResetXmin(void)
|
|
|
|
{
|
2015-01-17 00:14:32 +01:00
|
|
|
Snapshot minSnapshot;
|
|
|
|
|
|
|
|
if (ActiveSnapshot != NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (pairingheap_is_empty(&RegisteredSnapshots))
|
|
|
|
{
|
2020-08-14 01:25:21 +02:00
|
|
|
MyProc->xmin = InvalidTransactionId;
|
2015-01-17 00:14:32 +01:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
minSnapshot = pairingheap_container(SnapshotData, ph_node,
|
|
|
|
pairingheap_first(&RegisteredSnapshots));
|
|
|
|
|
2020-08-14 01:25:21 +02:00
|
|
|
if (TransactionIdPrecedes(MyProc->xmin, minSnapshot->xmin))
|
|
|
|
MyProc->xmin = minSnapshot->xmin;
|
2008-05-12 22:02:02 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* AtSubCommit_Snapshot
|
2008-03-26 17:20:48 +01:00
|
|
|
*/
|
|
|
|
void
|
2008-05-12 22:02:02 +02:00
|
|
|
AtSubCommit_Snapshot(int level)
|
2008-03-26 17:20:48 +01:00
|
|
|
{
|
2008-05-12 22:02:02 +02:00
|
|
|
ActiveSnapshotElt *active;
|
|
|
|
|
2008-03-26 17:20:48 +01:00
|
|
|
/*
|
2008-05-12 22:02:02 +02:00
|
|
|
* Relabel the active snapshots set in this subtransaction as though they
|
|
|
|
* are owned by the parent subxact.
|
2008-03-26 17:20:48 +01:00
|
|
|
*/
|
2008-05-12 22:02:02 +02:00
|
|
|
for (active = ActiveSnapshot; active != NULL; active = active->as_next)
|
|
|
|
{
|
|
|
|
if (active->as_level < level)
|
|
|
|
break;
|
|
|
|
active->as_level = level - 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* AtSubAbort_Snapshot
|
|
|
|
* Clean up snapshots after a subtransaction abort
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
AtSubAbort_Snapshot(int level)
|
|
|
|
{
|
|
|
|
/* Forget the active snapshots set by this subtransaction */
|
|
|
|
while (ActiveSnapshot && ActiveSnapshot->as_level >= level)
|
|
|
|
{
|
|
|
|
ActiveSnapshotElt *next;
|
|
|
|
|
|
|
|
next = ActiveSnapshot->as_next;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Decrement the snapshot's active count. If it's still registered or
|
|
|
|
* marked as active by an outer subtransaction, we can't free it yet.
|
|
|
|
*/
|
|
|
|
Assert(ActiveSnapshot->as_snap->active_count >= 1);
|
|
|
|
ActiveSnapshot->as_snap->active_count -= 1;
|
|
|
|
|
|
|
|
if (ActiveSnapshot->as_snap->active_count == 0 &&
|
|
|
|
ActiveSnapshot->as_snap->regd_count == 0)
|
|
|
|
FreeSnapshot(ActiveSnapshot->as_snap);
|
|
|
|
|
|
|
|
/* and free the stack element */
|
|
|
|
pfree(ActiveSnapshot);
|
|
|
|
|
|
|
|
ActiveSnapshot = next;
|
2016-08-03 22:41:43 +02:00
|
|
|
if (ActiveSnapshot == NULL)
|
|
|
|
OldestActiveSnapshot = NULL;
|
2008-05-12 22:02:02 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
SnapshotResetXmin();
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* AtEOXact_Snapshot
|
|
|
|
* Snapshot manager's cleanup function for end of transaction
|
|
|
|
*/
|
|
|
|
void
|
2017-04-06 16:30:22 +02:00
|
|
|
AtEOXact_Snapshot(bool isCommit, bool resetXmin)
|
2008-05-12 22:02:02 +02:00
|
|
|
{
|
2011-09-27 04:25:28 +02:00
|
|
|
/*
|
|
|
|
* In transaction-snapshot mode we must release our privately-managed
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
* reference to the transaction snapshot. We must remove it from
|
2011-09-27 04:25:28 +02:00
|
|
|
* RegisteredSnapshots to keep the check below happy. But we don't bother
|
|
|
|
* to do FreeSnapshot, for two reasons: the memory will go away with
|
|
|
|
* TopTransactionContext anyway, and if someone has left the snapshot
|
|
|
|
* stacked as active, we don't want the code below to be chasing through a
|
|
|
|
* dangling pointer.
|
|
|
|
*/
|
|
|
|
if (FirstXactSnapshot != NULL)
|
|
|
|
{
|
|
|
|
Assert(FirstXactSnapshot->regd_count > 0);
|
2015-01-17 00:14:32 +01:00
|
|
|
Assert(!pairingheap_is_empty(&RegisteredSnapshots));
|
|
|
|
pairingheap_remove(&RegisteredSnapshots, &FirstXactSnapshot->ph_node);
|
2011-09-27 04:25:28 +02:00
|
|
|
}
|
|
|
|
FirstXactSnapshot = NULL;
|
|
|
|
|
2011-10-23 00:22:45 +02:00
|
|
|
/*
|
|
|
|
* If we exported any snapshots, clean them up.
|
|
|
|
*/
|
|
|
|
if (exportedSnapshots != NIL)
|
|
|
|
{
|
2015-01-17 00:14:32 +01:00
|
|
|
ListCell *lc;
|
2011-10-23 00:22:45 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Get rid of the files. Unlink failure is only a WARNING because (1)
|
|
|
|
* it's too late to abort the transaction, and (2) leaving a leaked
|
|
|
|
* file around has little real consequence anyway.
|
2017-06-14 20:57:21 +02:00
|
|
|
*
|
2018-07-09 15:10:44 +02:00
|
|
|
* We also need to remove the snapshots from RegisteredSnapshots to
|
|
|
|
* prevent a warning below.
|
2017-06-14 20:57:21 +02:00
|
|
|
*
|
|
|
|
* As with the FirstXactSnapshot, we don't need to free resources of
|
2019-07-22 03:01:50 +02:00
|
|
|
* the snapshot itself as it will go away with the memory context.
|
2011-10-23 00:22:45 +02:00
|
|
|
*/
|
2015-01-17 00:14:32 +01:00
|
|
|
foreach(lc, exportedSnapshots)
|
|
|
|
{
|
2017-06-14 20:57:21 +02:00
|
|
|
ExportedSnapshot *esnap = (ExportedSnapshot *) lfirst(lc);
|
2015-05-24 03:35:49 +02:00
|
|
|
|
2017-06-14 20:57:21 +02:00
|
|
|
if (unlink(esnap->snapfile))
|
|
|
|
elog(WARNING, "could not unlink file \"%s\": %m",
|
|
|
|
esnap->snapfile);
|
|
|
|
|
|
|
|
pairingheap_remove(&RegisteredSnapshots,
|
|
|
|
&esnap->snapshot->ph_node);
|
2015-01-17 00:14:32 +01:00
|
|
|
}
|
2011-10-23 00:22:45 +02:00
|
|
|
|
|
|
|
exportedSnapshots = NIL;
|
|
|
|
}
|
|
|
|
|
Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 21:55:35 +01:00
|
|
|
/* Drop catalog snapshot if any */
|
|
|
|
InvalidateCatalogSnapshot();
|
|
|
|
|
2008-05-12 22:02:02 +02:00
|
|
|
/* On commit, complain about leftover snapshots */
|
|
|
|
if (isCommit)
|
|
|
|
{
|
|
|
|
ActiveSnapshotElt *active;
|
|
|
|
|
2015-01-17 00:14:32 +01:00
|
|
|
if (!pairingheap_is_empty(&RegisteredSnapshots))
|
|
|
|
elog(WARNING, "registered snapshots seem to remain after cleanup");
|
2008-11-25 21:28:29 +01:00
|
|
|
|
2008-05-12 22:02:02 +02:00
|
|
|
/* complain about unpopped active snapshots */
|
|
|
|
for (active = ActiveSnapshot; active != NULL; active = active->as_next)
|
2008-10-27 23:15:05 +01:00
|
|
|
elog(WARNING, "snapshot %p still active", active);
|
2008-05-12 22:02:02 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2008-07-11 02:00:29 +02:00
|
|
|
* And reset our state. We don't need to free the memory explicitly --
|
2008-05-12 22:02:02 +02:00
|
|
|
* it'll go away with TopTransactionContext.
|
|
|
|
*/
|
|
|
|
ActiveSnapshot = NULL;
|
2016-08-03 22:41:43 +02:00
|
|
|
OldestActiveSnapshot = NULL;
|
2015-01-17 00:14:32 +01:00
|
|
|
pairingheap_reset(&RegisteredSnapshots);
|
2008-05-12 22:02:02 +02:00
|
|
|
|
|
|
|
CurrentSnapshot = NULL;
|
|
|
|
SecondarySnapshot = NULL;
|
|
|
|
|
|
|
|
FirstSnapshotSet = false;
|
2011-09-27 04:25:28 +02:00
|
|
|
|
2017-04-06 14:31:52 +02:00
|
|
|
/*
|
2017-04-06 16:30:22 +02:00
|
|
|
* During normal commit processing, we call ProcArrayEndTransaction() to
|
2020-08-14 01:25:21 +02:00
|
|
|
* reset the MyProc->xmin. That call happens prior to the call to
|
2017-04-06 16:30:22 +02:00
|
|
|
* AtEOXact_Snapshot(), so we need not touch xmin here at all.
|
2017-04-06 14:31:52 +02:00
|
|
|
*/
|
2017-04-06 16:30:22 +02:00
|
|
|
if (resetXmin)
|
2017-04-06 14:31:52 +02:00
|
|
|
SnapshotResetXmin();
|
|
|
|
|
2020-08-14 01:25:21 +02:00
|
|
|
Assert(resetXmin || MyProc->xmin == 0);
|
2008-03-26 17:20:48 +01:00
|
|
|
}
|
2011-10-23 00:22:45 +02:00
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ExportSnapshot
|
|
|
|
* Export the snapshot to a file so that other backends can import it.
|
|
|
|
* Returns the token (the file name) that can be used to import this
|
|
|
|
* snapshot.
|
|
|
|
*/
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
char *
|
2011-10-23 00:22:45 +02:00
|
|
|
ExportSnapshot(Snapshot snapshot)
|
|
|
|
{
|
|
|
|
TransactionId topXid;
|
|
|
|
TransactionId *children;
|
2017-06-14 20:57:21 +02:00
|
|
|
ExportedSnapshot *esnap;
|
2011-10-23 00:22:45 +02:00
|
|
|
int nchildren;
|
|
|
|
int addTopXid;
|
|
|
|
StringInfoData buf;
|
|
|
|
FILE *f;
|
|
|
|
int i;
|
|
|
|
MemoryContext oldcxt;
|
|
|
|
char path[MAXPGPATH];
|
|
|
|
char pathtmp[MAXPGPATH];
|
|
|
|
|
|
|
|
/*
|
2018-02-17 02:44:15 +01:00
|
|
|
* It's tempting to call RequireTransactionBlock here, since it's not very
|
2011-10-23 00:22:45 +02:00
|
|
|
* useful to export a snapshot that will disappear immediately afterwards.
|
|
|
|
* However, we haven't got enough information to do that, since we don't
|
|
|
|
* know if we're at top level or not. For example, we could be inside a
|
|
|
|
* plpgsql function that is going to fire off other transactions via
|
|
|
|
* dblink. Rather than disallow perfectly legitimate usages, don't make a
|
|
|
|
* check.
|
|
|
|
*
|
|
|
|
* Also note that we don't make any restriction on the transaction's
|
|
|
|
* isolation level; however, importers must check the level if they are
|
|
|
|
* serializable.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
2017-06-14 20:57:21 +02:00
|
|
|
* Get our transaction ID if there is one, to include in the snapshot.
|
2011-10-23 00:22:45 +02:00
|
|
|
*/
|
2017-06-14 20:57:21 +02:00
|
|
|
topXid = GetTopTransactionIdIfAny();
|
2011-10-23 00:22:45 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We cannot export a snapshot from a subtransaction because there's no
|
|
|
|
* easy way for importers to verify that the same subtransaction is still
|
|
|
|
* running.
|
|
|
|
*/
|
|
|
|
if (IsSubTransaction())
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
|
|
|
|
errmsg("cannot export a snapshot from a subtransaction")));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We do however allow previous committed subtransactions to exist.
|
|
|
|
* Importers of the snapshot must see them as still running, so get their
|
|
|
|
* XIDs to add them to the snapshot.
|
|
|
|
*/
|
|
|
|
nchildren = xactGetCommittedChildren(&children);
|
|
|
|
|
2017-06-14 20:57:21 +02:00
|
|
|
/*
|
|
|
|
* Generate file path for the snapshot. We start numbering of snapshots
|
|
|
|
* inside the transaction from 1.
|
|
|
|
*/
|
|
|
|
snprintf(path, sizeof(path), SNAPSHOT_EXPORT_DIR "/%08X-%08X-%d",
|
|
|
|
MyProc->backendId, MyProc->lxid, list_length(exportedSnapshots) + 1);
|
|
|
|
|
2011-10-23 00:22:45 +02:00
|
|
|
/*
|
|
|
|
* Copy the snapshot into TopTransactionContext, add it to the
|
|
|
|
* exportedSnapshots list, and mark it pseudo-registered. We do this to
|
|
|
|
* ensure that the snapshot's xmin is honored for the rest of the
|
2015-01-17 00:14:32 +01:00
|
|
|
* transaction.
|
2011-10-23 00:22:45 +02:00
|
|
|
*/
|
|
|
|
snapshot = CopySnapshot(snapshot);
|
|
|
|
|
|
|
|
oldcxt = MemoryContextSwitchTo(TopTransactionContext);
|
2017-06-14 20:57:21 +02:00
|
|
|
esnap = (ExportedSnapshot *) palloc(sizeof(ExportedSnapshot));
|
|
|
|
esnap->snapfile = pstrdup(path);
|
|
|
|
esnap->snapshot = snapshot;
|
|
|
|
exportedSnapshots = lappend(exportedSnapshots, esnap);
|
2011-10-23 00:22:45 +02:00
|
|
|
MemoryContextSwitchTo(oldcxt);
|
|
|
|
|
|
|
|
snapshot->regd_count++;
|
2015-01-17 00:14:32 +01:00
|
|
|
pairingheap_add(&RegisteredSnapshots, &snapshot->ph_node);
|
2011-10-23 00:22:45 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Fill buf with a text serialization of the snapshot, plus identification
|
|
|
|
* data about this transaction. The format expected by ImportSnapshot is
|
|
|
|
* pretty rigid: each line must be fieldname:value.
|
|
|
|
*/
|
|
|
|
initStringInfo(&buf);
|
|
|
|
|
2017-06-14 20:57:21 +02:00
|
|
|
appendStringInfo(&buf, "vxid:%d/%u\n", MyProc->backendId, MyProc->lxid);
|
|
|
|
appendStringInfo(&buf, "pid:%d\n", MyProcPid);
|
2011-10-23 00:22:45 +02:00
|
|
|
appendStringInfo(&buf, "dbid:%u\n", MyDatabaseId);
|
|
|
|
appendStringInfo(&buf, "iso:%d\n", XactIsoLevel);
|
|
|
|
appendStringInfo(&buf, "ro:%d\n", XactReadOnly);
|
|
|
|
|
|
|
|
appendStringInfo(&buf, "xmin:%u\n", snapshot->xmin);
|
|
|
|
appendStringInfo(&buf, "xmax:%u\n", snapshot->xmax);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We must include our own top transaction ID in the top-xid data, since
|
|
|
|
* by definition we will still be running when the importing transaction
|
|
|
|
* adopts the snapshot, but GetSnapshotData never includes our own XID in
|
|
|
|
* the snapshot. (There must, therefore, be enough room to add it.)
|
|
|
|
*
|
|
|
|
* However, it could be that our topXid is after the xmax, in which case
|
|
|
|
* we shouldn't include it because xip[] members are expected to be before
|
|
|
|
* xmax. (We need not make the same check for subxip[] members, see
|
|
|
|
* snapshot.h.)
|
|
|
|
*/
|
2017-06-14 20:57:21 +02:00
|
|
|
addTopXid = (TransactionIdIsValid(topXid) &&
|
|
|
|
TransactionIdPrecedes(topXid, snapshot->xmax)) ? 1 : 0;
|
2011-10-23 00:22:45 +02:00
|
|
|
appendStringInfo(&buf, "xcnt:%d\n", snapshot->xcnt + addTopXid);
|
|
|
|
for (i = 0; i < snapshot->xcnt; i++)
|
|
|
|
appendStringInfo(&buf, "xip:%u\n", snapshot->xip[i]);
|
|
|
|
if (addTopXid)
|
|
|
|
appendStringInfo(&buf, "xip:%u\n", topXid);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Similarly, we add our subcommitted child XIDs to the subxid data. Here,
|
|
|
|
* we have to cope with possible overflow.
|
|
|
|
*/
|
|
|
|
if (snapshot->suboverflowed ||
|
|
|
|
snapshot->subxcnt + nchildren > GetMaxSnapshotSubxidCount())
|
|
|
|
appendStringInfoString(&buf, "sof:1\n");
|
|
|
|
else
|
|
|
|
{
|
|
|
|
appendStringInfoString(&buf, "sof:0\n");
|
|
|
|
appendStringInfo(&buf, "sxcnt:%d\n", snapshot->subxcnt + nchildren);
|
|
|
|
for (i = 0; i < snapshot->subxcnt; i++)
|
|
|
|
appendStringInfo(&buf, "sxp:%u\n", snapshot->subxip[i]);
|
|
|
|
for (i = 0; i < nchildren; i++)
|
|
|
|
appendStringInfo(&buf, "sxp:%u\n", children[i]);
|
|
|
|
}
|
|
|
|
appendStringInfo(&buf, "rec:%u\n", snapshot->takenDuringRecovery);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now write the text representation into a file. We first write to a
|
|
|
|
* ".tmp" filename, and rename to final filename if no error. This
|
|
|
|
* ensures that no other backend can read an incomplete file
|
|
|
|
* (ImportSnapshot won't allow it because of its valid-characters check).
|
|
|
|
*/
|
2017-06-14 20:57:21 +02:00
|
|
|
snprintf(pathtmp, sizeof(pathtmp), "%s.tmp", path);
|
2011-10-23 00:22:45 +02:00
|
|
|
if (!(f = AllocateFile(pathtmp, PG_BINARY_W)))
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not create file \"%s\": %m", pathtmp)));
|
|
|
|
|
|
|
|
if (fwrite(buf.data, buf.len, 1, f) != 1)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not write to file \"%s\": %m", pathtmp)));
|
|
|
|
|
|
|
|
/* no fsync() since file need not survive a system crash */
|
|
|
|
|
|
|
|
if (FreeFile(f))
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not write to file \"%s\": %m", pathtmp)));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now that we have written everything into a .tmp file, rename the file
|
|
|
|
* to remove the .tmp suffix.
|
|
|
|
*/
|
|
|
|
if (rename(pathtmp, path) < 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not rename file \"%s\" to \"%s\": %m",
|
|
|
|
pathtmp, path)));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The basename of the file is what we return from pg_export_snapshot().
|
|
|
|
* It's already in path in a textual format and we know that the path
|
|
|
|
* starts with SNAPSHOT_EXPORT_DIR. Skip over the prefix and the slash
|
|
|
|
* and pstrdup it so as not to return the address of a local variable.
|
|
|
|
*/
|
|
|
|
return pstrdup(path + strlen(SNAPSHOT_EXPORT_DIR) + 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* pg_export_snapshot
|
|
|
|
* SQL-callable wrapper for ExportSnapshot.
|
|
|
|
*/
|
|
|
|
Datum
|
|
|
|
pg_export_snapshot(PG_FUNCTION_ARGS)
|
|
|
|
{
|
|
|
|
char *snapshotName;
|
|
|
|
|
|
|
|
snapshotName = ExportSnapshot(GetActiveSnapshot());
|
|
|
|
PG_RETURN_TEXT_P(cstring_to_text(snapshotName));
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Parsing subroutines for ImportSnapshot: parse a line with the given
|
|
|
|
* prefix followed by a value, and advance *s to the next line. The
|
|
|
|
* filename is provided for use in error messages.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
parseIntFromText(const char *prefix, char **s, const char *filename)
|
|
|
|
{
|
|
|
|
char *ptr = *s;
|
|
|
|
int prefixlen = strlen(prefix);
|
|
|
|
int val;
|
|
|
|
|
|
|
|
if (strncmp(ptr, prefix, prefixlen) != 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
|
|
|
|
errmsg("invalid snapshot data in file \"%s\"", filename)));
|
|
|
|
ptr += prefixlen;
|
|
|
|
if (sscanf(ptr, "%d", &val) != 1)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
|
|
|
|
errmsg("invalid snapshot data in file \"%s\"", filename)));
|
|
|
|
ptr = strchr(ptr, '\n');
|
|
|
|
if (!ptr)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
|
|
|
|
errmsg("invalid snapshot data in file \"%s\"", filename)));
|
|
|
|
*s = ptr + 1;
|
|
|
|
return val;
|
|
|
|
}
|
|
|
|
|
|
|
|
static TransactionId
|
|
|
|
parseXidFromText(const char *prefix, char **s, const char *filename)
|
|
|
|
{
|
|
|
|
char *ptr = *s;
|
|
|
|
int prefixlen = strlen(prefix);
|
|
|
|
TransactionId val;
|
|
|
|
|
|
|
|
if (strncmp(ptr, prefix, prefixlen) != 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
|
|
|
|
errmsg("invalid snapshot data in file \"%s\"", filename)));
|
|
|
|
ptr += prefixlen;
|
|
|
|
if (sscanf(ptr, "%u", &val) != 1)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
|
|
|
|
errmsg("invalid snapshot data in file \"%s\"", filename)));
|
|
|
|
ptr = strchr(ptr, '\n');
|
|
|
|
if (!ptr)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
|
|
|
|
errmsg("invalid snapshot data in file \"%s\"", filename)));
|
|
|
|
*s = ptr + 1;
|
|
|
|
return val;
|
|
|
|
}
|
|
|
|
|
2017-06-14 20:57:21 +02:00
|
|
|
static void
|
|
|
|
parseVxidFromText(const char *prefix, char **s, const char *filename,
|
|
|
|
VirtualTransactionId *vxid)
|
|
|
|
{
|
|
|
|
char *ptr = *s;
|
|
|
|
int prefixlen = strlen(prefix);
|
|
|
|
|
|
|
|
if (strncmp(ptr, prefix, prefixlen) != 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
|
|
|
|
errmsg("invalid snapshot data in file \"%s\"", filename)));
|
|
|
|
ptr += prefixlen;
|
|
|
|
if (sscanf(ptr, "%d/%u", &vxid->backendId, &vxid->localTransactionId) != 2)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
|
|
|
|
errmsg("invalid snapshot data in file \"%s\"", filename)));
|
|
|
|
ptr = strchr(ptr, '\n');
|
|
|
|
if (!ptr)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
|
|
|
|
errmsg("invalid snapshot data in file \"%s\"", filename)));
|
|
|
|
*s = ptr + 1;
|
|
|
|
}
|
|
|
|
|
2011-10-23 00:22:45 +02:00
|
|
|
/*
|
|
|
|
* ImportSnapshot
|
|
|
|
* Import a previously exported snapshot. The argument should be a
|
|
|
|
* filename in SNAPSHOT_EXPORT_DIR. Load the snapshot from that file.
|
|
|
|
* This is called by "SET TRANSACTION SNAPSHOT 'foo'".
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ImportSnapshot(const char *idstr)
|
|
|
|
{
|
|
|
|
char path[MAXPGPATH];
|
|
|
|
FILE *f;
|
|
|
|
struct stat stat_buf;
|
|
|
|
char *filebuf;
|
|
|
|
int xcnt;
|
|
|
|
int i;
|
2017-06-14 20:57:21 +02:00
|
|
|
VirtualTransactionId src_vxid;
|
|
|
|
int src_pid;
|
2011-10-23 00:22:45 +02:00
|
|
|
Oid src_dbid;
|
|
|
|
int src_isolevel;
|
|
|
|
bool src_readonly;
|
|
|
|
SnapshotData snapshot;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Must be at top level of a fresh transaction. Note in particular that
|
|
|
|
* we check we haven't acquired an XID --- if we have, it's conceivable
|
|
|
|
* that the snapshot would show it as not running, making for very screwy
|
|
|
|
* behavior.
|
|
|
|
*/
|
|
|
|
if (FirstSnapshotSet ||
|
|
|
|
GetTopTransactionIdIfAny() != InvalidTransactionId ||
|
|
|
|
IsSubTransaction())
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
|
|
|
|
errmsg("SET TRANSACTION SNAPSHOT must be called before any query")));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are in read committed mode then the next query would execute with
|
|
|
|
* a new snapshot thus making this function call quite useless.
|
|
|
|
*/
|
|
|
|
if (!IsolationUsesXactSnapshot())
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
|
|
|
|
errmsg("a snapshot-importing transaction must have isolation level SERIALIZABLE or REPEATABLE READ")));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify the identifier: only 0-9, A-F and hyphens are allowed. We do
|
|
|
|
* this mainly to prevent reading arbitrary files.
|
|
|
|
*/
|
|
|
|
if (strspn(idstr, "0123456789ABCDEF-") != strlen(idstr))
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
|
2012-07-02 20:12:46 +02:00
|
|
|
errmsg("invalid snapshot identifier: \"%s\"", idstr)));
|
2011-10-23 00:22:45 +02:00
|
|
|
|
|
|
|
/* OK, read the file */
|
|
|
|
snprintf(path, MAXPGPATH, SNAPSHOT_EXPORT_DIR "/%s", idstr);
|
|
|
|
|
|
|
|
f = AllocateFile(path, PG_BINARY_R);
|
|
|
|
if (!f)
|
2023-09-19 03:19:50 +02:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If file is missing while identifier has a correct format, avoid
|
|
|
|
* system errors.
|
|
|
|
*/
|
|
|
|
if (errno == ENOENT)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_UNDEFINED_OBJECT),
|
|
|
|
errmsg("snapshot \"%s\" does not exist", idstr)));
|
|
|
|
else
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not open file \"%s\" for reading: %m",
|
|
|
|
path)));
|
|
|
|
}
|
2011-10-23 00:22:45 +02:00
|
|
|
|
|
|
|
/* get the size of the file so that we know how much memory we need */
|
|
|
|
if (fstat(fileno(f), &stat_buf))
|
|
|
|
elog(ERROR, "could not stat file \"%s\": %m", path);
|
|
|
|
|
|
|
|
/* and read the file into a palloc'd string */
|
|
|
|
filebuf = (char *) palloc(stat_buf.st_size + 1);
|
|
|
|
if (fread(filebuf, stat_buf.st_size, 1, f) != 1)
|
|
|
|
elog(ERROR, "could not read file \"%s\": %m", path);
|
|
|
|
|
|
|
|
filebuf[stat_buf.st_size] = '\0';
|
|
|
|
|
|
|
|
FreeFile(f);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Construct a snapshot struct by parsing the file content.
|
|
|
|
*/
|
|
|
|
memset(&snapshot, 0, sizeof(snapshot));
|
|
|
|
|
2017-06-14 20:57:21 +02:00
|
|
|
parseVxidFromText("vxid:", &filebuf, path, &src_vxid);
|
|
|
|
src_pid = parseIntFromText("pid:", &filebuf, path);
|
2011-10-23 00:22:45 +02:00
|
|
|
/* we abuse parseXidFromText a bit here ... */
|
|
|
|
src_dbid = parseXidFromText("dbid:", &filebuf, path);
|
|
|
|
src_isolevel = parseIntFromText("iso:", &filebuf, path);
|
|
|
|
src_readonly = parseIntFromText("ro:", &filebuf, path);
|
|
|
|
|
2019-02-20 04:31:07 +01:00
|
|
|
snapshot.snapshot_type = SNAPSHOT_MVCC;
|
|
|
|
|
2011-10-23 00:22:45 +02:00
|
|
|
snapshot.xmin = parseXidFromText("xmin:", &filebuf, path);
|
|
|
|
snapshot.xmax = parseXidFromText("xmax:", &filebuf, path);
|
|
|
|
|
|
|
|
snapshot.xcnt = xcnt = parseIntFromText("xcnt:", &filebuf, path);
|
|
|
|
|
|
|
|
/* sanity-check the xid count before palloc */
|
|
|
|
if (xcnt < 0 || xcnt > GetMaxSnapshotXidCount())
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
|
|
|
|
errmsg("invalid snapshot data in file \"%s\"", path)));
|
|
|
|
|
|
|
|
snapshot.xip = (TransactionId *) palloc(xcnt * sizeof(TransactionId));
|
|
|
|
for (i = 0; i < xcnt; i++)
|
|
|
|
snapshot.xip[i] = parseXidFromText("xip:", &filebuf, path);
|
|
|
|
|
|
|
|
snapshot.suboverflowed = parseIntFromText("sof:", &filebuf, path);
|
|
|
|
|
|
|
|
if (!snapshot.suboverflowed)
|
|
|
|
{
|
|
|
|
snapshot.subxcnt = xcnt = parseIntFromText("sxcnt:", &filebuf, path);
|
|
|
|
|
|
|
|
/* sanity-check the xid count before palloc */
|
|
|
|
if (xcnt < 0 || xcnt > GetMaxSnapshotSubxidCount())
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
|
|
|
|
errmsg("invalid snapshot data in file \"%s\"", path)));
|
|
|
|
|
|
|
|
snapshot.subxip = (TransactionId *) palloc(xcnt * sizeof(TransactionId));
|
|
|
|
for (i = 0; i < xcnt; i++)
|
|
|
|
snapshot.subxip[i] = parseXidFromText("sxp:", &filebuf, path);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
snapshot.subxcnt = 0;
|
|
|
|
snapshot.subxip = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
snapshot.takenDuringRecovery = parseIntFromText("rec:", &filebuf, path);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Do some additional sanity checking, just to protect ourselves. We
|
|
|
|
* don't trouble to check the array elements, just the most critical
|
|
|
|
* fields.
|
|
|
|
*/
|
2017-06-14 20:57:21 +02:00
|
|
|
if (!VirtualTransactionIdIsValid(src_vxid) ||
|
2011-10-23 00:22:45 +02:00
|
|
|
!OidIsValid(src_dbid) ||
|
|
|
|
!TransactionIdIsNormal(snapshot.xmin) ||
|
|
|
|
!TransactionIdIsNormal(snapshot.xmax))
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
|
|
|
|
errmsg("invalid snapshot data in file \"%s\"", path)));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we're serializable, the source transaction must be too, otherwise
|
|
|
|
* predicate.c has problems (SxactGlobalXmin could go backwards). Also, a
|
|
|
|
* non-read-only transaction can't adopt a snapshot from a read-only
|
|
|
|
* transaction, as predicate.c handles the cases very differently.
|
|
|
|
*/
|
|
|
|
if (IsolationIsSerializable())
|
|
|
|
{
|
|
|
|
if (src_isolevel != XACT_SERIALIZABLE)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
|
|
|
|
errmsg("a serializable transaction cannot import a snapshot from a non-serializable transaction")));
|
|
|
|
if (src_readonly && !XactReadOnly)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
|
|
|
|
errmsg("a non-read-only serializable transaction cannot import a snapshot from a read-only transaction")));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We cannot import a snapshot that was taken in a different database,
|
|
|
|
* because vacuum calculates OldestXmin on a per-database basis; so the
|
|
|
|
* source transaction's xmin doesn't protect us from data loss. This
|
|
|
|
* restriction could be removed if the source transaction were to mark its
|
|
|
|
* xmin as being globally applicable. But that would require some
|
|
|
|
* additional syntax, since that has to be known when the snapshot is
|
|
|
|
* initially taken. (See pgsql-hackers discussion of 2011-10-21.)
|
|
|
|
*/
|
|
|
|
if (src_dbid != MyDatabaseId)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
|
|
|
|
errmsg("cannot import a snapshot from a different database")));
|
|
|
|
|
|
|
|
/* OK, install the snapshot */
|
2017-06-14 20:57:21 +02:00
|
|
|
SetTransactionSnapshot(&snapshot, &src_vxid, src_pid, NULL);
|
2011-10-23 00:22:45 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* XactHasExportedSnapshots
|
|
|
|
* Test whether current transaction has exported any snapshots.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
XactHasExportedSnapshots(void)
|
|
|
|
{
|
|
|
|
return (exportedSnapshots != NIL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* DeleteAllExportedSnapshotFiles
|
|
|
|
* Clean up any files that have been left behind by a crashed backend
|
|
|
|
* that had exported snapshots before it died.
|
|
|
|
*
|
|
|
|
* This should be called during database startup or crash recovery.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
DeleteAllExportedSnapshotFiles(void)
|
|
|
|
{
|
2017-04-11 20:13:31 +02:00
|
|
|
char buf[MAXPGPATH + sizeof(SNAPSHOT_EXPORT_DIR)];
|
2011-10-23 00:22:45 +02:00
|
|
|
DIR *s_dir;
|
|
|
|
struct dirent *s_de;
|
|
|
|
|
Clean up assorted messiness around AllocateDir() usage.
This patch fixes a couple of low-probability bugs that could lead to
reporting an irrelevant errno value (and hence possibly a wrong SQLSTATE)
concerning directory-open or file-open failures. It also fixes places
where we took shortcuts in reporting such errors, either by using elog
instead of ereport or by using ereport but forgetting to specify an
errcode. And it eliminates a lot of just plain redundant error-handling
code.
In service of all this, export fd.c's formerly-static function
ReadDirExtended, so that external callers can make use of the coding
pattern
dir = AllocateDir(path);
while ((de = ReadDirExtended(dir, path, LOG)) != NULL)
if they'd like to treat directory-open failures as mere LOG conditions
rather than errors. Also fix FreeDir to be a no-op if we reach it
with dir == NULL, as such a coding pattern would cause.
Then, remove code at many call sites that was throwing an error or log
message for AllocateDir failure, as ReadDir or ReadDirExtended can handle
that job just fine. Aside from being a net code savings, this gets rid of
a lot of not-quite-up-to-snuff reports, as mentioned above. (In some
places these changes result in replacing a custom error message such as
"could not open tablespace directory" with more generic wording "could not
open directory", but it was agreed that the custom wording buys little as
long as we report the directory name.) In some other call sites where we
can't just remove code, change the error reports to be fully
project-style-compliant.
Also reorder code in restoreTwoPhaseData that was acquiring a lock
between AllocateDir and ReadDir; in the unlikely but surely not
impossible case that LWLockAcquire changes errno, AllocateDir failures
would be misreported. There is no great value in opening the directory
before acquiring TwoPhaseStateLock, so just do it in the other order.
Also fix CheckXLogRemoved to guarantee that it preserves errno,
as quite a number of call sites are implicitly assuming. (Again,
it's unlikely but I think not impossible that errno could change
during a SpinLockAcquire. If so, this function was broken for its
own purposes as well as breaking callers.)
And change a few places that were using not-per-project-style messages,
such as "could not read directory" when "could not open directory" is
more correct.
Back-patch the exporting of ReadDirExtended, in case we have occasion
to back-patch some fix that makes use of it; it's not needed right now
but surely making it global is pretty harmless. Also back-patch the
restoreTwoPhaseData and CheckXLogRemoved fixes. The rest of this is
essentially cosmetic and need not get back-patched.
Michael Paquier, with a bit of additional work by me
Discussion: https://postgr.es/m/CAB7nPqRpOCxjiirHmebEFhXVTK7V5Jvw4bz82p7Oimtsm3TyZA@mail.gmail.com
2017-12-04 23:02:52 +01:00
|
|
|
/*
|
|
|
|
* Problems in reading the directory, or unlinking files, are reported at
|
|
|
|
* LOG level. Since we're running in the startup process, ERROR level
|
|
|
|
* would prevent database start, and it's not important enough for that.
|
|
|
|
*/
|
|
|
|
s_dir = AllocateDir(SNAPSHOT_EXPORT_DIR);
|
2011-10-23 00:22:45 +02:00
|
|
|
|
Clean up assorted messiness around AllocateDir() usage.
This patch fixes a couple of low-probability bugs that could lead to
reporting an irrelevant errno value (and hence possibly a wrong SQLSTATE)
concerning directory-open or file-open failures. It also fixes places
where we took shortcuts in reporting such errors, either by using elog
instead of ereport or by using ereport but forgetting to specify an
errcode. And it eliminates a lot of just plain redundant error-handling
code.
In service of all this, export fd.c's formerly-static function
ReadDirExtended, so that external callers can make use of the coding
pattern
dir = AllocateDir(path);
while ((de = ReadDirExtended(dir, path, LOG)) != NULL)
if they'd like to treat directory-open failures as mere LOG conditions
rather than errors. Also fix FreeDir to be a no-op if we reach it
with dir == NULL, as such a coding pattern would cause.
Then, remove code at many call sites that was throwing an error or log
message for AllocateDir failure, as ReadDir or ReadDirExtended can handle
that job just fine. Aside from being a net code savings, this gets rid of
a lot of not-quite-up-to-snuff reports, as mentioned above. (In some
places these changes result in replacing a custom error message such as
"could not open tablespace directory" with more generic wording "could not
open directory", but it was agreed that the custom wording buys little as
long as we report the directory name.) In some other call sites where we
can't just remove code, change the error reports to be fully
project-style-compliant.
Also reorder code in restoreTwoPhaseData that was acquiring a lock
between AllocateDir and ReadDir; in the unlikely but surely not
impossible case that LWLockAcquire changes errno, AllocateDir failures
would be misreported. There is no great value in opening the directory
before acquiring TwoPhaseStateLock, so just do it in the other order.
Also fix CheckXLogRemoved to guarantee that it preserves errno,
as quite a number of call sites are implicitly assuming. (Again,
it's unlikely but I think not impossible that errno could change
during a SpinLockAcquire. If so, this function was broken for its
own purposes as well as breaking callers.)
And change a few places that were using not-per-project-style messages,
such as "could not read directory" when "could not open directory" is
more correct.
Back-patch the exporting of ReadDirExtended, in case we have occasion
to back-patch some fix that makes use of it; it's not needed right now
but surely making it global is pretty harmless. Also back-patch the
restoreTwoPhaseData and CheckXLogRemoved fixes. The rest of this is
essentially cosmetic and need not get back-patched.
Michael Paquier, with a bit of additional work by me
Discussion: https://postgr.es/m/CAB7nPqRpOCxjiirHmebEFhXVTK7V5Jvw4bz82p7Oimtsm3TyZA@mail.gmail.com
2017-12-04 23:02:52 +01:00
|
|
|
while ((s_de = ReadDirExtended(s_dir, SNAPSHOT_EXPORT_DIR, LOG)) != NULL)
|
2011-10-23 00:22:45 +02:00
|
|
|
{
|
|
|
|
if (strcmp(s_de->d_name, ".") == 0 ||
|
|
|
|
strcmp(s_de->d_name, "..") == 0)
|
|
|
|
continue;
|
|
|
|
|
2017-04-11 20:13:31 +02:00
|
|
|
snprintf(buf, sizeof(buf), SNAPSHOT_EXPORT_DIR "/%s", s_de->d_name);
|
Clean up assorted messiness around AllocateDir() usage.
This patch fixes a couple of low-probability bugs that could lead to
reporting an irrelevant errno value (and hence possibly a wrong SQLSTATE)
concerning directory-open or file-open failures. It also fixes places
where we took shortcuts in reporting such errors, either by using elog
instead of ereport or by using ereport but forgetting to specify an
errcode. And it eliminates a lot of just plain redundant error-handling
code.
In service of all this, export fd.c's formerly-static function
ReadDirExtended, so that external callers can make use of the coding
pattern
dir = AllocateDir(path);
while ((de = ReadDirExtended(dir, path, LOG)) != NULL)
if they'd like to treat directory-open failures as mere LOG conditions
rather than errors. Also fix FreeDir to be a no-op if we reach it
with dir == NULL, as such a coding pattern would cause.
Then, remove code at many call sites that was throwing an error or log
message for AllocateDir failure, as ReadDir or ReadDirExtended can handle
that job just fine. Aside from being a net code savings, this gets rid of
a lot of not-quite-up-to-snuff reports, as mentioned above. (In some
places these changes result in replacing a custom error message such as
"could not open tablespace directory" with more generic wording "could not
open directory", but it was agreed that the custom wording buys little as
long as we report the directory name.) In some other call sites where we
can't just remove code, change the error reports to be fully
project-style-compliant.
Also reorder code in restoreTwoPhaseData that was acquiring a lock
between AllocateDir and ReadDir; in the unlikely but surely not
impossible case that LWLockAcquire changes errno, AllocateDir failures
would be misreported. There is no great value in opening the directory
before acquiring TwoPhaseStateLock, so just do it in the other order.
Also fix CheckXLogRemoved to guarantee that it preserves errno,
as quite a number of call sites are implicitly assuming. (Again,
it's unlikely but I think not impossible that errno could change
during a SpinLockAcquire. If so, this function was broken for its
own purposes as well as breaking callers.)
And change a few places that were using not-per-project-style messages,
such as "could not read directory" when "could not open directory" is
more correct.
Back-patch the exporting of ReadDirExtended, in case we have occasion
to back-patch some fix that makes use of it; it's not needed right now
but surely making it global is pretty harmless. Also back-patch the
restoreTwoPhaseData and CheckXLogRemoved fixes. The rest of this is
essentially cosmetic and need not get back-patched.
Michael Paquier, with a bit of additional work by me
Discussion: https://postgr.es/m/CAB7nPqRpOCxjiirHmebEFhXVTK7V5Jvw4bz82p7Oimtsm3TyZA@mail.gmail.com
2017-12-04 23:02:52 +01:00
|
|
|
|
|
|
|
if (unlink(buf) != 0)
|
|
|
|
ereport(LOG,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not remove file \"%s\": %m", buf)));
|
2011-10-23 00:22:45 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
FreeDir(s_dir);
|
|
|
|
}
|
2012-12-01 13:54:20 +01:00
|
|
|
|
2017-11-05 18:25:52 +01:00
|
|
|
/*
|
|
|
|
* ThereAreNoPriorRegisteredSnapshots
|
|
|
|
* Is the registered snapshot count less than or equal to one?
|
|
|
|
*
|
|
|
|
* Don't use this to settle important decisions. While zero registrations and
|
|
|
|
* no ActiveSnapshot would confirm a certain idleness, the system makes no
|
|
|
|
* guarantees about the significance of one registered snapshot.
|
|
|
|
*/
|
2012-12-01 13:54:20 +01:00
|
|
|
bool
|
|
|
|
ThereAreNoPriorRegisteredSnapshots(void)
|
|
|
|
{
|
2015-01-17 00:14:32 +01:00
|
|
|
if (pairingheap_is_empty(&RegisteredSnapshots) ||
|
|
|
|
pairingheap_is_singular(&RegisteredSnapshots))
|
2012-12-01 13:54:20 +01:00
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
2022-02-19 21:42:37 +01:00
|
|
|
/*
|
2023-04-19 03:50:33 +02:00
|
|
|
* HaveRegisteredOrActiveSnapshot
|
2022-02-19 21:42:37 +01:00
|
|
|
* Is there any registered or active snapshot?
|
|
|
|
*
|
|
|
|
* NB: Unless pushed or active, the cached catalog snapshot will not cause
|
|
|
|
* this function to return true. That allows this function to be used in
|
|
|
|
* checks enforcing a longer-lived snapshot.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
HaveRegisteredOrActiveSnapshot(void)
|
|
|
|
{
|
|
|
|
if (ActiveSnapshot != NULL)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The catalog snapshot is in RegisteredSnapshots when valid, but can be
|
|
|
|
* removed at any time due to invalidation processing. If explicitly
|
|
|
|
* registered more than one snapshot has to be in RegisteredSnapshots.
|
|
|
|
*/
|
2022-04-16 22:04:50 +02:00
|
|
|
if (CatalogSnapshot != NULL &&
|
|
|
|
pairingheap_is_singular(&RegisteredSnapshots))
|
2022-02-19 21:42:37 +01:00
|
|
|
return false;
|
|
|
|
|
2022-04-16 22:04:50 +02:00
|
|
|
return !pairingheap_is_empty(&RegisteredSnapshots);
|
2022-02-19 21:42:37 +01:00
|
|
|
}
|
|
|
|
|
2016-04-08 21:36:30 +02:00
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
/*
|
|
|
|
* Setup a snapshot that replaces normal catalog snapshots that allows catalog
|
|
|
|
* access to behave just like it did at a certain point in the past.
|
|
|
|
*
|
|
|
|
* Needed for logical decoding.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
SetupHistoricSnapshot(Snapshot historic_snapshot, HTAB *tuplecids)
|
|
|
|
{
|
|
|
|
Assert(historic_snapshot != NULL);
|
|
|
|
|
|
|
|
/* setup the timetravel snapshot */
|
|
|
|
HistoricSnapshot = historic_snapshot;
|
|
|
|
|
|
|
|
/* setup (cmin, cmax) lookup hash */
|
|
|
|
tuplecid_data = tuplecids;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make catalog snapshots behave normally again.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
TeardownHistoricSnapshot(bool is_error)
|
|
|
|
{
|
|
|
|
HistoricSnapshot = NULL;
|
|
|
|
tuplecid_data = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool
|
|
|
|
HistoricSnapshotActive(void)
|
|
|
|
{
|
|
|
|
return HistoricSnapshot != NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
HTAB *
|
|
|
|
HistoricSnapshotGetTupleCids(void)
|
|
|
|
{
|
|
|
|
Assert(HistoricSnapshotActive());
|
|
|
|
return tuplecid_data;
|
|
|
|
}
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* EstimateSnapshotSpace
|
2016-06-07 17:14:48 +02:00
|
|
|
* Returns the size needed to store the given snapshot.
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
*
|
|
|
|
* We are exporting only required fields from the Snapshot, stored in
|
|
|
|
* SerializedSnapshotData.
|
|
|
|
*/
|
|
|
|
Size
|
2022-09-20 22:09:30 +02:00
|
|
|
EstimateSnapshotSpace(Snapshot snapshot)
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
{
|
|
|
|
Size size;
|
|
|
|
|
2022-09-20 22:09:30 +02:00
|
|
|
Assert(snapshot != InvalidSnapshot);
|
|
|
|
Assert(snapshot->snapshot_type == SNAPSHOT_MVCC);
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
|
|
|
|
/* We allocate any XID arrays needed in the same palloc block. */
|
|
|
|
size = add_size(sizeof(SerializedSnapshotData),
|
2022-09-20 22:09:30 +02:00
|
|
|
mul_size(snapshot->xcnt, sizeof(TransactionId)));
|
|
|
|
if (snapshot->subxcnt > 0 &&
|
|
|
|
(!snapshot->suboverflowed || snapshot->takenDuringRecovery))
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
size = add_size(size,
|
2022-09-20 22:09:30 +02:00
|
|
|
mul_size(snapshot->subxcnt, sizeof(TransactionId)));
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
|
|
|
|
return size;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* SerializeSnapshot
|
|
|
|
* Dumps the serialized snapshot (extracted from given snapshot) onto the
|
|
|
|
* memory location at start_address.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
SerializeSnapshot(Snapshot snapshot, char *start_address)
|
|
|
|
{
|
2017-03-02 06:03:27 +01:00
|
|
|
SerializedSnapshotData serialized_snapshot;
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
|
|
|
|
Assert(snapshot->subxcnt >= 0);
|
|
|
|
|
|
|
|
/* Copy all required fields */
|
2017-03-02 06:03:27 +01:00
|
|
|
serialized_snapshot.xmin = snapshot->xmin;
|
|
|
|
serialized_snapshot.xmax = snapshot->xmax;
|
|
|
|
serialized_snapshot.xcnt = snapshot->xcnt;
|
|
|
|
serialized_snapshot.subxcnt = snapshot->subxcnt;
|
|
|
|
serialized_snapshot.suboverflowed = snapshot->suboverflowed;
|
|
|
|
serialized_snapshot.takenDuringRecovery = snapshot->takenDuringRecovery;
|
|
|
|
serialized_snapshot.curcid = snapshot->curcid;
|
|
|
|
serialized_snapshot.whenTaken = snapshot->whenTaken;
|
|
|
|
serialized_snapshot.lsn = snapshot->lsn;
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
|
2017-06-24 14:03:55 +02:00
|
|
|
/*
|
|
|
|
* Ignore the SubXID array if it has overflowed, unless the snapshot was
|
2017-06-24 14:51:26 +02:00
|
|
|
* taken during recovery - in that case, top-level XIDs are in subxip as
|
2017-06-24 14:03:55 +02:00
|
|
|
* well, and we mustn't lose them.
|
|
|
|
*/
|
|
|
|
if (serialized_snapshot.suboverflowed && !snapshot->takenDuringRecovery)
|
|
|
|
serialized_snapshot.subxcnt = 0;
|
|
|
|
|
2017-03-02 06:03:27 +01:00
|
|
|
/* Copy struct to possibly-unaligned buffer */
|
|
|
|
memcpy(start_address,
|
|
|
|
&serialized_snapshot, sizeof(SerializedSnapshotData));
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
|
|
|
|
/* Copy XID array */
|
|
|
|
if (snapshot->xcnt > 0)
|
2017-03-02 06:03:27 +01:00
|
|
|
memcpy((TransactionId *) (start_address +
|
|
|
|
sizeof(SerializedSnapshotData)),
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
snapshot->xip, snapshot->xcnt * sizeof(TransactionId));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Copy SubXID array. Don't bother to copy it if it had overflowed,
|
|
|
|
* though, because it's not used anywhere in that case. Except if it's a
|
|
|
|
* snapshot taken during recovery; all the top-level XIDs are in subxip as
|
|
|
|
* well in that case, so we mustn't lose them.
|
|
|
|
*/
|
2017-03-02 06:03:27 +01:00
|
|
|
if (serialized_snapshot.subxcnt > 0)
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
{
|
|
|
|
Size subxipoff = sizeof(SerializedSnapshotData) +
|
|
|
|
snapshot->xcnt * sizeof(TransactionId);
|
|
|
|
|
2017-03-02 06:03:27 +01:00
|
|
|
memcpy((TransactionId *) (start_address + subxipoff),
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
snapshot->subxip, snapshot->subxcnt * sizeof(TransactionId));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* RestoreSnapshot
|
|
|
|
* Restore a serialized snapshot from the specified address.
|
|
|
|
*
|
|
|
|
* The copy is palloc'd in TopTransactionContext and has initial refcounts set
|
|
|
|
* to 0. The returned snapshot has the copied flag set.
|
|
|
|
*/
|
|
|
|
Snapshot
|
|
|
|
RestoreSnapshot(char *start_address)
|
|
|
|
{
|
2017-03-02 06:03:27 +01:00
|
|
|
SerializedSnapshotData serialized_snapshot;
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
Size size;
|
|
|
|
Snapshot snapshot;
|
|
|
|
TransactionId *serialized_xids;
|
|
|
|
|
2017-03-02 06:03:27 +01:00
|
|
|
memcpy(&serialized_snapshot, start_address,
|
|
|
|
sizeof(SerializedSnapshotData));
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
serialized_xids = (TransactionId *)
|
|
|
|
(start_address + sizeof(SerializedSnapshotData));
|
|
|
|
|
|
|
|
/* We allocate any XID arrays needed in the same palloc block. */
|
|
|
|
size = sizeof(SnapshotData)
|
2017-03-02 06:03:27 +01:00
|
|
|
+ serialized_snapshot.xcnt * sizeof(TransactionId)
|
|
|
|
+ serialized_snapshot.subxcnt * sizeof(TransactionId);
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
|
|
|
|
/* Copy all required fields */
|
|
|
|
snapshot = (Snapshot) MemoryContextAlloc(TopTransactionContext, size);
|
2019-01-22 02:03:15 +01:00
|
|
|
snapshot->snapshot_type = SNAPSHOT_MVCC;
|
2017-03-02 06:03:27 +01:00
|
|
|
snapshot->xmin = serialized_snapshot.xmin;
|
|
|
|
snapshot->xmax = serialized_snapshot.xmax;
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
snapshot->xip = NULL;
|
2017-03-02 06:03:27 +01:00
|
|
|
snapshot->xcnt = serialized_snapshot.xcnt;
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
snapshot->subxip = NULL;
|
2017-03-02 06:03:27 +01:00
|
|
|
snapshot->subxcnt = serialized_snapshot.subxcnt;
|
|
|
|
snapshot->suboverflowed = serialized_snapshot.suboverflowed;
|
|
|
|
snapshot->takenDuringRecovery = serialized_snapshot.takenDuringRecovery;
|
|
|
|
snapshot->curcid = serialized_snapshot.curcid;
|
|
|
|
snapshot->whenTaken = serialized_snapshot.whenTaken;
|
|
|
|
snapshot->lsn = serialized_snapshot.lsn;
|
2020-08-18 06:07:10 +02:00
|
|
|
snapshot->snapXactCompletionCount = 0;
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
|
|
|
|
/* Copy XIDs, if present. */
|
2017-03-02 06:03:27 +01:00
|
|
|
if (serialized_snapshot.xcnt > 0)
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
{
|
|
|
|
snapshot->xip = (TransactionId *) (snapshot + 1);
|
|
|
|
memcpy(snapshot->xip, serialized_xids,
|
2017-03-02 06:03:27 +01:00
|
|
|
serialized_snapshot.xcnt * sizeof(TransactionId));
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Copy SubXIDs, if present. */
|
2017-03-02 06:03:27 +01:00
|
|
|
if (serialized_snapshot.subxcnt > 0)
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
{
|
2016-07-01 14:51:58 +02:00
|
|
|
snapshot->subxip = ((TransactionId *) (snapshot + 1)) +
|
2017-03-02 06:03:27 +01:00
|
|
|
serialized_snapshot.xcnt;
|
|
|
|
memcpy(snapshot->subxip, serialized_xids + serialized_snapshot.xcnt,
|
|
|
|
serialized_snapshot.subxcnt * sizeof(TransactionId));
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Set the copied flag so that the caller will set refcounts correctly. */
|
|
|
|
snapshot->regd_count = 0;
|
|
|
|
snapshot->active_count = 0;
|
|
|
|
snapshot->copied = true;
|
|
|
|
|
|
|
|
return snapshot;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Install a restored snapshot as the transaction snapshot.
|
|
|
|
*
|
|
|
|
* The second argument is of type void * so that snapmgr.h need not include
|
|
|
|
* the declaration for PGPROC.
|
|
|
|
*/
|
|
|
|
void
|
2020-06-15 19:14:40 +02:00
|
|
|
RestoreTransactionSnapshot(Snapshot snapshot, void *source_pgproc)
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
{
|
2020-06-15 19:14:40 +02:00
|
|
|
SetTransactionSnapshot(snapshot, NULL, InvalidPid, source_pgproc);
|
Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 21:02:14 +02:00
|
|
|
}
|
2019-01-22 02:03:15 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* XidInMVCCSnapshot
|
|
|
|
* Is the given XID still-in-progress according to the snapshot?
|
|
|
|
*
|
|
|
|
* Note: GetSnapshotData never stores either top xid or subxids of our own
|
|
|
|
* backend into a snapshot, so these xids will not be reported as "running"
|
|
|
|
* by this function. This is OK for current uses, because we always check
|
|
|
|
* TransactionIdIsCurrentTransactionId first, except when it's known the
|
|
|
|
* XID could not be ours anyway.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Make a quick range check to eliminate most XIDs without looking at the
|
|
|
|
* xip arrays. Note that this is OK even if we convert a subxact XID to
|
|
|
|
* its parent below, because a subxact with XID < xmin has surely also got
|
|
|
|
* a parent with XID < xmin, while one with XID >= xmax must belong to a
|
|
|
|
* parent that was not yet committed at the time of this snapshot.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* Any xid < xmin is not in-progress */
|
|
|
|
if (TransactionIdPrecedes(xid, snapshot->xmin))
|
|
|
|
return false;
|
|
|
|
/* Any xid >= xmax is in-progress */
|
|
|
|
if (TransactionIdFollowsOrEquals(xid, snapshot->xmax))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Snapshot information is stored slightly differently in snapshots taken
|
|
|
|
* during recovery.
|
|
|
|
*/
|
|
|
|
if (!snapshot->takenDuringRecovery)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If the snapshot contains full subxact data, the fastest way to
|
|
|
|
* check things is just to compare the given XID against both subxact
|
|
|
|
* XIDs and top-level XIDs. If the snapshot overflowed, we have to
|
|
|
|
* use pg_subtrans to convert a subxact XID to its parent XID, but
|
|
|
|
* then we need only look at top-level XIDs not subxacts.
|
|
|
|
*/
|
|
|
|
if (!snapshot->suboverflowed)
|
|
|
|
{
|
|
|
|
/* we have full data, so search subxip */
|
2022-08-03 18:59:28 +02:00
|
|
|
if (pg_lfind32(xid, snapshot->subxip, snapshot->subxcnt))
|
|
|
|
return true;
|
2019-01-22 02:03:15 +01:00
|
|
|
|
|
|
|
/* not there, fall through to search xip[] */
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Snapshot overflowed, so convert xid to top-level. This is safe
|
|
|
|
* because we eliminated too-old XIDs above.
|
|
|
|
*/
|
|
|
|
xid = SubTransGetTopmostTransaction(xid);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If xid was indeed a subxact, we might now have an xid < xmin,
|
|
|
|
* so recheck to avoid an array scan. No point in rechecking
|
|
|
|
* xmax.
|
|
|
|
*/
|
|
|
|
if (TransactionIdPrecedes(xid, snapshot->xmin))
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2022-08-03 18:59:28 +02:00
|
|
|
if (pg_lfind32(xid, snapshot->xip, snapshot->xcnt))
|
|
|
|
return true;
|
2019-01-22 02:03:15 +01:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
2022-11-15 05:07:19 +01:00
|
|
|
* In recovery we store all xids in the subxip array because it is by
|
2019-01-22 02:03:15 +01:00
|
|
|
* far the bigger array, and we mostly don't know which xids are
|
|
|
|
* top-level and which are subxacts. The xip array is empty.
|
|
|
|
*
|
|
|
|
* We start by searching subtrans, if we overflowed.
|
|
|
|
*/
|
|
|
|
if (snapshot->suboverflowed)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Snapshot overflowed, so convert xid to top-level. This is safe
|
|
|
|
* because we eliminated too-old XIDs above.
|
|
|
|
*/
|
|
|
|
xid = SubTransGetTopmostTransaction(xid);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If xid was indeed a subxact, we might now have an xid < xmin,
|
|
|
|
* so recheck to avoid an array scan. No point in rechecking
|
|
|
|
* xmax.
|
|
|
|
*/
|
|
|
|
if (TransactionIdPrecedes(xid, snapshot->xmin))
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We now have either a top-level xid higher than xmin or an
|
|
|
|
* indeterminate xid. We don't know whether it's top level or subxact
|
|
|
|
* but it doesn't matter. If it's present, the xid is visible.
|
|
|
|
*/
|
2022-08-03 18:59:28 +02:00
|
|
|
if (pg_lfind32(xid, snapshot->subxip, snapshot->subxcnt))
|
|
|
|
return true;
|
2019-01-22 02:03:15 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
Make ResourceOwners more easily extensible.
Instead of having a separate array/hash for each resource kind, use a
single array and hash to hold all kinds of resources. This makes it
possible to introduce new resource "kinds" without having to modify
the ResourceOwnerData struct. In particular, this makes it possible
for extensions to register custom resource kinds.
The old approach was to have a small array of resources of each kind,
and if it fills up, switch to a hash table. The new approach also uses
an array and a hash, but now the array and the hash are used at the
same time. The array is used to hold the recently added resources, and
when it fills up, they are moved to the hash. This keeps the access to
recent entries fast, even when there are a lot of long-held resources.
All the resource-specific ResourceOwnerEnlarge*(),
ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have
been replaced with three generic functions that take resource kind as
argument. For convenience, we still define resource-specific wrapper
macros around the generic functions with the old names, but they are
now defined in the source files that use those resource kinds.
The release callback no longer needs to call ResourceOwnerForget on
the resource being released. ResourceOwnerRelease unregisters the
resource from the owner before calling the callback. That needed some
changes in bufmgr.c and some other files, where releasing the
resources previously always called ResourceOwnerForget.
Each resource kind specifies a release priority, and
ResourceOwnerReleaseAll releases the resources in priority order. To
make that possible, we have to restrict what you can do between
phases. After calling ResourceOwnerRelease(), you are no longer
allowed to remember any more resources in it or to forget any
previously remembered resources by calling ResourceOwnerForget. There
was one case where that was done previously. At subtransaction commit,
AtEOSubXact_Inval() would handle the invalidation messages and call
RelationFlushRelation(), which temporarily increased the reference
count on the relation being flushed. We now switch to the parent
subtransaction's resource owner before calling AtEOSubXact_Inval(), so
that there is a valid ResourceOwner to temporarily hold that relcache
reference.
Other end-of-xact routines make similar calls to AtEOXact_Inval()
between release phases, but I didn't see any regression test failures
from those, so I'm not sure if they could reach a codepath that needs
remembering extra resources.
There were two exceptions to how the resource leak WARNINGs on commit
were printed previously: llvmjit silently released the context without
printing the warning, and a leaked buffer io triggered a PANIC. Now
everything prints a WARNING, including those cases.
Add tests in src/test/modules/test_resowner.
Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 12:30:50 +01:00
|
|
|
|
|
|
|
/* ResourceOwner callbacks */
|
|
|
|
|
|
|
|
static void
|
|
|
|
ResOwnerReleaseSnapshot(Datum res)
|
|
|
|
{
|
|
|
|
UnregisterSnapshotNoOwner((Snapshot) DatumGetPointer(res));
|
|
|
|
}
|