2005-06-18 00:32:51 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
|
|
|
* twophase.c
|
|
|
|
* Two-phase commit support functions.
|
|
|
|
*
|
2015-01-06 17:43:47 +01:00
|
|
|
* Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
|
2005-06-18 00:32:51 +02:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/backend/access/transam/twophase.c
|
2005-06-18 00:32:51 +02:00
|
|
|
*
|
|
|
|
* NOTES
|
|
|
|
* Each global transaction is associated with a global transaction
|
|
|
|
* identifier (GID). The client assigns a GID to a postgres
|
|
|
|
* transaction with the PREPARE TRANSACTION command.
|
|
|
|
*
|
|
|
|
* We keep all active global transactions in a shared memory array.
|
|
|
|
* When the PREPARE TRANSACTION command is issued, the GID is
|
|
|
|
* reserved for the transaction in the array. This is done before
|
|
|
|
* a WAL entry is made, because the reservation checks for duplicate
|
|
|
|
* GIDs and aborts the transaction if there already is a global
|
|
|
|
* transaction in prepared state with the same GID.
|
|
|
|
*
|
2012-05-14 09:22:44 +02:00
|
|
|
* A global transaction (gxact) also has dummy PGXACT and PGPROC; this is
|
|
|
|
* what keeps the XID considered running by TransactionIdIsInProgress.
|
|
|
|
* It is also convenient as a PGPROC to hook the gxact's locks to.
|
2005-06-18 00:32:51 +02:00
|
|
|
*
|
|
|
|
* In order to survive crashes and shutdowns, all prepared
|
|
|
|
* transactions must be stored in permanent storage. This includes
|
|
|
|
* locking information, pending notifications etc. All that state
|
|
|
|
* information is written to the per-transaction state file in
|
|
|
|
* the pg_twophase directory.
|
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
#include "postgres.h"
|
|
|
|
|
|
|
|
#include <fcntl.h>
|
2005-06-20 00:34:56 +02:00
|
|
|
#include <sys/stat.h>
|
|
|
|
#include <sys/types.h>
|
|
|
|
#include <time.h>
|
2005-06-18 00:32:51 +02:00
|
|
|
#include <unistd.h>
|
|
|
|
|
2012-08-30 22:15:44 +02:00
|
|
|
#include "access/htup_details.h"
|
2005-06-18 00:32:51 +02:00
|
|
|
#include "access/subtrans.h"
|
2006-07-13 18:49:20 +02:00
|
|
|
#include "access/transam.h"
|
2005-06-18 00:32:51 +02:00
|
|
|
#include "access/twophase.h"
|
|
|
|
#include "access/twophase_rmgr.h"
|
|
|
|
#include "access/xact.h"
|
Add new wal_level, logical, sufficient for logical decoding.
When wal_level=logical, we'll log columns from the old tuple as
configured by the REPLICA IDENTITY facility added in commit
07cacba983ef79be4a84fcd0e0ca3b5fcb85dd65. This makes it possible
a properly-configured logical replication solution to correctly
follow table updates even if they change the chosen key columns,
or, with REPLICA IDENTITY FULL, even if the table has no key at
all. Note that updates which do not modify the replica identity
column won't log anything extra, making the choice of a good key
(i.e. one that will rarely be changed) important to performance
when wal_level=logical is configured.
Each insert, update, or delete to a catalog table will also log
the CMIN and/or CMAX values of stamped by the current transaction.
This is necessary because logical decoding will require access to
historical snapshots of the catalog in order to decode some data
types, and the CMIN/CMAX values that we may need in order to judge
row visibility may have been overwritten by the time we need them.
Andres Freund, reviewed in various versions by myself, Heikki
Linnakangas, KONDO Mitsumasa, and many others.
2013-12-11 00:33:45 +01:00
|
|
|
#include "access/xlog.h"
|
2014-11-06 12:52:08 +01:00
|
|
|
#include "access/xloginsert.h"
|
2008-11-19 11:34:52 +01:00
|
|
|
#include "access/xlogutils.h"
|
2005-06-18 00:32:51 +02:00
|
|
|
#include "catalog/pg_type.h"
|
2008-11-19 11:34:52 +01:00
|
|
|
#include "catalog/storage.h"
|
2005-06-18 00:32:51 +02:00
|
|
|
#include "funcapi.h"
|
|
|
|
#include "miscadmin.h"
|
2008-08-01 15:16:09 +02:00
|
|
|
#include "pg_trace.h"
|
2005-06-19 22:00:39 +02:00
|
|
|
#include "pgstat.h"
|
Introduce latches. A latch is a boolean variable, with the capability to
wait until it is set. Latches can be used to reliably wait until a signal
arrives, which is hard otherwise because signals don't interrupt select()
on some platforms, and even when they do, there's race conditions.
On Unix, latches use the so called self-pipe trick under the covers to
implement the sleep until the latch is set, without race conditions. On
Windows, Windows events are used.
Use the new latch abstraction to sleep in walsender, so that as soon as
a transaction finishes, walsender is woken up to immediately send the WAL
to the standby. This reduces the latency between master and standby, which
is good.
Preliminary work by Fujii Masao. The latch implementation is by me, with
helpful comments from many people.
2010-09-11 17:48:04 +02:00
|
|
|
#include "replication/walsender.h"
|
2011-03-06 23:49:16 +01:00
|
|
|
#include "replication/syncrep.h"
|
2005-06-18 00:32:51 +02:00
|
|
|
#include "storage/fd.h"
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
#include "storage/ipc.h"
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
#include "storage/predicate.h"
|
2012-06-25 23:45:15 +02:00
|
|
|
#include "storage/proc.h"
|
2005-06-18 00:32:51 +02:00
|
|
|
#include "storage/procarray.h"
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
#include "storage/sinvaladt.h"
|
2005-06-18 00:32:51 +02:00
|
|
|
#include "storage/smgr.h"
|
|
|
|
#include "utils/builtins.h"
|
2008-05-19 20:16:26 +02:00
|
|
|
#include "utils/memutils.h"
|
2011-09-09 19:23:41 +02:00
|
|
|
#include "utils/timestamp.h"
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Directory where Two-phase commit files reside within PGDATA
|
|
|
|
*/
|
|
|
|
#define TWOPHASE_DIR "pg_twophase"
|
|
|
|
|
|
|
|
/* GUC variable, can't be changed after startup */
|
2009-04-23 02:23:46 +02:00
|
|
|
int max_prepared_xacts = 0;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This struct describes one global transaction that is in prepared state
|
|
|
|
* or attempting to become prepared.
|
|
|
|
*
|
|
|
|
* The lifecycle of a global transaction is:
|
|
|
|
*
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
* 1. After checking that the requested GID is not in use, set up an entry in
|
|
|
|
* the TwoPhaseState->prepXacts array with the correct GID and valid = false,
|
|
|
|
* and mark it as locked by my backend.
|
2005-06-18 00:32:51 +02:00
|
|
|
*
|
|
|
|
* 2. After successfully completing prepare, set valid = true and enter the
|
2012-05-14 09:22:44 +02:00
|
|
|
* referenced PGPROC into the global ProcArray.
|
2005-06-18 00:32:51 +02:00
|
|
|
*
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
* 3. To begin COMMIT PREPARED or ROLLBACK PREPARED, check that the entry is
|
|
|
|
* valid and not locked, then mark the entry as locked by storing my current
|
|
|
|
* backend ID into locking_backend. This prevents concurrent attempts to
|
|
|
|
* commit or rollback the same prepared xact.
|
2005-06-18 00:32:51 +02:00
|
|
|
*
|
|
|
|
* 4. On completion of COMMIT PREPARED or ROLLBACK PREPARED, remove the entry
|
|
|
|
* from the ProcArray and the TwoPhaseState->prepXacts array and return it to
|
|
|
|
* the freelist.
|
|
|
|
*
|
|
|
|
* Note that if the preparing transaction fails between steps 1 and 2, the
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
* entry must be removed so that the GID and the GlobalTransaction struct
|
|
|
|
* can be reused. See AtAbort_Twophase().
|
2005-06-18 00:32:51 +02:00
|
|
|
*
|
2005-10-15 04:49:52 +02:00
|
|
|
* typedef struct GlobalTransactionData *GlobalTransaction appears in
|
2005-06-18 00:32:51 +02:00
|
|
|
* twophase.h
|
|
|
|
*/
|
|
|
|
#define GIDSIZE 200
|
|
|
|
|
|
|
|
typedef struct GlobalTransactionData
|
|
|
|
{
|
2012-08-08 17:52:02 +02:00
|
|
|
GlobalTransaction next; /* list link for free list */
|
|
|
|
int pgprocno; /* ID of associated dummy PGPROC */
|
2010-02-26 03:01:40 +01:00
|
|
|
BackendId dummyBackendId; /* similar to backend id for backends */
|
2005-10-15 04:49:52 +02:00
|
|
|
TimestampTz prepared_at; /* time of preparation */
|
2005-06-19 22:00:39 +02:00
|
|
|
XLogRecPtr prepare_lsn; /* XLOG offset of prepare record */
|
2005-06-28 07:09:14 +02:00
|
|
|
Oid owner; /* ID of user that executed the xact */
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
BackendId locking_backend; /* backend currently working on the xact */
|
|
|
|
bool valid; /* TRUE if PGPROC entry is in proc array */
|
2005-10-15 04:49:52 +02:00
|
|
|
char gid[GIDSIZE]; /* The GID assigned to the prepared xact */
|
2011-04-10 17:42:00 +02:00
|
|
|
} GlobalTransactionData;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Two Phase Commit shared state. Access to this struct is protected
|
|
|
|
* by TwoPhaseStateLock.
|
|
|
|
*/
|
|
|
|
typedef struct TwoPhaseStateData
|
|
|
|
{
|
|
|
|
/* Head of linked list of free GlobalTransactionData structs */
|
2008-11-02 22:24:52 +01:00
|
|
|
GlobalTransaction freeGXacts;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/* Number of valid prepXacts entries. */
|
2005-10-15 04:49:52 +02:00
|
|
|
int numPrepXacts;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* There are max_prepared_xacts items in this array, but C wants a
|
|
|
|
* fixed-size array.
|
|
|
|
*/
|
2005-10-15 04:49:52 +02:00
|
|
|
GlobalTransaction prepXacts[1]; /* VARIABLE LENGTH ARRAY */
|
2005-06-18 00:32:51 +02:00
|
|
|
} TwoPhaseStateData; /* VARIABLE LENGTH STRUCT */
|
|
|
|
|
|
|
|
static TwoPhaseStateData *TwoPhaseState;
|
|
|
|
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
/*
|
|
|
|
* Global transaction entry currently locked by us, if any.
|
|
|
|
*/
|
|
|
|
static GlobalTransaction MyLockedGxact = NULL;
|
|
|
|
|
|
|
|
static bool twophaseExitRegistered = false;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
static void RecordTransactionCommitPrepared(TransactionId xid,
|
2005-10-15 04:49:52 +02:00
|
|
|
int nchildren,
|
|
|
|
TransactionId *children,
|
|
|
|
int nrels,
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
RelFileNode *rels,
|
|
|
|
int ninvalmsgs,
|
|
|
|
SharedInvalidationMessage *invalmsgs,
|
|
|
|
bool initfileinval);
|
2005-06-18 00:32:51 +02:00
|
|
|
static void RecordTransactionAbortPrepared(TransactionId xid,
|
2005-10-15 04:49:52 +02:00
|
|
|
int nchildren,
|
|
|
|
TransactionId *children,
|
|
|
|
int nrels,
|
2008-11-19 11:34:52 +01:00
|
|
|
RelFileNode *rels);
|
2005-06-18 00:32:51 +02:00
|
|
|
static void ProcessRecords(char *bufptr, TransactionId xid,
|
2005-10-15 04:49:52 +02:00
|
|
|
const TwoPhaseCallback callbacks[]);
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
static void RemoveGXact(GlobalTransaction gxact);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialization of shared memory
|
|
|
|
*/
|
2005-08-21 01:26:37 +02:00
|
|
|
Size
|
2005-06-18 00:32:51 +02:00
|
|
|
TwoPhaseShmemSize(void)
|
|
|
|
{
|
2005-08-21 01:26:37 +02:00
|
|
|
Size size;
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
/* Need the fixed struct, the array of pointers, and the GTD structs */
|
2005-08-21 01:26:37 +02:00
|
|
|
size = offsetof(TwoPhaseStateData, prepXacts);
|
|
|
|
size = add_size(size, mul_size(max_prepared_xacts,
|
|
|
|
sizeof(GlobalTransaction)));
|
|
|
|
size = MAXALIGN(size);
|
|
|
|
size = add_size(size, mul_size(max_prepared_xacts,
|
|
|
|
sizeof(GlobalTransactionData)));
|
|
|
|
|
|
|
|
return size;
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
TwoPhaseShmemInit(void)
|
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
bool found;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
TwoPhaseState = ShmemInitStruct("Prepared Transaction Table",
|
|
|
|
TwoPhaseShmemSize(),
|
|
|
|
&found);
|
|
|
|
if (!IsUnderPostmaster)
|
|
|
|
{
|
|
|
|
GlobalTransaction gxacts;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
Assert(!found);
|
2008-11-02 22:24:52 +01:00
|
|
|
TwoPhaseState->freeGXacts = NULL;
|
2005-06-18 00:32:51 +02:00
|
|
|
TwoPhaseState->numPrepXacts = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize the linked list of free GlobalTransactionData structs
|
|
|
|
*/
|
|
|
|
gxacts = (GlobalTransaction)
|
|
|
|
((char *) TwoPhaseState +
|
2005-10-15 04:49:52 +02:00
|
|
|
MAXALIGN(offsetof(TwoPhaseStateData, prepXacts) +
|
2005-06-18 00:32:51 +02:00
|
|
|
sizeof(GlobalTransaction) * max_prepared_xacts));
|
|
|
|
for (i = 0; i < max_prepared_xacts; i++)
|
|
|
|
{
|
2012-08-08 17:52:02 +02:00
|
|
|
/* insert into linked list */
|
2011-11-25 14:02:10 +01:00
|
|
|
gxacts[i].next = TwoPhaseState->freeGXacts;
|
2008-11-02 22:24:52 +01:00
|
|
|
TwoPhaseState->freeGXacts = &gxacts[i];
|
2009-11-23 10:58:36 +01:00
|
|
|
|
2012-08-08 17:52:02 +02:00
|
|
|
/* associate it with a PGPROC assigned by InitProcGlobal */
|
|
|
|
gxacts[i].pgprocno = PreparedXactProcs[i].pgprocno;
|
|
|
|
|
2009-11-23 10:58:36 +01:00
|
|
|
/*
|
|
|
|
* Assign a unique ID for each dummy proc, so that the range of
|
|
|
|
* dummy backend IDs immediately follows the range of normal
|
2010-02-26 03:01:40 +01:00
|
|
|
* backend IDs. We don't dare to assign a real backend ID to dummy
|
|
|
|
* procs, because prepared transactions don't take part in cache
|
|
|
|
* invalidation like a real backend ID would imply, but having a
|
|
|
|
* unique ID for them is nevertheless handy. This arrangement
|
|
|
|
* allows you to allocate an array of size (MaxBackends +
|
|
|
|
* max_prepared_xacts + 1), and have a slot for every backend and
|
|
|
|
* prepared transaction. Currently multixact.c uses that
|
|
|
|
* technique.
|
2009-11-23 10:58:36 +01:00
|
|
|
*/
|
|
|
|
gxacts[i].dummyBackendId = MaxBackends + 1 + i;
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
|
|
|
Assert(found);
|
|
|
|
}
|
|
|
|
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
/*
|
|
|
|
* Exit hook to unlock the global transaction entry we're working on.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
AtProcExit_Twophase(int code, Datum arg)
|
|
|
|
{
|
|
|
|
/* same logic as abort */
|
|
|
|
AtAbort_Twophase();
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Abort hook to unlock the global transaction entry we're working on.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
AtAbort_Twophase(void)
|
|
|
|
{
|
|
|
|
if (MyLockedGxact == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* What to do with the locked global transaction entry? If we were in
|
|
|
|
* the process of preparing the transaction, but haven't written the WAL
|
|
|
|
* record and state file yet, the transaction must not be considered as
|
|
|
|
* prepared. Likewise, if we are in the process of finishing an
|
|
|
|
* already-prepared transaction, and fail after having already written
|
|
|
|
* the 2nd phase commit or rollback record to the WAL, the transaction
|
|
|
|
* should not be considered as prepared anymore. In those cases, just
|
|
|
|
* remove the entry from shared memory.
|
|
|
|
*
|
|
|
|
* Otherwise, the entry must be left in place so that the transaction
|
|
|
|
* can be finished later, so just unlock it.
|
|
|
|
*
|
|
|
|
* If we abort during prepare, after having written the WAL record, we
|
|
|
|
* might not have transfered all locks and other state to the prepared
|
|
|
|
* transaction yet. Likewise, if we abort during commit or rollback,
|
|
|
|
* after having written the WAL record, we might not have released
|
|
|
|
* all the resources held by the transaction yet. In those cases, the
|
|
|
|
* in-memory state can be wrong, but it's too late to back out.
|
|
|
|
*/
|
|
|
|
if (!MyLockedGxact->valid)
|
|
|
|
{
|
|
|
|
RemoveGXact(MyLockedGxact);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
|
|
|
|
|
|
|
|
MyLockedGxact->locking_backend = InvalidBackendId;
|
|
|
|
|
|
|
|
LWLockRelease(TwoPhaseStateLock);
|
|
|
|
}
|
|
|
|
MyLockedGxact = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is called after we have finished transfering state to the prepared
|
|
|
|
* PGXACT entry.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
PostPrepare_Twophase()
|
|
|
|
{
|
|
|
|
LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
|
|
|
|
MyLockedGxact->locking_backend = InvalidBackendId;
|
|
|
|
LWLockRelease(TwoPhaseStateLock);
|
|
|
|
|
|
|
|
MyLockedGxact = NULL;
|
|
|
|
}
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* MarkAsPreparing
|
2005-10-15 04:49:52 +02:00
|
|
|
* Reserve the GID for the given transaction.
|
2005-06-18 00:32:51 +02:00
|
|
|
*
|
|
|
|
* Internally, this creates a gxact struct and puts it into the active array.
|
|
|
|
* NOTE: this is also used when reloading a gxact after a crash; so avoid
|
|
|
|
* assuming that we can use very much backend context.
|
|
|
|
*/
|
|
|
|
GlobalTransaction
|
2005-06-18 21:33:42 +02:00
|
|
|
MarkAsPreparing(TransactionId xid, const char *gid,
|
2005-06-28 07:09:14 +02:00
|
|
|
TimestampTz prepared_at, Oid owner, Oid databaseid)
|
2005-06-18 00:32:51 +02:00
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
GlobalTransaction gxact;
|
2011-11-25 14:02:10 +01:00
|
|
|
PGPROC *proc;
|
|
|
|
PGXACT *pgxact;
|
2005-10-15 04:49:52 +02:00
|
|
|
int i;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
if (strlen(gid) >= GIDSIZE)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
|
2005-10-14 00:55:55 +02:00
|
|
|
errmsg("transaction identifier \"%s\" is too long",
|
2005-06-18 00:32:51 +02:00
|
|
|
gid)));
|
|
|
|
|
2009-04-23 02:23:46 +02:00
|
|
|
/* fail immediately if feature is disabled */
|
|
|
|
if (max_prepared_xacts == 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
|
|
|
|
errmsg("prepared transactions are disabled"),
|
2009-06-11 16:49:15 +02:00
|
|
|
errhint("Set max_prepared_transactions to a nonzero value.")));
|
2009-04-23 02:23:46 +02:00
|
|
|
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
/* on first call, register the exit hook */
|
|
|
|
if (!twophaseExitRegistered)
|
2005-06-18 00:32:51 +02:00
|
|
|
{
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
before_shmem_exit(AtProcExit_Twophase, 0);
|
|
|
|
twophaseExitRegistered = true;
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
/* Check for conflicting GID */
|
|
|
|
for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
|
|
|
|
{
|
|
|
|
gxact = TwoPhaseState->prepXacts[i];
|
|
|
|
if (strcmp(gxact->gid, gid) == 0)
|
|
|
|
{
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_DUPLICATE_OBJECT),
|
2005-10-14 00:55:55 +02:00
|
|
|
errmsg("transaction identifier \"%s\" is already in use",
|
2005-06-18 00:32:51 +02:00
|
|
|
gid)));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Get a free gxact from the freelist */
|
2008-11-02 22:24:52 +01:00
|
|
|
if (TwoPhaseState->freeGXacts == NULL)
|
2005-06-18 00:32:51 +02:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_OUT_OF_MEMORY),
|
|
|
|
errmsg("maximum number of prepared transactions reached"),
|
|
|
|
errhint("Increase max_prepared_transactions (currently %d).",
|
|
|
|
max_prepared_xacts)));
|
2008-11-02 22:24:52 +01:00
|
|
|
gxact = TwoPhaseState->freeGXacts;
|
2012-08-08 17:52:02 +02:00
|
|
|
TwoPhaseState->freeGXacts = gxact->next;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
2011-11-25 14:02:10 +01:00
|
|
|
proc = &ProcGlobal->allProcs[gxact->pgprocno];
|
|
|
|
pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
|
|
|
|
|
|
|
|
/* Initialize the PGPROC entry */
|
|
|
|
MemSet(proc, 0, sizeof(PGPROC));
|
|
|
|
proc->pgprocno = gxact->pgprocno;
|
|
|
|
SHMQueueElemInit(&(proc->links));
|
|
|
|
proc->waitStatus = STATUS_OK;
|
2007-09-05 22:53:17 +02:00
|
|
|
/* We set up the gxact's VXID as InvalidBackendId/XID */
|
2011-11-25 14:02:10 +01:00
|
|
|
proc->lxid = (LocalTransactionId) xid;
|
|
|
|
pgxact->xid = xid;
|
|
|
|
pgxact->xmin = InvalidTransactionId;
|
2012-12-03 14:13:53 +01:00
|
|
|
pgxact->delayChkpt = false;
|
2011-11-25 14:02:10 +01:00
|
|
|
pgxact->vacuumFlags = 0;
|
|
|
|
proc->pid = 0;
|
|
|
|
proc->backendId = InvalidBackendId;
|
|
|
|
proc->databaseId = databaseid;
|
|
|
|
proc->roleId = owner;
|
|
|
|
proc->lwWaiting = false;
|
Make group commit more effective.
When a backend needs to flush the WAL, and someone else is already flushing
the WAL, wait until it releases the WALInsertLock and check if we still need
to do the flush or if the other backend already did the work for us, before
acquiring WALInsertLock. This helps group commit, because when the WAL flush
finishes, all the backends that were waiting for it can be woken up in one
go, and the can all concurrently observe that they're done, rather than
waking them up one by one in a cascading fashion.
This is based on a new LWLock function, LWLockWaitUntilFree(), which has
peculiar semantics. If the lock is immediately free, it grabs the lock and
returns true. If it's not free, it waits until it is released, but then
returns false without grabbing the lock. This is used in XLogFlush(), so
that when the lock is acquired, the backend flushes the WAL, but if it's
not, the backend first checks the current flush location before retrying.
Original patch and benchmarking by Peter Geoghegan and Simon Riggs, although
this patch as committed ended up being very different from that.
2012-01-30 15:40:58 +01:00
|
|
|
proc->lwWaitMode = 0;
|
2011-11-25 14:02:10 +01:00
|
|
|
proc->waitLock = NULL;
|
|
|
|
proc->waitProcLock = NULL;
|
2005-12-11 22:02:18 +01:00
|
|
|
for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
|
2011-11-25 14:02:10 +01:00
|
|
|
SHMQueueInit(&(proc->myProcLocks[i]));
|
2005-06-18 00:32:51 +02:00
|
|
|
/* subxid data must be filled later by GXactLoadSubxactData */
|
2011-11-25 14:02:10 +01:00
|
|
|
pgxact->overflowed = false;
|
|
|
|
pgxact->nxids = 0;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
2005-06-18 21:33:42 +02:00
|
|
|
gxact->prepared_at = prepared_at;
|
2005-06-19 22:00:39 +02:00
|
|
|
/* initialize LSN to 0 (start of WAL) */
|
2012-06-24 17:51:37 +02:00
|
|
|
gxact->prepare_lsn = 0;
|
2005-06-18 00:32:51 +02:00
|
|
|
gxact->owner = owner;
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
gxact->locking_backend = MyBackendId;
|
2005-06-18 00:32:51 +02:00
|
|
|
gxact->valid = false;
|
|
|
|
strcpy(gxact->gid, gid);
|
|
|
|
|
|
|
|
/* And insert it into the active array */
|
|
|
|
Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts);
|
|
|
|
TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;
|
|
|
|
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
/*
|
|
|
|
* Remember that we have this GlobalTransaction entry locked for us.
|
|
|
|
* If we abort after this, we must release it.
|
|
|
|
*/
|
|
|
|
MyLockedGxact = gxact;
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
LWLockRelease(TwoPhaseStateLock);
|
|
|
|
|
|
|
|
return gxact;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* GXactLoadSubxactData
|
|
|
|
*
|
|
|
|
* If the transaction being persisted had any subtransactions, this must
|
|
|
|
* be called before MarkAsPrepared() to load information into the dummy
|
|
|
|
* PGPROC.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
GXactLoadSubxactData(GlobalTransaction gxact, int nsubxacts,
|
|
|
|
TransactionId *children)
|
|
|
|
{
|
2012-06-10 21:20:04 +02:00
|
|
|
PGPROC *proc = &ProcGlobal->allProcs[gxact->pgprocno];
|
|
|
|
PGXACT *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
/* We need no extra lock since the GXACT isn't valid yet */
|
|
|
|
if (nsubxacts > PGPROC_MAX_CACHED_SUBXIDS)
|
|
|
|
{
|
2011-11-25 14:02:10 +01:00
|
|
|
pgxact->overflowed = true;
|
2005-06-18 00:32:51 +02:00
|
|
|
nsubxacts = PGPROC_MAX_CACHED_SUBXIDS;
|
|
|
|
}
|
|
|
|
if (nsubxacts > 0)
|
|
|
|
{
|
2011-11-25 14:02:10 +01:00
|
|
|
memcpy(proc->subxids.xids, children,
|
2005-06-18 00:32:51 +02:00
|
|
|
nsubxacts * sizeof(TransactionId));
|
2011-11-25 14:02:10 +01:00
|
|
|
pgxact->nxids = nsubxacts;
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* MarkAsPrepared
|
|
|
|
* Mark the GXACT as fully valid, and enter it into the global ProcArray.
|
|
|
|
*/
|
2005-06-19 22:00:39 +02:00
|
|
|
static void
|
2005-06-18 00:32:51 +02:00
|
|
|
MarkAsPrepared(GlobalTransaction gxact)
|
|
|
|
{
|
|
|
|
/* Lock here may be overkill, but I'm not convinced of that ... */
|
|
|
|
LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
|
|
|
|
Assert(!gxact->valid);
|
|
|
|
gxact->valid = true;
|
|
|
|
LWLockRelease(TwoPhaseStateLock);
|
|
|
|
|
|
|
|
/*
|
2007-09-21 18:32:19 +02:00
|
|
|
* Put it into the global ProcArray so TransactionIdIsInProgress considers
|
2005-06-18 00:32:51 +02:00
|
|
|
* the XID as still running.
|
|
|
|
*/
|
2011-11-25 14:02:10 +01:00
|
|
|
ProcArrayAdd(&ProcGlobal->allProcs[gxact->pgprocno]);
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* LockGXact
|
|
|
|
* Locate the prepared transaction and mark it busy for COMMIT or PREPARE.
|
|
|
|
*/
|
|
|
|
static GlobalTransaction
|
2005-06-28 07:09:14 +02:00
|
|
|
LockGXact(const char *gid, Oid user)
|
2005-06-18 00:32:51 +02:00
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
int i;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
/* on first call, register the exit hook */
|
|
|
|
if (!twophaseExitRegistered)
|
|
|
|
{
|
|
|
|
before_shmem_exit(AtProcExit_Twophase, 0);
|
|
|
|
twophaseExitRegistered = true;
|
|
|
|
}
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
|
|
|
|
|
|
|
|
for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
|
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
|
2012-06-10 21:20:04 +02:00
|
|
|
PGPROC *proc = &ProcGlobal->allProcs[gxact->pgprocno];
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/* Ignore not-yet-valid GIDs */
|
|
|
|
if (!gxact->valid)
|
|
|
|
continue;
|
|
|
|
if (strcmp(gxact->gid, gid) != 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* Found it, but has someone else got it locked? */
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
if (gxact->locking_backend != InvalidBackendId)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
|
|
|
|
errmsg("prepared transaction with identifier \"%s\" is busy",
|
|
|
|
gid)));
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
if (user != gxact->owner && !superuser_arg(user))
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
|
2005-10-15 04:49:52 +02:00
|
|
|
errmsg("permission denied to finish prepared transaction"),
|
2005-06-18 00:32:51 +02:00
|
|
|
errhint("Must be superuser or the user that prepared the transaction.")));
|
|
|
|
|
2007-02-13 20:39:42 +01:00
|
|
|
/*
|
2007-11-15 22:14:46 +01:00
|
|
|
* Note: it probably would be possible to allow committing from
|
|
|
|
* another database; but at the moment NOTIFY is known not to work and
|
2014-05-06 18:12:18 +02:00
|
|
|
* there may be some other issues as well. Hence disallow until
|
2007-11-15 22:14:46 +01:00
|
|
|
* someone gets motivated to make it work.
|
2007-02-13 20:39:42 +01:00
|
|
|
*/
|
2011-11-25 14:02:10 +01:00
|
|
|
if (MyDatabaseId != proc->databaseId)
|
2007-02-13 20:39:42 +01:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
|
2007-11-15 22:14:46 +01:00
|
|
|
errmsg("prepared transaction belongs to another database"),
|
2007-02-13 20:39:42 +01:00
|
|
|
errhint("Connect to the database where the transaction was prepared to finish it.")));
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
/* OK for me to lock it */
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
gxact->locking_backend = MyBackendId;
|
|
|
|
MyLockedGxact = gxact;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
LWLockRelease(TwoPhaseStateLock);
|
|
|
|
|
|
|
|
return gxact;
|
|
|
|
}
|
|
|
|
|
|
|
|
LWLockRelease(TwoPhaseStateLock);
|
|
|
|
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_UNDEFINED_OBJECT),
|
2005-10-15 04:49:52 +02:00
|
|
|
errmsg("prepared transaction with identifier \"%s\" does not exist",
|
|
|
|
gid)));
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/* NOTREACHED */
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* RemoveGXact
|
|
|
|
* Remove the prepared transaction from the shared memory array.
|
|
|
|
*
|
|
|
|
* NB: caller should have already removed it from ProcArray
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
RemoveGXact(GlobalTransaction gxact)
|
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
int i;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
|
|
|
|
|
|
|
|
for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
|
|
|
|
{
|
|
|
|
if (gxact == TwoPhaseState->prepXacts[i])
|
|
|
|
{
|
|
|
|
/* remove from the active array */
|
|
|
|
TwoPhaseState->numPrepXacts--;
|
|
|
|
TwoPhaseState->prepXacts[i] = TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts];
|
|
|
|
|
|
|
|
/* and put it back in the freelist */
|
2011-11-25 14:02:10 +01:00
|
|
|
gxact->next = TwoPhaseState->freeGXacts;
|
2008-11-02 22:24:52 +01:00
|
|
|
TwoPhaseState->freeGXacts = gxact;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
LWLockRelease(TwoPhaseStateLock);
|
|
|
|
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
LWLockRelease(TwoPhaseStateLock);
|
|
|
|
|
|
|
|
elog(ERROR, "failed to find %p in GlobalTransaction array", gxact);
|
|
|
|
}
|
|
|
|
|
2005-06-19 22:00:39 +02:00
|
|
|
/*
|
|
|
|
* TransactionIdIsPrepared
|
|
|
|
* True iff transaction associated with the identifier is prepared
|
2005-10-15 04:49:52 +02:00
|
|
|
* for two-phase commit
|
2005-06-19 22:00:39 +02:00
|
|
|
*
|
|
|
|
* Note: only gxacts marked "valid" are considered; but notice we do not
|
|
|
|
* check the locking status.
|
|
|
|
*
|
|
|
|
* This is not currently exported, because it is only needed internally.
|
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
TransactionIdIsPrepared(TransactionId xid)
|
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
bool result = false;
|
|
|
|
int i;
|
2005-06-19 22:00:39 +02:00
|
|
|
|
|
|
|
LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
|
|
|
|
|
|
|
|
for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
|
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
|
2012-06-10 21:20:04 +02:00
|
|
|
PGXACT *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
|
2005-06-19 22:00:39 +02:00
|
|
|
|
2011-11-25 14:02:10 +01:00
|
|
|
if (gxact->valid && pgxact->xid == xid)
|
2005-06-19 22:00:39 +02:00
|
|
|
{
|
|
|
|
result = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
LWLockRelease(TwoPhaseStateLock);
|
|
|
|
|
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
/*
|
|
|
|
* Returns an array of all prepared transactions for the user-level
|
|
|
|
* function pg_prepared_xact.
|
|
|
|
*
|
|
|
|
* The returned array and all its elements are copies of internal data
|
|
|
|
* structures, to minimize the time we need to hold the TwoPhaseStateLock.
|
|
|
|
*
|
|
|
|
* WARNING -- we return even those transactions that are not fully prepared
|
|
|
|
* yet. The caller should filter them out if he doesn't want them.
|
|
|
|
*
|
|
|
|
* The returned array is palloc'd.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
GetPreparedTransactionList(GlobalTransaction *gxacts)
|
|
|
|
{
|
|
|
|
GlobalTransaction array;
|
2005-10-15 04:49:52 +02:00
|
|
|
int num;
|
|
|
|
int i;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
|
|
|
|
|
|
|
|
if (TwoPhaseState->numPrepXacts == 0)
|
|
|
|
{
|
|
|
|
LWLockRelease(TwoPhaseStateLock);
|
|
|
|
|
|
|
|
*gxacts = NULL;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
num = TwoPhaseState->numPrepXacts;
|
|
|
|
array = (GlobalTransaction) palloc(sizeof(GlobalTransactionData) * num);
|
|
|
|
*gxacts = array;
|
|
|
|
for (i = 0; i < num; i++)
|
|
|
|
memcpy(array + i, TwoPhaseState->prepXacts[i],
|
|
|
|
sizeof(GlobalTransactionData));
|
|
|
|
|
|
|
|
LWLockRelease(TwoPhaseStateLock);
|
|
|
|
|
|
|
|
return num;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* Working status for pg_prepared_xact */
|
|
|
|
typedef struct
|
|
|
|
{
|
|
|
|
GlobalTransaction array;
|
2005-10-15 04:49:52 +02:00
|
|
|
int ngxacts;
|
|
|
|
int currIdx;
|
2005-06-18 00:32:51 +02:00
|
|
|
} Working_State;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* pg_prepared_xact
|
2005-10-15 04:49:52 +02:00
|
|
|
* Produce a view with one row per prepared transaction.
|
2005-06-18 00:32:51 +02:00
|
|
|
*
|
|
|
|
* This function is here so we don't have to export the
|
|
|
|
* GlobalTransactionData struct definition.
|
|
|
|
*/
|
|
|
|
Datum
|
|
|
|
pg_prepared_xact(PG_FUNCTION_ARGS)
|
|
|
|
{
|
|
|
|
FuncCallContext *funcctx;
|
|
|
|
Working_State *status;
|
|
|
|
|
|
|
|
if (SRF_IS_FIRSTCALL())
|
|
|
|
{
|
|
|
|
TupleDesc tupdesc;
|
|
|
|
MemoryContext oldcontext;
|
|
|
|
|
|
|
|
/* create a function context for cross-call persistence */
|
|
|
|
funcctx = SRF_FIRSTCALL_INIT();
|
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Switch to memory context appropriate for multiple function calls
|
2005-06-18 00:32:51 +02:00
|
|
|
*/
|
|
|
|
oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
|
|
|
|
|
|
|
|
/* build tupdesc for result tuples */
|
|
|
|
/* this had better match pg_prepared_xacts view in system_views.sql */
|
2005-06-18 21:33:42 +02:00
|
|
|
tupdesc = CreateTemplateTupleDesc(5, false);
|
2005-06-18 00:32:51 +02:00
|
|
|
TupleDescInitEntry(tupdesc, (AttrNumber) 1, "transaction",
|
|
|
|
XIDOID, -1, 0);
|
|
|
|
TupleDescInitEntry(tupdesc, (AttrNumber) 2, "gid",
|
|
|
|
TEXTOID, -1, 0);
|
2005-06-18 21:33:42 +02:00
|
|
|
TupleDescInitEntry(tupdesc, (AttrNumber) 3, "prepared",
|
|
|
|
TIMESTAMPTZOID, -1, 0);
|
|
|
|
TupleDescInitEntry(tupdesc, (AttrNumber) 4, "ownerid",
|
2005-06-28 07:09:14 +02:00
|
|
|
OIDOID, -1, 0);
|
2005-06-18 21:33:42 +02:00
|
|
|
TupleDescInitEntry(tupdesc, (AttrNumber) 5, "dbid",
|
2005-06-18 00:32:51 +02:00
|
|
|
OIDOID, -1, 0);
|
|
|
|
|
|
|
|
funcctx->tuple_desc = BlessTupleDesc(tupdesc);
|
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Collect all the 2PC status information that we will format and send
|
|
|
|
* out as a result set.
|
2005-06-18 00:32:51 +02:00
|
|
|
*/
|
|
|
|
status = (Working_State *) palloc(sizeof(Working_State));
|
|
|
|
funcctx->user_fctx = (void *) status;
|
|
|
|
|
|
|
|
status->ngxacts = GetPreparedTransactionList(&status->array);
|
|
|
|
status->currIdx = 0;
|
|
|
|
|
|
|
|
MemoryContextSwitchTo(oldcontext);
|
|
|
|
}
|
|
|
|
|
|
|
|
funcctx = SRF_PERCALL_SETUP();
|
|
|
|
status = (Working_State *) funcctx->user_fctx;
|
|
|
|
|
|
|
|
while (status->array != NULL && status->currIdx < status->ngxacts)
|
|
|
|
{
|
|
|
|
GlobalTransaction gxact = &status->array[status->currIdx++];
|
2012-06-10 21:20:04 +02:00
|
|
|
PGPROC *proc = &ProcGlobal->allProcs[gxact->pgprocno];
|
|
|
|
PGXACT *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
|
2005-06-18 21:33:42 +02:00
|
|
|
Datum values[5];
|
|
|
|
bool nulls[5];
|
2005-06-18 00:32:51 +02:00
|
|
|
HeapTuple tuple;
|
|
|
|
Datum result;
|
|
|
|
|
|
|
|
if (!gxact->valid)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Form tuple with appropriate data.
|
|
|
|
*/
|
|
|
|
MemSet(values, 0, sizeof(values));
|
|
|
|
MemSet(nulls, 0, sizeof(nulls));
|
|
|
|
|
2011-11-25 14:02:10 +01:00
|
|
|
values[0] = TransactionIdGetDatum(pgxact->xid);
|
2008-03-25 23:42:46 +01:00
|
|
|
values[1] = CStringGetTextDatum(gxact->gid);
|
2005-06-18 21:33:42 +02:00
|
|
|
values[2] = TimestampTzGetDatum(gxact->prepared_at);
|
2005-06-28 07:09:14 +02:00
|
|
|
values[3] = ObjectIdGetDatum(gxact->owner);
|
2011-11-25 14:02:10 +01:00
|
|
|
values[4] = ObjectIdGetDatum(proc->databaseId);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
|
|
|
|
result = HeapTupleGetDatum(tuple);
|
|
|
|
SRF_RETURN_NEXT(funcctx, result);
|
|
|
|
}
|
|
|
|
|
|
|
|
SRF_RETURN_DONE(funcctx);
|
|
|
|
}
|
|
|
|
|
2009-11-23 10:58:36 +01:00
|
|
|
/*
|
2012-08-08 17:52:02 +02:00
|
|
|
* TwoPhaseGetGXact
|
|
|
|
* Get the GlobalTransaction struct for a prepared transaction
|
|
|
|
* specified by XID
|
2009-11-23 10:58:36 +01:00
|
|
|
*/
|
2012-08-08 17:52:02 +02:00
|
|
|
static GlobalTransaction
|
|
|
|
TwoPhaseGetGXact(TransactionId xid)
|
2005-06-18 00:32:51 +02:00
|
|
|
{
|
2012-08-08 17:52:02 +02:00
|
|
|
GlobalTransaction result = NULL;
|
2005-06-18 00:32:51 +02:00
|
|
|
int i;
|
|
|
|
|
|
|
|
static TransactionId cached_xid = InvalidTransactionId;
|
2012-08-08 17:52:02 +02:00
|
|
|
static GlobalTransaction cached_gxact = NULL;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* During a recovery, COMMIT PREPARED, or ABORT PREPARED, we'll be called
|
|
|
|
* repeatedly for the same XID. We can save work with a simple cache.
|
|
|
|
*/
|
|
|
|
if (xid == cached_xid)
|
2012-08-08 17:52:02 +02:00
|
|
|
return cached_gxact;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
|
|
|
|
|
|
|
|
for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
|
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
|
2012-06-10 21:20:04 +02:00
|
|
|
PGXACT *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
|
2005-06-18 00:32:51 +02:00
|
|
|
|
2011-11-25 14:02:10 +01:00
|
|
|
if (pgxact->xid == xid)
|
2005-06-18 00:32:51 +02:00
|
|
|
{
|
2012-08-08 17:52:02 +02:00
|
|
|
result = gxact;
|
2005-06-18 00:32:51 +02:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
LWLockRelease(TwoPhaseStateLock);
|
|
|
|
|
|
|
|
if (result == NULL) /* should not happen */
|
2012-08-08 17:52:02 +02:00
|
|
|
elog(ERROR, "failed to find GlobalTransaction for xid %u", xid);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
cached_xid = xid;
|
2012-08-08 17:52:02 +02:00
|
|
|
cached_gxact = result;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
2012-08-08 17:52:02 +02:00
|
|
|
/*
|
|
|
|
* TwoPhaseGetDummyProc
|
|
|
|
* Get the dummy backend ID for prepared transaction specified by XID
|
|
|
|
*
|
|
|
|
* Dummy backend IDs are similar to real backend IDs of real backends.
|
|
|
|
* They start at MaxBackends + 1, and are unique across all currently active
|
|
|
|
* real backends and prepared transactions.
|
|
|
|
*/
|
|
|
|
BackendId
|
|
|
|
TwoPhaseGetDummyBackendId(TransactionId xid)
|
|
|
|
{
|
|
|
|
GlobalTransaction gxact = TwoPhaseGetGXact(xid);
|
|
|
|
|
|
|
|
return gxact->dummyBackendId;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* TwoPhaseGetDummyProc
|
|
|
|
* Get the PGPROC that represents a prepared transaction specified by XID
|
|
|
|
*/
|
|
|
|
PGPROC *
|
|
|
|
TwoPhaseGetDummyProc(TransactionId xid)
|
|
|
|
{
|
|
|
|
GlobalTransaction gxact = TwoPhaseGetGXact(xid);
|
|
|
|
|
|
|
|
return &ProcGlobal->allProcs[gxact->pgprocno];
|
|
|
|
}
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
/************************************************************************/
|
2005-10-15 04:49:52 +02:00
|
|
|
/* State file support */
|
2005-06-18 00:32:51 +02:00
|
|
|
/************************************************************************/
|
|
|
|
|
|
|
|
#define TwoPhaseFilePath(path, xid) \
|
2005-07-04 06:51:52 +02:00
|
|
|
snprintf(path, MAXPGPATH, TWOPHASE_DIR "/%08X", xid)
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* 2PC state file format:
|
|
|
|
*
|
2005-10-15 04:49:52 +02:00
|
|
|
* 1. TwoPhaseFileHeader
|
|
|
|
* 2. TransactionId[] (subtransactions)
|
2008-11-19 11:34:52 +01:00
|
|
|
* 3. RelFileNode[] (files to be deleted at commit)
|
|
|
|
* 4. RelFileNode[] (files to be deleted at abort)
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
* 5. SharedInvalidationMessage[] (inval messages to be sent at commit)
|
|
|
|
* 6. TwoPhaseRecordOnDisk
|
|
|
|
* 7. ...
|
|
|
|
* 8. TwoPhaseRecordOnDisk (end sentinel, rmid == TWOPHASE_RM_END_ID)
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
* 9. checksum (CRC-32C)
|
2005-06-18 00:32:51 +02:00
|
|
|
*
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
* Each segment except the final checksum is MAXALIGN'd.
|
2005-06-18 00:32:51 +02:00
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Header for a 2PC state file
|
|
|
|
*/
|
2009-09-01 06:15:45 +02:00
|
|
|
#define TWOPHASE_MAGIC 0x57F94532 /* format identifier */
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
typedef struct TwoPhaseFileHeader
|
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
uint32 magic; /* format identifier */
|
|
|
|
uint32 total_len; /* actual file length */
|
|
|
|
TransactionId xid; /* original transaction XID */
|
|
|
|
Oid database; /* OID of database it was in */
|
|
|
|
TimestampTz prepared_at; /* time of preparation */
|
|
|
|
Oid owner; /* user running the transaction */
|
|
|
|
int32 nsubxacts; /* number of following subxact XIDs */
|
|
|
|
int32 ncommitrels; /* number of delete-on-commit rels */
|
|
|
|
int32 nabortrels; /* number of delete-on-abort rels */
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
int32 ninvalmsgs; /* number of cache invalidation messages */
|
|
|
|
bool initfileinval; /* does relcache init file need invalidation? */
|
2005-10-15 04:49:52 +02:00
|
|
|
char gid[GIDSIZE]; /* GID for transaction */
|
2005-06-18 00:32:51 +02:00
|
|
|
} TwoPhaseFileHeader;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Header for each record in a state file
|
|
|
|
*
|
|
|
|
* NOTE: len counts only the rmgr data, not the TwoPhaseRecordOnDisk header.
|
|
|
|
* The rmgr data will be stored starting on a MAXALIGN boundary.
|
|
|
|
*/
|
|
|
|
typedef struct TwoPhaseRecordOnDisk
|
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
uint32 len; /* length of rmgr data */
|
|
|
|
TwoPhaseRmgrId rmid; /* resource manager for this record */
|
|
|
|
uint16 info; /* flag bits for use by rmgr */
|
2005-06-18 00:32:51 +02:00
|
|
|
} TwoPhaseRecordOnDisk;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* During prepare, the state file is assembled in memory before writing it
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
* to WAL and the actual state file. We use a chain of StateFileChunk blocks
|
|
|
|
* for that.
|
2005-06-18 00:32:51 +02:00
|
|
|
*/
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
typedef struct StateFileChunk
|
|
|
|
{
|
|
|
|
char *data;
|
|
|
|
uint32 len;
|
|
|
|
struct StateFileChunk *next;
|
|
|
|
} StateFileChunk;
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
static struct xllist
|
|
|
|
{
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
StateFileChunk *head; /* first data block in the chain */
|
|
|
|
StateFileChunk *tail; /* last block in chain */
|
|
|
|
uint32 num_chunks;
|
2005-10-15 04:49:52 +02:00
|
|
|
uint32 bytes_free; /* free bytes left in tail block */
|
|
|
|
uint32 total_len; /* total data bytes in chain */
|
|
|
|
} records;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Append a block of data to records data structure.
|
|
|
|
*
|
|
|
|
* NB: each block is padded to a MAXALIGN multiple. This must be
|
|
|
|
* accounted for when the file is later read!
|
|
|
|
*
|
|
|
|
* The data is copied, so the caller is free to modify it afterwards.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
save_state_data(const void *data, uint32 len)
|
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
uint32 padlen = MAXALIGN(len);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
if (padlen > records.bytes_free)
|
|
|
|
{
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
records.tail->next = palloc0(sizeof(StateFileChunk));
|
2005-06-18 00:32:51 +02:00
|
|
|
records.tail = records.tail->next;
|
|
|
|
records.tail->len = 0;
|
|
|
|
records.tail->next = NULL;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
records.num_chunks++;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
records.bytes_free = Max(padlen, 512);
|
|
|
|
records.tail->data = palloc(records.bytes_free);
|
|
|
|
}
|
|
|
|
|
|
|
|
memcpy(((char *) records.tail->data) + records.tail->len, data, len);
|
|
|
|
records.tail->len += padlen;
|
|
|
|
records.bytes_free -= padlen;
|
|
|
|
records.total_len += padlen;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Start preparing a state file.
|
|
|
|
*
|
|
|
|
* Initializes data structure and inserts the 2PC file header record.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
StartPrepare(GlobalTransaction gxact)
|
|
|
|
{
|
2012-06-10 21:20:04 +02:00
|
|
|
PGPROC *proc = &ProcGlobal->allProcs[gxact->pgprocno];
|
|
|
|
PGXACT *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
|
2011-11-25 14:02:10 +01:00
|
|
|
TransactionId xid = pgxact->xid;
|
2005-06-18 00:32:51 +02:00
|
|
|
TwoPhaseFileHeader hdr;
|
|
|
|
TransactionId *children;
|
2008-11-19 11:34:52 +01:00
|
|
|
RelFileNode *commitrels;
|
|
|
|
RelFileNode *abortrels;
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
SharedInvalidationMessage *invalmsgs;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/* Initialize linked list */
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
records.head = palloc0(sizeof(StateFileChunk));
|
2005-06-18 00:32:51 +02:00
|
|
|
records.head->len = 0;
|
|
|
|
records.head->next = NULL;
|
|
|
|
|
|
|
|
records.bytes_free = Max(sizeof(TwoPhaseFileHeader), 512);
|
|
|
|
records.head->data = palloc(records.bytes_free);
|
|
|
|
|
|
|
|
records.tail = records.head;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
records.num_chunks = 1;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
records.total_len = 0;
|
|
|
|
|
|
|
|
/* Create header */
|
|
|
|
hdr.magic = TWOPHASE_MAGIC;
|
|
|
|
hdr.total_len = 0; /* EndPrepare will fill this in */
|
|
|
|
hdr.xid = xid;
|
2011-11-25 14:02:10 +01:00
|
|
|
hdr.database = proc->databaseId;
|
2005-06-18 21:33:42 +02:00
|
|
|
hdr.prepared_at = gxact->prepared_at;
|
|
|
|
hdr.owner = gxact->owner;
|
2005-06-18 00:32:51 +02:00
|
|
|
hdr.nsubxacts = xactGetCommittedChildren(&children);
|
2010-08-13 22:10:54 +02:00
|
|
|
hdr.ncommitrels = smgrGetPendingDeletes(true, &commitrels);
|
|
|
|
hdr.nabortrels = smgrGetPendingDeletes(false, &abortrels);
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
hdr.ninvalmsgs = xactGetCommittedInvalidationMessages(&invalmsgs,
|
|
|
|
&hdr.initfileinval);
|
2005-06-18 00:32:51 +02:00
|
|
|
StrNCpy(hdr.gid, gxact->gid, GIDSIZE);
|
|
|
|
|
|
|
|
save_state_data(&hdr, sizeof(TwoPhaseFileHeader));
|
|
|
|
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
/*
|
2010-02-26 03:01:40 +01:00
|
|
|
* Add the additional info about subxacts, deletable files and cache
|
|
|
|
* invalidation messages.
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
*/
|
2005-06-18 00:32:51 +02:00
|
|
|
if (hdr.nsubxacts > 0)
|
|
|
|
{
|
|
|
|
save_state_data(children, hdr.nsubxacts * sizeof(TransactionId));
|
|
|
|
/* While we have the child-xact data, stuff it in the gxact too */
|
|
|
|
GXactLoadSubxactData(gxact, hdr.nsubxacts, children);
|
|
|
|
}
|
|
|
|
if (hdr.ncommitrels > 0)
|
|
|
|
{
|
2008-11-19 11:34:52 +01:00
|
|
|
save_state_data(commitrels, hdr.ncommitrels * sizeof(RelFileNode));
|
2005-06-18 00:32:51 +02:00
|
|
|
pfree(commitrels);
|
|
|
|
}
|
|
|
|
if (hdr.nabortrels > 0)
|
|
|
|
{
|
2008-11-19 11:34:52 +01:00
|
|
|
save_state_data(abortrels, hdr.nabortrels * sizeof(RelFileNode));
|
2005-06-18 00:32:51 +02:00
|
|
|
pfree(abortrels);
|
|
|
|
}
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
if (hdr.ninvalmsgs > 0)
|
|
|
|
{
|
|
|
|
save_state_data(invalmsgs,
|
|
|
|
hdr.ninvalmsgs * sizeof(SharedInvalidationMessage));
|
|
|
|
pfree(invalmsgs);
|
|
|
|
}
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Finish preparing state file.
|
|
|
|
*
|
|
|
|
* Calculates CRC and writes state file to WAL and in pg_twophase directory.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
EndPrepare(GlobalTransaction gxact)
|
|
|
|
{
|
2011-11-25 14:02:10 +01:00
|
|
|
PGXACT *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
|
|
|
|
TransactionId xid = pgxact->xid;
|
2005-06-18 00:32:51 +02:00
|
|
|
TwoPhaseFileHeader *hdr;
|
2005-10-15 04:49:52 +02:00
|
|
|
char path[MAXPGPATH];
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
StateFileChunk *record;
|
2005-10-15 04:49:52 +02:00
|
|
|
pg_crc32 statefile_crc;
|
|
|
|
pg_crc32 bogus_crc;
|
|
|
|
int fd;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/* Add the end sentinel to the list of 2PC records */
|
|
|
|
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0,
|
|
|
|
NULL, 0);
|
|
|
|
|
|
|
|
/* Go back and fill in total_len in the file header record */
|
|
|
|
hdr = (TwoPhaseFileHeader *) records.head->data;
|
|
|
|
Assert(hdr->magic == TWOPHASE_MAGIC);
|
|
|
|
hdr->total_len = records.total_len + sizeof(pg_crc32);
|
|
|
|
|
2008-05-19 20:16:26 +02:00
|
|
|
/*
|
|
|
|
* If the file size exceeds MaxAllocSize, we won't be able to read it in
|
|
|
|
* ReadTwoPhaseFile. Check for that now, rather than fail at commit time.
|
|
|
|
*/
|
|
|
|
if (hdr->total_len > MaxAllocSize)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
|
|
|
|
errmsg("two-phase state file maximum length exceeded")));
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
/*
|
|
|
|
* Create the 2PC state file.
|
|
|
|
*/
|
|
|
|
TwoPhaseFilePath(path, xid);
|
|
|
|
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
fd = OpenTransientFile(path,
|
|
|
|
O_CREAT | O_EXCL | O_WRONLY | PG_BINARY,
|
|
|
|
S_IRUSR | S_IWUSR);
|
2005-06-18 00:32:51 +02:00
|
|
|
if (fd < 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2006-10-06 19:14:01 +02:00
|
|
|
errmsg("could not create two-phase state file \"%s\": %m",
|
2005-06-18 00:32:51 +02:00
|
|
|
path)));
|
|
|
|
|
|
|
|
/* Write data to file, and calculate CRC as we pass over it */
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
INIT_CRC32C(statefile_crc);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
for (record = records.head; record != NULL; record = record->next)
|
|
|
|
{
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
COMP_CRC32C(statefile_crc, record->data, record->len);
|
2005-06-18 00:32:51 +02:00
|
|
|
if ((write(fd, record->data, record->len)) != record->len)
|
|
|
|
{
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
CloseTransientFile(fd);
|
2005-06-18 00:32:51 +02:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2006-10-06 19:14:01 +02:00
|
|
|
errmsg("could not write two-phase state file: %m")));
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
FIN_CRC32C(statefile_crc);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Write a deliberately bogus CRC to the state file; this is just paranoia
|
|
|
|
* to catch the case where four more bytes will run us out of disk space.
|
2005-06-18 00:32:51 +02:00
|
|
|
*/
|
2005-10-15 04:49:52 +02:00
|
|
|
bogus_crc = ~statefile_crc;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
if ((write(fd, &bogus_crc, sizeof(pg_crc32))) != sizeof(pg_crc32))
|
|
|
|
{
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
CloseTransientFile(fd);
|
2005-06-18 00:32:51 +02:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2006-10-06 19:14:01 +02:00
|
|
|
errmsg("could not write two-phase state file: %m")));
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Back up to prepare for rewriting the CRC */
|
|
|
|
if (lseek(fd, -((off_t) sizeof(pg_crc32)), SEEK_CUR) < 0)
|
|
|
|
{
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
CloseTransientFile(fd);
|
2005-06-18 00:32:51 +02:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2006-10-06 19:14:01 +02:00
|
|
|
errmsg("could not seek in two-phase state file: %m")));
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The state file isn't valid yet, because we haven't written the correct
|
|
|
|
* CRC yet. Before we do that, insert entry in WAL and flush it to disk.
|
|
|
|
*
|
2005-11-22 19:17:34 +01:00
|
|
|
* Between the time we have written the WAL entry and the time we write
|
|
|
|
* out the correct state file CRC, we have an inconsistency: the xact is
|
2005-10-15 04:49:52 +02:00
|
|
|
* prepared according to WAL but not according to our on-disk state. We
|
|
|
|
* use a critical section to force a PANIC if we are unable to complete
|
2014-05-06 18:12:18 +02:00
|
|
|
* the write --- then, WAL replay should repair the inconsistency. The
|
2005-06-19 22:00:39 +02:00
|
|
|
* odds of a PANIC actually occurring should be very tiny given that we
|
|
|
|
* were able to write the bogus CRC above.
|
2005-06-18 00:32:51 +02:00
|
|
|
*
|
2012-12-03 14:13:53 +01:00
|
|
|
* We have to set delayChkpt here, too; otherwise a checkpoint starting
|
2007-11-15 22:14:46 +01:00
|
|
|
* immediately after the WAL record is inserted could complete without
|
|
|
|
* fsync'ing our state file. (This is essentially the same kind of race
|
|
|
|
* condition as the COMMIT-to-clog-write case that RecordTransactionCommit
|
2012-12-03 14:13:53 +01:00
|
|
|
* uses delayChkpt for; see notes there.)
|
2005-06-19 22:00:39 +02:00
|
|
|
*
|
|
|
|
* We save the PREPARE record's location in the gxact for later use by
|
|
|
|
* CheckPointTwoPhase.
|
2005-06-18 00:32:51 +02:00
|
|
|
*/
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
XLogEnsureRecordSpace(0, records.num_chunks);
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
START_CRIT_SECTION();
|
|
|
|
|
2012-12-03 14:13:53 +01:00
|
|
|
MyPgXact->delayChkpt = true;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
XLogBeginInsert();
|
|
|
|
for (record = records.head; record != NULL; record = record->next)
|
|
|
|
XLogRegisterData(record->data, record->len);
|
|
|
|
gxact->prepare_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
|
2005-06-19 22:00:39 +02:00
|
|
|
XLogFlush(gxact->prepare_lsn);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/* If we crash now, we have prepared: WAL replay will fix things */
|
|
|
|
|
2005-06-19 22:00:39 +02:00
|
|
|
/* write correct CRC and close file */
|
2005-06-18 00:32:51 +02:00
|
|
|
if ((write(fd, &statefile_crc, sizeof(pg_crc32))) != sizeof(pg_crc32))
|
|
|
|
{
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
CloseTransientFile(fd);
|
2005-06-18 00:32:51 +02:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2006-10-06 19:14:01 +02:00
|
|
|
errmsg("could not write two-phase state file: %m")));
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
if (CloseTransientFile(fd) != 0)
|
2005-06-18 00:32:51 +02:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2006-10-06 19:14:01 +02:00
|
|
|
errmsg("could not close two-phase state file: %m")));
|
2005-06-18 00:32:51 +02:00
|
|
|
|
2005-06-19 22:00:39 +02:00
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Mark the prepared transaction as valid. As soon as xact.c marks
|
2012-06-10 21:20:04 +02:00
|
|
|
* MyPgXact as not running our XID (which it will do immediately after
|
|
|
|
* this function returns), others can commit/rollback the xact.
|
2005-06-19 22:00:39 +02:00
|
|
|
*
|
|
|
|
* NB: a side effect of this is to make a dummy ProcArray entry for the
|
2012-05-14 09:22:44 +02:00
|
|
|
* prepared XID. This must happen before we clear the XID from MyPgXact,
|
2005-06-19 22:00:39 +02:00
|
|
|
* else there is a window where the XID is not running according to
|
2007-09-21 18:32:19 +02:00
|
|
|
* TransactionIdIsInProgress, and onlookers would be entitled to assume
|
|
|
|
* the xact crashed. Instead we have a window where the same XID appears
|
2005-10-15 04:49:52 +02:00
|
|
|
* twice in ProcArray, which is OK.
|
2005-06-19 22:00:39 +02:00
|
|
|
*/
|
|
|
|
MarkAsPrepared(gxact);
|
|
|
|
|
|
|
|
/*
|
2007-11-15 22:14:46 +01:00
|
|
|
* Now we can mark ourselves as out of the commit critical section: a
|
|
|
|
* checkpoint starting after this will certainly see the gxact as a
|
2007-04-03 18:34:36 +02:00
|
|
|
* candidate for fsyncing.
|
2005-06-19 22:00:39 +02:00
|
|
|
*/
|
2012-12-03 14:13:53 +01:00
|
|
|
MyPgXact->delayChkpt = false;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
/*
|
|
|
|
* Remember that we have this GlobalTransaction entry locked for us. If
|
|
|
|
* we crash after this point, it's too late to abort, but we must unlock
|
|
|
|
* it so that the prepared transaction can be committed or rolled back.
|
|
|
|
*/
|
|
|
|
MyLockedGxact = gxact;
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
END_CRIT_SECTION();
|
|
|
|
|
2011-03-06 23:49:16 +01:00
|
|
|
/*
|
|
|
|
* Wait for synchronous replication, if required.
|
|
|
|
*
|
|
|
|
* Note that at this stage we have marked the prepare, but still show as
|
|
|
|
* running in the procarray (twice!) and continue to hold locks.
|
|
|
|
*/
|
|
|
|
SyncRepWaitForLSN(gxact->prepare_lsn);
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
records.tail = records.head = NULL;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
records.num_chunks = 0;
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Register a 2PC record to be written to state file.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info,
|
|
|
|
const void *data, uint32 len)
|
|
|
|
{
|
|
|
|
TwoPhaseRecordOnDisk record;
|
|
|
|
|
|
|
|
record.rmid = rmid;
|
|
|
|
record.info = info;
|
|
|
|
record.len = len;
|
|
|
|
save_state_data(&record, sizeof(TwoPhaseRecordOnDisk));
|
|
|
|
if (len > 0)
|
|
|
|
save_state_data(data, len);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read and validate the state file for xid.
|
|
|
|
*
|
|
|
|
* If it looks OK (has a valid magic number and CRC), return the palloc'd
|
|
|
|
* contents of the file. Otherwise return NULL.
|
|
|
|
*/
|
|
|
|
static char *
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
|
2005-06-18 00:32:51 +02:00
|
|
|
{
|
|
|
|
char path[MAXPGPATH];
|
|
|
|
char *buf;
|
|
|
|
TwoPhaseFileHeader *hdr;
|
|
|
|
int fd;
|
2005-10-15 04:49:52 +02:00
|
|
|
struct stat stat;
|
2005-06-18 00:32:51 +02:00
|
|
|
uint32 crc_offset;
|
2005-10-15 04:49:52 +02:00
|
|
|
pg_crc32 calc_crc,
|
|
|
|
file_crc;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
TwoPhaseFilePath(path, xid);
|
|
|
|
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
|
2005-06-18 00:32:51 +02:00
|
|
|
if (fd < 0)
|
|
|
|
{
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
if (give_warnings)
|
|
|
|
ereport(WARNING,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not open two-phase state file \"%s\": %m",
|
|
|
|
path)));
|
2005-06-18 00:32:51 +02:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Check file length. We can determine a lower bound pretty easily. We
|
2008-05-19 20:16:26 +02:00
|
|
|
* set an upper bound to avoid palloc() failure on a corrupt file, though
|
|
|
|
* we can't guarantee that we won't get an out of memory error anyway,
|
|
|
|
* even on a valid file.
|
2005-06-18 00:32:51 +02:00
|
|
|
*/
|
|
|
|
if (fstat(fd, &stat))
|
|
|
|
{
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
CloseTransientFile(fd);
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
if (give_warnings)
|
|
|
|
ereport(WARNING,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not stat two-phase state file \"%s\": %m",
|
|
|
|
path)));
|
2005-06-18 00:32:51 +02:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (stat.st_size < (MAXALIGN(sizeof(TwoPhaseFileHeader)) +
|
|
|
|
MAXALIGN(sizeof(TwoPhaseRecordOnDisk)) +
|
|
|
|
sizeof(pg_crc32)) ||
|
2008-05-19 20:16:26 +02:00
|
|
|
stat.st_size > MaxAllocSize)
|
2005-06-18 00:32:51 +02:00
|
|
|
{
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
CloseTransientFile(fd);
|
2005-06-18 00:32:51 +02:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
crc_offset = stat.st_size - sizeof(pg_crc32);
|
|
|
|
if (crc_offset != MAXALIGN(crc_offset))
|
|
|
|
{
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
CloseTransientFile(fd);
|
2005-06-18 00:32:51 +02:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* OK, slurp in the file.
|
|
|
|
*/
|
|
|
|
buf = (char *) palloc(stat.st_size);
|
|
|
|
|
|
|
|
if (read(fd, buf, stat.st_size) != stat.st_size)
|
|
|
|
{
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
CloseTransientFile(fd);
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
if (give_warnings)
|
|
|
|
ereport(WARNING,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not read two-phase state file \"%s\": %m",
|
|
|
|
path)));
|
2005-06-18 00:32:51 +02:00
|
|
|
pfree(buf);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
CloseTransientFile(fd);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
hdr = (TwoPhaseFileHeader *) buf;
|
|
|
|
if (hdr->magic != TWOPHASE_MAGIC || hdr->total_len != stat.st_size)
|
|
|
|
{
|
|
|
|
pfree(buf);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
INIT_CRC32C(calc_crc);
|
|
|
|
COMP_CRC32C(calc_crc, buf, crc_offset);
|
|
|
|
FIN_CRC32C(calc_crc);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
file_crc = *((pg_crc32 *) (buf + crc_offset));
|
|
|
|
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
if (!EQ_CRC32C(calc_crc, file_crc))
|
2005-06-18 00:32:51 +02:00
|
|
|
{
|
|
|
|
pfree(buf);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return buf;
|
|
|
|
}
|
|
|
|
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
/*
|
|
|
|
* Confirms an xid is prepared, during recovery
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
StandbyTransactionIdIsPrepared(TransactionId xid)
|
|
|
|
{
|
|
|
|
char *buf;
|
|
|
|
TwoPhaseFileHeader *hdr;
|
|
|
|
bool result;
|
|
|
|
|
|
|
|
Assert(TransactionIdIsValid(xid));
|
|
|
|
|
2010-04-28 02:09:05 +02:00
|
|
|
if (max_prepared_xacts <= 0)
|
2010-07-06 21:19:02 +02:00
|
|
|
return false; /* nothing to do */
|
2010-04-28 02:09:05 +02:00
|
|
|
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
/* Read and validate file */
|
|
|
|
buf = ReadTwoPhaseFile(xid, false);
|
|
|
|
if (buf == NULL)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/* Check header also */
|
|
|
|
hdr = (TwoPhaseFileHeader *) buf;
|
|
|
|
result = TransactionIdEquals(hdr->xid, xid);
|
|
|
|
pfree(buf);
|
|
|
|
|
|
|
|
return result;
|
|
|
|
}
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* FinishPreparedTransaction: execute COMMIT PREPARED or ROLLBACK PREPARED
|
|
|
|
*/
|
|
|
|
void
|
2005-06-18 21:33:42 +02:00
|
|
|
FinishPreparedTransaction(const char *gid, bool isCommit)
|
2005-06-18 00:32:51 +02:00
|
|
|
{
|
|
|
|
GlobalTransaction gxact;
|
2011-11-25 14:02:10 +01:00
|
|
|
PGPROC *proc;
|
|
|
|
PGXACT *pgxact;
|
2005-06-18 00:32:51 +02:00
|
|
|
TransactionId xid;
|
2005-10-15 04:49:52 +02:00
|
|
|
char *buf;
|
|
|
|
char *bufptr;
|
2005-06-18 00:32:51 +02:00
|
|
|
TwoPhaseFileHeader *hdr;
|
2007-09-08 22:31:15 +02:00
|
|
|
TransactionId latestXid;
|
2005-06-18 00:32:51 +02:00
|
|
|
TransactionId *children;
|
2008-11-19 11:34:52 +01:00
|
|
|
RelFileNode *commitrels;
|
|
|
|
RelFileNode *abortrels;
|
|
|
|
RelFileNode *delrels;
|
|
|
|
int ndelrels;
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
SharedInvalidationMessage *invalmsgs;
|
2005-10-15 04:49:52 +02:00
|
|
|
int i;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Validate the GID, and lock the GXACT to ensure that two backends do not
|
|
|
|
* try to commit the same GID at once.
|
2005-06-18 00:32:51 +02:00
|
|
|
*/
|
|
|
|
gxact = LockGXact(gid, GetUserId());
|
2011-11-25 14:02:10 +01:00
|
|
|
proc = &ProcGlobal->allProcs[gxact->pgprocno];
|
|
|
|
pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
|
|
|
|
xid = pgxact->xid;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Read and validate the state file
|
|
|
|
*/
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
buf = ReadTwoPhaseFile(xid, true);
|
2005-06-18 00:32:51 +02:00
|
|
|
if (buf == NULL)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_DATA_CORRUPTED),
|
2006-10-06 19:14:01 +02:00
|
|
|
errmsg("two-phase state file for transaction %u is corrupt",
|
2005-06-18 00:32:51 +02:00
|
|
|
xid)));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Disassemble the header area
|
|
|
|
*/
|
|
|
|
hdr = (TwoPhaseFileHeader *) buf;
|
|
|
|
Assert(TransactionIdEquals(hdr->xid, xid));
|
|
|
|
bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
|
|
|
|
children = (TransactionId *) bufptr;
|
|
|
|
bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
|
2008-11-19 11:34:52 +01:00
|
|
|
commitrels = (RelFileNode *) bufptr;
|
|
|
|
bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
|
|
|
|
abortrels = (RelFileNode *) bufptr;
|
|
|
|
bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
invalmsgs = (SharedInvalidationMessage *) bufptr;
|
|
|
|
bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
|
2005-06-18 00:32:51 +02:00
|
|
|
|
2007-09-08 22:31:15 +02:00
|
|
|
/* compute latestXid among all children */
|
|
|
|
latestXid = TransactionIdLatest(xid, hdr->nsubxacts, children);
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
/*
|
|
|
|
* The order of operations here is critical: make the XLOG entry for
|
|
|
|
* commit or abort, then mark the transaction committed or aborted in
|
2005-10-15 04:49:52 +02:00
|
|
|
* pg_clog, then remove its PGPROC from the global ProcArray (which means
|
|
|
|
* TransactionIdIsInProgress will stop saying the prepared xact is in
|
|
|
|
* progress), then run the post-commit or post-abort callbacks. The
|
|
|
|
* callbacks will release the locks the transaction held.
|
2005-06-18 00:32:51 +02:00
|
|
|
*/
|
|
|
|
if (isCommit)
|
|
|
|
RecordTransactionCommitPrepared(xid,
|
|
|
|
hdr->nsubxacts, children,
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
hdr->ncommitrels, commitrels,
|
|
|
|
hdr->ninvalmsgs, invalmsgs,
|
|
|
|
hdr->initfileinval);
|
2005-06-18 00:32:51 +02:00
|
|
|
else
|
|
|
|
RecordTransactionAbortPrepared(xid,
|
|
|
|
hdr->nsubxacts, children,
|
|
|
|
hdr->nabortrels, abortrels);
|
|
|
|
|
2011-11-25 14:02:10 +01:00
|
|
|
ProcArrayRemove(proc, latestXid);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* In case we fail while running the callbacks, mark the gxact invalid so
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
* no one else will try to commit/rollback, and so it will be recycled
|
|
|
|
* if we fail after this point. It is still locked by our backend so it
|
|
|
|
* won't go away yet.
|
2005-06-19 22:00:39 +02:00
|
|
|
*
|
|
|
|
* (We assume it's safe to do this without taking TwoPhaseStateLock.)
|
2005-06-18 00:32:51 +02:00
|
|
|
*/
|
|
|
|
gxact->valid = false;
|
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* We have to remove any files that were supposed to be dropped. For
|
|
|
|
* consistency with the regular xact.c code paths, must do this before
|
|
|
|
* releasing locks, so do it before running the callbacks.
|
2005-06-18 07:21:09 +02:00
|
|
|
*
|
2005-06-18 00:32:51 +02:00
|
|
|
* NB: this code knows that we couldn't be dropping any temp rels ...
|
|
|
|
*/
|
|
|
|
if (isCommit)
|
|
|
|
{
|
2008-11-19 11:34:52 +01:00
|
|
|
delrels = commitrels;
|
|
|
|
ndelrels = hdr->ncommitrels;
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
2008-11-19 11:34:52 +01:00
|
|
|
delrels = abortrels;
|
|
|
|
ndelrels = hdr->nabortrels;
|
|
|
|
}
|
|
|
|
for (i = 0; i < ndelrels; i++)
|
|
|
|
{
|
2010-08-13 22:10:54 +02:00
|
|
|
SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
|
2008-11-19 11:34:52 +01:00
|
|
|
|
2012-06-07 23:42:27 +02:00
|
|
|
smgrdounlink(srel, false);
|
2008-11-19 11:34:52 +01:00
|
|
|
smgrclose(srel);
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
/*
|
|
|
|
* Handle cache invalidation messages.
|
|
|
|
*
|
2010-02-26 03:01:40 +01:00
|
|
|
* Relcache init file invalidation requires processing both before and
|
|
|
|
* after we send the SI messages. See AtEOXact_Inval()
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
*/
|
|
|
|
if (hdr->initfileinval)
|
2011-08-16 19:11:54 +02:00
|
|
|
RelationCacheInitFilePreInvalidate();
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
SendSharedInvalidMessages(invalmsgs, hdr->ninvalmsgs);
|
|
|
|
if (hdr->initfileinval)
|
2011-08-16 19:11:54 +02:00
|
|
|
RelationCacheInitFilePostInvalidate();
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
|
2005-06-18 07:21:09 +02:00
|
|
|
/* And now do the callbacks */
|
|
|
|
if (isCommit)
|
|
|
|
ProcessRecords(bufptr, xid, twophase_postcommit_callbacks);
|
|
|
|
else
|
|
|
|
ProcessRecords(bufptr, xid, twophase_postabort_callbacks);
|
|
|
|
|
Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-07 22:46:51 +01:00
|
|
|
PredicateLockTwoPhaseFinish(xid, isCommit);
|
|
|
|
|
2007-05-27 05:50:39 +02:00
|
|
|
/* Count the prepared xact as committed or aborted */
|
|
|
|
AtEOXact_PgStat(isCommit);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* And now we can clean up our mess.
|
|
|
|
*/
|
|
|
|
RemoveTwoPhaseFile(xid, true);
|
|
|
|
|
|
|
|
RemoveGXact(gxact);
|
Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
2014-05-15 15:37:50 +02:00
|
|
|
MyLockedGxact = NULL;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
pfree(buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Scan a 2PC state file (already read into memory by ReadTwoPhaseFile)
|
|
|
|
* and call the indicated callbacks for each 2PC record.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ProcessRecords(char *bufptr, TransactionId xid,
|
|
|
|
const TwoPhaseCallback callbacks[])
|
|
|
|
{
|
|
|
|
for (;;)
|
|
|
|
{
|
|
|
|
TwoPhaseRecordOnDisk *record = (TwoPhaseRecordOnDisk *) bufptr;
|
|
|
|
|
|
|
|
Assert(record->rmid <= TWOPHASE_RM_MAX_ID);
|
|
|
|
if (record->rmid == TWOPHASE_RM_END_ID)
|
|
|
|
break;
|
|
|
|
|
|
|
|
bufptr += MAXALIGN(sizeof(TwoPhaseRecordOnDisk));
|
|
|
|
|
|
|
|
if (callbacks[record->rmid] != NULL)
|
2005-10-15 04:49:52 +02:00
|
|
|
callbacks[record->rmid] (xid, record->info,
|
|
|
|
(void *) bufptr, record->len);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
bufptr += MAXALIGN(record->len);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove the 2PC file for the specified XID.
|
|
|
|
*
|
|
|
|
* If giveWarning is false, do not complain about file-not-present;
|
|
|
|
* this is an expected case during WAL replay.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
RemoveTwoPhaseFile(TransactionId xid, bool giveWarning)
|
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
char path[MAXPGPATH];
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
TwoPhaseFilePath(path, xid);
|
|
|
|
if (unlink(path))
|
|
|
|
if (errno != ENOENT || giveWarning)
|
|
|
|
ereport(WARNING,
|
|
|
|
(errcode_for_file_access(),
|
2007-11-15 22:14:46 +01:00
|
|
|
errmsg("could not remove two-phase state file \"%s\": %m",
|
|
|
|
path)));
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Recreates a state file. This is used in WAL replay.
|
|
|
|
*
|
|
|
|
* Note: content and len don't include CRC.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
|
|
|
|
{
|
|
|
|
char path[MAXPGPATH];
|
|
|
|
pg_crc32 statefile_crc;
|
|
|
|
int fd;
|
|
|
|
|
|
|
|
/* Recompute CRC */
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
INIT_CRC32C(statefile_crc);
|
|
|
|
COMP_CRC32C(statefile_crc, content, len);
|
|
|
|
FIN_CRC32C(statefile_crc);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
TwoPhaseFilePath(path, xid);
|
|
|
|
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
fd = OpenTransientFile(path,
|
|
|
|
O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY,
|
|
|
|
S_IRUSR | S_IWUSR);
|
2005-06-18 00:32:51 +02:00
|
|
|
if (fd < 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2006-10-06 19:14:01 +02:00
|
|
|
errmsg("could not recreate two-phase state file \"%s\": %m",
|
2005-06-18 00:32:51 +02:00
|
|
|
path)));
|
|
|
|
|
|
|
|
/* Write content and CRC */
|
|
|
|
if (write(fd, content, len) != len)
|
|
|
|
{
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
CloseTransientFile(fd);
|
2005-06-18 00:32:51 +02:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2006-10-06 19:14:01 +02:00
|
|
|
errmsg("could not write two-phase state file: %m")));
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
if (write(fd, &statefile_crc, sizeof(pg_crc32)) != sizeof(pg_crc32))
|
|
|
|
{
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
CloseTransientFile(fd);
|
2005-06-18 00:32:51 +02:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2006-10-06 19:14:01 +02:00
|
|
|
errmsg("could not write two-phase state file: %m")));
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
|
2005-06-19 22:00:39 +02:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* We must fsync the file because the end-of-replay checkpoint will not do
|
|
|
|
* so, there being no GXACT in shared memory yet to tell it to.
|
2005-06-19 22:00:39 +02:00
|
|
|
*/
|
2005-06-18 00:32:51 +02:00
|
|
|
if (pg_fsync(fd) != 0)
|
|
|
|
{
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
CloseTransientFile(fd);
|
2005-06-18 00:32:51 +02:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2006-10-06 19:14:01 +02:00
|
|
|
errmsg("could not fsync two-phase state file: %m")));
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
if (CloseTransientFile(fd) != 0)
|
2005-06-18 00:32:51 +02:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2006-10-06 19:14:01 +02:00
|
|
|
errmsg("could not close two-phase state file: %m")));
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
|
2005-06-19 22:00:39 +02:00
|
|
|
/*
|
|
|
|
* CheckPointTwoPhase -- handle 2PC component of checkpointing.
|
|
|
|
*
|
|
|
|
* We must fsync the state file of any GXACT that is valid and has a PREPARE
|
|
|
|
* LSN <= the checkpoint's redo horizon. (If the gxact isn't valid yet or
|
|
|
|
* has a later LSN, this checkpoint is not responsible for fsyncing it.)
|
|
|
|
*
|
|
|
|
* This is deliberately run as late as possible in the checkpoint sequence,
|
|
|
|
* because GXACTs ordinarily have short lifespans, and so it is quite
|
|
|
|
* possible that GXACTs that were valid at checkpoint start will no longer
|
|
|
|
* exist if we wait a little bit.
|
|
|
|
*
|
|
|
|
* If a GXACT remains valid across multiple checkpoints, it'll be fsynced
|
|
|
|
* each time. This is considered unusual enough that we don't bother to
|
|
|
|
* expend any extra code to avoid the redundant fsyncs. (They should be
|
|
|
|
* reasonably cheap anyway, since they won't cause I/O.)
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
CheckPointTwoPhase(XLogRecPtr redo_horizon)
|
|
|
|
{
|
|
|
|
TransactionId *xids;
|
|
|
|
int nxids;
|
|
|
|
char path[MAXPGPATH];
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* We don't want to hold the TwoPhaseStateLock while doing I/O, so we grab
|
|
|
|
* it just long enough to make a list of the XIDs that require fsyncing,
|
|
|
|
* and then do the I/O afterwards.
|
2005-06-19 22:00:39 +02:00
|
|
|
*
|
2005-11-22 19:17:34 +01:00
|
|
|
* This approach creates a race condition: someone else could delete a
|
|
|
|
* GXACT between the time we release TwoPhaseStateLock and the time we try
|
2014-05-06 18:12:18 +02:00
|
|
|
* to open its state file. We handle this by special-casing ENOENT
|
2005-11-22 19:17:34 +01:00
|
|
|
* failures: if we see that, we verify that the GXACT is no longer valid,
|
|
|
|
* and if so ignore the failure.
|
2005-06-19 22:00:39 +02:00
|
|
|
*/
|
|
|
|
if (max_prepared_xacts <= 0)
|
|
|
|
return; /* nothing to do */
|
2008-08-01 15:16:09 +02:00
|
|
|
|
|
|
|
TRACE_POSTGRESQL_TWOPHASE_CHECKPOINT_START();
|
|
|
|
|
2005-06-19 22:00:39 +02:00
|
|
|
xids = (TransactionId *) palloc(max_prepared_xacts * sizeof(TransactionId));
|
|
|
|
nxids = 0;
|
|
|
|
|
|
|
|
LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
|
|
|
|
|
|
|
|
for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
|
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
|
2012-06-10 21:20:04 +02:00
|
|
|
PGXACT *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
|
2005-06-19 22:00:39 +02:00
|
|
|
|
2005-10-15 04:49:52 +02:00
|
|
|
if (gxact->valid &&
|
2012-12-28 17:06:15 +01:00
|
|
|
gxact->prepare_lsn <= redo_horizon)
|
2011-11-25 14:02:10 +01:00
|
|
|
xids[nxids++] = pgxact->xid;
|
2005-06-19 22:00:39 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
LWLockRelease(TwoPhaseStateLock);
|
|
|
|
|
|
|
|
for (i = 0; i < nxids; i++)
|
|
|
|
{
|
|
|
|
TransactionId xid = xids[i];
|
2005-10-15 04:49:52 +02:00
|
|
|
int fd;
|
2005-06-19 22:00:39 +02:00
|
|
|
|
|
|
|
TwoPhaseFilePath(path, xid);
|
|
|
|
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
fd = OpenTransientFile(path, O_RDWR | PG_BINARY, 0);
|
2005-06-19 22:00:39 +02:00
|
|
|
if (fd < 0)
|
|
|
|
{
|
|
|
|
if (errno == ENOENT)
|
|
|
|
{
|
|
|
|
/* OK if gxact is no longer valid */
|
|
|
|
if (!TransactionIdIsPrepared(xid))
|
|
|
|
continue;
|
|
|
|
/* Restore errno in case it was changed */
|
|
|
|
errno = ENOENT;
|
|
|
|
}
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2006-10-06 19:14:01 +02:00
|
|
|
errmsg("could not open two-phase state file \"%s\": %m",
|
2005-06-19 22:00:39 +02:00
|
|
|
path)));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (pg_fsync(fd) != 0)
|
|
|
|
{
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
CloseTransientFile(fd);
|
2005-06-19 22:00:39 +02:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2006-10-06 19:14:01 +02:00
|
|
|
errmsg("could not fsync two-phase state file \"%s\": %m",
|
2005-06-19 22:00:39 +02:00
|
|
|
path)));
|
|
|
|
}
|
|
|
|
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
if (CloseTransientFile(fd) != 0)
|
2005-06-19 22:00:39 +02:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2006-10-06 19:14:01 +02:00
|
|
|
errmsg("could not close two-phase state file \"%s\": %m",
|
2005-06-19 22:00:39 +02:00
|
|
|
path)));
|
|
|
|
}
|
|
|
|
|
|
|
|
pfree(xids);
|
2008-08-01 15:16:09 +02:00
|
|
|
|
|
|
|
TRACE_POSTGRESQL_TWOPHASE_CHECKPOINT_DONE();
|
2005-06-19 22:00:39 +02:00
|
|
|
}
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
/*
|
|
|
|
* PrescanPreparedTransactions
|
|
|
|
*
|
|
|
|
* Scan the pg_twophase directory and determine the range of valid XIDs
|
|
|
|
* present. This is run during database startup, after we have completed
|
|
|
|
* reading WAL. ShmemVariableCache->nextXid has been set to one more than
|
|
|
|
* the highest XID for which evidence exists in WAL.
|
|
|
|
*
|
|
|
|
* We throw away any prepared xacts with main XID beyond nextXid --- if any
|
|
|
|
* are present, it suggests that the DBA has done a PITR recovery to an
|
2014-05-06 18:12:18 +02:00
|
|
|
* earlier point in time without cleaning out pg_twophase. We dare not
|
2005-06-18 00:32:51 +02:00
|
|
|
* try to recover such prepared xacts since they likely depend on database
|
|
|
|
* state that doesn't exist now.
|
|
|
|
*
|
|
|
|
* However, we will advance nextXid beyond any subxact XIDs belonging to
|
|
|
|
* valid prepared xacts. We need to do this since subxact commit doesn't
|
|
|
|
* write a WAL entry, and so there might be no evidence in WAL of those
|
|
|
|
* subxact XIDs.
|
|
|
|
*
|
|
|
|
* Our other responsibility is to determine and return the oldest valid XID
|
|
|
|
* among the prepared xacts (if none, return ShmemVariableCache->nextXid).
|
|
|
|
* This is needed to synchronize pg_subtrans startup properly.
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
*
|
|
|
|
* If xids_p and nxids_p are not NULL, pointer to a palloc'd array of all
|
|
|
|
* top-level xids is stored in *xids_p. The number of entries in the array
|
|
|
|
* is returned in *nxids_p.
|
2005-06-18 00:32:51 +02:00
|
|
|
*/
|
|
|
|
TransactionId
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
|
2005-06-18 00:32:51 +02:00
|
|
|
{
|
|
|
|
TransactionId origNextXid = ShmemVariableCache->nextXid;
|
|
|
|
TransactionId result = origNextXid;
|
2005-10-15 04:49:52 +02:00
|
|
|
DIR *cldir;
|
2005-06-18 00:32:51 +02:00
|
|
|
struct dirent *clde;
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
TransactionId *xids = NULL;
|
|
|
|
int nxids = 0;
|
|
|
|
int allocsize = 0;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
2005-07-04 06:51:52 +02:00
|
|
|
cldir = AllocateDir(TWOPHASE_DIR);
|
|
|
|
while ((clde = ReadDir(cldir, TWOPHASE_DIR)) != NULL)
|
2005-06-18 00:32:51 +02:00
|
|
|
{
|
|
|
|
if (strlen(clde->d_name) == 8 &&
|
|
|
|
strspn(clde->d_name, "0123456789ABCDEF") == 8)
|
|
|
|
{
|
|
|
|
TransactionId xid;
|
2005-10-15 04:49:52 +02:00
|
|
|
char *buf;
|
|
|
|
TwoPhaseFileHeader *hdr;
|
2005-06-18 00:32:51 +02:00
|
|
|
TransactionId *subxids;
|
2005-10-15 04:49:52 +02:00
|
|
|
int i;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
xid = (TransactionId) strtoul(clde->d_name, NULL, 16);
|
|
|
|
|
|
|
|
/* Reject XID if too new */
|
|
|
|
if (TransactionIdFollowsOrEquals(xid, origNextXid))
|
|
|
|
{
|
|
|
|
ereport(WARNING,
|
2006-10-06 19:14:01 +02:00
|
|
|
(errmsg("removing future two-phase state file \"%s\"",
|
2005-06-18 00:32:51 +02:00
|
|
|
clde->d_name)));
|
|
|
|
RemoveTwoPhaseFile(xid, true);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Note: we can't check if already processed because clog
|
|
|
|
* subsystem isn't up yet.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* Read and validate file */
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
buf = ReadTwoPhaseFile(xid, true);
|
2005-06-18 00:32:51 +02:00
|
|
|
if (buf == NULL)
|
|
|
|
{
|
|
|
|
ereport(WARNING,
|
2007-11-15 22:14:46 +01:00
|
|
|
(errmsg("removing corrupt two-phase state file \"%s\"",
|
|
|
|
clde->d_name)));
|
2005-06-18 00:32:51 +02:00
|
|
|
RemoveTwoPhaseFile(xid, true);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Deconstruct header */
|
|
|
|
hdr = (TwoPhaseFileHeader *) buf;
|
|
|
|
if (!TransactionIdEquals(hdr->xid, xid))
|
|
|
|
{
|
|
|
|
ereport(WARNING,
|
2007-11-15 22:14:46 +01:00
|
|
|
(errmsg("removing corrupt two-phase state file \"%s\"",
|
|
|
|
clde->d_name)));
|
2005-06-18 00:32:51 +02:00
|
|
|
RemoveTwoPhaseFile(xid, true);
|
|
|
|
pfree(buf);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* OK, we think this file is valid. Incorporate xid into the
|
|
|
|
* running-minimum result.
|
|
|
|
*/
|
|
|
|
if (TransactionIdPrecedes(xid, result))
|
|
|
|
result = xid;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Examine subtransaction XIDs ... they should all follow main
|
|
|
|
* XID, and they may force us to advance nextXid.
|
Add locking around WAL-replay modification of shared-memory variables.
Originally, most of this code assumed that no Postgres backends could be
running concurrently with it, and so no locking could be needed. That
assumption fails in Hot Standby. While it's still true that Hot Standby
backends should never change values like nextXid, they can examine them,
and consistency is important in some cases such as when computing a
snapshot. Therefore, prudence requires that WAL replay code obtain the
relevant locks when modifying such variables, even though it can examine
them without taking a lock. We were following that coding rule in some
places but not all. This commit applies the coding rule uniformly to all
updates of ShmemVariableCache and MultiXactState fields; a search of the
replay routines did not find any other cases that seemed to be at risk.
In addition, this commit fixes a longstanding thinko in replay of NEXTOID
and checkpoint records: we tried to advance nextOid only if it was behind
the value in the WAL record, but the comparison would draw the wrong
conclusion if OID wraparound had occurred since the previous value.
Better to just unconditionally assign the new value, since OID assignment
shouldn't be happening during replay anyway.
The additional locking seems to be more in the nature of future-proofing
than fixing any live bug, so I am not going to back-patch it. The NEXTOID
fix will be back-patched separately.
2012-02-06 18:34:10 +01:00
|
|
|
*
|
|
|
|
* We don't expect anyone else to modify nextXid, hence we don't
|
2014-05-06 18:12:18 +02:00
|
|
|
* need to hold a lock while examining it. We still acquire the
|
Add locking around WAL-replay modification of shared-memory variables.
Originally, most of this code assumed that no Postgres backends could be
running concurrently with it, and so no locking could be needed. That
assumption fails in Hot Standby. While it's still true that Hot Standby
backends should never change values like nextXid, they can examine them,
and consistency is important in some cases such as when computing a
snapshot. Therefore, prudence requires that WAL replay code obtain the
relevant locks when modifying such variables, even though it can examine
them without taking a lock. We were following that coding rule in some
places but not all. This commit applies the coding rule uniformly to all
updates of ShmemVariableCache and MultiXactState fields; a search of the
replay routines did not find any other cases that seemed to be at risk.
In addition, this commit fixes a longstanding thinko in replay of NEXTOID
and checkpoint records: we tried to advance nextOid only if it was behind
the value in the WAL record, but the comparison would draw the wrong
conclusion if OID wraparound had occurred since the previous value.
Better to just unconditionally assign the new value, since OID assignment
shouldn't be happening during replay anyway.
The additional locking seems to be more in the nature of future-proofing
than fixing any live bug, so I am not going to back-patch it. The NEXTOID
fix will be back-patched separately.
2012-02-06 18:34:10 +01:00
|
|
|
* lock to modify it, though.
|
2005-06-18 00:32:51 +02:00
|
|
|
*/
|
|
|
|
subxids = (TransactionId *)
|
|
|
|
(buf + MAXALIGN(sizeof(TwoPhaseFileHeader)));
|
|
|
|
for (i = 0; i < hdr->nsubxacts; i++)
|
|
|
|
{
|
|
|
|
TransactionId subxid = subxids[i];
|
|
|
|
|
|
|
|
Assert(TransactionIdFollows(subxid, xid));
|
|
|
|
if (TransactionIdFollowsOrEquals(subxid,
|
|
|
|
ShmemVariableCache->nextXid))
|
|
|
|
{
|
Add locking around WAL-replay modification of shared-memory variables.
Originally, most of this code assumed that no Postgres backends could be
running concurrently with it, and so no locking could be needed. That
assumption fails in Hot Standby. While it's still true that Hot Standby
backends should never change values like nextXid, they can examine them,
and consistency is important in some cases such as when computing a
snapshot. Therefore, prudence requires that WAL replay code obtain the
relevant locks when modifying such variables, even though it can examine
them without taking a lock. We were following that coding rule in some
places but not all. This commit applies the coding rule uniformly to all
updates of ShmemVariableCache and MultiXactState fields; a search of the
replay routines did not find any other cases that seemed to be at risk.
In addition, this commit fixes a longstanding thinko in replay of NEXTOID
and checkpoint records: we tried to advance nextOid only if it was behind
the value in the WAL record, but the comparison would draw the wrong
conclusion if OID wraparound had occurred since the previous value.
Better to just unconditionally assign the new value, since OID assignment
shouldn't be happening during replay anyway.
The additional locking seems to be more in the nature of future-proofing
than fixing any live bug, so I am not going to back-patch it. The NEXTOID
fix will be back-patched separately.
2012-02-06 18:34:10 +01:00
|
|
|
LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
|
2005-06-18 00:32:51 +02:00
|
|
|
ShmemVariableCache->nextXid = subxid;
|
|
|
|
TransactionIdAdvance(ShmemVariableCache->nextXid);
|
Add locking around WAL-replay modification of shared-memory variables.
Originally, most of this code assumed that no Postgres backends could be
running concurrently with it, and so no locking could be needed. That
assumption fails in Hot Standby. While it's still true that Hot Standby
backends should never change values like nextXid, they can examine them,
and consistency is important in some cases such as when computing a
snapshot. Therefore, prudence requires that WAL replay code obtain the
relevant locks when modifying such variables, even though it can examine
them without taking a lock. We were following that coding rule in some
places but not all. This commit applies the coding rule uniformly to all
updates of ShmemVariableCache and MultiXactState fields; a search of the
replay routines did not find any other cases that seemed to be at risk.
In addition, this commit fixes a longstanding thinko in replay of NEXTOID
and checkpoint records: we tried to advance nextOid only if it was behind
the value in the WAL record, but the comparison would draw the wrong
conclusion if OID wraparound had occurred since the previous value.
Better to just unconditionally assign the new value, since OID assignment
shouldn't be happening during replay anyway.
The additional locking seems to be more in the nature of future-proofing
than fixing any live bug, so I am not going to back-patch it. The NEXTOID
fix will be back-patched separately.
2012-02-06 18:34:10 +01:00
|
|
|
LWLockRelease(XidGenLock);
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
|
|
|
|
if (xids_p)
|
|
|
|
{
|
|
|
|
if (nxids == allocsize)
|
|
|
|
{
|
|
|
|
if (nxids == 0)
|
|
|
|
{
|
|
|
|
allocsize = 10;
|
|
|
|
xids = palloc(allocsize * sizeof(TransactionId));
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
allocsize = allocsize * 2;
|
|
|
|
xids = repalloc(xids, allocsize * sizeof(TransactionId));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
xids[nxids++] = xid;
|
|
|
|
}
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
pfree(buf);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
FreeDir(cldir);
|
|
|
|
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
if (xids_p)
|
|
|
|
{
|
|
|
|
*xids_p = xids;
|
|
|
|
*nxids_p = nxids;
|
|
|
|
}
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
2010-04-13 16:17:46 +02:00
|
|
|
/*
|
|
|
|
* StandbyRecoverPreparedTransactions
|
|
|
|
*
|
|
|
|
* Scan the pg_twophase directory and setup all the required information to
|
|
|
|
* allow standby queries to treat prepared transactions as still active.
|
|
|
|
* This is never called at the end of recovery - we use
|
|
|
|
* RecoverPreparedTransactions() at that point.
|
|
|
|
*
|
|
|
|
* Currently we simply call SubTransSetParent() for any subxids of prepared
|
|
|
|
* transactions. If overwriteOK is true, it's OK if some XIDs have already
|
|
|
|
* been marked in pg_subtrans.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
StandbyRecoverPreparedTransactions(bool overwriteOK)
|
|
|
|
{
|
|
|
|
DIR *cldir;
|
|
|
|
struct dirent *clde;
|
|
|
|
|
|
|
|
cldir = AllocateDir(TWOPHASE_DIR);
|
|
|
|
while ((clde = ReadDir(cldir, TWOPHASE_DIR)) != NULL)
|
|
|
|
{
|
|
|
|
if (strlen(clde->d_name) == 8 &&
|
|
|
|
strspn(clde->d_name, "0123456789ABCDEF") == 8)
|
|
|
|
{
|
|
|
|
TransactionId xid;
|
|
|
|
char *buf;
|
|
|
|
TwoPhaseFileHeader *hdr;
|
|
|
|
TransactionId *subxids;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
xid = (TransactionId) strtoul(clde->d_name, NULL, 16);
|
|
|
|
|
|
|
|
/* Already processed? */
|
|
|
|
if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
|
|
|
|
{
|
|
|
|
ereport(WARNING,
|
|
|
|
(errmsg("removing stale two-phase state file \"%s\"",
|
|
|
|
clde->d_name)));
|
|
|
|
RemoveTwoPhaseFile(xid, true);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Read and validate file */
|
|
|
|
buf = ReadTwoPhaseFile(xid, true);
|
|
|
|
if (buf == NULL)
|
|
|
|
{
|
|
|
|
ereport(WARNING,
|
|
|
|
(errmsg("removing corrupt two-phase state file \"%s\"",
|
|
|
|
clde->d_name)));
|
|
|
|
RemoveTwoPhaseFile(xid, true);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Deconstruct header */
|
|
|
|
hdr = (TwoPhaseFileHeader *) buf;
|
|
|
|
if (!TransactionIdEquals(hdr->xid, xid))
|
|
|
|
{
|
|
|
|
ereport(WARNING,
|
|
|
|
(errmsg("removing corrupt two-phase state file \"%s\"",
|
|
|
|
clde->d_name)));
|
|
|
|
RemoveTwoPhaseFile(xid, true);
|
|
|
|
pfree(buf);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Examine subtransaction XIDs ... they should all follow main
|
|
|
|
* XID.
|
|
|
|
*/
|
|
|
|
subxids = (TransactionId *)
|
|
|
|
(buf + MAXALIGN(sizeof(TwoPhaseFileHeader)));
|
|
|
|
for (i = 0; i < hdr->nsubxacts; i++)
|
|
|
|
{
|
|
|
|
TransactionId subxid = subxids[i];
|
|
|
|
|
|
|
|
Assert(TransactionIdFollows(subxid, xid));
|
|
|
|
SubTransSetParent(xid, subxid, overwriteOK);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
FreeDir(cldir);
|
|
|
|
}
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
/*
|
|
|
|
* RecoverPreparedTransactions
|
|
|
|
*
|
|
|
|
* Scan the pg_twophase directory and reload shared-memory state for each
|
|
|
|
* prepared transaction (reacquire locks, etc). This is run during database
|
|
|
|
* startup.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
RecoverPreparedTransactions(void)
|
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
char dir[MAXPGPATH];
|
|
|
|
DIR *cldir;
|
2005-06-18 00:32:51 +02:00
|
|
|
struct dirent *clde;
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
bool overwriteOK = false;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
2005-07-04 06:51:52 +02:00
|
|
|
snprintf(dir, MAXPGPATH, "%s", TWOPHASE_DIR);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
cldir = AllocateDir(dir);
|
2005-06-19 23:34:03 +02:00
|
|
|
while ((clde = ReadDir(cldir, dir)) != NULL)
|
2005-06-18 00:32:51 +02:00
|
|
|
{
|
|
|
|
if (strlen(clde->d_name) == 8 &&
|
|
|
|
strspn(clde->d_name, "0123456789ABCDEF") == 8)
|
|
|
|
{
|
|
|
|
TransactionId xid;
|
2005-10-15 04:49:52 +02:00
|
|
|
char *buf;
|
|
|
|
char *bufptr;
|
|
|
|
TwoPhaseFileHeader *hdr;
|
2005-06-18 00:32:51 +02:00
|
|
|
TransactionId *subxids;
|
2005-10-15 04:49:52 +02:00
|
|
|
GlobalTransaction gxact;
|
|
|
|
int i;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
xid = (TransactionId) strtoul(clde->d_name, NULL, 16);
|
|
|
|
|
|
|
|
/* Already processed? */
|
|
|
|
if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
|
|
|
|
{
|
|
|
|
ereport(WARNING,
|
2006-10-06 19:14:01 +02:00
|
|
|
(errmsg("removing stale two-phase state file \"%s\"",
|
2005-06-18 00:32:51 +02:00
|
|
|
clde->d_name)));
|
|
|
|
RemoveTwoPhaseFile(xid, true);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Read and validate file */
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
buf = ReadTwoPhaseFile(xid, true);
|
2005-06-18 00:32:51 +02:00
|
|
|
if (buf == NULL)
|
|
|
|
{
|
|
|
|
ereport(WARNING,
|
2007-11-15 22:14:46 +01:00
|
|
|
(errmsg("removing corrupt two-phase state file \"%s\"",
|
|
|
|
clde->d_name)));
|
2005-06-18 00:32:51 +02:00
|
|
|
RemoveTwoPhaseFile(xid, true);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
ereport(LOG,
|
|
|
|
(errmsg("recovering prepared transaction %u", xid)));
|
|
|
|
|
|
|
|
/* Deconstruct header */
|
|
|
|
hdr = (TwoPhaseFileHeader *) buf;
|
|
|
|
Assert(TransactionIdEquals(hdr->xid, xid));
|
|
|
|
bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
|
|
|
|
subxids = (TransactionId *) bufptr;
|
|
|
|
bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
|
2008-11-19 11:34:52 +01:00
|
|
|
bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
|
|
|
|
bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
|
|
|
|
|
|
|
|
/*
|
2010-02-26 03:01:40 +01:00
|
|
|
* It's possible that SubTransSetParent has been set before, if
|
|
|
|
* the prepared transaction generated xid assignment records. Test
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
* here must match one used in AssignTransactionId().
|
|
|
|
*/
|
Add new wal_level, logical, sufficient for logical decoding.
When wal_level=logical, we'll log columns from the old tuple as
configured by the REPLICA IDENTITY facility added in commit
07cacba983ef79be4a84fcd0e0ca3b5fcb85dd65. This makes it possible
a properly-configured logical replication solution to correctly
follow table updates even if they change the chosen key columns,
or, with REPLICA IDENTITY FULL, even if the table has no key at
all. Note that updates which do not modify the replica identity
column won't log anything extra, making the choice of a good key
(i.e. one that will rarely be changed) important to performance
when wal_level=logical is configured.
Each insert, update, or delete to a catalog table will also log
the CMIN and/or CMAX values of stamped by the current transaction.
This is necessary because logical decoding will require access to
historical snapshots of the catalog in order to decode some data
types, and the CMIN/CMAX values that we may need in order to judge
row visibility may have been overwritten by the time we need them.
Andres Freund, reviewed in various versions by myself, Heikki
Linnakangas, KONDO Mitsumasa, and many others.
2013-12-11 00:33:45 +01:00
|
|
|
if (InHotStandby && (hdr->nsubxacts >= PGPROC_MAX_CACHED_SUBXIDS ||
|
|
|
|
XLogLogicalInfoActive()))
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
overwriteOK = true;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Reconstruct subtrans state for the transaction --- needed
|
2005-10-15 04:49:52 +02:00
|
|
|
* because pg_subtrans is not preserved over a restart. Note that
|
|
|
|
* we are linking all the subtransactions directly to the
|
2005-06-18 21:33:42 +02:00
|
|
|
* top-level XID; there may originally have been a more complex
|
|
|
|
* hierarchy, but there's no need to restore that exactly.
|
2005-06-18 00:32:51 +02:00
|
|
|
*/
|
|
|
|
for (i = 0; i < hdr->nsubxacts; i++)
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
SubTransSetParent(subxids[i], xid, overwriteOK);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Recreate its GXACT and dummy PGPROC
|
2005-06-19 22:00:39 +02:00
|
|
|
*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Note: since we don't have the PREPARE record's WAL location at
|
|
|
|
* hand, we leave prepare_lsn zeroes. This means the GXACT will
|
|
|
|
* be fsync'd on every future checkpoint. We assume this
|
2005-06-19 22:00:39 +02:00
|
|
|
* situation is infrequent enough that the performance cost is
|
2005-10-15 04:49:52 +02:00
|
|
|
* negligible (especially since we know the state file has already
|
|
|
|
* been fsynced).
|
2005-06-18 00:32:51 +02:00
|
|
|
*/
|
2005-06-18 21:33:42 +02:00
|
|
|
gxact = MarkAsPreparing(xid, hdr->gid,
|
|
|
|
hdr->prepared_at,
|
|
|
|
hdr->owner, hdr->database);
|
2005-06-18 00:32:51 +02:00
|
|
|
GXactLoadSubxactData(gxact, hdr->nsubxacts, subxids);
|
|
|
|
MarkAsPrepared(gxact);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Recover other state (notably locks) using resource managers
|
|
|
|
*/
|
|
|
|
ProcessRecords(bufptr, xid, twophase_recover_callbacks);
|
|
|
|
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
/*
|
|
|
|
* Release locks held by the standby process after we process each
|
|
|
|
* prepared transaction. As a result, we don't need too many
|
|
|
|
* additional locks at any one time.
|
|
|
|
*/
|
|
|
|
if (InHotStandby)
|
|
|
|
StandbyReleaseLockTree(xid, hdr->nsubxacts, subxids);
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
pfree(buf);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
FreeDir(cldir);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* RecordTransactionCommitPrepared
|
|
|
|
*
|
|
|
|
* This is basically the same as RecordTransactionCommit: in particular,
|
2012-12-03 14:13:53 +01:00
|
|
|
* we must set the delayChkpt flag to avoid a race condition.
|
2005-06-18 00:32:51 +02:00
|
|
|
*
|
|
|
|
* We know the transaction made at least one XLOG entry (its PREPARE),
|
|
|
|
* so it is never possible to optimize out the commit record.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
RecordTransactionCommitPrepared(TransactionId xid,
|
|
|
|
int nchildren,
|
|
|
|
TransactionId *children,
|
|
|
|
int nrels,
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
RelFileNode *rels,
|
|
|
|
int ninvalmsgs,
|
|
|
|
SharedInvalidationMessage *invalmsgs,
|
|
|
|
bool initfileinval)
|
2005-06-18 00:32:51 +02:00
|
|
|
{
|
|
|
|
xl_xact_commit_prepared xlrec;
|
|
|
|
XLogRecPtr recptr;
|
|
|
|
|
|
|
|
START_CRIT_SECTION();
|
|
|
|
|
|
|
|
/* See notes in RecordTransactionCommit */
|
2012-12-03 14:13:53 +01:00
|
|
|
MyPgXact->delayChkpt = true;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/* Emit the XLOG commit record */
|
|
|
|
xlrec.xid = xid;
|
2014-05-16 08:47:50 +02:00
|
|
|
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
xlrec.crec.xinfo = initfileinval ? XACT_COMPLETION_UPDATE_RELCACHE_FILE : 0;
|
2014-05-16 08:47:50 +02:00
|
|
|
|
|
|
|
xlrec.crec.dbId = MyDatabaseId;
|
|
|
|
xlrec.crec.tsId = MyDatabaseTableSpace;
|
|
|
|
|
|
|
|
xlrec.crec.xact_time = GetCurrentTimestamp();
|
2005-06-18 00:32:51 +02:00
|
|
|
xlrec.crec.nrels = nrels;
|
|
|
|
xlrec.crec.nsubxacts = nchildren;
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
xlrec.crec.nmsgs = ninvalmsgs;
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
XLogBeginInsert();
|
|
|
|
XLogRegisterData((char *) (&xlrec), MinSizeOfXactCommitPrepared);
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
/* dump rels to delete */
|
|
|
|
if (nrels > 0)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
XLogRegisterData((char *) rels, nrels * sizeof(RelFileNode));
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
/* dump committed child Xids */
|
|
|
|
if (nchildren > 0)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
XLogRegisterData((char *) children,
|
|
|
|
nchildren * sizeof(TransactionId));
|
|
|
|
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
/* dump cache invalidation messages */
|
|
|
|
if (ninvalmsgs > 0)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
XLogRegisterData((char *) invalmsgs,
|
|
|
|
ninvalmsgs * sizeof(SharedInvalidationMessage));
|
2005-06-18 00:32:51 +02:00
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
recptr = XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT_PREPARED);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
2007-08-02 00:45:09 +02:00
|
|
|
/*
|
2007-11-15 22:14:46 +01:00
|
|
|
* We don't currently try to sleep before flush here ... nor is there any
|
|
|
|
* support for async commit of a prepared xact (the very idea is probably
|
|
|
|
* a contradiction)
|
2007-08-02 00:45:09 +02:00
|
|
|
*/
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/* Flush XLOG to disk */
|
|
|
|
XLogFlush(recptr);
|
|
|
|
|
|
|
|
/* Mark the transaction committed in pg_clog */
|
2008-10-20 21:18:18 +02:00
|
|
|
TransactionIdCommitTree(xid, nchildren, children);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
2007-04-03 18:34:36 +02:00
|
|
|
/* Checkpoint can proceed now */
|
2012-12-03 14:13:53 +01:00
|
|
|
MyPgXact->delayChkpt = false;
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
END_CRIT_SECTION();
|
2011-03-06 23:49:16 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Wait for synchronous replication, if required.
|
|
|
|
*
|
2011-04-10 17:42:00 +02:00
|
|
|
* Note that at this stage we have marked clog, but still show as running
|
|
|
|
* in the procarray and continue to hold locks.
|
2011-03-06 23:49:16 +01:00
|
|
|
*/
|
|
|
|
SyncRepWaitForLSN(recptr);
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* RecordTransactionAbortPrepared
|
|
|
|
*
|
|
|
|
* This is basically the same as RecordTransactionAbort.
|
|
|
|
*
|
|
|
|
* We know the transaction made at least one XLOG entry (its PREPARE),
|
|
|
|
* so it is never possible to optimize out the abort record.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
RecordTransactionAbortPrepared(TransactionId xid,
|
|
|
|
int nchildren,
|
|
|
|
TransactionId *children,
|
|
|
|
int nrels,
|
2008-11-19 11:34:52 +01:00
|
|
|
RelFileNode *rels)
|
2005-06-18 00:32:51 +02:00
|
|
|
{
|
|
|
|
xl_xact_abort_prepared xlrec;
|
|
|
|
XLogRecPtr recptr;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Catch the scenario where we aborted partway through
|
|
|
|
* RecordTransactionCommitPrepared ...
|
|
|
|
*/
|
|
|
|
if (TransactionIdDidCommit(xid))
|
|
|
|
elog(PANIC, "cannot abort transaction %u, it was already committed",
|
|
|
|
xid);
|
|
|
|
|
|
|
|
START_CRIT_SECTION();
|
|
|
|
|
|
|
|
/* Emit the XLOG abort record */
|
|
|
|
xlrec.xid = xid;
|
2007-04-30 23:01:53 +02:00
|
|
|
xlrec.arec.xact_time = GetCurrentTimestamp();
|
2005-06-18 00:32:51 +02:00
|
|
|
xlrec.arec.nrels = nrels;
|
|
|
|
xlrec.arec.nsubxacts = nchildren;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
|
|
|
XLogBeginInsert();
|
|
|
|
XLogRegisterData((char *) (&xlrec), MinSizeOfXactAbortPrepared);
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
/* dump rels to delete */
|
|
|
|
if (nrels > 0)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
XLogRegisterData((char *) rels, nrels * sizeof(RelFileNode));
|
|
|
|
|
2005-06-18 00:32:51 +02:00
|
|
|
/* dump committed child Xids */
|
|
|
|
if (nchildren > 0)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
XLogRegisterData((char *) children,
|
|
|
|
nchildren * sizeof(TransactionId));
|
2005-06-18 00:32:51 +02:00
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
recptr = XLogInsert(RM_XACT_ID, XLOG_XACT_ABORT_PREPARED);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
/* Always flush, since we're about to remove the 2PC state file */
|
|
|
|
XLogFlush(recptr);
|
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Mark the transaction aborted in clog. This is not absolutely necessary
|
|
|
|
* but we may as well do it while we are here.
|
2005-06-18 00:32:51 +02:00
|
|
|
*/
|
2008-10-20 21:18:18 +02:00
|
|
|
TransactionIdAbortTree(xid, nchildren, children);
|
2005-06-18 00:32:51 +02:00
|
|
|
|
|
|
|
END_CRIT_SECTION();
|
2011-03-06 23:49:16 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Wait for synchronous replication, if required.
|
|
|
|
*
|
2011-04-10 17:42:00 +02:00
|
|
|
* Note that at this stage we have marked clog, but still show as running
|
|
|
|
* in the procarray and continue to hold locks.
|
2011-03-06 23:49:16 +01:00
|
|
|
*/
|
|
|
|
SyncRepWaitForLSN(recptr);
|
2005-06-18 00:32:51 +02:00
|
|
|
}
|