1999-09-27 17:48:12 +02:00
|
|
|
/*
|
|
|
|
* xlog.h
|
|
|
|
*
|
2017-05-12 17:49:56 +02:00
|
|
|
* PostgreSQL write-ahead log manager
|
1999-09-27 17:48:12 +02:00
|
|
|
*
|
2021-01-02 19:06:25 +01:00
|
|
|
* Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
|
XLOG (and related) changes:
* Store two past checkpoint locations, not just one, in pg_control.
On startup, we fall back to the older checkpoint if the newer one
is unreadable. Also, a physical copy of the newest checkpoint record
is kept in pg_control for possible use in disaster recovery (ie,
complete loss of pg_xlog). Also add a version number for pg_control
itself. Remove archdir from pg_control; it ought to be a GUC
parameter, not a special case (not that it's implemented yet anyway).
* Suppress successive checkpoint records when nothing has been entered
in the WAL log since the last one. This is not so much to avoid I/O
as to make it actually useful to keep track of the last two
checkpoints. If the things are right next to each other then there's
not a lot of redundancy gained...
* Change CRC scheme to a true 64-bit CRC, not a pair of 32-bit CRCs
on alternate bytes. Polynomial borrowed from ECMA DLT1 standard.
* Fix XLOG record length handling so that it will work at BLCKSZ = 32k.
* Change XID allocation to work more like OID allocation. (This is of
dubious necessity, but I think it's a good idea anyway.)
* Fix a number of minor bugs, such as off-by-one logic for XLOG file
wraparound at the 4 gig mark.
* Add documentation and clean up some coding infelicities; move file
format declarations out to include files where planned contrib
utilities can get at them.
* Checkpoint will now occur every CHECKPOINT_SEGMENTS log segments or
every CHECKPOINT_TIMEOUT seconds, whichever comes first. It is also
possible to force a checkpoint by sending SIGUSR1 to the postmaster
(undocumented feature...)
* Defend against kill -9 postmaster by storing shmem block's key and ID
in postmaster.pid lockfile, and checking at startup to ensure that no
processes are still connected to old shmem block (if it still exists).
* Switch backends to accept SIGQUIT rather than SIGUSR1 for emergency
stop, for symmetry with postmaster and xlog utilities. Clean up signal
handling in bootstrap.c so that xlog utilities launched by postmaster
will react to signals better.
* Standalone bootstrap now grabs lockfile in target directory, as added
insurance against running it in parallel with live postmaster.
2001-03-13 02:17:06 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
|
|
|
*
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/include/access/xlog.h
|
1999-09-27 17:48:12 +02:00
|
|
|
*/
|
|
|
|
#ifndef XLOG_H
|
|
|
|
#define XLOG_H
|
|
|
|
|
|
|
|
#include "access/rmgr.h"
|
2000-10-28 18:21:00 +02:00
|
|
|
#include "access/xlogdefs.h"
|
2014-11-06 12:52:08 +01:00
|
|
|
#include "access/xloginsert.h"
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
#include "access/xlogreader.h"
|
2011-09-09 19:23:41 +02:00
|
|
|
#include "datatype/timestamp.h"
|
2006-03-24 05:32:13 +01:00
|
|
|
#include "lib/stringinfo.h"
|
2015-05-12 15:29:10 +02:00
|
|
|
#include "nodes/pg_list.h"
|
|
|
|
#include "storage/fd.h"
|
1999-09-27 17:48:12 +02:00
|
|
|
|
2007-05-20 23:08:19 +02:00
|
|
|
|
2005-05-20 16:53:26 +02:00
|
|
|
/* Sync methods */
|
|
|
|
#define SYNC_METHOD_FSYNC 0
|
|
|
|
#define SYNC_METHOD_FDATASYNC 1
|
2008-05-12 10:35:05 +02:00
|
|
|
#define SYNC_METHOD_OPEN 2 /* for O_SYNC */
|
2005-05-20 16:53:26 +02:00
|
|
|
#define SYNC_METHOD_FSYNC_WRITETHROUGH 3
|
2008-05-12 10:35:05 +02:00
|
|
|
#define SYNC_METHOD_OPEN_DSYNC 4 /* for O_DSYNC */
|
2005-05-20 16:53:26 +02:00
|
|
|
extern int sync_method;
|
|
|
|
|
2010-01-16 01:04:41 +01:00
|
|
|
extern PGDLLIMPORT TimeLineID ThisTimeLineID; /* current TLI */
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
|
|
|
|
/*
|
2010-07-03 22:43:58 +02:00
|
|
|
* Prior to 8.4, all activity during recovery was carried out by the startup
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
* process. This local variable continues to be used in many parts of the
|
2010-07-03 22:43:58 +02:00
|
|
|
* code to indicate actions taken by RecoveryManagers. Other processes that
|
|
|
|
* potentially perform work during recovery should check RecoveryInProgress().
|
|
|
|
* See XLogCtl notes in xlog.c.
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
*/
|
XLOG (and related) changes:
* Store two past checkpoint locations, not just one, in pg_control.
On startup, we fall back to the older checkpoint if the newer one
is unreadable. Also, a physical copy of the newest checkpoint record
is kept in pg_control for possible use in disaster recovery (ie,
complete loss of pg_xlog). Also add a version number for pg_control
itself. Remove archdir from pg_control; it ought to be a GUC
parameter, not a special case (not that it's implemented yet anyway).
* Suppress successive checkpoint records when nothing has been entered
in the WAL log since the last one. This is not so much to avoid I/O
as to make it actually useful to keep track of the last two
checkpoints. If the things are right next to each other then there's
not a lot of redundancy gained...
* Change CRC scheme to a true 64-bit CRC, not a pair of 32-bit CRCs
on alternate bytes. Polynomial borrowed from ECMA DLT1 standard.
* Fix XLOG record length handling so that it will work at BLCKSZ = 32k.
* Change XID allocation to work more like OID allocation. (This is of
dubious necessity, but I think it's a good idea anyway.)
* Fix a number of minor bugs, such as off-by-one logic for XLOG file
wraparound at the 4 gig mark.
* Add documentation and clean up some coding infelicities; move file
format declarations out to include files where planned contrib
utilities can get at them.
* Checkpoint will now occur every CHECKPOINT_SEGMENTS log segments or
every CHECKPOINT_TIMEOUT seconds, whichever comes first. It is also
possible to force a checkpoint by sending SIGUSR1 to the postmaster
(undocumented feature...)
* Defend against kill -9 postmaster by storing shmem block's key and ID
in postmaster.pid lockfile, and checking at startup to ensure that no
processes are still connected to old shmem block (if it still exists).
* Switch backends to accept SIGQUIT rather than SIGUSR1 for emergency
stop, for symmetry with postmaster and xlog utilities. Clean up signal
handling in bootstrap.c so that xlog utilities launched by postmaster
will react to signals better.
* Standalone bootstrap now grabs lockfile in target directory, as added
insurance against running it in parallel with live postmaster.
2001-03-13 02:17:06 +01:00
|
|
|
extern bool InRecovery;
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Like InRecovery, standbyState is only valid in the startup process.
|
2010-07-03 22:43:58 +02:00
|
|
|
* In all other processes it will have the value STANDBY_DISABLED (so
|
2017-08-16 06:22:32 +02:00
|
|
|
* InHotStandby will read as false).
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
*
|
|
|
|
* In DISABLED state, we're performing crash recovery or hot standby was
|
2010-08-13 01:24:54 +02:00
|
|
|
* disabled in postgresql.conf.
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
*
|
2010-07-03 22:43:58 +02:00
|
|
|
* In INITIALIZED state, we've run InitRecoveryTransactionEnvironment, but
|
|
|
|
* we haven't yet processed a RUNNING_XACTS or shutdown-checkpoint WAL record
|
2020-06-14 23:05:18 +02:00
|
|
|
* to initialize our primary-transaction tracking system.
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
*
|
|
|
|
* When the transaction tracking is initialized, we enter the SNAPSHOT_PENDING
|
|
|
|
* state. The tracked information might still be incomplete, so we can't allow
|
|
|
|
* connections yet, but redo functions must update the in-memory state when
|
|
|
|
* appropriate.
|
|
|
|
*
|
|
|
|
* In SNAPSHOT_READY mode, we have full knowledge of transactions that are
|
2020-06-14 23:05:18 +02:00
|
|
|
* (or were) running on the primary at the current WAL location. Snapshots
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
* can be taken, and read-only queries can be run.
|
|
|
|
*/
|
|
|
|
typedef enum
|
|
|
|
{
|
|
|
|
STANDBY_DISABLED,
|
|
|
|
STANDBY_INITIALIZED,
|
|
|
|
STANDBY_SNAPSHOT_PENDING,
|
|
|
|
STANDBY_SNAPSHOT_READY
|
|
|
|
} HotStandbyState;
|
2010-07-03 22:43:58 +02:00
|
|
|
|
Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 02:32:45 +01:00
|
|
|
extern HotStandbyState standbyState;
|
|
|
|
|
|
|
|
#define InHotStandby (standbyState >= STANDBY_SNAPSHOT_PENDING)
|
|
|
|
|
2010-03-19 12:05:15 +01:00
|
|
|
/*
|
|
|
|
* Recovery target type.
|
2018-11-25 16:31:16 +01:00
|
|
|
* Only set during a Point in Time recovery, not when in standby mode.
|
2010-03-19 12:05:15 +01:00
|
|
|
*/
|
|
|
|
typedef enum
|
|
|
|
{
|
|
|
|
RECOVERY_TARGET_UNSET,
|
|
|
|
RECOVERY_TARGET_XID,
|
2011-02-08 20:39:08 +01:00
|
|
|
RECOVERY_TARGET_TIME,
|
2014-01-25 16:34:04 +01:00
|
|
|
RECOVERY_TARGET_NAME,
|
2016-09-03 18:48:01 +02:00
|
|
|
RECOVERY_TARGET_LSN,
|
2014-01-25 16:34:04 +01:00
|
|
|
RECOVERY_TARGET_IMMEDIATE
|
2010-03-19 12:05:15 +01:00
|
|
|
} RecoveryTargetType;
|
|
|
|
|
2018-11-25 16:31:16 +01:00
|
|
|
/*
|
|
|
|
* Recovery target TimeLine goal
|
|
|
|
*/
|
|
|
|
typedef enum
|
|
|
|
{
|
|
|
|
RECOVERY_TARGET_TIMELINE_CONTROLFILE,
|
|
|
|
RECOVERY_TARGET_TIMELINE_LATEST,
|
|
|
|
RECOVERY_TARGET_TIMELINE_NUMERIC
|
|
|
|
} RecoveryTargetTimeLineGoal;
|
|
|
|
|
2016-01-21 03:40:44 +01:00
|
|
|
extern XLogRecPtr ProcLastRecPtr;
|
2007-09-05 20:10:48 +02:00
|
|
|
extern XLogRecPtr XactLastRecEnd;
|
Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented here, consist out of
three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies. E.g. by filtering all replayed changes out.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL. Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.
For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00
|
|
|
extern PGDLLIMPORT XLogRecPtr XactLastCommitEnd;
|
XLOG (and related) changes:
* Store two past checkpoint locations, not just one, in pg_control.
On startup, we fall back to the older checkpoint if the newer one
is unreadable. Also, a physical copy of the newest checkpoint record
is kept in pg_control for possible use in disaster recovery (ie,
complete loss of pg_xlog). Also add a version number for pg_control
itself. Remove archdir from pg_control; it ought to be a GUC
parameter, not a special case (not that it's implemented yet anyway).
* Suppress successive checkpoint records when nothing has been entered
in the WAL log since the last one. This is not so much to avoid I/O
as to make it actually useful to keep track of the last two
checkpoints. If the things are right next to each other then there's
not a lot of redundancy gained...
* Change CRC scheme to a true 64-bit CRC, not a pair of 32-bit CRCs
on alternate bytes. Polynomial borrowed from ECMA DLT1 standard.
* Fix XLOG record length handling so that it will work at BLCKSZ = 32k.
* Change XID allocation to work more like OID allocation. (This is of
dubious necessity, but I think it's a good idea anyway.)
* Fix a number of minor bugs, such as off-by-one logic for XLOG file
wraparound at the 4 gig mark.
* Add documentation and clean up some coding infelicities; move file
format declarations out to include files where planned contrib
utilities can get at them.
* Checkpoint will now occur every CHECKPOINT_SEGMENTS log segments or
every CHECKPOINT_TIMEOUT seconds, whichever comes first. It is also
possible to force a checkpoint by sending SIGUSR1 to the postmaster
(undocumented feature...)
* Defend against kill -9 postmaster by storing shmem block's key and ID
in postmaster.pid lockfile, and checking at startup to ensure that no
processes are still connected to old shmem block (if it still exists).
* Switch backends to accept SIGQUIT rather than SIGUSR1 for emergency
stop, for symmetry with postmaster and xlog utilities. Clean up signal
handling in bootstrap.c so that xlog utilities launched by postmaster
will react to signals better.
* Standalone bootstrap now grabs lockfile in target directory, as added
insurance against running it in parallel with live postmaster.
2001-03-13 02:17:06 +01:00
|
|
|
|
2011-12-09 13:32:42 +01:00
|
|
|
extern bool reachedConsistency;
|
2011-12-02 09:49:54 +01:00
|
|
|
|
2001-03-16 06:44:33 +01:00
|
|
|
/* these variables are GUC parameters related to XLOG */
|
Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-20 07:03:48 +02:00
|
|
|
extern int wal_segment_size;
|
2017-04-05 00:00:01 +02:00
|
|
|
extern int min_wal_size_mb;
|
|
|
|
extern int max_wal_size_mb;
|
2020-07-20 06:30:18 +02:00
|
|
|
extern int wal_keep_size_mb;
|
2020-04-08 00:35:00 +02:00
|
|
|
extern int max_slot_wal_keep_size_mb;
|
2001-03-16 06:44:33 +01:00
|
|
|
extern int XLOGbuffers;
|
2010-04-29 23:36:19 +02:00
|
|
|
extern int XLogArchiveTimeout;
|
2015-02-23 12:55:17 +01:00
|
|
|
extern int wal_retrieve_retry_interval;
|
2004-07-19 04:47:16 +02:00
|
|
|
extern char *XLogArchiveCommand;
|
2010-04-29 23:36:19 +02:00
|
|
|
extern bool EnableHotStandby;
|
2012-01-25 19:02:04 +01:00
|
|
|
extern bool fullPageWrites;
|
2014-01-02 02:17:00 +01:00
|
|
|
extern bool wal_log_hints;
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
extern bool wal_compression;
|
2019-04-02 03:37:14 +02:00
|
|
|
extern bool wal_init_zero;
|
|
|
|
extern bool wal_recycle;
|
2017-02-08 21:45:30 +01:00
|
|
|
extern bool *wal_consistency_checking;
|
|
|
|
extern char *wal_consistency_checking_string;
|
2010-04-29 23:36:19 +02:00
|
|
|
extern bool log_checkpoints;
|
2018-11-25 16:31:16 +01:00
|
|
|
extern char *recoveryRestoreCommand;
|
|
|
|
extern char *recoveryEndCommand;
|
|
|
|
extern char *archiveCleanupCommand;
|
|
|
|
extern bool recoveryTargetInclusive;
|
|
|
|
extern int recoveryTargetAction;
|
|
|
|
extern int recovery_min_apply_delay;
|
|
|
|
extern char *PrimaryConnInfo;
|
|
|
|
extern char *PrimarySlotName;
|
2020-03-27 20:04:52 +01:00
|
|
|
extern bool wal_receiver_create_temp_slot;
|
Track total amounts of times spent writing and syncing WAL data to disk.
This commit adds new GUC track_wal_io_timing. When this is enabled,
the total amounts of time XLogWrite writes and issue_xlog_fsync syncs
WAL data to disk are counted in pg_stat_wal. This information would be
useful to check how much WAL write and sync affect the performance.
Enabling track_wal_io_timing will make the server query the operating
system for the current time every time WAL is written or synced,
which may cause significant overhead on some platforms. To avoid such
additional overhead in the server with track_io_timing enabled,
this commit introduces track_wal_io_timing as a separate parameter from
track_io_timing.
Note that WAL write and sync activity by walreceiver has not been tracked yet.
This commit makes the server also track the numbers of times XLogWrite
writes and issue_xlog_fsync syncs WAL data to disk, in pg_stat_wal,
regardless of the setting of track_wal_io_timing. This counters can be
used to calculate the WAL write and sync time per request, for example.
Bump PGSTAT_FILE_FORMAT_ID.
Bump catalog version.
Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David Johnston, Fujii Masao
Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.com
2021-03-09 08:52:06 +01:00
|
|
|
extern bool track_wal_io_timing;
|
2018-11-25 16:31:16 +01:00
|
|
|
|
|
|
|
/* indirectly set via GUC system */
|
|
|
|
extern TransactionId recoveryTargetXid;
|
2019-06-30 10:15:25 +02:00
|
|
|
extern char *recovery_target_time_string;
|
2019-03-16 10:13:03 +01:00
|
|
|
extern const char *recoveryTargetName;
|
2018-11-25 16:31:16 +01:00
|
|
|
extern XLogRecPtr recoveryTargetLSN;
|
|
|
|
extern RecoveryTargetType recoveryTarget;
|
|
|
|
extern char *PromoteTriggerFile;
|
|
|
|
extern RecoveryTargetTimeLineGoal recoveryTargetTimeLineGoal;
|
|
|
|
extern TimeLineID recoveryTargetTLIRequested;
|
|
|
|
extern TimeLineID recoveryTargetTLI;
|
2010-04-29 23:36:19 +02:00
|
|
|
|
2015-02-23 17:53:02 +01:00
|
|
|
extern int CheckPointSegments;
|
|
|
|
|
2018-11-25 16:31:16 +01:00
|
|
|
/* option set locally in startup process only when signal files exist */
|
|
|
|
extern bool StandbyModeRequested;
|
|
|
|
extern bool StandbyMode;
|
|
|
|
|
2015-05-15 17:55:24 +02:00
|
|
|
/* Archive modes */
|
|
|
|
typedef enum ArchiveMode
|
|
|
|
{
|
|
|
|
ARCHIVE_MODE_OFF = 0, /* disabled */
|
|
|
|
ARCHIVE_MODE_ON, /* enabled while server is running normally */
|
|
|
|
ARCHIVE_MODE_ALWAYS /* enabled always (even during recovery) */
|
|
|
|
} ArchiveMode;
|
|
|
|
extern int XLogArchiveMode;
|
|
|
|
|
Introduce wal_level GUC to explicitly control if information needed for
archival or hot standby should be WAL-logged, instead of deducing that from
other options like archive_mode. This replaces recovery_connections GUC in
the primary, where it now has no effect, but it's still used in the standby
to enable/disable hot standby.
Remove the WAL-logging of "unlogged operations", like creating an index
without WAL-logging and fsyncing it at the end. Instead, we keep a copy of
the wal_mode setting and the settings that affect how much shared memory a
hot standby server needs to track master transactions (max_connections,
max_prepared_xacts, max_locks_per_xact) in pg_control. Whenever the settings
change, at server restart, write a WAL record noting the new settings and
update pg_control. This allows us to notice the change in those settings in
the standby at the right moment, they used to be included in checkpoint
records, but that meant that a changed value was not reflected in the
standby until the first checkpoint after the change.
Bump PG_CONTROL_VERSION and XLOG_PAGE_MAGIC. Whack XLOG_PAGE_MAGIC back to
the sequence it used to follow, before hot standby and subsequent patches
changed it to 0x9003.
2010-04-28 18:10:43 +02:00
|
|
|
/* WAL levels */
|
|
|
|
typedef enum WalLevel
|
|
|
|
{
|
|
|
|
WAL_LEVEL_MINIMAL = 0,
|
2016-03-01 02:01:54 +01:00
|
|
|
WAL_LEVEL_REPLICA,
|
Add new wal_level, logical, sufficient for logical decoding.
When wal_level=logical, we'll log columns from the old tuple as
configured by the REPLICA IDENTITY facility added in commit
07cacba983ef79be4a84fcd0e0ca3b5fcb85dd65. This makes it possible
a properly-configured logical replication solution to correctly
follow table updates even if they change the chosen key columns,
or, with REPLICA IDENTITY FULL, even if the table has no key at
all. Note that updates which do not modify the replica identity
column won't log anything extra, making the choice of a good key
(i.e. one that will rarely be changed) important to performance
when wal_level=logical is configured.
Each insert, update, or delete to a catalog table will also log
the CMIN and/or CMAX values of stamped by the current transaction.
This is necessary because logical decoding will require access to
historical snapshots of the catalog in order to decode some data
types, and the CMIN/CMAX values that we may need in order to judge
row visibility may have been overwritten by the time we need them.
Andres Freund, reviewed in various versions by myself, Heikki
Linnakangas, KONDO Mitsumasa, and many others.
2013-12-11 00:33:45 +01:00
|
|
|
WAL_LEVEL_LOGICAL
|
Introduce wal_level GUC to explicitly control if information needed for
archival or hot standby should be WAL-logged, instead of deducing that from
other options like archive_mode. This replaces recovery_connections GUC in
the primary, where it now has no effect, but it's still used in the standby
to enable/disable hot standby.
Remove the WAL-logging of "unlogged operations", like creating an index
without WAL-logging and fsyncing it at the end. Instead, we keep a copy of
the wal_mode setting and the settings that affect how much shared memory a
hot standby server needs to track master transactions (max_connections,
max_prepared_xacts, max_locks_per_xact) in pg_control. Whenever the settings
change, at server restart, write a WAL record noting the new settings and
update pg_control. This allows us to notice the change in those settings in
the standby at the right moment, they used to be included in checkpoint
records, but that meant that a changed value was not reflected in the
standby until the first checkpoint after the change.
Bump PG_CONTROL_VERSION and XLOG_PAGE_MAGIC. Whack XLOG_PAGE_MAGIC back to
the sequence it used to follow, before hot standby and subsequent patches
changed it to 0x9003.
2010-04-28 18:10:43 +02:00
|
|
|
} WalLevel;
|
2016-05-25 04:48:47 +02:00
|
|
|
|
2020-04-24 01:48:28 +02:00
|
|
|
/* Recovery states */
|
|
|
|
typedef enum RecoveryState
|
|
|
|
{
|
|
|
|
RECOVERY_STATE_CRASH = 0, /* crash recovery */
|
|
|
|
RECOVERY_STATE_ARCHIVE, /* archive recovery */
|
|
|
|
RECOVERY_STATE_DONE /* currently in production */
|
|
|
|
} RecoveryState;
|
|
|
|
|
Be clear about whether a recovery pause has taken effect.
Previously, the code and documentation seem to have essentially
assumed than a call to pg_wal_replay_pause() would take place
immediately, but that's not the case, because we only check for a
pause in certain places. This means that a tool that uses this
function and then wants to do something else afterward that is
dependent on the pause having taken effect doesn't know how long it
needs to wait to be sure that no more WAL is going to be replayed.
To avoid that, add a new function pg_get_wal_replay_pause_state()
which returns either 'not paused', 'paused requested', or 'paused'.
After calling pg_wal_replay_pause() the status will immediate change
from 'not paused' to 'pause requested'; when the startup process
has noticed this, the status will change to 'pause'. For backward
compatibility, pg_is_wal_replay_paused() still exists and returns
the same thing as before: true if a pause has been requested,
whether or not it has taken effect yet; and false if not.
The documentation is updated to clarify.
To improve the changes that a pause request is quickly confirmed
effective, adjust things so that WaitForWALToBecomeAvailable will
swiftly reach a call to recoveryPausesHere() when a pause request
is made.
Dilip Kumar, reviewed by Simon Riggs, Kyotaro Horiguchi, Yugo Nagata,
Masahiko Sawada, and Bharath Rupireddy.
Discussion: http://postgr.es/m/CAFiTN-vcLLWEm8Zr%3DYK83rgYrT9pbC8VJCfa1kY9vL3AUPfu6g%40mail.gmail.com
2021-03-11 20:52:32 +01:00
|
|
|
/* Recovery pause states */
|
|
|
|
typedef enum RecoveryPauseState
|
|
|
|
{
|
|
|
|
RECOVERY_NOT_PAUSED, /* pause not requested */
|
|
|
|
RECOVERY_PAUSE_REQUESTED, /* pause requested, but not yet paused */
|
|
|
|
RECOVERY_PAUSED /* recovery is paused */
|
|
|
|
} RecoveryPauseState;
|
|
|
|
|
2016-05-25 04:48:47 +02:00
|
|
|
extern PGDLLIMPORT int wal_level;
|
2001-03-16 06:44:33 +01:00
|
|
|
|
2015-06-12 16:11:51 +02:00
|
|
|
/* Is WAL archiving enabled (always or only while server is running normally)? */
|
2015-05-15 17:55:24 +02:00
|
|
|
#define XLogArchivingActive() \
|
2016-03-01 02:01:54 +01:00
|
|
|
(AssertMacro(XLogArchiveMode == ARCHIVE_MODE_OFF || wal_level >= WAL_LEVEL_REPLICA), XLogArchiveMode > ARCHIVE_MODE_OFF)
|
2015-06-12 16:11:51 +02:00
|
|
|
/* Is WAL archiving enabled always (even during recovery)? */
|
|
|
|
#define XLogArchivingAlways() \
|
2016-03-01 02:01:54 +01:00
|
|
|
(AssertMacro(XLogArchiveMode == ARCHIVE_MODE_OFF || wal_level >= WAL_LEVEL_REPLICA), XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
|
2007-09-27 00:36:30 +02:00
|
|
|
#define XLogArchiveCommandSet() (XLogArchiveCommand[0] != '\0')
|
2004-07-19 04:47:16 +02:00
|
|
|
|
2010-01-15 10:19:10 +01:00
|
|
|
/*
|
Introduce wal_level GUC to explicitly control if information needed for
archival or hot standby should be WAL-logged, instead of deducing that from
other options like archive_mode. This replaces recovery_connections GUC in
the primary, where it now has no effect, but it's still used in the standby
to enable/disable hot standby.
Remove the WAL-logging of "unlogged operations", like creating an index
without WAL-logging and fsyncing it at the end. Instead, we keep a copy of
the wal_mode setting and the settings that affect how much shared memory a
hot standby server needs to track master transactions (max_connections,
max_prepared_xacts, max_locks_per_xact) in pg_control. Whenever the settings
change, at server restart, write a WAL record noting the new settings and
update pg_control. This allows us to notice the change in those settings in
the standby at the right moment, they used to be included in checkpoint
records, but that meant that a changed value was not reflected in the
standby until the first checkpoint after the change.
Bump PG_CONTROL_VERSION and XLOG_PAGE_MAGIC. Whack XLOG_PAGE_MAGIC back to
the sequence it used to follow, before hot standby and subsequent patches
changed it to 0x9003.
2010-04-28 18:10:43 +02:00
|
|
|
* Is WAL-logging necessary for archival or log-shipping, or can we skip
|
|
|
|
* WAL-logging if we fsync() the data before committing instead?
|
2010-01-15 10:19:10 +01:00
|
|
|
*/
|
2016-03-01 02:01:54 +01:00
|
|
|
#define XLogIsNeeded() (wal_level >= WAL_LEVEL_REPLICA)
|
2010-01-15 10:19:10 +01:00
|
|
|
|
2013-12-13 15:26:14 +01:00
|
|
|
/*
|
|
|
|
* Is a full-page image needed for hint bit updates?
|
|
|
|
*
|
|
|
|
* Normally, we don't WAL-log hint bit updates, but if checksums are enabled,
|
|
|
|
* we have to protect them against torn page writes. When you only set
|
|
|
|
* individual bits on a page, it's still consistent no matter what combination
|
|
|
|
* of the bits make it to disk, but the checksum wouldn't match. Also WAL-log
|
2013-12-20 19:33:16 +01:00
|
|
|
* them if forced by wal_log_hints=on.
|
2013-12-13 15:26:14 +01:00
|
|
|
*/
|
2018-04-09 19:02:42 +02:00
|
|
|
#define XLogHintBitIsNeeded() (DataChecksumsEnabled() || wal_log_hints)
|
2013-12-13 15:26:14 +01:00
|
|
|
|
Add new wal_level, logical, sufficient for logical decoding.
When wal_level=logical, we'll log columns from the old tuple as
configured by the REPLICA IDENTITY facility added in commit
07cacba983ef79be4a84fcd0e0ca3b5fcb85dd65. This makes it possible
a properly-configured logical replication solution to correctly
follow table updates even if they change the chosen key columns,
or, with REPLICA IDENTITY FULL, even if the table has no key at
all. Note that updates which do not modify the replica identity
column won't log anything extra, making the choice of a good key
(i.e. one that will rarely be changed) important to performance
when wal_level=logical is configured.
Each insert, update, or delete to a catalog table will also log
the CMIN and/or CMAX values of stamped by the current transaction.
This is necessary because logical decoding will require access to
historical snapshots of the catalog in order to decode some data
types, and the CMIN/CMAX values that we may need in order to judge
row visibility may have been overwritten by the time we need them.
Andres Freund, reviewed in various versions by myself, Heikki
Linnakangas, KONDO Mitsumasa, and many others.
2013-12-11 00:33:45 +01:00
|
|
|
/* Do we need to WAL-log information required only for Hot Standby and logical replication? */
|
2016-03-01 02:01:54 +01:00
|
|
|
#define XLogStandbyInfoActive() (wal_level >= WAL_LEVEL_REPLICA)
|
2010-01-28 08:31:42 +01:00
|
|
|
|
Add new wal_level, logical, sufficient for logical decoding.
When wal_level=logical, we'll log columns from the old tuple as
configured by the REPLICA IDENTITY facility added in commit
07cacba983ef79be4a84fcd0e0ca3b5fcb85dd65. This makes it possible
a properly-configured logical replication solution to correctly
follow table updates even if they change the chosen key columns,
or, with REPLICA IDENTITY FULL, even if the table has no key at
all. Note that updates which do not modify the replica identity
column won't log anything extra, making the choice of a good key
(i.e. one that will rarely be changed) important to performance
when wal_level=logical is configured.
Each insert, update, or delete to a catalog table will also log
the CMIN and/or CMAX values of stamped by the current transaction.
This is necessary because logical decoding will require access to
historical snapshots of the catalog in order to decode some data
types, and the CMIN/CMAX values that we may need in order to judge
row visibility may have been overwritten by the time we need them.
Andres Freund, reviewed in various versions by myself, Heikki
Linnakangas, KONDO Mitsumasa, and many others.
2013-12-11 00:33:45 +01:00
|
|
|
/* Do we need to WAL-log information required only for logical replication? */
|
|
|
|
#define XLogLogicalInfoActive() (wal_level >= WAL_LEVEL_LOGICAL)
|
|
|
|
|
2004-01-06 18:26:23 +01:00
|
|
|
#ifdef WAL_DEBUG
|
|
|
|
extern bool XLOG_DEBUG;
|
|
|
|
#endif
|
2001-03-16 06:44:33 +01:00
|
|
|
|
2007-06-30 21:12:02 +02:00
|
|
|
/*
|
|
|
|
* OR-able request flag bits for checkpoints. The "cause" bits are used only
|
|
|
|
* for logging purposes. Note: the flags must be defined so that it's
|
|
|
|
* sensible to OR together request flags arising from different requestors.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* These directly affect the behavior of CreateCheckPoint and subsidiaries */
|
2007-06-28 02:02:40 +02:00
|
|
|
#define CHECKPOINT_IS_SHUTDOWN 0x0001 /* Checkpoint is for shutdown */
|
2009-06-26 22:29:04 +02:00
|
|
|
#define CHECKPOINT_END_OF_RECOVERY 0x0002 /* Like shutdown checkpoint, but
|
|
|
|
* issued at end of WAL recovery */
|
|
|
|
#define CHECKPOINT_IMMEDIATE 0x0004 /* Do it without delays */
|
|
|
|
#define CHECKPOINT_FORCE 0x0008 /* Force even if no activity */
|
2014-10-21 00:20:08 +02:00
|
|
|
#define CHECKPOINT_FLUSH_ALL 0x0010 /* Flush all pages, including those
|
|
|
|
* belonging to unlogged tables */
|
2007-06-30 21:12:02 +02:00
|
|
|
/* These are important to RequestCheckpoint */
|
2014-10-21 00:20:08 +02:00
|
|
|
#define CHECKPOINT_WAIT 0x0020 /* Wait for completion */
|
Make checkpoint requests more robust.
Commit 6f6a6d8b1 introduced a delay of up to 2 seconds if we're trying
to request a checkpoint but the checkpointer hasn't started yet (or,
much less likely, our kill() call fails). However buildfarm experience
shows that that's not quite enough for slow or heavily-loaded machines.
There's no good reason to assume that the checkpointer won't start
eventually, so we may as well make the timeout much longer, say 60 sec.
However, if the caller didn't say CHECKPOINT_WAIT, it seems like a bad
idea to be waiting at all, much less for as long as 60 sec. We can
remove the need for that, and make this whole thing more robust, by
adjusting the code so that the existence of a pending checkpoint
request is clear from the contents of shared memory, and making sure
that the checkpointer process will notice it at startup even if it did
not get a signal. In this way there's no need for a non-CHECKPOINT_WAIT
call to wait at all; if it can't send the signal, it can nonetheless
assume that the checkpointer will eventually service the request.
A potential downside of this change is that "kill -INT" on the checkpointer
process is no longer enough to trigger a checkpoint, should anyone be
relying on something so hacky. But there's no obvious reason to do it
like that rather than issuing a plain old CHECKPOINT command, so we'll
assume that nobody is. There doesn't seem to be a way to preserve this
undocumented quasi-feature without introducing race conditions.
Since a principal reason for messing with this is to prevent intermittent
buildfarm failures, back-patch to all supported branches.
Discussion: https://postgr.es/m/27830.1552752475@sss.pgh.pa.us
2019-03-19 17:49:27 +01:00
|
|
|
#define CHECKPOINT_REQUESTED 0x0040 /* Checkpoint request has been made */
|
2007-06-30 21:12:02 +02:00
|
|
|
/* These indicate the cause of a checkpoint request */
|
Make checkpoint requests more robust.
Commit 6f6a6d8b1 introduced a delay of up to 2 seconds if we're trying
to request a checkpoint but the checkpointer hasn't started yet (or,
much less likely, our kill() call fails). However buildfarm experience
shows that that's not quite enough for slow or heavily-loaded machines.
There's no good reason to assume that the checkpointer won't start
eventually, so we may as well make the timeout much longer, say 60 sec.
However, if the caller didn't say CHECKPOINT_WAIT, it seems like a bad
idea to be waiting at all, much less for as long as 60 sec. We can
remove the need for that, and make this whole thing more robust, by
adjusting the code so that the existence of a pending checkpoint
request is clear from the contents of shared memory, and making sure
that the checkpointer process will notice it at startup even if it did
not get a signal. In this way there's no need for a non-CHECKPOINT_WAIT
call to wait at all; if it can't send the signal, it can nonetheless
assume that the checkpointer will eventually service the request.
A potential downside of this change is that "kill -INT" on the checkpointer
process is no longer enough to trigger a checkpoint, should anyone be
relying on something so hacky. But there's no obvious reason to do it
like that rather than issuing a plain old CHECKPOINT command, so we'll
assume that nobody is. There doesn't seem to be a way to preserve this
undocumented quasi-feature without introducing race conditions.
Since a principal reason for messing with this is to prevent intermittent
buildfarm failures, back-patch to all supported branches.
Discussion: https://postgr.es/m/27830.1552752475@sss.pgh.pa.us
2019-03-19 17:49:27 +01:00
|
|
|
#define CHECKPOINT_CAUSE_XLOG 0x0080 /* XLOG consumption */
|
|
|
|
#define CHECKPOINT_CAUSE_TIME 0x0100 /* Elapsed time */
|
2007-06-30 21:12:02 +02:00
|
|
|
|
Skip checkpoints, archiving on idle systems.
Some background activity (like checkpoints, archive timeout, standby
snapshots) is not supposed to happen on an idle system. Unfortunately
so far it was not easy to determine when a system is idle, which
defeated some of the attempts to avoid redundant activity on an idle
system.
To make that easier, allow to make individual WAL insertions as not
being "important". By checking whether any important activity happened
since the last time an activity was performed, it now is easy to check
whether some action needs to be repeated.
Use the new facility for checkpoints, archive timeout and standby
snapshots.
The lack of a facility causes some issues in older releases, but in my
opinion the consequences (superflous checkpoints / archived segments)
aren't grave enough to warrant backpatching.
Author: Michael Paquier, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Amit Kapila, Kyotaro HORIGUCHI
Bug: #13685
Discussion:
https://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org
https://www.postgresql.org/message-id/CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com
Backpatch: -
2016-12-22 20:31:50 +01:00
|
|
|
/*
|
|
|
|
* Flag bits for the record being inserted, set using XLogSetRecordFlags().
|
|
|
|
*/
|
|
|
|
#define XLOG_INCLUDE_ORIGIN 0x01 /* include the replication origin */
|
|
|
|
#define XLOG_MARK_UNIMPORTANT 0x02 /* record not important for durability */
|
2020-07-20 05:18:26 +02:00
|
|
|
#define XLOG_INCLUDE_XID 0x04 /* include XID of top-level xact */
|
Skip checkpoints, archiving on idle systems.
Some background activity (like checkpoints, archive timeout, standby
snapshots) is not supposed to happen on an idle system. Unfortunately
so far it was not easy to determine when a system is idle, which
defeated some of the attempts to avoid redundant activity on an idle
system.
To make that easier, allow to make individual WAL insertions as not
being "important". By checking whether any important activity happened
since the last time an activity was performed, it now is easy to check
whether some action needs to be repeated.
Use the new facility for checkpoints, archive timeout and standby
snapshots.
The lack of a facility causes some issues in older releases, but in my
opinion the consequences (superflous checkpoints / archived segments)
aren't grave enough to warrant backpatching.
Author: Michael Paquier, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Amit Kapila, Kyotaro HORIGUCHI
Bug: #13685
Discussion:
https://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org
https://www.postgresql.org/message-id/CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com
Backpatch: -
2016-12-22 20:31:50 +01:00
|
|
|
|
|
|
|
|
2007-06-30 21:12:02 +02:00
|
|
|
/* Checkpoint statistics */
|
|
|
|
typedef struct CheckpointStatsData
|
|
|
|
{
|
|
|
|
TimestampTz ckpt_start_t; /* start of checkpoint */
|
|
|
|
TimestampTz ckpt_write_t; /* start of flushing buffers */
|
|
|
|
TimestampTz ckpt_sync_t; /* start of fsyncs */
|
|
|
|
TimestampTz ckpt_sync_end_t; /* end of fsyncs */
|
|
|
|
TimestampTz ckpt_end_t; /* end of checkpoint */
|
|
|
|
|
|
|
|
int ckpt_bufs_written; /* # of buffers written */
|
|
|
|
|
|
|
|
int ckpt_segs_added; /* # of new xlog segments created */
|
|
|
|
int ckpt_segs_removed; /* # of xlog segments deleted */
|
|
|
|
int ckpt_segs_recycled; /* # of xlog segments recycled */
|
2010-12-14 15:25:25 +01:00
|
|
|
|
|
|
|
int ckpt_sync_rels; /* # of relations synced */
|
|
|
|
uint64 ckpt_longest_sync; /* Longest sync for one relation */
|
|
|
|
uint64 ckpt_agg_sync_time; /* The sum of all the individual sync
|
|
|
|
* times, which is not necessarily the
|
|
|
|
* same as the total elapsed time for the
|
|
|
|
* entire sync phase. */
|
2007-06-30 21:12:02 +02:00
|
|
|
} CheckpointStatsData;
|
|
|
|
|
|
|
|
extern CheckpointStatsData CheckpointStats;
|
2007-06-28 02:02:40 +02:00
|
|
|
|
2020-04-08 00:35:00 +02:00
|
|
|
/*
|
|
|
|
* GetWALAvailability return codes
|
|
|
|
*/
|
|
|
|
typedef enum WALAvailability
|
|
|
|
{
|
|
|
|
WALAVAIL_INVALID_LSN, /* parameter error */
|
2020-06-24 20:23:39 +02:00
|
|
|
WALAVAIL_RESERVED, /* WAL segment is within max_wal_size */
|
|
|
|
WALAVAIL_EXTENDED, /* WAL segment is reserved by a slot or
|
2020-07-20 06:30:18 +02:00
|
|
|
* wal_keep_size */
|
2020-06-24 20:23:39 +02:00
|
|
|
WALAVAIL_UNRESERVED, /* no longer reserved, but not removed yet */
|
2020-04-08 00:35:00 +02:00
|
|
|
WALAVAIL_REMOVED /* WAL segment has been removed */
|
|
|
|
} WALAvailability;
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
struct XLogRecData;
|
|
|
|
|
Skip checkpoints, archiving on idle systems.
Some background activity (like checkpoints, archive timeout, standby
snapshots) is not supposed to happen on an idle system. Unfortunately
so far it was not easy to determine when a system is idle, which
defeated some of the attempts to avoid redundant activity on an idle
system.
To make that easier, allow to make individual WAL insertions as not
being "important". By checking whether any important activity happened
since the last time an activity was performed, it now is easy to check
whether some action needs to be repeated.
Use the new facility for checkpoints, archive timeout and standby
snapshots.
The lack of a facility causes some issues in older releases, but in my
opinion the consequences (superflous checkpoints / archived segments)
aren't grave enough to warrant backpatching.
Author: Michael Paquier, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Amit Kapila, Kyotaro HORIGUCHI
Bug: #13685
Discussion:
https://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org
https://www.postgresql.org/message-id/CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com
Backpatch: -
2016-12-22 20:31:50 +01:00
|
|
|
extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
|
|
|
|
XLogRecPtr fpw_lsn,
|
2020-04-04 06:32:08 +02:00
|
|
|
uint8 flags,
|
2020-05-05 04:30:53 +02:00
|
|
|
int num_fpi);
|
1999-09-27 17:48:12 +02:00
|
|
|
extern void XLogFlush(XLogRecPtr RecPtr);
|
Reduce idle power consumption of walwriter and checkpointer processes.
This patch modifies the walwriter process so that, when it has not found
anything useful to do for many consecutive wakeup cycles, it extends its
sleep time to reduce the server's idle power consumption. It reverts to
normal as soon as it's done any successful flushes. It's still true that
during any async commit, backends check for completed, unflushed pages of
WAL and signal the walwriter if there are any; so that in practice the
walwriter can get awakened and returned to normal operation sooner than the
sleep time might suggest.
Also, improve the checkpointer so that it uses a latch and a computed delay
time to not wake up at all except when it has something to do, replacing a
previous hardcoded 0.5 sec wakeup cycle. This also is primarily useful for
reducing the server's power consumption when idle.
In passing, get rid of the dedicated latch for signaling the walwriter in
favor of using its procLatch, since that comports better with possible
generic signal handlers using that latch. Also, fix a pre-existing bug
with failure to save/restore errno in walwriter's signal handlers.
Peter Geoghegan, somewhat simplified by Tom
2012-05-09 02:03:26 +02:00
|
|
|
extern bool XLogBackgroundFlush(void);
|
2007-05-30 22:12:03 +02:00
|
|
|
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
|
2021-06-29 03:34:55 +02:00
|
|
|
extern int XLogFileInit(XLogSegNo segno, bool *use_existent);
|
2012-06-24 17:06:38 +02:00
|
|
|
extern int XLogFileOpen(XLogSegNo segno);
|
2010-01-15 10:19:10 +01:00
|
|
|
|
2013-01-03 18:51:00 +01:00
|
|
|
extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
extern XLogSegNo XLogGetLastRemovedSegno(void);
|
2010-07-30 00:27:27 +02:00
|
|
|
extern void XLogSetAsyncXactLSN(XLogRecPtr record);
|
2014-02-01 04:45:17 +01:00
|
|
|
extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
|
2007-08-02 00:45:09 +02:00
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
extern void xlog_redo(XLogReaderState *record);
|
|
|
|
extern void xlog_desc(StringInfo buf, XLogReaderState *record);
|
2014-09-19 15:17:12 +02:00
|
|
|
extern const char *xlog_identify(uint8 info);
|
2000-11-21 22:16:06 +01:00
|
|
|
|
2012-06-24 17:06:38 +02:00
|
|
|
extern void issue_xlog_fsync(int fd, XLogSegNo segno);
|
2010-01-15 10:19:10 +01:00
|
|
|
|
Start background writer during archive recovery. Background writer now performs
its usual buffer cleaning duties during archive recovery, and it's responsible
for performing restartpoints.
This requires some changes in postmaster. When the startup process has done
all the initialization and is ready to start WAL redo, it signals the
postmaster to launch the background writer. The postmaster is signaled again
when the point in recovery is reached where we know that the database is in
consistent state. Postmaster isn't interested in that at the moment, but
that's the point where we could let other backends in to perform read-only
queries. The postmaster is signaled third time when the recovery has ended,
so that postmaster knows that it's safe to start accepting connections.
The startup process now traps SIGTERM, and performs a "clean" shutdown. If
you do a fast shutdown during recovery, a shutdown restartpoint is performed,
like a shutdown checkpoint, and postmaster kills the processes cleanly. You
still have to continue the recovery at next startup, though.
Currently, the background writer is only launched during archive recovery.
We could launch it during crash recovery as well, but it seems better to keep
that codepath as simple as possible, for the sake of robustness. And it
couldn't do any restartpoints during crash recovery anyway, so it wouldn't be
that useful.
log_restartpoints is gone. Use log_checkpoints instead. This is yet to be
documented.
This whole operation is a pre-requisite for Hot Standby, but has some value of
its own whether the hot standby patch makes 8.4 or not.
Simon Riggs, with lots of modifications by me.
2009-02-18 16:58:41 +01:00
|
|
|
extern bool RecoveryInProgress(void);
|
2020-04-24 01:48:28 +02:00
|
|
|
extern RecoveryState GetRecoveryState(void);
|
2011-02-16 20:29:37 +01:00
|
|
|
extern bool HotStandbyActive(void);
|
Fix multiple bugs in index page locking during hot-standby WAL replay.
In ordinary operation, VACUUM must be careful to take a cleanup lock on
each leaf page of a btree index; this ensures that no indexscans could
still be "in flight" to heap tuples due to be deleted. (Because of
possible index-tuple motion due to concurrent page splits, it's not enough
to lock only the pages we're deleting index tuples from.) In Hot Standby,
the WAL replay process must likewise lock every leaf page. There were
several bugs in the code for that:
* The replay scan might come across unused, all-zero pages in the index.
While btree_xlog_vacuum itself did the right thing (ie, nothing) with
such pages, xlogutils.c supposed that such pages must be corrupt and
would throw an error. This accounts for various reports of replication
failures with "PANIC: WAL contains references to invalid pages". To
fix, add a ReadBufferMode value that instructs XLogReadBufferExtended
not to complain when we're doing this.
* btree_xlog_vacuum performed the extra locking if standbyState ==
STANDBY_SNAPSHOT_READY, but that's not the correct test: we won't open up
for hot standby queries until the database has reached consistency, and
we don't want to do the extra locking till then either, for fear of reading
corrupted pages (which bufmgr.c would complain about). Fix by exporting a
new function from xlog.c that will report whether we're actually in hot
standby replay mode.
* To ensure full coverage of the index in the replay scan, btvacuumscan
would emit a dummy WAL record for the last page of the index, if no
vacuuming work had been done on that page. However, if the last page
of the index is all-zero, that would result in corruption of said page,
since the functions called on it weren't prepared to handle that case.
There's no need to lock any such pages, so change the logic to target
the last normal leaf page instead.
The first two of these bugs were diagnosed by Andres Freund, the other one
by me. Fixes based on ideas from Heikki Linnakangas and myself.
This has been wrong since Hot Standby was introduced, so back-patch to 9.0.
2014-01-14 23:34:47 +01:00
|
|
|
extern bool HotStandbyActiveInReplay(void);
|
2009-06-26 22:29:04 +02:00
|
|
|
extern bool XLogInsertAllowed(void);
|
2010-07-03 22:43:58 +02:00
|
|
|
extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
|
Follow TLI of last replayed record, not recovery target TLI, in walsenders.
Most of the time, the last replayed record comes from the recovery target
timeline, but there is a corner case where it makes a difference. When
the startup process scans for a new timeline, and decides to change recovery
target timeline, there is a window where the recovery target TLI has already
been bumped, but there are no WAL segments from the new timeline in pg_xlog
yet. For example, if we have just replayed up to point 0/30002D8, on
timeline 1, there is a WAL file called 000000010000000000000003 in pg_xlog
that contains the WAL up to that point. When recovery switches recovery
target timeline to 2, a walsender can immediately try to read WAL from
0/30002D8, from timeline 2, so it will try to open WAL file
000000020000000000000003. However, that doesn't exist yet - the startup
process hasn't copied that file from the archive yet nor has the walreceiver
streamed it yet, so walsender fails with error "requested WAL segment
000000020000000000000003 has already been removed". That's harmless, in that
the standby will try to reconnect later and by that time the segment is
already created, but error messages that should be ignored are not good.
To fix that, have walsender track the TLI of the last replayed record,
instead of the recovery target timeline. That way walsender will not try to
read anything from timeline 2, until the WAL segment has been created and at
least one record has been replayed from it. The recovery target timeline is
now xlog.c's internal affair, it doesn't need to be exposed in shared memory
anymore.
This fixes the error reported by Thom Brown. depesz the same error message,
but I'm not sure if this fixes his scenario.
2012-12-20 13:23:31 +01:00
|
|
|
extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
|
2012-01-11 10:00:53 +01:00
|
|
|
extern XLogRecPtr GetXLogInsertRecPtr(void);
|
2011-11-04 10:37:17 +01:00
|
|
|
extern XLogRecPtr GetXLogWriteRecPtr(void);
|
Be clear about whether a recovery pause has taken effect.
Previously, the code and documentation seem to have essentially
assumed than a call to pg_wal_replay_pause() would take place
immediately, but that's not the case, because we only check for a
pause in certain places. This means that a tool that uses this
function and then wants to do something else afterward that is
dependent on the pause having taken effect doesn't know how long it
needs to wait to be sure that no more WAL is going to be replayed.
To avoid that, add a new function pg_get_wal_replay_pause_state()
which returns either 'not paused', 'paused requested', or 'paused'.
After calling pg_wal_replay_pause() the status will immediate change
from 'not paused' to 'pause requested'; when the startup process
has noticed this, the status will change to 'pause'. For backward
compatibility, pg_is_wal_replay_paused() still exists and returns
the same thing as before: true if a pause has been requested,
whether or not it has taken effect yet; and false if not.
The documentation is updated to clarify.
To improve the changes that a pause request is quickly confirmed
effective, adjust things so that WaitForWALToBecomeAvailable will
swiftly reach a call to recoveryPausesHere() when a pause request
is made.
Dilip Kumar, reviewed by Simon Riggs, Kyotaro Horiguchi, Yugo Nagata,
Masahiko Sawada, and Bharath Rupireddy.
Discussion: http://postgr.es/m/CAFiTN-vcLLWEm8Zr%3DYK83rgYrT9pbC8VJCfa1kY9vL3AUPfu6g%40mail.gmail.com
2021-03-11 20:52:32 +01:00
|
|
|
extern RecoveryPauseState GetRecoveryPauseState(void);
|
2011-11-04 10:37:17 +01:00
|
|
|
extern void SetRecoveryPause(bool recoveryPause);
|
|
|
|
extern TimestampTz GetLatestXTime(void);
|
2011-12-31 14:30:26 +01:00
|
|
|
extern TimestampTz GetCurrentChunkReplayStartTime(void);
|
Start background writer during archive recovery. Background writer now performs
its usual buffer cleaning duties during archive recovery, and it's responsible
for performing restartpoints.
This requires some changes in postmaster. When the startup process has done
all the initialization and is ready to start WAL redo, it signals the
postmaster to launch the background writer. The postmaster is signaled again
when the point in recovery is reached where we know that the database is in
consistent state. Postmaster isn't interested in that at the moment, but
that's the point where we could let other backends in to perform read-only
queries. The postmaster is signaled third time when the recovery has ended,
so that postmaster knows that it's safe to start accepting connections.
The startup process now traps SIGTERM, and performs a "clean" shutdown. If
you do a fast shutdown during recovery, a shutdown restartpoint is performed,
like a shutdown checkpoint, and postmaster kills the processes cleanly. You
still have to continue the recovery at next startup, though.
Currently, the background writer is only launched during archive recovery.
We could launch it during crash recovery as well, but it seems better to keep
that codepath as simple as possible, for the sake of robustness. And it
couldn't do any restartpoints during crash recovery anyway, so it wouldn't be
that useful.
log_restartpoints is gone. Use log_checkpoints instead. This is yet to be
documented.
This whole operation is a pre-requisite for Hot Standby, but has some value of
its own whether the hot standby patch makes 8.4 or not.
Simon Riggs, with lots of modifications by me.
2009-02-18 16:58:41 +01:00
|
|
|
|
2000-11-21 22:16:06 +01:00
|
|
|
extern void UpdateControlFile(void);
|
2010-01-15 10:19:10 +01:00
|
|
|
extern uint64 GetSystemIdentifier(void);
|
Support SCRAM-SHA-256 authentication (RFC 5802 and 7677).
This introduces a new generic SASL authentication method, similar to the
GSS and SSPI methods. The server first tells the client which SASL
authentication mechanism to use, and then the mechanism-specific SASL
messages are exchanged in AuthenticationSASLcontinue and PasswordMessage
messages. Only SCRAM-SHA-256 is supported at the moment, but this allows
adding more SASL mechanisms in the future, without changing the overall
protocol.
Support for channel binding, aka SCRAM-SHA-256-PLUS is left for later.
The SASLPrep algorithm, for pre-processing the password, is not yet
implemented. That could cause trouble, if you use a password with
non-ASCII characters, and a client library that does implement SASLprep.
That will hopefully be added later.
Authorization identities, as specified in the SCRAM-SHA-256 specification,
are ignored. SET SESSION AUTHORIZATION provides more or less the same
functionality, anyway.
If a user doesn't exist, perform a "mock" authentication, by constructing
an authentic-looking challenge on the fly. The challenge is derived from
a new system-wide random value, "mock authentication nonce", which is
created at initdb, and stored in the control file. We go through these
motions, in order to not give away the information on whether the user
exists, to unauthenticated users.
Bumps PG_CONTROL_VERSION, because of the new field in control file.
Patch by Michael Paquier and Heikki Linnakangas, reviewed at different
stages by Robert Haas, Stephen Frost, David Steele, Aleksander Alekseev,
and many others.
Discussion: https://www.postgresql.org/message-id/CAB7nPqRbR3GmFYdedCAhzukfKrgBLTLtMvENOmPrVWREsZkF8g%40mail.gmail.com
Discussion: https://www.postgresql.org/message-id/CAB7nPqSMXU35g%3DW9X74HVeQp0uvgJxvYOuA4A-A3M%2B0wfEBv-w%40mail.gmail.com
Discussion: https://www.postgresql.org/message-id/55192AFE.6080106@iki.fi
2017-03-07 13:25:40 +01:00
|
|
|
extern char *GetMockAuthenticationNonce(void);
|
2018-04-09 19:02:42 +02:00
|
|
|
extern bool DataChecksumsEnabled(void);
|
2013-02-11 21:50:15 +01:00
|
|
|
extern XLogRecPtr GetFakeLSNForUnloggedRel(void);
|
2005-08-21 01:26:37 +02:00
|
|
|
extern Size XLOGShmemSize(void);
|
2000-11-21 22:16:06 +01:00
|
|
|
extern void XLOGShmemInit(void);
|
|
|
|
extern void BootStrapXLOG(void);
|
2017-09-17 10:00:39 +02:00
|
|
|
extern void LocalProcessControlFile(bool reset);
|
2000-11-21 22:16:06 +01:00
|
|
|
extern void StartupXLOG(void);
|
2003-12-12 19:45:10 +01:00
|
|
|
extern void ShutdownXLOG(int code, Datum arg);
|
2004-05-27 19:12:57 +02:00
|
|
|
extern void InitXLOGAccess(void);
|
2007-06-28 02:02:40 +02:00
|
|
|
extern void CreateCheckPoint(int flags);
|
Start background writer during archive recovery. Background writer now performs
its usual buffer cleaning duties during archive recovery, and it's responsible
for performing restartpoints.
This requires some changes in postmaster. When the startup process has done
all the initialization and is ready to start WAL redo, it signals the
postmaster to launch the background writer. The postmaster is signaled again
when the point in recovery is reached where we know that the database is in
consistent state. Postmaster isn't interested in that at the moment, but
that's the point where we could let other backends in to perform read-only
queries. The postmaster is signaled third time when the recovery has ended,
so that postmaster knows that it's safe to start accepting connections.
The startup process now traps SIGTERM, and performs a "clean" shutdown. If
you do a fast shutdown during recovery, a shutdown restartpoint is performed,
like a shutdown checkpoint, and postmaster kills the processes cleanly. You
still have to continue the recovery at next startup, though.
Currently, the background writer is only launched during archive recovery.
We could launch it during crash recovery as well, but it seems better to keep
that codepath as simple as possible, for the sake of robustness. And it
couldn't do any restartpoints during crash recovery anyway, so it wouldn't be
that useful.
log_restartpoints is gone. Use log_checkpoints instead. This is yet to be
documented.
This whole operation is a pre-requisite for Hot Standby, but has some value of
its own whether the hot standby patch makes 8.4 or not.
Simon Riggs, with lots of modifications by me.
2009-02-18 16:58:41 +01:00
|
|
|
extern bool CreateRestartPoint(int flags);
|
2020-06-24 20:15:17 +02:00
|
|
|
extern WALAvailability GetWALAvailability(XLogRecPtr targetLSN);
|
2020-04-08 00:35:00 +02:00
|
|
|
extern XLogRecPtr CalculateMaxmumSafeLSN(void);
|
XLOG (and related) changes:
* Store two past checkpoint locations, not just one, in pg_control.
On startup, we fall back to the older checkpoint if the newer one
is unreadable. Also, a physical copy of the newest checkpoint record
is kept in pg_control for possible use in disaster recovery (ie,
complete loss of pg_xlog). Also add a version number for pg_control
itself. Remove archdir from pg_control; it ought to be a GUC
parameter, not a special case (not that it's implemented yet anyway).
* Suppress successive checkpoint records when nothing has been entered
in the WAL log since the last one. This is not so much to avoid I/O
as to make it actually useful to keep track of the last two
checkpoints. If the things are right next to each other then there's
not a lot of redundancy gained...
* Change CRC scheme to a true 64-bit CRC, not a pair of 32-bit CRCs
on alternate bytes. Polynomial borrowed from ECMA DLT1 standard.
* Fix XLOG record length handling so that it will work at BLCKSZ = 32k.
* Change XID allocation to work more like OID allocation. (This is of
dubious necessity, but I think it's a good idea anyway.)
* Fix a number of minor bugs, such as off-by-one logic for XLOG file
wraparound at the 4 gig mark.
* Add documentation and clean up some coding infelicities; move file
format declarations out to include files where planned contrib
utilities can get at them.
* Checkpoint will now occur every CHECKPOINT_SEGMENTS log segments or
every CHECKPOINT_TIMEOUT seconds, whichever comes first. It is also
possible to force a checkpoint by sending SIGUSR1 to the postmaster
(undocumented feature...)
* Defend against kill -9 postmaster by storing shmem block's key and ID
in postmaster.pid lockfile, and checking at startup to ensure that no
processes are still connected to old shmem block (if it still exists).
* Switch backends to accept SIGQUIT rather than SIGUSR1 for emergency
stop, for symmetry with postmaster and xlog utilities. Clean up signal
handling in bootstrap.c so that xlog utilities launched by postmaster
will react to signals better.
* Standalone bootstrap now grabs lockfile in target directory, as added
insurance against running it in parallel with live postmaster.
2001-03-13 02:17:06 +01:00
|
|
|
extern void XLogPutNextOid(Oid nextOid);
|
2011-02-08 20:39:08 +01:00
|
|
|
extern XLogRecPtr XLogRestorePoint(const char *rpName);
|
2012-01-25 19:02:04 +01:00
|
|
|
extern void UpdateFullPageWrites(void);
|
2014-11-06 12:52:08 +01:00
|
|
|
extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p);
|
2002-03-15 20:20:36 +01:00
|
|
|
extern XLogRecPtr GetRedoRecPtr(void);
|
2007-06-28 02:02:40 +02:00
|
|
|
extern XLogRecPtr GetInsertRecPtr(void);
|
2010-06-17 18:41:25 +02:00
|
|
|
extern XLogRecPtr GetFlushRecPtr(void);
|
Skip checkpoints, archiving on idle systems.
Some background activity (like checkpoints, archive timeout, standby
snapshots) is not supposed to happen on an idle system. Unfortunately
so far it was not easy to determine when a system is idle, which
defeated some of the attempts to avoid redundant activity on an idle
system.
To make that easier, allow to make individual WAL insertions as not
being "important". By checking whether any important activity happened
since the last time an activity was performed, it now is easy to check
whether some action needs to be repeated.
Use the new facility for checkpoints, archive timeout and standby
snapshots.
The lack of a facility causes some issues in older releases, but in my
opinion the consequences (superflous checkpoints / archived segments)
aren't grave enough to warrant backpatching.
Author: Michael Paquier, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Amit Kapila, Kyotaro HORIGUCHI
Bug: #13685
Discussion:
https://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org
https://www.postgresql.org/message-id/CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com
Backpatch: -
2016-12-22 20:31:50 +01:00
|
|
|
extern XLogRecPtr GetLastImportantRecPtr(void);
|
2015-09-09 15:51:44 +02:00
|
|
|
extern void RemovePromoteSignalFiles(void);
|
2000-11-21 22:16:06 +01:00
|
|
|
|
2020-03-24 04:46:48 +01:00
|
|
|
extern bool PromoteIsTriggered(void);
|
2011-02-16 03:28:48 +01:00
|
|
|
extern bool CheckPromoteSignal(void);
|
2010-09-15 12:35:05 +02:00
|
|
|
extern void WakeupRecovery(void);
|
2012-05-09 05:05:58 +02:00
|
|
|
extern void SetWalWriterSleeping(bool sleeping);
|
Start background writer during archive recovery. Background writer now performs
its usual buffer cleaning duties during archive recovery, and it's responsible
for performing restartpoints.
This requires some changes in postmaster. When the startup process has done
all the initialization and is ready to start WAL redo, it signals the
postmaster to launch the background writer. The postmaster is signaled again
when the point in recovery is reached where we know that the database is in
consistent state. Postmaster isn't interested in that at the moment, but
that's the point where we could let other backends in to perform read-only
queries. The postmaster is signaled third time when the recovery has ended,
so that postmaster knows that it's safe to start accepting connections.
The startup process now traps SIGTERM, and performs a "clean" shutdown. If
you do a fast shutdown during recovery, a shutdown restartpoint is performed,
like a shutdown checkpoint, and postmaster kills the processes cleanly. You
still have to continue the recovery at next startup, though.
Currently, the background writer is only launched during archive recovery.
We could launch it during crash recovery as well, but it seems better to keep
that codepath as simple as possible, for the sake of robustness. And it
couldn't do any restartpoints during crash recovery anyway, so it wouldn't be
that useful.
log_restartpoints is gone. Use log_checkpoints instead. This is yet to be
documented.
This whole operation is a pre-requisite for Hot Standby, but has some value of
its own whether the hot standby patch makes 8.4 or not.
Simon Riggs, with lots of modifications by me.
2009-02-18 16:58:41 +01:00
|
|
|
|
2020-03-27 23:43:41 +01:00
|
|
|
extern void StartupRequestWalReceiverRestart(void);
|
2016-03-30 03:16:12 +02:00
|
|
|
extern void XLogRequestWalReceiverReply(void);
|
|
|
|
|
2015-02-23 17:53:02 +01:00
|
|
|
extern void assign_max_wal_size(int newval, void *extra);
|
|
|
|
extern void assign_checkpoint_completion_target(double newval, void *extra);
|
|
|
|
|
2011-01-31 17:13:01 +01:00
|
|
|
/*
|
2017-03-24 11:53:40 +01:00
|
|
|
* Routines to start, stop, and get status of a base backup.
|
2011-01-31 17:13:01 +01:00
|
|
|
*/
|
2017-03-24 11:53:40 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Session-level status of base backups
|
|
|
|
*
|
|
|
|
* This is used in parallel with the shared memory status to control parallel
|
|
|
|
* execution of base backup functions for a given session, be it a backend
|
|
|
|
* dedicated to replication or a normal backend connected to a database. The
|
|
|
|
* update of the session-level status happens at the same time as the shared
|
|
|
|
* memory counters to keep a consistent global and local state of the backups
|
|
|
|
* running.
|
|
|
|
*/
|
|
|
|
typedef enum SessionBackupState
|
|
|
|
{
|
|
|
|
SESSION_BACKUP_NONE,
|
|
|
|
SESSION_BACKUP_EXCLUSIVE,
|
|
|
|
SESSION_BACKUP_NON_EXCLUSIVE
|
|
|
|
} SessionBackupState;
|
|
|
|
|
Make pg_receivexlog and pg_basebackup -X stream work across timeline switches.
This mirrors the changes done earlier to the server in standby mode. When
receivelog reaches the end of a timeline, as reported by the server, it
fetches the timeline history file of the next timeline, and restarts
streaming from the new timeline by issuing a new START_STREAMING command.
When pg_receivexlog crosses a timeline, it leaves the .partial suffix on the
last segment on the old timeline. This helps you to tell apart a partial
segment left in the directory because of a timeline switch, and a completed
segment. If you just follow a single server, it won't make a difference, but
it can be significant in more complicated scenarios where new WAL is still
generated on the old timeline.
This includes two small changes to the streaming replication protocol:
First, when you reach the end of timeline while streaming, the server now
sends the TLI of the next timeline in the server's history to the client.
pg_receivexlog uses that as the next timeline, so that it doesn't need to
parse the timeline history file like a standby server does. Second, when
BASE_BACKUP command sends the begin and end WAL positions, it now also sends
the timeline IDs corresponding the positions.
2013-01-17 19:23:00 +01:00
|
|
|
extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast,
|
2017-12-05 00:37:54 +01:00
|
|
|
TimeLineID *starttli_p, StringInfo labelfile,
|
Code review for server's handling of "tablespace map" files.
While looking at Robert Foggia's report, I noticed a passel of
other issues in the same area:
* The scheme for backslash-quoting newlines in pathnames is just
wrong; it will misbehave if the last ordinary character in a pathname
is a backslash. I'm not sure why we're bothering to allow newlines
in tablespace paths, but if we're going to do it we should do it
without introducing other problems. Hence, backslashes themselves
have to be backslashed too.
* The author hadn't read the sscanf man page very carefully, because
this code would drop any leading whitespace from the path. (I doubt
that a tablespace path with leading whitespace could happen in
practice; but if we're bothering to allow newlines in the path, it
sure seems like leading whitespace is little less implausible.) Using
sscanf for the task of finding the first space is overkill anyway.
* While I'm not 100% sure what the rationale for escaping both \r and
\n is, if the idea is to allow Windows newlines in the file then this
code failed, because it'd throw an error if it saw \r followed by \n.
* There's no cross-check for an incomplete final line in the map file,
which would be a likely apparent symptom of the improper-escaping
bug.
On the generation end, aside from the escaping issue we have:
* If needtblspcmapfile is true then do_pg_start_backup will pass back
escaped strings in tablespaceinfo->path values, which no caller wants
or is prepared to deal with. I'm not sure if there's a live bug from
that, but it looks like there might be (given the dubious assumption
that anyone actually has newlines in their tablespace paths).
* It's not being very paranoid about the possibility of random stuff
in the pg_tblspc directory. IMO we should ignore anything without an
OID-like name.
The escaping rule change doesn't seem back-patchable: it'll require
doubling of backslashes in the tablespace_map file, which is basically
a basebackup format change. The odds of that causing trouble are
considerably more than the odds of the existing bug causing trouble.
The rest of this seems somewhat unlikely to cause problems too,
so no back-patch.
2021-03-17 21:18:46 +01:00
|
|
|
List **tablespaces, StringInfo tblspcmapfile);
|
Make pg_receivexlog and pg_basebackup -X stream work across timeline switches.
This mirrors the changes done earlier to the server in standby mode. When
receivelog reaches the end of a timeline, as reported by the server, it
fetches the timeline history file of the next timeline, and restarts
streaming from the new timeline by issuing a new START_STREAMING command.
When pg_receivexlog crosses a timeline, it leaves the .partial suffix on the
last segment on the old timeline. This helps you to tell apart a partial
segment left in the directory because of a timeline switch, and a completed
segment. If you just follow a single server, it won't make a difference, but
it can be significant in more complicated scenarios where new WAL is still
generated on the old timeline.
This includes two small changes to the streaming replication protocol:
First, when you reach the end of timeline while streaming, the server now
sends the TLI of the next timeline in the server's history to the client.
pg_receivexlog uses that as the next timeline, so that it doesn't need to
parse the timeline history file like a standby server does. Second, when
BASE_BACKUP command sends the begin and end WAL positions, it now also sends
the timeline IDs corresponding the positions.
2013-01-17 19:23:00 +01:00
|
|
|
extern XLogRecPtr do_pg_stop_backup(char *labelfile, bool waitforarchive,
|
|
|
|
TimeLineID *stoptli_p);
|
Fix minor problems with non-exclusive backup cleanup.
The previous coding imagined that it could call before_shmem_exit()
when a non-exclusive backup began and then remove the previously-added
handler by calling cancel_before_shmem_exit() when that backup
ended. However, this only works provided that nothing else in the
system has registered a before_shmem_exit() hook in the interim,
because cancel_before_shmem_exit() is documented to remove a callback
only if it is the latest callback registered. It also only works
if nothing can ERROR out between the time that sessionBackupState
is reset and the time that cancel_before_shmem_exit(), which doesn't
seem to be strictly true.
To fix, leave the handler installed for the lifetime of the session,
arrange to install it just once, and teach it to quietly do nothing if
there isn't a non-exclusive backup in process.
This is a bug, but for now I'm not going to back-patch, because the
consequences are minor. It's possible to cause a spurious warning
to be generated, but that doesn't really matter. It's also possible
to trigger an assertion failure, but production builds shouldn't
have assertions enabled.
Patch by me, reviewed by Kyotaro Horiguchi, Michael Paquier (who
preferred a different approach, but got outvoted), Fujii Masao,
and Tom Lane, and with comments by various others.
Discussion: http://postgr.es/m/CA+TgmobMjnyBfNhGTKQEDbqXYE3_rXWpc4CM63fhyerNCes3mA@mail.gmail.com
2019-12-19 15:06:54 +01:00
|
|
|
extern void do_pg_abort_backup(int code, Datum arg);
|
|
|
|
extern void register_persistent_abort_backup_handler(void);
|
2017-03-24 11:53:40 +01:00
|
|
|
extern SessionBackupState get_backup_status(void);
|
2011-01-09 21:00:28 +01:00
|
|
|
|
2011-01-31 17:13:01 +01:00
|
|
|
/* File path names (all relative to $PGDATA) */
|
2018-11-25 16:31:16 +01:00
|
|
|
#define RECOVERY_SIGNAL_FILE "recovery.signal"
|
|
|
|
#define STANDBY_SIGNAL_FILE "standby.signal"
|
2011-01-31 17:13:01 +01:00
|
|
|
#define BACKUP_LABEL_FILE "backup_label"
|
|
|
|
#define BACKUP_LABEL_OLD "backup_label.old"
|
|
|
|
|
2015-05-12 15:29:10 +02:00
|
|
|
#define TABLESPACE_MAP "tablespace_map"
|
|
|
|
#define TABLESPACE_MAP_OLD "tablespace_map.old"
|
|
|
|
|
2018-10-25 02:46:00 +02:00
|
|
|
/* files to signal promotion to primary */
|
|
|
|
#define PROMOTE_SIGNAL_FILE "promote"
|
|
|
|
|
1999-09-27 17:48:12 +02:00
|
|
|
#endif /* XLOG_H */
|