2004-07-22 00:31:26 +02:00
|
|
|
/*
|
|
|
|
* xlog_internal.h
|
|
|
|
*
|
2017-05-12 17:49:56 +02:00
|
|
|
* PostgreSQL write-ahead log internal declarations
|
2004-07-22 00:31:26 +02:00
|
|
|
*
|
|
|
|
* NOTE: this file is intended to contain declarations useful for
|
|
|
|
* manipulating the XLOG files directly, but it is not supposed to be
|
2005-06-06 19:01:25 +02:00
|
|
|
* needed by rmgr routines (redo support for individual record types).
|
2014-11-06 12:52:08 +01:00
|
|
|
* So the XLogRecord typedef and associated stuff appear in xlogrecord.h.
|
2004-07-22 00:31:26 +02:00
|
|
|
*
|
2012-12-13 13:59:13 +01:00
|
|
|
* Note: This file must be includable in both frontend and backend contexts,
|
2017-02-09 22:23:46 +01:00
|
|
|
* to allow stand-alone tools like pg_receivewal to deal with WAL files.
|
2012-12-13 13:59:13 +01:00
|
|
|
*
|
2022-01-08 01:04:57 +01:00
|
|
|
* Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
|
2004-07-22 00:31:26 +02:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
|
|
|
*
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/include/access/xlog_internal.h
|
2004-07-22 00:31:26 +02:00
|
|
|
*/
|
|
|
|
#ifndef XLOG_INTERNAL_H
|
|
|
|
#define XLOG_INTERNAL_H
|
|
|
|
|
2012-12-13 13:59:13 +01:00
|
|
|
#include "access/xlogdefs.h"
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
#include "access/xlogreader.h"
|
2012-12-13 13:59:13 +01:00
|
|
|
#include "datatype/timestamp.h"
|
|
|
|
#include "lib/stringinfo.h"
|
2011-09-09 19:23:41 +02:00
|
|
|
#include "pgtime.h"
|
2011-09-04 07:13:16 +02:00
|
|
|
#include "storage/block.h"
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
#include "storage/relfilelocator.h"
|
2004-07-22 00:31:26 +02:00
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Each page of XLOG file has a header like this:
|
|
|
|
*/
|
pgstat: scaffolding for transactional stats creation / drop.
One problematic part of the current statistics collector design is that there
is no reliable way of getting rid of statistics entries. Because of that
pgstat_vacuum_stat() (called by [auto-]vacuum) matches all stats for the
current database with the catalog contents and tries to drop now-superfluous
entries. That's quite expensive. What's worse, it doesn't work on physical
replicas, despite physical replicas collection statistics entries.
This commit introduces infrastructure to create / drop statistics entries
transactionally, together with the underlying catalog objects (functions,
relations, subscriptions). pgstat_xact.c maintains a list of stats entries
created / dropped transactionally in the current transaction. To ensure the
removal of statistics entries is durable dropped statistics entries are
included in commit / abort (and prepare) records, which also ensures that
stats entries are dropped on standbys.
Statistics entries created separately from creating the underlying catalog
object (e.g. when stats were previously lost due to an immediate restart)
are *not* WAL logged. However that can only happen outside of the transaction
creating the catalog object, so it does not lead to "leaked" statistics
entries.
For this to work, functions creating / dropping functions / relations /
subscriptions need to call into pgstat. For subscriptions this was already
done when dropping subscriptions, via pgstat_report_subscription_drop() (now
renamed to pgstat_drop_subscription()).
This commit does not actually drop stats yet, it just provides the
infrastructure. It is however a largely independent piece of infrastructure,
so committing it separately makes sense.
Bumps XLOG_PAGE_MAGIC.
Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20220303021600.hs34ghqcw6zcokdh@alap3.anarazel.de
2022-04-07 03:22:22 +02:00
|
|
|
#define XLOG_PAGE_MAGIC 0xD110 /* can be used as WAL version indicator */
|
2004-07-22 00:31:26 +02:00
|
|
|
|
|
|
|
typedef struct XLogPageHeaderData
|
|
|
|
{
|
|
|
|
uint16 xlp_magic; /* magic value for correctness checks */
|
|
|
|
uint16 xlp_info; /* flag bits, see below */
|
|
|
|
TimeLineID xlp_tli; /* TimeLineID of first record on page */
|
|
|
|
XLogRecPtr xlp_pageaddr; /* XLOG address of this page */
|
2012-06-24 17:15:00 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* When there is not enough space on current page for whole record, we
|
|
|
|
* continue on the next page. xlp_rem_len is the number of bytes
|
2020-08-16 05:21:52 +02:00
|
|
|
* remaining from a previous page; it tracks xl_tot_len in the initial
|
|
|
|
* header. Note that the continuation data isn't necessarily aligned.
|
2012-06-24 17:15:00 +02:00
|
|
|
*/
|
|
|
|
uint32 xlp_rem_len; /* total len of remaining data for record */
|
2004-07-22 00:31:26 +02:00
|
|
|
} XLogPageHeaderData;
|
|
|
|
|
|
|
|
#define SizeOfXLogShortPHD MAXALIGN(sizeof(XLogPageHeaderData))
|
|
|
|
|
|
|
|
typedef XLogPageHeaderData *XLogPageHeader;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When the XLP_LONG_HEADER flag is set, we store additional fields in the
|
|
|
|
* page header. (This is ordinarily done just in the first page of an
|
|
|
|
* XLOG file.) The additional fields serve to identify the file accurately.
|
|
|
|
*/
|
|
|
|
typedef struct XLogLongPageHeaderData
|
|
|
|
{
|
|
|
|
XLogPageHeaderData std; /* standard header fields */
|
|
|
|
uint64 xlp_sysid; /* system identifier from pg_control */
|
|
|
|
uint32 xlp_seg_size; /* just as a cross-check */
|
2006-04-05 05:34:05 +02:00
|
|
|
uint32 xlp_xlog_blcksz; /* just as a cross-check */
|
2004-07-22 00:31:26 +02:00
|
|
|
} XLogLongPageHeaderData;
|
|
|
|
|
|
|
|
#define SizeOfXLogLongPHD MAXALIGN(sizeof(XLogLongPageHeaderData))
|
|
|
|
|
|
|
|
typedef XLogLongPageHeaderData *XLogLongPageHeader;
|
|
|
|
|
|
|
|
/* When record crosses page boundary, set this flag in new page's header */
|
|
|
|
#define XLP_FIRST_IS_CONTRECORD 0x0001
|
|
|
|
/* This flag indicates a "long" page header */
|
|
|
|
#define XLP_LONG_HEADER 0x0002
|
2011-12-12 22:22:14 +01:00
|
|
|
/* This flag indicates backup blocks starting in this page are optional */
|
|
|
|
#define XLP_BKP_REMOVABLE 0x0004
|
Fix WAL replay in presence of an incomplete record
Physical replication always ships WAL segment files to replicas once
they are complete. This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
LOG: invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be. A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary. But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.
A fix for this problem was already attempted in commit 515e3d84a0b5, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.
This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts. With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.
Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.
A new TAP test that exercises this is added, but the portability of it
is yet to be seen.
This has been wrong since the introduction of physical replication, so
backpatch all the way back. In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
2021-09-29 16:21:51 +02:00
|
|
|
/* Replaces a missing contrecord; see CreateOverwriteContrecordRecord */
|
|
|
|
#define XLP_FIRST_IS_OVERWRITE_CONTRECORD 0x0008
|
2004-07-22 00:31:26 +02:00
|
|
|
/* All defined flag bits in xlp_info (used for validity checking of header) */
|
Fix WAL replay in presence of an incomplete record
Physical replication always ships WAL segment files to replicas once
they are complete. This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
LOG: invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be. A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary. But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.
A fix for this problem was already attempted in commit 515e3d84a0b5, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.
This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts. With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.
Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.
A new TAP test that exercises this is added, but the portability of it
is yet to be seen.
This has been wrong since the introduction of physical replication, so
backpatch all the way back. In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
2021-09-29 16:21:51 +02:00
|
|
|
#define XLP_ALL_FLAGS 0x000F
|
2004-07-22 00:31:26 +02:00
|
|
|
|
|
|
|
#define XLogPageHeaderSize(hdr) \
|
|
|
|
(((hdr)->xlp_info & XLP_LONG_HEADER) ? SizeOfXLogLongPHD : SizeOfXLogShortPHD)
|
|
|
|
|
Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-20 07:03:48 +02:00
|
|
|
/* wal_segment_size can range from 1MB to 1GB */
|
|
|
|
#define WalSegMinSize 1024 * 1024
|
|
|
|
#define WalSegMaxSize 1024 * 1024 * 1024
|
|
|
|
/* default number of min and max wal segments */
|
|
|
|
#define DEFAULT_MIN_WAL_SEGS 5
|
|
|
|
#define DEFAULT_MAX_WAL_SEGS 64
|
|
|
|
|
|
|
|
/* check that the given size is a valid wal_segment_size */
|
|
|
|
#define IsPowerOf2(x) (x > 0 && ((x) & ((x)-1)) == 0)
|
|
|
|
#define IsValidWalSegSize(size) \
|
|
|
|
(IsPowerOf2(size) && \
|
|
|
|
((size) >= WalSegMinSize && (size) <= WalSegMaxSize))
|
|
|
|
|
|
|
|
#define XLogSegmentsPerXLogId(wal_segsz_bytes) \
|
|
|
|
(UINT64CONST(0x100000000) / (wal_segsz_bytes))
|
|
|
|
|
2018-07-09 20:28:21 +02:00
|
|
|
#define XLogSegNoOffsetToRecPtr(segno, offset, wal_segsz_bytes, dest) \
|
Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-20 07:03:48 +02:00
|
|
|
(dest) = (segno) * (wal_segsz_bytes) + (offset)
|
2004-07-22 00:31:26 +02:00
|
|
|
|
Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-20 07:03:48 +02:00
|
|
|
#define XLogSegmentOffset(xlogptr, wal_segsz_bytes) \
|
|
|
|
((xlogptr) & ((wal_segsz_bytes) - 1))
|
2004-07-22 00:31:26 +02:00
|
|
|
|
|
|
|
/*
|
2017-08-01 01:00:11 +02:00
|
|
|
* Compute a segment number from an XLogRecPtr.
|
2004-07-22 00:31:26 +02:00
|
|
|
*
|
|
|
|
* For XLByteToSeg, do the computation at face value. For XLByteToPrevSeg,
|
|
|
|
* a boundary byte is taken to be in the previous segment. This is suitable
|
|
|
|
* for deciding which segment to write given a pointer to a record end,
|
2012-06-24 17:51:37 +02:00
|
|
|
* for example.
|
2004-07-22 00:31:26 +02:00
|
|
|
*/
|
Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-20 07:03:48 +02:00
|
|
|
#define XLByteToSeg(xlrp, logSegNo, wal_segsz_bytes) \
|
|
|
|
logSegNo = (xlrp) / (wal_segsz_bytes)
|
2012-06-24 17:06:38 +02:00
|
|
|
|
Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-20 07:03:48 +02:00
|
|
|
#define XLByteToPrevSeg(xlrp, logSegNo, wal_segsz_bytes) \
|
|
|
|
logSegNo = ((xlrp) - 1) / (wal_segsz_bytes)
|
2004-07-22 00:31:26 +02:00
|
|
|
|
2020-07-07 19:08:00 +02:00
|
|
|
/*
|
|
|
|
* Convert values of GUCs measured in megabytes to equiv. segment count.
|
|
|
|
* Rounds down.
|
|
|
|
*/
|
|
|
|
#define XLogMBVarToSegs(mbvar, wal_segsz_bytes) \
|
|
|
|
((mbvar) / ((wal_segsz_bytes) / (1024 * 1024)))
|
|
|
|
|
2004-07-22 00:31:26 +02:00
|
|
|
/*
|
|
|
|
* Is an XLogRecPtr within a particular XLOG segment?
|
|
|
|
*
|
|
|
|
* For XLByteInSeg, do the computation at face value. For XLByteInPrevSeg,
|
|
|
|
* a boundary byte is taken to be in the previous segment.
|
|
|
|
*/
|
Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-20 07:03:48 +02:00
|
|
|
#define XLByteInSeg(xlrp, logSegNo, wal_segsz_bytes) \
|
|
|
|
(((xlrp) / (wal_segsz_bytes)) == (logSegNo))
|
2004-07-22 00:31:26 +02:00
|
|
|
|
Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-20 07:03:48 +02:00
|
|
|
#define XLByteInPrevSeg(xlrp, logSegNo, wal_segsz_bytes) \
|
|
|
|
((((xlrp) - 1) / (wal_segsz_bytes)) == (logSegNo))
|
2012-06-24 17:51:37 +02:00
|
|
|
|
|
|
|
/* Check if an XLogRecPtr value is in a plausible range */
|
|
|
|
#define XRecOffIsValid(xlrp) \
|
2012-06-25 18:14:43 +02:00
|
|
|
((xlrp) % XLOG_BLCKSZ >= SizeOfXLogShortPHD)
|
2004-07-22 00:31:26 +02:00
|
|
|
|
2005-07-04 06:51:52 +02:00
|
|
|
/*
|
|
|
|
* The XLog directory and control file (relative to $PGDATA)
|
|
|
|
*/
|
2016-10-20 17:24:37 +02:00
|
|
|
#define XLOGDIR "pg_wal"
|
2005-07-04 06:51:52 +02:00
|
|
|
#define XLOG_CONTROL_FILE "global/pg_control"
|
|
|
|
|
2004-07-22 00:31:26 +02:00
|
|
|
/*
|
|
|
|
* These macros encapsulate knowledge about the exact layout of XLog file
|
|
|
|
* names, timeline history file names, and archive-status file names.
|
|
|
|
*/
|
2004-08-03 22:32:36 +02:00
|
|
|
#define MAXFNAMELEN 64
|
2004-07-22 00:31:26 +02:00
|
|
|
|
2015-07-02 03:35:38 +02:00
|
|
|
/* Length of XLog file name */
|
|
|
|
#define XLOG_FNAME_LEN 24
|
|
|
|
|
2019-12-03 07:06:04 +01:00
|
|
|
/*
|
2022-07-15 12:05:01 +02:00
|
|
|
* Generate a WAL segment file name. Do not use this function in a helper
|
2019-12-03 07:06:04 +01:00
|
|
|
* function allocating the result generated.
|
|
|
|
*/
|
2022-07-15 12:05:01 +02:00
|
|
|
static inline void
|
|
|
|
XLogFileName(char *fname, TimeLineID tli, XLogSegNo logSegNo, int wal_segsz_bytes)
|
|
|
|
{
|
|
|
|
snprintf(fname, MAXFNAMELEN, "%08X%08X%08X", tli,
|
|
|
|
(uint32) (logSegNo / XLogSegmentsPerXLogId(wal_segsz_bytes)),
|
|
|
|
(uint32) (logSegNo % XLogSegmentsPerXLogId(wal_segsz_bytes)));
|
|
|
|
}
|
2012-06-24 17:06:38 +02:00
|
|
|
|
2022-07-15 12:05:01 +02:00
|
|
|
static inline void
|
|
|
|
XLogFileNameById(char *fname, TimeLineID tli, uint32 log, uint32 seg)
|
|
|
|
{
|
|
|
|
snprintf(fname, MAXFNAMELEN, "%08X%08X%08X", tli, log, seg);
|
|
|
|
}
|
2015-07-02 03:35:38 +02:00
|
|
|
|
2022-07-15 12:05:01 +02:00
|
|
|
static inline bool
|
|
|
|
IsXLogFileName(const char *fname)
|
|
|
|
{
|
|
|
|
return (strlen(fname) == XLOG_FNAME_LEN && \
|
|
|
|
strspn(fname, "0123456789ABCDEF") == XLOG_FNAME_LEN);
|
|
|
|
}
|
2015-05-08 20:58:57 +02:00
|
|
|
|
At promotion, archive last segment from old timeline with .partial suffix.
Previously, we would archive the possible-incomplete WAL segment with its
normal filename, but that causes trouble if the server owning that timeline
is still running, and tries to archive the same segment later. It's not nice
for the standby to trip up the master's archival like that. And it's pretty
confusing, anyway, to have an incomplete segment in the archive that's
indistinguishable from a normal, complete segment.
To avoid such confusion, add a .partial suffix to the file. Or to be more
precise, make a copy of the old segment under the .partial suffix, and
archive that instead of the original file. pg_receivexlog also uses the
.partial suffix for the same purpose, to tell apart incompletely streamed
files from complete ones.
There is no automatic mechanism to use the .partial files at recovery, so
they will go unused, unless the administrator manually copies to them to
the pg_xlog directory (and removes the .partial suffix). Recovery won't
normally need the WAL - when recovering to the new timeline, it will find
the same WAL on the first segment on the new timeline instead - but it
nevertheless feels better to archive the file with the .partial suffix, for
debugging purposes if nothing else.
2015-05-08 20:59:01 +02:00
|
|
|
/*
|
2017-02-09 22:23:46 +01:00
|
|
|
* XLOG segment with .partial suffix. Used by pg_receivewal and at end of
|
At promotion, archive last segment from old timeline with .partial suffix.
Previously, we would archive the possible-incomplete WAL segment with its
normal filename, but that causes trouble if the server owning that timeline
is still running, and tries to archive the same segment later. It's not nice
for the standby to trip up the master's archival like that. And it's pretty
confusing, anyway, to have an incomplete segment in the archive that's
indistinguishable from a normal, complete segment.
To avoid such confusion, add a .partial suffix to the file. Or to be more
precise, make a copy of the old segment under the .partial suffix, and
archive that instead of the original file. pg_receivexlog also uses the
.partial suffix for the same purpose, to tell apart incompletely streamed
files from complete ones.
There is no automatic mechanism to use the .partial files at recovery, so
they will go unused, unless the administrator manually copies to them to
the pg_xlog directory (and removes the .partial suffix). Recovery won't
normally need the WAL - when recovering to the new timeline, it will find
the same WAL on the first segment on the new timeline instead - but it
nevertheless feels better to archive the file with the .partial suffix, for
debugging purposes if nothing else.
2015-05-08 20:59:01 +02:00
|
|
|
* archive recovery, when we want to archive a WAL segment but it might not
|
|
|
|
* be complete yet.
|
|
|
|
*/
|
2022-07-15 12:05:01 +02:00
|
|
|
static inline bool
|
|
|
|
IsPartialXLogFileName(const char *fname)
|
|
|
|
{
|
|
|
|
return (strlen(fname) == XLOG_FNAME_LEN + strlen(".partial") &&
|
|
|
|
strspn(fname, "0123456789ABCDEF") == XLOG_FNAME_LEN &&
|
|
|
|
strcmp(fname + XLOG_FNAME_LEN, ".partial") == 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
XLogFromFileName(const char *fname, TimeLineID *tli, XLogSegNo *logSegNo, int wal_segsz_bytes)
|
|
|
|
{
|
|
|
|
uint32 log;
|
|
|
|
uint32 seg;
|
|
|
|
|
|
|
|
sscanf(fname, "%08X%08X%08X", tli, &log, &seg);
|
|
|
|
*logSegNo = (uint64) log * XLogSegmentsPerXLogId(wal_segsz_bytes) + seg;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
XLogFilePath(char *path, TimeLineID tli, XLogSegNo logSegNo, int wal_segsz_bytes)
|
|
|
|
{
|
|
|
|
snprintf(path, MAXPGPATH, XLOGDIR "/%08X%08X%08X", tli,
|
|
|
|
(uint32) (logSegNo / XLogSegmentsPerXLogId(wal_segsz_bytes)),
|
|
|
|
(uint32) (logSegNo % XLogSegmentsPerXLogId(wal_segsz_bytes)));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
TLHistoryFileName(char *fname, TimeLineID tli)
|
|
|
|
{
|
|
|
|
snprintf(fname, MAXFNAMELEN, "%08X.history", tli);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool
|
|
|
|
IsTLHistoryFileName(const char *fname)
|
|
|
|
{
|
|
|
|
return (strlen(fname) == 8 + strlen(".history") &&
|
|
|
|
strspn(fname, "0123456789ABCDEF") == 8 &&
|
|
|
|
strcmp(fname + 8, ".history") == 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
TLHistoryFilePath(char *path, TimeLineID tli)
|
|
|
|
{
|
|
|
|
snprintf(path, MAXPGPATH, XLOGDIR "/%08X.history", tli);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
StatusFilePath(char *path, const char *xlog, const char *suffix)
|
|
|
|
{
|
|
|
|
snprintf(path, MAXPGPATH, XLOGDIR "/archive_status/%s%s", xlog, suffix);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
BackupHistoryFileName(char *fname, TimeLineID tli, XLogSegNo logSegNo, XLogRecPtr startpoint, int wal_segsz_bytes)
|
|
|
|
{
|
|
|
|
snprintf(fname, MAXFNAMELEN, "%08X%08X%08X.%08X.backup", tli,
|
|
|
|
(uint32) (logSegNo / XLogSegmentsPerXLogId(wal_segsz_bytes)),
|
|
|
|
(uint32) (logSegNo % XLogSegmentsPerXLogId(wal_segsz_bytes)),
|
|
|
|
(uint32) (XLogSegmentOffset(startpoint, wal_segsz_bytes)));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool
|
|
|
|
IsBackupHistoryFileName(const char *fname)
|
|
|
|
{
|
|
|
|
return (strlen(fname) > XLOG_FNAME_LEN &&
|
|
|
|
strspn(fname, "0123456789ABCDEF") == XLOG_FNAME_LEN &&
|
|
|
|
strcmp(fname + strlen(fname) - strlen(".backup"), ".backup") == 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
BackupHistoryFilePath(char *path, TimeLineID tli, XLogSegNo logSegNo, XLogRecPtr startpoint, int wal_segsz_bytes)
|
|
|
|
{
|
|
|
|
snprintf(path, MAXPGPATH, XLOGDIR "/%08X%08X%08X.%08X.backup", tli,
|
|
|
|
(uint32) (logSegNo / XLogSegmentsPerXLogId(wal_segsz_bytes)),
|
|
|
|
(uint32) (logSegNo % XLogSegmentsPerXLogId(wal_segsz_bytes)),
|
|
|
|
(uint32) (XLogSegmentOffset((startpoint), wal_segsz_bytes)));
|
|
|
|
}
|
2004-07-22 00:31:26 +02:00
|
|
|
|
2012-11-28 16:35:01 +01:00
|
|
|
/*
|
|
|
|
* Information logged when we detect a change in one of the parameters
|
|
|
|
* important for Hot Standby.
|
|
|
|
*/
|
|
|
|
typedef struct xl_parameter_change
|
|
|
|
{
|
|
|
|
int MaxConnections;
|
Add new GUC, max_worker_processes, limiting number of bgworkers.
In 9.3, there's no particular limit on the number of bgworkers;
instead, we just count up the number that are actually registered,
and use that to set MaxBackends. However, that approach causes
problems for Hot Standby, which needs both MaxBackends and the
size of the lock table to be the same on the standby as on the
master, yet it may not be desirable to run the same bgworkers in
both places. 9.3 handles that by failing to notice the problem,
which will probably work fine in nearly all cases anyway, but is
not theoretically sound.
A further problem with simply counting the number of registered
workers is that new workers can't be registered without a
postmaster restart. This is inconvenient for administrators,
since bouncing the postmaster causes an interruption of service.
Moreover, there are a number of applications for background
processes where, by necessity, the background process must be
started on the fly (e.g. parallel query). While this patch
doesn't actually make it possible to register new background
workers after startup time, it's a necessary prerequisite.
Patch by me. Review by Michael Paquier.
2013-07-04 17:24:24 +02:00
|
|
|
int max_worker_processes;
|
Move max_wal_senders out of max_connections for connection slot handling
Since its introduction, max_wal_senders is counted as part of
max_connections when it comes to define how many connection slots can be
used for replication connections with a WAL sender context. This can
lead to confusion for some users, as it could be possible to block a
base backup or replication from happening because other backend sessions
are already taken for other purposes by an application, and
superuser-only connection slots are not a correct solution to handle
that case.
This commit makes max_wal_senders independent of max_connections for its
handling of PGPROC entries in ProcGlobal, meaning that connection slots
for WAL senders are handled using their own free queue, like autovacuum
workers and bgworkers.
One compatibility issue that this change creates is that a standby now
requires to have a value of max_wal_senders at least equal to its
primary. So, if a standby created enforces the value of
max_wal_senders to be lower than that, then this could break failovers.
Normally this should not be an issue though, as any settings of a
standby are inherited from its primary as postgresql.conf gets normally
copied as part of a base backup, so parameters would be consistent.
Author: Alexander Kukushkin
Reviewed-by: Kyotaro Horiguchi, Petr Jelínek, Masahiko Sawada, Oleksii
Kliukin
Discussion: https://postgr.es/m/CAFh8B=nBzHQeYAu0b8fjK-AF1X4+_p6GRtwG+cCgs6Vci2uRuQ@mail.gmail.com
2019-02-12 02:07:56 +01:00
|
|
|
int max_wal_senders;
|
2012-11-28 16:35:01 +01:00
|
|
|
int max_prepared_xacts;
|
|
|
|
int max_locks_per_xact;
|
|
|
|
int wal_level;
|
2013-12-20 19:33:16 +01:00
|
|
|
bool wal_log_hints;
|
Keep track of transaction commit timestamps
Transactions can now set their commit timestamp directly as they commit,
or an external transaction commit timestamp can be fed from an outside
system using the new function TransactionTreeSetCommitTsData(). This
data is crash-safe, and truncated at Xid freeze point, same as pg_clog.
This module is disabled by default because it causes a performance hit,
but can be enabled in postgresql.conf requiring only a server restart.
A new test in src/test/modules is included.
Catalog version bumped due to the new subdirectory within PGDATA and a
couple of new SQL functions.
Authors: Álvaro Herrera and Petr Jelínek
Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert
Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven
Singer, Peter Eisentraut
2014-12-03 15:53:02 +01:00
|
|
|
bool track_commit_timestamp;
|
2012-11-28 16:35:01 +01:00
|
|
|
} xl_parameter_change;
|
|
|
|
|
|
|
|
/* logs restore point */
|
|
|
|
typedef struct xl_restore_point
|
|
|
|
{
|
|
|
|
TimestampTz rp_time;
|
|
|
|
char rp_name[MAXFNAMELEN];
|
|
|
|
} xl_restore_point;
|
|
|
|
|
Fix WAL replay in presence of an incomplete record
Physical replication always ships WAL segment files to replicas once
they are complete. This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
LOG: invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be. A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary. But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.
A fix for this problem was already attempted in commit 515e3d84a0b5, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.
This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts. With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.
Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.
A new TAP test that exercises this is added, but the portability of it
is yet to be seen.
This has been wrong since the introduction of physical replication, so
backpatch all the way back. In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
2021-09-29 16:21:51 +02:00
|
|
|
/* Overwrite of prior contrecord */
|
|
|
|
typedef struct xl_overwrite_contrecord
|
|
|
|
{
|
|
|
|
XLogRecPtr overwritten_lsn;
|
|
|
|
TimestampTz overwrite_time;
|
|
|
|
} xl_overwrite_contrecord;
|
|
|
|
|
2013-01-29 01:06:15 +01:00
|
|
|
/* End of recovery mark, when we don't do an END_OF_RECOVERY checkpoint */
|
|
|
|
typedef struct xl_end_of_recovery
|
|
|
|
{
|
|
|
|
TimestampTz end_time;
|
2013-02-11 17:13:09 +01:00
|
|
|
TimeLineID ThisTimeLineID; /* new TLI */
|
|
|
|
TimeLineID PrevTimeLineID; /* previous TLI we forked off from */
|
2013-01-29 01:06:15 +01:00
|
|
|
} xl_end_of_recovery;
|
2004-07-22 00:31:26 +02:00
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
/*
|
|
|
|
* The functions in xloginsert.c construct a chain of XLogRecData structs
|
|
|
|
* to represent the final WAL record.
|
|
|
|
*/
|
|
|
|
typedef struct XLogRecData
|
|
|
|
{
|
|
|
|
struct XLogRecData *next; /* next struct in chain, or NULL */
|
|
|
|
char *data; /* start of rmgr data to include */
|
|
|
|
uint32 len; /* length of rmgr data to include */
|
|
|
|
} XLogRecData;
|
|
|
|
|
2014-11-25 21:13:30 +01:00
|
|
|
/*
|
|
|
|
* Recovery target action.
|
|
|
|
*/
|
|
|
|
typedef enum
|
|
|
|
{
|
|
|
|
RECOVERY_TARGET_ACTION_PAUSE,
|
|
|
|
RECOVERY_TARGET_ACTION_PROMOTE,
|
2016-04-15 03:54:06 +02:00
|
|
|
RECOVERY_TARGET_ACTION_SHUTDOWN
|
2014-11-25 21:13:30 +01:00
|
|
|
} RecoveryTargetAction;
|
|
|
|
|
2022-01-19 23:58:04 +01:00
|
|
|
struct LogicalDecodingContext;
|
|
|
|
struct XLogRecordBuffer;
|
|
|
|
|
2004-07-22 00:31:26 +02:00
|
|
|
/*
|
|
|
|
* Method table for resource managers.
|
|
|
|
*
|
2013-02-05 21:21:29 +01:00
|
|
|
* This struct must be kept in sync with the PG_RMGR definition in
|
|
|
|
* rmgr.c.
|
|
|
|
*
|
2014-09-19 15:17:12 +02:00
|
|
|
* rm_identify must return a name for the record based on xl_info (without
|
|
|
|
* reference to the rmid). For example, XLOG_BTREE_VACUUM would be named
|
|
|
|
* "VACUUM". rm_desc can then be called to obtain additional detail for the
|
|
|
|
* record, if available (e.g. the last block).
|
|
|
|
*
|
2017-02-08 21:45:30 +01:00
|
|
|
* rm_mask takes as input a page modified by the resource manager and masks
|
|
|
|
* out bits that shouldn't be flagged by wal_consistency_checking.
|
|
|
|
*
|
2022-04-07 07:26:43 +02:00
|
|
|
* RmgrTable[] is indexed by RmgrId values (see rmgrlist.h). If rm_name is
|
|
|
|
* NULL, the corresponding RmgrTable entry is considered invalid.
|
2004-07-22 00:31:26 +02:00
|
|
|
*/
|
|
|
|
typedef struct RmgrData
|
|
|
|
{
|
|
|
|
const char *rm_name;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
void (*rm_redo) (XLogReaderState *record);
|
|
|
|
void (*rm_desc) (StringInfo buf, XLogReaderState *record);
|
2014-09-19 15:17:12 +02:00
|
|
|
const char *(*rm_identify) (uint8 info);
|
2004-07-22 00:31:26 +02:00
|
|
|
void (*rm_startup) (void);
|
|
|
|
void (*rm_cleanup) (void);
|
2017-02-08 21:45:30 +01:00
|
|
|
void (*rm_mask) (char *pagedata, BlockNumber blkno);
|
2022-01-19 23:58:04 +01:00
|
|
|
void (*rm_decode) (struct LogicalDecodingContext *ctx,
|
|
|
|
struct XLogRecordBuffer *buf);
|
2004-07-22 00:31:26 +02:00
|
|
|
} RmgrData;
|
|
|
|
|
2022-04-08 09:02:10 +02:00
|
|
|
extern PGDLLIMPORT RmgrData RmgrTable[];
|
2022-04-07 07:26:43 +02:00
|
|
|
extern void RmgrStartup(void);
|
|
|
|
extern void RmgrCleanup(void);
|
|
|
|
extern void RmgrNotFound(RmgrId rmid);
|
|
|
|
extern void RegisterCustomRmgr(RmgrId rmid, RmgrData *rmgr);
|
|
|
|
|
2022-04-07 17:40:16 +02:00
|
|
|
#ifndef FRONTEND
|
2022-04-07 07:26:43 +02:00
|
|
|
static inline bool
|
|
|
|
RmgrIdExists(RmgrId rmid)
|
|
|
|
{
|
|
|
|
return RmgrTable[rmid].rm_name != NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline RmgrData
|
|
|
|
GetRmgr(RmgrId rmid)
|
|
|
|
{
|
|
|
|
if (unlikely(!RmgrIdExists(rmid)))
|
|
|
|
RmgrNotFound(rmid);
|
|
|
|
return RmgrTable[rmid];
|
|
|
|
}
|
2022-04-07 17:40:16 +02:00
|
|
|
#endif
|
2004-07-22 00:31:26 +02:00
|
|
|
|
2006-08-18 01:04:10 +02:00
|
|
|
/*
|
2011-11-01 18:14:47 +01:00
|
|
|
* Exported to support xlog switching from checkpointer
|
2006-08-18 01:04:10 +02:00
|
|
|
*/
|
Skip checkpoints, archiving on idle systems.
Some background activity (like checkpoints, archive timeout, standby
snapshots) is not supposed to happen on an idle system. Unfortunately
so far it was not easy to determine when a system is idle, which
defeated some of the attempts to avoid redundant activity on an idle
system.
To make that easier, allow to make individual WAL insertions as not
being "important". By checking whether any important activity happened
since the last time an activity was performed, it now is easy to check
whether some action needs to be repeated.
Use the new facility for checkpoints, archive timeout and standby
snapshots.
The lack of a facility causes some issues in older releases, but in my
opinion the consequences (superflous checkpoints / archived segments)
aren't grave enough to warrant backpatching.
Author: Michael Paquier, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Amit Kapila, Kyotaro HORIGUCHI
Bug: #13685
Discussion:
https://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org
https://www.postgresql.org/message-id/CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com
Backpatch: -
2016-12-22 20:31:50 +01:00
|
|
|
extern pg_time_t GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN);
|
2017-03-14 17:57:10 +01:00
|
|
|
extern XLogRecPtr RequestXLogSwitch(bool mark_unimportant);
|
2006-08-18 01:04:10 +02:00
|
|
|
|
2012-10-02 12:37:19 +02:00
|
|
|
extern void GetOldestRestartPoint(XLogRecPtr *oldrecptr, TimeLineID *oldtli);
|
|
|
|
|
2022-04-08 09:02:10 +02:00
|
|
|
extern void XLogRecGetBlockRefInfo(XLogReaderState *record, bool pretty,
|
|
|
|
bool detailed_format, StringInfo buf,
|
|
|
|
uint32 *fpi_len);
|
|
|
|
|
2012-10-02 12:37:19 +02:00
|
|
|
/*
|
|
|
|
* Exported for the functions in timeline.c and xlogarchive.c. Only valid
|
|
|
|
* in the startup process.
|
|
|
|
*/
|
2022-04-08 14:16:38 +02:00
|
|
|
extern PGDLLIMPORT bool ArchiveRecoveryRequested;
|
|
|
|
extern PGDLLIMPORT bool InArchiveRecovery;
|
|
|
|
extern PGDLLIMPORT bool StandbyMode;
|
|
|
|
extern PGDLLIMPORT char *recoveryRestoreCommand;
|
2012-10-02 12:37:19 +02:00
|
|
|
|
2004-07-22 00:31:26 +02:00
|
|
|
#endif /* XLOG_INTERNAL_H */
|