2013-01-16 20:12:53 +01:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
|
|
|
* xlogreader.h
|
|
|
|
* Definitions for the generic XLog reading facility
|
|
|
|
*
|
2022-01-08 01:04:57 +01:00
|
|
|
* Portions Copyright (c) 2013-2022, PostgreSQL Global Development Group
|
2013-01-16 20:12:53 +01:00
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
|
|
|
* src/include/access/xlogreader.h
|
|
|
|
*
|
|
|
|
* NOTES
|
|
|
|
* See the definition of the XLogReaderState struct for instructions on
|
|
|
|
* how to use the XLogReader infrastructure.
|
|
|
|
*
|
|
|
|
* The basic idea is to allocate an XLogReaderState via
|
2020-01-26 10:39:00 +01:00
|
|
|
* XLogReaderAllocate(), position the reader to the first record with
|
|
|
|
* XLogBeginRead() or XLogFindNextRecord(), and call XLogReadRecord()
|
|
|
|
* until it returns NULL.
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
*
|
2020-11-03 04:08:27 +01:00
|
|
|
* Callers supply a page_read callback if they want to call
|
2020-05-08 21:30:34 +02:00
|
|
|
* XLogReadRecord or XLogFindNextRecord; it can be passed in as NULL
|
|
|
|
* otherwise. The WALRead function can be used as a helper to write
|
|
|
|
* page_read callbacks, but it is not mandatory; callers that use it,
|
2020-05-28 09:40:07 +02:00
|
|
|
* must supply segment_open callbacks. The segment_close callback
|
2020-05-08 21:30:34 +02:00
|
|
|
* must always be supplied.
|
|
|
|
*
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
* After reading a record with XLogReadRecord(), it's decomposed into
|
|
|
|
* the per-block and main data parts, and the parts can be accessed
|
|
|
|
* with the XLogRec* macros and functions. You can also decode a
|
|
|
|
* record that's already constructed in memory, without reading from
|
|
|
|
* disk, by calling the DecodeXLogRecord() function.
|
2013-01-16 20:12:53 +01:00
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
#ifndef XLOGREADER_H
|
|
|
|
#define XLOGREADER_H
|
|
|
|
|
2019-07-15 07:03:46 +02:00
|
|
|
#ifndef FRONTEND
|
|
|
|
#include "access/transam.h"
|
|
|
|
#endif
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
#include "access/xlogrecord.h"
|
2022-04-07 09:28:40 +02:00
|
|
|
#include "storage/buf.h"
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
/* WALOpenSegment represents a WAL segment being read. */
|
|
|
|
typedef struct WALOpenSegment
|
|
|
|
{
|
|
|
|
int ws_file; /* segment file descriptor */
|
|
|
|
XLogSegNo ws_segno; /* segment number */
|
|
|
|
TimeLineID ws_tli; /* timeline ID of the currently open file */
|
|
|
|
} WALOpenSegment;
|
|
|
|
|
|
|
|
/* WALSegmentContext carries context information about WAL segments to read */
|
|
|
|
typedef struct WALSegmentContext
|
|
|
|
{
|
|
|
|
char ws_dir[MAXPGPATH];
|
|
|
|
int ws_segsize;
|
|
|
|
} WALSegmentContext;
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
typedef struct XLogReaderState XLogReaderState;
|
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/* Function type definitions for various xlogreader interactions */
|
|
|
|
typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
|
|
|
|
XLogRecPtr targetPagePtr,
|
|
|
|
int reqLen,
|
|
|
|
XLogRecPtr targetRecPtr,
|
|
|
|
char *readBuf);
|
2020-05-13 18:17:08 +02:00
|
|
|
typedef void (*WALSegmentOpenCB) (XLogReaderState *xlogreader,
|
|
|
|
XLogSegNo nextSegNo,
|
|
|
|
TimeLineID *tli_p);
|
2020-05-08 21:30:34 +02:00
|
|
|
typedef void (*WALSegmentCloseCB) (XLogReaderState *xlogreader);
|
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
typedef struct XLogReaderRoutine
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Data input callback
|
|
|
|
*
|
|
|
|
* This callback shall read at least reqLen valid bytes of the xlog page
|
|
|
|
* starting at targetPagePtr, and store them in readBuf. The callback
|
|
|
|
* shall return the number of bytes read (never more than XLOG_BLCKSZ), or
|
|
|
|
* -1 on failure. The callback shall sleep, if necessary, to wait for the
|
|
|
|
* requested bytes to become available. The callback will not be invoked
|
|
|
|
* again for the same page unless more than the returned number of bytes
|
|
|
|
* are needed.
|
|
|
|
*
|
|
|
|
* targetRecPtr is the position of the WAL record we're reading. Usually
|
|
|
|
* it is equal to targetPagePtr + reqLen, but sometimes xlogreader needs
|
|
|
|
* to read and verify the page or segment header, before it reads the
|
|
|
|
* actual WAL record it's interested in. In that case, targetRecPtr can
|
|
|
|
* be used to determine which timeline to read the page from.
|
|
|
|
*
|
|
|
|
* The callback shall set ->seg.ws_tli to the TLI of the file the page was
|
|
|
|
* read from.
|
|
|
|
*/
|
|
|
|
XLogPageReadCB page_read;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Callback to open the specified WAL segment for reading. ->seg.ws_file
|
|
|
|
* shall be set to the file descriptor of the opened segment. In case of
|
|
|
|
* failure, an error shall be raised by the callback and it shall not
|
|
|
|
* return.
|
|
|
|
*
|
|
|
|
* "nextSegNo" is the number of the segment to be opened.
|
|
|
|
*
|
|
|
|
* "tli_p" is an input/output argument. WALRead() uses it to pass the
|
|
|
|
* timeline in which the new segment should be found, but the callback can
|
|
|
|
* use it to return the TLI that it actually opened.
|
|
|
|
*/
|
|
|
|
WALSegmentOpenCB segment_open;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* WAL segment close callback. ->seg.ws_file shall be set to a negative
|
|
|
|
* number.
|
|
|
|
*/
|
|
|
|
WALSegmentCloseCB segment_close;
|
|
|
|
} XLogReaderRoutine;
|
|
|
|
|
|
|
|
#define XL_ROUTINE(...) &(XLogReaderRoutine){__VA_ARGS__}
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
typedef struct
|
|
|
|
{
|
|
|
|
/* Is this block ref in use? */
|
|
|
|
bool in_use;
|
|
|
|
|
|
|
|
/* Identify the block this refers to */
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
RelFileLocator rlocator;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
ForkNumber forknum;
|
|
|
|
BlockNumber blkno;
|
|
|
|
|
2022-04-07 09:28:40 +02:00
|
|
|
/* Prefetching workspace. */
|
|
|
|
Buffer prefetch_buffer;
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
/* copy of the fork_flags field from the XLogRecordBlockHeader */
|
|
|
|
uint8 flags;
|
|
|
|
|
|
|
|
/* Information on full-page image, if any */
|
2017-02-08 21:45:30 +01:00
|
|
|
bool has_image; /* has image, even for consistency checking */
|
|
|
|
bool apply_image; /* has image that should be restored */
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
char *bkp_image;
|
|
|
|
uint16 hole_offset;
|
|
|
|
uint16 hole_length;
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
uint16 bimg_len;
|
|
|
|
uint8 bimg_info;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
|
|
|
/* Buffer holding the rmgr-specific data associated with this block */
|
|
|
|
bool has_data;
|
|
|
|
char *data;
|
|
|
|
uint16 data_len;
|
|
|
|
uint16 data_bufsz;
|
|
|
|
} DecodedBkpBlock;
|
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
/*
|
|
|
|
* The decoded contents of a record. This occupies a contiguous region of
|
|
|
|
* memory, with main_data and blocks[n].data pointing to memory after the
|
|
|
|
* members declared here.
|
|
|
|
*/
|
|
|
|
typedef struct DecodedXLogRecord
|
|
|
|
{
|
|
|
|
/* Private member used for resource management. */
|
|
|
|
size_t size; /* total size of decoded record */
|
|
|
|
bool oversized; /* outside the regular decode buffer? */
|
|
|
|
struct DecodedXLogRecord *next; /* decoded record queue link */
|
|
|
|
|
|
|
|
/* Public members. */
|
|
|
|
XLogRecPtr lsn; /* location */
|
|
|
|
XLogRecPtr next_lsn; /* location of next record */
|
|
|
|
XLogRecord header; /* header */
|
|
|
|
RepOriginId record_origin;
|
|
|
|
TransactionId toplevel_xid; /* XID of top-level transaction */
|
|
|
|
char *main_data; /* record's main data portion */
|
|
|
|
uint32 main_data_len; /* main data portion's length */
|
|
|
|
int max_block_id; /* highest block_id in use (-1 if none) */
|
|
|
|
DecodedBkpBlock blocks[FLEXIBLE_ARRAY_MEMBER];
|
|
|
|
} DecodedXLogRecord;
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
struct XLogReaderState
|
|
|
|
{
|
2020-05-08 21:30:34 +02:00
|
|
|
/*
|
|
|
|
* Operational callbacks
|
|
|
|
*/
|
2021-05-10 06:00:53 +02:00
|
|
|
XLogReaderRoutine routine;
|
2020-05-08 21:30:34 +02:00
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/* ----------------------------------------
|
|
|
|
* Public parameters
|
|
|
|
* ----------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* System identifier of the xlog files we're about to read. Set to zero
|
|
|
|
* (the default value) if unknown or unimportant.
|
|
|
|
*/
|
|
|
|
uint64 system_identifier;
|
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/*
|
|
|
|
* Opaque data for callbacks to use. Not used by XLogReader.
|
|
|
|
*/
|
|
|
|
void *private_data;
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/*
|
|
|
|
* Start and end point of last record read. EndRecPtr is also used as the
|
2020-01-26 10:39:00 +01:00
|
|
|
* position to read next. Calling XLogBeginRead() sets EndRecPtr to the
|
|
|
|
* starting position and ReadRecPtr to invalid.
|
2022-03-18 05:45:04 +01:00
|
|
|
*
|
|
|
|
* Start and end point of last record returned by XLogReadRecord(). These
|
|
|
|
* are also available as record->lsn and record->next_lsn.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
2021-05-10 06:00:53 +02:00
|
|
|
XLogRecPtr ReadRecPtr; /* start of last record read */
|
2013-01-16 20:12:53 +01:00
|
|
|
XLogRecPtr EndRecPtr; /* end+1 of last record read */
|
|
|
|
|
Fix WAL replay in presence of an incomplete record
Physical replication always ships WAL segment files to replicas once
they are complete. This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
LOG: invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be. A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary. But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.
A fix for this problem was already attempted in commit 515e3d84a0b5, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.
This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts. With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.
Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.
A new TAP test that exercises this is added, but the portability of it
is yet to be seen.
This has been wrong since the introduction of physical replication, so
backpatch all the way back. In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
2021-09-29 16:21:51 +02:00
|
|
|
/*
|
|
|
|
* Set at the end of recovery: the start point of a partial record at the
|
|
|
|
* end of WAL (InvalidXLogRecPtr if there wasn't one), and the start
|
|
|
|
* location of its first contrecord that went missing.
|
|
|
|
*/
|
|
|
|
XLogRecPtr abortedRecPtr;
|
|
|
|
XLogRecPtr missingContrecPtr;
|
|
|
|
/* Set when XLP_FIRST_IS_OVERWRITE_CONTRECORD is found */
|
|
|
|
XLogRecPtr overwrittenRecPtr;
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
|
|
|
/* ----------------------------------------
|
|
|
|
* Decoded representation of current record
|
|
|
|
*
|
|
|
|
* Use XLogRecGet* functions to investigate the record; these fields
|
|
|
|
* should not be accessed directly.
|
|
|
|
* ----------------------------------------
|
2022-03-18 05:45:04 +01:00
|
|
|
* Start and end point of the last record read and decoded by
|
|
|
|
* XLogReadRecordInternal(). NextRecPtr is also used as the position to
|
|
|
|
* decode next. Calling XLogBeginRead() sets NextRecPtr and EndRecPtr to
|
|
|
|
* the requested starting position.
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
*/
|
2022-03-18 05:45:04 +01:00
|
|
|
XLogRecPtr DecodeRecPtr; /* start of last record decoded */
|
|
|
|
XLogRecPtr NextRecPtr; /* end+1 of last record decoded */
|
|
|
|
XLogRecPtr PrevRecPtr; /* start of previous record decoded */
|
2021-05-10 06:00:53 +02:00
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
/* Last record returned by XLogReadRecord(). */
|
|
|
|
DecodedXLogRecord *record;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/* ----------------------------------------
|
|
|
|
* private/internal state
|
|
|
|
* ----------------------------------------
|
|
|
|
*/
|
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
/*
|
|
|
|
* Buffer for decoded records. This is a circular buffer, though
|
|
|
|
* individual records can't be split in the middle, so some space is often
|
|
|
|
* wasted at the end. Oversized records that don't fit in this space are
|
|
|
|
* allocated separately.
|
|
|
|
*/
|
|
|
|
char *decode_buffer;
|
|
|
|
size_t decode_buffer_size;
|
|
|
|
bool free_decode_buffer; /* need to free? */
|
|
|
|
char *decode_buffer_head; /* data is read from the head */
|
|
|
|
char *decode_buffer_tail; /* new data is written at the tail */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Queue of records that have been decoded. This is a linked list that
|
|
|
|
* usually consists of consecutive records in decode_buffer, but may also
|
|
|
|
* contain oversized records allocated with palloc().
|
|
|
|
*/
|
|
|
|
DecodedXLogRecord *decode_queue_head; /* oldest decoded record */
|
|
|
|
DecodedXLogRecord *decode_queue_tail; /* newest decoded record */
|
|
|
|
|
2021-04-08 13:03:34 +02:00
|
|
|
/*
|
2021-05-10 06:00:53 +02:00
|
|
|
* Buffer for currently read page (XLOG_BLCKSZ bytes, valid up to at least
|
|
|
|
* readLen bytes)
|
2021-04-08 13:03:34 +02:00
|
|
|
*/
|
2021-05-10 06:00:53 +02:00
|
|
|
char *readBuf;
|
|
|
|
uint32 readLen;
|
2021-04-08 13:03:34 +02:00
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
/* last read XLOG position for data currently in readBuf */
|
|
|
|
WALSegmentContext segcxt;
|
|
|
|
WALOpenSegment seg;
|
2019-11-25 19:04:54 +01:00
|
|
|
uint32 segoff;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2016-03-30 23:56:13 +02:00
|
|
|
/*
|
|
|
|
* beginning of prior page read, and its TLI. Doesn't necessarily
|
|
|
|
* correspond to what's in readBuf; used for timeline sanity checks.
|
|
|
|
*/
|
2013-01-16 20:12:53 +01:00
|
|
|
XLogRecPtr latestPagePtr;
|
|
|
|
TimeLineID latestPageTLI;
|
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/* beginning of the WAL record being read. */
|
|
|
|
XLogRecPtr currRecPtr;
|
2017-03-22 08:05:12 +01:00
|
|
|
/* timeline to read it from, 0 if a lookup is required */
|
|
|
|
TimeLineID currTLI;
|
2017-05-17 22:31:56 +02:00
|
|
|
|
2017-03-22 08:05:12 +01:00
|
|
|
/*
|
|
|
|
* Safe point to read to in currTLI if current TLI is historical
|
|
|
|
* (tliSwitchPoint) or InvalidXLogRecPtr if on current timeline.
|
|
|
|
*
|
|
|
|
* Actually set to the start of the segment containing the timeline switch
|
|
|
|
* that ends currTLI's validity, not the LSN of the switch its self, since
|
|
|
|
* we can't assume the old segment will be present.
|
|
|
|
*/
|
|
|
|
XLogRecPtr currTLIValidUntil;
|
2017-05-17 22:31:56 +02:00
|
|
|
|
2017-03-22 08:05:12 +01:00
|
|
|
/*
|
|
|
|
* If currTLI is not the most recent known timeline, the next timeline to
|
|
|
|
* read from when currTLIValidUntil is reached.
|
|
|
|
*/
|
|
|
|
TimeLineID nextTLI;
|
Use the right timeline when beginning to stream from master.
The xlogreader refactoring broke the logic to decide which timeline to start
streaming from. XLogPageRead() uses the timeline history to check which
timeline the requested WAL position falls into. However, after the
refactoring, XLogPageRead() is always first called with the first page in
the segment, to verify the segment header, and only then with the actual WAL
position we're interested in. That first read of the segment's header made
XLogPageRead() to always start streaming from the old timeline containing
the segment header, not the timeline containing the actual record, if there
was a timeline switch within the segment.
I thought I fixed this yesterday, but that fix was too narrow and only fixed
this for the corner-case that the timeline switch happened in the first page
of the segment. To fix this more robustly, pass explicitly the position of
the record we're actually interested in to XLogPageRead, and use that to
decide which timeline to read from, rather than deduce it from the page and
offset.
Per report from Fujii Masao.
2013-01-18 10:41:36 +01:00
|
|
|
|
2018-11-21 00:43:32 +01:00
|
|
|
/*
|
|
|
|
* Buffer for current ReadRecord result (expandable), used when a record
|
|
|
|
* crosses a page boundary.
|
|
|
|
*/
|
2013-01-16 20:12:53 +01:00
|
|
|
char *readRecordBuf;
|
|
|
|
uint32 readRecordBufSize;
|
|
|
|
|
|
|
|
/* Buffer to hold error message */
|
|
|
|
char *errormsg_buf;
|
2022-03-18 05:45:04 +01:00
|
|
|
bool errormsg_deferred;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Flag to indicate to XLogPageReadCB that it should not block waiting for
|
|
|
|
* data.
|
|
|
|
*/
|
|
|
|
bool nonblocking;
|
2013-01-16 20:12:53 +01:00
|
|
|
};
|
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
/*
|
|
|
|
* Check if XLogNextRecord() has any more queued records or an error to return.
|
|
|
|
*/
|
|
|
|
static inline bool
|
|
|
|
XLogReaderHasQueuedRecordOrError(XLogReaderState *state)
|
|
|
|
{
|
|
|
|
return (state->decode_queue_head != NULL) || state->errormsg_deferred;
|
|
|
|
}
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/* Get a new XLogReader */
|
Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-20 07:03:48 +02:00
|
|
|
extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
|
2019-09-24 21:08:31 +02:00
|
|
|
const char *waldir,
|
2021-05-10 06:00:53 +02:00
|
|
|
XLogReaderRoutine *routine,
|
|
|
|
void *private_data);
|
|
|
|
extern XLogReaderRoutine *LocalXLogReaderRoutine(void);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
/* Free an XLogReader */
|
|
|
|
extern void XLogReaderFree(XLogReaderState *state);
|
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
/* Optionally provide a circular decoding buffer to allow readahead. */
|
|
|
|
extern void XLogReaderSetDecodeBuffer(XLogReaderState *state,
|
|
|
|
void *buffer,
|
|
|
|
size_t size);
|
|
|
|
|
2020-01-26 10:39:00 +01:00
|
|
|
/* Position the XLogReader to given record */
|
|
|
|
extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
|
2021-05-10 06:00:53 +02:00
|
|
|
extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
|
2020-01-26 10:39:00 +01:00
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
/* Return values from XLogPageReadCB. */
|
|
|
|
typedef enum XLogPageReadResult
|
|
|
|
{
|
|
|
|
XLREAD_SUCCESS = 0, /* record is successfully read */
|
|
|
|
XLREAD_FAIL = -1, /* failed during reading a record */
|
|
|
|
XLREAD_WOULDBLOCK = -2 /* nonblocking mode only, no data */
|
|
|
|
} XLogPageReadResult;
|
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/* Read the next XLog record. Returns NULL on end-of-WAL or failure */
|
|
|
|
extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
|
|
|
|
char **errormsg);
|
2021-04-08 13:03:34 +02:00
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
/* Consume the next record or error. */
|
|
|
|
extern DecodedXLogRecord *XLogNextRecord(XLogReaderState *state,
|
|
|
|
char **errormsg);
|
|
|
|
|
|
|
|
/* Release the previously returned record, if necessary. */
|
|
|
|
extern void XLogReleasePreviousRecord(XLogReaderState *state);
|
|
|
|
|
|
|
|
/* Try to read ahead, if there is data and space. */
|
|
|
|
extern DecodedXLogRecord *XLogReadAhead(XLogReaderState *state,
|
|
|
|
bool nonblocking);
|
|
|
|
|
Fix scenario where streaming standby gets stuck at a continuation record.
If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.
Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.
Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.
Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
2018-05-05 00:34:53 +02:00
|
|
|
/* Validate a page */
|
|
|
|
extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
|
|
|
|
XLogRecPtr recptr, char *phdr);
|
|
|
|
|
2019-11-25 19:04:54 +01:00
|
|
|
/*
|
|
|
|
* Error information from WALRead that both backend and frontend caller can
|
2022-08-04 23:42:31 +02:00
|
|
|
* process. Currently only errors from pread can be reported.
|
2019-11-25 19:04:54 +01:00
|
|
|
*/
|
|
|
|
typedef struct WALReadError
|
|
|
|
{
|
2022-08-04 23:42:31 +02:00
|
|
|
int wre_errno; /* errno set by the last pread() */
|
2019-11-25 19:04:54 +01:00
|
|
|
int wre_off; /* Offset we tried to read from. */
|
|
|
|
int wre_req; /* Bytes requested to be read. */
|
|
|
|
int wre_read; /* Bytes read by the last read(). */
|
|
|
|
WALOpenSegment wre_seg; /* Segment we tried to read from. */
|
|
|
|
} WALReadError;
|
|
|
|
|
2020-05-08 21:30:34 +02:00
|
|
|
extern bool WALRead(XLogReaderState *state,
|
|
|
|
char *buf, XLogRecPtr startptr, Size count,
|
2020-05-13 18:17:08 +02:00
|
|
|
TimeLineID tli, WALReadError *errinfo);
|
2019-11-25 19:04:54 +01:00
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
/* Functions for decoding an XLogRecord */
|
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
extern size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
|
|
|
|
extern bool DecodeXLogRecord(XLogReaderState *state,
|
|
|
|
DecodedXLogRecord *decoded,
|
|
|
|
XLogRecord *record,
|
|
|
|
XLogRecPtr lsn,
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
char **errmsg);
|
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
/*
|
|
|
|
* Macros that provide access to parts of the record most recently returned by
|
|
|
|
* XLogReadRecord() or XLogNextRecord().
|
|
|
|
*/
|
|
|
|
#define XLogRecGetTotalLen(decoder) ((decoder)->record->header.xl_tot_len)
|
|
|
|
#define XLogRecGetPrev(decoder) ((decoder)->record->header.xl_prev)
|
|
|
|
#define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info)
|
|
|
|
#define XLogRecGetRmid(decoder) ((decoder)->record->header.xl_rmid)
|
|
|
|
#define XLogRecGetXid(decoder) ((decoder)->record->header.xl_xid)
|
|
|
|
#define XLogRecGetOrigin(decoder) ((decoder)->record->record_origin)
|
|
|
|
#define XLogRecGetTopXid(decoder) ((decoder)->record->toplevel_xid)
|
|
|
|
#define XLogRecGetData(decoder) ((decoder)->record->main_data)
|
|
|
|
#define XLogRecGetDataLen(decoder) ((decoder)->record->main_data_len)
|
|
|
|
#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->record->max_block_id >= 0)
|
|
|
|
#define XLogRecMaxBlockId(decoder) ((decoder)->record->max_block_id)
|
|
|
|
#define XLogRecGetBlock(decoder, i) (&(decoder)->record->blocks[(i)])
|
|
|
|
#define XLogRecHasBlockRef(decoder, block_id) \
|
|
|
|
(((decoder)->record->max_block_id >= (block_id)) && \
|
|
|
|
((decoder)->record->blocks[block_id].in_use))
|
|
|
|
#define XLogRecHasBlockImage(decoder, block_id) \
|
|
|
|
((decoder)->record->blocks[block_id].has_image)
|
|
|
|
#define XLogRecBlockImageApply(decoder, block_id) \
|
|
|
|
((decoder)->record->blocks[block_id].apply_image)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
2019-07-15 07:03:46 +02:00
|
|
|
#ifndef FRONTEND
|
|
|
|
extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
|
|
|
|
#endif
|
|
|
|
|
2019-08-13 06:53:41 +02:00
|
|
|
extern bool RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len);
|
2022-04-11 23:43:46 +02:00
|
|
|
extern void XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
RelFileLocator *rlocator, ForkNumber *forknum,
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
BlockNumber *blknum);
|
2022-04-07 09:28:40 +02:00
|
|
|
extern bool XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
RelFileLocator *rlocator, ForkNumber *forknum,
|
2022-04-07 09:28:40 +02:00
|
|
|
BlockNumber *blknum,
|
|
|
|
Buffer *prefetch_buffer);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
#endif /* XLOGREADER_H */
|