2013-01-16 20:12:53 +01:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
|
|
|
* xlogreader.h
|
|
|
|
* Definitions for the generic XLog reading facility
|
|
|
|
*
|
2021-01-02 19:06:25 +01:00
|
|
|
* Portions Copyright (c) 2013-2021, PostgreSQL Global Development Group
|
2013-01-16 20:12:53 +01:00
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
|
|
|
* src/include/access/xlogreader.h
|
|
|
|
*
|
|
|
|
* NOTES
|
|
|
|
* See the definition of the XLogReaderState struct for instructions on
|
|
|
|
* how to use the XLogReader infrastructure.
|
|
|
|
*
|
|
|
|
* The basic idea is to allocate an XLogReaderState via
|
2020-01-26 10:39:00 +01:00
|
|
|
* XLogReaderAllocate(), position the reader to the first record with
|
|
|
|
* XLogBeginRead() or XLogFindNextRecord(), and call XLogReadRecord()
|
|
|
|
* until it returns NULL.
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
*
|
2020-11-03 04:08:27 +01:00
|
|
|
* Callers supply a page_read callback if they want to call
|
2020-05-08 21:30:34 +02:00
|
|
|
* XLogReadRecord or XLogFindNextRecord; it can be passed in as NULL
|
|
|
|
* otherwise. The WALRead function can be used as a helper to write
|
|
|
|
* page_read callbacks, but it is not mandatory; callers that use it,
|
2020-05-28 09:40:07 +02:00
|
|
|
* must supply segment_open callbacks. The segment_close callback
|
2020-05-08 21:30:34 +02:00
|
|
|
* must always be supplied.
|
|
|
|
*
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
* After reading a record with XLogReadRecord(), it's decomposed into
|
|
|
|
* the per-block and main data parts, and the parts can be accessed
|
|
|
|
* with the XLogRec* macros and functions. You can also decode a
|
|
|
|
* record that's already constructed in memory, without reading from
|
|
|
|
* disk, by calling the DecodeXLogRecord() function.
|
2013-01-16 20:12:53 +01:00
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
#ifndef XLOGREADER_H
|
|
|
|
#define XLOGREADER_H
|
|
|
|
|
2019-07-15 07:03:46 +02:00
|
|
|
#ifndef FRONTEND
|
|
|
|
#include "access/transam.h"
|
|
|
|
#endif
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
#include "access/xlogrecord.h"
|
2021-04-08 13:03:43 +02:00
|
|
|
#include "storage/buf.h"
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
/* WALOpenSegment represents a WAL segment being read. */
|
|
|
|
typedef struct WALOpenSegment
|
|
|
|
{
|
|
|
|
int ws_file; /* segment file descriptor */
|
|
|
|
XLogSegNo ws_segno; /* segment number */
|
|
|
|
TimeLineID ws_tli; /* timeline ID of the currently open file */
|
|
|
|
} WALOpenSegment;
|
|
|
|
|
|
|
|
/* WALSegmentContext carries context information about WAL segments to read */
|
|
|
|
typedef struct WALSegmentContext
|
|
|
|
{
|
|
|
|
char ws_dir[MAXPGPATH];
|
|
|
|
int ws_segsize;
|
|
|
|
} WALSegmentContext;
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
typedef struct XLogReaderState XLogReaderState;
|
2021-04-08 13:03:23 +02:00
|
|
|
typedef struct XLogFindNextRecordState XLogFindNextRecordState;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-04-08 13:03:23 +02:00
|
|
|
/* Function type definition for the segment cleanup callback */
|
|
|
|
typedef void (*WALSegmentCleanupCB) (XLogReaderState *xlogreader);
|
|
|
|
|
|
|
|
/* Function type definition for the open/close callbacks for WALRead() */
|
2020-05-13 18:17:08 +02:00
|
|
|
typedef void (*WALSegmentOpenCB) (XLogReaderState *xlogreader,
|
|
|
|
XLogSegNo nextSegNo,
|
|
|
|
TimeLineID *tli_p);
|
2020-05-08 21:30:34 +02:00
|
|
|
typedef void (*WALSegmentCloseCB) (XLogReaderState *xlogreader);
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
typedef struct
|
|
|
|
{
|
|
|
|
/* Is this block ref in use? */
|
|
|
|
bool in_use;
|
|
|
|
|
|
|
|
/* Identify the block this refers to */
|
|
|
|
RelFileNode rnode;
|
|
|
|
ForkNumber forknum;
|
|
|
|
BlockNumber blkno;
|
|
|
|
|
2021-04-08 13:03:43 +02:00
|
|
|
/* Workspace for remembering last known buffer holding this block. */
|
|
|
|
Buffer recent_buffer;
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
/* copy of the fork_flags field from the XLogRecordBlockHeader */
|
|
|
|
uint8 flags;
|
|
|
|
|
|
|
|
/* Information on full-page image, if any */
|
2017-02-08 21:45:30 +01:00
|
|
|
bool has_image; /* has image, even for consistency checking */
|
|
|
|
bool apply_image; /* has image that should be restored */
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
char *bkp_image;
|
|
|
|
uint16 hole_offset;
|
|
|
|
uint16 hole_length;
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
uint16 bimg_len;
|
|
|
|
uint8 bimg_info;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
|
|
|
/* Buffer holding the rmgr-specific data associated with this block */
|
|
|
|
bool has_data;
|
|
|
|
char *data;
|
|
|
|
uint16 data_len;
|
|
|
|
uint16 data_bufsz;
|
|
|
|
} DecodedBkpBlock;
|
|
|
|
|
2021-04-08 13:03:23 +02:00
|
|
|
/* Return code from XLogReadRecord */
|
|
|
|
typedef enum XLogReadRecordResult
|
|
|
|
{
|
|
|
|
XLREAD_SUCCESS, /* record is successfully read */
|
|
|
|
XLREAD_NEED_DATA, /* need more data. see XLogReadRecord. */
|
2021-04-08 13:03:34 +02:00
|
|
|
XLREAD_FULL, /* cannot hold more data while reading ahead */
|
2021-04-08 13:03:23 +02:00
|
|
|
XLREAD_FAIL /* failed during reading a record */
|
|
|
|
} XLogReadRecordResult;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* internal state of XLogReadRecord
|
|
|
|
*
|
|
|
|
* XLogReadState runs a state machine while reading a record. Theses states
|
|
|
|
* are not seen outside the function. Each state may repeat several times
|
|
|
|
* exiting requesting caller for new data. See the comment of XLogReadRecrod
|
|
|
|
* for details.
|
|
|
|
*/
|
|
|
|
typedef enum XLogReadRecordState
|
|
|
|
{
|
|
|
|
XLREAD_NEXT_RECORD,
|
|
|
|
XLREAD_TOT_LEN,
|
|
|
|
XLREAD_FIRST_FRAGMENT,
|
|
|
|
XLREAD_CONTINUATION
|
|
|
|
} XLogReadRecordState;
|
|
|
|
|
2021-04-08 13:03:34 +02:00
|
|
|
/*
|
|
|
|
* The decoded contents of a record. This occupies a contiguous region of
|
|
|
|
* memory, with main_data and blocks[n].data pointing to memory after the
|
|
|
|
* members declared here.
|
|
|
|
*/
|
|
|
|
typedef struct DecodedXLogRecord
|
|
|
|
{
|
|
|
|
/* Private member used for resource management. */
|
|
|
|
size_t size; /* total size of decoded record */
|
|
|
|
bool oversized; /* outside the regular decode buffer? */
|
|
|
|
struct DecodedXLogRecord *next; /* decoded record queue link */
|
|
|
|
|
|
|
|
/* Public members. */
|
|
|
|
XLogRecPtr lsn; /* location */
|
|
|
|
XLogRecPtr next_lsn; /* location of next record */
|
|
|
|
XLogRecord header; /* header */
|
|
|
|
RepOriginId record_origin;
|
|
|
|
TransactionId toplevel_xid; /* XID of top-level transaction */
|
|
|
|
char *main_data; /* record's main data portion */
|
|
|
|
uint32 main_data_len; /* main data portion's length */
|
|
|
|
int max_block_id; /* highest block_id in use (-1 if none) */
|
|
|
|
DecodedBkpBlock blocks[FLEXIBLE_ARRAY_MEMBER];
|
|
|
|
} DecodedXLogRecord;
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
struct XLogReaderState
|
|
|
|
{
|
2020-05-08 21:30:34 +02:00
|
|
|
/*
|
|
|
|
* Operational callbacks
|
|
|
|
*/
|
2021-04-08 13:03:23 +02:00
|
|
|
WALSegmentCleanupCB cleanup_cb;
|
2020-05-08 21:30:34 +02:00
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/* ----------------------------------------
|
|
|
|
* Public parameters
|
|
|
|
* ----------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* System identifier of the xlog files we're about to read. Set to zero
|
|
|
|
* (the default value) if unknown or unimportant.
|
|
|
|
*/
|
|
|
|
uint64 system_identifier;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Start and end point of last record read. EndRecPtr is also used as the
|
2020-01-26 10:39:00 +01:00
|
|
|
* position to read next. Calling XLogBeginRead() sets EndRecPtr to the
|
|
|
|
* starting position and ReadRecPtr to invalid.
|
2021-04-08 13:03:34 +02:00
|
|
|
*
|
|
|
|
* Start and end point of last record returned by XLogReadRecord(). These
|
|
|
|
* are also available as record->lsn and record->next_lsn.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
2021-04-08 13:03:23 +02:00
|
|
|
XLogRecPtr ReadRecPtr; /* start of last record read or being read */
|
2013-01-16 20:12:53 +01:00
|
|
|
XLogRecPtr EndRecPtr; /* end+1 of last record read */
|
|
|
|
|
2021-04-08 13:03:23 +02:00
|
|
|
/* ----------------------------------------
|
|
|
|
* Communication with page reader
|
|
|
|
* readBuf is XLOG_BLCKSZ bytes, valid up to at least reqLen bytes.
|
|
|
|
* ----------------------------------------
|
|
|
|
*/
|
|
|
|
/* variables the clients of xlogreader can examine */
|
|
|
|
XLogRecPtr readPagePtr; /* page pointer to read */
|
|
|
|
int32 reqLen; /* bytes requested to the caller */
|
|
|
|
char *readBuf; /* buffer to store data */
|
|
|
|
bool page_verified; /* is the page header on the buffer verified? */
|
|
|
|
bool record_verified;/* is the current record header verified? */
|
|
|
|
|
|
|
|
/* variables set by the client of xlogreader */
|
|
|
|
int32 readLen; /* actual bytes copied into readBuf by client,
|
|
|
|
* which should be >= reqLen. Client should
|
|
|
|
* use XLogReaderSetInputData() to set. */
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
|
|
|
/* ----------------------------------------
|
|
|
|
* Decoded representation of current record
|
|
|
|
*
|
|
|
|
* Use XLogRecGet* functions to investigate the record; these fields
|
|
|
|
* should not be accessed directly.
|
|
|
|
* ----------------------------------------
|
2021-04-08 13:03:34 +02:00
|
|
|
* Start and end point of the last record read and decoded by
|
|
|
|
* XLogReadRecordInternal(). NextRecPtr is also used as the position to
|
|
|
|
* decode next. Calling XLogBeginRead() sets NextRecPtr and EndRecPtr to
|
|
|
|
* the requested starting position.
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
*/
|
2021-04-08 13:03:34 +02:00
|
|
|
XLogRecPtr DecodeRecPtr; /* start of last record decoded */
|
|
|
|
XLogRecPtr NextRecPtr; /* end+1 of last record decoded */
|
|
|
|
XLogRecPtr PrevRecPtr; /* start of previous record decoded */
|
Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented here, consist out of
three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies. E.g. by filtering all replayed changes out.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL. Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.
For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00
|
|
|
|
2021-04-08 13:03:34 +02:00
|
|
|
/* Last record returned by XLogReadRecord(). */
|
|
|
|
DecodedXLogRecord *record;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/* ----------------------------------------
|
|
|
|
* private/internal state
|
|
|
|
* ----------------------------------------
|
|
|
|
*/
|
|
|
|
|
2021-04-08 13:03:34 +02:00
|
|
|
/*
|
|
|
|
* Buffer for decoded records. This is a circular buffer, though
|
|
|
|
* individual records can't be split in the middle, so some space is often
|
|
|
|
* wasted at the end. Oversized records that don't fit in this space are
|
|
|
|
* allocated separately.
|
|
|
|
*/
|
|
|
|
char *decode_buffer;
|
|
|
|
size_t decode_buffer_size;
|
|
|
|
bool free_decode_buffer; /* need to free? */
|
|
|
|
char *decode_buffer_head; /* write head */
|
|
|
|
char *decode_buffer_tail; /* read head */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Queue of records that have been decoded. This is a linked list that
|
|
|
|
* usually consists of consecutive records in decode_buffer, but may also
|
|
|
|
* contain oversized records allocated with palloc().
|
|
|
|
*/
|
|
|
|
DecodedXLogRecord *decode_queue_head; /* newest decoded record */
|
|
|
|
DecodedXLogRecord *decode_queue_tail; /* oldest decoded record */
|
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
/* last read XLOG position for data currently in readBuf */
|
|
|
|
WALSegmentContext segcxt;
|
|
|
|
WALOpenSegment seg;
|
2019-11-25 19:04:54 +01:00
|
|
|
uint32 segoff;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2016-03-30 23:56:13 +02:00
|
|
|
/*
|
|
|
|
* beginning of prior page read, and its TLI. Doesn't necessarily
|
|
|
|
* correspond to what's in readBuf; used for timeline sanity checks.
|
|
|
|
*/
|
2013-01-16 20:12:53 +01:00
|
|
|
XLogRecPtr latestPagePtr;
|
|
|
|
TimeLineID latestPageTLI;
|
|
|
|
|
2017-03-22 08:05:12 +01:00
|
|
|
/* timeline to read it from, 0 if a lookup is required */
|
|
|
|
TimeLineID currTLI;
|
2017-05-17 22:31:56 +02:00
|
|
|
|
2017-03-22 08:05:12 +01:00
|
|
|
/*
|
|
|
|
* Safe point to read to in currTLI if current TLI is historical
|
|
|
|
* (tliSwitchPoint) or InvalidXLogRecPtr if on current timeline.
|
|
|
|
*
|
2017-05-17 22:31:56 +02:00
|
|
|
* Actually set to the start of the segment containing the timeline switch
|
|
|
|
* that ends currTLI's validity, not the LSN of the switch its self, since
|
|
|
|
* we can't assume the old segment will be present.
|
2017-03-22 08:05:12 +01:00
|
|
|
*/
|
|
|
|
XLogRecPtr currTLIValidUntil;
|
2017-05-17 22:31:56 +02:00
|
|
|
|
2017-03-22 08:05:12 +01:00
|
|
|
/*
|
|
|
|
* If currTLI is not the most recent known timeline, the next timeline to
|
|
|
|
* read from when currTLIValidUntil is reached.
|
|
|
|
*/
|
|
|
|
TimeLineID nextTLI;
|
Use the right timeline when beginning to stream from master.
The xlogreader refactoring broke the logic to decide which timeline to start
streaming from. XLogPageRead() uses the timeline history to check which
timeline the requested WAL position falls into. However, after the
refactoring, XLogPageRead() is always first called with the first page in
the segment, to verify the segment header, and only then with the actual WAL
position we're interested in. That first read of the segment's header made
XLogPageRead() to always start streaming from the old timeline containing
the segment header, not the timeline containing the actual record, if there
was a timeline switch within the segment.
I thought I fixed this yesterday, but that fix was too narrow and only fixed
this for the corner-case that the timeline switch happened in the first page
of the segment. To fix this more robustly, pass explicitly the position of
the record we're actually interested in to XLogPageRead, and use that to
decide which timeline to read from, rather than deduce it from the page and
offset.
Per report from Fujii Masao.
2013-01-18 10:41:36 +01:00
|
|
|
|
2018-11-21 00:43:32 +01:00
|
|
|
/*
|
|
|
|
* Buffer for current ReadRecord result (expandable), used when a record
|
|
|
|
* crosses a page boundary.
|
|
|
|
*/
|
2013-01-16 20:12:53 +01:00
|
|
|
char *readRecordBuf;
|
|
|
|
uint32 readRecordBufSize;
|
|
|
|
|
2021-04-08 13:03:23 +02:00
|
|
|
/*
|
2021-04-08 13:03:34 +02:00
|
|
|
* XLogReadRecordInternal() state
|
2021-04-08 13:03:23 +02:00
|
|
|
*/
|
|
|
|
XLogReadRecordState readRecordState; /* state machine state */
|
|
|
|
int recordGotLen; /* amount of current record that has already
|
|
|
|
* been read */
|
|
|
|
int recordRemainLen; /* length of current record that remains */
|
|
|
|
XLogRecPtr recordContRecPtr; /* where the current record continues */
|
|
|
|
|
2021-04-08 13:03:34 +02:00
|
|
|
DecodedXLogRecord *decoding; /* record currently being decoded */
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/* Buffer to hold error message */
|
|
|
|
char *errormsg_buf;
|
2021-04-08 13:03:34 +02:00
|
|
|
bool errormsg_deferred;
|
2013-01-16 20:12:53 +01:00
|
|
|
};
|
|
|
|
|
2021-04-08 13:03:23 +02:00
|
|
|
struct XLogFindNextRecordState
|
|
|
|
{
|
|
|
|
XLogReaderState *reader_state;
|
|
|
|
XLogRecPtr targetRecPtr;
|
|
|
|
XLogRecPtr currRecPtr;
|
|
|
|
};
|
|
|
|
|
|
|
|
/* Report that data is available for decoding. */
|
|
|
|
static inline void
|
|
|
|
XLogReaderSetInputData(XLogReaderState *state, int32 len)
|
|
|
|
{
|
|
|
|
state->readLen = len;
|
|
|
|
}
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/* Get a new XLogReader */
|
Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-20 07:03:48 +02:00
|
|
|
extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
|
2019-09-24 21:08:31 +02:00
|
|
|
const char *waldir,
|
2021-04-08 13:03:23 +02:00
|
|
|
WALSegmentCleanupCB cleanup_cb);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
/* Free an XLogReader */
|
|
|
|
extern void XLogReaderFree(XLogReaderState *state);
|
|
|
|
|
2021-04-08 13:03:34 +02:00
|
|
|
/* Optionally provide a circular decoding buffer to allow readahead. */
|
|
|
|
extern void XLogReaderSetDecodeBuffer(XLogReaderState *state,
|
|
|
|
void *buffer,
|
|
|
|
size_t size);
|
|
|
|
|
2020-01-26 10:39:00 +01:00
|
|
|
/* Position the XLogReader to given record */
|
|
|
|
extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
|
|
|
|
#ifdef FRONTEND
|
2021-04-08 13:03:23 +02:00
|
|
|
extern XLogFindNextRecordState *InitXLogFindNextRecord(XLogReaderState *reader_state, XLogRecPtr start_ptr);
|
|
|
|
extern bool XLogFindNextRecord(XLogFindNextRecordState *state);
|
2020-01-26 10:39:00 +01:00
|
|
|
#endif /* FRONTEND */
|
|
|
|
|
2021-04-08 13:03:34 +02:00
|
|
|
/* Read the next record's header. Returns NULL on end-of-WAL or failure. */
|
2021-04-08 13:03:23 +02:00
|
|
|
extern XLogReadRecordResult XLogReadRecord(XLogReaderState *state,
|
|
|
|
XLogRecord **record,
|
|
|
|
char **errormsg);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-04-08 13:03:34 +02:00
|
|
|
/* Read the next decoded record. Returns NULL on end-of-WAL or failure. */
|
|
|
|
extern XLogReadRecordResult XLogNextRecord(XLogReaderState *state,
|
|
|
|
DecodedXLogRecord **record,
|
|
|
|
char **errormsg);
|
|
|
|
|
|
|
|
/* Try to read ahead, if there is space in the decoding buffer. */
|
|
|
|
extern XLogReadRecordResult XLogReadAhead(XLogReaderState *state,
|
|
|
|
DecodedXLogRecord **record,
|
|
|
|
char **errormsg);
|
|
|
|
|
Fix scenario where streaming standby gets stuck at a continuation record.
If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.
Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.
Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.
Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
2018-05-05 00:34:53 +02:00
|
|
|
/* Validate a page */
|
|
|
|
extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
|
2019-05-22 19:04:48 +02:00
|
|
|
XLogRecPtr recptr, char *phdr);
|
Fix scenario where streaming standby gets stuck at a continuation record.
If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.
Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.
Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.
Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
2018-05-05 00:34:53 +02:00
|
|
|
|
2019-11-25 19:04:54 +01:00
|
|
|
/*
|
|
|
|
* Error information from WALRead that both backend and frontend caller can
|
|
|
|
* process. Currently only errors from pg_pread can be reported.
|
|
|
|
*/
|
|
|
|
typedef struct WALReadError
|
|
|
|
{
|
|
|
|
int wre_errno; /* errno set by the last pg_pread() */
|
|
|
|
int wre_off; /* Offset we tried to read from. */
|
|
|
|
int wre_req; /* Bytes requested to be read. */
|
|
|
|
int wre_read; /* Bytes read by the last read(). */
|
|
|
|
WALOpenSegment wre_seg; /* Segment we tried to read from. */
|
|
|
|
} WALReadError;
|
|
|
|
|
2020-05-08 21:30:34 +02:00
|
|
|
extern bool WALRead(XLogReaderState *state,
|
2021-04-08 13:03:23 +02:00
|
|
|
WALSegmentOpenCB segopenfn, WALSegmentCloseCB sgclosefn,
|
2020-05-08 21:30:34 +02:00
|
|
|
char *buf, XLogRecPtr startptr, Size count,
|
2020-05-13 18:17:08 +02:00
|
|
|
TimeLineID tli, WALReadError *errinfo);
|
2019-11-25 19:04:54 +01:00
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
/* Functions for decoding an XLogRecord */
|
|
|
|
|
2021-04-08 13:03:34 +02:00
|
|
|
extern size_t DecodeXLogRecordRequiredSpace(size_t xl_tot_len);
|
|
|
|
extern bool DecodeXLogRecord(XLogReaderState *state,
|
|
|
|
DecodedXLogRecord *decoded,
|
|
|
|
XLogRecord *record,
|
|
|
|
XLogRecPtr lsn,
|
2019-05-22 19:04:48 +02:00
|
|
|
char **errmsg);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
2021-04-08 13:03:34 +02:00
|
|
|
#define XLogRecGetTotalLen(decoder) ((decoder)->record->header.xl_tot_len)
|
|
|
|
#define XLogRecGetPrev(decoder) ((decoder)->record->header.xl_prev)
|
|
|
|
#define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info)
|
|
|
|
#define XLogRecGetRmid(decoder) ((decoder)->record->header.xl_rmid)
|
|
|
|
#define XLogRecGetXid(decoder) ((decoder)->record->header.xl_xid)
|
|
|
|
#define XLogRecGetOrigin(decoder) ((decoder)->record->record_origin)
|
|
|
|
#define XLogRecGetTopXid(decoder) ((decoder)->record->toplevel_xid)
|
|
|
|
#define XLogRecGetData(decoder) ((decoder)->record->main_data)
|
|
|
|
#define XLogRecGetDataLen(decoder) ((decoder)->record->main_data_len)
|
|
|
|
#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->record->max_block_id >= 0)
|
|
|
|
#define XLogRecMaxBlockId(decoder) ((decoder)->record->max_block_id)
|
|
|
|
#define XLogRecGetBlock(decoder, i) (&(decoder)->record->blocks[(i)])
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
#define XLogRecHasBlockRef(decoder, block_id) \
|
2021-04-08 13:03:34 +02:00
|
|
|
((decoder)->record->max_block_id >= (block_id)) && \
|
|
|
|
((decoder)->record->blocks[block_id].in_use)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
#define XLogRecHasBlockImage(decoder, block_id) \
|
2021-04-08 13:03:34 +02:00
|
|
|
((decoder)->record->blocks[block_id].has_image)
|
2017-02-08 21:45:30 +01:00
|
|
|
#define XLogRecBlockImageApply(decoder, block_id) \
|
2021-04-08 13:03:34 +02:00
|
|
|
((decoder)->record->blocks[block_id].apply_image)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
2019-07-15 07:03:46 +02:00
|
|
|
#ifndef FRONTEND
|
|
|
|
extern FullTransactionId XLogRecGetFullXid(XLogReaderState *record);
|
|
|
|
#endif
|
|
|
|
|
2019-08-13 06:53:41 +02:00
|
|
|
extern bool RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
extern char *XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len);
|
|
|
|
extern bool XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
|
2019-05-22 19:04:48 +02:00
|
|
|
RelFileNode *rnode, ForkNumber *forknum,
|
|
|
|
BlockNumber *blknum);
|
2021-04-08 13:03:43 +02:00
|
|
|
extern bool XLogRecGetRecentBuffer(XLogReaderState *record, uint8 block_id,
|
|
|
|
RelFileNode *rnode, ForkNumber *forknum,
|
|
|
|
BlockNumber *blknum, Buffer *recent_buffer);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
#endif /* XLOGREADER_H */
|