postgresql/src/include/access/xlogrecord.h
Andres Freund 5aa2350426 Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
  e.g. to avoid loops in bi-directional replication setups

The solution to these problems, as implemented here, consist out of
three parts:

1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
   replication origin, how far replay has progressed in a efficient and
   crash safe manner.
3) The ability to filter out changes performed on the behest of a
   replication origin during logical decoding; this allows complex
   replication topologies. E.g. by filtering all replayed changes out.

Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated.  We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.

This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL.  Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.

For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.

Bumps both catversion and wal page magic.

Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
    20140923182422.GA15776@alap3.anarazel.de,
    20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00

218 lines
7.8 KiB
C

/*
* xlogrecord.h
*
* Definitions for the WAL record format.
*
* Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
* src/include/access/xlogrecord.h
*/
#ifndef XLOGRECORD_H
#define XLOGRECORD_H
#include "access/rmgr.h"
#include "access/xlogdefs.h"
#include "port/pg_crc32c.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
/*
* The overall layout of an XLOG record is:
* Fixed-size header (XLogRecord struct)
* XLogRecordBlockHeader struct
* XLogRecordBlockHeader struct
* ...
* XLogRecordDataHeader[Short|Long] struct
* block data
* block data
* ...
* main data
*
* There can be zero or more XLogRecordBlockHeaders, and 0 or more bytes of
* rmgr-specific data not associated with a block. XLogRecord structs
* always start on MAXALIGN boundaries in the WAL files, but the rest of
* the fields are not aligned.
*
* The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
* XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
* used to distinguish between block references, and the main data structs.
*/
typedef struct XLogRecord
{
uint32 xl_tot_len; /* total len of entire record */
TransactionId xl_xid; /* xact id */
XLogRecPtr xl_prev; /* ptr to previous record in log */
uint8 xl_info; /* flag bits, see below */
RmgrId xl_rmid; /* resource manager for this record */
/* 2 bytes of padding here, initialize to zero */
pg_crc32c xl_crc; /* CRC for this record */
/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */
} XLogRecord;
#define SizeOfXLogRecord (offsetof(XLogRecord, xl_crc) + sizeof(pg_crc32c))
/*
* The high 4 bits in xl_info may be used freely by rmgr. The
* XLR_SPECIAL_REL_UPDATE bit can be passed by XLogInsert caller. The rest
* are set internally by XLogInsert.
*/
#define XLR_INFO_MASK 0x0F
#define XLR_RMGR_INFO_MASK 0xF0
/*
* If a WAL record modifies any relation files, in ways not covered by the
* usual block references, this flag is set. This is not used for anything
* by PostgreSQL itself, but it allows external tools that read WAL and keep
* track of modified blocks to recognize such special record types.
*/
#define XLR_SPECIAL_REL_UPDATE 0x01
/*
* Header info for block data appended to an XLOG record.
*
* 'data_length' is the length of the rmgr-specific payload data associated
* with this block. It does not include the possible full page image, nor
* XLogRecordBlockHeader struct itself.
*
* Note that we don't attempt to align the XLogRecordBlockHeader struct!
* So, the struct must be copied to aligned local storage before use.
*/
typedef struct XLogRecordBlockHeader
{
uint8 id; /* block reference ID */
uint8 fork_flags; /* fork within the relation, and flags */
uint16 data_length; /* number of payload bytes (not including page
* image) */
/* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */
/* If BKPBLOCK_SAME_REL is not set, a RelFileNode follows */
/* BlockNumber follows */
} XLogRecordBlockHeader;
#define SizeOfXLogRecordBlockHeader (offsetof(XLogRecordBlockHeader, data_length) + sizeof(uint16))
/*
* Additional header information when a full-page image is included
* (i.e. when BKPBLOCK_HAS_IMAGE is set).
*
* As a trivial form of data compression, the XLOG code is aware that
* PG data pages usually contain an unused "hole" in the middle, which
* contains only zero bytes. If the length of "hole" > 0 then we have removed
* such a "hole" from the stored data (and it's not counted in the
* XLOG record's CRC, either). Hence, the amount of block data actually
* present is BLCKSZ - the length of "hole" bytes.
*
* When wal_compression is enabled, a full page image which "hole" was
* removed is additionally compressed using PGLZ compression algorithm.
* This can reduce the WAL volume, but at some extra cost of CPU spent
* on the compression during WAL logging. In this case, since the "hole"
* length cannot be calculated by subtracting the number of page image bytes
* from BLCKSZ, basically it needs to be stored as an extra information.
* But when no "hole" exists, we can assume that the "hole" length is zero
* and no such an extra information needs to be stored. Note that
* the original version of page image is stored in WAL instead of the
* compressed one if the number of bytes saved by compression is less than
* the length of extra information. Hence, when a page image is successfully
* compressed, the amount of block data actually present is less than
* BLCKSZ - the length of "hole" bytes - the length of extra information.
*/
typedef struct XLogRecordBlockImageHeader
{
uint16 length; /* number of page image bytes */
uint16 hole_offset; /* number of bytes before "hole" */
uint8 bimg_info; /* flag bits, see below */
/*
* If BKPIMAGE_HAS_HOLE and BKPIMAGE_IS_COMPRESSED,
* an XLogRecordBlockCompressHeader struct follows.
*/
} XLogRecordBlockImageHeader;
#define SizeOfXLogRecordBlockImageHeader \
(offsetof(XLogRecordBlockImageHeader, bimg_info) + sizeof(uint8))
/* Information stored in bimg_info */
#define BKPIMAGE_HAS_HOLE 0x01 /* page image has "hole" */
#define BKPIMAGE_IS_COMPRESSED 0x02 /* page image is compressed */
/*
* Extra header information used when page image has "hole" and
* is compressed.
*/
typedef struct XLogRecordBlockCompressHeader
{
uint16 hole_length; /* number of bytes in "hole" */
} XLogRecordBlockCompressHeader;
#define SizeOfXLogRecordBlockCompressHeader \
sizeof(XLogRecordBlockCompressHeader)
/*
* Maximum size of the header for a block reference. This is used to size a
* temporary buffer for constructing the header.
*/
#define MaxSizeOfXLogRecordBlockHeader \
(SizeOfXLogRecordBlockHeader + \
SizeOfXLogRecordBlockImageHeader + \
SizeOfXLogRecordBlockCompressHeader + \
sizeof(RelFileNode) + \
sizeof(BlockNumber))
/*
* The fork number fits in the lower 4 bits in the fork_flags field. The upper
* bits are used for flags.
*/
#define BKPBLOCK_FORK_MASK 0x0F
#define BKPBLOCK_FLAG_MASK 0xF0
#define BKPBLOCK_HAS_IMAGE 0x10 /* block data is an XLogRecordBlockImage */
#define BKPBLOCK_HAS_DATA 0x20
#define BKPBLOCK_WILL_INIT 0x40 /* redo will re-init the page */
#define BKPBLOCK_SAME_REL 0x80 /* RelFileNode omitted, same as previous */
/*
* XLogRecordDataHeaderShort/Long are used for the "main data" portion of
* the record. If the length of the data is less than 256 bytes, the short
* form is used, with a single byte to hold the length. Otherwise the long
* form is used.
*
* (These structs are currently not used in the code, they are here just for
* documentation purposes).
*/
typedef struct XLogRecordDataHeaderShort
{
uint8 id; /* XLR_BLOCK_ID_DATA_SHORT */
uint8 data_length; /* number of payload bytes */
} XLogRecordDataHeaderShort;
#define SizeOfXLogRecordDataHeaderShort (sizeof(uint8) * 2)
typedef struct XLogRecordDataHeaderLong
{
uint8 id; /* XLR_BLOCK_ID_DATA_LONG */
/* followed by uint32 data_length, unaligned */
} XLogRecordDataHeaderLong;
#define SizeOfXLogRecordDataHeaderLong (sizeof(uint8) + sizeof(uint32))
/*
* Block IDs used to distinguish different kinds of record fragments. Block
* references are numbered from 0 to XLR_MAX_BLOCK_ID. A rmgr is free to use
* any ID number in that range (although you should stick to small numbers,
* because the WAL machinery is optimized for that case). A couple of ID
* numbers are reserved to denote the "main" data portion of the record.
*
* The maximum is currently set at 32, quite arbitrarily. Most records only
* need a handful of block references, but there are a few exceptions that
* need more.
*/
#define XLR_MAX_BLOCK_ID 32
#define XLR_BLOCK_ID_DATA_SHORT 255
#define XLR_BLOCK_ID_DATA_LONG 254
#define XLR_BLOCK_ID_ORIGIN 253
#endif /* XLOGRECORD_H */