2013-01-16 20:12:53 +01:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
|
|
|
* xlogreader.c
|
|
|
|
* Generic XLog reading facility
|
|
|
|
*
|
2015-01-06 17:43:47 +01:00
|
|
|
* Portions Copyright (c) 2013-2015, PostgreSQL Global Development Group
|
2013-01-16 20:12:53 +01:00
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
|
|
|
* src/backend/access/transam/xlogreader.c
|
|
|
|
*
|
|
|
|
* NOTES
|
|
|
|
* See xlogreader.h for more notes on this facility.
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include "postgres.h"
|
|
|
|
|
|
|
|
#include "access/transam.h"
|
2014-11-06 12:52:08 +01:00
|
|
|
#include "access/xlogrecord.h"
|
2013-01-16 20:12:53 +01:00
|
|
|
#include "access/xlog_internal.h"
|
|
|
|
#include "access/xlogreader.h"
|
|
|
|
#include "catalog/pg_control.h"
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
#include "common/pg_lzcompress.h"
|
Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented here, consist out of
three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies. E.g. by filtering all replayed changes out.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL. Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.
For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00
|
|
|
#include "replication/origin.h"
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
|
|
|
|
|
|
|
|
static bool ValidXLogPageHeader(XLogReaderState *state, XLogRecPtr recptr,
|
|
|
|
XLogPageHeader hdr);
|
|
|
|
static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
|
|
|
|
XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
|
|
|
|
static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
|
|
|
|
XLogRecPtr recptr);
|
|
|
|
static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
|
|
|
|
int reqLen);
|
2015-03-26 19:03:19 +01:00
|
|
|
static void report_invalid_record(XLogReaderState *state, const char *fmt,...) pg_attribute_printf(2, 3);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
static void ResetDecoder(XLogReaderState *state);
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/* size of the buffer allocated for error message. */
|
|
|
|
#define MAX_ERRORMSG_LEN 1000
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Construct a string in state->errormsg_buf explaining what's wrong with
|
|
|
|
* the current record being read.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
report_invalid_record(XLogReaderState *state, const char *fmt,...)
|
|
|
|
{
|
|
|
|
va_list args;
|
|
|
|
|
|
|
|
fmt = _(fmt);
|
|
|
|
|
|
|
|
va_start(args, fmt);
|
|
|
|
vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
|
|
|
|
va_end(args);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate and initialize a new XLogReader.
|
|
|
|
*
|
2015-04-03 14:55:37 +02:00
|
|
|
* Returns NULL if the xlogreader couldn't be allocated.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
|
|
|
XLogReaderState *
|
|
|
|
XLogReaderAllocate(XLogPageReadCB pagereadfunc, void *private_data)
|
|
|
|
{
|
|
|
|
XLogReaderState *state;
|
|
|
|
|
2015-04-03 14:55:37 +02:00
|
|
|
state = (XLogReaderState *)
|
|
|
|
palloc_extended(sizeof(XLogReaderState),
|
|
|
|
MCXT_ALLOC_NO_OOM | MCXT_ALLOC_ZERO);
|
|
|
|
if (!state)
|
|
|
|
return NULL;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
state->max_block_id = -1;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Permanently allocate readBuf. We do it this way, rather than just
|
|
|
|
* making a static array, for two reasons: (1) no need to waste the
|
|
|
|
* storage in most instantiations of the backend; (2) a static char array
|
2015-04-03 14:55:37 +02:00
|
|
|
* isn't guaranteed to have any particular alignment, whereas
|
|
|
|
* palloc_extended() will provide MAXALIGN'd storage.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
2015-04-03 14:55:37 +02:00
|
|
|
state->readBuf = (char *) palloc_extended(XLOG_BLCKSZ,
|
|
|
|
MCXT_ALLOC_NO_OOM);
|
|
|
|
if (!state->readBuf)
|
|
|
|
{
|
|
|
|
pfree(state);
|
|
|
|
return NULL;
|
|
|
|
}
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
state->read_page = pagereadfunc;
|
|
|
|
/* system_identifier initialized to zeroes above */
|
|
|
|
state->private_data = private_data;
|
|
|
|
/* ReadRecPtr and EndRecPtr initialized to zeroes above */
|
|
|
|
/* readSegNo, readOff, readLen, readPageTLI initialized to zeroes above */
|
2015-04-03 14:55:37 +02:00
|
|
|
state->errormsg_buf = palloc_extended(MAX_ERRORMSG_LEN + 1,
|
|
|
|
MCXT_ALLOC_NO_OOM);
|
|
|
|
if (!state->errormsg_buf)
|
|
|
|
{
|
|
|
|
pfree(state->readBuf);
|
|
|
|
pfree(state);
|
|
|
|
return NULL;
|
|
|
|
}
|
2013-01-16 20:12:53 +01:00
|
|
|
state->errormsg_buf[0] = '\0';
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate an initial readRecordBuf of minimal size, which can later be
|
|
|
|
* enlarged if necessary.
|
|
|
|
*/
|
|
|
|
if (!allocate_recordbuf(state, 0))
|
|
|
|
{
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
pfree(state->errormsg_buf);
|
|
|
|
pfree(state->readBuf);
|
|
|
|
pfree(state);
|
2013-01-16 20:12:53 +01:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return state;
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
XLogReaderFree(XLogReaderState *state)
|
|
|
|
{
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
int block_id;
|
|
|
|
|
2015-07-28 08:05:46 +02:00
|
|
|
for (block_id = 0; block_id <= XLR_MAX_BLOCK_ID; block_id++)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
{
|
2015-07-27 17:27:27 +02:00
|
|
|
if (state->blocks[block_id].data)
|
|
|
|
pfree(state->blocks[block_id].data);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
}
|
|
|
|
if (state->main_data)
|
|
|
|
pfree(state->main_data);
|
|
|
|
|
|
|
|
pfree(state->errormsg_buf);
|
2013-01-16 20:12:53 +01:00
|
|
|
if (state->readRecordBuf)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
pfree(state->readRecordBuf);
|
|
|
|
pfree(state->readBuf);
|
|
|
|
pfree(state);
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate readRecordBuf to fit a record of at least the given length.
|
|
|
|
* Returns true if successful, false if out of memory.
|
|
|
|
*
|
|
|
|
* readRecordBufSize is set to the new buffer size.
|
|
|
|
*
|
|
|
|
* To avoid useless small increases, round its size to a multiple of
|
|
|
|
* XLOG_BLCKSZ, and make sure it's at least 5*Max(BLCKSZ, XLOG_BLCKSZ) to start
|
|
|
|
* with. (That is enough for all "normal" records, but very large commit or
|
|
|
|
* abort records might need more space.)
|
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
allocate_recordbuf(XLogReaderState *state, uint32 reclength)
|
|
|
|
{
|
|
|
|
uint32 newSize = reclength;
|
|
|
|
|
|
|
|
newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
|
|
|
|
newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));
|
|
|
|
|
|
|
|
if (state->readRecordBuf)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
pfree(state->readRecordBuf);
|
2015-04-03 11:29:38 +02:00
|
|
|
state->readRecordBuf =
|
|
|
|
(char *) palloc_extended(newSize, MCXT_ALLOC_NO_OOM);
|
|
|
|
if (state->readRecordBuf == NULL)
|
|
|
|
{
|
|
|
|
state->readRecordBufSize = 0;
|
|
|
|
return false;
|
|
|
|
}
|
2013-01-16 20:12:53 +01:00
|
|
|
state->readRecordBufSize = newSize;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Attempt to read an XLOG record.
|
|
|
|
*
|
2013-10-24 10:50:02 +02:00
|
|
|
* If RecPtr is valid, try to read a record at that position. Otherwise
|
2013-01-16 20:12:53 +01:00
|
|
|
* try to read a record just after the last one previously read.
|
|
|
|
*
|
2013-01-17 22:45:37 +01:00
|
|
|
* If the read_page callback fails to read the requested data, NULL is
|
2013-01-16 20:12:53 +01:00
|
|
|
* returned. The callback is expected to have reported the error; errormsg
|
|
|
|
* is set to NULL.
|
|
|
|
*
|
|
|
|
* If the reading fails for some other reason, NULL is also returned, and
|
|
|
|
* *errormsg is set to a string with details of the failure.
|
|
|
|
*
|
|
|
|
* The returned pointer (or *errormsg) points to an internal buffer that's
|
|
|
|
* valid until the next call to XLogReadRecord.
|
|
|
|
*/
|
|
|
|
XLogRecord *
|
|
|
|
XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
|
|
|
|
{
|
|
|
|
XLogRecord *record;
|
|
|
|
XLogRecPtr targetPagePtr;
|
|
|
|
bool randAccess = false;
|
|
|
|
uint32 len,
|
|
|
|
total_len;
|
|
|
|
uint32 targetRecOff;
|
|
|
|
uint32 pageHeaderSize;
|
|
|
|
bool gotheader;
|
|
|
|
int readOff;
|
|
|
|
|
|
|
|
/* reset error state */
|
|
|
|
*errormsg = NULL;
|
|
|
|
state->errormsg_buf[0] = '\0';
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
ResetDecoder(state);
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
if (RecPtr == InvalidXLogRecPtr)
|
|
|
|
{
|
|
|
|
RecPtr = state->EndRecPtr;
|
|
|
|
|
|
|
|
if (state->ReadRecPtr == InvalidXLogRecPtr)
|
|
|
|
randAccess = true;
|
|
|
|
|
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* RecPtr is pointing to end+1 of the previous WAL record. If we're
|
2013-01-16 20:12:53 +01:00
|
|
|
* at a page boundary, no more records can fit on the current page. We
|
|
|
|
* must skip over the page header, but we can't do that until we've
|
|
|
|
* read in the page, since the header size is variable.
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* In this case, the passed-in record pointer should already be
|
|
|
|
* pointing to a valid record starting position.
|
|
|
|
*/
|
|
|
|
Assert(XRecOffIsValid(RecPtr));
|
|
|
|
randAccess = true; /* allow readPageTLI to go backwards too */
|
|
|
|
}
|
|
|
|
|
Use the right timeline when beginning to stream from master.
The xlogreader refactoring broke the logic to decide which timeline to start
streaming from. XLogPageRead() uses the timeline history to check which
timeline the requested WAL position falls into. However, after the
refactoring, XLogPageRead() is always first called with the first page in
the segment, to verify the segment header, and only then with the actual WAL
position we're interested in. That first read of the segment's header made
XLogPageRead() to always start streaming from the old timeline containing
the segment header, not the timeline containing the actual record, if there
was a timeline switch within the segment.
I thought I fixed this yesterday, but that fix was too narrow and only fixed
this for the corner-case that the timeline switch happened in the first page
of the segment. To fix this more robustly, pass explicitly the position of
the record we're actually interested in to XLogPageRead, and use that to
decide which timeline to read from, rather than deduce it from the page and
offset.
Per report from Fujii Masao.
2013-01-18 10:41:36 +01:00
|
|
|
state->currRecPtr = RecPtr;
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
|
2013-01-17 22:45:37 +01:00
|
|
|
targetRecOff = RecPtr % XLOG_BLCKSZ;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2013-01-17 22:45:37 +01:00
|
|
|
/*
|
2013-05-29 22:58:43 +02:00
|
|
|
* Read the page containing the record into state->readBuf. Request enough
|
|
|
|
* byte to cover the whole record header, or at least the part of it that
|
|
|
|
* fits on the same page.
|
2013-01-17 22:45:37 +01:00
|
|
|
*/
|
|
|
|
readOff = ReadPageInternal(state,
|
|
|
|
targetPagePtr,
|
|
|
|
Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
|
2013-01-16 20:12:53 +01:00
|
|
|
if (readOff < 0)
|
2013-01-17 18:05:19 +01:00
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* ReadPageInternal always returns at least the page header, so we can
|
|
|
|
* examine it now.
|
|
|
|
*/
|
|
|
|
pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
|
|
|
|
if (targetRecOff == 0)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* At page start, so skip over page header.
|
|
|
|
*/
|
|
|
|
RecPtr += pageHeaderSize;
|
|
|
|
targetRecOff = pageHeaderSize;
|
|
|
|
}
|
|
|
|
else if (targetRecOff < pageHeaderSize)
|
|
|
|
{
|
|
|
|
report_invalid_record(state, "invalid record offset at %X/%X",
|
|
|
|
(uint32) (RecPtr >> 32), (uint32) RecPtr);
|
2013-01-17 18:05:19 +01:00
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if ((((XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
|
|
|
|
targetRecOff == pageHeaderSize)
|
|
|
|
{
|
|
|
|
report_invalid_record(state, "contrecord is requested by %X/%X",
|
|
|
|
(uint32) (RecPtr >> 32), (uint32) RecPtr);
|
2013-01-17 18:05:19 +01:00
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* ReadPageInternal has verified the page header */
|
|
|
|
Assert(pageHeaderSize <= readOff);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read the record length.
|
|
|
|
*
|
|
|
|
* NB: Even though we use an XLogRecord pointer here, the whole record
|
|
|
|
* header might not fit on this page. xl_tot_len is the first field of the
|
|
|
|
* struct, so it must be on this page (the records are MAXALIGNed), but we
|
|
|
|
* cannot access any other fields until we've verified that we got the
|
|
|
|
* whole header.
|
|
|
|
*/
|
|
|
|
record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
|
|
|
|
total_len = record->xl_tot_len;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the whole record header is on this page, validate it immediately.
|
|
|
|
* Otherwise do just a basic sanity check on xl_tot_len, and validate the
|
2014-05-06 18:12:18 +02:00
|
|
|
* rest of the header after reading it from the next page. The xl_tot_len
|
2013-01-16 20:12:53 +01:00
|
|
|
* check is necessary here to ensure that we enter the "Need to reassemble
|
|
|
|
* record" code path below; otherwise we might fail to apply
|
|
|
|
* ValidXLogRecordHeader at all.
|
|
|
|
*/
|
|
|
|
if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
|
|
|
|
{
|
|
|
|
if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
|
|
|
|
randAccess))
|
2013-01-17 18:05:19 +01:00
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
gotheader = true;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* XXX: more validation should be done here */
|
|
|
|
if (total_len < SizeOfXLogRecord)
|
|
|
|
{
|
|
|
|
report_invalid_record(state, "invalid record length at %X/%X",
|
|
|
|
(uint32) (RecPtr >> 32), (uint32) RecPtr);
|
2013-01-17 18:05:19 +01:00
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
gotheader = false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Enlarge readRecordBuf as needed.
|
|
|
|
*/
|
|
|
|
if (total_len > state->readRecordBufSize &&
|
|
|
|
!allocate_recordbuf(state, total_len))
|
|
|
|
{
|
|
|
|
/* We treat this as a "bogus data" condition */
|
|
|
|
report_invalid_record(state, "record length %u at %X/%X too long",
|
|
|
|
total_len,
|
|
|
|
(uint32) (RecPtr >> 32), (uint32) RecPtr);
|
2013-01-17 18:05:19 +01:00
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
|
|
|
|
if (total_len > len)
|
|
|
|
{
|
|
|
|
/* Need to reassemble record */
|
|
|
|
char *contdata;
|
|
|
|
XLogPageHeader pageHeader;
|
|
|
|
char *buffer;
|
|
|
|
uint32 gotlen;
|
|
|
|
|
|
|
|
/* Copy the first fragment of the record from the first page. */
|
|
|
|
memcpy(state->readRecordBuf,
|
|
|
|
state->readBuf + RecPtr % XLOG_BLCKSZ, len);
|
|
|
|
buffer = state->readRecordBuf + len;
|
|
|
|
gotlen = len;
|
|
|
|
|
|
|
|
do
|
|
|
|
{
|
|
|
|
/* Calculate pointer to beginning of next page */
|
|
|
|
targetPagePtr += XLOG_BLCKSZ;
|
|
|
|
|
|
|
|
/* Wait for the next page to become available */
|
|
|
|
readOff = ReadPageInternal(state, targetPagePtr,
|
|
|
|
Min(total_len - gotlen + SizeOfXLogShortPHD,
|
|
|
|
XLOG_BLCKSZ));
|
|
|
|
|
|
|
|
if (readOff < 0)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
Assert(SizeOfXLogShortPHD <= readOff);
|
|
|
|
|
|
|
|
/* Check that the continuation on next page looks valid */
|
|
|
|
pageHeader = (XLogPageHeader) state->readBuf;
|
|
|
|
if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"there is no contrecord flag at %X/%X",
|
|
|
|
(uint32) (RecPtr >> 32), (uint32) RecPtr);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Cross-check that xlp_rem_len agrees with how much of the record
|
|
|
|
* we expect there to be left.
|
|
|
|
*/
|
|
|
|
if (pageHeader->xlp_rem_len == 0 ||
|
|
|
|
total_len != (pageHeader->xlp_rem_len + gotlen))
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"invalid contrecord length %u at %X/%X",
|
|
|
|
pageHeader->xlp_rem_len,
|
|
|
|
(uint32) (RecPtr >> 32), (uint32) RecPtr);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Append the continuation from this page to the buffer */
|
|
|
|
pageHeaderSize = XLogPageHeaderSize(pageHeader);
|
|
|
|
|
|
|
|
if (readOff < pageHeaderSize)
|
|
|
|
readOff = ReadPageInternal(state, targetPagePtr,
|
|
|
|
pageHeaderSize);
|
|
|
|
|
|
|
|
Assert(pageHeaderSize <= readOff);
|
|
|
|
|
|
|
|
contdata = (char *) state->readBuf + pageHeaderSize;
|
|
|
|
len = XLOG_BLCKSZ - pageHeaderSize;
|
|
|
|
if (pageHeader->xlp_rem_len < len)
|
|
|
|
len = pageHeader->xlp_rem_len;
|
|
|
|
|
|
|
|
if (readOff < pageHeaderSize + len)
|
|
|
|
readOff = ReadPageInternal(state, targetPagePtr,
|
|
|
|
pageHeaderSize + len);
|
|
|
|
|
|
|
|
memcpy(buffer, (char *) contdata, len);
|
|
|
|
buffer += len;
|
|
|
|
gotlen += len;
|
|
|
|
|
|
|
|
/* If we just reassembled the record header, validate it. */
|
|
|
|
if (!gotheader)
|
|
|
|
{
|
|
|
|
record = (XLogRecord *) state->readRecordBuf;
|
|
|
|
if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr,
|
|
|
|
record, randAccess))
|
|
|
|
goto err;
|
|
|
|
gotheader = true;
|
|
|
|
}
|
|
|
|
} while (gotlen < total_len);
|
|
|
|
|
|
|
|
Assert(gotheader);
|
|
|
|
|
|
|
|
record = (XLogRecord *) state->readRecordBuf;
|
|
|
|
if (!ValidXLogRecord(state, record, RecPtr))
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
|
|
|
|
state->ReadRecPtr = RecPtr;
|
|
|
|
state->EndRecPtr = targetPagePtr + pageHeaderSize
|
|
|
|
+ MAXALIGN(pageHeader->xlp_rem_len);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* Wait for the record data to become available */
|
|
|
|
readOff = ReadPageInternal(state, targetPagePtr,
|
|
|
|
Min(targetRecOff + total_len, XLOG_BLCKSZ));
|
|
|
|
if (readOff < 0)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
/* Record does not cross a page boundary */
|
|
|
|
if (!ValidXLogRecord(state, record, RecPtr))
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
state->EndRecPtr = RecPtr + MAXALIGN(total_len);
|
|
|
|
|
|
|
|
state->ReadRecPtr = RecPtr;
|
|
|
|
memcpy(state->readRecordBuf, record, total_len);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Special processing if it's an XLOG SWITCH record
|
|
|
|
*/
|
|
|
|
if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
|
|
|
|
{
|
|
|
|
/* Pretend it extends to end of segment */
|
|
|
|
state->EndRecPtr += XLogSegSize - 1;
|
|
|
|
state->EndRecPtr -= state->EndRecPtr % XLogSegSize;
|
|
|
|
}
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
if (DecodeXLogRecord(state, record, errormsg))
|
|
|
|
return record;
|
|
|
|
else
|
|
|
|
return NULL;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
err:
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Invalidate the xlog page we've cached. We might read from a different
|
|
|
|
* source after failure.
|
|
|
|
*/
|
|
|
|
state->readSegNo = 0;
|
|
|
|
state->readOff = 0;
|
|
|
|
state->readLen = 0;
|
|
|
|
|
|
|
|
if (state->errormsg_buf[0] != '\0')
|
|
|
|
*errormsg = state->errormsg_buf;
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2013-01-17 22:45:37 +01:00
|
|
|
* Read a single xlog page including at least [pageptr, reqLen] of valid data
|
2013-01-16 20:12:53 +01:00
|
|
|
* via the read_page() callback.
|
|
|
|
*
|
|
|
|
* Returns -1 if the required page cannot be read for some reason; errormsg_buf
|
|
|
|
* is set in that case (unless the error occurs in the read_page callback).
|
|
|
|
*
|
|
|
|
* We fetch the page from a reader-local cache if we know we have the required
|
|
|
|
* data and if there hasn't been any error since caching the data.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
|
|
|
|
{
|
|
|
|
int readLen;
|
|
|
|
uint32 targetPageOff;
|
|
|
|
XLogSegNo targetSegNo;
|
|
|
|
XLogPageHeader hdr;
|
|
|
|
|
|
|
|
Assert((pageptr % XLOG_BLCKSZ) == 0);
|
|
|
|
|
|
|
|
XLByteToSeg(pageptr, targetSegNo);
|
|
|
|
targetPageOff = (pageptr % XLogSegSize);
|
|
|
|
|
|
|
|
/* check whether we have all the requested data already */
|
|
|
|
if (targetSegNo == state->readSegNo && targetPageOff == state->readOff &&
|
|
|
|
reqLen < state->readLen)
|
|
|
|
return state->readLen;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Data is not in our buffer.
|
|
|
|
*
|
|
|
|
* Every time we actually read the page, even if we looked at parts of it
|
|
|
|
* before, we need to do verification as the read_page callback might now
|
|
|
|
* be rereading data from a different source.
|
|
|
|
*
|
|
|
|
* Whenever switching to a new WAL segment, we read the first page of the
|
|
|
|
* file and validate its header, even if that's not where the target
|
|
|
|
* record is. This is so that we can check the additional identification
|
|
|
|
* info that is present in the first page's "long" header.
|
|
|
|
*/
|
2013-01-17 22:45:37 +01:00
|
|
|
if (targetSegNo != state->readSegNo && targetPageOff != 0)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
XLogPageHeader hdr;
|
|
|
|
XLogRecPtr targetSegmentPtr = pageptr - targetPageOff;
|
|
|
|
|
|
|
|
readLen = state->read_page(state, targetSegmentPtr, XLOG_BLCKSZ,
|
Use the right timeline when beginning to stream from master.
The xlogreader refactoring broke the logic to decide which timeline to start
streaming from. XLogPageRead() uses the timeline history to check which
timeline the requested WAL position falls into. However, after the
refactoring, XLogPageRead() is always first called with the first page in
the segment, to verify the segment header, and only then with the actual WAL
position we're interested in. That first read of the segment's header made
XLogPageRead() to always start streaming from the old timeline containing
the segment header, not the timeline containing the actual record, if there
was a timeline switch within the segment.
I thought I fixed this yesterday, but that fix was too narrow and only fixed
this for the corner-case that the timeline switch happened in the first page
of the segment. To fix this more robustly, pass explicitly the position of
the record we're actually interested in to XLogPageRead, and use that to
decide which timeline to read from, rather than deduce it from the page and
offset.
Per report from Fujii Masao.
2013-01-18 10:41:36 +01:00
|
|
|
state->currRecPtr,
|
2013-01-16 20:12:53 +01:00
|
|
|
state->readBuf, &state->readPageTLI);
|
|
|
|
if (readLen < 0)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
/* we can be sure to have enough WAL available, we scrolled back */
|
|
|
|
Assert(readLen == XLOG_BLCKSZ);
|
|
|
|
|
|
|
|
hdr = (XLogPageHeader) state->readBuf;
|
|
|
|
|
|
|
|
if (!ValidXLogPageHeader(state, targetSegmentPtr, hdr))
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* First, read the requested data length, but at least a short page header
|
|
|
|
* so that we can validate it.
|
|
|
|
*/
|
|
|
|
readLen = state->read_page(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
|
Use the right timeline when beginning to stream from master.
The xlogreader refactoring broke the logic to decide which timeline to start
streaming from. XLogPageRead() uses the timeline history to check which
timeline the requested WAL position falls into. However, after the
refactoring, XLogPageRead() is always first called with the first page in
the segment, to verify the segment header, and only then with the actual WAL
position we're interested in. That first read of the segment's header made
XLogPageRead() to always start streaming from the old timeline containing
the segment header, not the timeline containing the actual record, if there
was a timeline switch within the segment.
I thought I fixed this yesterday, but that fix was too narrow and only fixed
this for the corner-case that the timeline switch happened in the first page
of the segment. To fix this more robustly, pass explicitly the position of
the record we're actually interested in to XLogPageRead, and use that to
decide which timeline to read from, rather than deduce it from the page and
offset.
Per report from Fujii Masao.
2013-01-18 10:41:36 +01:00
|
|
|
state->currRecPtr,
|
2013-01-16 20:12:53 +01:00
|
|
|
state->readBuf, &state->readPageTLI);
|
|
|
|
if (readLen < 0)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
Assert(readLen <= XLOG_BLCKSZ);
|
|
|
|
|
|
|
|
/* Do we have enough data to check the header length? */
|
|
|
|
if (readLen <= SizeOfXLogShortPHD)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
Assert(readLen >= reqLen);
|
|
|
|
|
|
|
|
hdr = (XLogPageHeader) state->readBuf;
|
|
|
|
|
|
|
|
/* still not enough */
|
|
|
|
if (readLen < XLogPageHeaderSize(hdr))
|
|
|
|
{
|
|
|
|
readLen = state->read_page(state, pageptr, XLogPageHeaderSize(hdr),
|
Use the right timeline when beginning to stream from master.
The xlogreader refactoring broke the logic to decide which timeline to start
streaming from. XLogPageRead() uses the timeline history to check which
timeline the requested WAL position falls into. However, after the
refactoring, XLogPageRead() is always first called with the first page in
the segment, to verify the segment header, and only then with the actual WAL
position we're interested in. That first read of the segment's header made
XLogPageRead() to always start streaming from the old timeline containing
the segment header, not the timeline containing the actual record, if there
was a timeline switch within the segment.
I thought I fixed this yesterday, but that fix was too narrow and only fixed
this for the corner-case that the timeline switch happened in the first page
of the segment. To fix this more robustly, pass explicitly the position of
the record we're actually interested in to XLogPageRead, and use that to
decide which timeline to read from, rather than deduce it from the page and
offset.
Per report from Fujii Masao.
2013-01-18 10:41:36 +01:00
|
|
|
state->currRecPtr,
|
2013-01-16 20:12:53 +01:00
|
|
|
state->readBuf, &state->readPageTLI);
|
|
|
|
if (readLen < 0)
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now that we know we have the full header, validate it.
|
|
|
|
*/
|
|
|
|
if (!ValidXLogPageHeader(state, pageptr, hdr))
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
/* update cache information */
|
|
|
|
state->readSegNo = targetSegNo;
|
|
|
|
state->readOff = targetPageOff;
|
|
|
|
state->readLen = readLen;
|
|
|
|
|
|
|
|
return readLen;
|
|
|
|
|
|
|
|
err:
|
|
|
|
state->readSegNo = 0;
|
|
|
|
state->readOff = 0;
|
|
|
|
state->readLen = 0;
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Validate an XLOG record header.
|
|
|
|
*
|
|
|
|
* This is just a convenience subroutine to avoid duplicated code in
|
2014-05-06 18:12:18 +02:00
|
|
|
* XLogReadRecord. It's not intended for use from anywhere else.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
|
|
|
|
XLogRecPtr PrevRecPtr, XLogRecord *record,
|
|
|
|
bool randAccess)
|
|
|
|
{
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
if (record->xl_tot_len < SizeOfXLogRecord)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"invalid record length at %X/%X",
|
|
|
|
(uint32) (RecPtr >> 32), (uint32) RecPtr);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
if (record->xl_rmid > RM_MAX_ID)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"invalid resource manager ID %u at %X/%X",
|
|
|
|
record->xl_rmid, (uint32) (RecPtr >> 32),
|
|
|
|
(uint32) RecPtr);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
if (randAccess)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We can't exactly verify the prev-link, but surely it should be less
|
|
|
|
* than the record's own address.
|
|
|
|
*/
|
|
|
|
if (!(record->xl_prev < RecPtr))
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"record with incorrect prev-link %X/%X at %X/%X",
|
|
|
|
(uint32) (record->xl_prev >> 32),
|
|
|
|
(uint32) record->xl_prev,
|
|
|
|
(uint32) (RecPtr >> 32), (uint32) RecPtr);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Record's prev-link should exactly match our previous location. This
|
|
|
|
* check guards against torn WAL pages where a stale but valid-looking
|
|
|
|
* WAL record starts on a sector boundary.
|
|
|
|
*/
|
|
|
|
if (record->xl_prev != PrevRecPtr)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"record with incorrect prev-link %X/%X at %X/%X",
|
|
|
|
(uint32) (record->xl_prev >> 32),
|
|
|
|
(uint32) record->xl_prev,
|
|
|
|
(uint32) (RecPtr >> 32), (uint32) RecPtr);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* CRC-check an XLOG record. We do not believe the contents of an XLOG
|
|
|
|
* record (other than to the minimal extent of computing the amount of
|
|
|
|
* data to read in) until we've checked the CRCs.
|
|
|
|
*
|
|
|
|
* We assume all of the record (that is, xl_tot_len bytes) has been read
|
2014-05-06 18:12:18 +02:00
|
|
|
* into memory at *record. Also, ValidXLogRecordHeader() has accepted the
|
2013-01-16 20:12:53 +01:00
|
|
|
* record's header, which means in particular that xl_tot_len is at least
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
* SizeOfXlogRecord.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr)
|
|
|
|
{
|
2015-04-14 16:03:42 +02:00
|
|
|
pg_crc32c crc;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
/* Calculate the CRC */
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
INIT_CRC32C(crc);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
COMP_CRC32C(crc, ((char *) record) + SizeOfXLogRecord, record->xl_tot_len - SizeOfXLogRecord);
|
|
|
|
/* include the record header last */
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
COMP_CRC32C(crc, (char *) record, offsetof(XLogRecord, xl_crc));
|
|
|
|
FIN_CRC32C(crc);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
if (!EQ_CRC32C(record->xl_crc, crc))
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"incorrect resource manager data checksum in record at %X/%X",
|
|
|
|
(uint32) (recptr >> 32), (uint32) recptr);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Validate a page header
|
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
ValidXLogPageHeader(XLogReaderState *state, XLogRecPtr recptr,
|
|
|
|
XLogPageHeader hdr)
|
|
|
|
{
|
|
|
|
XLogRecPtr recaddr;
|
|
|
|
XLogSegNo segno;
|
|
|
|
int32 offset;
|
|
|
|
|
|
|
|
Assert((recptr % XLOG_BLCKSZ) == 0);
|
|
|
|
|
|
|
|
XLByteToSeg(recptr, segno);
|
|
|
|
offset = recptr % XLogSegSize;
|
|
|
|
|
|
|
|
XLogSegNoOffsetToRecPtr(segno, offset, recaddr);
|
|
|
|
|
|
|
|
if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
|
|
|
|
{
|
|
|
|
char fname[MAXFNAMELEN];
|
|
|
|
|
|
|
|
XLogFileName(fname, state->readPageTLI, segno);
|
|
|
|
|
|
|
|
report_invalid_record(state,
|
|
|
|
"invalid magic number %04X in log segment %s, offset %u",
|
|
|
|
hdr->xlp_magic,
|
|
|
|
fname,
|
|
|
|
offset);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((hdr->xlp_info & ~XLP_ALL_FLAGS) != 0)
|
|
|
|
{
|
|
|
|
char fname[MAXFNAMELEN];
|
|
|
|
|
|
|
|
XLogFileName(fname, state->readPageTLI, segno);
|
|
|
|
|
|
|
|
report_invalid_record(state,
|
|
|
|
"invalid info bits %04X in log segment %s, offset %u",
|
|
|
|
hdr->xlp_info,
|
|
|
|
fname,
|
|
|
|
offset);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (hdr->xlp_info & XLP_LONG_HEADER)
|
|
|
|
{
|
|
|
|
XLogLongPageHeader longhdr = (XLogLongPageHeader) hdr;
|
|
|
|
|
|
|
|
if (state->system_identifier &&
|
|
|
|
longhdr->xlp_sysid != state->system_identifier)
|
|
|
|
{
|
|
|
|
char fhdrident_str[32];
|
|
|
|
char sysident_str[32];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Format sysids separately to keep platform-dependent format code
|
|
|
|
* out of the translatable message string.
|
|
|
|
*/
|
|
|
|
snprintf(fhdrident_str, sizeof(fhdrident_str), UINT64_FORMAT,
|
|
|
|
longhdr->xlp_sysid);
|
|
|
|
snprintf(sysident_str, sizeof(sysident_str), UINT64_FORMAT,
|
|
|
|
state->system_identifier);
|
|
|
|
report_invalid_record(state,
|
|
|
|
"WAL file is from different database system: WAL file database system identifier is %s, pg_control database system identifier is %s.",
|
|
|
|
fhdrident_str, sysident_str);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
else if (longhdr->xlp_seg_size != XLogSegSize)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"WAL file is from different database system: Incorrect XLOG_SEG_SIZE in page header.");
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
else if (longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"WAL file is from different database system: Incorrect XLOG_BLCKSZ in page header.");
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else if (offset == 0)
|
|
|
|
{
|
|
|
|
char fname[MAXFNAMELEN];
|
|
|
|
|
|
|
|
XLogFileName(fname, state->readPageTLI, segno);
|
|
|
|
|
|
|
|
/* hmm, first page of file doesn't have a long header? */
|
|
|
|
report_invalid_record(state,
|
|
|
|
"invalid info bits %04X in log segment %s, offset %u",
|
|
|
|
hdr->xlp_info,
|
|
|
|
fname,
|
|
|
|
offset);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (hdr->xlp_pageaddr != recaddr)
|
|
|
|
{
|
|
|
|
char fname[MAXFNAMELEN];
|
|
|
|
|
|
|
|
XLogFileName(fname, state->readPageTLI, segno);
|
|
|
|
|
|
|
|
report_invalid_record(state,
|
|
|
|
"unexpected pageaddr %X/%X in log segment %s, offset %u",
|
|
|
|
(uint32) (hdr->xlp_pageaddr >> 32), (uint32) hdr->xlp_pageaddr,
|
|
|
|
fname,
|
|
|
|
offset);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Since child timelines are always assigned a TLI greater than their
|
|
|
|
* immediate parent's TLI, we should never see TLI go backwards across
|
|
|
|
* successive pages of a consistent WAL sequence.
|
|
|
|
*
|
|
|
|
* Sometimes we re-read a segment that's already been (partially) read. So
|
|
|
|
* we only verify TLIs for pages that are later than the last remembered
|
|
|
|
* LSN.
|
|
|
|
*/
|
|
|
|
if (recptr > state->latestPagePtr)
|
|
|
|
{
|
|
|
|
if (hdr->xlp_tli < state->latestPageTLI)
|
|
|
|
{
|
|
|
|
char fname[MAXFNAMELEN];
|
|
|
|
|
|
|
|
XLogFileName(fname, state->readPageTLI, segno);
|
|
|
|
|
|
|
|
report_invalid_record(state,
|
|
|
|
"out-of-sequence timeline ID %u (after %u) in log segment %s, offset %u",
|
|
|
|
hdr->xlp_tli,
|
|
|
|
state->latestPageTLI,
|
|
|
|
fname,
|
|
|
|
offset);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
state->latestPagePtr = recptr;
|
|
|
|
state->latestPageTLI = hdr->xlp_tli;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef FRONTEND
|
|
|
|
/*
|
|
|
|
* Functions that are currently not needed in the backend, but are better
|
|
|
|
* implemented inside xlogreader.c because of the internal facilities available
|
|
|
|
* here.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
2013-10-24 10:50:02 +02:00
|
|
|
* Find the first record with an lsn >= RecPtr.
|
2013-01-16 20:12:53 +01:00
|
|
|
*
|
2013-10-24 10:50:02 +02:00
|
|
|
* Useful for checking whether RecPtr is a valid xlog address for reading, and
|
|
|
|
* to find the first valid address after some address when dumping records for
|
2013-01-16 20:12:53 +01:00
|
|
|
* debugging purposes.
|
|
|
|
*/
|
|
|
|
XLogRecPtr
|
|
|
|
XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
|
|
|
|
{
|
|
|
|
XLogReaderState saved_state = *state;
|
|
|
|
XLogRecPtr targetPagePtr;
|
|
|
|
XLogRecPtr tmpRecPtr;
|
|
|
|
int targetRecOff;
|
|
|
|
XLogRecPtr found = InvalidXLogRecPtr;
|
|
|
|
uint32 pageHeaderSize;
|
|
|
|
XLogPageHeader header;
|
|
|
|
int readLen;
|
|
|
|
char *errormsg;
|
|
|
|
|
|
|
|
Assert(!XLogRecPtrIsInvalid(RecPtr));
|
|
|
|
|
|
|
|
targetRecOff = RecPtr % XLOG_BLCKSZ;
|
|
|
|
|
|
|
|
/* scroll back to page boundary */
|
|
|
|
targetPagePtr = RecPtr - targetRecOff;
|
|
|
|
|
|
|
|
/* Read the page containing the record */
|
|
|
|
readLen = ReadPageInternal(state, targetPagePtr, targetRecOff);
|
|
|
|
if (readLen < 0)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
header = (XLogPageHeader) state->readBuf;
|
|
|
|
|
|
|
|
pageHeaderSize = XLogPageHeaderSize(header);
|
|
|
|
|
|
|
|
/* make sure we have enough data for the page header */
|
|
|
|
readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize);
|
|
|
|
if (readLen < 0)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
/* skip over potential continuation data */
|
|
|
|
if (header->xlp_info & XLP_FIRST_IS_CONTRECORD)
|
|
|
|
{
|
|
|
|
/* record headers are MAXALIGN'ed */
|
|
|
|
tmpRecPtr = targetPagePtr + pageHeaderSize
|
|
|
|
+ MAXALIGN(header->xlp_rem_len);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
tmpRecPtr = targetPagePtr + pageHeaderSize;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* we know now that tmpRecPtr is an address pointing to a valid XLogRecord
|
|
|
|
* because either we're at the first record after the beginning of a page
|
|
|
|
* or we just jumped over the remaining data of a continuation.
|
|
|
|
*/
|
2015-01-04 15:35:46 +01:00
|
|
|
while (XLogReadRecord(state, tmpRecPtr, &errormsg) != NULL)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
/* continue after the record */
|
|
|
|
tmpRecPtr = InvalidXLogRecPtr;
|
|
|
|
|
|
|
|
/* past the record we've found, break out */
|
|
|
|
if (RecPtr <= state->ReadRecPtr)
|
|
|
|
{
|
|
|
|
found = state->ReadRecPtr;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
err:
|
|
|
|
out:
|
|
|
|
/* Reset state to what we had before finding the record */
|
|
|
|
state->readSegNo = 0;
|
|
|
|
state->readOff = 0;
|
|
|
|
state->readLen = 0;
|
|
|
|
state->ReadRecPtr = saved_state.ReadRecPtr;
|
|
|
|
state->EndRecPtr = saved_state.EndRecPtr;
|
|
|
|
|
|
|
|
return found;
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* FRONTEND */
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
|
|
|
|
|
|
|
/* ----------------------------------------
|
|
|
|
* Functions for decoding the data and block references in a record.
|
|
|
|
* ----------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* private function to reset the state between records */
|
|
|
|
static void
|
|
|
|
ResetDecoder(XLogReaderState *state)
|
|
|
|
{
|
|
|
|
int block_id;
|
|
|
|
|
|
|
|
state->decoded_record = NULL;
|
|
|
|
|
|
|
|
state->main_data_len = 0;
|
|
|
|
|
|
|
|
for (block_id = 0; block_id <= state->max_block_id; block_id++)
|
|
|
|
{
|
|
|
|
state->blocks[block_id].in_use = false;
|
|
|
|
state->blocks[block_id].has_image = false;
|
|
|
|
state->blocks[block_id].has_data = false;
|
|
|
|
}
|
|
|
|
state->max_block_id = -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Decode the previously read record.
|
|
|
|
*
|
|
|
|
* On error, a human-readable error message is returned in *errormsg, and
|
|
|
|
* the return value is false.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* read next _size bytes from record buffer, but check for overrun first.
|
|
|
|
*/
|
|
|
|
#define COPY_HEADER_FIELD(_dst, _size) \
|
|
|
|
do { \
|
|
|
|
if (remaining < _size) \
|
|
|
|
goto shortdata_err; \
|
|
|
|
memcpy(_dst, ptr, _size); \
|
|
|
|
ptr += _size; \
|
|
|
|
remaining -= _size; \
|
|
|
|
} while(0)
|
|
|
|
|
|
|
|
char *ptr;
|
|
|
|
uint32 remaining;
|
|
|
|
uint32 datatotal;
|
|
|
|
RelFileNode *rnode = NULL;
|
|
|
|
uint8 block_id;
|
|
|
|
|
|
|
|
ResetDecoder(state);
|
|
|
|
|
|
|
|
state->decoded_record = record;
|
Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented here, consist out of
three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies. E.g. by filtering all replayed changes out.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL. Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.
For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00
|
|
|
state->record_origin = InvalidRepOriginId;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
|
|
|
ptr = (char *) record;
|
|
|
|
ptr += SizeOfXLogRecord;
|
|
|
|
remaining = record->xl_tot_len - SizeOfXLogRecord;
|
|
|
|
|
|
|
|
/* Decode the headers */
|
|
|
|
datatotal = 0;
|
|
|
|
while (remaining > datatotal)
|
|
|
|
{
|
|
|
|
COPY_HEADER_FIELD(&block_id, sizeof(uint8));
|
|
|
|
|
|
|
|
if (block_id == XLR_BLOCK_ID_DATA_SHORT)
|
|
|
|
{
|
|
|
|
/* XLogRecordDataHeaderShort */
|
|
|
|
uint8 main_data_len;
|
|
|
|
|
|
|
|
COPY_HEADER_FIELD(&main_data_len, sizeof(uint8));
|
|
|
|
|
|
|
|
state->main_data_len = main_data_len;
|
|
|
|
datatotal += main_data_len;
|
|
|
|
break; /* by convention, the main data fragment is
|
|
|
|
* always last */
|
|
|
|
}
|
|
|
|
else if (block_id == XLR_BLOCK_ID_DATA_LONG)
|
|
|
|
{
|
|
|
|
/* XLogRecordDataHeaderLong */
|
|
|
|
uint32 main_data_len;
|
|
|
|
|
|
|
|
COPY_HEADER_FIELD(&main_data_len, sizeof(uint32));
|
|
|
|
state->main_data_len = main_data_len;
|
|
|
|
datatotal += main_data_len;
|
|
|
|
break; /* by convention, the main data fragment is
|
|
|
|
* always last */
|
|
|
|
}
|
Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented here, consist out of
three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies. E.g. by filtering all replayed changes out.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL. Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.
For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00
|
|
|
else if (block_id == XLR_BLOCK_ID_ORIGIN)
|
|
|
|
{
|
|
|
|
COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
|
|
|
|
}
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
else if (block_id <= XLR_MAX_BLOCK_ID)
|
|
|
|
{
|
|
|
|
/* XLogRecordBlockHeader */
|
|
|
|
DecodedBkpBlock *blk;
|
|
|
|
uint8 fork_flags;
|
|
|
|
|
|
|
|
if (block_id <= state->max_block_id)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"out-of-order block_id %u at %X/%X",
|
|
|
|
block_id,
|
|
|
|
(uint32) (state->ReadRecPtr >> 32),
|
|
|
|
(uint32) state->ReadRecPtr);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
state->max_block_id = block_id;
|
|
|
|
|
|
|
|
blk = &state->blocks[block_id];
|
|
|
|
blk->in_use = true;
|
|
|
|
|
|
|
|
COPY_HEADER_FIELD(&fork_flags, sizeof(uint8));
|
|
|
|
blk->forknum = fork_flags & BKPBLOCK_FORK_MASK;
|
|
|
|
blk->flags = fork_flags;
|
|
|
|
blk->has_image = ((fork_flags & BKPBLOCK_HAS_IMAGE) != 0);
|
|
|
|
blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
|
|
|
|
|
|
|
|
COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
|
|
|
|
/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
|
|
|
|
if (blk->has_data && blk->data_len == 0)
|
2015-03-09 06:31:10 +01:00
|
|
|
{
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
report_invalid_record(state,
|
|
|
|
"BKPBLOCK_HAS_DATA set, but no data included at %X/%X",
|
|
|
|
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
|
2015-03-09 06:31:10 +01:00
|
|
|
goto err;
|
|
|
|
}
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
if (!blk->has_data && blk->data_len != 0)
|
2015-03-09 06:31:10 +01:00
|
|
|
{
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
report_invalid_record(state,
|
|
|
|
"BKPBLOCK_HAS_DATA not set, but data length is %u at %X/%X",
|
|
|
|
(unsigned int) blk->data_len,
|
|
|
|
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
|
2015-03-09 06:31:10 +01:00
|
|
|
goto err;
|
|
|
|
}
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
datatotal += blk->data_len;
|
|
|
|
|
|
|
|
if (blk->has_image)
|
|
|
|
{
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
COPY_HEADER_FIELD(&blk->bimg_len, sizeof(uint16));
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
COPY_HEADER_FIELD(&blk->hole_offset, sizeof(uint16));
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
COPY_HEADER_FIELD(&blk->bimg_info, sizeof(uint8));
|
|
|
|
if (blk->bimg_info & BKPIMAGE_IS_COMPRESSED)
|
|
|
|
{
|
|
|
|
if (blk->bimg_info & BKPIMAGE_HAS_HOLE)
|
|
|
|
COPY_HEADER_FIELD(&blk->hole_length, sizeof(uint16));
|
|
|
|
else
|
|
|
|
blk->hole_length = 0;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
blk->hole_length = BLCKSZ - blk->bimg_len;
|
|
|
|
datatotal += blk->bimg_len;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* cross-check that hole_offset > 0, hole_length > 0 and
|
|
|
|
* bimg_len < BLCKSZ if the HAS_HOLE flag is set.
|
|
|
|
*/
|
|
|
|
if ((blk->bimg_info & BKPIMAGE_HAS_HOLE) &&
|
|
|
|
(blk->hole_offset == 0 ||
|
|
|
|
blk->hole_length == 0 ||
|
|
|
|
blk->bimg_len == BLCKSZ))
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
2015-05-24 03:35:49 +02:00
|
|
|
"BKPIMAGE_HAS_HOLE set, but hole offset %u length %u block image length %u at %X/%X",
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
(unsigned int) blk->hole_offset,
|
|
|
|
(unsigned int) blk->hole_length,
|
|
|
|
(unsigned int) blk->bimg_len,
|
|
|
|
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
|
|
|
|
goto err;
|
|
|
|
}
|
2015-05-24 03:35:49 +02:00
|
|
|
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
/*
|
2015-05-24 03:35:49 +02:00
|
|
|
* cross-check that hole_offset == 0 and hole_length == 0 if
|
|
|
|
* the HAS_HOLE flag is not set.
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
*/
|
|
|
|
if (!(blk->bimg_info & BKPIMAGE_HAS_HOLE) &&
|
|
|
|
(blk->hole_offset != 0 || blk->hole_length != 0))
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
2015-05-24 03:35:49 +02:00
|
|
|
"BKPIMAGE_HAS_HOLE not set, but hole offset %u length %u at %X/%X",
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
(unsigned int) blk->hole_offset,
|
|
|
|
(unsigned int) blk->hole_length,
|
|
|
|
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
|
|
|
|
goto err;
|
|
|
|
}
|
2015-05-24 03:35:49 +02:00
|
|
|
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
/*
|
2015-05-24 03:35:49 +02:00
|
|
|
* cross-check that bimg_len < BLCKSZ if the IS_COMPRESSED
|
|
|
|
* flag is set.
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
*/
|
|
|
|
if ((blk->bimg_info & BKPIMAGE_IS_COMPRESSED) &&
|
|
|
|
blk->bimg_len == BLCKSZ)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
2015-05-24 03:35:49 +02:00
|
|
|
"BKPIMAGE_IS_COMPRESSED set, but block image length %u at %X/%X",
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
(unsigned int) blk->bimg_len,
|
|
|
|
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
|
|
|
|
goto err;
|
|
|
|
}
|
2015-05-24 03:35:49 +02:00
|
|
|
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
/*
|
2015-05-24 03:35:49 +02:00
|
|
|
* cross-check that bimg_len = BLCKSZ if neither HAS_HOLE nor
|
|
|
|
* IS_COMPRESSED flag is set.
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
*/
|
|
|
|
if (!(blk->bimg_info & BKPIMAGE_HAS_HOLE) &&
|
|
|
|
!(blk->bimg_info & BKPIMAGE_IS_COMPRESSED) &&
|
|
|
|
blk->bimg_len != BLCKSZ)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
2015-05-24 03:35:49 +02:00
|
|
|
"neither BKPIMAGE_HAS_HOLE nor BKPIMAGE_IS_COMPRESSED set, but block image length is %u at %X/%X",
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
(unsigned int) blk->data_len,
|
|
|
|
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
|
|
|
|
goto err;
|
|
|
|
}
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
}
|
|
|
|
if (!(fork_flags & BKPBLOCK_SAME_REL))
|
|
|
|
{
|
|
|
|
COPY_HEADER_FIELD(&blk->rnode, sizeof(RelFileNode));
|
|
|
|
rnode = &blk->rnode;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
if (rnode == NULL)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"BKPBLOCK_SAME_REL set but no previous rel at %X/%X",
|
|
|
|
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
blk->rnode = *rnode;
|
|
|
|
}
|
|
|
|
COPY_HEADER_FIELD(&blk->blkno, sizeof(BlockNumber));
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"invalid block_id %u at %X/%X",
|
|
|
|
block_id,
|
|
|
|
(uint32) (state->ReadRecPtr >> 32),
|
|
|
|
(uint32) state->ReadRecPtr);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (remaining != datatotal)
|
|
|
|
goto shortdata_err;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ok, we've parsed the fragment headers, and verified that the total
|
|
|
|
* length of the payload in the fragments is equal to the amount of data
|
|
|
|
* left. Copy the data of each fragment to a separate buffer.
|
|
|
|
*
|
|
|
|
* We could just set up pointers into readRecordBuf, but we want to align
|
|
|
|
* the data for the convenience of the callers. Backup images are not
|
|
|
|
* copied, however; they don't need alignment.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* block data first */
|
|
|
|
for (block_id = 0; block_id <= state->max_block_id; block_id++)
|
|
|
|
{
|
|
|
|
DecodedBkpBlock *blk = &state->blocks[block_id];
|
|
|
|
|
|
|
|
if (!blk->in_use)
|
|
|
|
continue;
|
|
|
|
if (blk->has_image)
|
|
|
|
{
|
|
|
|
blk->bkp_image = ptr;
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
ptr += blk->bimg_len;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
}
|
|
|
|
if (blk->has_data)
|
|
|
|
{
|
|
|
|
if (!blk->data || blk->data_len > blk->data_bufsz)
|
|
|
|
{
|
|
|
|
if (blk->data)
|
|
|
|
pfree(blk->data);
|
|
|
|
blk->data_bufsz = blk->data_len;
|
|
|
|
blk->data = palloc(blk->data_bufsz);
|
|
|
|
}
|
|
|
|
memcpy(blk->data, ptr, blk->data_len);
|
|
|
|
ptr += blk->data_len;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* and finally, the main data */
|
|
|
|
if (state->main_data_len > 0)
|
|
|
|
{
|
|
|
|
if (!state->main_data || state->main_data_len > state->main_data_bufsz)
|
|
|
|
{
|
|
|
|
if (state->main_data)
|
|
|
|
pfree(state->main_data);
|
|
|
|
state->main_data_bufsz = state->main_data_len;
|
|
|
|
state->main_data = palloc(state->main_data_bufsz);
|
|
|
|
}
|
|
|
|
memcpy(state->main_data, ptr, state->main_data_len);
|
|
|
|
ptr += state->main_data_len;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
|
|
|
|
shortdata_err:
|
|
|
|
report_invalid_record(state,
|
|
|
|
"record with invalid length at %X/%X",
|
|
|
|
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
|
|
|
|
err:
|
|
|
|
*errormsg = state->errormsg_buf;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns information about the block that a block reference refers to.
|
|
|
|
*
|
|
|
|
* If the WAL record contains a block reference with the given ID, *rnode,
|
|
|
|
* *forknum, and *blknum are filled in (if not NULL), and returns TRUE.
|
|
|
|
* Otherwise returns FALSE.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
|
|
|
|
RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
|
|
|
|
{
|
|
|
|
DecodedBkpBlock *bkpb;
|
|
|
|
|
|
|
|
if (!record->blocks[block_id].in_use)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
bkpb = &record->blocks[block_id];
|
|
|
|
if (rnode)
|
|
|
|
*rnode = bkpb->rnode;
|
|
|
|
if (forknum)
|
|
|
|
*forknum = bkpb->forknum;
|
|
|
|
if (blknum)
|
|
|
|
*blknum = bkpb->blkno;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns the data associated with a block reference, or NULL if there is
|
|
|
|
* no data (e.g. because a full-page image was taken instead). The returned
|
|
|
|
* pointer points to a MAXALIGNed buffer.
|
|
|
|
*/
|
|
|
|
char *
|
|
|
|
XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
|
|
|
|
{
|
|
|
|
DecodedBkpBlock *bkpb;
|
|
|
|
|
|
|
|
if (!record->blocks[block_id].in_use)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
bkpb = &record->blocks[block_id];
|
|
|
|
|
|
|
|
if (!bkpb->has_data)
|
|
|
|
{
|
|
|
|
if (len)
|
|
|
|
*len = 0;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
if (len)
|
|
|
|
*len = bkpb->data_len;
|
|
|
|
return bkpb->data;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Restore a full-page image from a backup block attached to an XLOG record.
|
|
|
|
*
|
|
|
|
* Returns the buffer number containing the page.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
|
|
|
|
{
|
|
|
|
DecodedBkpBlock *bkpb;
|
2015-05-24 03:35:49 +02:00
|
|
|
char *ptr;
|
|
|
|
char tmp[BLCKSZ];
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
|
|
|
if (!record->blocks[block_id].in_use)
|
|
|
|
return false;
|
|
|
|
if (!record->blocks[block_id].has_image)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
bkpb = &record->blocks[block_id];
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
ptr = bkpb->bkp_image;
|
|
|
|
|
|
|
|
if (bkpb->bimg_info & BKPIMAGE_IS_COMPRESSED)
|
|
|
|
{
|
|
|
|
/* If a backup block image is compressed, decompress it */
|
|
|
|
if (pglz_decompress(ptr, bkpb->bimg_len, tmp,
|
|
|
|
BLCKSZ - bkpb->hole_length) < 0)
|
|
|
|
{
|
|
|
|
report_invalid_record(record, "invalid compressed image at %X/%X, block %d",
|
|
|
|
(uint32) (record->ReadRecPtr >> 32),
|
|
|
|
(uint32) record->ReadRecPtr,
|
|
|
|
block_id);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
ptr = tmp;
|
|
|
|
}
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
/* generate page, taking into account hole if necessary */
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
if (bkpb->hole_length == 0)
|
|
|
|
{
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
memcpy(page, ptr, BLCKSZ);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
memcpy(page, ptr, bkpb->hole_offset);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
/* must zero-fill the hole */
|
|
|
|
MemSet(page + bkpb->hole_offset, 0, bkpb->hole_length);
|
|
|
|
memcpy(page + (bkpb->hole_offset + bkpb->hole_length),
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
ptr + bkpb->hole_offset,
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|