2013-01-16 20:12:53 +01:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
|
|
|
* xlogreader.c
|
|
|
|
* Generic XLog reading facility
|
|
|
|
*
|
2019-01-02 18:44:25 +01:00
|
|
|
* Portions Copyright (c) 2013-2019, PostgreSQL Global Development Group
|
2013-01-16 20:12:53 +01:00
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
|
|
|
* src/backend/access/transam/xlogreader.c
|
|
|
|
*
|
|
|
|
* NOTES
|
|
|
|
* See xlogreader.h for more notes on this facility.
|
2016-03-30 23:56:13 +02:00
|
|
|
*
|
|
|
|
* This file is compiled as both front-end and backend code, so it
|
|
|
|
* may not use ereport, server-defined static variables, etc.
|
2013-01-16 20:12:53 +01:00
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
#include "postgres.h"
|
|
|
|
|
2019-11-25 19:04:54 +01:00
|
|
|
#include <unistd.h>
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
#include "access/transam.h"
|
|
|
|
#include "access/xlog_internal.h"
|
|
|
|
#include "access/xlogreader.h"
|
2019-11-12 04:00:16 +01:00
|
|
|
#include "access/xlogrecord.h"
|
2013-01-16 20:12:53 +01:00
|
|
|
#include "catalog/pg_control.h"
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
#include "common/pg_lzcompress.h"
|
Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented here, consist out of
three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies. E.g. by filtering all replayed changes out.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL. Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.
For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00
|
|
|
#include "replication/origin.h"
|
2013-01-16 20:12:53 +01:00
|
|
|
|
Prevent hard failures of standbys caused by recycled WAL segments
When a standby's WAL receiver stops reading WAL from a WAL stream, it
writes data to the current WAL segment without having priorily zero'ed
the page currently written to, which can cause the WAL reader to read
junk data from a past recycled segment and then it would try to get a
record from it. While sanity checks in place provide most of the
protection needed, in some rare circumstances, with chances increasing
when a record header crosses a page boundary, then the startup process
could fail violently on an allocation failure, as follows:
FATAL: invalid memory alloc request size XXX
This is confusing for the user and also unhelpful as this requires in
the worst case a manual restart of the instance, impacting potentially
the availability of the cluster, and this also makes WAL data look like
it is in a corrupted state.
The chances of seeing failures are higher if the connection between the
standby and its root node is unstable, causing WAL pages to be written
in the middle. A couple of approaches have been discussed, like
zero-ing new WAL pages within the WAL receiver itself but this has the
disadvantage of impacting performance of any existing instances as this
breaks the sequential writes done by the WAL receiver. This commit
deals with the problem with a more simple approach, which has no
performance impact without reducing the detection of the problem: if a
record is found with a length higher than 1GB for backends, then do not
try any allocation and report a soft failure which will force the
standby to retry reading WAL. It could be possible that the allocation
call passes and that an unnecessary amount of memory is allocated,
however follow-up checks on records would just fail, making this
allocation short-lived anyway.
This patch owes a great deal to Tsunakawa Takayuki for reporting the
failure first, and then discussing a couple of potential approaches to
the problem.
Backpatch down to 9.5, which is where palloc_extended has been
introduced.
Reported-by: Tsunakawa Takayuki
Reviewed-by: Tsunakawa Takayuki
Author: Michael Paquier
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
2018-06-18 03:43:27 +02:00
|
|
|
#ifndef FRONTEND
|
2019-07-15 07:03:46 +02:00
|
|
|
#include "miscadmin.h"
|
2019-11-25 19:04:54 +01:00
|
|
|
#include "pgstat.h"
|
Prevent hard failures of standbys caused by recycled WAL segments
When a standby's WAL receiver stops reading WAL from a WAL stream, it
writes data to the current WAL segment without having priorily zero'ed
the page currently written to, which can cause the WAL reader to read
junk data from a past recycled segment and then it would try to get a
record from it. While sanity checks in place provide most of the
protection needed, in some rare circumstances, with chances increasing
when a record header crosses a page boundary, then the startup process
could fail violently on an allocation failure, as follows:
FATAL: invalid memory alloc request size XXX
This is confusing for the user and also unhelpful as this requires in
the worst case a manual restart of the instance, impacting potentially
the availability of the cluster, and this also makes WAL data look like
it is in a corrupted state.
The chances of seeing failures are higher if the connection between the
standby and its root node is unstable, causing WAL pages to be written
in the middle. A couple of approaches have been discussed, like
zero-ing new WAL pages within the WAL receiver itself but this has the
disadvantage of impacting performance of any existing instances as this
breaks the sequential writes done by the WAL receiver. This commit
deals with the problem with a more simple approach, which has no
performance impact without reducing the detection of the problem: if a
record is found with a length higher than 1GB for backends, then do not
try any allocation and report a soft failure which will force the
standby to retry reading WAL. It could be possible that the allocation
call passes and that an unnecessary amount of memory is allocated,
however follow-up checks on records would just fail, making this
allocation short-lived anyway.
This patch owes a great deal to Tsunakawa Takayuki for reporting the
failure first, and then discussing a couple of potential approaches to
the problem.
Backpatch down to 9.5, which is where palloc_extended has been
introduced.
Reported-by: Tsunakawa Takayuki
Reviewed-by: Tsunakawa Takayuki
Author: Michael Paquier
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
2018-06-18 03:43:27 +02:00
|
|
|
#include "utils/memutils.h"
|
|
|
|
#endif
|
|
|
|
|
2019-09-03 23:41:43 +02:00
|
|
|
static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
|
|
|
|
pg_attribute_printf(2, 3);
|
|
|
|
static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
|
|
|
|
static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
|
|
|
|
int reqLen);
|
|
|
|
static void XLogReaderInvalReadState(XLogReaderState *state);
|
2013-01-16 20:12:53 +01:00
|
|
|
static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
|
2019-05-22 19:04:48 +02:00
|
|
|
XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
|
2013-01-16 20:12:53 +01:00
|
|
|
static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
|
2019-05-22 19:04:48 +02:00
|
|
|
XLogRecPtr recptr);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
static void ResetDecoder(XLogReaderState *state);
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/* size of the buffer allocated for error message. */
|
|
|
|
#define MAX_ERRORMSG_LEN 1000
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Construct a string in state->errormsg_buf explaining what's wrong with
|
|
|
|
* the current record being read.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
report_invalid_record(XLogReaderState *state, const char *fmt,...)
|
|
|
|
{
|
|
|
|
va_list args;
|
|
|
|
|
|
|
|
fmt = _(fmt);
|
|
|
|
|
|
|
|
va_start(args, fmt);
|
|
|
|
vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
|
|
|
|
va_end(args);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate and initialize a new XLogReader.
|
|
|
|
*
|
2015-04-03 14:55:37 +02:00
|
|
|
* Returns NULL if the xlogreader couldn't be allocated.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
|
|
|
XLogReaderState *
|
2019-09-24 21:08:31 +02:00
|
|
|
XLogReaderAllocate(int wal_segment_size, const char *waldir,
|
|
|
|
XLogPageReadCB pagereadfunc, void *private_data)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
XLogReaderState *state;
|
|
|
|
|
2015-04-03 14:55:37 +02:00
|
|
|
state = (XLogReaderState *)
|
|
|
|
palloc_extended(sizeof(XLogReaderState),
|
|
|
|
MCXT_ALLOC_NO_OOM | MCXT_ALLOC_ZERO);
|
|
|
|
if (!state)
|
|
|
|
return NULL;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
state->max_block_id = -1;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Permanently allocate readBuf. We do it this way, rather than just
|
|
|
|
* making a static array, for two reasons: (1) no need to waste the
|
|
|
|
* storage in most instantiations of the backend; (2) a static char array
|
2015-04-03 14:55:37 +02:00
|
|
|
* isn't guaranteed to have any particular alignment, whereas
|
|
|
|
* palloc_extended() will provide MAXALIGN'd storage.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
2015-04-03 14:55:37 +02:00
|
|
|
state->readBuf = (char *) palloc_extended(XLOG_BLCKSZ,
|
|
|
|
MCXT_ALLOC_NO_OOM);
|
|
|
|
if (!state->readBuf)
|
|
|
|
{
|
|
|
|
pfree(state);
|
|
|
|
return NULL;
|
|
|
|
}
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
/* Initialize segment info. */
|
|
|
|
WALOpenSegmentInit(&state->seg, &state->segcxt, wal_segment_size,
|
|
|
|
waldir);
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
state->read_page = pagereadfunc;
|
|
|
|
/* system_identifier initialized to zeroes above */
|
|
|
|
state->private_data = private_data;
|
2019-09-26 04:53:37 +02:00
|
|
|
/* ReadRecPtr, EndRecPtr and readLen initialized to zeroes above */
|
2015-04-03 14:55:37 +02:00
|
|
|
state->errormsg_buf = palloc_extended(MAX_ERRORMSG_LEN + 1,
|
|
|
|
MCXT_ALLOC_NO_OOM);
|
|
|
|
if (!state->errormsg_buf)
|
|
|
|
{
|
|
|
|
pfree(state->readBuf);
|
|
|
|
pfree(state);
|
|
|
|
return NULL;
|
|
|
|
}
|
2013-01-16 20:12:53 +01:00
|
|
|
state->errormsg_buf[0] = '\0';
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate an initial readRecordBuf of minimal size, which can later be
|
|
|
|
* enlarged if necessary.
|
|
|
|
*/
|
|
|
|
if (!allocate_recordbuf(state, 0))
|
|
|
|
{
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
pfree(state->errormsg_buf);
|
|
|
|
pfree(state->readBuf);
|
|
|
|
pfree(state);
|
2013-01-16 20:12:53 +01:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return state;
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
XLogReaderFree(XLogReaderState *state)
|
|
|
|
{
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
int block_id;
|
|
|
|
|
2015-07-28 08:05:46 +02:00
|
|
|
for (block_id = 0; block_id <= XLR_MAX_BLOCK_ID; block_id++)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
{
|
2015-07-27 17:27:27 +02:00
|
|
|
if (state->blocks[block_id].data)
|
|
|
|
pfree(state->blocks[block_id].data);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
}
|
|
|
|
if (state->main_data)
|
|
|
|
pfree(state->main_data);
|
|
|
|
|
|
|
|
pfree(state->errormsg_buf);
|
2013-01-16 20:12:53 +01:00
|
|
|
if (state->readRecordBuf)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
pfree(state->readRecordBuf);
|
|
|
|
pfree(state->readBuf);
|
|
|
|
pfree(state);
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate readRecordBuf to fit a record of at least the given length.
|
|
|
|
* Returns true if successful, false if out of memory.
|
|
|
|
*
|
|
|
|
* readRecordBufSize is set to the new buffer size.
|
|
|
|
*
|
|
|
|
* To avoid useless small increases, round its size to a multiple of
|
|
|
|
* XLOG_BLCKSZ, and make sure it's at least 5*Max(BLCKSZ, XLOG_BLCKSZ) to start
|
|
|
|
* with. (That is enough for all "normal" records, but very large commit or
|
|
|
|
* abort records might need more space.)
|
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
allocate_recordbuf(XLogReaderState *state, uint32 reclength)
|
|
|
|
{
|
|
|
|
uint32 newSize = reclength;
|
|
|
|
|
|
|
|
newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
|
|
|
|
newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));
|
|
|
|
|
Prevent hard failures of standbys caused by recycled WAL segments
When a standby's WAL receiver stops reading WAL from a WAL stream, it
writes data to the current WAL segment without having priorily zero'ed
the page currently written to, which can cause the WAL reader to read
junk data from a past recycled segment and then it would try to get a
record from it. While sanity checks in place provide most of the
protection needed, in some rare circumstances, with chances increasing
when a record header crosses a page boundary, then the startup process
could fail violently on an allocation failure, as follows:
FATAL: invalid memory alloc request size XXX
This is confusing for the user and also unhelpful as this requires in
the worst case a manual restart of the instance, impacting potentially
the availability of the cluster, and this also makes WAL data look like
it is in a corrupted state.
The chances of seeing failures are higher if the connection between the
standby and its root node is unstable, causing WAL pages to be written
in the middle. A couple of approaches have been discussed, like
zero-ing new WAL pages within the WAL receiver itself but this has the
disadvantage of impacting performance of any existing instances as this
breaks the sequential writes done by the WAL receiver. This commit
deals with the problem with a more simple approach, which has no
performance impact without reducing the detection of the problem: if a
record is found with a length higher than 1GB for backends, then do not
try any allocation and report a soft failure which will force the
standby to retry reading WAL. It could be possible that the allocation
call passes and that an unnecessary amount of memory is allocated,
however follow-up checks on records would just fail, making this
allocation short-lived anyway.
This patch owes a great deal to Tsunakawa Takayuki for reporting the
failure first, and then discussing a couple of potential approaches to
the problem.
Backpatch down to 9.5, which is where palloc_extended has been
introduced.
Reported-by: Tsunakawa Takayuki
Reviewed-by: Tsunakawa Takayuki
Author: Michael Paquier
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
2018-06-18 03:43:27 +02:00
|
|
|
#ifndef FRONTEND
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Note that in much unlucky circumstances, the random data read from a
|
|
|
|
* recycled segment can cause this routine to be called with a size
|
|
|
|
* causing a hard failure at allocation. For a standby, this would cause
|
|
|
|
* the instance to stop suddenly with a hard failure, preventing it to
|
|
|
|
* retry fetching WAL from one of its sources which could allow it to move
|
|
|
|
* on with replay without a manual restart. If the data comes from a past
|
|
|
|
* recycled segment and is still valid, then the allocation may succeed
|
|
|
|
* but record checks are going to fail so this would be short-lived. If
|
|
|
|
* the allocation fails because of a memory shortage, then this is not a
|
|
|
|
* hard failure either per the guarantee given by MCXT_ALLOC_NO_OOM.
|
|
|
|
*/
|
|
|
|
if (!AllocSizeIsValid(newSize))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
#endif
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
if (state->readRecordBuf)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
pfree(state->readRecordBuf);
|
2015-04-03 11:29:38 +02:00
|
|
|
state->readRecordBuf =
|
|
|
|
(char *) palloc_extended(newSize, MCXT_ALLOC_NO_OOM);
|
|
|
|
if (state->readRecordBuf == NULL)
|
|
|
|
{
|
|
|
|
state->readRecordBufSize = 0;
|
|
|
|
return false;
|
|
|
|
}
|
2013-01-16 20:12:53 +01:00
|
|
|
state->readRecordBufSize = newSize;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
/*
|
|
|
|
* Initialize the passed segment structs.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
|
|
|
|
int segsize, const char *waldir)
|
|
|
|
{
|
|
|
|
seg->ws_file = -1;
|
|
|
|
seg->ws_segno = 0;
|
|
|
|
seg->ws_tli = 0;
|
|
|
|
|
|
|
|
segcxt->ws_segsize = segsize;
|
|
|
|
if (waldir)
|
|
|
|
snprintf(segcxt->ws_dir, MAXPGPATH, "%s", waldir);
|
|
|
|
}
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/*
|
|
|
|
* Attempt to read an XLOG record.
|
|
|
|
*
|
2013-10-24 10:50:02 +02:00
|
|
|
* If RecPtr is valid, try to read a record at that position. Otherwise
|
2013-01-16 20:12:53 +01:00
|
|
|
* try to read a record just after the last one previously read.
|
|
|
|
*
|
2013-01-17 22:45:37 +01:00
|
|
|
* If the read_page callback fails to read the requested data, NULL is
|
2013-01-16 20:12:53 +01:00
|
|
|
* returned. The callback is expected to have reported the error; errormsg
|
|
|
|
* is set to NULL.
|
|
|
|
*
|
|
|
|
* If the reading fails for some other reason, NULL is also returned, and
|
|
|
|
* *errormsg is set to a string with details of the failure.
|
|
|
|
*
|
|
|
|
* The returned pointer (or *errormsg) points to an internal buffer that's
|
|
|
|
* valid until the next call to XLogReadRecord.
|
|
|
|
*/
|
|
|
|
XLogRecord *
|
|
|
|
XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
|
|
|
|
{
|
|
|
|
XLogRecord *record;
|
|
|
|
XLogRecPtr targetPagePtr;
|
2016-03-30 23:56:13 +02:00
|
|
|
bool randAccess;
|
2013-01-16 20:12:53 +01:00
|
|
|
uint32 len,
|
|
|
|
total_len;
|
|
|
|
uint32 targetRecOff;
|
|
|
|
uint32 pageHeaderSize;
|
|
|
|
bool gotheader;
|
|
|
|
int readOff;
|
|
|
|
|
2016-03-30 23:56:13 +02:00
|
|
|
/*
|
|
|
|
* randAccess indicates whether to verify the previous-record pointer of
|
|
|
|
* the record we're reading. We only do this if we're reading
|
|
|
|
* sequentially, which is what we initially assume.
|
|
|
|
*/
|
|
|
|
randAccess = false;
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/* reset error state */
|
|
|
|
*errormsg = NULL;
|
|
|
|
state->errormsg_buf[0] = '\0';
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
ResetDecoder(state);
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
if (RecPtr == InvalidXLogRecPtr)
|
|
|
|
{
|
2016-03-30 23:56:13 +02:00
|
|
|
/* No explicit start point; read the record after the one we just read */
|
2013-01-16 20:12:53 +01:00
|
|
|
RecPtr = state->EndRecPtr;
|
|
|
|
|
|
|
|
if (state->ReadRecPtr == InvalidXLogRecPtr)
|
|
|
|
randAccess = true;
|
|
|
|
|
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* RecPtr is pointing to end+1 of the previous WAL record. If we're
|
2013-01-16 20:12:53 +01:00
|
|
|
* at a page boundary, no more records can fit on the current page. We
|
|
|
|
* must skip over the page header, but we can't do that until we've
|
|
|
|
* read in the page, since the header size is variable.
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
2016-03-30 23:56:13 +02:00
|
|
|
* Caller supplied a position to start at.
|
|
|
|
*
|
2013-01-16 20:12:53 +01:00
|
|
|
* In this case, the passed-in record pointer should already be
|
|
|
|
* pointing to a valid record starting position.
|
|
|
|
*/
|
|
|
|
Assert(XRecOffIsValid(RecPtr));
|
2016-03-30 23:56:13 +02:00
|
|
|
randAccess = true;
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
Use the right timeline when beginning to stream from master.
The xlogreader refactoring broke the logic to decide which timeline to start
streaming from. XLogPageRead() uses the timeline history to check which
timeline the requested WAL position falls into. However, after the
refactoring, XLogPageRead() is always first called with the first page in
the segment, to verify the segment header, and only then with the actual WAL
position we're interested in. That first read of the segment's header made
XLogPageRead() to always start streaming from the old timeline containing
the segment header, not the timeline containing the actual record, if there
was a timeline switch within the segment.
I thought I fixed this yesterday, but that fix was too narrow and only fixed
this for the corner-case that the timeline switch happened in the first page
of the segment. To fix this more robustly, pass explicitly the position of
the record we're actually interested in to XLogPageRead, and use that to
decide which timeline to read from, rather than deduce it from the page and
offset.
Per report from Fujii Masao.
2013-01-18 10:41:36 +01:00
|
|
|
state->currRecPtr = RecPtr;
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
|
2013-01-17 22:45:37 +01:00
|
|
|
targetRecOff = RecPtr % XLOG_BLCKSZ;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2013-01-17 22:45:37 +01:00
|
|
|
/*
|
2013-05-29 22:58:43 +02:00
|
|
|
* Read the page containing the record into state->readBuf. Request enough
|
|
|
|
* byte to cover the whole record header, or at least the part of it that
|
|
|
|
* fits on the same page.
|
2013-01-17 22:45:37 +01:00
|
|
|
*/
|
2019-11-25 19:04:54 +01:00
|
|
|
readOff = ReadPageInternal(state, targetPagePtr,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
|
2013-01-16 20:12:53 +01:00
|
|
|
if (readOff < 0)
|
2013-01-17 18:05:19 +01:00
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* ReadPageInternal always returns at least the page header, so we can
|
|
|
|
* examine it now.
|
|
|
|
*/
|
|
|
|
pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
|
|
|
|
if (targetRecOff == 0)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* At page start, so skip over page header.
|
|
|
|
*/
|
|
|
|
RecPtr += pageHeaderSize;
|
|
|
|
targetRecOff = pageHeaderSize;
|
|
|
|
}
|
|
|
|
else if (targetRecOff < pageHeaderSize)
|
|
|
|
{
|
|
|
|
report_invalid_record(state, "invalid record offset at %X/%X",
|
|
|
|
(uint32) (RecPtr >> 32), (uint32) RecPtr);
|
2013-01-17 18:05:19 +01:00
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if ((((XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
|
|
|
|
targetRecOff == pageHeaderSize)
|
|
|
|
{
|
|
|
|
report_invalid_record(state, "contrecord is requested by %X/%X",
|
|
|
|
(uint32) (RecPtr >> 32), (uint32) RecPtr);
|
2013-01-17 18:05:19 +01:00
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* ReadPageInternal has verified the page header */
|
|
|
|
Assert(pageHeaderSize <= readOff);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read the record length.
|
|
|
|
*
|
|
|
|
* NB: Even though we use an XLogRecord pointer here, the whole record
|
|
|
|
* header might not fit on this page. xl_tot_len is the first field of the
|
|
|
|
* struct, so it must be on this page (the records are MAXALIGNed), but we
|
|
|
|
* cannot access any other fields until we've verified that we got the
|
|
|
|
* whole header.
|
|
|
|
*/
|
|
|
|
record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
|
|
|
|
total_len = record->xl_tot_len;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the whole record header is on this page, validate it immediately.
|
|
|
|
* Otherwise do just a basic sanity check on xl_tot_len, and validate the
|
2014-05-06 18:12:18 +02:00
|
|
|
* rest of the header after reading it from the next page. The xl_tot_len
|
2013-01-16 20:12:53 +01:00
|
|
|
* check is necessary here to ensure that we enter the "Need to reassemble
|
|
|
|
* record" code path below; otherwise we might fail to apply
|
|
|
|
* ValidXLogRecordHeader at all.
|
|
|
|
*/
|
|
|
|
if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
|
|
|
|
{
|
|
|
|
if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
|
|
|
|
randAccess))
|
2013-01-17 18:05:19 +01:00
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
gotheader = true;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* XXX: more validation should be done here */
|
|
|
|
if (total_len < SizeOfXLogRecord)
|
|
|
|
{
|
2016-03-30 23:56:13 +02:00
|
|
|
report_invalid_record(state,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
"invalid record length at %X/%X: wanted %u, got %u",
|
2016-03-30 23:56:13 +02:00
|
|
|
(uint32) (RecPtr >> 32), (uint32) RecPtr,
|
2016-04-04 22:07:18 +02:00
|
|
|
(uint32) SizeOfXLogRecord, total_len);
|
2013-01-17 18:05:19 +01:00
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
gotheader = false;
|
|
|
|
}
|
|
|
|
|
|
|
|
len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
|
|
|
|
if (total_len > len)
|
|
|
|
{
|
|
|
|
/* Need to reassemble record */
|
|
|
|
char *contdata;
|
|
|
|
XLogPageHeader pageHeader;
|
|
|
|
char *buffer;
|
|
|
|
uint32 gotlen;
|
|
|
|
|
2018-11-19 02:25:48 +01:00
|
|
|
/*
|
|
|
|
* Enlarge readRecordBuf as needed.
|
|
|
|
*/
|
|
|
|
if (total_len > state->readRecordBufSize &&
|
|
|
|
!allocate_recordbuf(state, total_len))
|
|
|
|
{
|
|
|
|
/* We treat this as a "bogus data" condition */
|
|
|
|
report_invalid_record(state, "record length %u at %X/%X too long",
|
|
|
|
total_len,
|
|
|
|
(uint32) (RecPtr >> 32), (uint32) RecPtr);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/* Copy the first fragment of the record from the first page. */
|
|
|
|
memcpy(state->readRecordBuf,
|
|
|
|
state->readBuf + RecPtr % XLOG_BLCKSZ, len);
|
|
|
|
buffer = state->readRecordBuf + len;
|
|
|
|
gotlen = len;
|
|
|
|
|
|
|
|
do
|
|
|
|
{
|
|
|
|
/* Calculate pointer to beginning of next page */
|
|
|
|
targetPagePtr += XLOG_BLCKSZ;
|
|
|
|
|
|
|
|
/* Wait for the next page to become available */
|
|
|
|
readOff = ReadPageInternal(state, targetPagePtr,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
Min(total_len - gotlen + SizeOfXLogShortPHD,
|
|
|
|
XLOG_BLCKSZ));
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
if (readOff < 0)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
Assert(SizeOfXLogShortPHD <= readOff);
|
|
|
|
|
|
|
|
/* Check that the continuation on next page looks valid */
|
|
|
|
pageHeader = (XLogPageHeader) state->readBuf;
|
|
|
|
if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"there is no contrecord flag at %X/%X",
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
(uint32) (RecPtr >> 32), (uint32) RecPtr);
|
2013-01-16 20:12:53 +01:00
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Cross-check that xlp_rem_len agrees with how much of the record
|
|
|
|
* we expect there to be left.
|
|
|
|
*/
|
|
|
|
if (pageHeader->xlp_rem_len == 0 ||
|
|
|
|
total_len != (pageHeader->xlp_rem_len + gotlen))
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"invalid contrecord length %u at %X/%X",
|
|
|
|
pageHeader->xlp_rem_len,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
(uint32) (RecPtr >> 32), (uint32) RecPtr);
|
2013-01-16 20:12:53 +01:00
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Append the continuation from this page to the buffer */
|
|
|
|
pageHeaderSize = XLogPageHeaderSize(pageHeader);
|
|
|
|
|
|
|
|
if (readOff < pageHeaderSize)
|
|
|
|
readOff = ReadPageInternal(state, targetPagePtr,
|
|
|
|
pageHeaderSize);
|
|
|
|
|
|
|
|
Assert(pageHeaderSize <= readOff);
|
|
|
|
|
|
|
|
contdata = (char *) state->readBuf + pageHeaderSize;
|
|
|
|
len = XLOG_BLCKSZ - pageHeaderSize;
|
|
|
|
if (pageHeader->xlp_rem_len < len)
|
|
|
|
len = pageHeader->xlp_rem_len;
|
|
|
|
|
|
|
|
if (readOff < pageHeaderSize + len)
|
|
|
|
readOff = ReadPageInternal(state, targetPagePtr,
|
|
|
|
pageHeaderSize + len);
|
|
|
|
|
|
|
|
memcpy(buffer, (char *) contdata, len);
|
|
|
|
buffer += len;
|
|
|
|
gotlen += len;
|
|
|
|
|
|
|
|
/* If we just reassembled the record header, validate it. */
|
|
|
|
if (!gotheader)
|
|
|
|
{
|
|
|
|
record = (XLogRecord *) state->readRecordBuf;
|
|
|
|
if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr,
|
|
|
|
record, randAccess))
|
|
|
|
goto err;
|
|
|
|
gotheader = true;
|
|
|
|
}
|
|
|
|
} while (gotlen < total_len);
|
|
|
|
|
|
|
|
Assert(gotheader);
|
|
|
|
|
|
|
|
record = (XLogRecord *) state->readRecordBuf;
|
|
|
|
if (!ValidXLogRecord(state, record, RecPtr))
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
|
|
|
|
state->ReadRecPtr = RecPtr;
|
|
|
|
state->EndRecPtr = targetPagePtr + pageHeaderSize
|
|
|
|
+ MAXALIGN(pageHeader->xlp_rem_len);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* Wait for the record data to become available */
|
|
|
|
readOff = ReadPageInternal(state, targetPagePtr,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
Min(targetRecOff + total_len, XLOG_BLCKSZ));
|
2013-01-16 20:12:53 +01:00
|
|
|
if (readOff < 0)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
/* Record does not cross a page boundary */
|
|
|
|
if (!ValidXLogRecord(state, record, RecPtr))
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
state->EndRecPtr = RecPtr + MAXALIGN(total_len);
|
|
|
|
|
|
|
|
state->ReadRecPtr = RecPtr;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Special processing if it's an XLOG SWITCH record
|
|
|
|
*/
|
2016-11-04 18:26:49 +01:00
|
|
|
if (record->xl_rmid == RM_XLOG_ID &&
|
|
|
|
(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
/* Pretend it extends to end of segment */
|
2019-09-24 21:08:31 +02:00
|
|
|
state->EndRecPtr += state->segcxt.ws_segsize - 1;
|
|
|
|
state->EndRecPtr -= XLogSegmentOffset(state->EndRecPtr, state->segcxt.ws_segsize);
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
if (DecodeXLogRecord(state, record, errormsg))
|
|
|
|
return record;
|
|
|
|
else
|
|
|
|
return NULL;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
err:
|
|
|
|
|
|
|
|
/*
|
2016-03-30 23:56:13 +02:00
|
|
|
* Invalidate the read state. We might read from a different source after
|
|
|
|
* failure.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
2016-03-30 23:56:13 +02:00
|
|
|
XLogReaderInvalReadState(state);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
if (state->errormsg_buf[0] != '\0')
|
|
|
|
*errormsg = state->errormsg_buf;
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2013-01-17 22:45:37 +01:00
|
|
|
* Read a single xlog page including at least [pageptr, reqLen] of valid data
|
2013-01-16 20:12:53 +01:00
|
|
|
* via the read_page() callback.
|
|
|
|
*
|
|
|
|
* Returns -1 if the required page cannot be read for some reason; errormsg_buf
|
|
|
|
* is set in that case (unless the error occurs in the read_page callback).
|
|
|
|
*
|
|
|
|
* We fetch the page from a reader-local cache if we know we have the required
|
|
|
|
* data and if there hasn't been any error since caching the data.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
|
|
|
|
{
|
|
|
|
int readLen;
|
|
|
|
uint32 targetPageOff;
|
|
|
|
XLogSegNo targetSegNo;
|
|
|
|
XLogPageHeader hdr;
|
|
|
|
|
|
|
|
Assert((pageptr % XLOG_BLCKSZ) == 0);
|
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
XLByteToSeg(pageptr, targetSegNo, state->segcxt.ws_segsize);
|
|
|
|
targetPageOff = XLogSegmentOffset(pageptr, state->segcxt.ws_segsize);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
/* check whether we have all the requested data already */
|
2019-09-24 21:08:31 +02:00
|
|
|
if (targetSegNo == state->seg.ws_segno &&
|
2019-11-25 19:04:54 +01:00
|
|
|
targetPageOff == state->segoff && reqLen <= state->readLen)
|
2013-01-16 20:12:53 +01:00
|
|
|
return state->readLen;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Data is not in our buffer.
|
|
|
|
*
|
|
|
|
* Every time we actually read the page, even if we looked at parts of it
|
|
|
|
* before, we need to do verification as the read_page callback might now
|
|
|
|
* be rereading data from a different source.
|
|
|
|
*
|
|
|
|
* Whenever switching to a new WAL segment, we read the first page of the
|
|
|
|
* file and validate its header, even if that's not where the target
|
|
|
|
* record is. This is so that we can check the additional identification
|
|
|
|
* info that is present in the first page's "long" header.
|
|
|
|
*/
|
2019-09-24 21:08:31 +02:00
|
|
|
if (targetSegNo != state->seg.ws_segno && targetPageOff != 0)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
XLogRecPtr targetSegmentPtr = pageptr - targetPageOff;
|
|
|
|
|
|
|
|
readLen = state->read_page(state, targetSegmentPtr, XLOG_BLCKSZ,
|
Use the right timeline when beginning to stream from master.
The xlogreader refactoring broke the logic to decide which timeline to start
streaming from. XLogPageRead() uses the timeline history to check which
timeline the requested WAL position falls into. However, after the
refactoring, XLogPageRead() is always first called with the first page in
the segment, to verify the segment header, and only then with the actual WAL
position we're interested in. That first read of the segment's header made
XLogPageRead() to always start streaming from the old timeline containing
the segment header, not the timeline containing the actual record, if there
was a timeline switch within the segment.
I thought I fixed this yesterday, but that fix was too narrow and only fixed
this for the corner-case that the timeline switch happened in the first page
of the segment. To fix this more robustly, pass explicitly the position of
the record we're actually interested in to XLogPageRead, and use that to
decide which timeline to read from, rather than deduce it from the page and
offset.
Per report from Fujii Masao.
2013-01-18 10:41:36 +01:00
|
|
|
state->currRecPtr,
|
2019-09-24 21:08:31 +02:00
|
|
|
state->readBuf);
|
2013-01-16 20:12:53 +01:00
|
|
|
if (readLen < 0)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
/* we can be sure to have enough WAL available, we scrolled back */
|
|
|
|
Assert(readLen == XLOG_BLCKSZ);
|
|
|
|
|
Fix scenario where streaming standby gets stuck at a continuation record.
If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.
Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.
Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.
Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
2018-05-05 00:34:53 +02:00
|
|
|
if (!XLogReaderValidatePageHeader(state, targetSegmentPtr,
|
|
|
|
state->readBuf))
|
2013-01-16 20:12:53 +01:00
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* First, read the requested data length, but at least a short page header
|
|
|
|
* so that we can validate it.
|
|
|
|
*/
|
|
|
|
readLen = state->read_page(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
|
Use the right timeline when beginning to stream from master.
The xlogreader refactoring broke the logic to decide which timeline to start
streaming from. XLogPageRead() uses the timeline history to check which
timeline the requested WAL position falls into. However, after the
refactoring, XLogPageRead() is always first called with the first page in
the segment, to verify the segment header, and only then with the actual WAL
position we're interested in. That first read of the segment's header made
XLogPageRead() to always start streaming from the old timeline containing
the segment header, not the timeline containing the actual record, if there
was a timeline switch within the segment.
I thought I fixed this yesterday, but that fix was too narrow and only fixed
this for the corner-case that the timeline switch happened in the first page
of the segment. To fix this more robustly, pass explicitly the position of
the record we're actually interested in to XLogPageRead, and use that to
decide which timeline to read from, rather than deduce it from the page and
offset.
Per report from Fujii Masao.
2013-01-18 10:41:36 +01:00
|
|
|
state->currRecPtr,
|
2019-09-24 21:08:31 +02:00
|
|
|
state->readBuf);
|
2013-01-16 20:12:53 +01:00
|
|
|
if (readLen < 0)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
Assert(readLen <= XLOG_BLCKSZ);
|
|
|
|
|
|
|
|
/* Do we have enough data to check the header length? */
|
|
|
|
if (readLen <= SizeOfXLogShortPHD)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
Assert(readLen >= reqLen);
|
|
|
|
|
|
|
|
hdr = (XLogPageHeader) state->readBuf;
|
|
|
|
|
|
|
|
/* still not enough */
|
|
|
|
if (readLen < XLogPageHeaderSize(hdr))
|
|
|
|
{
|
|
|
|
readLen = state->read_page(state, pageptr, XLogPageHeaderSize(hdr),
|
Use the right timeline when beginning to stream from master.
The xlogreader refactoring broke the logic to decide which timeline to start
streaming from. XLogPageRead() uses the timeline history to check which
timeline the requested WAL position falls into. However, after the
refactoring, XLogPageRead() is always first called with the first page in
the segment, to verify the segment header, and only then with the actual WAL
position we're interested in. That first read of the segment's header made
XLogPageRead() to always start streaming from the old timeline containing
the segment header, not the timeline containing the actual record, if there
was a timeline switch within the segment.
I thought I fixed this yesterday, but that fix was too narrow and only fixed
this for the corner-case that the timeline switch happened in the first page
of the segment. To fix this more robustly, pass explicitly the position of
the record we're actually interested in to XLogPageRead, and use that to
decide which timeline to read from, rather than deduce it from the page and
offset.
Per report from Fujii Masao.
2013-01-18 10:41:36 +01:00
|
|
|
state->currRecPtr,
|
2019-09-24 21:08:31 +02:00
|
|
|
state->readBuf);
|
2013-01-16 20:12:53 +01:00
|
|
|
if (readLen < 0)
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now that we know we have the full header, validate it.
|
|
|
|
*/
|
Fix scenario where streaming standby gets stuck at a continuation record.
If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.
Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.
Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.
Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
2018-05-05 00:34:53 +02:00
|
|
|
if (!XLogReaderValidatePageHeader(state, pageptr, (char *) hdr))
|
2013-01-16 20:12:53 +01:00
|
|
|
goto err;
|
|
|
|
|
2016-03-30 23:56:13 +02:00
|
|
|
/* update read state information */
|
2019-09-24 21:08:31 +02:00
|
|
|
state->seg.ws_segno = targetSegNo;
|
2019-11-25 19:04:54 +01:00
|
|
|
state->segoff = targetPageOff;
|
2013-01-16 20:12:53 +01:00
|
|
|
state->readLen = readLen;
|
|
|
|
|
|
|
|
return readLen;
|
|
|
|
|
|
|
|
err:
|
2016-03-30 23:56:13 +02:00
|
|
|
XLogReaderInvalReadState(state);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Invalidate the xlogreader's read state to force a re-read.
|
|
|
|
*/
|
2019-09-03 23:41:43 +02:00
|
|
|
static void
|
2016-03-30 23:56:13 +02:00
|
|
|
XLogReaderInvalReadState(XLogReaderState *state)
|
|
|
|
{
|
2019-09-24 21:08:31 +02:00
|
|
|
state->seg.ws_segno = 0;
|
2019-11-25 19:04:54 +01:00
|
|
|
state->segoff = 0;
|
2013-01-16 20:12:53 +01:00
|
|
|
state->readLen = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Validate an XLOG record header.
|
|
|
|
*
|
|
|
|
* This is just a convenience subroutine to avoid duplicated code in
|
2014-05-06 18:12:18 +02:00
|
|
|
* XLogReadRecord. It's not intended for use from anywhere else.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
|
|
|
|
XLogRecPtr PrevRecPtr, XLogRecord *record,
|
|
|
|
bool randAccess)
|
|
|
|
{
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
if (record->xl_tot_len < SizeOfXLogRecord)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
"invalid record length at %X/%X: wanted %u, got %u",
|
2016-03-30 23:56:13 +02:00
|
|
|
(uint32) (RecPtr >> 32), (uint32) RecPtr,
|
2016-04-04 22:07:18 +02:00
|
|
|
(uint32) SizeOfXLogRecord, record->xl_tot_len);
|
2013-01-16 20:12:53 +01:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
if (record->xl_rmid > RM_MAX_ID)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"invalid resource manager ID %u at %X/%X",
|
|
|
|
record->xl_rmid, (uint32) (RecPtr >> 32),
|
|
|
|
(uint32) RecPtr);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
if (randAccess)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We can't exactly verify the prev-link, but surely it should be less
|
|
|
|
* than the record's own address.
|
|
|
|
*/
|
|
|
|
if (!(record->xl_prev < RecPtr))
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
"record with incorrect prev-link %X/%X at %X/%X",
|
2013-01-16 20:12:53 +01:00
|
|
|
(uint32) (record->xl_prev >> 32),
|
|
|
|
(uint32) record->xl_prev,
|
|
|
|
(uint32) (RecPtr >> 32), (uint32) RecPtr);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Record's prev-link should exactly match our previous location. This
|
|
|
|
* check guards against torn WAL pages where a stale but valid-looking
|
|
|
|
* WAL record starts on a sector boundary.
|
|
|
|
*/
|
|
|
|
if (record->xl_prev != PrevRecPtr)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
"record with incorrect prev-link %X/%X at %X/%X",
|
2013-01-16 20:12:53 +01:00
|
|
|
(uint32) (record->xl_prev >> 32),
|
|
|
|
(uint32) record->xl_prev,
|
|
|
|
(uint32) (RecPtr >> 32), (uint32) RecPtr);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* CRC-check an XLOG record. We do not believe the contents of an XLOG
|
|
|
|
* record (other than to the minimal extent of computing the amount of
|
|
|
|
* data to read in) until we've checked the CRCs.
|
|
|
|
*
|
|
|
|
* We assume all of the record (that is, xl_tot_len bytes) has been read
|
2014-05-06 18:12:18 +02:00
|
|
|
* into memory at *record. Also, ValidXLogRecordHeader() has accepted the
|
2013-01-16 20:12:53 +01:00
|
|
|
* record's header, which means in particular that xl_tot_len is at least
|
2019-08-13 06:53:41 +02:00
|
|
|
* SizeOfXLogRecord.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr)
|
|
|
|
{
|
2015-04-14 16:03:42 +02:00
|
|
|
pg_crc32c crc;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
/* Calculate the CRC */
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
INIT_CRC32C(crc);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
COMP_CRC32C(crc, ((char *) record) + SizeOfXLogRecord, record->xl_tot_len - SizeOfXLogRecord);
|
|
|
|
/* include the record header last */
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
COMP_CRC32C(crc, (char *) record, offsetof(XLogRecord, xl_crc));
|
|
|
|
FIN_CRC32C(crc);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
if (!EQ_CRC32C(record->xl_crc, crc))
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
"incorrect resource manager data checksum in record at %X/%X",
|
2013-01-16 20:12:53 +01:00
|
|
|
(uint32) (recptr >> 32), (uint32) recptr);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
Fix scenario where streaming standby gets stuck at a continuation record.
If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.
Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.
Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.
Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
2018-05-05 00:34:53 +02:00
|
|
|
* Validate a page header.
|
|
|
|
*
|
|
|
|
* Check if 'phdr' is valid as the header of the XLog page at position
|
|
|
|
* 'recptr'.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
Fix scenario where streaming standby gets stuck at a continuation record.
If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.
Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.
Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.
Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
2018-05-05 00:34:53 +02:00
|
|
|
bool
|
|
|
|
XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
|
|
|
|
char *phdr)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
XLogRecPtr recaddr;
|
|
|
|
XLogSegNo segno;
|
|
|
|
int32 offset;
|
Fix scenario where streaming standby gets stuck at a continuation record.
If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.
Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.
Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.
Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
2018-05-05 00:34:53 +02:00
|
|
|
XLogPageHeader hdr = (XLogPageHeader) phdr;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
Assert((recptr % XLOG_BLCKSZ) == 0);
|
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
XLByteToSeg(recptr, segno, state->segcxt.ws_segsize);
|
|
|
|
offset = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
|
|
|
|
{
|
|
|
|
char fname[MAXFNAMELEN];
|
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
report_invalid_record(state,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
"invalid magic number %04X in log segment %s, offset %u",
|
2013-01-16 20:12:53 +01:00
|
|
|
hdr->xlp_magic,
|
|
|
|
fname,
|
|
|
|
offset);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((hdr->xlp_info & ~XLP_ALL_FLAGS) != 0)
|
|
|
|
{
|
|
|
|
char fname[MAXFNAMELEN];
|
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
report_invalid_record(state,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
"invalid info bits %04X in log segment %s, offset %u",
|
2013-01-16 20:12:53 +01:00
|
|
|
hdr->xlp_info,
|
|
|
|
fname,
|
|
|
|
offset);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (hdr->xlp_info & XLP_LONG_HEADER)
|
|
|
|
{
|
|
|
|
XLogLongPageHeader longhdr = (XLogLongPageHeader) hdr;
|
|
|
|
|
|
|
|
if (state->system_identifier &&
|
|
|
|
longhdr->xlp_sysid != state->system_identifier)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
2019-06-06 14:14:29 +02:00
|
|
|
"WAL file is from different database system: WAL file database system identifier is %llu, pg_control database system identifier is %llu",
|
|
|
|
(unsigned long long) longhdr->xlp_sysid,
|
|
|
|
(unsigned long long) state->system_identifier);
|
2013-01-16 20:12:53 +01:00
|
|
|
return false;
|
|
|
|
}
|
2019-09-24 21:08:31 +02:00
|
|
|
else if (longhdr->xlp_seg_size != state->segcxt.ws_segsize)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-20 07:03:48 +02:00
|
|
|
"WAL file is from different database system: incorrect segment size in page header");
|
2013-01-16 20:12:53 +01:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
else if (longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
2015-10-02 03:39:56 +02:00
|
|
|
"WAL file is from different database system: incorrect XLOG_BLCKSZ in page header");
|
2013-01-16 20:12:53 +01:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else if (offset == 0)
|
|
|
|
{
|
|
|
|
char fname[MAXFNAMELEN];
|
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
/* hmm, first page of file doesn't have a long header? */
|
|
|
|
report_invalid_record(state,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
"invalid info bits %04X in log segment %s, offset %u",
|
2013-01-16 20:12:53 +01:00
|
|
|
hdr->xlp_info,
|
|
|
|
fname,
|
|
|
|
offset);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
Fix scenario where streaming standby gets stuck at a continuation record.
If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.
Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.
Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.
Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
2018-05-05 00:34:53 +02:00
|
|
|
/*
|
2018-06-30 18:25:49 +02:00
|
|
|
* Check that the address on the page agrees with what we expected. This
|
|
|
|
* check typically fails when an old WAL segment is recycled, and hasn't
|
|
|
|
* yet been overwritten with new data yet.
|
Fix scenario where streaming standby gets stuck at a continuation record.
If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.
Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.
Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.
Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
2018-05-05 00:34:53 +02:00
|
|
|
*/
|
2013-01-16 20:12:53 +01:00
|
|
|
if (hdr->xlp_pageaddr != recaddr)
|
|
|
|
{
|
|
|
|
char fname[MAXFNAMELEN];
|
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
report_invalid_record(state,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
"unexpected pageaddr %X/%X in log segment %s, offset %u",
|
|
|
|
(uint32) (hdr->xlp_pageaddr >> 32), (uint32) hdr->xlp_pageaddr,
|
2013-01-16 20:12:53 +01:00
|
|
|
fname,
|
|
|
|
offset);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Since child timelines are always assigned a TLI greater than their
|
|
|
|
* immediate parent's TLI, we should never see TLI go backwards across
|
|
|
|
* successive pages of a consistent WAL sequence.
|
|
|
|
*
|
|
|
|
* Sometimes we re-read a segment that's already been (partially) read. So
|
|
|
|
* we only verify TLIs for pages that are later than the last remembered
|
|
|
|
* LSN.
|
|
|
|
*/
|
|
|
|
if (recptr > state->latestPagePtr)
|
|
|
|
{
|
|
|
|
if (hdr->xlp_tli < state->latestPageTLI)
|
|
|
|
{
|
|
|
|
char fname[MAXFNAMELEN];
|
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
report_invalid_record(state,
|
|
|
|
"out-of-sequence timeline ID %u (after %u) in log segment %s, offset %u",
|
|
|
|
hdr->xlp_tli,
|
|
|
|
state->latestPageTLI,
|
|
|
|
fname,
|
|
|
|
offset);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
state->latestPagePtr = recptr;
|
|
|
|
state->latestPageTLI = hdr->xlp_tli;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef FRONTEND
|
|
|
|
/*
|
|
|
|
* Functions that are currently not needed in the backend, but are better
|
|
|
|
* implemented inside xlogreader.c because of the internal facilities available
|
|
|
|
* here.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
2013-10-24 10:50:02 +02:00
|
|
|
* Find the first record with an lsn >= RecPtr.
|
2013-01-16 20:12:53 +01:00
|
|
|
*
|
2013-10-24 10:50:02 +02:00
|
|
|
* Useful for checking whether RecPtr is a valid xlog address for reading, and
|
|
|
|
* to find the first valid address after some address when dumping records for
|
2013-01-16 20:12:53 +01:00
|
|
|
* debugging purposes.
|
|
|
|
*/
|
|
|
|
XLogRecPtr
|
|
|
|
XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
|
|
|
|
{
|
|
|
|
XLogReaderState saved_state = *state;
|
|
|
|
XLogRecPtr tmpRecPtr;
|
|
|
|
XLogRecPtr found = InvalidXLogRecPtr;
|
|
|
|
XLogPageHeader header;
|
|
|
|
char *errormsg;
|
|
|
|
|
|
|
|
Assert(!XLogRecPtrIsInvalid(RecPtr));
|
|
|
|
|
2016-08-29 07:34:58 +02:00
|
|
|
/*
|
|
|
|
* skip over potential continuation data, keeping in mind that it may span
|
|
|
|
* multiple pages
|
|
|
|
*/
|
|
|
|
tmpRecPtr = RecPtr;
|
|
|
|
while (true)
|
|
|
|
{
|
|
|
|
XLogRecPtr targetPagePtr;
|
|
|
|
int targetRecOff;
|
|
|
|
uint32 pageHeaderSize;
|
|
|
|
int readLen;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2016-08-29 07:34:58 +02:00
|
|
|
/*
|
|
|
|
* Compute targetRecOff. It should typically be equal or greater than
|
|
|
|
* short page-header since a valid record can't start anywhere before
|
|
|
|
* that, except when caller has explicitly specified the offset that
|
|
|
|
* falls somewhere there or when we are skipping multi-page
|
|
|
|
* continuation record. It doesn't matter though because
|
2017-05-17 22:31:56 +02:00
|
|
|
* ReadPageInternal() is prepared to handle that and will read at
|
|
|
|
* least short page-header worth of data
|
2016-08-29 07:34:58 +02:00
|
|
|
*/
|
|
|
|
targetRecOff = tmpRecPtr % XLOG_BLCKSZ;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2016-08-29 07:34:58 +02:00
|
|
|
/* scroll back to page boundary */
|
|
|
|
targetPagePtr = tmpRecPtr - targetRecOff;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2016-08-29 07:34:58 +02:00
|
|
|
/* Read the page containing the record */
|
|
|
|
readLen = ReadPageInternal(state, targetPagePtr, targetRecOff);
|
|
|
|
if (readLen < 0)
|
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2016-08-29 07:34:58 +02:00
|
|
|
header = (XLogPageHeader) state->readBuf;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2016-08-29 07:34:58 +02:00
|
|
|
pageHeaderSize = XLogPageHeaderSize(header);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2016-08-29 07:34:58 +02:00
|
|
|
/* make sure we have enough data for the page header */
|
|
|
|
readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize);
|
|
|
|
if (readLen < 0)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
/* skip over potential continuation data */
|
|
|
|
if (header->xlp_info & XLP_FIRST_IS_CONTRECORD)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If the length of the remaining continuation data is more than
|
|
|
|
* what can fit in this page, the continuation record crosses over
|
|
|
|
* this page. Read the next page and try again. xlp_rem_len in the
|
|
|
|
* next page header will contain the remaining length of the
|
|
|
|
* continuation data
|
|
|
|
*
|
|
|
|
* Note that record headers are MAXALIGN'ed
|
|
|
|
*/
|
2019-11-07 08:31:36 +01:00
|
|
|
if (MAXALIGN(header->xlp_rem_len) >= (XLOG_BLCKSZ - pageHeaderSize))
|
2016-08-29 07:34:58 +02:00
|
|
|
tmpRecPtr = targetPagePtr + XLOG_BLCKSZ;
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* The previous continuation record ends in this page. Set
|
|
|
|
* tmpRecPtr to point to the first valid record
|
|
|
|
*/
|
|
|
|
tmpRecPtr = targetPagePtr + pageHeaderSize
|
|
|
|
+ MAXALIGN(header->xlp_rem_len);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
tmpRecPtr = targetPagePtr + pageHeaderSize;
|
|
|
|
break;
|
|
|
|
}
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* we know now that tmpRecPtr is an address pointing to a valid XLogRecord
|
|
|
|
* because either we're at the first record after the beginning of a page
|
|
|
|
* or we just jumped over the remaining data of a continuation.
|
|
|
|
*/
|
2015-01-04 15:35:46 +01:00
|
|
|
while (XLogReadRecord(state, tmpRecPtr, &errormsg) != NULL)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
/* continue after the record */
|
|
|
|
tmpRecPtr = InvalidXLogRecPtr;
|
|
|
|
|
|
|
|
/* past the record we've found, break out */
|
|
|
|
if (RecPtr <= state->ReadRecPtr)
|
|
|
|
{
|
|
|
|
found = state->ReadRecPtr;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
err:
|
|
|
|
out:
|
|
|
|
/* Reset state to what we had before finding the record */
|
|
|
|
state->ReadRecPtr = saved_state.ReadRecPtr;
|
|
|
|
state->EndRecPtr = saved_state.EndRecPtr;
|
2016-03-30 23:56:13 +02:00
|
|
|
XLogReaderInvalReadState(state);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
return found;
|
|
|
|
}
|
|
|
|
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
#endif /* FRONTEND */
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
2019-11-25 19:04:54 +01:00
|
|
|
/*
|
|
|
|
* Read 'count' bytes into 'buf', starting at location 'startptr', from WAL
|
|
|
|
* fetched from timeline 'tli'.
|
|
|
|
*
|
|
|
|
* 'seg/segcxt' identify the last segment used. 'openSegment' is a callback
|
|
|
|
* to open the next segment, if necessary.
|
|
|
|
*
|
|
|
|
* Returns true if succeeded, false if an error occurs, in which case
|
|
|
|
* 'errinfo' receives error details.
|
|
|
|
*
|
|
|
|
* XXX probably this should be improved to suck data directly from the
|
|
|
|
* WAL buffers when possible.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
WALRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
|
|
|
|
WALOpenSegment *seg, WALSegmentContext *segcxt,
|
|
|
|
WALSegmentOpen openSegment, WALReadError *errinfo)
|
|
|
|
{
|
|
|
|
char *p;
|
|
|
|
XLogRecPtr recptr;
|
|
|
|
Size nbytes;
|
|
|
|
|
|
|
|
p = buf;
|
|
|
|
recptr = startptr;
|
|
|
|
nbytes = count;
|
|
|
|
|
|
|
|
while (nbytes > 0)
|
|
|
|
{
|
|
|
|
uint32 startoff;
|
|
|
|
int segbytes;
|
|
|
|
int readbytes;
|
|
|
|
|
|
|
|
startoff = XLogSegmentOffset(recptr, segcxt->ws_segsize);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the data we want is not in a segment we have open, close what we
|
|
|
|
* have (if anything) and open the next one, using the caller's
|
|
|
|
* provided openSegment callback.
|
|
|
|
*/
|
|
|
|
if (seg->ws_file < 0 ||
|
|
|
|
!XLByteInSeg(recptr, seg->ws_segno, segcxt->ws_segsize) ||
|
|
|
|
tli != seg->ws_tli)
|
|
|
|
{
|
|
|
|
XLogSegNo nextSegNo;
|
|
|
|
|
|
|
|
if (seg->ws_file >= 0)
|
|
|
|
close(seg->ws_file);
|
|
|
|
|
|
|
|
XLByteToSeg(recptr, nextSegNo, segcxt->ws_segsize);
|
|
|
|
seg->ws_file = openSegment(nextSegNo, segcxt, &tli);
|
|
|
|
|
|
|
|
/* Update the current segment info. */
|
|
|
|
seg->ws_tli = tli;
|
|
|
|
seg->ws_segno = nextSegNo;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* How many bytes are within this segment? */
|
|
|
|
if (nbytes > (segcxt->ws_segsize - startoff))
|
|
|
|
segbytes = segcxt->ws_segsize - startoff;
|
|
|
|
else
|
|
|
|
segbytes = nbytes;
|
|
|
|
|
|
|
|
#ifndef FRONTEND
|
|
|
|
pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/* Reset errno first; eases reporting non-errno-affecting errors */
|
|
|
|
errno = 0;
|
|
|
|
readbytes = pg_pread(seg->ws_file, p, segbytes, (off_t) startoff);
|
|
|
|
|
|
|
|
#ifndef FRONTEND
|
|
|
|
pgstat_report_wait_end();
|
|
|
|
#endif
|
|
|
|
|
|
|
|
if (readbytes <= 0)
|
|
|
|
{
|
|
|
|
errinfo->wre_errno = errno;
|
|
|
|
errinfo->wre_req = segbytes;
|
|
|
|
errinfo->wre_read = readbytes;
|
|
|
|
errinfo->wre_off = startoff;
|
|
|
|
errinfo->wre_seg = *seg;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Update state for read */
|
|
|
|
recptr += readbytes;
|
|
|
|
nbytes -= readbytes;
|
|
|
|
p += readbytes;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
/* ----------------------------------------
|
|
|
|
* Functions for decoding the data and block references in a record.
|
|
|
|
* ----------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* private function to reset the state between records */
|
|
|
|
static void
|
|
|
|
ResetDecoder(XLogReaderState *state)
|
|
|
|
{
|
|
|
|
int block_id;
|
|
|
|
|
|
|
|
state->decoded_record = NULL;
|
|
|
|
|
|
|
|
state->main_data_len = 0;
|
|
|
|
|
|
|
|
for (block_id = 0; block_id <= state->max_block_id; block_id++)
|
|
|
|
{
|
|
|
|
state->blocks[block_id].in_use = false;
|
|
|
|
state->blocks[block_id].has_image = false;
|
|
|
|
state->blocks[block_id].has_data = false;
|
2017-02-08 21:45:30 +01:00
|
|
|
state->blocks[block_id].apply_image = false;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
}
|
|
|
|
state->max_block_id = -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Decode the previously read record.
|
|
|
|
*
|
|
|
|
* On error, a human-readable error message is returned in *errormsg, and
|
|
|
|
* the return value is false.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* read next _size bytes from record buffer, but check for overrun first.
|
|
|
|
*/
|
|
|
|
#define COPY_HEADER_FIELD(_dst, _size) \
|
|
|
|
do { \
|
|
|
|
if (remaining < _size) \
|
|
|
|
goto shortdata_err; \
|
|
|
|
memcpy(_dst, ptr, _size); \
|
|
|
|
ptr += _size; \
|
|
|
|
remaining -= _size; \
|
|
|
|
} while(0)
|
|
|
|
|
|
|
|
char *ptr;
|
|
|
|
uint32 remaining;
|
|
|
|
uint32 datatotal;
|
|
|
|
RelFileNode *rnode = NULL;
|
|
|
|
uint8 block_id;
|
|
|
|
|
|
|
|
ResetDecoder(state);
|
|
|
|
|
|
|
|
state->decoded_record = record;
|
Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented here, consist out of
three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies. E.g. by filtering all replayed changes out.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL. Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.
For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00
|
|
|
state->record_origin = InvalidRepOriginId;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
|
|
|
ptr = (char *) record;
|
|
|
|
ptr += SizeOfXLogRecord;
|
|
|
|
remaining = record->xl_tot_len - SizeOfXLogRecord;
|
|
|
|
|
|
|
|
/* Decode the headers */
|
|
|
|
datatotal = 0;
|
|
|
|
while (remaining > datatotal)
|
|
|
|
{
|
|
|
|
COPY_HEADER_FIELD(&block_id, sizeof(uint8));
|
|
|
|
|
|
|
|
if (block_id == XLR_BLOCK_ID_DATA_SHORT)
|
|
|
|
{
|
|
|
|
/* XLogRecordDataHeaderShort */
|
|
|
|
uint8 main_data_len;
|
|
|
|
|
|
|
|
COPY_HEADER_FIELD(&main_data_len, sizeof(uint8));
|
|
|
|
|
|
|
|
state->main_data_len = main_data_len;
|
|
|
|
datatotal += main_data_len;
|
|
|
|
break; /* by convention, the main data fragment is
|
|
|
|
* always last */
|
|
|
|
}
|
|
|
|
else if (block_id == XLR_BLOCK_ID_DATA_LONG)
|
|
|
|
{
|
|
|
|
/* XLogRecordDataHeaderLong */
|
|
|
|
uint32 main_data_len;
|
|
|
|
|
|
|
|
COPY_HEADER_FIELD(&main_data_len, sizeof(uint32));
|
|
|
|
state->main_data_len = main_data_len;
|
|
|
|
datatotal += main_data_len;
|
|
|
|
break; /* by convention, the main data fragment is
|
|
|
|
* always last */
|
|
|
|
}
|
Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented here, consist out of
three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies. E.g. by filtering all replayed changes out.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL. Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.
For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00
|
|
|
else if (block_id == XLR_BLOCK_ID_ORIGIN)
|
|
|
|
{
|
|
|
|
COPY_HEADER_FIELD(&state->record_origin, sizeof(RepOriginId));
|
|
|
|
}
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
else if (block_id <= XLR_MAX_BLOCK_ID)
|
|
|
|
{
|
|
|
|
/* XLogRecordBlockHeader */
|
|
|
|
DecodedBkpBlock *blk;
|
|
|
|
uint8 fork_flags;
|
|
|
|
|
|
|
|
if (block_id <= state->max_block_id)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"out-of-order block_id %u at %X/%X",
|
|
|
|
block_id,
|
|
|
|
(uint32) (state->ReadRecPtr >> 32),
|
|
|
|
(uint32) state->ReadRecPtr);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
state->max_block_id = block_id;
|
|
|
|
|
|
|
|
blk = &state->blocks[block_id];
|
|
|
|
blk->in_use = true;
|
2017-02-08 21:45:30 +01:00
|
|
|
blk->apply_image = false;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
|
|
|
COPY_HEADER_FIELD(&fork_flags, sizeof(uint8));
|
|
|
|
blk->forknum = fork_flags & BKPBLOCK_FORK_MASK;
|
|
|
|
blk->flags = fork_flags;
|
|
|
|
blk->has_image = ((fork_flags & BKPBLOCK_HAS_IMAGE) != 0);
|
|
|
|
blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
|
|
|
|
|
|
|
|
COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
|
|
|
|
/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
|
|
|
|
if (blk->has_data && blk->data_len == 0)
|
2015-03-09 06:31:10 +01:00
|
|
|
{
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
report_invalid_record(state,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
"BKPBLOCK_HAS_DATA set, but no data included at %X/%X",
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
|
2015-03-09 06:31:10 +01:00
|
|
|
goto err;
|
|
|
|
}
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
if (!blk->has_data && blk->data_len != 0)
|
2015-03-09 06:31:10 +01:00
|
|
|
{
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
report_invalid_record(state,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
"BKPBLOCK_HAS_DATA not set, but data length is %u at %X/%X",
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
(unsigned int) blk->data_len,
|
|
|
|
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
|
2015-03-09 06:31:10 +01:00
|
|
|
goto err;
|
|
|
|
}
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
datatotal += blk->data_len;
|
|
|
|
|
|
|
|
if (blk->has_image)
|
|
|
|
{
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
COPY_HEADER_FIELD(&blk->bimg_len, sizeof(uint16));
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
COPY_HEADER_FIELD(&blk->hole_offset, sizeof(uint16));
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
COPY_HEADER_FIELD(&blk->bimg_info, sizeof(uint8));
|
2017-02-08 21:45:30 +01:00
|
|
|
|
|
|
|
blk->apply_image = ((blk->bimg_info & BKPIMAGE_APPLY) != 0);
|
|
|
|
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
if (blk->bimg_info & BKPIMAGE_IS_COMPRESSED)
|
|
|
|
{
|
|
|
|
if (blk->bimg_info & BKPIMAGE_HAS_HOLE)
|
|
|
|
COPY_HEADER_FIELD(&blk->hole_length, sizeof(uint16));
|
|
|
|
else
|
|
|
|
blk->hole_length = 0;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
blk->hole_length = BLCKSZ - blk->bimg_len;
|
|
|
|
datatotal += blk->bimg_len;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* cross-check that hole_offset > 0, hole_length > 0 and
|
|
|
|
* bimg_len < BLCKSZ if the HAS_HOLE flag is set.
|
|
|
|
*/
|
|
|
|
if ((blk->bimg_info & BKPIMAGE_HAS_HOLE) &&
|
|
|
|
(blk->hole_offset == 0 ||
|
|
|
|
blk->hole_length == 0 ||
|
|
|
|
blk->bimg_len == BLCKSZ))
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
2015-05-24 03:35:49 +02:00
|
|
|
"BKPIMAGE_HAS_HOLE set, but hole offset %u length %u block image length %u at %X/%X",
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
(unsigned int) blk->hole_offset,
|
|
|
|
(unsigned int) blk->hole_length,
|
|
|
|
(unsigned int) blk->bimg_len,
|
|
|
|
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
|
|
|
|
goto err;
|
|
|
|
}
|
2015-05-24 03:35:49 +02:00
|
|
|
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
/*
|
2015-05-24 03:35:49 +02:00
|
|
|
* cross-check that hole_offset == 0 and hole_length == 0 if
|
|
|
|
* the HAS_HOLE flag is not set.
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
*/
|
|
|
|
if (!(blk->bimg_info & BKPIMAGE_HAS_HOLE) &&
|
|
|
|
(blk->hole_offset != 0 || blk->hole_length != 0))
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
2015-05-24 03:35:49 +02:00
|
|
|
"BKPIMAGE_HAS_HOLE not set, but hole offset %u length %u at %X/%X",
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
(unsigned int) blk->hole_offset,
|
|
|
|
(unsigned int) blk->hole_length,
|
|
|
|
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
|
|
|
|
goto err;
|
|
|
|
}
|
2015-05-24 03:35:49 +02:00
|
|
|
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
/*
|
2015-05-24 03:35:49 +02:00
|
|
|
* cross-check that bimg_len < BLCKSZ if the IS_COMPRESSED
|
|
|
|
* flag is set.
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
*/
|
|
|
|
if ((blk->bimg_info & BKPIMAGE_IS_COMPRESSED) &&
|
|
|
|
blk->bimg_len == BLCKSZ)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
2015-05-24 03:35:49 +02:00
|
|
|
"BKPIMAGE_IS_COMPRESSED set, but block image length %u at %X/%X",
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
(unsigned int) blk->bimg_len,
|
|
|
|
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
|
|
|
|
goto err;
|
|
|
|
}
|
2015-05-24 03:35:49 +02:00
|
|
|
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
/*
|
2015-05-24 03:35:49 +02:00
|
|
|
* cross-check that bimg_len = BLCKSZ if neither HAS_HOLE nor
|
|
|
|
* IS_COMPRESSED flag is set.
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
*/
|
|
|
|
if (!(blk->bimg_info & BKPIMAGE_HAS_HOLE) &&
|
|
|
|
!(blk->bimg_info & BKPIMAGE_IS_COMPRESSED) &&
|
|
|
|
blk->bimg_len != BLCKSZ)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
2015-05-24 03:35:49 +02:00
|
|
|
"neither BKPIMAGE_HAS_HOLE nor BKPIMAGE_IS_COMPRESSED set, but block image length is %u at %X/%X",
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
(unsigned int) blk->data_len,
|
|
|
|
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
|
|
|
|
goto err;
|
|
|
|
}
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
}
|
|
|
|
if (!(fork_flags & BKPBLOCK_SAME_REL))
|
|
|
|
{
|
|
|
|
COPY_HEADER_FIELD(&blk->rnode, sizeof(RelFileNode));
|
|
|
|
rnode = &blk->rnode;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
if (rnode == NULL)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
"BKPBLOCK_SAME_REL set but no previous rel at %X/%X",
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
blk->rnode = *rnode;
|
|
|
|
}
|
|
|
|
COPY_HEADER_FIELD(&blk->blkno, sizeof(BlockNumber));
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"invalid block_id %u at %X/%X",
|
|
|
|
block_id,
|
|
|
|
(uint32) (state->ReadRecPtr >> 32),
|
|
|
|
(uint32) state->ReadRecPtr);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (remaining != datatotal)
|
|
|
|
goto shortdata_err;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ok, we've parsed the fragment headers, and verified that the total
|
|
|
|
* length of the payload in the fragments is equal to the amount of data
|
|
|
|
* left. Copy the data of each fragment to a separate buffer.
|
|
|
|
*
|
|
|
|
* We could just set up pointers into readRecordBuf, but we want to align
|
|
|
|
* the data for the convenience of the callers. Backup images are not
|
|
|
|
* copied, however; they don't need alignment.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* block data first */
|
|
|
|
for (block_id = 0; block_id <= state->max_block_id; block_id++)
|
|
|
|
{
|
|
|
|
DecodedBkpBlock *blk = &state->blocks[block_id];
|
|
|
|
|
|
|
|
if (!blk->in_use)
|
|
|
|
continue;
|
2017-02-08 21:45:30 +01:00
|
|
|
|
|
|
|
Assert(blk->has_image || !blk->apply_image);
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
if (blk->has_image)
|
|
|
|
{
|
|
|
|
blk->bkp_image = ptr;
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
ptr += blk->bimg_len;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
}
|
|
|
|
if (blk->has_data)
|
|
|
|
{
|
|
|
|
if (!blk->data || blk->data_len > blk->data_bufsz)
|
|
|
|
{
|
|
|
|
if (blk->data)
|
|
|
|
pfree(blk->data);
|
2017-11-27 10:34:05 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Force the initial request to be BLCKSZ so that we don't
|
|
|
|
* waste time with lots of trips through this stanza as a
|
|
|
|
* result of WAL compression.
|
|
|
|
*/
|
|
|
|
blk->data_bufsz = MAXALIGN(Max(blk->data_len, BLCKSZ));
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
blk->data = palloc(blk->data_bufsz);
|
|
|
|
}
|
|
|
|
memcpy(blk->data, ptr, blk->data_len);
|
|
|
|
ptr += blk->data_len;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* and finally, the main data */
|
|
|
|
if (state->main_data_len > 0)
|
|
|
|
{
|
|
|
|
if (!state->main_data || state->main_data_len > state->main_data_bufsz)
|
|
|
|
{
|
|
|
|
if (state->main_data)
|
|
|
|
pfree(state->main_data);
|
2017-11-26 21:17:24 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* main_data_bufsz must be MAXALIGN'ed. In many xlog record
|
|
|
|
* types, we omit trailing struct padding on-disk to save a few
|
|
|
|
* bytes; but compilers may generate accesses to the xlog struct
|
|
|
|
* that assume that padding bytes are present. If the palloc
|
|
|
|
* request is not large enough to include such padding bytes then
|
|
|
|
* we'll get valgrind complaints due to otherwise-harmless fetches
|
|
|
|
* of the padding bytes.
|
|
|
|
*
|
|
|
|
* In addition, force the initial request to be reasonably large
|
|
|
|
* so that we don't waste time with lots of trips through this
|
|
|
|
* stanza. BLCKSZ / 2 seems like a good compromise choice.
|
|
|
|
*/
|
|
|
|
state->main_data_bufsz = MAXALIGN(Max(state->main_data_len,
|
|
|
|
BLCKSZ / 2));
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
state->main_data = palloc(state->main_data_bufsz);
|
|
|
|
}
|
|
|
|
memcpy(state->main_data, ptr, state->main_data_len);
|
|
|
|
ptr += state->main_data_len;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
|
|
|
|
shortdata_err:
|
|
|
|
report_invalid_record(state,
|
|
|
|
"record with invalid length at %X/%X",
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
err:
|
|
|
|
*errormsg = state->errormsg_buf;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns information about the block that a block reference refers to.
|
|
|
|
*
|
|
|
|
* If the WAL record contains a block reference with the given ID, *rnode,
|
2017-08-16 06:22:32 +02:00
|
|
|
* *forknum, and *blknum are filled in (if not NULL), and returns true.
|
|
|
|
* Otherwise returns false.
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
*/
|
|
|
|
bool
|
|
|
|
XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
{
|
|
|
|
DecodedBkpBlock *bkpb;
|
|
|
|
|
|
|
|
if (!record->blocks[block_id].in_use)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
bkpb = &record->blocks[block_id];
|
|
|
|
if (rnode)
|
|
|
|
*rnode = bkpb->rnode;
|
|
|
|
if (forknum)
|
|
|
|
*forknum = bkpb->forknum;
|
|
|
|
if (blknum)
|
|
|
|
*blknum = bkpb->blkno;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns the data associated with a block reference, or NULL if there is
|
|
|
|
* no data (e.g. because a full-page image was taken instead). The returned
|
|
|
|
* pointer points to a MAXALIGNed buffer.
|
|
|
|
*/
|
|
|
|
char *
|
|
|
|
XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
|
|
|
|
{
|
|
|
|
DecodedBkpBlock *bkpb;
|
|
|
|
|
|
|
|
if (!record->blocks[block_id].in_use)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
bkpb = &record->blocks[block_id];
|
|
|
|
|
|
|
|
if (!bkpb->has_data)
|
|
|
|
{
|
|
|
|
if (len)
|
|
|
|
*len = 0;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
if (len)
|
|
|
|
*len = bkpb->data_len;
|
|
|
|
return bkpb->data;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Restore a full-page image from a backup block attached to an XLOG record.
|
|
|
|
*
|
|
|
|
* Returns the buffer number containing the page.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
|
|
|
|
{
|
|
|
|
DecodedBkpBlock *bkpb;
|
2015-05-24 03:35:49 +02:00
|
|
|
char *ptr;
|
2018-09-01 21:27:12 +02:00
|
|
|
PGAlignedBlock tmp;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
|
|
|
if (!record->blocks[block_id].in_use)
|
|
|
|
return false;
|
|
|
|
if (!record->blocks[block_id].has_image)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
bkpb = &record->blocks[block_id];
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
ptr = bkpb->bkp_image;
|
|
|
|
|
|
|
|
if (bkpb->bimg_info & BKPIMAGE_IS_COMPRESSED)
|
|
|
|
{
|
|
|
|
/* If a backup block image is compressed, decompress it */
|
2018-09-01 21:27:12 +02:00
|
|
|
if (pglz_decompress(ptr, bkpb->bimg_len, tmp.data,
|
2019-04-02 18:35:32 +02:00
|
|
|
BLCKSZ - bkpb->hole_length, true) < 0)
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
{
|
|
|
|
report_invalid_record(record, "invalid compressed image at %X/%X, block %d",
|
|
|
|
(uint32) (record->ReadRecPtr >> 32),
|
|
|
|
(uint32) record->ReadRecPtr,
|
|
|
|
block_id);
|
|
|
|
return false;
|
|
|
|
}
|
2018-09-01 21:27:12 +02:00
|
|
|
ptr = tmp.data;
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
}
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
/* generate page, taking into account hole if necessary */
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
if (bkpb->hole_length == 0)
|
|
|
|
{
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
memcpy(page, ptr, BLCKSZ);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
memcpy(page, ptr, bkpb->hole_offset);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
/* must zero-fill the hole */
|
|
|
|
MemSet(page + bkpb->hole_offset, 0, bkpb->hole_length);
|
|
|
|
memcpy(page + (bkpb->hole_offset + bkpb->hole_length),
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
ptr + bkpb->hole_offset,
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
2019-07-15 07:03:46 +02:00
|
|
|
|
|
|
|
#ifndef FRONTEND
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Extract the FullTransactionId from a WAL record.
|
|
|
|
*/
|
|
|
|
FullTransactionId
|
|
|
|
XLogRecGetFullXid(XLogReaderState *record)
|
|
|
|
{
|
|
|
|
TransactionId xid,
|
|
|
|
next_xid;
|
|
|
|
uint32 epoch;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function is only safe during replay, because it depends on the
|
|
|
|
* replay state. See AdvanceNextFullTransactionIdPastXid() for more.
|
|
|
|
*/
|
|
|
|
Assert(AmStartupProcess() || !IsUnderPostmaster);
|
|
|
|
|
|
|
|
xid = XLogRecGetXid(record);
|
|
|
|
next_xid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
|
|
|
|
epoch = EpochFromFullTransactionId(ShmemVariableCache->nextFullXid);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If xid is numerically greater than next_xid, it has to be from the
|
|
|
|
* last epoch.
|
|
|
|
*/
|
|
|
|
if (unlikely(xid > next_xid))
|
|
|
|
--epoch;
|
|
|
|
|
|
|
|
return FullTransactionIdFromEpochAndXid(epoch, xid);
|
2019-07-31 03:29:55 +02:00
|
|
|
}
|
2019-07-15 07:03:46 +02:00
|
|
|
|
|
|
|
#endif
|