2013-01-16 20:12:53 +01:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
|
|
|
* xlogreader.c
|
|
|
|
* Generic XLog reading facility
|
|
|
|
*
|
2023-01-02 21:00:37 +01:00
|
|
|
* Portions Copyright (c) 2013-2023, PostgreSQL Global Development Group
|
2013-01-16 20:12:53 +01:00
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
|
|
|
* src/backend/access/transam/xlogreader.c
|
|
|
|
*
|
|
|
|
* NOTES
|
|
|
|
* See xlogreader.h for more notes on this facility.
|
2016-03-30 23:56:13 +02:00
|
|
|
*
|
|
|
|
* This file is compiled as both front-end and backend code, so it
|
|
|
|
* may not use ereport, server-defined static variables, etc.
|
2013-01-16 20:12:53 +01:00
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
#include "postgres.h"
|
|
|
|
|
2019-11-25 19:04:54 +01:00
|
|
|
#include <unistd.h>
|
Add support for LZ4 with compression of full-page writes in WAL
The logic is implemented so as there can be a choice in the compression
used when building a WAL record, and an extra per-record bit is used to
track down if a block is compressed with PGLZ, LZ4 or nothing.
wal_compression, the existing parameter, is changed to an enum with
support for the following backward-compatible values:
- "off", the default, to not use compression.
- "pglz" or "on", to compress FPWs with PGLZ.
- "lz4", the new mode, to compress FPWs with LZ4.
Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be
also an interesting choice, but going just with LZ4 for now makes the
patch minimalistic as toast compression is already able to use LZ4, so
there is no need to worry about any build-related needs for this
implementation.
Author: Andrey Borodin, Justin Pryzby
Reviewed-by: Dilip Kumar, Michael Paquier
Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru
2021-06-29 04:17:55 +02:00
|
|
|
#ifdef USE_LZ4
|
|
|
|
#include <lz4.h>
|
|
|
|
#endif
|
Add support for zstd with compression of full-page writes in WAL
wal_compression gains a new value, "zstd", to allow the compression of
full-page images using the compression method of the same name.
Compression is done using the default level recommended by the library,
as of ZSTD_CLEVEL_DEFAULT = 3. Some benchmarking has shown that it
could make sense to use a level lower for the FPI compression, like 1 or
2, as the compression rate did not change much with a bit less CPU
consumed, but any tests done would only cover few scenarios so it is
hard to come to a clear conclusion. Anyway, there is no reason to not
use the default level instead, which is the level recommended by the
library so it should be fine for most cases.
zstd outclasses easily pglz, and is better than LZ4 where one wants to
have more compression at the cost of extra CPU but both are good enough
in their own scenarios, so the choice between one or the other of these
comes to a study of the workload patterns and the schema involved,
mainly.
This commit relies heavily on 4035cd5, that reshaped the code creating
and restoring full-page writes to be aware of the compression type,
making this integration straight-forward.
This patch borrows some early work from Andrey Borodin, though the patch
got a complete rewrite.
Author: Justin Pryzby
Discussion: https://postgr.es/m/20220222231948.GJ9008@telsasoft.com
2022-03-11 04:18:53 +01:00
|
|
|
#ifdef USE_ZSTD
|
|
|
|
#include <zstd.h>
|
|
|
|
#endif
|
2019-11-25 19:04:54 +01:00
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
#include "access/transam.h"
|
|
|
|
#include "access/xlog_internal.h"
|
|
|
|
#include "access/xlogreader.h"
|
2019-11-12 04:00:16 +01:00
|
|
|
#include "access/xlogrecord.h"
|
2013-01-16 20:12:53 +01:00
|
|
|
#include "catalog/pg_control.h"
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
#include "common/pg_lzcompress.h"
|
Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented here, consist out of
three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies. E.g. by filtering all replayed changes out.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL. Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.
For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00
|
|
|
#include "replication/origin.h"
|
2013-01-16 20:12:53 +01:00
|
|
|
|
Prevent hard failures of standbys caused by recycled WAL segments
When a standby's WAL receiver stops reading WAL from a WAL stream, it
writes data to the current WAL segment without having priorily zero'ed
the page currently written to, which can cause the WAL reader to read
junk data from a past recycled segment and then it would try to get a
record from it. While sanity checks in place provide most of the
protection needed, in some rare circumstances, with chances increasing
when a record header crosses a page boundary, then the startup process
could fail violently on an allocation failure, as follows:
FATAL: invalid memory alloc request size XXX
This is confusing for the user and also unhelpful as this requires in
the worst case a manual restart of the instance, impacting potentially
the availability of the cluster, and this also makes WAL data look like
it is in a corrupted state.
The chances of seeing failures are higher if the connection between the
standby and its root node is unstable, causing WAL pages to be written
in the middle. A couple of approaches have been discussed, like
zero-ing new WAL pages within the WAL receiver itself but this has the
disadvantage of impacting performance of any existing instances as this
breaks the sequential writes done by the WAL receiver. This commit
deals with the problem with a more simple approach, which has no
performance impact without reducing the detection of the problem: if a
record is found with a length higher than 1GB for backends, then do not
try any allocation and report a soft failure which will force the
standby to retry reading WAL. It could be possible that the allocation
call passes and that an unnecessary amount of memory is allocated,
however follow-up checks on records would just fail, making this
allocation short-lived anyway.
This patch owes a great deal to Tsunakawa Takayuki for reporting the
failure first, and then discussing a couple of potential approaches to
the problem.
Backpatch down to 9.5, which is where palloc_extended has been
introduced.
Reported-by: Tsunakawa Takayuki
Reviewed-by: Tsunakawa Takayuki
Author: Michael Paquier
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
2018-06-18 03:43:27 +02:00
|
|
|
#ifndef FRONTEND
|
2019-07-15 07:03:46 +02:00
|
|
|
#include "miscadmin.h"
|
2019-11-25 19:04:54 +01:00
|
|
|
#include "pgstat.h"
|
Prevent hard failures of standbys caused by recycled WAL segments
When a standby's WAL receiver stops reading WAL from a WAL stream, it
writes data to the current WAL segment without having priorily zero'ed
the page currently written to, which can cause the WAL reader to read
junk data from a past recycled segment and then it would try to get a
record from it. While sanity checks in place provide most of the
protection needed, in some rare circumstances, with chances increasing
when a record header crosses a page boundary, then the startup process
could fail violently on an allocation failure, as follows:
FATAL: invalid memory alloc request size XXX
This is confusing for the user and also unhelpful as this requires in
the worst case a manual restart of the instance, impacting potentially
the availability of the cluster, and this also makes WAL data look like
it is in a corrupted state.
The chances of seeing failures are higher if the connection between the
standby and its root node is unstable, causing WAL pages to be written
in the middle. A couple of approaches have been discussed, like
zero-ing new WAL pages within the WAL receiver itself but this has the
disadvantage of impacting performance of any existing instances as this
breaks the sequential writes done by the WAL receiver. This commit
deals with the problem with a more simple approach, which has no
performance impact without reducing the detection of the problem: if a
record is found with a length higher than 1GB for backends, then do not
try any allocation and report a soft failure which will force the
standby to retry reading WAL. It could be possible that the allocation
call passes and that an unnecessary amount of memory is allocated,
however follow-up checks on records would just fail, making this
allocation short-lived anyway.
This patch owes a great deal to Tsunakawa Takayuki for reporting the
failure first, and then discussing a couple of potential approaches to
the problem.
Backpatch down to 9.5, which is where palloc_extended has been
introduced.
Reported-by: Tsunakawa Takayuki
Reviewed-by: Tsunakawa Takayuki
Author: Michael Paquier
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
2018-06-18 03:43:27 +02:00
|
|
|
#include "utils/memutils.h"
|
2022-04-11 23:43:46 +02:00
|
|
|
#else
|
|
|
|
#include "common/logging.h"
|
Prevent hard failures of standbys caused by recycled WAL segments
When a standby's WAL receiver stops reading WAL from a WAL stream, it
writes data to the current WAL segment without having priorily zero'ed
the page currently written to, which can cause the WAL reader to read
junk data from a past recycled segment and then it would try to get a
record from it. While sanity checks in place provide most of the
protection needed, in some rare circumstances, with chances increasing
when a record header crosses a page boundary, then the startup process
could fail violently on an allocation failure, as follows:
FATAL: invalid memory alloc request size XXX
This is confusing for the user and also unhelpful as this requires in
the worst case a manual restart of the instance, impacting potentially
the availability of the cluster, and this also makes WAL data look like
it is in a corrupted state.
The chances of seeing failures are higher if the connection between the
standby and its root node is unstable, causing WAL pages to be written
in the middle. A couple of approaches have been discussed, like
zero-ing new WAL pages within the WAL receiver itself but this has the
disadvantage of impacting performance of any existing instances as this
breaks the sequential writes done by the WAL receiver. This commit
deals with the problem with a more simple approach, which has no
performance impact without reducing the detection of the problem: if a
record is found with a length higher than 1GB for backends, then do not
try any allocation and report a soft failure which will force the
standby to retry reading WAL. It could be possible that the allocation
call passes and that an unnecessary amount of memory is allocated,
however follow-up checks on records would just fail, making this
allocation short-lived anyway.
This patch owes a great deal to Tsunakawa Takayuki for reporting the
failure first, and then discussing a couple of potential approaches to
the problem.
Backpatch down to 9.5, which is where palloc_extended has been
introduced.
Reported-by: Tsunakawa Takayuki
Reviewed-by: Tsunakawa Takayuki
Author: Michael Paquier
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
2018-06-18 03:43:27 +02:00
|
|
|
#endif
|
|
|
|
|
2019-09-03 23:41:43 +02:00
|
|
|
static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
|
|
|
|
pg_attribute_printf(2, 3);
|
|
|
|
static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
|
2021-05-10 06:00:53 +02:00
|
|
|
static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
|
|
|
|
int reqLen);
|
2019-09-03 23:41:43 +02:00
|
|
|
static void XLogReaderInvalReadState(XLogReaderState *state);
|
2022-09-20 04:18:36 +02:00
|
|
|
static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
|
2013-01-16 20:12:53 +01:00
|
|
|
static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
|
2021-05-10 06:00:53 +02:00
|
|
|
XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
|
2013-01-16 20:12:53 +01:00
|
|
|
static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
|
|
|
|
XLogRecPtr recptr);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
static void ResetDecoder(XLogReaderState *state);
|
2020-06-08 03:12:24 +02:00
|
|
|
static void WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
|
|
|
|
int segsize, const char *waldir);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/* size of the buffer allocated for error message. */
|
|
|
|
#define MAX_ERRORMSG_LEN 1000
|
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
/*
|
|
|
|
* Default size; large enough that typical users of XLogReader won't often need
|
|
|
|
* to use the 'oversized' memory allocation code path.
|
|
|
|
*/
|
|
|
|
#define DEFAULT_DECODE_BUFFER_SIZE (64 * 1024)
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/*
|
|
|
|
* Construct a string in state->errormsg_buf explaining what's wrong with
|
|
|
|
* the current record being read.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
report_invalid_record(XLogReaderState *state, const char *fmt,...)
|
|
|
|
{
|
|
|
|
va_list args;
|
|
|
|
|
|
|
|
fmt = _(fmt);
|
|
|
|
|
|
|
|
va_start(args, fmt);
|
|
|
|
vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
|
|
|
|
va_end(args);
|
2022-03-18 05:45:04 +01:00
|
|
|
|
|
|
|
state->errormsg_deferred = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Set the size of the decoding buffer. A pointer to a caller supplied memory
|
|
|
|
* region may also be passed in, in which case non-oversized records will be
|
|
|
|
* decoded there.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
XLogReaderSetDecodeBuffer(XLogReaderState *state, void *buffer, size_t size)
|
|
|
|
{
|
|
|
|
Assert(state->decode_buffer == NULL);
|
|
|
|
|
|
|
|
state->decode_buffer = buffer;
|
|
|
|
state->decode_buffer_size = size;
|
|
|
|
state->decode_buffer_tail = buffer;
|
|
|
|
state->decode_buffer_head = buffer;
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate and initialize a new XLogReader.
|
|
|
|
*
|
2015-04-03 14:55:37 +02:00
|
|
|
* Returns NULL if the xlogreader couldn't be allocated.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
|
|
|
XLogReaderState *
|
2019-09-24 21:08:31 +02:00
|
|
|
XLogReaderAllocate(int wal_segment_size, const char *waldir,
|
2021-05-10 06:00:53 +02:00
|
|
|
XLogReaderRoutine *routine, void *private_data)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
XLogReaderState *state;
|
|
|
|
|
2015-04-03 14:55:37 +02:00
|
|
|
state = (XLogReaderState *)
|
|
|
|
palloc_extended(sizeof(XLogReaderState),
|
|
|
|
MCXT_ALLOC_NO_OOM | MCXT_ALLOC_ZERO);
|
|
|
|
if (!state)
|
|
|
|
return NULL;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2020-05-08 21:30:34 +02:00
|
|
|
/* initialize caller-provided support functions */
|
2021-05-10 06:00:53 +02:00
|
|
|
state->routine = *routine;
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/*
|
|
|
|
* Permanently allocate readBuf. We do it this way, rather than just
|
|
|
|
* making a static array, for two reasons: (1) no need to waste the
|
|
|
|
* storage in most instantiations of the backend; (2) a static char array
|
2015-04-03 14:55:37 +02:00
|
|
|
* isn't guaranteed to have any particular alignment, whereas
|
|
|
|
* palloc_extended() will provide MAXALIGN'd storage.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
2015-04-03 14:55:37 +02:00
|
|
|
state->readBuf = (char *) palloc_extended(XLOG_BLCKSZ,
|
|
|
|
MCXT_ALLOC_NO_OOM);
|
|
|
|
if (!state->readBuf)
|
|
|
|
{
|
|
|
|
pfree(state);
|
|
|
|
return NULL;
|
|
|
|
}
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
/* Initialize segment info. */
|
|
|
|
WALOpenSegmentInit(&state->seg, &state->segcxt, wal_segment_size,
|
|
|
|
waldir);
|
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/* system_identifier initialized to zeroes above */
|
|
|
|
state->private_data = private_data;
|
|
|
|
/* ReadRecPtr, EndRecPtr and readLen initialized to zeroes above */
|
2015-04-03 14:55:37 +02:00
|
|
|
state->errormsg_buf = palloc_extended(MAX_ERRORMSG_LEN + 1,
|
|
|
|
MCXT_ALLOC_NO_OOM);
|
|
|
|
if (!state->errormsg_buf)
|
|
|
|
{
|
|
|
|
pfree(state->readBuf);
|
|
|
|
pfree(state);
|
|
|
|
return NULL;
|
|
|
|
}
|
2013-01-16 20:12:53 +01:00
|
|
|
state->errormsg_buf[0] = '\0';
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate an initial readRecordBuf of minimal size, which can later be
|
|
|
|
* enlarged if necessary.
|
|
|
|
*/
|
|
|
|
if (!allocate_recordbuf(state, 0))
|
|
|
|
{
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
pfree(state->errormsg_buf);
|
|
|
|
pfree(state->readBuf);
|
|
|
|
pfree(state);
|
2013-01-16 20:12:53 +01:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return state;
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
XLogReaderFree(XLogReaderState *state)
|
|
|
|
{
|
2021-05-10 06:00:53 +02:00
|
|
|
if (state->seg.ws_file != -1)
|
|
|
|
state->routine.segment_close(state);
|
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
if (state->decode_buffer && state->free_decode_buffer)
|
|
|
|
pfree(state->decode_buffer);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
|
|
|
pfree(state->errormsg_buf);
|
2013-01-16 20:12:53 +01:00
|
|
|
if (state->readRecordBuf)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
pfree(state->readRecordBuf);
|
|
|
|
pfree(state->readBuf);
|
|
|
|
pfree(state);
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate readRecordBuf to fit a record of at least the given length.
|
|
|
|
* Returns true if successful, false if out of memory.
|
|
|
|
*
|
|
|
|
* readRecordBufSize is set to the new buffer size.
|
|
|
|
*
|
|
|
|
* To avoid useless small increases, round its size to a multiple of
|
|
|
|
* XLOG_BLCKSZ, and make sure it's at least 5*Max(BLCKSZ, XLOG_BLCKSZ) to start
|
|
|
|
* with. (That is enough for all "normal" records, but very large commit or
|
|
|
|
* abort records might need more space.)
|
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
allocate_recordbuf(XLogReaderState *state, uint32 reclength)
|
|
|
|
{
|
|
|
|
uint32 newSize = reclength;
|
|
|
|
|
|
|
|
newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
|
|
|
|
newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));
|
|
|
|
|
Prevent hard failures of standbys caused by recycled WAL segments
When a standby's WAL receiver stops reading WAL from a WAL stream, it
writes data to the current WAL segment without having priorily zero'ed
the page currently written to, which can cause the WAL reader to read
junk data from a past recycled segment and then it would try to get a
record from it. While sanity checks in place provide most of the
protection needed, in some rare circumstances, with chances increasing
when a record header crosses a page boundary, then the startup process
could fail violently on an allocation failure, as follows:
FATAL: invalid memory alloc request size XXX
This is confusing for the user and also unhelpful as this requires in
the worst case a manual restart of the instance, impacting potentially
the availability of the cluster, and this also makes WAL data look like
it is in a corrupted state.
The chances of seeing failures are higher if the connection between the
standby and its root node is unstable, causing WAL pages to be written
in the middle. A couple of approaches have been discussed, like
zero-ing new WAL pages within the WAL receiver itself but this has the
disadvantage of impacting performance of any existing instances as this
breaks the sequential writes done by the WAL receiver. This commit
deals with the problem with a more simple approach, which has no
performance impact without reducing the detection of the problem: if a
record is found with a length higher than 1GB for backends, then do not
try any allocation and report a soft failure which will force the
standby to retry reading WAL. It could be possible that the allocation
call passes and that an unnecessary amount of memory is allocated,
however follow-up checks on records would just fail, making this
allocation short-lived anyway.
This patch owes a great deal to Tsunakawa Takayuki for reporting the
failure first, and then discussing a couple of potential approaches to
the problem.
Backpatch down to 9.5, which is where palloc_extended has been
introduced.
Reported-by: Tsunakawa Takayuki
Reviewed-by: Tsunakawa Takayuki
Author: Michael Paquier
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
2018-06-18 03:43:27 +02:00
|
|
|
#ifndef FRONTEND
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Note that in much unlucky circumstances, the random data read from a
|
|
|
|
* recycled segment can cause this routine to be called with a size
|
|
|
|
* causing a hard failure at allocation. For a standby, this would cause
|
|
|
|
* the instance to stop suddenly with a hard failure, preventing it to
|
|
|
|
* retry fetching WAL from one of its sources which could allow it to move
|
|
|
|
* on with replay without a manual restart. If the data comes from a past
|
|
|
|
* recycled segment and is still valid, then the allocation may succeed
|
|
|
|
* but record checks are going to fail so this would be short-lived. If
|
|
|
|
* the allocation fails because of a memory shortage, then this is not a
|
|
|
|
* hard failure either per the guarantee given by MCXT_ALLOC_NO_OOM.
|
|
|
|
*/
|
|
|
|
if (!AllocSizeIsValid(newSize))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
#endif
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
if (state->readRecordBuf)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
pfree(state->readRecordBuf);
|
2015-04-03 11:29:38 +02:00
|
|
|
state->readRecordBuf =
|
|
|
|
(char *) palloc_extended(newSize, MCXT_ALLOC_NO_OOM);
|
|
|
|
if (state->readRecordBuf == NULL)
|
|
|
|
{
|
|
|
|
state->readRecordBufSize = 0;
|
|
|
|
return false;
|
|
|
|
}
|
2013-01-16 20:12:53 +01:00
|
|
|
state->readRecordBufSize = newSize;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
/*
|
|
|
|
* Initialize the passed segment structs.
|
|
|
|
*/
|
2020-06-08 03:12:24 +02:00
|
|
|
static void
|
2019-09-24 21:08:31 +02:00
|
|
|
WALOpenSegmentInit(WALOpenSegment *seg, WALSegmentContext *segcxt,
|
|
|
|
int segsize, const char *waldir)
|
|
|
|
{
|
|
|
|
seg->ws_file = -1;
|
|
|
|
seg->ws_segno = 0;
|
|
|
|
seg->ws_tli = 0;
|
|
|
|
|
|
|
|
segcxt->ws_segsize = segsize;
|
|
|
|
if (waldir)
|
|
|
|
snprintf(segcxt->ws_dir, MAXPGPATH, "%s", waldir);
|
|
|
|
}
|
|
|
|
|
2020-01-26 10:39:00 +01:00
|
|
|
/*
|
|
|
|
* Begin reading WAL at 'RecPtr'.
|
|
|
|
*
|
2021-10-27 22:38:38 +02:00
|
|
|
* 'RecPtr' should point to the beginning of a valid WAL record. Pointing at
|
2020-01-26 10:39:00 +01:00
|
|
|
* the beginning of a page is also OK, if there is a new record right after
|
|
|
|
* the page header, i.e. not a continuation.
|
|
|
|
*
|
|
|
|
* This does not make any attempt to read the WAL yet, and hence cannot fail.
|
|
|
|
* If the starting address is not correct, the first call to XLogReadRecord()
|
|
|
|
* will error out.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr)
|
|
|
|
{
|
|
|
|
Assert(!XLogRecPtrIsInvalid(RecPtr));
|
|
|
|
|
|
|
|
ResetDecoder(state);
|
|
|
|
|
|
|
|
/* Begin at the passed-in record pointer. */
|
|
|
|
state->EndRecPtr = RecPtr;
|
2022-03-18 05:45:04 +01:00
|
|
|
state->NextRecPtr = RecPtr;
|
2020-01-26 10:39:00 +01:00
|
|
|
state->ReadRecPtr = InvalidXLogRecPtr;
|
2022-03-18 05:45:04 +01:00
|
|
|
state->DecodeRecPtr = InvalidXLogRecPtr;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2022-09-08 10:25:20 +02:00
|
|
|
* Release the last record that was returned by XLogNextRecord(), if any, to
|
|
|
|
* free up space. Returns the LSN past the end of the record.
|
2022-03-18 05:45:04 +01:00
|
|
|
*/
|
2022-09-08 10:25:20 +02:00
|
|
|
XLogRecPtr
|
2022-03-18 05:45:04 +01:00
|
|
|
XLogReleasePreviousRecord(XLogReaderState *state)
|
|
|
|
{
|
|
|
|
DecodedXLogRecord *record;
|
2023-05-19 23:24:48 +02:00
|
|
|
XLogRecPtr next_lsn;
|
2022-03-18 05:45:04 +01:00
|
|
|
|
|
|
|
if (!state->record)
|
2022-09-08 10:25:20 +02:00
|
|
|
return InvalidXLogRecPtr;
|
2022-03-18 05:45:04 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove it from the decoded record queue. It must be the oldest item
|
|
|
|
* decoded, decode_queue_head.
|
|
|
|
*/
|
|
|
|
record = state->record;
|
2022-09-08 10:25:20 +02:00
|
|
|
next_lsn = record->next_lsn;
|
2022-03-18 05:45:04 +01:00
|
|
|
Assert(record == state->decode_queue_head);
|
|
|
|
state->record = NULL;
|
|
|
|
state->decode_queue_head = record->next;
|
|
|
|
|
|
|
|
/* It might also be the newest item decoded, decode_queue_tail. */
|
|
|
|
if (state->decode_queue_tail == record)
|
|
|
|
state->decode_queue_tail = NULL;
|
|
|
|
|
|
|
|
/* Release the space. */
|
|
|
|
if (unlikely(record->oversized))
|
|
|
|
{
|
2022-04-11 10:49:41 +02:00
|
|
|
/* It's not in the decode buffer, so free it to release space. */
|
2022-03-18 05:45:04 +01:00
|
|
|
pfree(record);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* It must be the head (oldest) record in the decode buffer. */
|
|
|
|
Assert(state->decode_buffer_head == (char *) record);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to update head to point to the next record that is in the
|
|
|
|
* decode buffer, if any, being careful to skip oversized ones
|
|
|
|
* (they're not in the decode buffer).
|
|
|
|
*/
|
|
|
|
record = record->next;
|
|
|
|
while (unlikely(record && record->oversized))
|
|
|
|
record = record->next;
|
|
|
|
|
|
|
|
if (record)
|
|
|
|
{
|
|
|
|
/* Adjust head to release space up to the next record. */
|
|
|
|
state->decode_buffer_head = (char *) record;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Otherwise we might as well just reset head and tail to the
|
|
|
|
* start of the buffer space, because we're empty. This means
|
|
|
|
* we'll keep overwriting the same piece of memory if we're not
|
|
|
|
* doing any prefetching.
|
|
|
|
*/
|
|
|
|
state->decode_buffer_head = state->decode_buffer;
|
|
|
|
state->decode_buffer_tail = state->decode_buffer;
|
|
|
|
}
|
|
|
|
}
|
2022-09-08 10:25:20 +02:00
|
|
|
|
|
|
|
return next_lsn;
|
2022-03-18 05:45:04 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Attempt to read an XLOG record.
|
|
|
|
*
|
|
|
|
* XLogBeginRead() or XLogFindNextRecord() and then XLogReadAhead() must be
|
|
|
|
* called before the first call to XLogNextRecord(). This functions returns
|
|
|
|
* records and errors that were put into an internal queue by XLogReadAhead().
|
|
|
|
*
|
|
|
|
* On success, a record is returned.
|
|
|
|
*
|
|
|
|
* The returned record (or *errormsg) points to an internal buffer that's
|
|
|
|
* valid until the next call to XLogNextRecord.
|
|
|
|
*/
|
|
|
|
DecodedXLogRecord *
|
|
|
|
XLogNextRecord(XLogReaderState *state, char **errormsg)
|
|
|
|
{
|
|
|
|
/* Release the last record returned by XLogNextRecord(). */
|
|
|
|
XLogReleasePreviousRecord(state);
|
|
|
|
|
|
|
|
if (state->decode_queue_head == NULL)
|
|
|
|
{
|
|
|
|
*errormsg = NULL;
|
|
|
|
if (state->errormsg_deferred)
|
|
|
|
{
|
|
|
|
if (state->errormsg_buf[0] != '\0')
|
|
|
|
*errormsg = state->errormsg_buf;
|
|
|
|
state->errormsg_deferred = false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* state->EndRecPtr is expected to have been set by the last call to
|
|
|
|
* XLogBeginRead() or XLogNextRecord(), and is the location of the
|
|
|
|
* error.
|
|
|
|
*/
|
|
|
|
Assert(!XLogRecPtrIsInvalid(state->EndRecPtr));
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Record this as the most recent record returned, so that we'll release
|
|
|
|
* it next time. This also exposes it to the traditional
|
|
|
|
* XLogRecXXX(xlogreader) macros, which work with the decoder rather than
|
|
|
|
* the record for historical reasons.
|
|
|
|
*/
|
|
|
|
state->record = state->decode_queue_head;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Update the pointers to the beginning and one-past-the-end of this
|
|
|
|
* record, again for the benefit of historical code that expected the
|
|
|
|
* decoder to track this rather than accessing these fields of the record
|
|
|
|
* itself.
|
|
|
|
*/
|
|
|
|
state->ReadRecPtr = state->record->lsn;
|
|
|
|
state->EndRecPtr = state->record->next_lsn;
|
|
|
|
|
|
|
|
*errormsg = NULL;
|
|
|
|
|
|
|
|
return state->record;
|
2020-01-26 10:39:00 +01:00
|
|
|
}
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/*
|
2021-05-10 06:00:53 +02:00
|
|
|
* Attempt to read an XLOG record.
|
|
|
|
*
|
|
|
|
* XLogBeginRead() or XLogFindNextRecord() must be called before the first call
|
|
|
|
* to XLogReadRecord().
|
|
|
|
*
|
|
|
|
* If the page_read callback fails to read the requested data, NULL is
|
|
|
|
* returned. The callback is expected to have reported the error; errormsg
|
|
|
|
* is set to NULL.
|
|
|
|
*
|
|
|
|
* If the reading fails for some other reason, NULL is also returned, and
|
|
|
|
* *errormsg is set to a string with details of the failure.
|
|
|
|
*
|
|
|
|
* The returned pointer (or *errormsg) points to an internal buffer that's
|
|
|
|
* valid until the next call to XLogReadRecord.
|
2021-04-08 13:03:34 +02:00
|
|
|
*/
|
2021-05-10 06:00:53 +02:00
|
|
|
XLogRecord *
|
|
|
|
XLogReadRecord(XLogReaderState *state, char **errormsg)
|
2022-03-18 05:45:04 +01:00
|
|
|
{
|
|
|
|
DecodedXLogRecord *decoded;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Release last returned record, if there is one. We need to do this so
|
|
|
|
* that we can check for empty decode queue accurately.
|
|
|
|
*/
|
|
|
|
XLogReleasePreviousRecord(state);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Call XLogReadAhead() in blocking mode to make sure there is something
|
|
|
|
* in the queue, though we don't use the result.
|
|
|
|
*/
|
|
|
|
if (!XLogReaderHasQueuedRecordOrError(state))
|
|
|
|
XLogReadAhead(state, false /* nonblocking */ );
|
|
|
|
|
|
|
|
/* Consume the head record or error. */
|
|
|
|
decoded = XLogNextRecord(state, errormsg);
|
|
|
|
if (decoded)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* This function returns a pointer to the record's header, not the
|
|
|
|
* actual decoded record. The caller will access the decoded record
|
|
|
|
* through the XLogRecGetXXX() macros, which reach the decoded
|
|
|
|
* recorded as xlogreader->record.
|
|
|
|
*/
|
|
|
|
Assert(state->record == decoded);
|
|
|
|
return &decoded->header;
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate space for a decoded record. The only member of the returned
|
|
|
|
* object that is initialized is the 'oversized' flag, indicating that the
|
|
|
|
* decoded record wouldn't fit in the decode buffer and must eventually be
|
|
|
|
* freed explicitly.
|
|
|
|
*
|
|
|
|
* The caller is responsible for adjusting decode_buffer_tail with the real
|
|
|
|
* size after successfully decoding a record into this space. This way, if
|
|
|
|
* decoding fails, then there is nothing to undo unless the 'oversized' flag
|
|
|
|
* was set and pfree() must be called.
|
|
|
|
*
|
|
|
|
* Return NULL if there is no space in the decode buffer and allow_oversized
|
|
|
|
* is false, or if memory allocation fails for an oversized buffer.
|
|
|
|
*/
|
|
|
|
static DecodedXLogRecord *
|
|
|
|
XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversized)
|
|
|
|
{
|
|
|
|
size_t required_space = DecodeXLogRecordRequiredSpace(xl_tot_len);
|
|
|
|
DecodedXLogRecord *decoded = NULL;
|
|
|
|
|
|
|
|
/* Allocate a circular decode buffer if we don't have one already. */
|
|
|
|
if (unlikely(state->decode_buffer == NULL))
|
|
|
|
{
|
|
|
|
if (state->decode_buffer_size == 0)
|
|
|
|
state->decode_buffer_size = DEFAULT_DECODE_BUFFER_SIZE;
|
|
|
|
state->decode_buffer = palloc(state->decode_buffer_size);
|
|
|
|
state->decode_buffer_head = state->decode_buffer;
|
|
|
|
state->decode_buffer_tail = state->decode_buffer;
|
|
|
|
state->free_decode_buffer = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Try to allocate space in the circular decode buffer. */
|
|
|
|
if (state->decode_buffer_tail >= state->decode_buffer_head)
|
|
|
|
{
|
|
|
|
/* Empty, or tail is to the right of head. */
|
|
|
|
if (state->decode_buffer_tail + required_space <=
|
|
|
|
state->decode_buffer + state->decode_buffer_size)
|
|
|
|
{
|
|
|
|
/* There is space between tail and end. */
|
|
|
|
decoded = (DecodedXLogRecord *) state->decode_buffer_tail;
|
|
|
|
decoded->oversized = false;
|
|
|
|
return decoded;
|
|
|
|
}
|
|
|
|
else if (state->decode_buffer + required_space <
|
|
|
|
state->decode_buffer_head)
|
|
|
|
{
|
|
|
|
/* There is space between start and head. */
|
|
|
|
decoded = (DecodedXLogRecord *) state->decode_buffer;
|
|
|
|
decoded->oversized = false;
|
|
|
|
return decoded;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* Tail is to the left of head. */
|
|
|
|
if (state->decode_buffer_tail + required_space <
|
|
|
|
state->decode_buffer_head)
|
|
|
|
{
|
|
|
|
/* There is space between tail and head. */
|
|
|
|
decoded = (DecodedXLogRecord *) state->decode_buffer_tail;
|
|
|
|
decoded->oversized = false;
|
|
|
|
return decoded;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Not enough space in the decode buffer. Are we allowed to allocate? */
|
|
|
|
if (allow_oversized)
|
|
|
|
{
|
|
|
|
decoded = palloc_extended(required_space, MCXT_ALLOC_NO_OOM);
|
|
|
|
if (decoded == NULL)
|
|
|
|
return NULL;
|
|
|
|
decoded->oversized = true;
|
|
|
|
return decoded;
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static XLogPageReadResult
|
|
|
|
XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
|
2021-04-08 13:03:34 +02:00
|
|
|
{
|
2021-05-10 06:00:53 +02:00
|
|
|
XLogRecPtr RecPtr;
|
|
|
|
XLogRecord *record;
|
|
|
|
XLogRecPtr targetPagePtr;
|
|
|
|
bool randAccess;
|
|
|
|
uint32 len,
|
|
|
|
total_len;
|
|
|
|
uint32 targetRecOff;
|
|
|
|
uint32 pageHeaderSize;
|
Fix WAL replay in presence of an incomplete record
Physical replication always ships WAL segment files to replicas once
they are complete. This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
LOG: invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be. A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary. But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.
A fix for this problem was already attempted in commit 515e3d84a0b5, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.
This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts. With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.
Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.
A new TAP test that exercises this is added, but the portability of it
is yet to be seen.
This has been wrong since the introduction of physical replication, so
backpatch all the way back. In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
2021-09-29 16:21:51 +02:00
|
|
|
bool assembled;
|
2021-05-10 06:00:53 +02:00
|
|
|
bool gotheader;
|
|
|
|
int readOff;
|
2022-03-18 05:45:04 +01:00
|
|
|
DecodedXLogRecord *decoded;
|
|
|
|
char *errormsg; /* not used */
|
2021-04-08 13:03:34 +02:00
|
|
|
|
|
|
|
/*
|
2021-05-10 06:00:53 +02:00
|
|
|
* randAccess indicates whether to verify the previous-record pointer of
|
|
|
|
* the record we're reading. We only do this if we're reading
|
|
|
|
* sequentially, which is what we initially assume.
|
2021-04-08 13:03:34 +02:00
|
|
|
*/
|
2021-05-10 06:00:53 +02:00
|
|
|
randAccess = false;
|
2021-04-08 13:03:34 +02:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/* reset error state */
|
|
|
|
state->errormsg_buf[0] = '\0';
|
2022-03-18 05:45:04 +01:00
|
|
|
decoded = NULL;
|
2021-04-08 13:03:34 +02:00
|
|
|
|
Fix WAL replay in presence of an incomplete record
Physical replication always ships WAL segment files to replicas once
they are complete. This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
LOG: invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be. A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary. But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.
A fix for this problem was already attempted in commit 515e3d84a0b5, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.
This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts. With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.
Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.
A new TAP test that exercises this is added, but the portability of it
is yet to be seen.
This has been wrong since the introduction of physical replication, so
backpatch all the way back. In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
2021-09-29 16:21:51 +02:00
|
|
|
state->abortedRecPtr = InvalidXLogRecPtr;
|
|
|
|
state->missingContrecPtr = InvalidXLogRecPtr;
|
2021-05-10 06:00:53 +02:00
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
RecPtr = state->NextRecPtr;
|
2021-05-10 06:00:53 +02:00
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
if (state->DecodeRecPtr != InvalidXLogRecPtr)
|
2021-04-08 13:03:34 +02:00
|
|
|
{
|
2021-05-10 06:00:53 +02:00
|
|
|
/* read the record after the one we just read */
|
2021-04-08 13:03:34 +02:00
|
|
|
|
|
|
|
/*
|
2022-03-18 05:45:04 +01:00
|
|
|
* NextRecPtr is pointing to end+1 of the previous WAL record. If
|
2021-05-10 06:00:53 +02:00
|
|
|
* we're at a page boundary, no more records can fit on the current
|
|
|
|
* page. We must skip over the page header, but we can't do that until
|
|
|
|
* we've read in the page, since the header size is variable.
|
2021-04-08 13:03:34 +02:00
|
|
|
*/
|
|
|
|
}
|
2021-05-10 06:00:53 +02:00
|
|
|
else
|
2021-04-08 13:03:34 +02:00
|
|
|
{
|
|
|
|
/*
|
2021-05-10 06:00:53 +02:00
|
|
|
* Caller supplied a position to start at.
|
|
|
|
*
|
2022-08-18 18:15:55 +02:00
|
|
|
* In this case, NextRecPtr should already be pointing either to a
|
|
|
|
* valid record starting position or alternatively to the beginning of
|
|
|
|
* a page. See the header comments for XLogBeginRead.
|
2021-04-08 13:03:34 +02:00
|
|
|
*/
|
2022-08-18 18:15:55 +02:00
|
|
|
Assert(RecPtr % XLOG_BLCKSZ == 0 || XRecOffIsValid(RecPtr));
|
2021-05-10 06:00:53 +02:00
|
|
|
randAccess = true;
|
2021-04-08 13:03:34 +02:00
|
|
|
}
|
|
|
|
|
Fix WAL replay in presence of an incomplete record
Physical replication always ships WAL segment files to replicas once
they are complete. This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
LOG: invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be. A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary. But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.
A fix for this problem was already attempted in commit 515e3d84a0b5, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.
This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts. With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.
Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.
A new TAP test that exercises this is added, but the portability of it
is yet to be seen.
This has been wrong since the introduction of physical replication, so
backpatch all the way back. In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
2021-09-29 16:21:51 +02:00
|
|
|
restart:
|
2022-03-18 05:45:04 +01:00
|
|
|
state->nonblocking = nonblocking;
|
2021-05-10 06:00:53 +02:00
|
|
|
state->currRecPtr = RecPtr;
|
Fix WAL replay in presence of an incomplete record
Physical replication always ships WAL segment files to replicas once
they are complete. This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
LOG: invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be. A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary. But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.
A fix for this problem was already attempted in commit 515e3d84a0b5, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.
This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts. With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.
Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.
A new TAP test that exercises this is added, but the portability of it
is yet to be seen.
This has been wrong since the introduction of physical replication, so
backpatch all the way back. In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
2021-09-29 16:21:51 +02:00
|
|
|
assembled = false;
|
2021-04-08 13:03:34 +02:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
|
|
|
|
targetRecOff = RecPtr % XLOG_BLCKSZ;
|
2021-04-08 13:03:34 +02:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/*
|
|
|
|
* Read the page containing the record into state->readBuf. Request enough
|
|
|
|
* byte to cover the whole record header, or at least the part of it that
|
|
|
|
* fits on the same page.
|
|
|
|
*/
|
|
|
|
readOff = ReadPageInternal(state, targetPagePtr,
|
|
|
|
Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
|
2022-03-18 05:45:04 +01:00
|
|
|
if (readOff == XLREAD_WOULDBLOCK)
|
|
|
|
return XLREAD_WOULDBLOCK;
|
|
|
|
else if (readOff < 0)
|
2021-05-10 06:00:53 +02:00
|
|
|
goto err;
|
2021-04-08 13:03:34 +02:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/*
|
|
|
|
* ReadPageInternal always returns at least the page header, so we can
|
|
|
|
* examine it now.
|
|
|
|
*/
|
|
|
|
pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
|
|
|
|
if (targetRecOff == 0)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* At page start, so skip over page header.
|
|
|
|
*/
|
|
|
|
RecPtr += pageHeaderSize;
|
|
|
|
targetRecOff = pageHeaderSize;
|
2021-04-08 13:03:34 +02:00
|
|
|
}
|
2021-05-10 06:00:53 +02:00
|
|
|
else if (targetRecOff < pageHeaderSize)
|
2021-04-08 13:03:34 +02:00
|
|
|
{
|
2023-03-02 07:42:39 +01:00
|
|
|
report_invalid_record(state, "invalid record offset at %X/%X: expected at least %u, got %u",
|
|
|
|
LSN_FORMAT_ARGS(RecPtr),
|
|
|
|
pageHeaderSize, targetRecOff);
|
2021-05-10 06:00:53 +02:00
|
|
|
goto err;
|
2021-04-08 13:03:34 +02:00
|
|
|
}
|
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
if ((((XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
|
|
|
|
targetRecOff == pageHeaderSize)
|
2021-04-08 13:03:34 +02:00
|
|
|
{
|
2021-05-10 06:00:53 +02:00
|
|
|
report_invalid_record(state, "contrecord is requested by %X/%X",
|
|
|
|
LSN_FORMAT_ARGS(RecPtr));
|
|
|
|
goto err;
|
2021-04-08 13:03:34 +02:00
|
|
|
}
|
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/* ReadPageInternal has verified the page header */
|
|
|
|
Assert(pageHeaderSize <= readOff);
|
2021-04-08 13:03:34 +02:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/*
|
|
|
|
* Read the record length.
|
|
|
|
*
|
|
|
|
* NB: Even though we use an XLogRecord pointer here, the whole record
|
|
|
|
* header might not fit on this page. xl_tot_len is the first field of the
|
|
|
|
* struct, so it must be on this page (the records are MAXALIGNed), but we
|
|
|
|
* cannot access any other fields until we've verified that we got the
|
|
|
|
* whole header.
|
|
|
|
*/
|
|
|
|
record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
|
|
|
|
total_len = record->xl_tot_len;
|
2021-04-08 13:03:34 +02:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/*
|
|
|
|
* If the whole record header is on this page, validate it immediately.
|
|
|
|
* Otherwise do just a basic sanity check on xl_tot_len, and validate the
|
|
|
|
* rest of the header after reading it from the next page. The xl_tot_len
|
|
|
|
* check is necessary here to ensure that we enter the "Need to reassemble
|
|
|
|
* record" code path below; otherwise we might fail to apply
|
|
|
|
* ValidXLogRecordHeader at all.
|
|
|
|
*/
|
|
|
|
if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
|
2021-04-08 13:03:34 +02:00
|
|
|
{
|
2022-03-18 05:45:04 +01:00
|
|
|
if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
|
2021-05-10 06:00:53 +02:00
|
|
|
randAccess))
|
|
|
|
goto err;
|
|
|
|
gotheader = true;
|
2021-04-08 13:03:34 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
2021-05-10 06:00:53 +02:00
|
|
|
/* XXX: more validation should be done here */
|
|
|
|
if (total_len < SizeOfXLogRecord)
|
2021-04-08 13:03:34 +02:00
|
|
|
{
|
2021-05-10 06:00:53 +02:00
|
|
|
report_invalid_record(state,
|
2023-03-02 07:42:39 +01:00
|
|
|
"invalid record length at %X/%X: expected at least %u, got %u",
|
2021-05-10 06:00:53 +02:00
|
|
|
LSN_FORMAT_ARGS(RecPtr),
|
|
|
|
(uint32) SizeOfXLogRecord, total_len);
|
|
|
|
goto err;
|
2021-04-08 13:03:34 +02:00
|
|
|
}
|
2021-05-10 06:00:53 +02:00
|
|
|
gotheader = false;
|
2021-04-08 13:03:34 +02:00
|
|
|
}
|
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
/*
|
|
|
|
* Find space to decode this record. Don't allow oversized allocation if
|
|
|
|
* the caller requested nonblocking. Otherwise, we *have* to try to
|
|
|
|
* decode the record now because the caller has nothing else to do, so
|
|
|
|
* allow an oversized record to be palloc'd if that turns out to be
|
|
|
|
* necessary.
|
|
|
|
*/
|
|
|
|
decoded = XLogReadRecordAlloc(state,
|
|
|
|
total_len,
|
|
|
|
!nonblocking /* allow_oversized */ );
|
|
|
|
if (decoded == NULL)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* There is no space in the decode buffer. The caller should help
|
|
|
|
* with that problem by consuming some records.
|
|
|
|
*/
|
|
|
|
if (nonblocking)
|
|
|
|
return XLREAD_WOULDBLOCK;
|
|
|
|
|
|
|
|
/* We failed to allocate memory for an oversized record. */
|
|
|
|
report_invalid_record(state,
|
|
|
|
"out of memory while trying to decode a record of length %u", total_len);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
|
|
|
|
if (total_len > len)
|
2021-04-08 13:03:34 +02:00
|
|
|
{
|
2021-05-10 06:00:53 +02:00
|
|
|
/* Need to reassemble record */
|
|
|
|
char *contdata;
|
|
|
|
XLogPageHeader pageHeader;
|
|
|
|
char *buffer;
|
|
|
|
uint32 gotlen;
|
2021-04-08 13:03:34 +02:00
|
|
|
|
Fix WAL replay in presence of an incomplete record
Physical replication always ships WAL segment files to replicas once
they are complete. This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
LOG: invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be. A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary. But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.
A fix for this problem was already attempted in commit 515e3d84a0b5, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.
This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts. With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.
Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.
A new TAP test that exercises this is added, but the portability of it
is yet to be seen.
This has been wrong since the introduction of physical replication, so
backpatch all the way back. In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
2021-09-29 16:21:51 +02:00
|
|
|
assembled = true;
|
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/*
|
|
|
|
* Enlarge readRecordBuf as needed.
|
|
|
|
*/
|
|
|
|
if (total_len > state->readRecordBufSize &&
|
|
|
|
!allocate_recordbuf(state, total_len))
|
|
|
|
{
|
|
|
|
/* We treat this as a "bogus data" condition */
|
|
|
|
report_invalid_record(state, "record length %u at %X/%X too long",
|
|
|
|
total_len, LSN_FORMAT_ARGS(RecPtr));
|
|
|
|
goto err;
|
|
|
|
}
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/* Copy the first fragment of the record from the first page. */
|
|
|
|
memcpy(state->readRecordBuf,
|
|
|
|
state->readBuf + RecPtr % XLOG_BLCKSZ, len);
|
|
|
|
buffer = state->readRecordBuf + len;
|
|
|
|
gotlen = len;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
do
|
|
|
|
{
|
|
|
|
/* Calculate pointer to beginning of next page */
|
|
|
|
targetPagePtr += XLOG_BLCKSZ;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/* Wait for the next page to become available */
|
|
|
|
readOff = ReadPageInternal(state, targetPagePtr,
|
|
|
|
Min(total_len - gotlen + SizeOfXLogShortPHD,
|
|
|
|
XLOG_BLCKSZ));
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
if (readOff == XLREAD_WOULDBLOCK)
|
|
|
|
return XLREAD_WOULDBLOCK;
|
|
|
|
else if (readOff < 0)
|
2021-05-10 06:00:53 +02:00
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
Assert(SizeOfXLogShortPHD <= readOff);
|
Use the right timeline when beginning to stream from master.
The xlogreader refactoring broke the logic to decide which timeline to start
streaming from. XLogPageRead() uses the timeline history to check which
timeline the requested WAL position falls into. However, after the
refactoring, XLogPageRead() is always first called with the first page in
the segment, to verify the segment header, and only then with the actual WAL
position we're interested in. That first read of the segment's header made
XLogPageRead() to always start streaming from the old timeline containing
the segment header, not the timeline containing the actual record, if there
was a timeline switch within the segment.
I thought I fixed this yesterday, but that fix was too narrow and only fixed
this for the corner-case that the timeline switch happened in the first page
of the segment. To fix this more robustly, pass explicitly the position of
the record we're actually interested in to XLogPageRead, and use that to
decide which timeline to read from, rather than deduce it from the page and
offset.
Per report from Fujii Masao.
2013-01-18 10:41:36 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
pageHeader = (XLogPageHeader) state->readBuf;
|
Fix WAL replay in presence of an incomplete record
Physical replication always ships WAL segment files to replicas once
they are complete. This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
LOG: invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be. A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary. But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.
A fix for this problem was already attempted in commit 515e3d84a0b5, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.
This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts. With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.
Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.
A new TAP test that exercises this is added, but the portability of it
is yet to be seen.
This has been wrong since the introduction of physical replication, so
backpatch all the way back. In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
2021-09-29 16:21:51 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we were expecting a continuation record and got an
|
|
|
|
* "overwrite contrecord" flag, that means the continuation record
|
|
|
|
* was overwritten with a different record. Restart the read by
|
|
|
|
* assuming the address to read is the location where we found
|
|
|
|
* this flag; but keep track of the LSN of the record we were
|
|
|
|
* reading, for later verification.
|
|
|
|
*/
|
|
|
|
if (pageHeader->xlp_info & XLP_FIRST_IS_OVERWRITE_CONTRECORD)
|
|
|
|
{
|
2021-11-26 15:14:27 +01:00
|
|
|
state->overwrittenRecPtr = RecPtr;
|
Fix WAL replay in presence of an incomplete record
Physical replication always ships WAL segment files to replicas once
they are complete. This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
LOG: invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be. A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary. But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.
A fix for this problem was already attempted in commit 515e3d84a0b5, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.
This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts. With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.
Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.
A new TAP test that exercises this is added, but the portability of it
is yet to be seen.
This has been wrong since the introduction of physical replication, so
backpatch all the way back. In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
2021-09-29 16:21:51 +02:00
|
|
|
RecPtr = targetPagePtr;
|
|
|
|
goto restart;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Check that the continuation on next page looks valid */
|
2021-05-10 06:00:53 +02:00
|
|
|
if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
|
2021-04-08 13:03:23 +02:00
|
|
|
{
|
2021-05-10 06:00:53 +02:00
|
|
|
report_invalid_record(state,
|
|
|
|
"there is no contrecord flag at %X/%X",
|
|
|
|
LSN_FORMAT_ARGS(RecPtr));
|
|
|
|
goto err;
|
2021-04-08 13:03:23 +02:00
|
|
|
}
|
2018-11-19 02:25:48 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/*
|
|
|
|
* Cross-check that xlp_rem_len agrees with how much of the record
|
|
|
|
* we expect there to be left.
|
|
|
|
*/
|
|
|
|
if (pageHeader->xlp_rem_len == 0 ||
|
|
|
|
total_len != (pageHeader->xlp_rem_len + gotlen))
|
2021-04-08 13:03:23 +02:00
|
|
|
{
|
2021-05-10 06:00:53 +02:00
|
|
|
report_invalid_record(state,
|
|
|
|
"invalid contrecord length %u (expected %lld) at %X/%X",
|
|
|
|
pageHeader->xlp_rem_len,
|
|
|
|
((long long) total_len) - gotlen,
|
|
|
|
LSN_FORMAT_ARGS(RecPtr));
|
|
|
|
goto err;
|
|
|
|
}
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/* Append the continuation from this page to the buffer */
|
|
|
|
pageHeaderSize = XLogPageHeaderSize(pageHeader);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
if (readOff < pageHeaderSize)
|
|
|
|
readOff = ReadPageInternal(state, targetPagePtr,
|
|
|
|
pageHeaderSize);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
Assert(pageHeaderSize <= readOff);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
contdata = (char *) state->readBuf + pageHeaderSize;
|
|
|
|
len = XLOG_BLCKSZ - pageHeaderSize;
|
|
|
|
if (pageHeader->xlp_rem_len < len)
|
|
|
|
len = pageHeader->xlp_rem_len;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
if (readOff < pageHeaderSize + len)
|
|
|
|
readOff = ReadPageInternal(state, targetPagePtr,
|
|
|
|
pageHeaderSize + len);
|
2021-04-08 13:03:23 +02:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
memcpy(buffer, (char *) contdata, len);
|
|
|
|
buffer += len;
|
|
|
|
gotlen += len;
|
2021-04-08 13:03:23 +02:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/* If we just reassembled the record header, validate it. */
|
|
|
|
if (!gotheader)
|
|
|
|
{
|
|
|
|
record = (XLogRecord *) state->readRecordBuf;
|
2022-03-18 05:45:04 +01:00
|
|
|
if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr,
|
2021-05-10 06:00:53 +02:00
|
|
|
record, randAccess))
|
2013-01-16 20:12:53 +01:00
|
|
|
goto err;
|
2021-05-10 06:00:53 +02:00
|
|
|
gotheader = true;
|
2021-04-08 13:03:23 +02:00
|
|
|
}
|
2021-05-10 06:00:53 +02:00
|
|
|
} while (gotlen < total_len);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
Assert(gotheader);
|
2021-04-08 13:03:23 +02:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
record = (XLogRecord *) state->readRecordBuf;
|
|
|
|
if (!ValidXLogRecord(state, record, RecPtr))
|
|
|
|
goto err;
|
2021-04-08 13:03:23 +02:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
|
2022-03-18 05:45:04 +01:00
|
|
|
state->DecodeRecPtr = RecPtr;
|
|
|
|
state->NextRecPtr = targetPagePtr + pageHeaderSize
|
2021-05-10 06:00:53 +02:00
|
|
|
+ MAXALIGN(pageHeader->xlp_rem_len);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* Wait for the record data to become available */
|
|
|
|
readOff = ReadPageInternal(state, targetPagePtr,
|
|
|
|
Min(targetRecOff + total_len, XLOG_BLCKSZ));
|
2022-03-18 05:45:04 +01:00
|
|
|
if (readOff == XLREAD_WOULDBLOCK)
|
|
|
|
return XLREAD_WOULDBLOCK;
|
|
|
|
else if (readOff < 0)
|
2021-05-10 06:00:53 +02:00
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/* Record does not cross a page boundary */
|
|
|
|
if (!ValidXLogRecord(state, record, RecPtr))
|
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
state->NextRecPtr = RecPtr + MAXALIGN(total_len);
|
2021-04-08 13:03:34 +02:00
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
state->DecodeRecPtr = RecPtr;
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Special processing if it's an XLOG SWITCH record
|
|
|
|
*/
|
2021-04-08 13:03:34 +02:00
|
|
|
if (record->xl_rmid == RM_XLOG_ID &&
|
|
|
|
(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
/* Pretend it extends to end of segment */
|
2022-03-18 05:45:04 +01:00
|
|
|
state->NextRecPtr += state->segcxt.ws_segsize - 1;
|
|
|
|
state->NextRecPtr -= XLogSegmentOffset(state->NextRecPtr, state->segcxt.ws_segsize);
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
if (DecodeXLogRecord(state, decoded, record, RecPtr, &errormsg))
|
|
|
|
{
|
|
|
|
/* Record the location of the next record. */
|
|
|
|
decoded->next_lsn = state->NextRecPtr;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If it's in the decode buffer, mark the decode buffer space as
|
|
|
|
* occupied.
|
|
|
|
*/
|
|
|
|
if (!decoded->oversized)
|
|
|
|
{
|
|
|
|
/* The new decode buffer head must be MAXALIGNed. */
|
|
|
|
Assert(decoded->size == MAXALIGN(decoded->size));
|
|
|
|
if ((char *) decoded == state->decode_buffer)
|
|
|
|
state->decode_buffer_tail = state->decode_buffer + decoded->size;
|
|
|
|
else
|
|
|
|
state->decode_buffer_tail += decoded->size;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Insert it into the queue of decoded records. */
|
|
|
|
Assert(state->decode_queue_tail != decoded);
|
|
|
|
if (state->decode_queue_tail)
|
|
|
|
state->decode_queue_tail->next = decoded;
|
|
|
|
state->decode_queue_tail = decoded;
|
|
|
|
if (!state->decode_queue_head)
|
|
|
|
state->decode_queue_head = decoded;
|
|
|
|
return XLREAD_SUCCESS;
|
|
|
|
}
|
2021-05-10 06:00:53 +02:00
|
|
|
else
|
2022-03-18 05:45:04 +01:00
|
|
|
return XLREAD_FAIL;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
err:
|
Fix WAL replay in presence of an incomplete record
Physical replication always ships WAL segment files to replicas once
they are complete. This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
LOG: invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be. A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary. But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.
A fix for this problem was already attempted in commit 515e3d84a0b5, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.
This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts. With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.
Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.
A new TAP test that exercises this is added, but the portability of it
is yet to be seen.
This has been wrong since the introduction of physical replication, so
backpatch all the way back. In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
2021-09-29 16:21:51 +02:00
|
|
|
if (assembled)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We get here when a record that spans multiple pages needs to be
|
|
|
|
* assembled, but something went wrong -- perhaps a contrecord piece
|
|
|
|
* was lost. If caller is WAL replay, it will know where the aborted
|
|
|
|
* record was and where to direct followup WAL to be written, marking
|
|
|
|
* the next piece with XLP_FIRST_IS_OVERWRITE_CONTRECORD, which will
|
|
|
|
* in turn signal downstream WAL consumers that the broken WAL record
|
|
|
|
* is to be ignored.
|
|
|
|
*/
|
|
|
|
state->abortedRecPtr = RecPtr;
|
|
|
|
state->missingContrecPtr = targetPagePtr;
|
2022-09-08 10:25:20 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we got here without reporting an error, report one now so that
|
|
|
|
* XLogPrefetcherReadRecord() doesn't bring us back a second time and
|
|
|
|
* clobber the above state. Otherwise, the existing error takes
|
|
|
|
* precedence.
|
|
|
|
*/
|
|
|
|
if (!state->errormsg_buf[0])
|
|
|
|
report_invalid_record(state,
|
|
|
|
"missing contrecord at %X/%X",
|
|
|
|
LSN_FORMAT_ARGS(RecPtr));
|
Fix WAL replay in presence of an incomplete record
Physical replication always ships WAL segment files to replicas once
they are complete. This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
LOG: invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be. A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary. But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.
A fix for this problem was already attempted in commit 515e3d84a0b5, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.
This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts. With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.
Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.
A new TAP test that exercises this is added, but the portability of it
is yet to be seen.
This has been wrong since the introduction of physical replication, so
backpatch all the way back. In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
2021-09-29 16:21:51 +02:00
|
|
|
}
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
if (decoded && decoded->oversized)
|
|
|
|
pfree(decoded);
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/*
|
2021-05-10 06:00:53 +02:00
|
|
|
* Invalidate the read state. We might read from a different source after
|
2016-03-30 23:56:13 +02:00
|
|
|
* failure.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
2016-03-30 23:56:13 +02:00
|
|
|
XLogReaderInvalReadState(state);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
/*
|
|
|
|
* If an error was written to errmsg_buf, it'll be returned to the caller
|
|
|
|
* of XLogReadRecord() after all successfully decoded records from the
|
|
|
|
* read queue.
|
|
|
|
*/
|
|
|
|
|
|
|
|
return XLREAD_FAIL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to decode the next available record, and return it. The record will
|
|
|
|
* also be returned to XLogNextRecord(), which must be called to 'consume'
|
|
|
|
* each record.
|
|
|
|
*
|
|
|
|
* If nonblocking is true, may return NULL due to lack of data or WAL decoding
|
|
|
|
* space.
|
|
|
|
*/
|
|
|
|
DecodedXLogRecord *
|
|
|
|
XLogReadAhead(XLogReaderState *state, bool nonblocking)
|
|
|
|
{
|
|
|
|
XLogPageReadResult result;
|
|
|
|
|
|
|
|
if (state->errormsg_deferred)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
result = XLogDecodeNextRecord(state, nonblocking);
|
|
|
|
if (result == XLREAD_SUCCESS)
|
|
|
|
{
|
|
|
|
Assert(state->decode_queue_tail != NULL);
|
|
|
|
return state->decode_queue_tail;
|
|
|
|
}
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
return NULL;
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2021-05-10 06:00:53 +02:00
|
|
|
* Read a single xlog page including at least [pageptr, reqLen] of valid data
|
|
|
|
* via the page_read() callback.
|
2013-01-16 20:12:53 +01:00
|
|
|
*
|
2022-03-18 05:45:04 +01:00
|
|
|
* Returns XLREAD_FAIL if the required page cannot be read for some
|
|
|
|
* reason; errormsg_buf is set in that case (unless the error occurs in the
|
|
|
|
* page_read callback).
|
|
|
|
*
|
|
|
|
* Returns XLREAD_WOULDBLOCK if the requested data can't be read without
|
|
|
|
* waiting. This can be returned only if the installed page_read callback
|
|
|
|
* respects the state->nonblocking flag, and cannot read the requested data
|
|
|
|
* immediately.
|
2013-01-16 20:12:53 +01:00
|
|
|
*
|
2021-05-10 06:00:53 +02:00
|
|
|
* We fetch the page from a reader-local cache if we know we have the required
|
|
|
|
* data and if there hasn't been any error since caching the data.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
2021-05-10 06:00:53 +02:00
|
|
|
static int
|
|
|
|
ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
2021-05-10 06:00:53 +02:00
|
|
|
int readLen;
|
2013-01-16 20:12:53 +01:00
|
|
|
uint32 targetPageOff;
|
|
|
|
XLogSegNo targetSegNo;
|
2021-05-10 06:00:53 +02:00
|
|
|
XLogPageHeader hdr;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
Assert((pageptr % XLOG_BLCKSZ) == 0);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
XLByteToSeg(pageptr, targetSegNo, state->segcxt.ws_segsize);
|
|
|
|
targetPageOff = XLogSegmentOffset(pageptr, state->segcxt.ws_segsize);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/* check whether we have all the requested data already */
|
|
|
|
if (targetSegNo == state->seg.ws_segno &&
|
|
|
|
targetPageOff == state->segoff && reqLen <= state->readLen)
|
|
|
|
return state->readLen;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
Fix cache invalidation bug in recovery_prefetch.
XLogPageRead() can retry internally after a pread() system call has
succeeded, in the case of short reads, and page validation failures
while in standby mode (see commit 0668719801). Due to an oversight in
commit 3f1ce973, these cases could leave stale data in the internal
cache of xlogreader.c without marking it invalid. The main defense
against stale cached data on failure to read a page was in the error
handling path of the calling function ReadPageInternal(), but that
wasn't quite enough for errors handled internally by XLogPageRead()'s
retry loop if we then exited with XLREAD_WOULDBLOCK.
1. ReadPageInternal() now marks the cache invalid before calling the
page_read callback, by setting state->readLen to 0. It'll be set to
a non-zero value only after a successful read. It'll stay valid as
long as the caller requests data in the cached range.
2. XLogPageRead() no long performs internal retries while reading
ahead. While such retries should work, the general philosophy is
that we should give up prefetching if anything unusual happens so we
can handle it when recovery catches up, to reduce the complexity of
the system. Let's do that here too.
3. While here, a new function XLogReaderResetError() improves the
separation between xlogrecovery.c and xlogreader.c, where the former
previously clobbered the latter's internal error buffer directly.
The new function makes this more explicit, and also clears a related
flag, without which a standby would needlessly retry in the outer
function.
Thanks to Noah Misch for tracking down the conditions required for a
rare build farm failure in src/bin/pg_ctl/t/003_promote.pl, and
providing a reproducer.
Back-patch to 15.
Reported-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20220807003627.GA4168930%40rfd.leadboat.com
2022-09-03 02:58:16 +02:00
|
|
|
/*
|
|
|
|
* Invalidate contents of internal buffer before read attempt. Just set
|
|
|
|
* the length to 0, rather than a full XLogReaderInvalReadState(), so we
|
|
|
|
* don't forget the segment we last successfully read.
|
|
|
|
*/
|
|
|
|
state->readLen = 0;
|
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/*
|
|
|
|
* Data is not in our buffer.
|
|
|
|
*
|
|
|
|
* Every time we actually read the segment, even if we looked at parts of
|
|
|
|
* it before, we need to do verification as the page_read callback might
|
|
|
|
* now be rereading data from a different source.
|
|
|
|
*
|
|
|
|
* Whenever switching to a new WAL segment, we read the first page of the
|
|
|
|
* file and validate its header, even if that's not where the target
|
|
|
|
* record is. This is so that we can check the additional identification
|
|
|
|
* info that is present in the first page's "long" header.
|
|
|
|
*/
|
|
|
|
if (targetSegNo != state->seg.ws_segno && targetPageOff != 0)
|
|
|
|
{
|
|
|
|
XLogRecPtr targetSegmentPtr = pageptr - targetPageOff;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ,
|
|
|
|
state->currRecPtr,
|
|
|
|
state->readBuf);
|
2022-03-18 05:45:04 +01:00
|
|
|
if (readLen == XLREAD_WOULDBLOCK)
|
|
|
|
return XLREAD_WOULDBLOCK;
|
|
|
|
else if (readLen < 0)
|
2021-05-10 06:00:53 +02:00
|
|
|
goto err;
|
2021-04-08 13:03:23 +02:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/* we can be sure to have enough WAL available, we scrolled back */
|
|
|
|
Assert(readLen == XLOG_BLCKSZ);
|
2021-04-08 13:03:23 +02:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
if (!XLogReaderValidatePageHeader(state, targetSegmentPtr,
|
|
|
|
state->readBuf))
|
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2021-05-10 06:00:53 +02:00
|
|
|
* First, read the requested data length, but at least a short page header
|
|
|
|
* so that we can validate it.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
2021-05-10 06:00:53 +02:00
|
|
|
readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
|
|
|
|
state->currRecPtr,
|
|
|
|
state->readBuf);
|
2022-03-18 05:45:04 +01:00
|
|
|
if (readLen == XLREAD_WOULDBLOCK)
|
|
|
|
return XLREAD_WOULDBLOCK;
|
|
|
|
else if (readLen < 0)
|
2021-05-10 06:00:53 +02:00
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
Assert(readLen <= XLOG_BLCKSZ);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/* Do we have enough data to check the header length? */
|
|
|
|
if (readLen <= SizeOfXLogShortPHD)
|
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
Assert(readLen >= reqLen);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
hdr = (XLogPageHeader) state->readBuf;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/* still not enough */
|
|
|
|
if (readLen < XLogPageHeaderSize(hdr))
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
2021-05-10 06:00:53 +02:00
|
|
|
readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr),
|
|
|
|
state->currRecPtr,
|
|
|
|
state->readBuf);
|
2022-03-18 05:45:04 +01:00
|
|
|
if (readLen == XLREAD_WOULDBLOCK)
|
|
|
|
return XLREAD_WOULDBLOCK;
|
|
|
|
else if (readLen < 0)
|
2021-05-10 06:00:53 +02:00
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2021-05-10 06:00:53 +02:00
|
|
|
* Now that we know we have the full header, validate it.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
2021-05-10 06:00:53 +02:00
|
|
|
if (!XLogReaderValidatePageHeader(state, pageptr, (char *) hdr))
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
/* update read state information */
|
|
|
|
state->seg.ws_segno = targetSegNo;
|
|
|
|
state->segoff = targetPageOff;
|
|
|
|
state->readLen = readLen;
|
|
|
|
|
|
|
|
return readLen;
|
|
|
|
|
|
|
|
err:
|
Fix cache invalidation bug in recovery_prefetch.
XLogPageRead() can retry internally after a pread() system call has
succeeded, in the case of short reads, and page validation failures
while in standby mode (see commit 0668719801). Due to an oversight in
commit 3f1ce973, these cases could leave stale data in the internal
cache of xlogreader.c without marking it invalid. The main defense
against stale cached data on failure to read a page was in the error
handling path of the calling function ReadPageInternal(), but that
wasn't quite enough for errors handled internally by XLogPageRead()'s
retry loop if we then exited with XLREAD_WOULDBLOCK.
1. ReadPageInternal() now marks the cache invalid before calling the
page_read callback, by setting state->readLen to 0. It'll be set to
a non-zero value only after a successful read. It'll stay valid as
long as the caller requests data in the cached range.
2. XLogPageRead() no long performs internal retries while reading
ahead. While such retries should work, the general philosophy is
that we should give up prefetching if anything unusual happens so we
can handle it when recovery catches up, to reduce the complexity of
the system. Let's do that here too.
3. While here, a new function XLogReaderResetError() improves the
separation between xlogrecovery.c and xlogreader.c, where the former
previously clobbered the latter's internal error buffer directly.
The new function makes this more explicit, and also clears a related
flag, without which a standby would needlessly retry in the outer
function.
Thanks to Noah Misch for tracking down the conditions required for a
rare build farm failure in src/bin/pg_ctl/t/003_promote.pl, and
providing a reproducer.
Back-patch to 15.
Reported-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20220807003627.GA4168930%40rfd.leadboat.com
2022-09-03 02:58:16 +02:00
|
|
|
XLogReaderInvalReadState(state);
|
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
return XLREAD_FAIL;
|
2016-03-30 23:56:13 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Invalidate the xlogreader's read state to force a re-read.
|
|
|
|
*/
|
2019-09-03 23:41:43 +02:00
|
|
|
static void
|
2016-03-30 23:56:13 +02:00
|
|
|
XLogReaderInvalReadState(XLogReaderState *state)
|
|
|
|
{
|
2021-05-10 06:00:53 +02:00
|
|
|
state->seg.ws_segno = 0;
|
|
|
|
state->segoff = 0;
|
|
|
|
state->readLen = 0;
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Validate an XLOG record header.
|
|
|
|
*
|
|
|
|
* This is just a convenience subroutine to avoid duplicated code in
|
|
|
|
* XLogReadRecord. It's not intended for use from anywhere else.
|
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
|
2021-05-10 06:00:53 +02:00
|
|
|
XLogRecPtr PrevRecPtr, XLogRecord *record,
|
|
|
|
bool randAccess)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
if (record->xl_tot_len < SizeOfXLogRecord)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
2023-03-02 07:42:39 +01:00
|
|
|
"invalid record length at %X/%X: expected at least %u, got %u",
|
2021-02-23 10:14:38 +01:00
|
|
|
LSN_FORMAT_ARGS(RecPtr),
|
2016-04-04 22:07:18 +02:00
|
|
|
(uint32) SizeOfXLogRecord, record->xl_tot_len);
|
2013-01-16 20:12:53 +01:00
|
|
|
return false;
|
|
|
|
}
|
2022-04-07 09:27:07 +02:00
|
|
|
if (!RmgrIdIsValid(record->xl_rmid))
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"invalid resource manager ID %u at %X/%X",
|
2021-02-23 10:14:38 +01:00
|
|
|
record->xl_rmid, LSN_FORMAT_ARGS(RecPtr));
|
2013-01-16 20:12:53 +01:00
|
|
|
return false;
|
|
|
|
}
|
2021-05-10 06:00:53 +02:00
|
|
|
if (randAccess)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We can't exactly verify the prev-link, but surely it should be less
|
|
|
|
* than the record's own address.
|
|
|
|
*/
|
|
|
|
if (!(record->xl_prev < RecPtr))
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"record with incorrect prev-link %X/%X at %X/%X",
|
2021-02-23 10:14:38 +01:00
|
|
|
LSN_FORMAT_ARGS(record->xl_prev),
|
|
|
|
LSN_FORMAT_ARGS(RecPtr));
|
2013-01-16 20:12:53 +01:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Record's prev-link should exactly match our previous location. This
|
|
|
|
* check guards against torn WAL pages where a stale but valid-looking
|
|
|
|
* WAL record starts on a sector boundary.
|
|
|
|
*/
|
|
|
|
if (record->xl_prev != PrevRecPtr)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"record with incorrect prev-link %X/%X at %X/%X",
|
2021-02-23 10:14:38 +01:00
|
|
|
LSN_FORMAT_ARGS(record->xl_prev),
|
|
|
|
LSN_FORMAT_ARGS(RecPtr));
|
2013-01-16 20:12:53 +01:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* CRC-check an XLOG record. We do not believe the contents of an XLOG
|
|
|
|
* record (other than to the minimal extent of computing the amount of
|
|
|
|
* data to read in) until we've checked the CRCs.
|
|
|
|
*
|
|
|
|
* We assume all of the record (that is, xl_tot_len bytes) has been read
|
|
|
|
* into memory at *record. Also, ValidXLogRecordHeader() has accepted the
|
|
|
|
* record's header, which means in particular that xl_tot_len is at least
|
2019-08-13 06:53:41 +02:00
|
|
|
* SizeOfXLogRecord.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr)
|
|
|
|
{
|
2015-04-14 16:03:42 +02:00
|
|
|
pg_crc32c crc;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
/* Calculate the CRC */
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
INIT_CRC32C(crc);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
COMP_CRC32C(crc, ((char *) record) + SizeOfXLogRecord, record->xl_tot_len - SizeOfXLogRecord);
|
|
|
|
/* include the record header last */
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
COMP_CRC32C(crc, (char *) record, offsetof(XLogRecord, xl_crc));
|
|
|
|
FIN_CRC32C(crc);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.
Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.
The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 10:35:15 +01:00
|
|
|
if (!EQ_CRC32C(record->xl_crc, crc))
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"incorrect resource manager data checksum in record at %X/%X",
|
2021-02-23 10:14:38 +01:00
|
|
|
LSN_FORMAT_ARGS(recptr));
|
2013-01-16 20:12:53 +01:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
Fix scenario where streaming standby gets stuck at a continuation record.
If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.
Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.
Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.
Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
2018-05-05 00:34:53 +02:00
|
|
|
* Validate a page header.
|
|
|
|
*
|
|
|
|
* Check if 'phdr' is valid as the header of the XLog page at position
|
|
|
|
* 'recptr'.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
Fix scenario where streaming standby gets stuck at a continuation record.
If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.
Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.
Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.
Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
2018-05-05 00:34:53 +02:00
|
|
|
bool
|
|
|
|
XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
|
|
|
|
char *phdr)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
XLogSegNo segno;
|
|
|
|
int32 offset;
|
Fix scenario where streaming standby gets stuck at a continuation record.
If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.
Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.
Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.
Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
2018-05-05 00:34:53 +02:00
|
|
|
XLogPageHeader hdr = (XLogPageHeader) phdr;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
Assert((recptr % XLOG_BLCKSZ) == 0);
|
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
XLByteToSeg(recptr, segno, state->segcxt.ws_segsize);
|
|
|
|
offset = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
|
|
|
|
{
|
|
|
|
char fname[MAXFNAMELEN];
|
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
report_invalid_record(state,
|
2022-12-05 01:28:29 +01:00
|
|
|
"invalid magic number %04X in WAL segment %s, LSN %X/%X, offset %u",
|
2013-01-16 20:12:53 +01:00
|
|
|
hdr->xlp_magic,
|
|
|
|
fname,
|
2022-12-05 01:28:29 +01:00
|
|
|
LSN_FORMAT_ARGS(recptr),
|
2013-01-16 20:12:53 +01:00
|
|
|
offset);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((hdr->xlp_info & ~XLP_ALL_FLAGS) != 0)
|
|
|
|
{
|
|
|
|
char fname[MAXFNAMELEN];
|
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
report_invalid_record(state,
|
2022-12-05 01:28:29 +01:00
|
|
|
"invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
|
2013-01-16 20:12:53 +01:00
|
|
|
hdr->xlp_info,
|
|
|
|
fname,
|
2022-12-05 01:28:29 +01:00
|
|
|
LSN_FORMAT_ARGS(recptr),
|
2013-01-16 20:12:53 +01:00
|
|
|
offset);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (hdr->xlp_info & XLP_LONG_HEADER)
|
|
|
|
{
|
|
|
|
XLogLongPageHeader longhdr = (XLogLongPageHeader) hdr;
|
|
|
|
|
|
|
|
if (state->system_identifier &&
|
|
|
|
longhdr->xlp_sysid != state->system_identifier)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
2019-06-06 14:14:29 +02:00
|
|
|
"WAL file is from different database system: WAL file database system identifier is %llu, pg_control database system identifier is %llu",
|
|
|
|
(unsigned long long) longhdr->xlp_sysid,
|
|
|
|
(unsigned long long) state->system_identifier);
|
2013-01-16 20:12:53 +01:00
|
|
|
return false;
|
|
|
|
}
|
2019-09-24 21:08:31 +02:00
|
|
|
else if (longhdr->xlp_seg_size != state->segcxt.ws_segsize)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-20 07:03:48 +02:00
|
|
|
"WAL file is from different database system: incorrect segment size in page header");
|
2013-01-16 20:12:53 +01:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
else if (longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
2015-10-02 03:39:56 +02:00
|
|
|
"WAL file is from different database system: incorrect XLOG_BLCKSZ in page header");
|
2013-01-16 20:12:53 +01:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else if (offset == 0)
|
|
|
|
{
|
|
|
|
char fname[MAXFNAMELEN];
|
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
/* hmm, first page of file doesn't have a long header? */
|
|
|
|
report_invalid_record(state,
|
2022-12-05 01:28:29 +01:00
|
|
|
"invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
|
2013-01-16 20:12:53 +01:00
|
|
|
hdr->xlp_info,
|
|
|
|
fname,
|
2022-12-05 01:28:29 +01:00
|
|
|
LSN_FORMAT_ARGS(recptr),
|
2013-01-16 20:12:53 +01:00
|
|
|
offset);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
Fix scenario where streaming standby gets stuck at a continuation record.
If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.
Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.
Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.
Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
2018-05-05 00:34:53 +02:00
|
|
|
/*
|
|
|
|
* Check that the address on the page agrees with what we expected. This
|
|
|
|
* check typically fails when an old WAL segment is recycled, and hasn't
|
|
|
|
* yet been overwritten with new data yet.
|
|
|
|
*/
|
2022-10-12 02:59:36 +02:00
|
|
|
if (hdr->xlp_pageaddr != recptr)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
char fname[MAXFNAMELEN];
|
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
report_invalid_record(state,
|
2022-12-05 01:28:29 +01:00
|
|
|
"unexpected pageaddr %X/%X in WAL segment %s, LSN %X/%X, offset %u",
|
2021-02-23 10:14:38 +01:00
|
|
|
LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
|
2013-01-16 20:12:53 +01:00
|
|
|
fname,
|
2022-12-05 01:28:29 +01:00
|
|
|
LSN_FORMAT_ARGS(recptr),
|
2013-01-16 20:12:53 +01:00
|
|
|
offset);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Since child timelines are always assigned a TLI greater than their
|
|
|
|
* immediate parent's TLI, we should never see TLI go backwards across
|
|
|
|
* successive pages of a consistent WAL sequence.
|
|
|
|
*
|
|
|
|
* Sometimes we re-read a segment that's already been (partially) read. So
|
|
|
|
* we only verify TLIs for pages that are later than the last remembered
|
|
|
|
* LSN.
|
|
|
|
*/
|
|
|
|
if (recptr > state->latestPagePtr)
|
|
|
|
{
|
|
|
|
if (hdr->xlp_tli < state->latestPageTLI)
|
|
|
|
{
|
|
|
|
char fname[MAXFNAMELEN];
|
|
|
|
|
2019-09-24 21:08:31 +02:00
|
|
|
XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
|
|
|
report_invalid_record(state,
|
2022-12-05 01:28:29 +01:00
|
|
|
"out-of-sequence timeline ID %u (after %u) in WAL segment %s, LSN %X/%X, offset %u",
|
2013-01-16 20:12:53 +01:00
|
|
|
hdr->xlp_tli,
|
|
|
|
state->latestPageTLI,
|
|
|
|
fname,
|
2022-12-05 01:28:29 +01:00
|
|
|
LSN_FORMAT_ARGS(recptr),
|
2013-01-16 20:12:53 +01:00
|
|
|
offset);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
state->latestPagePtr = recptr;
|
|
|
|
state->latestPageTLI = hdr->xlp_tli;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
Fix cache invalidation bug in recovery_prefetch.
XLogPageRead() can retry internally after a pread() system call has
succeeded, in the case of short reads, and page validation failures
while in standby mode (see commit 0668719801). Due to an oversight in
commit 3f1ce973, these cases could leave stale data in the internal
cache of xlogreader.c without marking it invalid. The main defense
against stale cached data on failure to read a page was in the error
handling path of the calling function ReadPageInternal(), but that
wasn't quite enough for errors handled internally by XLogPageRead()'s
retry loop if we then exited with XLREAD_WOULDBLOCK.
1. ReadPageInternal() now marks the cache invalid before calling the
page_read callback, by setting state->readLen to 0. It'll be set to
a non-zero value only after a successful read. It'll stay valid as
long as the caller requests data in the cached range.
2. XLogPageRead() no long performs internal retries while reading
ahead. While such retries should work, the general philosophy is
that we should give up prefetching if anything unusual happens so we
can handle it when recovery catches up, to reduce the complexity of
the system. Let's do that here too.
3. While here, a new function XLogReaderResetError() improves the
separation between xlogrecovery.c and xlogreader.c, where the former
previously clobbered the latter's internal error buffer directly.
The new function makes this more explicit, and also clears a related
flag, without which a standby would needlessly retry in the outer
function.
Thanks to Noah Misch for tracking down the conditions required for a
rare build farm failure in src/bin/pg_ctl/t/003_promote.pl, and
providing a reproducer.
Back-patch to 15.
Reported-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20220807003627.GA4168930%40rfd.leadboat.com
2022-09-03 02:58:16 +02:00
|
|
|
/*
|
|
|
|
* Forget about an error produced by XLogReaderValidatePageHeader().
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
XLogReaderResetError(XLogReaderState *state)
|
|
|
|
{
|
|
|
|
state->errormsg_buf[0] = '\0';
|
|
|
|
state->errormsg_deferred = false;
|
|
|
|
}
|
|
|
|
|
2013-01-16 20:12:53 +01:00
|
|
|
/*
|
2013-10-24 10:50:02 +02:00
|
|
|
* Find the first record with an lsn >= RecPtr.
|
2013-01-16 20:12:53 +01:00
|
|
|
*
|
2020-01-26 10:39:00 +01:00
|
|
|
* This is different from XLogBeginRead() in that RecPtr doesn't need to point
|
|
|
|
* to a valid record boundary. Useful for checking whether RecPtr is a valid
|
|
|
|
* xlog address for reading, and to find the first valid address after some
|
|
|
|
* address when dumping records for debugging purposes.
|
|
|
|
*
|
|
|
|
* This positions the reader, like XLogBeginRead(), so that the next call to
|
|
|
|
* XLogReadRecord() will read the next valid record.
|
2013-01-16 20:12:53 +01:00
|
|
|
*/
|
2021-05-10 06:00:53 +02:00
|
|
|
XLogRecPtr
|
|
|
|
XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
2021-05-10 06:00:53 +02:00
|
|
|
XLogRecPtr tmpRecPtr;
|
|
|
|
XLogRecPtr found = InvalidXLogRecPtr;
|
2013-01-16 20:12:53 +01:00
|
|
|
XLogPageHeader header;
|
|
|
|
char *errormsg;
|
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
Assert(!XLogRecPtrIsInvalid(RecPtr));
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
/* Make sure ReadPageInternal() can't return XLREAD_WOULDBLOCK. */
|
|
|
|
state->nonblocking = false;
|
|
|
|
|
2016-08-29 07:34:58 +02:00
|
|
|
/*
|
|
|
|
* skip over potential continuation data, keeping in mind that it may span
|
|
|
|
* multiple pages
|
|
|
|
*/
|
2021-05-10 06:00:53 +02:00
|
|
|
tmpRecPtr = RecPtr;
|
2016-08-29 07:34:58 +02:00
|
|
|
while (true)
|
|
|
|
{
|
|
|
|
XLogRecPtr targetPagePtr;
|
|
|
|
int targetRecOff;
|
|
|
|
uint32 pageHeaderSize;
|
2021-05-10 06:00:53 +02:00
|
|
|
int readLen;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2016-08-29 07:34:58 +02:00
|
|
|
/*
|
|
|
|
* Compute targetRecOff. It should typically be equal or greater than
|
|
|
|
* short page-header since a valid record can't start anywhere before
|
|
|
|
* that, except when caller has explicitly specified the offset that
|
|
|
|
* falls somewhere there or when we are skipping multi-page
|
|
|
|
* continuation record. It doesn't matter though because
|
2021-05-10 06:00:53 +02:00
|
|
|
* ReadPageInternal() is prepared to handle that and will read at
|
|
|
|
* least short page-header worth of data
|
2016-08-29 07:34:58 +02:00
|
|
|
*/
|
2021-05-10 06:00:53 +02:00
|
|
|
targetRecOff = tmpRecPtr % XLOG_BLCKSZ;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2016-08-29 07:34:58 +02:00
|
|
|
/* scroll back to page boundary */
|
2021-05-10 06:00:53 +02:00
|
|
|
targetPagePtr = tmpRecPtr - targetRecOff;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/* Read the page containing the record */
|
|
|
|
readLen = ReadPageInternal(state, targetPagePtr, targetRecOff);
|
|
|
|
if (readLen < 0)
|
2016-08-29 07:34:58 +02:00
|
|
|
goto err;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
header = (XLogPageHeader) state->readBuf;
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2016-08-29 07:34:58 +02:00
|
|
|
pageHeaderSize = XLogPageHeaderSize(header);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
/* make sure we have enough data for the page header */
|
|
|
|
readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize);
|
|
|
|
if (readLen < 0)
|
|
|
|
goto err;
|
2016-08-29 07:34:58 +02:00
|
|
|
|
|
|
|
/* skip over potential continuation data */
|
|
|
|
if (header->xlp_info & XLP_FIRST_IS_CONTRECORD)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If the length of the remaining continuation data is more than
|
|
|
|
* what can fit in this page, the continuation record crosses over
|
|
|
|
* this page. Read the next page and try again. xlp_rem_len in the
|
|
|
|
* next page header will contain the remaining length of the
|
|
|
|
* continuation data
|
|
|
|
*
|
|
|
|
* Note that record headers are MAXALIGN'ed
|
|
|
|
*/
|
2019-11-07 08:31:36 +01:00
|
|
|
if (MAXALIGN(header->xlp_rem_len) >= (XLOG_BLCKSZ - pageHeaderSize))
|
2021-05-10 06:00:53 +02:00
|
|
|
tmpRecPtr = targetPagePtr + XLOG_BLCKSZ;
|
2016-08-29 07:34:58 +02:00
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* The previous continuation record ends in this page. Set
|
2021-05-10 06:00:53 +02:00
|
|
|
* tmpRecPtr to point to the first valid record
|
2016-08-29 07:34:58 +02:00
|
|
|
*/
|
2021-05-10 06:00:53 +02:00
|
|
|
tmpRecPtr = targetPagePtr + pageHeaderSize
|
2016-08-29 07:34:58 +02:00
|
|
|
+ MAXALIGN(header->xlp_rem_len);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
2021-05-10 06:00:53 +02:00
|
|
|
tmpRecPtr = targetPagePtr + pageHeaderSize;
|
2016-08-29 07:34:58 +02:00
|
|
|
break;
|
|
|
|
}
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* we know now that tmpRecPtr is an address pointing to a valid XLogRecord
|
|
|
|
* because either we're at the first record after the beginning of a page
|
|
|
|
* or we just jumped over the remaining data of a continuation.
|
|
|
|
*/
|
2021-05-10 06:00:53 +02:00
|
|
|
XLogBeginRead(state, tmpRecPtr);
|
|
|
|
while (XLogReadRecord(state, &errormsg) != NULL)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
|
|
|
/* past the record we've found, break out */
|
2021-05-10 06:00:53 +02:00
|
|
|
if (RecPtr <= state->ReadRecPtr)
|
2013-01-16 20:12:53 +01:00
|
|
|
{
|
2020-01-26 10:39:00 +01:00
|
|
|
/* Rewind the reader to the beginning of the last record. */
|
2021-05-10 06:00:53 +02:00
|
|
|
found = state->ReadRecPtr;
|
|
|
|
XLogBeginRead(state, found);
|
|
|
|
return found;
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
err:
|
2021-05-10 06:00:53 +02:00
|
|
|
XLogReaderInvalReadState(state);
|
2013-01-16 20:12:53 +01:00
|
|
|
|
2021-05-10 06:00:53 +02:00
|
|
|
return InvalidXLogRecPtr;
|
2013-01-16 20:12:53 +01:00
|
|
|
}
|
|
|
|
|
2019-11-25 19:04:54 +01:00
|
|
|
/*
|
2023-05-02 05:23:08 +02:00
|
|
|
* Helper function to ease writing of XLogReaderRoutine->page_read callbacks.
|
2021-05-10 06:00:53 +02:00
|
|
|
* If this function is used, caller must supply a segment_open callback in
|
|
|
|
* 'state', as that is used here.
|
2020-05-08 21:30:34 +02:00
|
|
|
*
|
2019-11-25 19:04:54 +01:00
|
|
|
* Read 'count' bytes into 'buf', starting at location 'startptr', from WAL
|
|
|
|
* fetched from timeline 'tli'.
|
|
|
|
*
|
|
|
|
* Returns true if succeeded, false if an error occurs, in which case
|
|
|
|
* 'errinfo' receives error details.
|
|
|
|
*
|
|
|
|
* XXX probably this should be improved to suck data directly from the
|
|
|
|
* WAL buffers when possible.
|
|
|
|
*/
|
|
|
|
bool
|
2020-05-08 21:30:34 +02:00
|
|
|
WALRead(XLogReaderState *state,
|
|
|
|
char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
|
|
|
|
WALReadError *errinfo)
|
2019-11-25 19:04:54 +01:00
|
|
|
{
|
|
|
|
char *p;
|
|
|
|
XLogRecPtr recptr;
|
|
|
|
Size nbytes;
|
|
|
|
|
|
|
|
p = buf;
|
|
|
|
recptr = startptr;
|
|
|
|
nbytes = count;
|
|
|
|
|
|
|
|
while (nbytes > 0)
|
|
|
|
{
|
|
|
|
uint32 startoff;
|
|
|
|
int segbytes;
|
|
|
|
int readbytes;
|
|
|
|
|
2020-05-13 18:17:08 +02:00
|
|
|
startoff = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
|
2019-11-25 19:04:54 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the data we want is not in a segment we have open, close what we
|
|
|
|
* have (if anything) and open the next one, using the caller's
|
2023-05-02 05:23:08 +02:00
|
|
|
* provided segment_open callback.
|
2019-11-25 19:04:54 +01:00
|
|
|
*/
|
2020-05-13 18:17:08 +02:00
|
|
|
if (state->seg.ws_file < 0 ||
|
|
|
|
!XLByteInSeg(recptr, state->seg.ws_segno, state->segcxt.ws_segsize) ||
|
|
|
|
tli != state->seg.ws_tli)
|
2019-11-25 19:04:54 +01:00
|
|
|
{
|
|
|
|
XLogSegNo nextSegNo;
|
|
|
|
|
2020-05-13 18:17:08 +02:00
|
|
|
if (state->seg.ws_file >= 0)
|
2021-05-10 06:00:53 +02:00
|
|
|
state->routine.segment_close(state);
|
2019-11-25 19:04:54 +01:00
|
|
|
|
2020-05-13 18:17:08 +02:00
|
|
|
XLByteToSeg(recptr, nextSegNo, state->segcxt.ws_segsize);
|
2021-05-10 06:00:53 +02:00
|
|
|
state->routine.segment_open(state, nextSegNo, &tli);
|
2020-05-13 18:17:08 +02:00
|
|
|
|
|
|
|
/* This shouldn't happen -- indicates a bug in segment_open */
|
|
|
|
Assert(state->seg.ws_file >= 0);
|
2019-11-25 19:04:54 +01:00
|
|
|
|
|
|
|
/* Update the current segment info. */
|
2020-05-13 18:17:08 +02:00
|
|
|
state->seg.ws_tli = tli;
|
|
|
|
state->seg.ws_segno = nextSegNo;
|
2019-11-25 19:04:54 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* How many bytes are within this segment? */
|
2020-05-13 18:17:08 +02:00
|
|
|
if (nbytes > (state->segcxt.ws_segsize - startoff))
|
|
|
|
segbytes = state->segcxt.ws_segsize - startoff;
|
2019-11-25 19:04:54 +01:00
|
|
|
else
|
|
|
|
segbytes = nbytes;
|
|
|
|
|
|
|
|
#ifndef FRONTEND
|
|
|
|
pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/* Reset errno first; eases reporting non-errno-affecting errors */
|
|
|
|
errno = 0;
|
2022-09-29 02:12:11 +02:00
|
|
|
readbytes = pg_pread(state->seg.ws_file, p, segbytes, (off_t) startoff);
|
2019-11-25 19:04:54 +01:00
|
|
|
|
|
|
|
#ifndef FRONTEND
|
|
|
|
pgstat_report_wait_end();
|
|
|
|
#endif
|
|
|
|
|
|
|
|
if (readbytes <= 0)
|
|
|
|
{
|
|
|
|
errinfo->wre_errno = errno;
|
|
|
|
errinfo->wre_req = segbytes;
|
|
|
|
errinfo->wre_read = readbytes;
|
|
|
|
errinfo->wre_off = startoff;
|
2020-05-13 18:17:08 +02:00
|
|
|
errinfo->wre_seg = state->seg;
|
2019-11-25 19:04:54 +01:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Update state for read */
|
|
|
|
recptr += readbytes;
|
|
|
|
nbytes -= readbytes;
|
|
|
|
p += readbytes;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
/* ----------------------------------------
|
|
|
|
* Functions for decoding the data and block references in a record.
|
|
|
|
* ----------------------------------------
|
|
|
|
*/
|
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
/*
|
|
|
|
* Private function to reset the state, forgetting all decoded records, if we
|
|
|
|
* are asked to move to a new read position.
|
|
|
|
*/
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
static void
|
|
|
|
ResetDecoder(XLogReaderState *state)
|
|
|
|
{
|
2022-03-18 05:45:04 +01:00
|
|
|
DecodedXLogRecord *r;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
/* Reset the decoded record queue, freeing any oversized records. */
|
|
|
|
while ((r = state->decode_queue_head) != NULL)
|
2021-05-10 06:00:53 +02:00
|
|
|
{
|
2022-03-18 05:45:04 +01:00
|
|
|
state->decode_queue_head = r->next;
|
|
|
|
if (r->oversized)
|
|
|
|
pfree(r);
|
2021-05-10 06:00:53 +02:00
|
|
|
}
|
2022-03-18 05:45:04 +01:00
|
|
|
state->decode_queue_tail = NULL;
|
|
|
|
state->decode_queue_head = NULL;
|
|
|
|
state->record = NULL;
|
|
|
|
|
|
|
|
/* Reset the decode buffer to empty. */
|
|
|
|
state->decode_buffer_tail = state->decode_buffer;
|
|
|
|
state->decode_buffer_head = state->decode_buffer;
|
|
|
|
|
|
|
|
/* Clear error state. */
|
|
|
|
state->errormsg_buf[0] = '\0';
|
|
|
|
state->errormsg_deferred = false;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2022-03-18 05:45:04 +01:00
|
|
|
* Compute the maximum possible amount of padding that could be required to
|
|
|
|
* decode a record, given xl_tot_len from the record's header. This is the
|
|
|
|
* amount of output buffer space that we need to decode a record, though we
|
|
|
|
* might not finish up using it all.
|
|
|
|
*
|
|
|
|
* This computation is pessimistic and assumes the maximum possible number of
|
|
|
|
* blocks, due to lack of better information.
|
|
|
|
*/
|
|
|
|
size_t
|
|
|
|
DecodeXLogRecordRequiredSpace(size_t xl_tot_len)
|
|
|
|
{
|
|
|
|
size_t size = 0;
|
|
|
|
|
|
|
|
/* Account for the fixed size part of the decoded record struct. */
|
|
|
|
size += offsetof(DecodedXLogRecord, blocks[0]);
|
|
|
|
/* Account for the flexible blocks array of maximum possible size. */
|
|
|
|
size += sizeof(DecodedBkpBlock) * (XLR_MAX_BLOCK_ID + 1);
|
|
|
|
/* Account for all the raw main and block data. */
|
|
|
|
size += xl_tot_len;
|
|
|
|
/* We might insert padding before main_data. */
|
|
|
|
size += (MAXIMUM_ALIGNOF - 1);
|
|
|
|
/* We might insert padding before each block's data. */
|
|
|
|
size += (MAXIMUM_ALIGNOF - 1) * (XLR_MAX_BLOCK_ID + 1);
|
|
|
|
/* We might insert padding at the end. */
|
|
|
|
size += (MAXIMUM_ALIGNOF - 1);
|
|
|
|
|
|
|
|
return size;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Decode a record. "decoded" must point to a MAXALIGNed memory area that has
|
|
|
|
* space for at least DecodeXLogRecordRequiredSpace(record) bytes. On
|
|
|
|
* success, decoded->size contains the actual space occupied by the decoded
|
|
|
|
* record, which may turn out to be less.
|
|
|
|
*
|
|
|
|
* Only decoded->oversized member must be initialized already, and will not be
|
|
|
|
* modified. Other members will be initialized as required.
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
*
|
|
|
|
* On error, a human-readable error message is returned in *errormsg, and
|
|
|
|
* the return value is false.
|
|
|
|
*/
|
|
|
|
bool
|
2022-03-18 05:45:04 +01:00
|
|
|
DecodeXLogRecord(XLogReaderState *state,
|
|
|
|
DecodedXLogRecord *decoded,
|
|
|
|
XLogRecord *record,
|
|
|
|
XLogRecPtr lsn,
|
|
|
|
char **errormsg)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* read next _size bytes from record buffer, but check for overrun first.
|
|
|
|
*/
|
|
|
|
#define COPY_HEADER_FIELD(_dst, _size) \
|
|
|
|
do { \
|
|
|
|
if (remaining < _size) \
|
|
|
|
goto shortdata_err; \
|
|
|
|
memcpy(_dst, ptr, _size); \
|
|
|
|
ptr += _size; \
|
|
|
|
remaining -= _size; \
|
|
|
|
} while(0)
|
|
|
|
|
|
|
|
char *ptr;
|
2022-03-18 05:45:04 +01:00
|
|
|
char *out;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
uint32 remaining;
|
|
|
|
uint32 datatotal;
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
RelFileLocator *rlocator = NULL;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
uint8 block_id;
|
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
decoded->header = *record;
|
|
|
|
decoded->lsn = lsn;
|
|
|
|
decoded->next = NULL;
|
|
|
|
decoded->record_origin = InvalidRepOriginId;
|
|
|
|
decoded->toplevel_xid = InvalidTransactionId;
|
|
|
|
decoded->main_data = NULL;
|
|
|
|
decoded->main_data_len = 0;
|
|
|
|
decoded->max_block_id = -1;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
ptr = (char *) record;
|
|
|
|
ptr += SizeOfXLogRecord;
|
|
|
|
remaining = record->xl_tot_len - SizeOfXLogRecord;
|
|
|
|
|
|
|
|
/* Decode the headers */
|
|
|
|
datatotal = 0;
|
|
|
|
while (remaining > datatotal)
|
|
|
|
{
|
|
|
|
COPY_HEADER_FIELD(&block_id, sizeof(uint8));
|
|
|
|
|
|
|
|
if (block_id == XLR_BLOCK_ID_DATA_SHORT)
|
|
|
|
{
|
|
|
|
/* XLogRecordDataHeaderShort */
|
|
|
|
uint8 main_data_len;
|
|
|
|
|
|
|
|
COPY_HEADER_FIELD(&main_data_len, sizeof(uint8));
|
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
decoded->main_data_len = main_data_len;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
datatotal += main_data_len;
|
|
|
|
break; /* by convention, the main data fragment is
|
|
|
|
* always last */
|
|
|
|
}
|
|
|
|
else if (block_id == XLR_BLOCK_ID_DATA_LONG)
|
|
|
|
{
|
|
|
|
/* XLogRecordDataHeaderLong */
|
|
|
|
uint32 main_data_len;
|
|
|
|
|
|
|
|
COPY_HEADER_FIELD(&main_data_len, sizeof(uint32));
|
2022-03-18 05:45:04 +01:00
|
|
|
decoded->main_data_len = main_data_len;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
datatotal += main_data_len;
|
|
|
|
break; /* by convention, the main data fragment is
|
|
|
|
* always last */
|
|
|
|
}
|
Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented here, consist out of
three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies. E.g. by filtering all replayed changes out.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL. Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.
For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00
|
|
|
else if (block_id == XLR_BLOCK_ID_ORIGIN)
|
|
|
|
{
|
2022-03-18 05:45:04 +01:00
|
|
|
COPY_HEADER_FIELD(&decoded->record_origin, sizeof(RepOriginId));
|
Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented here, consist out of
three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies. E.g. by filtering all replayed changes out.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL. Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.
For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00
|
|
|
}
|
2020-07-20 05:18:26 +02:00
|
|
|
else if (block_id == XLR_BLOCK_ID_TOPLEVEL_XID)
|
|
|
|
{
|
2022-03-18 05:45:04 +01:00
|
|
|
COPY_HEADER_FIELD(&decoded->toplevel_xid, sizeof(TransactionId));
|
2020-07-20 05:18:26 +02:00
|
|
|
}
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
else if (block_id <= XLR_MAX_BLOCK_ID)
|
|
|
|
{
|
|
|
|
/* XLogRecordBlockHeader */
|
|
|
|
DecodedBkpBlock *blk;
|
|
|
|
uint8 fork_flags;
|
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
/* mark any intervening block IDs as not in use */
|
|
|
|
for (int i = decoded->max_block_id + 1; i < block_id; ++i)
|
|
|
|
decoded->blocks[i].in_use = false;
|
|
|
|
|
|
|
|
if (block_id <= decoded->max_block_id)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"out-of-order block_id %u at %X/%X",
|
|
|
|
block_id,
|
2021-02-23 10:14:38 +01:00
|
|
|
LSN_FORMAT_ARGS(state->ReadRecPtr));
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
goto err;
|
|
|
|
}
|
2022-03-18 05:45:04 +01:00
|
|
|
decoded->max_block_id = block_id;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
blk = &decoded->blocks[block_id];
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
blk->in_use = true;
|
2017-02-08 21:45:30 +01:00
|
|
|
blk->apply_image = false;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
|
|
|
COPY_HEADER_FIELD(&fork_flags, sizeof(uint8));
|
|
|
|
blk->forknum = fork_flags & BKPBLOCK_FORK_MASK;
|
|
|
|
blk->flags = fork_flags;
|
|
|
|
blk->has_image = ((fork_flags & BKPBLOCK_HAS_IMAGE) != 0);
|
|
|
|
blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
|
|
|
|
|
2022-04-07 09:28:40 +02:00
|
|
|
blk->prefetch_buffer = InvalidBuffer;
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
|
|
|
|
/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
|
|
|
|
if (blk->has_data && blk->data_len == 0)
|
2015-03-09 06:31:10 +01:00
|
|
|
{
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
report_invalid_record(state,
|
|
|
|
"BKPBLOCK_HAS_DATA set, but no data included at %X/%X",
|
2021-02-23 10:14:38 +01:00
|
|
|
LSN_FORMAT_ARGS(state->ReadRecPtr));
|
2015-03-09 06:31:10 +01:00
|
|
|
goto err;
|
|
|
|
}
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
if (!blk->has_data && blk->data_len != 0)
|
2015-03-09 06:31:10 +01:00
|
|
|
{
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
report_invalid_record(state,
|
|
|
|
"BKPBLOCK_HAS_DATA not set, but data length is %u at %X/%X",
|
|
|
|
(unsigned int) blk->data_len,
|
2021-02-23 10:14:38 +01:00
|
|
|
LSN_FORMAT_ARGS(state->ReadRecPtr));
|
2015-03-09 06:31:10 +01:00
|
|
|
goto err;
|
|
|
|
}
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
datatotal += blk->data_len;
|
|
|
|
|
|
|
|
if (blk->has_image)
|
|
|
|
{
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
COPY_HEADER_FIELD(&blk->bimg_len, sizeof(uint16));
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
COPY_HEADER_FIELD(&blk->hole_offset, sizeof(uint16));
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
COPY_HEADER_FIELD(&blk->bimg_info, sizeof(uint8));
|
2017-02-08 21:45:30 +01:00
|
|
|
|
|
|
|
blk->apply_image = ((blk->bimg_info & BKPIMAGE_APPLY) != 0);
|
|
|
|
|
Add support for LZ4 with compression of full-page writes in WAL
The logic is implemented so as there can be a choice in the compression
used when building a WAL record, and an extra per-record bit is used to
track down if a block is compressed with PGLZ, LZ4 or nothing.
wal_compression, the existing parameter, is changed to an enum with
support for the following backward-compatible values:
- "off", the default, to not use compression.
- "pglz" or "on", to compress FPWs with PGLZ.
- "lz4", the new mode, to compress FPWs with LZ4.
Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be
also an interesting choice, but going just with LZ4 for now makes the
patch minimalistic as toast compression is already able to use LZ4, so
there is no need to worry about any build-related needs for this
implementation.
Author: Andrey Borodin, Justin Pryzby
Reviewed-by: Dilip Kumar, Michael Paquier
Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru
2021-06-29 04:17:55 +02:00
|
|
|
if (BKPIMAGE_COMPRESSED(blk->bimg_info))
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
{
|
|
|
|
if (blk->bimg_info & BKPIMAGE_HAS_HOLE)
|
|
|
|
COPY_HEADER_FIELD(&blk->hole_length, sizeof(uint16));
|
|
|
|
else
|
|
|
|
blk->hole_length = 0;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
blk->hole_length = BLCKSZ - blk->bimg_len;
|
|
|
|
datatotal += blk->bimg_len;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* cross-check that hole_offset > 0, hole_length > 0 and
|
|
|
|
* bimg_len < BLCKSZ if the HAS_HOLE flag is set.
|
|
|
|
*/
|
|
|
|
if ((blk->bimg_info & BKPIMAGE_HAS_HOLE) &&
|
|
|
|
(blk->hole_offset == 0 ||
|
|
|
|
blk->hole_length == 0 ||
|
|
|
|
blk->bimg_len == BLCKSZ))
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"BKPIMAGE_HAS_HOLE set, but hole offset %u length %u block image length %u at %X/%X",
|
|
|
|
(unsigned int) blk->hole_offset,
|
|
|
|
(unsigned int) blk->hole_length,
|
|
|
|
(unsigned int) blk->bimg_len,
|
2021-02-23 10:14:38 +01:00
|
|
|
LSN_FORMAT_ARGS(state->ReadRecPtr));
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
goto err;
|
|
|
|
}
|
2015-05-24 03:35:49 +02:00
|
|
|
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
/*
|
|
|
|
* cross-check that hole_offset == 0 and hole_length == 0 if
|
|
|
|
* the HAS_HOLE flag is not set.
|
|
|
|
*/
|
|
|
|
if (!(blk->bimg_info & BKPIMAGE_HAS_HOLE) &&
|
|
|
|
(blk->hole_offset != 0 || blk->hole_length != 0))
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"BKPIMAGE_HAS_HOLE not set, but hole offset %u length %u at %X/%X",
|
|
|
|
(unsigned int) blk->hole_offset,
|
|
|
|
(unsigned int) blk->hole_length,
|
2021-02-23 10:14:38 +01:00
|
|
|
LSN_FORMAT_ARGS(state->ReadRecPtr));
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
goto err;
|
|
|
|
}
|
2015-05-24 03:35:49 +02:00
|
|
|
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
/*
|
Add support for LZ4 with compression of full-page writes in WAL
The logic is implemented so as there can be a choice in the compression
used when building a WAL record, and an extra per-record bit is used to
track down if a block is compressed with PGLZ, LZ4 or nothing.
wal_compression, the existing parameter, is changed to an enum with
support for the following backward-compatible values:
- "off", the default, to not use compression.
- "pglz" or "on", to compress FPWs with PGLZ.
- "lz4", the new mode, to compress FPWs with LZ4.
Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be
also an interesting choice, but going just with LZ4 for now makes the
patch minimalistic as toast compression is already able to use LZ4, so
there is no need to worry about any build-related needs for this
implementation.
Author: Andrey Borodin, Justin Pryzby
Reviewed-by: Dilip Kumar, Michael Paquier
Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru
2021-06-29 04:17:55 +02:00
|
|
|
* Cross-check that bimg_len < BLCKSZ if it is compressed.
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
*/
|
Add support for LZ4 with compression of full-page writes in WAL
The logic is implemented so as there can be a choice in the compression
used when building a WAL record, and an extra per-record bit is used to
track down if a block is compressed with PGLZ, LZ4 or nothing.
wal_compression, the existing parameter, is changed to an enum with
support for the following backward-compatible values:
- "off", the default, to not use compression.
- "pglz" or "on", to compress FPWs with PGLZ.
- "lz4", the new mode, to compress FPWs with LZ4.
Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be
also an interesting choice, but going just with LZ4 for now makes the
patch minimalistic as toast compression is already able to use LZ4, so
there is no need to worry about any build-related needs for this
implementation.
Author: Andrey Borodin, Justin Pryzby
Reviewed-by: Dilip Kumar, Michael Paquier
Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru
2021-06-29 04:17:55 +02:00
|
|
|
if (BKPIMAGE_COMPRESSED(blk->bimg_info) &&
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
blk->bimg_len == BLCKSZ)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
Add support for LZ4 with compression of full-page writes in WAL
The logic is implemented so as there can be a choice in the compression
used when building a WAL record, and an extra per-record bit is used to
track down if a block is compressed with PGLZ, LZ4 or nothing.
wal_compression, the existing parameter, is changed to an enum with
support for the following backward-compatible values:
- "off", the default, to not use compression.
- "pglz" or "on", to compress FPWs with PGLZ.
- "lz4", the new mode, to compress FPWs with LZ4.
Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be
also an interesting choice, but going just with LZ4 for now makes the
patch minimalistic as toast compression is already able to use LZ4, so
there is no need to worry about any build-related needs for this
implementation.
Author: Andrey Borodin, Justin Pryzby
Reviewed-by: Dilip Kumar, Michael Paquier
Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru
2021-06-29 04:17:55 +02:00
|
|
|
"BKPIMAGE_COMPRESSED set, but block image length %u at %X/%X",
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
(unsigned int) blk->bimg_len,
|
2021-02-23 10:14:38 +01:00
|
|
|
LSN_FORMAT_ARGS(state->ReadRecPtr));
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
goto err;
|
|
|
|
}
|
2015-05-24 03:35:49 +02:00
|
|
|
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
/*
|
Add support for LZ4 with compression of full-page writes in WAL
The logic is implemented so as there can be a choice in the compression
used when building a WAL record, and an extra per-record bit is used to
track down if a block is compressed with PGLZ, LZ4 or nothing.
wal_compression, the existing parameter, is changed to an enum with
support for the following backward-compatible values:
- "off", the default, to not use compression.
- "pglz" or "on", to compress FPWs with PGLZ.
- "lz4", the new mode, to compress FPWs with LZ4.
Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be
also an interesting choice, but going just with LZ4 for now makes the
patch minimalistic as toast compression is already able to use LZ4, so
there is no need to worry about any build-related needs for this
implementation.
Author: Andrey Borodin, Justin Pryzby
Reviewed-by: Dilip Kumar, Michael Paquier
Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru
2021-06-29 04:17:55 +02:00
|
|
|
* cross-check that bimg_len = BLCKSZ if neither HAS_HOLE is
|
|
|
|
* set nor COMPRESSED().
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
*/
|
|
|
|
if (!(blk->bimg_info & BKPIMAGE_HAS_HOLE) &&
|
Add support for LZ4 with compression of full-page writes in WAL
The logic is implemented so as there can be a choice in the compression
used when building a WAL record, and an extra per-record bit is used to
track down if a block is compressed with PGLZ, LZ4 or nothing.
wal_compression, the existing parameter, is changed to an enum with
support for the following backward-compatible values:
- "off", the default, to not use compression.
- "pglz" or "on", to compress FPWs with PGLZ.
- "lz4", the new mode, to compress FPWs with LZ4.
Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be
also an interesting choice, but going just with LZ4 for now makes the
patch minimalistic as toast compression is already able to use LZ4, so
there is no need to worry about any build-related needs for this
implementation.
Author: Andrey Borodin, Justin Pryzby
Reviewed-by: Dilip Kumar, Michael Paquier
Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru
2021-06-29 04:17:55 +02:00
|
|
|
!BKPIMAGE_COMPRESSED(blk->bimg_info) &&
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
blk->bimg_len != BLCKSZ)
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
Add support for LZ4 with compression of full-page writes in WAL
The logic is implemented so as there can be a choice in the compression
used when building a WAL record, and an extra per-record bit is used to
track down if a block is compressed with PGLZ, LZ4 or nothing.
wal_compression, the existing parameter, is changed to an enum with
support for the following backward-compatible values:
- "off", the default, to not use compression.
- "pglz" or "on", to compress FPWs with PGLZ.
- "lz4", the new mode, to compress FPWs with LZ4.
Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be
also an interesting choice, but going just with LZ4 for now makes the
patch minimalistic as toast compression is already able to use LZ4, so
there is no need to worry about any build-related needs for this
implementation.
Author: Andrey Borodin, Justin Pryzby
Reviewed-by: Dilip Kumar, Michael Paquier
Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru
2021-06-29 04:17:55 +02:00
|
|
|
"neither BKPIMAGE_HAS_HOLE nor BKPIMAGE_COMPRESSED set, but block image length is %u at %X/%X",
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
(unsigned int) blk->data_len,
|
2021-02-23 10:14:38 +01:00
|
|
|
LSN_FORMAT_ARGS(state->ReadRecPtr));
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
goto err;
|
|
|
|
}
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
}
|
|
|
|
if (!(fork_flags & BKPBLOCK_SAME_REL))
|
|
|
|
{
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
COPY_HEADER_FIELD(&blk->rlocator, sizeof(RelFileLocator));
|
|
|
|
rlocator = &blk->rlocator;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
if (rlocator == NULL)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"BKPBLOCK_SAME_REL set but no previous rel at %X/%X",
|
2021-02-23 10:14:38 +01:00
|
|
|
LSN_FORMAT_ARGS(state->ReadRecPtr));
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
blk->rlocator = *rlocator;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
}
|
|
|
|
COPY_HEADER_FIELD(&blk->blkno, sizeof(BlockNumber));
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
report_invalid_record(state,
|
|
|
|
"invalid block_id %u at %X/%X",
|
2021-02-23 10:14:38 +01:00
|
|
|
block_id, LSN_FORMAT_ARGS(state->ReadRecPtr));
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (remaining != datatotal)
|
|
|
|
goto shortdata_err;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ok, we've parsed the fragment headers, and verified that the total
|
|
|
|
* length of the payload in the fragments is equal to the amount of data
|
2022-03-18 05:45:04 +01:00
|
|
|
* left. Copy the data of each fragment to contiguous space after the
|
|
|
|
* blocks array, inserting alignment padding before the data fragments so
|
|
|
|
* they can be cast to struct pointers by REDO routines.
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
*/
|
2022-03-18 05:45:04 +01:00
|
|
|
out = ((char *) decoded) +
|
|
|
|
offsetof(DecodedXLogRecord, blocks) +
|
|
|
|
sizeof(decoded->blocks[0]) * (decoded->max_block_id + 1);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
|
|
|
/* block data first */
|
2022-03-18 05:45:04 +01:00
|
|
|
for (block_id = 0; block_id <= decoded->max_block_id; block_id++)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
{
|
2022-03-18 05:45:04 +01:00
|
|
|
DecodedBkpBlock *blk = &decoded->blocks[block_id];
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
|
|
|
if (!blk->in_use)
|
|
|
|
continue;
|
2017-02-08 21:45:30 +01:00
|
|
|
|
|
|
|
Assert(blk->has_image || !blk->apply_image);
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
if (blk->has_image)
|
|
|
|
{
|
2022-03-18 05:45:04 +01:00
|
|
|
/* no need to align image */
|
|
|
|
blk->bkp_image = out;
|
|
|
|
memcpy(out, ptr, blk->bimg_len);
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
ptr += blk->bimg_len;
|
2022-03-18 05:45:04 +01:00
|
|
|
out += blk->bimg_len;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
}
|
|
|
|
if (blk->has_data)
|
|
|
|
{
|
2022-03-18 05:45:04 +01:00
|
|
|
out = (char *) MAXALIGN(out);
|
|
|
|
blk->data = out;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
memcpy(blk->data, ptr, blk->data_len);
|
|
|
|
ptr += blk->data_len;
|
2022-03-18 05:45:04 +01:00
|
|
|
out += blk->data_len;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* and finally, the main data */
|
2022-03-18 05:45:04 +01:00
|
|
|
if (decoded->main_data_len > 0)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
{
|
2022-03-18 05:45:04 +01:00
|
|
|
out = (char *) MAXALIGN(out);
|
|
|
|
decoded->main_data = out;
|
|
|
|
memcpy(decoded->main_data, ptr, decoded->main_data_len);
|
|
|
|
ptr += decoded->main_data_len;
|
|
|
|
out += decoded->main_data_len;
|
2021-05-10 06:00:53 +02:00
|
|
|
}
|
2021-04-08 13:03:34 +02:00
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
/* Report the actual size we used. */
|
|
|
|
decoded->size = MAXALIGN(out - (char *) decoded);
|
|
|
|
Assert(DecodeXLogRecordRequiredSpace(record->xl_tot_len) >=
|
|
|
|
decoded->size);
|
|
|
|
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
return true;
|
|
|
|
|
|
|
|
shortdata_err:
|
|
|
|
report_invalid_record(state,
|
|
|
|
"record with invalid length at %X/%X",
|
2021-02-23 10:14:38 +01:00
|
|
|
LSN_FORMAT_ARGS(state->ReadRecPtr));
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
err:
|
|
|
|
*errormsg = state->errormsg_buf;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns information about the block that a block reference refers to.
|
2022-04-11 23:43:46 +02:00
|
|
|
*
|
|
|
|
* This is like XLogRecGetBlockTagExtended, except that the block reference
|
|
|
|
* must exist and there's no access to prefetch_buffer.
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
*/
|
2022-04-11 23:43:46 +02:00
|
|
|
void
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
RelFileLocator *rlocator, ForkNumber *forknum,
|
|
|
|
BlockNumber *blknum)
|
2022-04-07 09:28:40 +02:00
|
|
|
{
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
if (!XLogRecGetBlockTagExtended(record, block_id, rlocator, forknum,
|
|
|
|
blknum, NULL))
|
2022-04-11 23:43:46 +02:00
|
|
|
{
|
|
|
|
#ifndef FRONTEND
|
2022-09-25 00:38:35 +02:00
|
|
|
elog(ERROR, "could not locate backup block with ID %d in WAL record",
|
2022-04-11 23:43:46 +02:00
|
|
|
block_id);
|
|
|
|
#else
|
2022-09-25 00:38:35 +02:00
|
|
|
pg_fatal("could not locate backup block with ID %d in WAL record",
|
2022-04-11 23:43:46 +02:00
|
|
|
block_id);
|
|
|
|
#endif
|
|
|
|
}
|
2022-04-07 09:28:40 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns information about the block that a block reference refers to,
|
|
|
|
* optionally including the buffer that the block may already be in.
|
|
|
|
*
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
* If the WAL record contains a block reference with the given ID, *rlocator,
|
2022-04-07 09:28:40 +02:00
|
|
|
* *forknum, *blknum and *prefetch_buffer are filled in (if not NULL), and
|
|
|
|
* returns true. Otherwise returns false.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
RelFileLocator *rlocator, ForkNumber *forknum,
|
2022-04-07 09:28:40 +02:00
|
|
|
BlockNumber *blknum,
|
|
|
|
Buffer *prefetch_buffer)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
{
|
|
|
|
DecodedBkpBlock *bkpb;
|
|
|
|
|
2022-04-11 23:43:46 +02:00
|
|
|
if (!XLogRecHasBlockRef(record, block_id))
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
return false;
|
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
bkpb = &record->record->blocks[block_id];
|
Change internal RelFileNode references to RelFileNumber or RelFileLocator.
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
2022-07-06 17:39:09 +02:00
|
|
|
if (rlocator)
|
|
|
|
*rlocator = bkpb->rlocator;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
if (forknum)
|
|
|
|
*forknum = bkpb->forknum;
|
|
|
|
if (blknum)
|
|
|
|
*blknum = bkpb->blkno;
|
2022-04-07 09:28:40 +02:00
|
|
|
if (prefetch_buffer)
|
|
|
|
*prefetch_buffer = bkpb->prefetch_buffer;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns the data associated with a block reference, or NULL if there is
|
|
|
|
* no data (e.g. because a full-page image was taken instead). The returned
|
|
|
|
* pointer points to a MAXALIGNed buffer.
|
|
|
|
*/
|
|
|
|
char *
|
|
|
|
XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
|
|
|
|
{
|
|
|
|
DecodedBkpBlock *bkpb;
|
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
if (block_id > record->record->max_block_id ||
|
|
|
|
!record->record->blocks[block_id].in_use)
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
return NULL;
|
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
bkpb = &record->record->blocks[block_id];
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
|
|
|
if (!bkpb->has_data)
|
|
|
|
{
|
|
|
|
if (len)
|
|
|
|
*len = 0;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
if (len)
|
|
|
|
*len = bkpb->data_len;
|
|
|
|
return bkpb->data;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Restore a full-page image from a backup block attached to an XLOG record.
|
|
|
|
*
|
2022-09-09 03:00:40 +02:00
|
|
|
* Returns true if a full-page image is restored, and false on failure with
|
|
|
|
* an error to be consumed by the caller.
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
*/
|
|
|
|
bool
|
|
|
|
RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)
|
|
|
|
{
|
|
|
|
DecodedBkpBlock *bkpb;
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
char *ptr;
|
2018-09-01 21:27:12 +02:00
|
|
|
PGAlignedBlock tmp;
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
if (block_id > record->record->max_block_id ||
|
|
|
|
!record->record->blocks[block_id].in_use)
|
2022-09-09 03:00:40 +02:00
|
|
|
{
|
|
|
|
report_invalid_record(record,
|
|
|
|
"could not restore image at %X/%X with invalid block %d specified",
|
|
|
|
LSN_FORMAT_ARGS(record->ReadRecPtr),
|
|
|
|
block_id);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
return false;
|
2022-09-09 03:00:40 +02:00
|
|
|
}
|
2022-03-18 05:45:04 +01:00
|
|
|
if (!record->record->blocks[block_id].has_image)
|
2022-09-09 03:00:40 +02:00
|
|
|
{
|
|
|
|
report_invalid_record(record, "could not restore image at %X/%X with invalid state, block %d",
|
|
|
|
LSN_FORMAT_ARGS(record->ReadRecPtr),
|
|
|
|
block_id);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
return false;
|
2022-09-09 03:00:40 +02:00
|
|
|
}
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
2022-03-18 05:45:04 +01:00
|
|
|
bkpb = &record->record->blocks[block_id];
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
ptr = bkpb->bkp_image;
|
|
|
|
|
Add support for LZ4 with compression of full-page writes in WAL
The logic is implemented so as there can be a choice in the compression
used when building a WAL record, and an extra per-record bit is used to
track down if a block is compressed with PGLZ, LZ4 or nothing.
wal_compression, the existing parameter, is changed to an enum with
support for the following backward-compatible values:
- "off", the default, to not use compression.
- "pglz" or "on", to compress FPWs with PGLZ.
- "lz4", the new mode, to compress FPWs with LZ4.
Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be
also an interesting choice, but going just with LZ4 for now makes the
patch minimalistic as toast compression is already able to use LZ4, so
there is no need to worry about any build-related needs for this
implementation.
Author: Andrey Borodin, Justin Pryzby
Reviewed-by: Dilip Kumar, Michael Paquier
Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru
2021-06-29 04:17:55 +02:00
|
|
|
if (BKPIMAGE_COMPRESSED(bkpb->bimg_info))
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
{
|
|
|
|
/* If a backup block image is compressed, decompress it */
|
Add support for LZ4 with compression of full-page writes in WAL
The logic is implemented so as there can be a choice in the compression
used when building a WAL record, and an extra per-record bit is used to
track down if a block is compressed with PGLZ, LZ4 or nothing.
wal_compression, the existing parameter, is changed to an enum with
support for the following backward-compatible values:
- "off", the default, to not use compression.
- "pglz" or "on", to compress FPWs with PGLZ.
- "lz4", the new mode, to compress FPWs with LZ4.
Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be
also an interesting choice, but going just with LZ4 for now makes the
patch minimalistic as toast compression is already able to use LZ4, so
there is no need to worry about any build-related needs for this
implementation.
Author: Andrey Borodin, Justin Pryzby
Reviewed-by: Dilip Kumar, Michael Paquier
Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru
2021-06-29 04:17:55 +02:00
|
|
|
bool decomp_success = true;
|
|
|
|
|
|
|
|
if ((bkpb->bimg_info & BKPIMAGE_COMPRESS_PGLZ) != 0)
|
|
|
|
{
|
|
|
|
if (pglz_decompress(ptr, bkpb->bimg_len, tmp.data,
|
|
|
|
BLCKSZ - bkpb->hole_length, true) < 0)
|
|
|
|
decomp_success = false;
|
|
|
|
}
|
|
|
|
else if ((bkpb->bimg_info & BKPIMAGE_COMPRESS_LZ4) != 0)
|
|
|
|
{
|
|
|
|
#ifdef USE_LZ4
|
|
|
|
if (LZ4_decompress_safe(ptr, tmp.data,
|
|
|
|
bkpb->bimg_len, BLCKSZ - bkpb->hole_length) <= 0)
|
|
|
|
decomp_success = false;
|
|
|
|
#else
|
2022-09-09 03:00:40 +02:00
|
|
|
report_invalid_record(record, "could not restore image at %X/%X compressed with %s not supported by build, block %d",
|
2021-07-09 08:27:36 +02:00
|
|
|
LSN_FORMAT_ARGS(record->ReadRecPtr),
|
Add support for LZ4 with compression of full-page writes in WAL
The logic is implemented so as there can be a choice in the compression
used when building a WAL record, and an extra per-record bit is used to
track down if a block is compressed with PGLZ, LZ4 or nothing.
wal_compression, the existing parameter, is changed to an enum with
support for the following backward-compatible values:
- "off", the default, to not use compression.
- "pglz" or "on", to compress FPWs with PGLZ.
- "lz4", the new mode, to compress FPWs with LZ4.
Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be
also an interesting choice, but going just with LZ4 for now makes the
patch minimalistic as toast compression is already able to use LZ4, so
there is no need to worry about any build-related needs for this
implementation.
Author: Andrey Borodin, Justin Pryzby
Reviewed-by: Dilip Kumar, Michael Paquier
Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru
2021-06-29 04:17:55 +02:00
|
|
|
"LZ4",
|
|
|
|
block_id);
|
|
|
|
return false;
|
Add support for zstd with compression of full-page writes in WAL
wal_compression gains a new value, "zstd", to allow the compression of
full-page images using the compression method of the same name.
Compression is done using the default level recommended by the library,
as of ZSTD_CLEVEL_DEFAULT = 3. Some benchmarking has shown that it
could make sense to use a level lower for the FPI compression, like 1 or
2, as the compression rate did not change much with a bit less CPU
consumed, but any tests done would only cover few scenarios so it is
hard to come to a clear conclusion. Anyway, there is no reason to not
use the default level instead, which is the level recommended by the
library so it should be fine for most cases.
zstd outclasses easily pglz, and is better than LZ4 where one wants to
have more compression at the cost of extra CPU but both are good enough
in their own scenarios, so the choice between one or the other of these
comes to a study of the workload patterns and the schema involved,
mainly.
This commit relies heavily on 4035cd5, that reshaped the code creating
and restoring full-page writes to be aware of the compression type,
making this integration straight-forward.
This patch borrows some early work from Andrey Borodin, though the patch
got a complete rewrite.
Author: Justin Pryzby
Discussion: https://postgr.es/m/20220222231948.GJ9008@telsasoft.com
2022-03-11 04:18:53 +01:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
else if ((bkpb->bimg_info & BKPIMAGE_COMPRESS_ZSTD) != 0)
|
|
|
|
{
|
|
|
|
#ifdef USE_ZSTD
|
|
|
|
size_t decomp_result = ZSTD_decompress(tmp.data,
|
|
|
|
BLCKSZ - bkpb->hole_length,
|
|
|
|
ptr, bkpb->bimg_len);
|
|
|
|
|
|
|
|
if (ZSTD_isError(decomp_result))
|
|
|
|
decomp_success = false;
|
|
|
|
#else
|
2022-09-09 03:00:40 +02:00
|
|
|
report_invalid_record(record, "could not restore image at %X/%X compressed with %s not supported by build, block %d",
|
Add support for zstd with compression of full-page writes in WAL
wal_compression gains a new value, "zstd", to allow the compression of
full-page images using the compression method of the same name.
Compression is done using the default level recommended by the library,
as of ZSTD_CLEVEL_DEFAULT = 3. Some benchmarking has shown that it
could make sense to use a level lower for the FPI compression, like 1 or
2, as the compression rate did not change much with a bit less CPU
consumed, but any tests done would only cover few scenarios so it is
hard to come to a clear conclusion. Anyway, there is no reason to not
use the default level instead, which is the level recommended by the
library so it should be fine for most cases.
zstd outclasses easily pglz, and is better than LZ4 where one wants to
have more compression at the cost of extra CPU but both are good enough
in their own scenarios, so the choice between one or the other of these
comes to a study of the workload patterns and the schema involved,
mainly.
This commit relies heavily on 4035cd5, that reshaped the code creating
and restoring full-page writes to be aware of the compression type,
making this integration straight-forward.
This patch borrows some early work from Andrey Borodin, though the patch
got a complete rewrite.
Author: Justin Pryzby
Discussion: https://postgr.es/m/20220222231948.GJ9008@telsasoft.com
2022-03-11 04:18:53 +01:00
|
|
|
LSN_FORMAT_ARGS(record->ReadRecPtr),
|
|
|
|
"zstd",
|
|
|
|
block_id);
|
|
|
|
return false;
|
Add support for LZ4 with compression of full-page writes in WAL
The logic is implemented so as there can be a choice in the compression
used when building a WAL record, and an extra per-record bit is used to
track down if a block is compressed with PGLZ, LZ4 or nothing.
wal_compression, the existing parameter, is changed to an enum with
support for the following backward-compatible values:
- "off", the default, to not use compression.
- "pglz" or "on", to compress FPWs with PGLZ.
- "lz4", the new mode, to compress FPWs with LZ4.
Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be
also an interesting choice, but going just with LZ4 for now makes the
patch minimalistic as toast compression is already able to use LZ4, so
there is no need to worry about any build-related needs for this
implementation.
Author: Andrey Borodin, Justin Pryzby
Reviewed-by: Dilip Kumar, Michael Paquier
Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru
2021-06-29 04:17:55 +02:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
2022-09-09 03:00:40 +02:00
|
|
|
report_invalid_record(record, "could not restore image at %X/%X compressed with unknown method, block %d",
|
2021-07-09 08:27:36 +02:00
|
|
|
LSN_FORMAT_ARGS(record->ReadRecPtr),
|
Add support for LZ4 with compression of full-page writes in WAL
The logic is implemented so as there can be a choice in the compression
used when building a WAL record, and an extra per-record bit is used to
track down if a block is compressed with PGLZ, LZ4 or nothing.
wal_compression, the existing parameter, is changed to an enum with
support for the following backward-compatible values:
- "off", the default, to not use compression.
- "pglz" or "on", to compress FPWs with PGLZ.
- "lz4", the new mode, to compress FPWs with LZ4.
Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be
also an interesting choice, but going just with LZ4 for now makes the
patch minimalistic as toast compression is already able to use LZ4, so
there is no need to worry about any build-related needs for this
implementation.
Author: Andrey Borodin, Justin Pryzby
Reviewed-by: Dilip Kumar, Michael Paquier
Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru
2021-06-29 04:17:55 +02:00
|
|
|
block_id);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!decomp_success)
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
{
|
2022-09-09 03:00:40 +02:00
|
|
|
report_invalid_record(record, "could not decompress image at %X/%X, block %d",
|
2021-02-23 10:14:38 +01:00
|
|
|
LSN_FORMAT_ARGS(record->ReadRecPtr),
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
block_id);
|
|
|
|
return false;
|
|
|
|
}
|
Add support for LZ4 with compression of full-page writes in WAL
The logic is implemented so as there can be a choice in the compression
used when building a WAL record, and an extra per-record bit is used to
track down if a block is compressed with PGLZ, LZ4 or nothing.
wal_compression, the existing parameter, is changed to an enum with
support for the following backward-compatible values:
- "off", the default, to not use compression.
- "pglz" or "on", to compress FPWs with PGLZ.
- "lz4", the new mode, to compress FPWs with LZ4.
Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be
also an interesting choice, but going just with LZ4 for now makes the
patch minimalistic as toast compression is already able to use LZ4, so
there is no need to worry about any build-related needs for this
implementation.
Author: Andrey Borodin, Justin Pryzby
Reviewed-by: Dilip Kumar, Michael Paquier
Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru
2021-06-29 04:17:55 +02:00
|
|
|
|
2018-09-01 21:27:12 +02:00
|
|
|
ptr = tmp.data;
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
}
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
/* generate page, taking into account hole if necessary */
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
if (bkpb->hole_length == 0)
|
|
|
|
{
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
memcpy(page, ptr, BLCKSZ);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
memcpy(page, ptr, bkpb->hole_offset);
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
/* must zero-fill the hole */
|
|
|
|
MemSet(page + bkpb->hole_offset, 0, bkpb->hole_length);
|
|
|
|
memcpy(page + (bkpb->hole_offset + bkpb->hole_length),
|
Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 07:52:24 +01:00
|
|
|
ptr + bkpb->hole_offset,
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
2019-07-15 07:03:46 +02:00
|
|
|
|
|
|
|
#ifndef FRONTEND
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Extract the FullTransactionId from a WAL record.
|
|
|
|
*/
|
|
|
|
FullTransactionId
|
|
|
|
XLogRecGetFullXid(XLogReaderState *record)
|
|
|
|
{
|
|
|
|
TransactionId xid,
|
|
|
|
next_xid;
|
|
|
|
uint32 epoch;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function is only safe during replay, because it depends on the
|
|
|
|
* replay state. See AdvanceNextFullTransactionIdPastXid() for more.
|
|
|
|
*/
|
|
|
|
Assert(AmStartupProcess() || !IsUnderPostmaster);
|
|
|
|
|
|
|
|
xid = XLogRecGetXid(record);
|
2020-08-11 20:25:23 +02:00
|
|
|
next_xid = XidFromFullTransactionId(ShmemVariableCache->nextXid);
|
|
|
|
epoch = EpochFromFullTransactionId(ShmemVariableCache->nextXid);
|
2019-07-15 07:03:46 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If xid is numerically greater than next_xid, it has to be from the last
|
|
|
|
* epoch.
|
|
|
|
*/
|
|
|
|
if (unlikely(xid > next_xid))
|
|
|
|
--epoch;
|
|
|
|
|
|
|
|
return FullTransactionIdFromEpochAndXid(epoch, xid);
|
2019-07-31 03:29:55 +02:00
|
|
|
}
|
2019-07-15 07:03:46 +02:00
|
|
|
|
|
|
|
#endif
|