Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
|
|
|
* reorderbuffer.c
|
|
|
|
* PostgreSQL logical replay/reorder buffer management
|
|
|
|
*
|
|
|
|
*
|
2018-01-03 05:30:12 +01:00
|
|
|
* Copyright (c) 2012-2018, PostgreSQL Global Development Group
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*
|
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
|
|
|
* src/backend/replication/reorderbuffer.c
|
|
|
|
*
|
|
|
|
* NOTES
|
|
|
|
* This module gets handed individual pieces of transactions in the order
|
|
|
|
* they are written to the WAL and is responsible to reassemble them into
|
|
|
|
* toplevel transaction sized pieces. When a transaction is completely
|
|
|
|
* reassembled - signalled by reading the transaction commit record - it
|
2018-01-11 14:31:11 +01:00
|
|
|
* will then call the output plugin (cf. ReorderBufferCommit()) with the
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
* individual changes. The output plugins rely on snapshots built by
|
|
|
|
* snapbuild.c which hands them to us.
|
|
|
|
*
|
|
|
|
* Transactions and subtransactions/savepoints in postgres are not
|
|
|
|
* immediately linked to each other from outside the performing
|
|
|
|
* backend. Only at commit/abort (or special xact_assignment records) they
|
|
|
|
* are linked together. Which means that we will have to splice together a
|
|
|
|
* toplevel transaction from its subtransactions. To do that efficiently we
|
|
|
|
* build a binary heap indexed by the smallest current lsn of the individual
|
|
|
|
* subtransactions' changestreams. As the individual streams are inherently
|
|
|
|
* ordered by LSN - since that is where we build them from - the transaction
|
|
|
|
* can easily be reassembled by always using the subtransaction with the
|
|
|
|
* smallest current LSN from the heap.
|
|
|
|
*
|
|
|
|
* In order to cope with large transactions - which can be several times as
|
|
|
|
* big as the available memory - this module supports spooling the contents
|
|
|
|
* of a large transactions to disk. When the transaction is replayed the
|
|
|
|
* contents of individual (sub-)transactions will be read from disk in
|
|
|
|
* chunks.
|
|
|
|
*
|
|
|
|
* This module also has to deal with reassembling toast records from the
|
|
|
|
* individual chunks stored in WAL. When a new (or initial) version of a
|
|
|
|
* tuple is stored in WAL it will always be preceded by the toast chunks
|
|
|
|
* emitted for the columns stored out of line. Within a single toplevel
|
|
|
|
* transaction there will be no other data carrying records between a row's
|
|
|
|
* toast chunks and the row data itself. See ReorderBufferToast* for
|
|
|
|
* details.
|
2017-11-22 19:45:07 +01:00
|
|
|
*
|
|
|
|
* ReorderBuffer uses two special memory context types - SlabContext for
|
|
|
|
* allocations of fixed-length structures (changes and transactions), and
|
|
|
|
* GenerationContext for the variable-length transaction data (allocated
|
|
|
|
* and freed in groups with similar lifespan).
|
|
|
|
*
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
* -------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
#include "postgres.h"
|
|
|
|
|
|
|
|
#include <unistd.h>
|
|
|
|
#include <sys/stat.h>
|
|
|
|
|
|
|
|
#include "access/rewriteheap.h"
|
|
|
|
#include "access/transam.h"
|
|
|
|
#include "access/tuptoaster.h"
|
|
|
|
#include "access/xact.h"
|
Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 16:56:26 +01:00
|
|
|
#include "access/xlog_internal.h"
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
#include "catalog/catalog.h"
|
|
|
|
#include "lib/binaryheap.h"
|
Rationalize common/relpath.[hc].
Commit a73018392636ce832b09b5c31f6ad1f18a4643ea created rather a mess by
putting dependencies on backend-only include files into include/common.
We really shouldn't do that. To clean it up:
* Move TABLESPACE_VERSION_DIRECTORY back to its longtime home in
catalog/catalog.h. We won't consider this symbol part of the FE/BE API.
* Push enum ForkNumber from relfilenode.h into relpath.h. We'll consider
relpath.h as the source of truth for fork numbers, since relpath.c was
already partially serving that function, and anyway relfilenode.h was
kind of a random place for that enum.
* So, relfilenode.h now includes relpath.h rather than vice-versa. This
direction of dependency is fine. (That allows most, but not quite all,
of the existing explicit #includes of relpath.h to go away again.)
* Push forkname_to_number from catalog.c to relpath.c, just to centralize
fork number stuff a bit better.
* Push GetDatabasePath from catalog.c to relpath.c; it was rather odd
that the previous commit didn't keep this together with relpath().
* To avoid needing relfilenode.h in common/, redefine the underlying
function (now called GetRelationPath) as taking separate OID arguments,
and make the APIs using RelFileNode or RelFileNodeBackend into macro
wrappers. (The macros have a potential multiple-eval risk, but none of
the existing call sites have an issue with that; one of them had such a
risk already anyway.)
* Fix failure to follow the directions when "init" fork type was added;
specifically, the errhint in forkname_to_number wasn't updated, and neither
was the SGML documentation for pg_relation_size().
* Fix tablespace-path-too-long check in CreateTableSpace() to account for
fork-name component of maximum-length pathnames. This requires putting
FORKNAMECHARS into a header file, but it was rather useless (and
actually unreferenced) where it was.
The last couple of items are potentially back-patchable bug fixes,
if anyone is sufficiently excited about them; but personally I'm not.
Per a gripe from Christoph Berg about how include/common wasn't
self-contained.
2014-04-30 23:30:50 +02:00
|
|
|
#include "miscadmin.h"
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
#include "pgstat.h"
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
#include "replication/logical.h"
|
|
|
|
#include "replication/reorderbuffer.h"
|
|
|
|
#include "replication/slot.h"
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
#include "replication/snapbuild.h" /* just for SnapBuildSnapDecRefcount */
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
#include "storage/bufmgr.h"
|
|
|
|
#include "storage/fd.h"
|
|
|
|
#include "storage/sinval.h"
|
|
|
|
#include "utils/builtins.h"
|
|
|
|
#include "utils/combocid.h"
|
|
|
|
#include "utils/memdebug.h"
|
|
|
|
#include "utils/memutils.h"
|
2015-08-11 18:03:14 +02:00
|
|
|
#include "utils/rel.h"
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
#include "utils/relfilenodemap.h"
|
|
|
|
#include "utils/tqual.h"
|
|
|
|
|
Rationalize common/relpath.[hc].
Commit a73018392636ce832b09b5c31f6ad1f18a4643ea created rather a mess by
putting dependencies on backend-only include files into include/common.
We really shouldn't do that. To clean it up:
* Move TABLESPACE_VERSION_DIRECTORY back to its longtime home in
catalog/catalog.h. We won't consider this symbol part of the FE/BE API.
* Push enum ForkNumber from relfilenode.h into relpath.h. We'll consider
relpath.h as the source of truth for fork numbers, since relpath.c was
already partially serving that function, and anyway relfilenode.h was
kind of a random place for that enum.
* So, relfilenode.h now includes relpath.h rather than vice-versa. This
direction of dependency is fine. (That allows most, but not quite all,
of the existing explicit #includes of relpath.h to go away again.)
* Push forkname_to_number from catalog.c to relpath.c, just to centralize
fork number stuff a bit better.
* Push GetDatabasePath from catalog.c to relpath.c; it was rather odd
that the previous commit didn't keep this together with relpath().
* To avoid needing relfilenode.h in common/, redefine the underlying
function (now called GetRelationPath) as taking separate OID arguments,
and make the APIs using RelFileNode or RelFileNodeBackend into macro
wrappers. (The macros have a potential multiple-eval risk, but none of
the existing call sites have an issue with that; one of them had such a
risk already anyway.)
* Fix failure to follow the directions when "init" fork type was added;
specifically, the errhint in forkname_to_number wasn't updated, and neither
was the SGML documentation for pg_relation_size().
* Fix tablespace-path-too-long check in CreateTableSpace() to account for
fork-name component of maximum-length pathnames. This requires putting
FORKNAMECHARS into a header file, but it was rather useless (and
actually unreferenced) where it was.
The last couple of items are potentially back-patchable bug fixes,
if anyone is sufficiently excited about them; but personally I'm not.
Per a gripe from Christoph Berg about how include/common wasn't
self-contained.
2014-04-30 23:30:50 +02:00
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
/* entry for a hash table we use to map from xid to our transaction state */
|
|
|
|
typedef struct ReorderBufferTXNByIdEnt
|
|
|
|
{
|
|
|
|
TransactionId xid;
|
|
|
|
ReorderBufferTXN *txn;
|
|
|
|
} ReorderBufferTXNByIdEnt;
|
|
|
|
|
|
|
|
/* data structures for (relfilenode, ctid) => (cmin, cmax) mapping */
|
|
|
|
typedef struct ReorderBufferTupleCidKey
|
|
|
|
{
|
|
|
|
RelFileNode relnode;
|
|
|
|
ItemPointerData tid;
|
|
|
|
} ReorderBufferTupleCidKey;
|
|
|
|
|
|
|
|
typedef struct ReorderBufferTupleCidEnt
|
|
|
|
{
|
|
|
|
ReorderBufferTupleCidKey key;
|
|
|
|
CommandId cmin;
|
|
|
|
CommandId cmax;
|
|
|
|
CommandId combocid; /* just for debugging */
|
|
|
|
} ReorderBufferTupleCidEnt;
|
|
|
|
|
|
|
|
/* k-way in-order change iteration support structures */
|
|
|
|
typedef struct ReorderBufferIterTXNEntry
|
|
|
|
{
|
|
|
|
XLogRecPtr lsn;
|
|
|
|
ReorderBufferChange *change;
|
|
|
|
ReorderBufferTXN *txn;
|
|
|
|
int fd;
|
|
|
|
XLogSegNo segno;
|
|
|
|
} ReorderBufferIterTXNEntry;
|
|
|
|
|
|
|
|
typedef struct ReorderBufferIterTXNState
|
|
|
|
{
|
|
|
|
binaryheap *heap;
|
|
|
|
Size nr_txns;
|
|
|
|
dlist_head old_change;
|
|
|
|
ReorderBufferIterTXNEntry entries[FLEXIBLE_ARRAY_MEMBER];
|
|
|
|
} ReorderBufferIterTXNState;
|
|
|
|
|
|
|
|
/* toast datastructures */
|
|
|
|
typedef struct ReorderBufferToastEnt
|
|
|
|
{
|
|
|
|
Oid chunk_id; /* toast_table.chunk_id */
|
|
|
|
int32 last_chunk_seq; /* toast_table.chunk_seq of the last chunk we
|
|
|
|
* have seen */
|
|
|
|
Size num_chunks; /* number of chunks we've already seen */
|
|
|
|
Size size; /* combined size of chunks seen */
|
|
|
|
dlist_head chunks; /* linked list of chunks */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
struct varlena *reconstructed; /* reconstructed varlena now pointed to in
|
|
|
|
* main tup */
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
} ReorderBufferToastEnt;
|
|
|
|
|
|
|
|
/* Disk serialization support datastructures */
|
|
|
|
typedef struct ReorderBufferDiskChange
|
|
|
|
{
|
|
|
|
Size size;
|
|
|
|
ReorderBufferChange change;
|
|
|
|
/* data follows */
|
|
|
|
} ReorderBufferDiskChange;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Maximum number of changes kept in memory, per transaction. After that,
|
|
|
|
* changes are spooled to disk.
|
|
|
|
*
|
|
|
|
* The current value should be sufficient to decode the entire transaction
|
|
|
|
* without hitting disk in OLTP workloads, while starting to spool to disk in
|
|
|
|
* other workloads reasonably fast.
|
|
|
|
*
|
2015-05-20 18:44:46 +02:00
|
|
|
* At some point in the future it probably makes sense to have a more elaborate
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
* resource management here, but it's not entirely clear what that would look
|
|
|
|
* like.
|
|
|
|
*/
|
|
|
|
static const Size max_changes_in_memory = 4096;
|
|
|
|
|
|
|
|
/* ---------------------------------------
|
|
|
|
* primary reorderbuffer support routines
|
|
|
|
* ---------------------------------------
|
|
|
|
*/
|
|
|
|
static ReorderBufferTXN *ReorderBufferGetTXN(ReorderBuffer *rb);
|
|
|
|
static void ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
|
|
|
|
static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
|
|
|
|
TransactionId xid, bool create, bool *is_new,
|
|
|
|
XLogRecPtr lsn, bool create_as_top);
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
|
|
|
|
ReorderBufferTXN *subtxn);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
static void AssertTXNLsnOrder(ReorderBuffer *rb);
|
|
|
|
|
|
|
|
/* ---------------------------------------
|
|
|
|
* support functions for lsn-order iterating over the ->changes of a
|
|
|
|
* transaction and its subtransactions
|
|
|
|
*
|
|
|
|
* used for iteration over the k-way heap merge of a transaction and its
|
|
|
|
* subtransactions
|
|
|
|
* ---------------------------------------
|
|
|
|
*/
|
|
|
|
static ReorderBufferIterTXNState *ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn);
|
2017-06-21 20:39:04 +02:00
|
|
|
static ReorderBufferChange *ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
static void ReorderBufferIterTXNFinish(ReorderBuffer *rb,
|
|
|
|
ReorderBufferIterTXNState *state);
|
|
|
|
static void ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ---------------------------------------
|
|
|
|
* Disk serialization support functions
|
|
|
|
* ---------------------------------------
|
|
|
|
*/
|
|
|
|
static void ReorderBufferCheckSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
|
|
|
|
static void ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
|
|
|
|
static void ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
|
|
|
|
int fd, ReorderBufferChange *change);
|
|
|
|
static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
|
|
|
|
int *fd, XLogSegNo *segno);
|
|
|
|
static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
|
|
|
|
char *change);
|
|
|
|
static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
|
2018-03-06 20:22:03 +01:00
|
|
|
static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
|
|
|
|
static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
|
|
|
|
TransactionId xid, XLogSegNo segno);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
|
|
|
|
static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
|
|
|
|
ReorderBufferTXN *txn, CommandId cid);
|
|
|
|
|
|
|
|
/* ---------------------------------------
|
|
|
|
* toast reassembly support
|
|
|
|
* ---------------------------------------
|
|
|
|
*/
|
|
|
|
static void ReorderBufferToastInitHash(ReorderBuffer *rb, ReorderBufferTXN *txn);
|
|
|
|
static void ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn);
|
|
|
|
static void ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
|
|
|
|
Relation relation, ReorderBufferChange *change);
|
|
|
|
static void ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
|
|
|
|
Relation relation, ReorderBufferChange *change);
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
2018-03-06 20:22:03 +01:00
|
|
|
* Allocate a new ReorderBuffer and clean out any old serialized state from
|
|
|
|
* prior ReorderBuffer instances for the same slot.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
|
|
|
ReorderBuffer *
|
|
|
|
ReorderBufferAllocate(void)
|
|
|
|
{
|
|
|
|
ReorderBuffer *buffer;
|
|
|
|
HASHCTL hash_ctl;
|
|
|
|
MemoryContext new_ctx;
|
|
|
|
|
2018-03-06 20:22:03 +01:00
|
|
|
Assert(MyReplicationSlot != NULL);
|
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
/* allocate memory in own context, to have better accountability */
|
|
|
|
new_ctx = AllocSetContextCreate(CurrentMemoryContext,
|
|
|
|
"ReorderBuffer",
|
Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
|
|
|
ALLOCSET_DEFAULT_SIZES);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
buffer =
|
|
|
|
(ReorderBuffer *) MemoryContextAlloc(new_ctx, sizeof(ReorderBuffer));
|
|
|
|
|
|
|
|
memset(&hash_ctl, 0, sizeof(hash_ctl));
|
|
|
|
|
|
|
|
buffer->context = new_ctx;
|
|
|
|
|
2017-02-27 12:41:44 +01:00
|
|
|
buffer->change_context = SlabContextCreate(new_ctx,
|
|
|
|
"Change",
|
|
|
|
SLAB_DEFAULT_BLOCK_SIZE,
|
|
|
|
sizeof(ReorderBufferChange));
|
|
|
|
|
|
|
|
buffer->txn_context = SlabContextCreate(new_ctx,
|
|
|
|
"TXN",
|
|
|
|
SLAB_DEFAULT_BLOCK_SIZE,
|
|
|
|
sizeof(ReorderBufferTXN));
|
|
|
|
|
2017-11-22 19:45:07 +01:00
|
|
|
buffer->tup_context = GenerationContextCreate(new_ctx,
|
2017-11-29 15:24:24 +01:00
|
|
|
"Tuples",
|
|
|
|
SLAB_LARGE_BLOCK_SIZE);
|
2017-11-22 19:45:07 +01:00
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
hash_ctl.keysize = sizeof(TransactionId);
|
|
|
|
hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
|
|
|
|
hash_ctl.hcxt = buffer->context;
|
|
|
|
|
|
|
|
buffer->by_txn = hash_create("ReorderBufferByXid", 1000, &hash_ctl,
|
Improve hash_create's API for selecting simple-binary-key hash functions.
Previously, if you wanted anything besides C-string hash keys, you had to
specify a custom hashing function to hash_create(). Nearly all such
callers were specifying tag_hash or oid_hash; which is tedious, and rather
error-prone, since a caller could easily miss the opportunity to optimize
by using hash_uint32 when appropriate. Replace this with a design whereby
callers using simple binary-data keys just specify HASH_BLOBS and don't
need to mess with specific support functions. hash_create() itself will
take care of optimizing when the key size is four bytes.
This nets out saving a few hundred bytes of code space, and offers
a measurable performance improvement in tidbitmap.c (which was not
exploiting the opportunity to use hash_uint32 for its 4-byte keys).
There might be some wins elsewhere too, I didn't analyze closely.
In future we could look into offering a similar optimized hashing function
for 8-byte keys. Under this design that could be done in a centralized
and machine-independent fashion, whereas getting it right for keys of
platform-dependent sizes would've been notationally painful before.
For the moment, the old way still works fine, so as not to break source
code compatibility for loadable modules. Eventually we might want to
remove tag_hash and friends from the exported API altogether, since there's
no real need for them to be explicitly referenced from outside dynahash.c.
Teodor Sigaev and Tom Lane
2014-12-18 19:36:29 +01:00
|
|
|
HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
buffer->by_txn_last_xid = InvalidTransactionId;
|
|
|
|
buffer->by_txn_last_txn = NULL;
|
|
|
|
|
|
|
|
buffer->outbuf = NULL;
|
|
|
|
buffer->outbufsize = 0;
|
|
|
|
|
|
|
|
buffer->current_restart_decoding_lsn = InvalidXLogRecPtr;
|
|
|
|
|
|
|
|
dlist_init(&buffer->toplevel_by_lsn);
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
dlist_init(&buffer->txns_by_base_snapshot_lsn);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
2018-03-06 20:22:03 +01:00
|
|
|
/*
|
|
|
|
* Ensure there's no stale data from prior uses of this slot, in case some
|
|
|
|
* prior exit avoided calling ReorderBufferFree. Failure to do this can
|
|
|
|
* produce duplicated txns, and it's very cheap if there's nothing there.
|
|
|
|
*/
|
|
|
|
ReorderBufferCleanupSerializedTXNs(NameStr(MyReplicationSlot->data.name));
|
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
return buffer;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Free a ReorderBuffer
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReorderBufferFree(ReorderBuffer *rb)
|
|
|
|
{
|
|
|
|
MemoryContext context = rb->context;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We free separately allocated data by entirely scrapping reorderbuffer's
|
|
|
|
* memory context.
|
|
|
|
*/
|
|
|
|
MemoryContextDelete(context);
|
2018-03-06 20:22:03 +01:00
|
|
|
|
|
|
|
/* Free disk space used by unconsumed reorder buffers */
|
|
|
|
ReorderBufferCleanupSerializedTXNs(NameStr(MyReplicationSlot->data.name));
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2015-05-20 15:18:11 +02:00
|
|
|
* Get an unused, possibly preallocated, ReorderBufferTXN.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
|
|
|
static ReorderBufferTXN *
|
|
|
|
ReorderBufferGetTXN(ReorderBuffer *rb)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *txn;
|
|
|
|
|
2017-02-27 12:41:44 +01:00
|
|
|
txn = (ReorderBufferTXN *)
|
|
|
|
MemoryContextAlloc(rb->txn_context, sizeof(ReorderBufferTXN));
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
memset(txn, 0, sizeof(ReorderBufferTXN));
|
|
|
|
|
|
|
|
dlist_init(&txn->changes);
|
|
|
|
dlist_init(&txn->tuplecids);
|
|
|
|
dlist_init(&txn->subtxns);
|
|
|
|
|
|
|
|
return txn;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Free a ReorderBufferTXN.
|
|
|
|
*/
|
2014-05-17 23:57:53 +02:00
|
|
|
static void
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
|
|
|
|
{
|
|
|
|
/* clean the lookup cache if we were cached (quite likely) */
|
|
|
|
if (rb->by_txn_last_xid == txn->xid)
|
|
|
|
{
|
|
|
|
rb->by_txn_last_xid = InvalidTransactionId;
|
|
|
|
rb->by_txn_last_txn = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* free data that's contained */
|
|
|
|
|
|
|
|
if (txn->tuplecid_hash != NULL)
|
|
|
|
{
|
|
|
|
hash_destroy(txn->tuplecid_hash);
|
|
|
|
txn->tuplecid_hash = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (txn->invalidations)
|
|
|
|
{
|
|
|
|
pfree(txn->invalidations);
|
|
|
|
txn->invalidations = NULL;
|
|
|
|
}
|
|
|
|
|
2017-02-27 12:41:44 +01:00
|
|
|
pfree(txn);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2018-06-25 21:36:33 +02:00
|
|
|
* Get an fresh ReorderBufferChange.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
|
|
|
ReorderBufferChange *
|
|
|
|
ReorderBufferGetChange(ReorderBuffer *rb)
|
|
|
|
{
|
|
|
|
ReorderBufferChange *change;
|
|
|
|
|
2017-02-27 12:41:44 +01:00
|
|
|
change = (ReorderBufferChange *)
|
|
|
|
MemoryContextAlloc(rb->change_context, sizeof(ReorderBufferChange));
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
memset(change, 0, sizeof(ReorderBufferChange));
|
|
|
|
return change;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Free an ReorderBufferChange.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
|
|
|
|
{
|
|
|
|
/* free contained data */
|
2014-03-07 23:02:48 +01:00
|
|
|
switch (change->action)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
2014-03-07 23:02:48 +01:00
|
|
|
case REORDER_BUFFER_CHANGE_INSERT:
|
|
|
|
case REORDER_BUFFER_CHANGE_UPDATE:
|
|
|
|
case REORDER_BUFFER_CHANGE_DELETE:
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
|
2014-03-07 23:02:48 +01:00
|
|
|
if (change->data.tp.newtuple)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
2014-03-07 23:02:48 +01:00
|
|
|
ReorderBufferReturnTupleBuf(rb, change->data.tp.newtuple);
|
|
|
|
change->data.tp.newtuple = NULL;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
if (change->data.tp.oldtuple)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
2014-03-07 23:02:48 +01:00
|
|
|
ReorderBufferReturnTupleBuf(rb, change->data.tp.oldtuple);
|
|
|
|
change->data.tp.oldtuple = NULL;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
break;
|
2016-04-06 11:05:41 +02:00
|
|
|
case REORDER_BUFFER_CHANGE_MESSAGE:
|
|
|
|
if (change->data.msg.prefix != NULL)
|
|
|
|
pfree(change->data.msg.prefix);
|
|
|
|
change->data.msg.prefix = NULL;
|
|
|
|
if (change->data.msg.message != NULL)
|
|
|
|
pfree(change->data.msg.message);
|
|
|
|
change->data.msg.message = NULL;
|
|
|
|
break;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
|
2014-03-07 23:02:48 +01:00
|
|
|
if (change->data.snapshot)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
2014-03-07 23:02:48 +01:00
|
|
|
ReorderBufferFreeSnap(rb, change->data.snapshot);
|
|
|
|
change->data.snapshot = NULL;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
break;
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
/* no data in addition to the struct itself */
|
Fix memory leak in TRUNCATE decoding
When decoding a TRUNCATE record, the relids array was being allocated in
the main ReorderBuffer memory context, but not released with the change
resulting in a memory leak.
The array was also ignored when serializing/deserializing the change,
assuming all the information is stored in the change itself. So when
spilling the change to disk, we've only we have serialized only the
pointer to the relids array. Thanks to never releasing the array,
the pointer however remained valid even after loading the change back
to memory, preventing an actual crash.
This fixes both the memory leak and (de)serialization. The relids array
is still allocated in the main ReorderBuffer memory context (none of the
existing ones seems like a good match, and adding an extra context seems
like an overkill). The allocation is wrapped in a new ReorderBuffer API
functions, to keep the details within reorderbuffer.c, just like the
other ReorderBufferGet methods do.
Author: Tomas Vondra
Discussion: https://www.postgresql.org/message-id/flat/66175a41-9342-2845-652f-1bd4c3ee50aa%402ndquadrant.com
Backpatch: 11, where decoding of TRUNCATE was introduced
2018-09-03 02:10:24 +02:00
|
|
|
case REORDER_BUFFER_CHANGE_TRUNCATE:
|
|
|
|
if (change->data.truncate.relids != NULL)
|
|
|
|
{
|
|
|
|
ReorderBufferReturnRelids(rb, change->data.truncate.relids);
|
|
|
|
change->data.truncate.relids = NULL;
|
|
|
|
}
|
|
|
|
break;
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
|
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2017-02-27 12:41:44 +01:00
|
|
|
pfree(change);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2018-06-25 21:36:33 +02:00
|
|
|
* Get a fresh ReorderBufferTupleBuf fitting at least a tuple of size
|
|
|
|
* tuple_len (excluding header overhead).
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
|
|
|
ReorderBufferTupleBuf *
|
logical decoding: Fix handling of large old tuples with replica identity full.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
2016-03-06 03:02:20 +01:00
|
|
|
ReorderBufferGetTupleBuf(ReorderBuffer *rb, Size tuple_len)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
|
|
|
ReorderBufferTupleBuf *tuple;
|
logical decoding: Fix handling of large old tuples with replica identity full.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
2016-03-06 03:02:20 +01:00
|
|
|
Size alloc_len;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
logical decoding: Fix handling of large old tuples with replica identity full.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
2016-03-06 03:02:20 +01:00
|
|
|
alloc_len = tuple_len + SizeofHeapTupleHeader;
|
|
|
|
|
2017-11-22 19:45:07 +01:00
|
|
|
tuple = (ReorderBufferTupleBuf *)
|
|
|
|
MemoryContextAlloc(rb->tup_context,
|
|
|
|
sizeof(ReorderBufferTupleBuf) +
|
|
|
|
MAXIMUM_ALIGNOF + alloc_len);
|
|
|
|
tuple->alloc_tuple_size = alloc_len;
|
|
|
|
tuple->tuple.t_data = ReorderBufferTupleBufData(tuple);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
return tuple;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Free an ReorderBufferTupleBuf.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReorderBufferReturnTupleBuf(ReorderBuffer *rb, ReorderBufferTupleBuf *tuple)
|
|
|
|
{
|
2017-11-22 19:45:07 +01:00
|
|
|
pfree(tuple);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
|
Fix memory leak in TRUNCATE decoding
When decoding a TRUNCATE record, the relids array was being allocated in
the main ReorderBuffer memory context, but not released with the change
resulting in a memory leak.
The array was also ignored when serializing/deserializing the change,
assuming all the information is stored in the change itself. So when
spilling the change to disk, we've only we have serialized only the
pointer to the relids array. Thanks to never releasing the array,
the pointer however remained valid even after loading the change back
to memory, preventing an actual crash.
This fixes both the memory leak and (de)serialization. The relids array
is still allocated in the main ReorderBuffer memory context (none of the
existing ones seems like a good match, and adding an extra context seems
like an overkill). The allocation is wrapped in a new ReorderBuffer API
functions, to keep the details within reorderbuffer.c, just like the
other ReorderBufferGet methods do.
Author: Tomas Vondra
Discussion: https://www.postgresql.org/message-id/flat/66175a41-9342-2845-652f-1bd4c3ee50aa%402ndquadrant.com
Backpatch: 11, where decoding of TRUNCATE was introduced
2018-09-03 02:10:24 +02:00
|
|
|
/*
|
|
|
|
* Get an array for relids of truncated relations.
|
|
|
|
*
|
|
|
|
* We use the global memory context (for the whole reorder buffer), because
|
|
|
|
* none of the existing ones seems like a good match (some are SLAB, so we
|
|
|
|
* can't use those, and tup_context is meant for tuple data, not relids). We
|
|
|
|
* could add yet another context, but it seems like an overkill - TRUNCATE is
|
|
|
|
* not particularly common operation, so it does not seem worth it.
|
|
|
|
*/
|
|
|
|
Oid *
|
|
|
|
ReorderBufferGetRelids(ReorderBuffer *rb, int nrelids)
|
|
|
|
{
|
|
|
|
Oid *relids;
|
|
|
|
Size alloc_len;
|
|
|
|
|
|
|
|
alloc_len = sizeof(Oid) * nrelids;
|
|
|
|
|
|
|
|
relids = (Oid *) MemoryContextAlloc(rb->context, alloc_len);
|
|
|
|
|
|
|
|
return relids;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Free an array of relids.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReorderBufferReturnRelids(ReorderBuffer *rb, Oid *relids)
|
|
|
|
{
|
|
|
|
pfree(relids);
|
|
|
|
}
|
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
/*
|
|
|
|
* Return the ReorderBufferTXN from the given buffer, specified by Xid.
|
|
|
|
* If create is true, and a transaction doesn't already exist, create it
|
|
|
|
* (with the given LSN, and as top transaction if that's specified);
|
|
|
|
* when this happens, is_new is set to true.
|
|
|
|
*/
|
|
|
|
static ReorderBufferTXN *
|
|
|
|
ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
|
|
|
|
bool *is_new, XLogRecPtr lsn, bool create_as_top)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *txn;
|
|
|
|
ReorderBufferTXNByIdEnt *ent;
|
|
|
|
bool found;
|
|
|
|
|
|
|
|
Assert(TransactionIdIsValid(xid));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check the one-entry lookup cache first
|
|
|
|
*/
|
|
|
|
if (TransactionIdIsValid(rb->by_txn_last_xid) &&
|
|
|
|
rb->by_txn_last_xid == xid)
|
|
|
|
{
|
|
|
|
txn = rb->by_txn_last_txn;
|
|
|
|
|
|
|
|
if (txn != NULL)
|
|
|
|
{
|
|
|
|
/* found it, and it's valid */
|
|
|
|
if (is_new)
|
|
|
|
*is_new = false;
|
|
|
|
return txn;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2016-06-03 16:13:36 +02:00
|
|
|
* cached as non-existent, and asked not to create? Then nothing else
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
* to do.
|
|
|
|
*/
|
|
|
|
if (!create)
|
|
|
|
return NULL;
|
|
|
|
/* otherwise fall through to create it */
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the cache wasn't hit or it yielded an "does-not-exist" and we want
|
|
|
|
* to create an entry.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* search the lookup table */
|
|
|
|
ent = (ReorderBufferTXNByIdEnt *)
|
|
|
|
hash_search(rb->by_txn,
|
|
|
|
(void *) &xid,
|
|
|
|
create ? HASH_ENTER : HASH_FIND,
|
|
|
|
&found);
|
|
|
|
if (found)
|
|
|
|
txn = ent->txn;
|
|
|
|
else if (create)
|
|
|
|
{
|
|
|
|
/* initialize the new entry, if creation was requested */
|
|
|
|
Assert(ent != NULL);
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
Assert(lsn != InvalidXLogRecPtr);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
ent->txn = ReorderBufferGetTXN(rb);
|
|
|
|
ent->txn->xid = xid;
|
|
|
|
txn = ent->txn;
|
|
|
|
txn->first_lsn = lsn;
|
|
|
|
txn->restart_decoding_lsn = rb->current_restart_decoding_lsn;
|
|
|
|
|
|
|
|
if (create_as_top)
|
|
|
|
{
|
|
|
|
dlist_push_tail(&rb->toplevel_by_lsn, &txn->node);
|
|
|
|
AssertTXNLsnOrder(rb);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
|
|
|
txn = NULL; /* not found and not asked to create */
|
|
|
|
|
|
|
|
/* update cache */
|
|
|
|
rb->by_txn_last_xid = xid;
|
|
|
|
rb->by_txn_last_txn = txn;
|
|
|
|
|
|
|
|
if (is_new)
|
|
|
|
*is_new = !found;
|
|
|
|
|
2016-03-27 17:17:00 +02:00
|
|
|
Assert(!create || txn != NULL);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
return txn;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Queue a change into a transaction so it can be replayed upon commit.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
|
2014-05-06 18:12:18 +02:00
|
|
|
ReorderBufferChange *change)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
|
|
|
ReorderBufferTXN *txn;
|
|
|
|
|
|
|
|
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
|
|
|
|
|
|
|
|
change->lsn = lsn;
|
|
|
|
Assert(InvalidXLogRecPtr != lsn);
|
|
|
|
dlist_push_tail(&txn->changes, &change->node);
|
|
|
|
txn->nentries++;
|
|
|
|
txn->nentries_mem++;
|
|
|
|
|
|
|
|
ReorderBufferCheckSerializeTXN(rb, txn);
|
|
|
|
}
|
|
|
|
|
2016-04-06 11:05:41 +02:00
|
|
|
/*
|
|
|
|
* Queue message into a transaction so it can be processed upon commit.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
|
|
|
|
Snapshot snapshot, XLogRecPtr lsn,
|
|
|
|
bool transactional, const char *prefix,
|
|
|
|
Size message_size, const char *message)
|
|
|
|
{
|
|
|
|
if (transactional)
|
|
|
|
{
|
|
|
|
MemoryContext oldcontext;
|
|
|
|
ReorderBufferChange *change;
|
|
|
|
|
|
|
|
Assert(xid != InvalidTransactionId);
|
|
|
|
|
|
|
|
oldcontext = MemoryContextSwitchTo(rb->context);
|
|
|
|
|
|
|
|
change = ReorderBufferGetChange(rb);
|
|
|
|
change->action = REORDER_BUFFER_CHANGE_MESSAGE;
|
|
|
|
change->data.msg.prefix = pstrdup(prefix);
|
|
|
|
change->data.msg.message_size = message_size;
|
|
|
|
change->data.msg.message = palloc(message_size);
|
|
|
|
memcpy(change->data.msg.message, message, message_size);
|
|
|
|
|
|
|
|
ReorderBufferQueueChange(rb, xid, lsn, change);
|
|
|
|
|
|
|
|
MemoryContextSwitchTo(oldcontext);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
2016-06-10 00:02:36 +02:00
|
|
|
ReorderBufferTXN *txn = NULL;
|
|
|
|
volatile Snapshot snapshot_now = snapshot;
|
2016-04-06 11:05:41 +02:00
|
|
|
|
|
|
|
if (xid != InvalidTransactionId)
|
|
|
|
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
|
|
|
|
|
|
|
|
/* setup snapshot to allow catalog access */
|
|
|
|
SetupHistoricSnapshot(snapshot_now, NULL);
|
|
|
|
PG_TRY();
|
|
|
|
{
|
|
|
|
rb->message(rb, txn, lsn, false, prefix, message_size, message);
|
|
|
|
|
|
|
|
TeardownHistoricSnapshot(false);
|
|
|
|
}
|
|
|
|
PG_CATCH();
|
|
|
|
{
|
|
|
|
TeardownHistoricSnapshot(true);
|
|
|
|
PG_RE_THROW();
|
|
|
|
}
|
|
|
|
PG_END_TRY();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
/*
|
|
|
|
* AssertTXNLsnOrder
|
|
|
|
* Verify LSN ordering of transaction lists in the reorderbuffer
|
|
|
|
*
|
|
|
|
* Other LSN-related invariants are checked too.
|
|
|
|
*
|
|
|
|
* No-op if assertions are not in use.
|
|
|
|
*/
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
static void
|
|
|
|
AssertTXNLsnOrder(ReorderBuffer *rb)
|
|
|
|
{
|
|
|
|
#ifdef USE_ASSERT_CHECKING
|
|
|
|
dlist_iter iter;
|
|
|
|
XLogRecPtr prev_first_lsn = InvalidXLogRecPtr;
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
XLogRecPtr prev_base_snap_lsn = InvalidXLogRecPtr;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
dlist_foreach(iter, &rb->toplevel_by_lsn)
|
|
|
|
{
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
ReorderBufferTXN *cur_txn = dlist_container(ReorderBufferTXN, node,
|
|
|
|
iter.cur);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
/* start LSN must be set */
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
Assert(cur_txn->first_lsn != InvalidXLogRecPtr);
|
|
|
|
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
/* If there is an end LSN, it must be higher than start LSN */
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
if (cur_txn->end_lsn != InvalidXLogRecPtr)
|
|
|
|
Assert(cur_txn->first_lsn <= cur_txn->end_lsn);
|
|
|
|
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
/* Current initial LSN must be strictly higher than previous */
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
if (prev_first_lsn != InvalidXLogRecPtr)
|
|
|
|
Assert(prev_first_lsn < cur_txn->first_lsn);
|
|
|
|
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
/* known-as-subtxn txns must not be listed */
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
Assert(!cur_txn->is_known_as_subxact);
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
prev_first_lsn = cur_txn->first_lsn;
|
|
|
|
}
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
|
|
|
|
dlist_foreach(iter, &rb->txns_by_base_snapshot_lsn)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *cur_txn = dlist_container(ReorderBufferTXN,
|
|
|
|
base_snapshot_node,
|
|
|
|
iter.cur);
|
|
|
|
|
|
|
|
/* base snapshot (and its LSN) must be set */
|
|
|
|
Assert(cur_txn->base_snapshot != NULL);
|
|
|
|
Assert(cur_txn->base_snapshot_lsn != InvalidXLogRecPtr);
|
|
|
|
|
|
|
|
/* current LSN must be strictly higher than previous */
|
|
|
|
if (prev_base_snap_lsn != InvalidXLogRecPtr)
|
|
|
|
Assert(prev_base_snap_lsn < cur_txn->base_snapshot_lsn);
|
|
|
|
|
|
|
|
/* known-as-subtxn txns must not be listed */
|
|
|
|
Assert(!cur_txn->is_known_as_subxact);
|
|
|
|
|
|
|
|
prev_base_snap_lsn = cur_txn->base_snapshot_lsn;
|
|
|
|
}
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
/*
|
|
|
|
* ReorderBufferGetOldestTXN
|
|
|
|
* Return oldest transaction in reorderbuffer
|
|
|
|
*/
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
ReorderBufferTXN *
|
|
|
|
ReorderBufferGetOldestTXN(ReorderBuffer *rb)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *txn;
|
|
|
|
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
AssertTXNLsnOrder(rb);
|
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
if (dlist_is_empty(&rb->toplevel_by_lsn))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
txn = dlist_head_element(ReorderBufferTXN, node, &rb->toplevel_by_lsn);
|
|
|
|
|
|
|
|
Assert(!txn->is_known_as_subxact);
|
|
|
|
Assert(txn->first_lsn != InvalidXLogRecPtr);
|
|
|
|
return txn;
|
|
|
|
}
|
|
|
|
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
/*
|
|
|
|
* ReorderBufferGetOldestXmin
|
|
|
|
* Return oldest Xmin in reorderbuffer
|
|
|
|
*
|
|
|
|
* Returns oldest possibly running Xid from the point of view of snapshots
|
|
|
|
* used in the transactions kept by reorderbuffer, or InvalidTransactionId if
|
|
|
|
* there are none.
|
|
|
|
*
|
|
|
|
* Since snapshots are assigned monotonically, this equals the Xmin of the
|
|
|
|
* base snapshot with minimal base_snapshot_lsn.
|
|
|
|
*/
|
|
|
|
TransactionId
|
|
|
|
ReorderBufferGetOldestXmin(ReorderBuffer *rb)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *txn;
|
|
|
|
|
|
|
|
AssertTXNLsnOrder(rb);
|
|
|
|
|
|
|
|
if (dlist_is_empty(&rb->txns_by_base_snapshot_lsn))
|
|
|
|
return InvalidTransactionId;
|
|
|
|
|
|
|
|
txn = dlist_head_element(ReorderBufferTXN, base_snapshot_node,
|
|
|
|
&rb->txns_by_base_snapshot_lsn);
|
|
|
|
return txn->base_snapshot->xmin;
|
|
|
|
}
|
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
void
|
|
|
|
ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr)
|
|
|
|
{
|
|
|
|
rb->current_restart_decoding_lsn = ptr;
|
|
|
|
}
|
|
|
|
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
/*
|
|
|
|
* ReorderBufferAssignChild
|
|
|
|
*
|
|
|
|
* Make note that we know that subxid is a subtransaction of xid, seen as of
|
|
|
|
* the given lsn.
|
|
|
|
*/
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
void
|
|
|
|
ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
|
|
|
|
TransactionId subxid, XLogRecPtr lsn)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *txn;
|
|
|
|
ReorderBufferTXN *subtxn;
|
|
|
|
bool new_top;
|
|
|
|
bool new_sub;
|
|
|
|
|
|
|
|
txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true);
|
|
|
|
subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false);
|
|
|
|
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
if (new_top && !new_sub)
|
|
|
|
elog(ERROR, "subtransaction logged without previous top-level txn record");
|
|
|
|
|
|
|
|
if (!new_sub)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
if (subtxn->is_known_as_subxact)
|
|
|
|
{
|
|
|
|
/* already associated, nothing to do */
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
2018-06-30 18:25:49 +02:00
|
|
|
* We already saw this transaction, but initially added it to the
|
|
|
|
* list of top-level txns. Now that we know it's not top-level,
|
|
|
|
* remove it from there.
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
*/
|
|
|
|
dlist_delete(&subtxn->node);
|
|
|
|
}
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
subtxn->is_known_as_subxact = true;
|
|
|
|
subtxn->toplevel_xid = xid;
|
|
|
|
Assert(subtxn->nsubtxns == 0);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
/* add to subtransaction list */
|
|
|
|
dlist_push_tail(&txn->subtxns, &subtxn->node);
|
|
|
|
txn->nsubtxns++;
|
|
|
|
|
|
|
|
/* Possibly transfer the subtxn's snapshot to its top-level txn. */
|
|
|
|
ReorderBufferTransferSnapToParent(txn, subtxn);
|
|
|
|
|
|
|
|
/* Verify LSN-ordering invariant */
|
|
|
|
AssertTXNLsnOrder(rb);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ReorderBufferTransferSnapToParent
|
|
|
|
* Transfer base snapshot from subtxn to top-level txn, if needed
|
|
|
|
*
|
|
|
|
* This is done if the top-level txn doesn't have a base snapshot, or if the
|
|
|
|
* subtxn's base snapshot has an earlier LSN than the top-level txn's base
|
|
|
|
* snapshot's LSN. This can happen if there are no changes in the toplevel
|
|
|
|
* txn but there are some in the subtxn, or the first change in subtxn has
|
|
|
|
* earlier LSN than first change in the top-level txn and we learned about
|
|
|
|
* their kinship only now.
|
|
|
|
*
|
|
|
|
* The subtransaction's snapshot is cleared regardless of the transfer
|
|
|
|
* happening, since it's not needed anymore in either case.
|
|
|
|
*
|
|
|
|
* We do this as soon as we become aware of their kinship, to avoid queueing
|
|
|
|
* extra snapshots to txns known-as-subtxns -- only top-level txns will
|
|
|
|
* receive further snapshots.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
|
|
|
|
ReorderBufferTXN *subtxn)
|
|
|
|
{
|
|
|
|
Assert(subtxn->toplevel_xid == txn->xid);
|
|
|
|
|
|
|
|
if (subtxn->base_snapshot != NULL)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
if (txn->base_snapshot == NULL ||
|
|
|
|
subtxn->base_snapshot_lsn < txn->base_snapshot_lsn)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If the toplevel transaction already has a base snapshot but
|
|
|
|
* it's newer than the subxact's, purge it.
|
|
|
|
*/
|
|
|
|
if (txn->base_snapshot != NULL)
|
|
|
|
{
|
|
|
|
SnapBuildSnapDecRefcount(txn->base_snapshot);
|
|
|
|
dlist_delete(&txn->base_snapshot_node);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The snapshot is now the top transaction's; transfer it, and
|
|
|
|
* adjust the list position of the top transaction in the list by
|
|
|
|
* moving it to where the subtransaction is.
|
|
|
|
*/
|
|
|
|
txn->base_snapshot = subtxn->base_snapshot;
|
|
|
|
txn->base_snapshot_lsn = subtxn->base_snapshot_lsn;
|
|
|
|
dlist_insert_before(&subtxn->base_snapshot_node,
|
|
|
|
&txn->base_snapshot_node);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The subtransaction doesn't have a snapshot anymore (so it
|
|
|
|
* mustn't be in the list.)
|
|
|
|
*/
|
|
|
|
subtxn->base_snapshot = NULL;
|
|
|
|
subtxn->base_snapshot_lsn = InvalidXLogRecPtr;
|
|
|
|
dlist_delete(&subtxn->base_snapshot_node);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* Base snap of toplevel is fine, so subxact's is not needed */
|
|
|
|
SnapBuildSnapDecRefcount(subtxn->base_snapshot);
|
|
|
|
dlist_delete(&subtxn->base_snapshot_node);
|
|
|
|
subtxn->base_snapshot = NULL;
|
|
|
|
subtxn->base_snapshot_lsn = InvalidXLogRecPtr;
|
|
|
|
}
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Associate a subtransaction with its toplevel transaction at commit
|
|
|
|
* time. There may be no further changes added after this.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
|
|
|
|
TransactionId subxid, XLogRecPtr commit_lsn,
|
|
|
|
XLogRecPtr end_lsn)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *subtxn;
|
|
|
|
|
|
|
|
subtxn = ReorderBufferTXNByXid(rb, subxid, false, NULL,
|
|
|
|
InvalidXLogRecPtr, false);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* No need to do anything if that subtxn didn't contain any changes
|
|
|
|
*/
|
|
|
|
if (!subtxn)
|
|
|
|
return;
|
|
|
|
|
|
|
|
subtxn->final_lsn = commit_lsn;
|
|
|
|
subtxn->end_lsn = end_lsn;
|
|
|
|
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
/*
|
|
|
|
* Assign this subxact as a child of the toplevel xact (no-op if already
|
|
|
|
* done.)
|
|
|
|
*/
|
|
|
|
ReorderBufferAssignChild(rb, xid, subxid, InvalidXLogRecPtr);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Support for efficiently iterating over a transaction's and its
|
|
|
|
* subtransactions' changes.
|
|
|
|
*
|
|
|
|
* We do by doing a k-way merge between transactions/subtransactions. For that
|
|
|
|
* we model the current heads of the different transactions as a binary heap
|
|
|
|
* so we easily know which (sub-)transaction has the change with the smallest
|
|
|
|
* lsn next.
|
|
|
|
*
|
|
|
|
* We assume the changes in individual transactions are already sorted by LSN.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Binary heap comparison function.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
ReorderBufferIterCompare(Datum a, Datum b, void *arg)
|
|
|
|
{
|
|
|
|
ReorderBufferIterTXNState *state = (ReorderBufferIterTXNState *) arg;
|
|
|
|
XLogRecPtr pos_a = state->entries[DatumGetInt32(a)].lsn;
|
|
|
|
XLogRecPtr pos_b = state->entries[DatumGetInt32(b)].lsn;
|
|
|
|
|
|
|
|
if (pos_a < pos_b)
|
|
|
|
return 1;
|
|
|
|
else if (pos_a == pos_b)
|
|
|
|
return 0;
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate & initialize an iterator which iterates in lsn order over a
|
|
|
|
* transaction and all its subtransactions.
|
|
|
|
*/
|
|
|
|
static ReorderBufferIterTXNState *
|
|
|
|
ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
|
|
|
|
{
|
|
|
|
Size nr_txns = 0;
|
|
|
|
ReorderBufferIterTXNState *state;
|
|
|
|
dlist_iter cur_txn_i;
|
|
|
|
int32 off;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Calculate the size of our heap: one element for every transaction that
|
|
|
|
* contains changes. (Besides the transactions already in the reorder
|
|
|
|
* buffer, we count the one we were directly passed.)
|
|
|
|
*/
|
|
|
|
if (txn->nentries > 0)
|
|
|
|
nr_txns++;
|
|
|
|
|
|
|
|
dlist_foreach(cur_txn_i, &txn->subtxns)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *cur_txn;
|
|
|
|
|
|
|
|
cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
|
|
|
|
|
|
|
|
if (cur_txn->nentries > 0)
|
|
|
|
nr_txns++;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* TODO: Consider adding fastpath for the rather common nr_txns=1 case, no
|
|
|
|
* need to allocate/build a heap then.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* allocate iteration state */
|
|
|
|
state = (ReorderBufferIterTXNState *)
|
|
|
|
MemoryContextAllocZero(rb->context,
|
|
|
|
sizeof(ReorderBufferIterTXNState) +
|
|
|
|
sizeof(ReorderBufferIterTXNEntry) * nr_txns);
|
|
|
|
|
|
|
|
state->nr_txns = nr_txns;
|
|
|
|
dlist_init(&state->old_change);
|
|
|
|
|
|
|
|
for (off = 0; off < state->nr_txns; off++)
|
|
|
|
{
|
|
|
|
state->entries[off].fd = -1;
|
|
|
|
state->entries[off].segno = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* allocate heap */
|
|
|
|
state->heap = binaryheap_allocate(state->nr_txns,
|
|
|
|
ReorderBufferIterCompare,
|
|
|
|
state);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now insert items into the binary heap, in an unordered fashion. (We
|
|
|
|
* will run a heap assembly step at the end; this is more efficient.)
|
|
|
|
*/
|
|
|
|
|
|
|
|
off = 0;
|
|
|
|
|
|
|
|
/* add toplevel transaction if it contains changes */
|
|
|
|
if (txn->nentries > 0)
|
|
|
|
{
|
|
|
|
ReorderBufferChange *cur_change;
|
|
|
|
|
2017-06-19 03:48:22 +02:00
|
|
|
if (txn->serialized)
|
2016-10-04 07:11:36 +02:00
|
|
|
{
|
|
|
|
/* serialize remaining changes */
|
|
|
|
ReorderBufferSerializeTXN(rb, txn);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
ReorderBufferRestoreChanges(rb, txn, &state->entries[off].fd,
|
|
|
|
&state->entries[off].segno);
|
2016-10-04 07:11:36 +02:00
|
|
|
}
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
cur_change = dlist_head_element(ReorderBufferChange, node,
|
|
|
|
&txn->changes);
|
|
|
|
|
|
|
|
state->entries[off].lsn = cur_change->lsn;
|
|
|
|
state->entries[off].change = cur_change;
|
|
|
|
state->entries[off].txn = txn;
|
|
|
|
|
|
|
|
binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
|
|
|
|
}
|
|
|
|
|
|
|
|
/* add subtransactions if they contain changes */
|
|
|
|
dlist_foreach(cur_txn_i, &txn->subtxns)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *cur_txn;
|
|
|
|
|
|
|
|
cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
|
|
|
|
|
|
|
|
if (cur_txn->nentries > 0)
|
|
|
|
{
|
|
|
|
ReorderBufferChange *cur_change;
|
|
|
|
|
2017-06-19 03:48:22 +02:00
|
|
|
if (cur_txn->serialized)
|
2016-10-04 07:11:36 +02:00
|
|
|
{
|
|
|
|
/* serialize remaining changes */
|
|
|
|
ReorderBufferSerializeTXN(rb, cur_txn);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
ReorderBufferRestoreChanges(rb, cur_txn,
|
|
|
|
&state->entries[off].fd,
|
|
|
|
&state->entries[off].segno);
|
2016-10-04 07:11:36 +02:00
|
|
|
}
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
cur_change = dlist_head_element(ReorderBufferChange, node,
|
|
|
|
&cur_txn->changes);
|
|
|
|
|
|
|
|
state->entries[off].lsn = cur_change->lsn;
|
|
|
|
state->entries[off].change = cur_change;
|
|
|
|
state->entries[off].txn = cur_txn;
|
|
|
|
|
|
|
|
binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* assemble a valid binary heap */
|
|
|
|
binaryheap_build(state->heap);
|
|
|
|
|
|
|
|
return state;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return the next change when iterating over a transaction and its
|
|
|
|
* subtransactions.
|
|
|
|
*
|
|
|
|
* Returns NULL when no further changes exist.
|
|
|
|
*/
|
|
|
|
static ReorderBufferChange *
|
|
|
|
ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
|
|
|
|
{
|
|
|
|
ReorderBufferChange *change;
|
|
|
|
ReorderBufferIterTXNEntry *entry;
|
|
|
|
int32 off;
|
|
|
|
|
|
|
|
/* nothing there anymore */
|
|
|
|
if (state->heap->bh_size == 0)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
off = DatumGetInt32(binaryheap_first(state->heap));
|
|
|
|
entry = &state->entries[off];
|
|
|
|
|
|
|
|
/* free memory we might have "leaked" in the previous *Next call */
|
|
|
|
if (!dlist_is_empty(&state->old_change))
|
|
|
|
{
|
|
|
|
change = dlist_container(ReorderBufferChange, node,
|
|
|
|
dlist_pop_head_node(&state->old_change));
|
|
|
|
ReorderBufferReturnChange(rb, change);
|
|
|
|
Assert(dlist_is_empty(&state->old_change));
|
|
|
|
}
|
|
|
|
|
|
|
|
change = entry->change;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* update heap with information about which transaction has the next
|
|
|
|
* relevant change in LSN order
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* there are in-memory changes */
|
|
|
|
if (dlist_has_next(&entry->txn->changes, &entry->change->node))
|
|
|
|
{
|
|
|
|
dlist_node *next = dlist_next_node(&entry->txn->changes, &change->node);
|
|
|
|
ReorderBufferChange *next_change =
|
|
|
|
dlist_container(ReorderBufferChange, node, next);
|
|
|
|
|
|
|
|
/* txn stays the same */
|
|
|
|
state->entries[off].lsn = next_change->lsn;
|
|
|
|
state->entries[off].change = next_change;
|
|
|
|
|
|
|
|
binaryheap_replace_first(state->heap, Int32GetDatum(off));
|
|
|
|
return change;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* try to load changes from disk */
|
|
|
|
if (entry->txn->nentries != entry->txn->nentries_mem)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Ugly: restoring changes will reuse *Change records, thus delete the
|
|
|
|
* current one from the per-tx list and only free in the next call.
|
|
|
|
*/
|
|
|
|
dlist_delete(&change->node);
|
|
|
|
dlist_push_tail(&state->old_change, &change->node);
|
|
|
|
|
|
|
|
if (ReorderBufferRestoreChanges(rb, entry->txn, &entry->fd,
|
|
|
|
&state->entries[off].segno))
|
|
|
|
{
|
|
|
|
/* successfully restored changes from disk */
|
|
|
|
ReorderBufferChange *next_change =
|
|
|
|
dlist_head_element(ReorderBufferChange, node,
|
|
|
|
&entry->txn->changes);
|
|
|
|
|
|
|
|
elog(DEBUG2, "restored %u/%u changes from disk",
|
|
|
|
(uint32) entry->txn->nentries_mem,
|
|
|
|
(uint32) entry->txn->nentries);
|
|
|
|
|
|
|
|
Assert(entry->txn->nentries_mem);
|
|
|
|
/* txn stays the same */
|
|
|
|
state->entries[off].lsn = next_change->lsn;
|
|
|
|
state->entries[off].change = next_change;
|
|
|
|
binaryheap_replace_first(state->heap, Int32GetDatum(off));
|
|
|
|
|
|
|
|
return change;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ok, no changes there anymore, remove */
|
|
|
|
binaryheap_remove_first(state->heap);
|
|
|
|
|
|
|
|
return change;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Deallocate the iterator
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReorderBufferIterTXNFinish(ReorderBuffer *rb,
|
|
|
|
ReorderBufferIterTXNState *state)
|
|
|
|
{
|
|
|
|
int32 off;
|
|
|
|
|
|
|
|
for (off = 0; off < state->nr_txns; off++)
|
|
|
|
{
|
|
|
|
if (state->entries[off].fd != -1)
|
|
|
|
CloseTransientFile(state->entries[off].fd);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* free memory we might have "leaked" in the last *Next call */
|
|
|
|
if (!dlist_is_empty(&state->old_change))
|
|
|
|
{
|
|
|
|
ReorderBufferChange *change;
|
|
|
|
|
|
|
|
change = dlist_container(ReorderBufferChange, node,
|
|
|
|
dlist_pop_head_node(&state->old_change));
|
|
|
|
ReorderBufferReturnChange(rb, change);
|
|
|
|
Assert(dlist_is_empty(&state->old_change));
|
|
|
|
}
|
|
|
|
|
|
|
|
binaryheap_free(state->heap);
|
|
|
|
pfree(state);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Cleanup the contents of a transaction, usually after the transaction
|
|
|
|
* committed or aborted.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
|
|
|
|
{
|
|
|
|
bool found;
|
|
|
|
dlist_mutable_iter iter;
|
|
|
|
|
|
|
|
/* cleanup subtransactions & their changes */
|
|
|
|
dlist_foreach_modify(iter, &txn->subtxns)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *subtxn;
|
|
|
|
|
|
|
|
subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Subtransactions are always associated to the toplevel TXN, even if
|
|
|
|
* they originally were happening inside another subtxn, so we won't
|
|
|
|
* ever recurse more than one level deep here.
|
|
|
|
*/
|
|
|
|
Assert(subtxn->is_known_as_subxact);
|
|
|
|
Assert(subtxn->nsubtxns == 0);
|
|
|
|
|
|
|
|
ReorderBufferCleanupTXN(rb, subtxn);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* cleanup changes in the toplevel txn */
|
|
|
|
dlist_foreach_modify(iter, &txn->changes)
|
|
|
|
{
|
|
|
|
ReorderBufferChange *change;
|
|
|
|
|
|
|
|
change = dlist_container(ReorderBufferChange, node, iter.cur);
|
|
|
|
|
|
|
|
ReorderBufferReturnChange(rb, change);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Cleanup the tuplecids we stored for decoding catalog snapshot access.
|
|
|
|
* They are always stored in the toplevel transaction.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
|
|
|
dlist_foreach_modify(iter, &txn->tuplecids)
|
|
|
|
{
|
|
|
|
ReorderBufferChange *change;
|
|
|
|
|
|
|
|
change = dlist_container(ReorderBufferChange, node, iter.cur);
|
2014-03-07 23:02:48 +01:00
|
|
|
Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
ReorderBufferReturnChange(rb, change);
|
|
|
|
}
|
|
|
|
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
/*
|
|
|
|
* Cleanup the base snapshot, if set.
|
|
|
|
*/
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
if (txn->base_snapshot != NULL)
|
|
|
|
{
|
|
|
|
SnapBuildSnapDecRefcount(txn->base_snapshot);
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
dlist_delete(&txn->base_snapshot_node);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
|
2016-09-05 02:49:44 +02:00
|
|
|
/*
|
|
|
|
* Remove TXN from its containing list.
|
|
|
|
*
|
|
|
|
* Note: if txn->is_known_as_subxact, we are deleting the TXN from its
|
|
|
|
* parent's list of known subxacts; this leaves the parent's nsubxacts
|
|
|
|
* count too high, but we don't care. Otherwise, we are deleting the TXN
|
|
|
|
* from the LSN-ordered list of toplevel TXNs.
|
|
|
|
*/
|
|
|
|
dlist_delete(&txn->node);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/* now remove reference from buffer */
|
|
|
|
hash_search(rb->by_txn,
|
|
|
|
(void *) &txn->xid,
|
|
|
|
HASH_REMOVE,
|
|
|
|
&found);
|
|
|
|
Assert(found);
|
|
|
|
|
|
|
|
/* remove entries spilled to disk */
|
2017-06-19 03:48:22 +02:00
|
|
|
if (txn->serialized)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
ReorderBufferRestoreCleanup(rb, txn);
|
|
|
|
|
|
|
|
/* deallocate */
|
|
|
|
ReorderBufferReturnTXN(rb, txn);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Build a hash with a (relfilenode, ctid) -> (cmin, cmax) mapping for use by
|
|
|
|
* tqual.c's HeapTupleSatisfiesHistoricMVCC.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
|
|
|
|
{
|
|
|
|
dlist_iter iter;
|
|
|
|
HASHCTL hash_ctl;
|
|
|
|
|
|
|
|
if (!txn->has_catalog_changes || dlist_is_empty(&txn->tuplecids))
|
|
|
|
return;
|
|
|
|
|
|
|
|
memset(&hash_ctl, 0, sizeof(hash_ctl));
|
|
|
|
|
|
|
|
hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
|
|
|
|
hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
|
|
|
|
hash_ctl.hcxt = rb->context;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* create the hash with the exact number of to-be-stored tuplecids from
|
|
|
|
* the start
|
|
|
|
*/
|
|
|
|
txn->tuplecid_hash =
|
|
|
|
hash_create("ReorderBufferTupleCid", txn->ntuplecids, &hash_ctl,
|
Improve hash_create's API for selecting simple-binary-key hash functions.
Previously, if you wanted anything besides C-string hash keys, you had to
specify a custom hashing function to hash_create(). Nearly all such
callers were specifying tag_hash or oid_hash; which is tedious, and rather
error-prone, since a caller could easily miss the opportunity to optimize
by using hash_uint32 when appropriate. Replace this with a design whereby
callers using simple binary-data keys just specify HASH_BLOBS and don't
need to mess with specific support functions. hash_create() itself will
take care of optimizing when the key size is four bytes.
This nets out saving a few hundred bytes of code space, and offers
a measurable performance improvement in tidbitmap.c (which was not
exploiting the opportunity to use hash_uint32 for its 4-byte keys).
There might be some wins elsewhere too, I didn't analyze closely.
In future we could look into offering a similar optimized hashing function
for 8-byte keys. Under this design that could be done in a centralized
and machine-independent fashion, whereas getting it right for keys of
platform-dependent sizes would've been notationally painful before.
For the moment, the old way still works fine, so as not to break source
code compatibility for loadable modules. Eventually we might want to
remove tag_hash and friends from the exported API altogether, since there's
no real need for them to be explicitly referenced from outside dynahash.c.
Teodor Sigaev and Tom Lane
2014-12-18 19:36:29 +01:00
|
|
|
HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
dlist_foreach(iter, &txn->tuplecids)
|
|
|
|
{
|
|
|
|
ReorderBufferTupleCidKey key;
|
|
|
|
ReorderBufferTupleCidEnt *ent;
|
|
|
|
bool found;
|
|
|
|
ReorderBufferChange *change;
|
|
|
|
|
|
|
|
change = dlist_container(ReorderBufferChange, node, iter.cur);
|
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/* be careful about padding */
|
|
|
|
memset(&key, 0, sizeof(ReorderBufferTupleCidKey));
|
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
key.relnode = change->data.tuplecid.node;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
ItemPointerCopy(&change->data.tuplecid.tid,
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
&key.tid);
|
|
|
|
|
|
|
|
ent = (ReorderBufferTupleCidEnt *)
|
|
|
|
hash_search(txn->tuplecid_hash,
|
|
|
|
(void *) &key,
|
|
|
|
HASH_ENTER | HASH_FIND,
|
|
|
|
&found);
|
|
|
|
if (!found)
|
|
|
|
{
|
2014-03-07 23:02:48 +01:00
|
|
|
ent->cmin = change->data.tuplecid.cmin;
|
|
|
|
ent->cmax = change->data.tuplecid.cmax;
|
|
|
|
ent->combocid = change->data.tuplecid.combocid;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
2014-03-07 23:02:48 +01:00
|
|
|
Assert(ent->cmin == change->data.tuplecid.cmin);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
Assert(ent->cmax == InvalidCommandId ||
|
2014-03-07 23:02:48 +01:00
|
|
|
ent->cmax == change->data.tuplecid.cmax);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* if the tuple got valid in this transaction and now got deleted
|
|
|
|
* we already have a valid cmin stored. The cmax will be
|
|
|
|
* InvalidCommandId though.
|
|
|
|
*/
|
2014-03-07 23:02:48 +01:00
|
|
|
ent->cmax = change->data.tuplecid.cmax;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Copy a provided snapshot so we can modify it privately. This is needed so
|
|
|
|
* that catalog modifying transactions can look into intermediate catalog
|
|
|
|
* states.
|
|
|
|
*/
|
|
|
|
static Snapshot
|
|
|
|
ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
|
|
|
|
ReorderBufferTXN *txn, CommandId cid)
|
|
|
|
{
|
|
|
|
Snapshot snap;
|
|
|
|
dlist_iter iter;
|
|
|
|
int i = 0;
|
|
|
|
Size size;
|
|
|
|
|
|
|
|
size = sizeof(SnapshotData) +
|
|
|
|
sizeof(TransactionId) * orig_snap->xcnt +
|
|
|
|
sizeof(TransactionId) * (txn->nsubtxns + 1);
|
|
|
|
|
|
|
|
snap = MemoryContextAllocZero(rb->context, size);
|
|
|
|
memcpy(snap, orig_snap, sizeof(SnapshotData));
|
|
|
|
|
|
|
|
snap->copied = true;
|
2015-04-16 20:00:55 +02:00
|
|
|
snap->active_count = 1; /* mark as active so nobody frees it */
|
|
|
|
snap->regd_count = 0;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
snap->xip = (TransactionId *) (snap + 1);
|
|
|
|
|
|
|
|
memcpy(snap->xip, orig_snap->xip, sizeof(TransactionId) * snap->xcnt);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* snap->subxip contains all txids that belong to our transaction which we
|
2017-05-15 21:41:15 +02:00
|
|
|
* need to check via cmin/cmax. That's why we store the toplevel
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
* transaction in there as well.
|
|
|
|
*/
|
|
|
|
snap->subxip = snap->xip + snap->xcnt;
|
|
|
|
snap->subxip[i++] = txn->xid;
|
|
|
|
|
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* nsubxcnt isn't decreased when subtransactions abort, so count manually.
|
|
|
|
* Since it's an upper boundary it is safe to use it for the allocation
|
|
|
|
* above.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
|
|
|
snap->subxcnt = 1;
|
|
|
|
|
|
|
|
dlist_foreach(iter, &txn->subtxns)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *sub_txn;
|
|
|
|
|
|
|
|
sub_txn = dlist_container(ReorderBufferTXN, node, iter.cur);
|
|
|
|
snap->subxip[i++] = sub_txn->xid;
|
|
|
|
snap->subxcnt++;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* sort so we can bsearch() later */
|
|
|
|
qsort(snap->subxip, snap->subxcnt, sizeof(TransactionId), xidComparator);
|
|
|
|
|
|
|
|
/* store the specified current CommandId */
|
|
|
|
snap->curcid = cid;
|
|
|
|
|
|
|
|
return snap;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Free a previously ReorderBufferCopySnap'ed snapshot
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap)
|
|
|
|
{
|
|
|
|
if (snap->copied)
|
|
|
|
pfree(snap);
|
|
|
|
else
|
|
|
|
SnapBuildSnapDecRefcount(snap);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
* Perform the replay of a transaction and its non-aborted subtransactions.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*
|
|
|
|
* Subtransactions previously have to be processed by
|
|
|
|
* ReorderBufferCommitChild(), even if previously assigned to the toplevel
|
|
|
|
* transaction with ReorderBufferAssignChild.
|
|
|
|
*
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
* We currently can only decode a transaction's contents when its commit
|
|
|
|
* record is read because that's the only place where we know about cache
|
|
|
|
* invalidations. Thus, once a toplevel commit is read, we iterate over the top
|
|
|
|
* and subtransactions (using a k-way merge) and replay the changes in lsn
|
|
|
|
* order.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
|
|
|
|
XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
|
Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented here, consist out of
three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies. E.g. by filtering all replayed changes out.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL. Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.
For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00
|
|
|
TimestampTz commit_time,
|
|
|
|
RepOriginId origin_id, XLogRecPtr origin_lsn)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
|
|
|
ReorderBufferTXN *txn;
|
2015-01-26 04:49:56 +01:00
|
|
|
volatile Snapshot snapshot_now;
|
2014-05-06 18:12:18 +02:00
|
|
|
volatile CommandId command_id = FirstCommandId;
|
2015-01-26 04:49:56 +01:00
|
|
|
bool using_subtxn;
|
|
|
|
ReorderBufferIterTXNState *volatile iterstate = NULL;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
|
|
|
|
false);
|
|
|
|
|
|
|
|
/* unknown transaction, nothing to replay */
|
|
|
|
if (txn == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
txn->final_lsn = commit_lsn;
|
|
|
|
txn->end_lsn = end_lsn;
|
|
|
|
txn->commit_time = commit_time;
|
Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
e.g. to avoid loops in bi-directional replication setups
The solution to these problems, as implemented here, consist out of
three parts:
1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
replication origin, how far replay has progressed in a efficient and
crash safe manner.
3) The ability to filter out changes performed on the behest of a
replication origin during logical decoding; this allows complex
replication topologies. E.g. by filtering all replayed changes out.
Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated. We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.
This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL. Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.
For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.
Bumps both catversion and wal page magic.
Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
20140923182422.GA15776@alap3.anarazel.de,
20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00
|
|
|
txn->origin_id = origin_id;
|
|
|
|
txn->origin_lsn = origin_lsn;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/*
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
* If this transaction has no snapshot, it didn't make any changes to the
|
|
|
|
* database, so there's nothing to decode. Note that
|
|
|
|
* ReorderBufferCommitChild will have transferred any snapshots from
|
|
|
|
* subtransactions if there were any.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
|
|
|
if (txn->base_snapshot == NULL)
|
|
|
|
{
|
|
|
|
Assert(txn->ninvalidations == 0);
|
|
|
|
ReorderBufferCleanupTXN(rb, txn);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
snapshot_now = txn->base_snapshot;
|
|
|
|
|
|
|
|
/* build data to be able to lookup the CommandIds of catalog tuples */
|
|
|
|
ReorderBufferBuildTupleCidHash(rb, txn);
|
|
|
|
|
|
|
|
/* setup the initial snapshot */
|
|
|
|
SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
|
|
|
|
|
2015-01-26 04:49:56 +01:00
|
|
|
/*
|
|
|
|
* Decoding needs access to syscaches et al., which in turn use
|
|
|
|
* heavyweight locks and such. Thus we need to have enough state around to
|
|
|
|
* keep track of those. The easiest way is to simply use a transaction
|
|
|
|
* internally. That also allows us to easily enforce that nothing writes
|
|
|
|
* to the database by checking for xid assignments.
|
|
|
|
*
|
|
|
|
* When we're called via the SQL SRF there's already a transaction
|
|
|
|
* started, so start an explicit subtransaction there.
|
|
|
|
*/
|
|
|
|
using_subtxn = IsTransactionOrTransactionBlock();
|
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
PG_TRY();
|
|
|
|
{
|
2015-01-26 04:49:56 +01:00
|
|
|
ReorderBufferChange *change;
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
ReorderBufferChange *specinsert = NULL;
|
2014-11-13 19:06:43 +01:00
|
|
|
|
|
|
|
if (using_subtxn)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
BeginInternalSubTransaction("replay");
|
|
|
|
else
|
|
|
|
StartTransactionCommand();
|
|
|
|
|
|
|
|
rb->begin(rb, txn);
|
|
|
|
|
|
|
|
iterstate = ReorderBufferIterTXNInit(rb, txn);
|
2015-01-26 04:49:56 +01:00
|
|
|
while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
|
|
|
Relation relation = NULL;
|
|
|
|
Oid reloid;
|
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
switch (change->action)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
|
2015-05-24 03:35:49 +02:00
|
|
|
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
/*
|
|
|
|
* Confirmation for speculative insertion arrived. Simply
|
|
|
|
* use as a normal record. It'll be cleaned up at the end
|
|
|
|
* of INSERT processing.
|
|
|
|
*/
|
2018-07-05 23:42:37 +02:00
|
|
|
if (specinsert == NULL)
|
|
|
|
elog(ERROR, "invalid ordering of speculative insertion changes");
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
Assert(specinsert->data.tp.oldtuple == NULL);
|
|
|
|
change = specinsert;
|
|
|
|
change->action = REORDER_BUFFER_CHANGE_INSERT;
|
|
|
|
|
|
|
|
/* intentionally fall through */
|
2014-03-07 23:02:48 +01:00
|
|
|
case REORDER_BUFFER_CHANGE_INSERT:
|
|
|
|
case REORDER_BUFFER_CHANGE_UPDATE:
|
|
|
|
case REORDER_BUFFER_CHANGE_DELETE:
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
Assert(snapshot_now);
|
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
change->data.tp.relnode.relNode);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/*
|
Fix logical decoding error when system table w/ toast is repeatedly rewritten.
Repeatedly rewriting a mapped catalog table with VACUUM FULL or
CLUSTER could cause logical decoding to fail with:
ERROR, "could not map filenode \"%s\" to relation OID"
To trigger the problem the rewritten catalog had to have live tuples
with toasted columns.
The problem was triggered as during catalog table rewrites the
heap_insert() check that prevents logical decoding information to be
emitted for system catalogs, failed to treat the new heap's toast table
as a system catalog (because the new heap is not recognized as a
catalog table via RelationIsLogicallyLogged()). The relmapper, in
contrast to the normal catalog contents, does not contain historical
information. After a single rewrite of a mapped table the new relation
is known to the relmapper, but if the table is rewritten twice before
logical decoding occurs, the relfilenode cannot be mapped to a
relation anymore. Which then leads us to error out. This only
happens for toast tables, because the main table contents aren't
re-inserted with heap_insert().
The fix is simple, add a new heap_insert() flag that prevents logical
decoding information from being emitted, and accept during decoding
that there might not be tuple data for toast tables.
Unfortunately that does not fix pre-existing logical decoding
errors. Doing so would require not throwing an error when a filenode
cannot be mapped to a relation during decoding, and that seems too
likely to hide bugs. If it's crucial to fix decoding for an existing
slot, temporarily changing the ERROR in ReorderBufferCommit() to a
WARNING appears to be the best fix.
Author: Andres Freund
Discussion: https://postgr.es/m/20180914021046.oi7dm4ra3ot2g2kt@alap3.anarazel.de
Backpatch: 9.4-, where logical decoding was introduced
2018-10-10 22:53:02 +02:00
|
|
|
* Mapped catalog tuple without data, emitted while
|
|
|
|
* catalog table was in the process of being rewritten. We
|
|
|
|
* can fail to look up the relfilenode, because the the
|
|
|
|
* relmapper has no "historic" view, in contrast to normal
|
|
|
|
* the normal catalog during decoding. Thus repeated
|
|
|
|
* rewrites can cause a lookup failure. That's OK because
|
|
|
|
* we do not decode catalog changes anyway. Normally such
|
|
|
|
* tuples would be skipped over below, but we can't
|
|
|
|
* identify whether the table should be logically logged
|
|
|
|
* without mapping the relfilenode to the oid.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
|
|
|
if (reloid == InvalidOid &&
|
2014-03-07 23:02:48 +01:00
|
|
|
change->data.tp.newtuple == NULL &&
|
|
|
|
change->data.tp.oldtuple == NULL)
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
goto change_done;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
else if (reloid == InvalidOid)
|
2014-05-01 00:16:53 +02:00
|
|
|
elog(ERROR, "could not map filenode \"%s\" to relation OID",
|
|
|
|
relpathperm(change->data.tp.relnode,
|
|
|
|
MAIN_FORKNUM));
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
relation = RelationIdGetRelation(reloid);
|
|
|
|
|
|
|
|
if (relation == NULL)
|
2014-05-01 00:16:53 +02:00
|
|
|
elog(ERROR, "could not open relation with OID %u (for filenode \"%s\")",
|
|
|
|
reloid,
|
|
|
|
relpathperm(change->data.tp.relnode,
|
|
|
|
MAIN_FORKNUM));
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
if (!RelationIsLogicallyLogged(relation))
|
|
|
|
goto change_done;
|
|
|
|
|
2018-03-21 14:13:24 +01:00
|
|
|
/*
|
|
|
|
* Ignore temporary heaps created during DDL unless the
|
|
|
|
* plugin has asked for them.
|
|
|
|
*/
|
|
|
|
if (relation->rd_rel->relrewrite && !rb->output_rewrites)
|
|
|
|
goto change_done;
|
|
|
|
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
/*
|
2015-05-24 03:35:49 +02:00
|
|
|
* For now ignore sequence changes entirely. Most of the
|
|
|
|
* time they don't log changes using records we
|
|
|
|
* understand, so it doesn't make sense to handle the few
|
|
|
|
* cases we do.
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
*/
|
|
|
|
if (relation->rd_rel->relkind == RELKIND_SEQUENCE)
|
|
|
|
goto change_done;
|
|
|
|
|
|
|
|
/* user-triggered change */
|
|
|
|
if (!IsToastRelation(relation))
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
ReorderBufferToastReplace(rb, txn, relation, change);
|
|
|
|
rb->apply_change(rb, txn, relation, change);
|
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
/*
|
2015-05-24 03:35:49 +02:00
|
|
|
* Only clear reassembled toast chunks if we're sure
|
|
|
|
* they're not required anymore. The creator of the
|
|
|
|
* tuple tells us.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
if (change->data.tp.clear_toast_afterwards)
|
|
|
|
ReorderBufferToastReset(rb, txn);
|
|
|
|
}
|
|
|
|
/* we're not interested in toast deletions */
|
|
|
|
else if (change->action == REORDER_BUFFER_CHANGE_INSERT)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Need to reassemble the full toasted Datum in
|
|
|
|
* memory, to ensure the chunks don't get reused till
|
|
|
|
* we're done remove it from the list of this
|
|
|
|
* transaction's changes. Otherwise it will get
|
|
|
|
* freed/reused while restoring spooled data from
|
|
|
|
* disk.
|
|
|
|
*/
|
Do not decode TOAST data for table rewrites
During table rewrites (VACUUM FULL and CLUSTER), the main heap is logged
using XLOG / FPI records, and thus (correctly) ignored in decoding.
But the associated TOAST table is WAL-logged as plain INSERT records,
and so was logically decoded and passed to reorder buffer.
That has severe consequences with TOAST tables of non-trivial size.
Firstly, reorder buffer has to keep all those changes, possibly spilling
them to a file, incurring I/O costs and disk space.
Secondly, ReoderBufferCommit() was stashing all those TOAST chunks into
a hash table, which got discarded only after processing the row from the
main heap. But as the main heap is not decoded for rewrites, this never
happened, so all the TOAST data accumulated in memory, resulting either
in excessive memory consumption or OOM.
The fix is simple, as commit e9edc1ba already introduced infrastructure
(namely HEAP_INSERT_NO_LOGICAL flag) to skip logical decoding of TOAST
tables, but it only applied it to system tables. So simply use it for
all TOAST data in raw_heap_insert().
That would however solve only the memory consumption issue - the TOAST
changes would still be decoded and added to the reorder buffer, and
spilled to disk (although without TOAST tuple data, so much smaller).
But we can solve that by tweaking DecodeInsert() to just ignore such
INSERT records altogether, using XLH_INSERT_CONTAINS_NEW_TUPLE flag,
instead of skipping them later in ReorderBufferCommit().
Review: Masahiko Sawada
Discussion: https://www.postgresql.org/message-id/flat/1a17c643-e9af-3dba-486b-fbe31bc1823a%402ndquadrant.com
Backpatch: 9.4-, where logical decoding was introduced
2018-11-28 01:11:15 +01:00
|
|
|
Assert(change->data.tp.newtuple != NULL);
|
|
|
|
|
|
|
|
dlist_delete(&change->node);
|
|
|
|
ReorderBufferToastAppendChunk(rb, txn, relation,
|
|
|
|
change);
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
}
|
|
|
|
|
2015-05-24 03:35:49 +02:00
|
|
|
change_done:
|
|
|
|
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
/*
|
|
|
|
* Either speculative insertion was confirmed, or it was
|
|
|
|
* unsuccessful and the record isn't needed anymore.
|
|
|
|
*/
|
|
|
|
if (specinsert != NULL)
|
|
|
|
{
|
|
|
|
ReorderBufferReturnChange(rb, specinsert);
|
|
|
|
specinsert = NULL;
|
|
|
|
}
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
if (relation != NULL)
|
|
|
|
{
|
|
|
|
RelationClose(relation);
|
|
|
|
relation = NULL;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
break;
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
|
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
|
2015-05-24 03:35:49 +02:00
|
|
|
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
/*
|
|
|
|
* Speculative insertions are dealt with by delaying the
|
|
|
|
* processing of the insert until the confirmation record
|
|
|
|
* arrives. For that we simply unlink the record from the
|
|
|
|
* chain, so it does not get freed/reused while restoring
|
|
|
|
* spooled data from disk.
|
|
|
|
*
|
|
|
|
* This is safe in the face of concurrent catalog changes
|
|
|
|
* because the relevant relation can't be changed between
|
|
|
|
* speculative insertion and confirmation due to
|
|
|
|
* CheckTableNotInUse() and locking.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* clear out a pending (and thus failed) speculation */
|
|
|
|
if (specinsert != NULL)
|
|
|
|
{
|
|
|
|
ReorderBufferReturnChange(rb, specinsert);
|
|
|
|
specinsert = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* and memorize the pending insertion */
|
|
|
|
dlist_delete(&change->node);
|
|
|
|
specinsert = change;
|
|
|
|
break;
|
|
|
|
|
2018-04-07 17:17:56 +02:00
|
|
|
case REORDER_BUFFER_CHANGE_TRUNCATE:
|
|
|
|
{
|
2018-04-26 20:47:16 +02:00
|
|
|
int i;
|
|
|
|
int nrelids = change->data.truncate.nrelids;
|
|
|
|
int nrelations = 0;
|
|
|
|
Relation *relations;
|
2018-04-07 17:17:56 +02:00
|
|
|
|
2018-04-26 20:47:16 +02:00
|
|
|
relations = palloc0(nrelids * sizeof(Relation));
|
|
|
|
for (i = 0; i < nrelids; i++)
|
|
|
|
{
|
|
|
|
Oid relid = change->data.truncate.relids[i];
|
|
|
|
Relation relation;
|
2018-04-07 17:17:56 +02:00
|
|
|
|
2018-04-26 20:47:16 +02:00
|
|
|
relation = RelationIdGetRelation(relid);
|
2018-04-07 17:17:56 +02:00
|
|
|
|
2018-04-26 20:47:16 +02:00
|
|
|
if (relation == NULL)
|
|
|
|
elog(ERROR, "could not open relation with OID %u", relid);
|
2018-04-07 17:17:56 +02:00
|
|
|
|
2018-04-26 20:47:16 +02:00
|
|
|
if (!RelationIsLogicallyLogged(relation))
|
|
|
|
continue;
|
2018-04-07 17:17:56 +02:00
|
|
|
|
2018-04-26 20:47:16 +02:00
|
|
|
relations[nrelations++] = relation;
|
|
|
|
}
|
2018-04-07 17:17:56 +02:00
|
|
|
|
2018-04-26 20:47:16 +02:00
|
|
|
rb->apply_truncate(rb, txn, nrelations, relations, change);
|
2018-04-07 17:17:56 +02:00
|
|
|
|
2018-04-26 20:47:16 +02:00
|
|
|
for (i = 0; i < nrelations; i++)
|
|
|
|
RelationClose(relations[i]);
|
|
|
|
|
|
|
|
break;
|
|
|
|
}
|
2018-04-07 17:17:56 +02:00
|
|
|
|
2016-04-06 11:05:41 +02:00
|
|
|
case REORDER_BUFFER_CHANGE_MESSAGE:
|
|
|
|
rb->message(rb, txn, change->lsn, true,
|
|
|
|
change->data.msg.prefix,
|
|
|
|
change->data.msg.message_size,
|
|
|
|
change->data.msg.message);
|
|
|
|
break;
|
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
|
|
|
|
/* get rid of the old */
|
|
|
|
TeardownHistoricSnapshot(false);
|
|
|
|
|
|
|
|
if (snapshot_now->copied)
|
|
|
|
{
|
|
|
|
ReorderBufferFreeSnap(rb, snapshot_now);
|
|
|
|
snapshot_now =
|
2014-03-07 23:02:48 +01:00
|
|
|
ReorderBufferCopySnap(rb, change->data.snapshot,
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
txn, command_id);
|
|
|
|
}
|
2014-05-06 18:12:18 +02:00
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
/*
|
|
|
|
* Restored from disk, need to be careful not to double
|
|
|
|
* free. We could introduce refcounting for that, but for
|
|
|
|
* now this seems infrequent enough not to care.
|
|
|
|
*/
|
2014-03-07 23:02:48 +01:00
|
|
|
else if (change->data.snapshot->copied)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
|
|
|
snapshot_now =
|
2014-03-07 23:02:48 +01:00
|
|
|
ReorderBufferCopySnap(rb, change->data.snapshot,
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
txn, command_id);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
2014-03-07 23:02:48 +01:00
|
|
|
snapshot_now = change->data.snapshot;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* and continue with the new one */
|
|
|
|
SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
|
2014-03-07 23:02:48 +01:00
|
|
|
Assert(change->data.command_id != InvalidCommandId);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
if (command_id < change->data.command_id)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
2014-03-07 23:02:48 +01:00
|
|
|
command_id = change->data.command_id;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
if (!snapshot_now->copied)
|
|
|
|
{
|
|
|
|
/* we don't use the global one anymore */
|
|
|
|
snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
txn, command_id);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
snapshot_now->curcid = command_id;
|
|
|
|
|
|
|
|
TeardownHistoricSnapshot(false);
|
|
|
|
SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Every time the CommandId is incremented, we could
|
|
|
|
* see new catalog contents, so execute all
|
|
|
|
* invalidations.
|
|
|
|
*/
|
|
|
|
ReorderBufferExecuteInvalidations(rb, txn);
|
|
|
|
}
|
|
|
|
|
|
|
|
break;
|
|
|
|
|
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
|
|
|
|
elog(ERROR, "tuplecid value in changequeue");
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
/*
|
2015-05-20 15:18:11 +02:00
|
|
|
* There's a speculative insertion remaining, just clean in up, it
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
* can't have been successful, otherwise we'd gotten a confirmation
|
|
|
|
* record.
|
|
|
|
*/
|
|
|
|
if (specinsert)
|
|
|
|
{
|
|
|
|
ReorderBufferReturnChange(rb, specinsert);
|
|
|
|
specinsert = NULL;
|
|
|
|
}
|
|
|
|
|
2015-01-24 19:25:19 +01:00
|
|
|
/* clean up the iterator */
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
ReorderBufferIterTXNFinish(rb, iterstate);
|
2015-01-24 19:25:19 +01:00
|
|
|
iterstate = NULL;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/* call commit callback */
|
|
|
|
rb->commit(rb, txn, commit_lsn);
|
|
|
|
|
|
|
|
/* this is just a sanity check against bad output plugin behaviour */
|
|
|
|
if (GetCurrentTransactionIdIfAny() != InvalidTransactionId)
|
2014-05-01 00:16:53 +02:00
|
|
|
elog(ERROR, "output plugin used XID %u",
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
GetCurrentTransactionId());
|
|
|
|
|
|
|
|
/* cleanup */
|
|
|
|
TeardownHistoricSnapshot(false);
|
|
|
|
|
|
|
|
/*
|
2014-11-13 19:06:43 +01:00
|
|
|
* Aborting the current (sub-)transaction as a whole has the right
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
* semantics. We want all locks acquired in here to be released, not
|
|
|
|
* reassigned to the parent and we do not want any database access
|
|
|
|
* have persistent effects.
|
|
|
|
*/
|
2014-11-13 19:06:43 +01:00
|
|
|
AbortCurrentTransaction();
|
|
|
|
|
|
|
|
/* make sure there's no cache pollution */
|
|
|
|
ReorderBufferExecuteInvalidations(rb, txn);
|
|
|
|
|
|
|
|
if (using_subtxn)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
RollbackAndReleaseCurrentSubTransaction();
|
|
|
|
|
|
|
|
if (snapshot_now->copied)
|
|
|
|
ReorderBufferFreeSnap(rb, snapshot_now);
|
|
|
|
|
|
|
|
/* remove potential on-disk data, and deallocate */
|
|
|
|
ReorderBufferCleanupTXN(rb, txn);
|
|
|
|
}
|
|
|
|
PG_CATCH();
|
|
|
|
{
|
|
|
|
/* TODO: Encapsulate cleanup from the PG_TRY and PG_CATCH blocks */
|
|
|
|
if (iterstate)
|
|
|
|
ReorderBufferIterTXNFinish(rb, iterstate);
|
|
|
|
|
|
|
|
TeardownHistoricSnapshot(true);
|
|
|
|
|
|
|
|
/*
|
2014-11-13 19:06:43 +01:00
|
|
|
* Force cache invalidation to happen outside of a valid transaction
|
|
|
|
* to prevent catalog access as we just caught an error.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
2014-11-13 19:06:43 +01:00
|
|
|
AbortCurrentTransaction();
|
|
|
|
|
|
|
|
/* make sure there's no cache pollution */
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
ReorderBufferExecuteInvalidations(rb, txn);
|
|
|
|
|
2014-11-13 19:06:43 +01:00
|
|
|
if (using_subtxn)
|
|
|
|
RollbackAndReleaseCurrentSubTransaction();
|
|
|
|
|
|
|
|
if (snapshot_now->copied)
|
|
|
|
ReorderBufferFreeSnap(rb, snapshot_now);
|
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
/* remove potential on-disk data, and deallocate */
|
|
|
|
ReorderBufferCleanupTXN(rb, txn);
|
|
|
|
|
|
|
|
PG_RE_THROW();
|
|
|
|
}
|
|
|
|
PG_END_TRY();
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Abort a transaction that possibly has previous changes. Needs to be first
|
|
|
|
* called for subtransactions and then for the toplevel xid.
|
|
|
|
*
|
|
|
|
* NB: Transactions handled here have to have actively aborted (i.e. have
|
|
|
|
* produced an abort record). Implicitly aborted transactions are handled via
|
2017-02-06 10:33:58 +01:00
|
|
|
* ReorderBufferAbortOld(); transactions we're just not interested in, but
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
* which have committed are handled in ReorderBufferForget().
|
|
|
|
*
|
|
|
|
* This function purges this transaction and its contents from memory and
|
|
|
|
* disk.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *txn;
|
|
|
|
|
|
|
|
txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
|
|
|
|
false);
|
|
|
|
|
|
|
|
/* unknown, nothing to remove */
|
|
|
|
if (txn == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* cosmetic... */
|
|
|
|
txn->final_lsn = lsn;
|
|
|
|
|
|
|
|
/* remove potential on-disk data, and deallocate */
|
|
|
|
ReorderBufferCleanupTXN(rb, txn);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Abort all transactions that aren't actually running anymore because the
|
|
|
|
* server restarted.
|
|
|
|
*
|
|
|
|
* NB: These really have to be transactions that have aborted due to a server
|
|
|
|
* crash/immediate restart, as we don't deal with invalidations here.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReorderBufferAbortOld(ReorderBuffer *rb, TransactionId oldestRunningXid)
|
|
|
|
{
|
|
|
|
dlist_mutable_iter it;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Iterate through all (potential) toplevel TXNs and abort all that are
|
|
|
|
* older than what possibly can be running. Once we've found the first
|
|
|
|
* that is alive we stop, there might be some that acquired an xid earlier
|
2018-01-05 16:17:10 +01:00
|
|
|
* but started writing later, but it's unlikely and they will be cleaned
|
|
|
|
* up in a later call to this function.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
|
|
|
dlist_foreach_modify(it, &rb->toplevel_by_lsn)
|
|
|
|
{
|
2014-05-06 18:12:18 +02:00
|
|
|
ReorderBufferTXN *txn;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
txn = dlist_container(ReorderBufferTXN, node, it.cur);
|
|
|
|
|
|
|
|
if (TransactionIdPrecedes(txn->xid, oldestRunningXid))
|
|
|
|
{
|
2018-01-05 16:17:10 +01:00
|
|
|
/*
|
|
|
|
* We set final_lsn on a transaction when we decode its commit or
|
|
|
|
* abort record, but we never see those records for crashed
|
|
|
|
* transactions. To ensure cleanup of these transactions, set
|
|
|
|
* final_lsn to that of their last change; this causes
|
|
|
|
* ReorderBufferRestoreCleanup to do the right thing.
|
|
|
|
*/
|
|
|
|
if (txn->serialized && txn->final_lsn == 0)
|
|
|
|
{
|
|
|
|
ReorderBufferChange *last =
|
2018-04-26 20:47:16 +02:00
|
|
|
dlist_tail_element(ReorderBufferChange, node, &txn->changes);
|
2018-01-05 16:17:10 +01:00
|
|
|
|
|
|
|
txn->final_lsn = last->lsn;
|
|
|
|
}
|
|
|
|
|
Fix race condition leading to hanging logical slot creation.
The snapshot assembly during the creation of logical slots relied
waiting for transactions in xl_running_xacts to end, by checking for
their commit/abort records. Unfortunately, despite locking, it is
possible to see an xl_running_xact record listing transactions as
ready, that have already WAL-logged an commit/abort record, as the
locking just prevents the ProcArray to be adjusted, and the commit
record has to be logged first.
That lead to either delayed or hanging snapshot creation, because
snapbuild.c would wait "forever" to see commit/abort records for some
transactions. That hang resolved only if a xl_running_xacts record
without any running transactions happened to be logged, far from
certain on a busy server.
It's impractical to prevent that via more heavyweight locking, the
likelihood of deadlocks and significantly increased contention would
be too big.
Instead change the initial snapshot creation to be solely based on
tracking the oldest running transaction via
xl_running_xacts->oldestRunningXid - that actually ends up
significantly simplifying the code. That has two disadvantages:
1) Because we cannot fully "trust" the contents of xl_running_xacts,
we cannot use it to build the initial snapshot. Instead we have to
wait twice for all running transactions to finish.
2) Previously a slot, unless the race occurred, could be created when
the all transaction perceived as running based on commit/abort
records, now we have to wait for the next xl_running_xacts record.
To address that, trigger logging new xl_running_xacts record from
within snapbuild.c exactly when necessary.
Unfortunately snabuild.c's SnapBuild is stored on disk, one of the
stupider ideas of a certain Mr Freund, so we can't change it in a
minor release. As this is going to be backpatched, we have to hack
around a bit to keep on-disk compatibility. A later commit will
rejigger that on master.
Author: Andres Freund, based on a quite different patch from Petr Jelinek
Analyzed-By: Petr Jelinek
Reviewed-By: Petr Jelinek
Discussion: https://postgr.es/m/f37e975c-908f-858e-707f-058d3b1eb214@2ndquadrant.com
Backpatch: 9.4-, where logical decoding has been introduced
2017-05-13 23:21:00 +02:00
|
|
|
elog(DEBUG2, "aborting old transaction %u", txn->xid);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/* remove potential on-disk data, and deallocate this tx */
|
|
|
|
ReorderBufferCleanupTXN(rb, txn);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Forget the contents of a transaction if we aren't interested in it's
|
|
|
|
* contents. Needs to be first called for subtransactions and then for the
|
|
|
|
* toplevel xid.
|
|
|
|
*
|
|
|
|
* This is significantly different to ReorderBufferAbort() because
|
2017-02-06 10:33:58 +01:00
|
|
|
* transactions that have committed need to be treated differently from aborted
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
* ones since they may have modified the catalog.
|
|
|
|
*
|
|
|
|
* Note that this is only allowed to be called in the moment a transaction
|
|
|
|
* commit has just been read, not earlier; otherwise later records referring
|
|
|
|
* to this xid might re-create the transaction incompletely.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReorderBufferForget(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *txn;
|
|
|
|
|
|
|
|
txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
|
|
|
|
false);
|
|
|
|
|
|
|
|
/* unknown, nothing to forget */
|
|
|
|
if (txn == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* cosmetic... */
|
|
|
|
txn->final_lsn = lsn;
|
|
|
|
|
|
|
|
/*
|
2015-05-24 03:35:49 +02:00
|
|
|
* Process cache invalidation messages if there are any. Even if we're not
|
|
|
|
* interested in the transaction's contents, it could have manipulated the
|
|
|
|
* catalog and we need to update the caches according to that.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
|
|
|
if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
|
Emit invalidations to standby for transactions without xid.
So far, when a transaction with pending invalidations, but without an
assigned xid, committed, we simply ignored those invalidation
messages. That's problematic, because those are actually sent for a
reason.
Known symptoms of this include that existing sessions on a hot-standby
replica sometimes fail to notice new concurrently built indexes and
visibility map updates.
The solution is to WAL log such invalidations in transactions without an
xid. We considered to alternatively force-assign an xid, but that'd be
problematic for vacuum, which might be run in systems with few xids.
Important: This adds a new WAL record, but as the patch has to be
back-patched, we can't bump the WAL page magic. This means that standbys
have to be updated before primaries; otherwise
"PANIC: standby_redo: unknown op code 32" errors can be encountered.
XXX:
Reported-By: Васильев Дмитрий, Masahiko Sawada
Discussion:
CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com
CAD21AoDpZ6Xjg=gFrGPnSn4oTRRcwK1EBrWCq9OqOHuAcMMC=w@mail.gmail.com
2016-04-24 04:18:00 +02:00
|
|
|
ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
|
|
|
|
txn->invalidations);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
else
|
|
|
|
Assert(txn->ninvalidations == 0);
|
|
|
|
|
|
|
|
/* remove potential on-disk data, and deallocate */
|
|
|
|
ReorderBufferCleanupTXN(rb, txn);
|
|
|
|
}
|
|
|
|
|
Emit invalidations to standby for transactions without xid.
So far, when a transaction with pending invalidations, but without an
assigned xid, committed, we simply ignored those invalidation
messages. That's problematic, because those are actually sent for a
reason.
Known symptoms of this include that existing sessions on a hot-standby
replica sometimes fail to notice new concurrently built indexes and
visibility map updates.
The solution is to WAL log such invalidations in transactions without an
xid. We considered to alternatively force-assign an xid, but that'd be
problematic for vacuum, which might be run in systems with few xids.
Important: This adds a new WAL record, but as the patch has to be
back-patched, we can't bump the WAL page magic. This means that standbys
have to be updated before primaries; otherwise
"PANIC: standby_redo: unknown op code 32" errors can be encountered.
XXX:
Reported-By: Васильев Дмитрий, Masahiko Sawada
Discussion:
CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com
CAD21AoDpZ6Xjg=gFrGPnSn4oTRRcwK1EBrWCq9OqOHuAcMMC=w@mail.gmail.com
2016-04-24 04:18:00 +02:00
|
|
|
/*
|
|
|
|
* Execute invalidations happening outside the context of a decoded
|
|
|
|
* transaction. That currently happens either for xid-less commits
|
2018-01-11 14:31:11 +01:00
|
|
|
* (cf. RecordTransactionCommit()) or for invalidations in uninteresting
|
Emit invalidations to standby for transactions without xid.
So far, when a transaction with pending invalidations, but without an
assigned xid, committed, we simply ignored those invalidation
messages. That's problematic, because those are actually sent for a
reason.
Known symptoms of this include that existing sessions on a hot-standby
replica sometimes fail to notice new concurrently built indexes and
visibility map updates.
The solution is to WAL log such invalidations in transactions without an
xid. We considered to alternatively force-assign an xid, but that'd be
problematic for vacuum, which might be run in systems with few xids.
Important: This adds a new WAL record, but as the patch has to be
back-patched, we can't bump the WAL page magic. This means that standbys
have to be updated before primaries; otherwise
"PANIC: standby_redo: unknown op code 32" errors can be encountered.
XXX:
Reported-By: Васильев Дмитрий, Masahiko Sawada
Discussion:
CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com
CAD21AoDpZ6Xjg=gFrGPnSn4oTRRcwK1EBrWCq9OqOHuAcMMC=w@mail.gmail.com
2016-04-24 04:18:00 +02:00
|
|
|
* transactions (via ReorderBufferForget()).
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReorderBufferImmediateInvalidation(ReorderBuffer *rb, uint32 ninvalidations,
|
|
|
|
SharedInvalidationMessage *invalidations)
|
|
|
|
{
|
|
|
|
bool use_subtxn = IsTransactionOrTransactionBlock();
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (use_subtxn)
|
|
|
|
BeginInternalSubTransaction("replay");
|
|
|
|
|
|
|
|
/*
|
2016-06-10 00:02:36 +02:00
|
|
|
* Force invalidations to happen outside of a valid transaction - that way
|
|
|
|
* entries will just be marked as invalid without accessing the catalog.
|
|
|
|
* That's advantageous because we don't need to setup the full state
|
|
|
|
* necessary for catalog access.
|
Emit invalidations to standby for transactions without xid.
So far, when a transaction with pending invalidations, but without an
assigned xid, committed, we simply ignored those invalidation
messages. That's problematic, because those are actually sent for a
reason.
Known symptoms of this include that existing sessions on a hot-standby
replica sometimes fail to notice new concurrently built indexes and
visibility map updates.
The solution is to WAL log such invalidations in transactions without an
xid. We considered to alternatively force-assign an xid, but that'd be
problematic for vacuum, which might be run in systems with few xids.
Important: This adds a new WAL record, but as the patch has to be
back-patched, we can't bump the WAL page magic. This means that standbys
have to be updated before primaries; otherwise
"PANIC: standby_redo: unknown op code 32" errors can be encountered.
XXX:
Reported-By: Васильев Дмитрий, Masahiko Sawada
Discussion:
CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com
CAD21AoDpZ6Xjg=gFrGPnSn4oTRRcwK1EBrWCq9OqOHuAcMMC=w@mail.gmail.com
2016-04-24 04:18:00 +02:00
|
|
|
*/
|
|
|
|
if (use_subtxn)
|
|
|
|
AbortCurrentTransaction();
|
|
|
|
|
|
|
|
for (i = 0; i < ninvalidations; i++)
|
|
|
|
LocalExecuteInvalidationMessage(&invalidations[i]);
|
|
|
|
|
|
|
|
if (use_subtxn)
|
|
|
|
RollbackAndReleaseCurrentSubTransaction();
|
|
|
|
}
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/*
|
2016-03-06 03:02:20 +01:00
|
|
|
* Tell reorderbuffer about an xid seen in the WAL stream. Has to be called at
|
|
|
|
* least once for every xid in XLogRecord->xl_xid (other places in records
|
|
|
|
* may, but do not have to be passed through here).
|
|
|
|
*
|
|
|
|
* Reorderbuffer keeps some datastructures about transactions in LSN order,
|
|
|
|
* for efficiency. To do that it has to know about when transactions are seen
|
|
|
|
* first in the WAL. As many types of records are not actually interesting for
|
|
|
|
* logical decoding, they do not necessarily pass though here.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
2016-03-06 03:02:20 +01:00
|
|
|
void
|
|
|
|
ReorderBufferProcessXid(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
2016-03-06 03:02:20 +01:00
|
|
|
/* many records won't have an xid assigned, centralize check here */
|
|
|
|
if (xid != InvalidTransactionId)
|
|
|
|
ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add a new snapshot to this transaction that may only used after lsn 'lsn'
|
|
|
|
* because the previous snapshot doesn't describe the catalog correctly for
|
|
|
|
* following rows.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReorderBufferAddSnapshot(ReorderBuffer *rb, TransactionId xid,
|
|
|
|
XLogRecPtr lsn, Snapshot snap)
|
|
|
|
{
|
|
|
|
ReorderBufferChange *change = ReorderBufferGetChange(rb);
|
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
change->data.snapshot = snap;
|
|
|
|
change->action = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
ReorderBufferQueueChange(rb, xid, lsn, change);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
* Set up the transaction's base snapshot.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
* If we know that xid is a subtransaction, set the base snapshot on the
|
|
|
|
* top-level transaction instead.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReorderBufferSetBaseSnapshot(ReorderBuffer *rb, TransactionId xid,
|
|
|
|
XLogRecPtr lsn, Snapshot snap)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *txn;
|
|
|
|
bool is_new;
|
|
|
|
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
AssertArg(snap != NULL);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Fetch the transaction to operate on. If we know it's a subtransaction,
|
|
|
|
* operate on its top-level transaction instead.
|
|
|
|
*/
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
txn = ReorderBufferTXNByXid(rb, xid, true, &is_new, lsn, true);
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
if (txn->is_known_as_subxact)
|
|
|
|
txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
|
|
|
|
NULL, InvalidXLogRecPtr, false);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
Assert(txn->base_snapshot == NULL);
|
|
|
|
|
|
|
|
txn->base_snapshot = snap;
|
|
|
|
txn->base_snapshot_lsn = lsn;
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
dlist_push_tail(&rb->txns_by_base_snapshot_lsn, &txn->base_snapshot_node);
|
|
|
|
|
|
|
|
AssertTXNLsnOrder(rb);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Access the catalog with this CommandId at this point in the changestream.
|
|
|
|
*
|
|
|
|
* May only be called for command ids > 1
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
|
|
|
|
XLogRecPtr lsn, CommandId cid)
|
|
|
|
{
|
|
|
|
ReorderBufferChange *change = ReorderBufferGetChange(rb);
|
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
change->data.command_id = cid;
|
|
|
|
change->action = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
ReorderBufferQueueChange(rb, xid, lsn, change);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add new (relfilenode, tid) -> (cmin, cmax) mappings.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
|
|
|
|
XLogRecPtr lsn, RelFileNode node,
|
|
|
|
ItemPointerData tid, CommandId cmin,
|
|
|
|
CommandId cmax, CommandId combocid)
|
|
|
|
{
|
|
|
|
ReorderBufferChange *change = ReorderBufferGetChange(rb);
|
|
|
|
ReorderBufferTXN *txn;
|
|
|
|
|
|
|
|
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
|
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
change->data.tuplecid.node = node;
|
|
|
|
change->data.tuplecid.tid = tid;
|
|
|
|
change->data.tuplecid.cmin = cmin;
|
|
|
|
change->data.tuplecid.cmax = cmax;
|
|
|
|
change->data.tuplecid.combocid = combocid;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
change->lsn = lsn;
|
2014-03-07 23:02:48 +01:00
|
|
|
change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
dlist_push_tail(&txn->tuplecids, &change->node);
|
|
|
|
txn->ntuplecids++;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Setup the invalidation of the toplevel transaction.
|
|
|
|
*
|
|
|
|
* This needs to be done before ReorderBufferCommit is called!
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
|
|
|
|
XLogRecPtr lsn, Size nmsgs,
|
|
|
|
SharedInvalidationMessage *msgs)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *txn;
|
|
|
|
|
|
|
|
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
|
|
|
|
|
|
|
|
if (txn->ninvalidations != 0)
|
|
|
|
elog(ERROR, "only ever add one set of invalidations");
|
|
|
|
|
|
|
|
Assert(nmsgs > 0);
|
|
|
|
|
|
|
|
txn->ninvalidations = nmsgs;
|
|
|
|
txn->invalidations = (SharedInvalidationMessage *)
|
|
|
|
MemoryContextAlloc(rb->context,
|
|
|
|
sizeof(SharedInvalidationMessage) * nmsgs);
|
|
|
|
memcpy(txn->invalidations, msgs,
|
|
|
|
sizeof(SharedInvalidationMessage) * nmsgs);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Apply all invalidations we know. Possibly we only need parts at this point
|
|
|
|
* in the changestream but we don't know which those are.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReorderBufferExecuteInvalidations(ReorderBuffer *rb, ReorderBufferTXN *txn)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < txn->ninvalidations; i++)
|
|
|
|
LocalExecuteInvalidationMessage(&txn->invalidations[i]);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Mark a transaction as containing catalog changes
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReorderBufferXidSetCatalogChanges(ReorderBuffer *rb, TransactionId xid,
|
|
|
|
XLogRecPtr lsn)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *txn;
|
|
|
|
|
|
|
|
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
|
|
|
|
|
|
|
|
txn->has_catalog_changes = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Query whether a transaction is already *known* to contain catalog
|
|
|
|
* changes. This can be wrong until directly before the commit!
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
ReorderBufferXidHasCatalogChanges(ReorderBuffer *rb, TransactionId xid)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *txn;
|
|
|
|
|
|
|
|
txn = ReorderBufferTXNByXid(rb, xid, false, NULL, InvalidXLogRecPtr,
|
|
|
|
false);
|
|
|
|
if (txn == NULL)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return txn->has_catalog_changes;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
* ReorderBufferXidHasBaseSnapshot
|
|
|
|
* Have we already set the base snapshot for the given txn/subtxn?
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
|
|
|
bool
|
|
|
|
ReorderBufferXidHasBaseSnapshot(ReorderBuffer *rb, TransactionId xid)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *txn;
|
|
|
|
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
txn = ReorderBufferTXNByXid(rb, xid, false,
|
|
|
|
NULL, InvalidXLogRecPtr, false);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/* transaction isn't known yet, ergo no snapshot */
|
|
|
|
if (txn == NULL)
|
|
|
|
return false;
|
|
|
|
|
Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed. First, xmin of logical slots was
advanced too early. During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them. The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.
To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN. This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment). Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.
The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.
Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.
Liberally sprinkle code comments, and rewrite a few existing ones. This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 22:38:34 +02:00
|
|
|
/* a known subtxn? operate on top-level txn instead */
|
|
|
|
if (txn->is_known_as_subxact)
|
|
|
|
txn = ReorderBufferTXNByXid(rb, txn->toplevel_xid, false,
|
|
|
|
NULL, InvalidXLogRecPtr, false);
|
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
return txn->base_snapshot != NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ---------------------------------------
|
|
|
|
* Disk serialization support
|
|
|
|
* ---------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ensure the IO buffer is >= sz.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
|
|
|
|
{
|
|
|
|
if (!rb->outbufsize)
|
|
|
|
{
|
|
|
|
rb->outbuf = MemoryContextAlloc(rb->context, sz);
|
|
|
|
rb->outbufsize = sz;
|
|
|
|
}
|
|
|
|
else if (rb->outbufsize < sz)
|
|
|
|
{
|
|
|
|
rb->outbuf = repalloc(rb->outbuf, sz);
|
|
|
|
rb->outbufsize = sz;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check whether the transaction tx should spill its data to disk.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReorderBufferCheckSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* TODO: improve accounting so we cheaply can take subtransactions into
|
|
|
|
* account here.
|
|
|
|
*/
|
|
|
|
if (txn->nentries_mem >= max_changes_in_memory)
|
|
|
|
{
|
|
|
|
ReorderBufferSerializeTXN(rb, txn);
|
|
|
|
Assert(txn->nentries_mem == 0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Spill data of a large transaction (and its subtransactions) to disk.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
|
|
|
|
{
|
|
|
|
dlist_iter subtxn_i;
|
|
|
|
dlist_mutable_iter change_i;
|
|
|
|
int fd = -1;
|
|
|
|
XLogSegNo curOpenSegNo = 0;
|
|
|
|
Size spilled = 0;
|
|
|
|
|
2014-05-01 00:16:53 +02:00
|
|
|
elog(DEBUG2, "spill %u changes in XID %u to disk",
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
(uint32) txn->nentries_mem, txn->xid);
|
|
|
|
|
|
|
|
/* do the same to all child TXs */
|
|
|
|
dlist_foreach(subtxn_i, &txn->subtxns)
|
|
|
|
{
|
|
|
|
ReorderBufferTXN *subtxn;
|
|
|
|
|
|
|
|
subtxn = dlist_container(ReorderBufferTXN, node, subtxn_i.cur);
|
|
|
|
ReorderBufferSerializeTXN(rb, subtxn);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* serialize changestream */
|
|
|
|
dlist_foreach_modify(change_i, &txn->changes)
|
|
|
|
{
|
|
|
|
ReorderBufferChange *change;
|
|
|
|
|
|
|
|
change = dlist_container(ReorderBufferChange, node, change_i.cur);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* store in segment in which it belongs by start lsn, don't split over
|
|
|
|
* multiple segments tho
|
|
|
|
*/
|
Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-20 07:03:48 +02:00
|
|
|
if (fd == -1 ||
|
|
|
|
!XLByteInSeg(change->lsn, curOpenSegNo, wal_segment_size))
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
2018-03-06 20:22:03 +01:00
|
|
|
char path[MAXPGPATH];
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
if (fd != -1)
|
|
|
|
CloseTransientFile(fd);
|
|
|
|
|
Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-20 07:03:48 +02:00
|
|
|
XLByteToSeg(change->lsn, curOpenSegNo, wal_segment_size);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* No need to care about TLIs here, only used during a single run,
|
|
|
|
* so each LSN only maps to a specific WAL record.
|
|
|
|
*/
|
2018-03-06 20:22:03 +01:00
|
|
|
ReorderBufferSerializedPath(path, MyReplicationSlot, txn->xid,
|
|
|
|
curOpenSegNo);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/* open segment, create it if necessary */
|
|
|
|
fd = OpenTransientFile(path,
|
2017-09-23 15:49:22 +02:00
|
|
|
O_CREAT | O_WRONLY | O_APPEND | PG_BINARY);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
if (fd < 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2018-03-06 20:22:03 +01:00
|
|
|
errmsg("could not open file \"%s\": %m", path)));
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
ReorderBufferSerializeChange(rb, txn, fd, change);
|
|
|
|
dlist_delete(&change->node);
|
|
|
|
ReorderBufferReturnChange(rb, change);
|
|
|
|
|
|
|
|
spilled++;
|
|
|
|
}
|
|
|
|
|
|
|
|
Assert(spilled == txn->nentries_mem);
|
|
|
|
Assert(dlist_is_empty(&txn->changes));
|
|
|
|
txn->nentries_mem = 0;
|
2017-06-19 03:48:22 +02:00
|
|
|
txn->serialized = true;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
if (fd != -1)
|
|
|
|
CloseTransientFile(fd);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Serialize individual change to disk.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
|
|
|
|
int fd, ReorderBufferChange *change)
|
|
|
|
{
|
|
|
|
ReorderBufferDiskChange *ondisk;
|
|
|
|
Size sz = sizeof(ReorderBufferDiskChange);
|
|
|
|
|
|
|
|
ReorderBufferSerializeReserve(rb, sz);
|
|
|
|
|
|
|
|
ondisk = (ReorderBufferDiskChange *) rb->outbuf;
|
|
|
|
memcpy(&ondisk->change, change, sizeof(ReorderBufferChange));
|
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
switch (change->action)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
2015-05-24 03:35:49 +02:00
|
|
|
/* fall through these, they're all similar enough */
|
2014-03-07 23:02:48 +01:00
|
|
|
case REORDER_BUFFER_CHANGE_INSERT:
|
|
|
|
case REORDER_BUFFER_CHANGE_UPDATE:
|
|
|
|
case REORDER_BUFFER_CHANGE_DELETE:
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
|
|
|
char *data;
|
2014-05-06 18:12:18 +02:00
|
|
|
ReorderBufferTupleBuf *oldtup,
|
|
|
|
*newtup;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
Size oldlen = 0;
|
|
|
|
Size newlen = 0;
|
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
oldtup = change->data.tp.oldtuple;
|
|
|
|
newtup = change->data.tp.newtuple;
|
|
|
|
|
|
|
|
if (oldtup)
|
logical decoding: Fix handling of large old tuples with replica identity full.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
2016-03-06 03:02:20 +01:00
|
|
|
{
|
|
|
|
sz += sizeof(HeapTupleData);
|
|
|
|
oldlen = oldtup->tuple.t_len;
|
|
|
|
sz += oldlen;
|
|
|
|
}
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
if (newtup)
|
logical decoding: Fix handling of large old tuples with replica identity full.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
2016-03-06 03:02:20 +01:00
|
|
|
{
|
|
|
|
sz += sizeof(HeapTupleData);
|
|
|
|
newlen = newtup->tuple.t_len;
|
|
|
|
sz += newlen;
|
|
|
|
}
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/* make sure we have enough space */
|
|
|
|
ReorderBufferSerializeReserve(rb, sz);
|
|
|
|
|
|
|
|
data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
|
|
|
|
/* might have been reallocated above */
|
|
|
|
ondisk = (ReorderBufferDiskChange *) rb->outbuf;
|
|
|
|
|
|
|
|
if (oldlen)
|
|
|
|
{
|
logical decoding: Fix handling of large old tuples with replica identity full.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
2016-03-06 03:02:20 +01:00
|
|
|
memcpy(data, &oldtup->tuple, sizeof(HeapTupleData));
|
|
|
|
data += sizeof(HeapTupleData);
|
|
|
|
|
|
|
|
memcpy(data, oldtup->tuple.t_data, oldlen);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
data += oldlen;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (newlen)
|
|
|
|
{
|
logical decoding: Fix handling of large old tuples with replica identity full.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
2016-03-06 03:02:20 +01:00
|
|
|
memcpy(data, &newtup->tuple, sizeof(HeapTupleData));
|
|
|
|
data += sizeof(HeapTupleData);
|
|
|
|
|
|
|
|
memcpy(data, newtup->tuple.t_data, newlen);
|
2016-03-07 23:24:03 +01:00
|
|
|
data += newlen;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
2016-04-06 11:05:41 +02:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
case REORDER_BUFFER_CHANGE_MESSAGE:
|
|
|
|
{
|
|
|
|
char *data;
|
|
|
|
Size prefix_size = strlen(change->data.msg.prefix) + 1;
|
|
|
|
|
|
|
|
sz += prefix_size + change->data.msg.message_size +
|
|
|
|
sizeof(Size) + sizeof(Size);
|
|
|
|
ReorderBufferSerializeReserve(rb, sz);
|
|
|
|
|
|
|
|
data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
|
|
|
|
|
2016-09-28 17:19:46 +02:00
|
|
|
/* might have been reallocated above */
|
|
|
|
ondisk = (ReorderBufferDiskChange *) rb->outbuf;
|
|
|
|
|
2016-04-06 11:05:41 +02:00
|
|
|
/* write the prefix including the size */
|
|
|
|
memcpy(data, &prefix_size, sizeof(Size));
|
|
|
|
data += sizeof(Size);
|
|
|
|
memcpy(data, change->data.msg.prefix,
|
|
|
|
prefix_size);
|
|
|
|
data += prefix_size;
|
|
|
|
|
|
|
|
/* write the message including the size */
|
|
|
|
memcpy(data, &change->data.msg.message_size, sizeof(Size));
|
|
|
|
data += sizeof(Size);
|
|
|
|
memcpy(data, change->data.msg.message,
|
|
|
|
change->data.msg.message_size);
|
|
|
|
data += change->data.msg.message_size;
|
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
|
|
|
|
{
|
2014-03-07 23:02:48 +01:00
|
|
|
Snapshot snap;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
char *data;
|
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
snap = change->data.snapshot;
|
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
sz += sizeof(SnapshotData) +
|
2014-03-07 23:02:48 +01:00
|
|
|
sizeof(TransactionId) * snap->xcnt +
|
|
|
|
sizeof(TransactionId) * snap->subxcnt
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
;
|
|
|
|
|
|
|
|
/* make sure we have enough space */
|
|
|
|
ReorderBufferSerializeReserve(rb, sz);
|
|
|
|
data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
|
|
|
|
/* might have been reallocated above */
|
|
|
|
ondisk = (ReorderBufferDiskChange *) rb->outbuf;
|
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
memcpy(data, snap, sizeof(SnapshotData));
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
data += sizeof(SnapshotData);
|
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
if (snap->xcnt)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
2014-03-07 23:02:48 +01:00
|
|
|
memcpy(data, snap->xip,
|
2014-05-09 16:44:04 +02:00
|
|
|
sizeof(TransactionId) * snap->xcnt);
|
|
|
|
data += sizeof(TransactionId) * snap->xcnt;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
if (snap->subxcnt)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
2014-03-07 23:02:48 +01:00
|
|
|
memcpy(data, snap->subxip,
|
2014-05-09 16:44:04 +02:00
|
|
|
sizeof(TransactionId) * snap->subxcnt);
|
|
|
|
data += sizeof(TransactionId) * snap->subxcnt;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
2018-04-07 17:17:56 +02:00
|
|
|
case REORDER_BUFFER_CHANGE_TRUNCATE:
|
Fix memory leak in TRUNCATE decoding
When decoding a TRUNCATE record, the relids array was being allocated in
the main ReorderBuffer memory context, but not released with the change
resulting in a memory leak.
The array was also ignored when serializing/deserializing the change,
assuming all the information is stored in the change itself. So when
spilling the change to disk, we've only we have serialized only the
pointer to the relids array. Thanks to never releasing the array,
the pointer however remained valid even after loading the change back
to memory, preventing an actual crash.
This fixes both the memory leak and (de)serialization. The relids array
is still allocated in the main ReorderBuffer memory context (none of the
existing ones seems like a good match, and adding an extra context seems
like an overkill). The allocation is wrapped in a new ReorderBuffer API
functions, to keep the details within reorderbuffer.c, just like the
other ReorderBufferGet methods do.
Author: Tomas Vondra
Discussion: https://www.postgresql.org/message-id/flat/66175a41-9342-2845-652f-1bd4c3ee50aa%402ndquadrant.com
Backpatch: 11, where decoding of TRUNCATE was introduced
2018-09-03 02:10:24 +02:00
|
|
|
{
|
|
|
|
Size size;
|
|
|
|
char *data;
|
|
|
|
|
|
|
|
/* account for the OIDs of truncated relations */
|
|
|
|
size = sizeof(Oid) * change->data.truncate.nrelids;
|
|
|
|
sz += size;
|
|
|
|
|
|
|
|
/* make sure we have enough space */
|
|
|
|
ReorderBufferSerializeReserve(rb, sz);
|
|
|
|
|
|
|
|
data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
|
|
|
|
/* might have been reallocated above */
|
|
|
|
ondisk = (ReorderBufferDiskChange *) rb->outbuf;
|
|
|
|
|
|
|
|
memcpy(data, change->data.truncate.relids, size);
|
|
|
|
data += size;
|
|
|
|
|
|
|
|
break;
|
|
|
|
}
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
|
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
|
|
|
|
/* ReorderBufferChange contains everything important */
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
ondisk->size = sz;
|
|
|
|
|
2018-08-04 22:31:18 +02:00
|
|
|
errno = 0;
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
pgstat_report_wait_start(WAIT_EVENT_REORDER_BUFFER_WRITE);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
if (write(fd, rb->outbuf, ondisk->size) != ondisk->size)
|
|
|
|
{
|
Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
|
|
|
int save_errno = errno;
|
2016-08-19 19:38:55 +02:00
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
CloseTransientFile(fd);
|
2018-06-25 04:19:05 +02:00
|
|
|
|
|
|
|
/* if write didn't set errno, assume problem is no disk space */
|
|
|
|
errno = save_errno ? save_errno : ENOSPC;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2014-05-01 00:16:53 +02:00
|
|
|
errmsg("could not write to data file for XID %u: %m",
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
txn->xid)));
|
|
|
|
}
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
pgstat_report_wait_end();
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
Assert(ondisk->change.action == change->action);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Restore a number of changes spilled to disk back into memory.
|
|
|
|
*/
|
|
|
|
static Size
|
|
|
|
ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
|
|
|
|
int *fd, XLogSegNo *segno)
|
|
|
|
{
|
|
|
|
Size restored = 0;
|
|
|
|
XLogSegNo last_segno;
|
|
|
|
dlist_mutable_iter cleanup_iter;
|
|
|
|
|
|
|
|
Assert(txn->first_lsn != InvalidXLogRecPtr);
|
|
|
|
Assert(txn->final_lsn != InvalidXLogRecPtr);
|
|
|
|
|
|
|
|
/* free current entries, so we have memory for more */
|
|
|
|
dlist_foreach_modify(cleanup_iter, &txn->changes)
|
|
|
|
{
|
|
|
|
ReorderBufferChange *cleanup =
|
|
|
|
dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
|
|
|
|
|
|
|
|
dlist_delete(&cleanup->node);
|
|
|
|
ReorderBufferReturnChange(rb, cleanup);
|
|
|
|
}
|
|
|
|
txn->nentries_mem = 0;
|
|
|
|
Assert(dlist_is_empty(&txn->changes));
|
|
|
|
|
Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-20 07:03:48 +02:00
|
|
|
XLByteToSeg(txn->final_lsn, last_segno, wal_segment_size);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
while (restored < max_changes_in_memory && *segno <= last_segno)
|
|
|
|
{
|
|
|
|
int readBytes;
|
|
|
|
ReorderBufferDiskChange *ondisk;
|
|
|
|
|
|
|
|
if (*fd == -1)
|
|
|
|
{
|
|
|
|
char path[MAXPGPATH];
|
|
|
|
|
|
|
|
/* first time in */
|
|
|
|
if (*segno == 0)
|
Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-20 07:03:48 +02:00
|
|
|
XLByteToSeg(txn->first_lsn, *segno, wal_segment_size);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
Assert(*segno != 0 || dlist_is_empty(&txn->changes));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* No need to care about TLIs here, only used during a single run,
|
|
|
|
* so each LSN only maps to a specific WAL record.
|
|
|
|
*/
|
2018-03-06 20:22:03 +01:00
|
|
|
ReorderBufferSerializedPath(path, MyReplicationSlot, txn->xid,
|
|
|
|
*segno);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
2017-09-23 15:49:22 +02:00
|
|
|
*fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
if (*fd < 0 && errno == ENOENT)
|
|
|
|
{
|
|
|
|
*fd = -1;
|
|
|
|
(*segno)++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
else if (*fd < 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not open file \"%s\": %m",
|
|
|
|
path)));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read the statically sized part of a change which has information
|
|
|
|
* about the total size. If we couldn't read a record, we're at the
|
|
|
|
* end of this file.
|
|
|
|
*/
|
2014-05-09 16:44:04 +02:00
|
|
|
ReorderBufferSerializeReserve(rb, sizeof(ReorderBufferDiskChange));
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
pgstat_report_wait_start(WAIT_EVENT_REORDER_BUFFER_READ);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
readBytes = read(*fd, rb->outbuf, sizeof(ReorderBufferDiskChange));
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
pgstat_report_wait_end();
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/* eof */
|
|
|
|
if (readBytes == 0)
|
|
|
|
{
|
|
|
|
CloseTransientFile(*fd);
|
|
|
|
*fd = -1;
|
|
|
|
(*segno)++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
else if (readBytes < 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
errmsg("could not read from reorderbuffer spill file: %m")));
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
else if (readBytes != sizeof(ReorderBufferDiskChange))
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2014-11-12 02:00:58 +01:00
|
|
|
errmsg("could not read from reorderbuffer spill file: read %d instead of %u bytes",
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
readBytes,
|
|
|
|
(uint32) sizeof(ReorderBufferDiskChange))));
|
|
|
|
|
|
|
|
ondisk = (ReorderBufferDiskChange *) rb->outbuf;
|
|
|
|
|
|
|
|
ReorderBufferSerializeReserve(rb,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
sizeof(ReorderBufferDiskChange) + ondisk->size);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
ondisk = (ReorderBufferDiskChange *) rb->outbuf;
|
|
|
|
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
pgstat_report_wait_start(WAIT_EVENT_REORDER_BUFFER_READ);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
readBytes = read(*fd, rb->outbuf + sizeof(ReorderBufferDiskChange),
|
|
|
|
ondisk->size - sizeof(ReorderBufferDiskChange));
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
pgstat_report_wait_end();
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
if (readBytes < 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
errmsg("could not read from reorderbuffer spill file: %m")));
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
else if (readBytes != ondisk->size - sizeof(ReorderBufferDiskChange))
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2014-05-01 00:16:53 +02:00
|
|
|
errmsg("could not read from reorderbuffer spill file: read %d instead of %u bytes",
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
readBytes,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
(uint32) (ondisk->size - sizeof(ReorderBufferDiskChange)))));
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* ok, read a full change from disk, now restore it into proper
|
|
|
|
* in-memory format
|
|
|
|
*/
|
|
|
|
ReorderBufferRestoreChange(rb, txn, rb->outbuf);
|
|
|
|
restored++;
|
|
|
|
}
|
|
|
|
|
|
|
|
return restored;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Convert change from its on-disk format to in-memory format and queue it onto
|
|
|
|
* the TXN's ->changes list.
|
Fix core dump in ReorderBufferRestoreChange on alignment-picky platforms.
When re-reading an update involving both an old tuple and a new tuple from
disk, reorderbuffer.c was careless about whether the new tuple is suitably
aligned for direct access --- in general, it isn't. We'd missed seeing
this in the buildfarm because the contrib/test_decoding tests exercise this
code path only a few times, and by chance all of those cases have old
tuples with length a multiple of 4, which is usually enough to make the
access to the new tuple's t_len safe. For some still-not-entirely-clear
reason, however, Debian's sparc build gets a bus error, as reported by
Christoph Berg; perhaps it's assuming 8-byte alignment of the pointer?
The lack of previous field reports is probably because you need all of
these conditions to trigger a crash: an alignment-picky platform (not
Intel), a transaction large enough to spill to disk, an update within
that xact that changes a primary-key field and has an odd-length old tuple,
and of course logical decoding tracing the transaction.
Avoid the alignment assumption by using memcpy instead of fetching t_len
directly, and add a test case that exposes the crash on picky platforms.
Back-patch to 9.4 where the bug was introduced.
Discussion: <20160413094117.GC21485@msg.credativ.de>
2016-04-15 01:42:21 +02:00
|
|
|
*
|
|
|
|
* Note: although "data" is declared char*, at entry it points to a
|
|
|
|
* maxalign'd buffer, making it safe in most of this function to assume
|
|
|
|
* that the pointed-to data is suitably aligned for direct access.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
|
|
|
|
char *data)
|
|
|
|
{
|
|
|
|
ReorderBufferDiskChange *ondisk;
|
|
|
|
ReorderBufferChange *change;
|
|
|
|
|
|
|
|
ondisk = (ReorderBufferDiskChange *) data;
|
|
|
|
|
|
|
|
change = ReorderBufferGetChange(rb);
|
|
|
|
|
|
|
|
/* copy static part */
|
|
|
|
memcpy(change, &ondisk->change, sizeof(ReorderBufferChange));
|
|
|
|
|
|
|
|
data += sizeof(ReorderBufferDiskChange);
|
|
|
|
|
|
|
|
/* restore individual stuff */
|
2014-03-07 23:02:48 +01:00
|
|
|
switch (change->action)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
2015-05-24 03:35:49 +02:00
|
|
|
/* fall through these, they're all similar enough */
|
2014-03-07 23:02:48 +01:00
|
|
|
case REORDER_BUFFER_CHANGE_INSERT:
|
|
|
|
case REORDER_BUFFER_CHANGE_UPDATE:
|
|
|
|
case REORDER_BUFFER_CHANGE_DELETE:
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
|
2016-03-06 03:02:20 +01:00
|
|
|
if (change->data.tp.oldtuple)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
Fix core dump in ReorderBufferRestoreChange on alignment-picky platforms.
When re-reading an update involving both an old tuple and a new tuple from
disk, reorderbuffer.c was careless about whether the new tuple is suitably
aligned for direct access --- in general, it isn't. We'd missed seeing
this in the buildfarm because the contrib/test_decoding tests exercise this
code path only a few times, and by chance all of those cases have old
tuples with length a multiple of 4, which is usually enough to make the
access to the new tuple's t_len safe. For some still-not-entirely-clear
reason, however, Debian's sparc build gets a bus error, as reported by
Christoph Berg; perhaps it's assuming 8-byte alignment of the pointer?
The lack of previous field reports is probably because you need all of
these conditions to trigger a crash: an alignment-picky platform (not
Intel), a transaction large enough to spill to disk, an update within
that xact that changes a primary-key field and has an odd-length old tuple,
and of course logical decoding tracing the transaction.
Avoid the alignment assumption by using memcpy instead of fetching t_len
directly, and add a test case that exposes the crash on picky platforms.
Back-patch to 9.4 where the bug was introduced.
Discussion: <20160413094117.GC21485@msg.credativ.de>
2016-04-15 01:42:21 +02:00
|
|
|
uint32 tuplelen = ((HeapTuple) data)->t_len;
|
logical decoding: Fix handling of large old tuples with replica identity full.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
2016-03-06 03:02:20 +01:00
|
|
|
|
|
|
|
change->data.tp.oldtuple =
|
|
|
|
ReorderBufferGetTupleBuf(rb, tuplelen - SizeofHeapTupleHeader);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
logical decoding: Fix handling of large old tuples with replica identity full.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
2016-03-06 03:02:20 +01:00
|
|
|
/* restore ->tuple */
|
|
|
|
memcpy(&change->data.tp.oldtuple->tuple, data,
|
|
|
|
sizeof(HeapTupleData));
|
|
|
|
data += sizeof(HeapTupleData);
|
|
|
|
|
|
|
|
/* reset t_data pointer into the new tuplebuf */
|
2016-03-06 03:02:20 +01:00
|
|
|
change->data.tp.oldtuple->tuple.t_data =
|
logical decoding: Fix handling of large old tuples with replica identity full.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
2016-03-06 03:02:20 +01:00
|
|
|
ReorderBufferTupleBufData(change->data.tp.oldtuple);
|
|
|
|
|
|
|
|
/* restore tuple data itself */
|
|
|
|
memcpy(change->data.tp.oldtuple->tuple.t_data, data, tuplelen);
|
|
|
|
data += tuplelen;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
|
2016-03-06 03:02:20 +01:00
|
|
|
if (change->data.tp.newtuple)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
Fix core dump in ReorderBufferRestoreChange on alignment-picky platforms.
When re-reading an update involving both an old tuple and a new tuple from
disk, reorderbuffer.c was careless about whether the new tuple is suitably
aligned for direct access --- in general, it isn't. We'd missed seeing
this in the buildfarm because the contrib/test_decoding tests exercise this
code path only a few times, and by chance all of those cases have old
tuples with length a multiple of 4, which is usually enough to make the
access to the new tuple's t_len safe. For some still-not-entirely-clear
reason, however, Debian's sparc build gets a bus error, as reported by
Christoph Berg; perhaps it's assuming 8-byte alignment of the pointer?
The lack of previous field reports is probably because you need all of
these conditions to trigger a crash: an alignment-picky platform (not
Intel), a transaction large enough to spill to disk, an update within
that xact that changes a primary-key field and has an odd-length old tuple,
and of course logical decoding tracing the transaction.
Avoid the alignment assumption by using memcpy instead of fetching t_len
directly, and add a test case that exposes the crash on picky platforms.
Back-patch to 9.4 where the bug was introduced.
Discussion: <20160413094117.GC21485@msg.credativ.de>
2016-04-15 01:42:21 +02:00
|
|
|
/* here, data might not be suitably aligned! */
|
|
|
|
uint32 tuplelen;
|
|
|
|
|
|
|
|
memcpy(&tuplelen, data + offsetof(HeapTupleData, t_len),
|
|
|
|
sizeof(uint32));
|
logical decoding: Fix handling of large old tuples with replica identity full.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
2016-03-06 03:02:20 +01:00
|
|
|
|
|
|
|
change->data.tp.newtuple =
|
|
|
|
ReorderBufferGetTupleBuf(rb, tuplelen - SizeofHeapTupleHeader);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
logical decoding: Fix handling of large old tuples with replica identity full.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
2016-03-06 03:02:20 +01:00
|
|
|
/* restore ->tuple */
|
|
|
|
memcpy(&change->data.tp.newtuple->tuple, data,
|
|
|
|
sizeof(HeapTupleData));
|
|
|
|
data += sizeof(HeapTupleData);
|
|
|
|
|
|
|
|
/* reset t_data pointer into the new tuplebuf */
|
2016-03-06 03:02:20 +01:00
|
|
|
change->data.tp.newtuple->tuple.t_data =
|
logical decoding: Fix handling of large old tuples with replica identity full.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
2016-03-06 03:02:20 +01:00
|
|
|
ReorderBufferTupleBufData(change->data.tp.newtuple);
|
|
|
|
|
|
|
|
/* restore tuple data itself */
|
|
|
|
memcpy(change->data.tp.newtuple->tuple.t_data, data, tuplelen);
|
|
|
|
data += tuplelen;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
logical decoding: Fix handling of large old tuples with replica identity full.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
2016-03-06 03:02:20 +01:00
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
break;
|
2016-04-06 11:05:41 +02:00
|
|
|
case REORDER_BUFFER_CHANGE_MESSAGE:
|
|
|
|
{
|
|
|
|
Size prefix_size;
|
|
|
|
|
|
|
|
/* read prefix */
|
|
|
|
memcpy(&prefix_size, data, sizeof(Size));
|
|
|
|
data += sizeof(Size);
|
|
|
|
change->data.msg.prefix = MemoryContextAlloc(rb->context,
|
|
|
|
prefix_size);
|
|
|
|
memcpy(change->data.msg.prefix, data, prefix_size);
|
2016-06-10 00:02:36 +02:00
|
|
|
Assert(change->data.msg.prefix[prefix_size - 1] == '\0');
|
2016-04-06 11:05:41 +02:00
|
|
|
data += prefix_size;
|
|
|
|
|
2016-10-26 10:12:31 +02:00
|
|
|
/* read the message */
|
2016-04-06 11:05:41 +02:00
|
|
|
memcpy(&change->data.msg.message_size, data, sizeof(Size));
|
|
|
|
data += sizeof(Size);
|
|
|
|
change->data.msg.message = MemoryContextAlloc(rb->context,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
change->data.msg.message_size);
|
2016-04-06 11:05:41 +02:00
|
|
|
memcpy(change->data.msg.message, data,
|
|
|
|
change->data.msg.message_size);
|
|
|
|
data += change->data.msg.message_size;
|
|
|
|
|
|
|
|
break;
|
|
|
|
}
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
|
|
|
|
{
|
2014-03-07 23:02:48 +01:00
|
|
|
Snapshot oldsnap;
|
|
|
|
Snapshot newsnap;
|
|
|
|
Size size;
|
|
|
|
|
|
|
|
oldsnap = (Snapshot) data;
|
|
|
|
|
|
|
|
size = sizeof(SnapshotData) +
|
|
|
|
sizeof(TransactionId) * oldsnap->xcnt +
|
|
|
|
sizeof(TransactionId) * (oldsnap->subxcnt + 0);
|
|
|
|
|
|
|
|
change->data.snapshot = MemoryContextAllocZero(rb->context, size);
|
|
|
|
|
|
|
|
newsnap = change->data.snapshot;
|
|
|
|
|
|
|
|
memcpy(newsnap, data, size);
|
|
|
|
newsnap->xip = (TransactionId *)
|
|
|
|
(((char *) newsnap) + sizeof(SnapshotData));
|
|
|
|
newsnap->subxip = newsnap->xip + newsnap->xcnt;
|
|
|
|
newsnap->copied = true;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
/* the base struct contains all the data, easy peasy */
|
2018-04-07 17:17:56 +02:00
|
|
|
case REORDER_BUFFER_CHANGE_TRUNCATE:
|
Fix memory leak in TRUNCATE decoding
When decoding a TRUNCATE record, the relids array was being allocated in
the main ReorderBuffer memory context, but not released with the change
resulting in a memory leak.
The array was also ignored when serializing/deserializing the change,
assuming all the information is stored in the change itself. So when
spilling the change to disk, we've only we have serialized only the
pointer to the relids array. Thanks to never releasing the array,
the pointer however remained valid even after loading the change back
to memory, preventing an actual crash.
This fixes both the memory leak and (de)serialization. The relids array
is still allocated in the main ReorderBuffer memory context (none of the
existing ones seems like a good match, and adding an extra context seems
like an overkill). The allocation is wrapped in a new ReorderBuffer API
functions, to keep the details within reorderbuffer.c, just like the
other ReorderBufferGet methods do.
Author: Tomas Vondra
Discussion: https://www.postgresql.org/message-id/flat/66175a41-9342-2845-652f-1bd4c3ee50aa%402ndquadrant.com
Backpatch: 11, where decoding of TRUNCATE was introduced
2018-09-03 02:10:24 +02:00
|
|
|
{
|
|
|
|
Oid *relids;
|
|
|
|
|
|
|
|
relids = ReorderBufferGetRelids(rb,
|
|
|
|
change->data.truncate.nrelids);
|
|
|
|
memcpy(relids, data, change->data.truncate.nrelids * sizeof(Oid));
|
|
|
|
change->data.truncate.relids = relids;
|
|
|
|
|
|
|
|
break;
|
|
|
|
}
|
Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint. DO NOTHING avoids the
constraint violation, without touching the pre-existing row. DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed. The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.
This feature is often referred to as upsert.
This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert. If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made. If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.
To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.
Bumps catversion as stored rules change.
Author: Peter Geoghegan, with significant contributions from Heikki
Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:31:36 +02:00
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
|
|
|
|
case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
dlist_push_tail(&txn->changes, &change->node);
|
|
|
|
txn->nentries_mem++;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove all on-disk stored for the passed in transaction.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn)
|
|
|
|
{
|
|
|
|
XLogSegNo first;
|
|
|
|
XLogSegNo cur;
|
|
|
|
XLogSegNo last;
|
|
|
|
|
|
|
|
Assert(txn->first_lsn != InvalidXLogRecPtr);
|
|
|
|
Assert(txn->final_lsn != InvalidXLogRecPtr);
|
|
|
|
|
Make WAL segment size configurable at initdb time.
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
2017-09-20 07:03:48 +02:00
|
|
|
XLByteToSeg(txn->first_lsn, first, wal_segment_size);
|
|
|
|
XLByteToSeg(txn->final_lsn, last, wal_segment_size);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/* iterate over all possible filenames, and delete them */
|
|
|
|
for (cur = first; cur <= last; cur++)
|
|
|
|
{
|
|
|
|
char path[MAXPGPATH];
|
|
|
|
|
2018-03-06 20:22:03 +01:00
|
|
|
ReorderBufferSerializedPath(path, MyReplicationSlot, txn->xid, cur);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
if (unlink(path) != 0 && errno != ENOENT)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2014-09-05 07:20:33 +02:00
|
|
|
errmsg("could not remove file \"%s\": %m", path)));
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-03-06 20:22:03 +01:00
|
|
|
/*
|
|
|
|
* Remove any leftover serialized reorder buffers from a slot directory after a
|
|
|
|
* prior crash or decoding session exit.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReorderBufferCleanupSerializedTXNs(const char *slotname)
|
|
|
|
{
|
|
|
|
DIR *spill_dir;
|
|
|
|
struct dirent *spill_de;
|
|
|
|
struct stat statbuf;
|
|
|
|
char path[MAXPGPATH * 2 + 12];
|
|
|
|
|
|
|
|
sprintf(path, "pg_replslot/%s", slotname);
|
|
|
|
|
|
|
|
/* we're only handling directories here, skip if it's not ours */
|
|
|
|
if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
|
|
|
|
return;
|
|
|
|
|
|
|
|
spill_dir = AllocateDir(path);
|
|
|
|
while ((spill_de = ReadDirExtended(spill_dir, path, INFO)) != NULL)
|
|
|
|
{
|
|
|
|
/* only look at names that can be ours */
|
|
|
|
if (strncmp(spill_de->d_name, "xid", 3) == 0)
|
|
|
|
{
|
|
|
|
snprintf(path, sizeof(path),
|
|
|
|
"pg_replslot/%s/%s", slotname,
|
|
|
|
spill_de->d_name);
|
|
|
|
|
|
|
|
if (unlink(path) != 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2018-08-25 18:19:21 +02:00
|
|
|
errmsg("could not remove file \"%s\" during removal of pg_replslot/%s/xid*: %m",
|
2018-03-06 20:22:03 +01:00
|
|
|
path, slotname)));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
FreeDir(spill_dir);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Given a replication slot, transaction ID and segment number, fill in the
|
|
|
|
* corresponding spill file into 'path', which is a caller-owned buffer of size
|
|
|
|
* at least MAXPGPATH.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReorderBufferSerializedPath(char *path, ReplicationSlot *slot, TransactionId xid,
|
|
|
|
XLogSegNo segno)
|
|
|
|
{
|
|
|
|
XLogRecPtr recptr;
|
|
|
|
|
2018-07-09 20:28:21 +02:00
|
|
|
XLogSegNoOffsetToRecPtr(segno, 0, wal_segment_size, recptr);
|
2018-03-06 20:22:03 +01:00
|
|
|
|
2018-08-26 07:45:59 +02:00
|
|
|
snprintf(path, MAXPGPATH, "pg_replslot/%s/xid-%u-lsn-%X-%X.spill",
|
2018-04-26 20:47:16 +02:00
|
|
|
NameStr(MyReplicationSlot->data.name),
|
|
|
|
xid,
|
|
|
|
(uint32) (recptr >> 32), (uint32) recptr);
|
2018-03-06 20:22:03 +01:00
|
|
|
}
|
|
|
|
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
/*
|
|
|
|
* Delete all data spilled to disk after we've restarted/crashed. It will be
|
|
|
|
* recreated when the respective slots are reused.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
StartupReorderBuffer(void)
|
|
|
|
{
|
|
|
|
DIR *logical_dir;
|
|
|
|
struct dirent *logical_de;
|
|
|
|
|
|
|
|
logical_dir = AllocateDir("pg_replslot");
|
|
|
|
while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL)
|
|
|
|
{
|
|
|
|
if (strcmp(logical_de->d_name, ".") == 0 ||
|
|
|
|
strcmp(logical_de->d_name, "..") == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* if it cannot be a slot, skip the directory */
|
|
|
|
if (!ReplicationSlotValidateName(logical_de->d_name, DEBUG2))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ok, has to be a surviving logical slot, iterate and delete
|
2017-02-06 10:33:58 +01:00
|
|
|
* everything starting with xid-*
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
2018-03-06 20:22:03 +01:00
|
|
|
ReorderBufferCleanupSerializedTXNs(logical_de->d_name);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
FreeDir(logical_dir);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ---------------------------------------
|
|
|
|
* toast reassembly support
|
|
|
|
* ---------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize per tuple toast reconstruction support.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReorderBufferToastInitHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
|
|
|
|
{
|
|
|
|
HASHCTL hash_ctl;
|
|
|
|
|
|
|
|
Assert(txn->toast_hash == NULL);
|
|
|
|
|
|
|
|
memset(&hash_ctl, 0, sizeof(hash_ctl));
|
|
|
|
hash_ctl.keysize = sizeof(Oid);
|
|
|
|
hash_ctl.entrysize = sizeof(ReorderBufferToastEnt);
|
|
|
|
hash_ctl.hcxt = rb->context;
|
|
|
|
txn->toast_hash = hash_create("ReorderBufferToastHash", 5, &hash_ctl,
|
Improve hash_create's API for selecting simple-binary-key hash functions.
Previously, if you wanted anything besides C-string hash keys, you had to
specify a custom hashing function to hash_create(). Nearly all such
callers were specifying tag_hash or oid_hash; which is tedious, and rather
error-prone, since a caller could easily miss the opportunity to optimize
by using hash_uint32 when appropriate. Replace this with a design whereby
callers using simple binary-data keys just specify HASH_BLOBS and don't
need to mess with specific support functions. hash_create() itself will
take care of optimizing when the key size is four bytes.
This nets out saving a few hundred bytes of code space, and offers
a measurable performance improvement in tidbitmap.c (which was not
exploiting the opportunity to use hash_uint32 for its 4-byte keys).
There might be some wins elsewhere too, I didn't analyze closely.
In future we could look into offering a similar optimized hashing function
for 8-byte keys. Under this design that could be done in a centralized
and machine-independent fashion, whereas getting it right for keys of
platform-dependent sizes would've been notationally painful before.
For the moment, the old way still works fine, so as not to break source
code compatibility for loadable modules. Eventually we might want to
remove tag_hash and friends from the exported API altogether, since there's
no real need for them to be explicitly referenced from outside dynahash.c.
Teodor Sigaev and Tom Lane
2014-12-18 19:36:29 +01:00
|
|
|
HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Per toast-chunk handling for toast reconstruction
|
|
|
|
*
|
|
|
|
* Appends a toast chunk so we can reconstruct it when the tuple "owning" the
|
|
|
|
* toasted Datum comes along.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
|
|
|
|
Relation relation, ReorderBufferChange *change)
|
|
|
|
{
|
|
|
|
ReorderBufferToastEnt *ent;
|
2014-03-07 23:02:48 +01:00
|
|
|
ReorderBufferTupleBuf *newtup;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
bool found;
|
|
|
|
int32 chunksize;
|
|
|
|
bool isnull;
|
|
|
|
Pointer chunk;
|
|
|
|
TupleDesc desc = RelationGetDescr(relation);
|
|
|
|
Oid chunk_id;
|
2014-12-11 21:41:15 +01:00
|
|
|
int32 chunk_seq;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
if (txn->toast_hash == NULL)
|
|
|
|
ReorderBufferToastInitHash(rb, txn);
|
|
|
|
|
|
|
|
Assert(IsToastRelation(relation));
|
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
newtup = change->data.tp.newtuple;
|
|
|
|
chunk_id = DatumGetObjectId(fastgetattr(&newtup->tuple, 1, desc, &isnull));
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
Assert(!isnull);
|
2014-03-07 23:02:48 +01:00
|
|
|
chunk_seq = DatumGetInt32(fastgetattr(&newtup->tuple, 2, desc, &isnull));
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
Assert(!isnull);
|
|
|
|
|
|
|
|
ent = (ReorderBufferToastEnt *)
|
|
|
|
hash_search(txn->toast_hash,
|
|
|
|
(void *) &chunk_id,
|
|
|
|
HASH_ENTER,
|
|
|
|
&found);
|
|
|
|
|
|
|
|
if (!found)
|
|
|
|
{
|
|
|
|
Assert(ent->chunk_id == chunk_id);
|
|
|
|
ent->num_chunks = 0;
|
|
|
|
ent->last_chunk_seq = 0;
|
|
|
|
ent->size = 0;
|
|
|
|
ent->reconstructed = NULL;
|
|
|
|
dlist_init(&ent->chunks);
|
|
|
|
|
|
|
|
if (chunk_seq != 0)
|
|
|
|
elog(ERROR, "got sequence entry %d for toast chunk %u instead of seq 0",
|
|
|
|
chunk_seq, chunk_id);
|
|
|
|
}
|
|
|
|
else if (found && chunk_seq != ent->last_chunk_seq + 1)
|
|
|
|
elog(ERROR, "got sequence entry %d for toast chunk %u instead of seq %d",
|
|
|
|
chunk_seq, chunk_id, ent->last_chunk_seq + 1);
|
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
chunk = DatumGetPointer(fastgetattr(&newtup->tuple, 3, desc, &isnull));
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
Assert(!isnull);
|
|
|
|
|
|
|
|
/* calculate size so we can allocate the right size at once later */
|
|
|
|
if (!VARATT_IS_EXTENDED(chunk))
|
|
|
|
chunksize = VARSIZE(chunk) - VARHDRSZ;
|
|
|
|
else if (VARATT_IS_SHORT(chunk))
|
|
|
|
/* could happen due to heap_form_tuple doing its thing */
|
|
|
|
chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
|
|
|
|
else
|
|
|
|
elog(ERROR, "unexpected type of toast chunk");
|
|
|
|
|
|
|
|
ent->size += chunksize;
|
|
|
|
ent->last_chunk_seq = chunk_seq;
|
|
|
|
ent->num_chunks++;
|
|
|
|
dlist_push_tail(&ent->chunks, &change->node);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Rejigger change->newtuple to point to in-memory toast tuples instead to
|
|
|
|
* on-disk toast tuples that may not longer exist (think DROP TABLE or VACUUM).
|
|
|
|
*
|
|
|
|
* We cannot replace unchanged toast tuples though, so those will still point
|
|
|
|
* to on-disk toast data.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
|
|
|
|
Relation relation, ReorderBufferChange *change)
|
|
|
|
{
|
|
|
|
TupleDesc desc;
|
|
|
|
int natt;
|
|
|
|
Datum *attrs;
|
|
|
|
bool *isnull;
|
|
|
|
bool *free;
|
2014-03-07 23:02:48 +01:00
|
|
|
HeapTuple tmphtup;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
Relation toast_rel;
|
|
|
|
TupleDesc toast_desc;
|
|
|
|
MemoryContext oldcontext;
|
2014-03-07 23:02:48 +01:00
|
|
|
ReorderBufferTupleBuf *newtup;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/* no toast tuples changed */
|
|
|
|
if (txn->toast_hash == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
oldcontext = MemoryContextSwitchTo(rb->context);
|
|
|
|
|
|
|
|
/* we should only have toast tuples in an INSERT or UPDATE */
|
2014-03-07 23:02:48 +01:00
|
|
|
Assert(change->data.tp.newtuple);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
desc = RelationGetDescr(relation);
|
|
|
|
|
|
|
|
toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid);
|
|
|
|
toast_desc = RelationGetDescr(toast_rel);
|
|
|
|
|
|
|
|
/* should we allocate from stack instead? */
|
|
|
|
attrs = palloc0(sizeof(Datum) * desc->natts);
|
|
|
|
isnull = palloc0(sizeof(bool) * desc->natts);
|
|
|
|
free = palloc0(sizeof(bool) * desc->natts);
|
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
newtup = change->data.tp.newtuple;
|
|
|
|
|
|
|
|
heap_deform_tuple(&newtup->tuple, desc, attrs, isnull);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
for (natt = 0; natt < desc->natts; natt++)
|
|
|
|
{
|
2017-08-20 20:19:07 +02:00
|
|
|
Form_pg_attribute attr = TupleDescAttr(desc, natt);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
ReorderBufferToastEnt *ent;
|
|
|
|
struct varlena *varlena;
|
|
|
|
|
|
|
|
/* va_rawsize is the size of the original datum -- including header */
|
|
|
|
struct varatt_external toast_pointer;
|
|
|
|
struct varatt_indirect redirect_pointer;
|
|
|
|
struct varlena *new_datum = NULL;
|
|
|
|
struct varlena *reconstructed;
|
|
|
|
dlist_iter it;
|
|
|
|
Size data_done = 0;
|
|
|
|
|
|
|
|
/* system columns aren't toasted */
|
|
|
|
if (attr->attnum < 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (attr->attisdropped)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* not a varlena datatype */
|
|
|
|
if (attr->attlen != -1)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* no data */
|
|
|
|
if (isnull[natt])
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* ok, we know we have a toast datum */
|
|
|
|
varlena = (struct varlena *) DatumGetPointer(attrs[natt]);
|
|
|
|
|
|
|
|
/* no need to do anything if the tuple isn't external */
|
|
|
|
if (!VARATT_IS_EXTERNAL(varlena))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
VARATT_EXTERNAL_GET_POINTER(toast_pointer, varlena);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check whether the toast tuple changed, replace if so.
|
|
|
|
*/
|
|
|
|
ent = (ReorderBufferToastEnt *)
|
|
|
|
hash_search(txn->toast_hash,
|
|
|
|
(void *) &toast_pointer.va_valueid,
|
|
|
|
HASH_FIND,
|
|
|
|
NULL);
|
|
|
|
if (ent == NULL)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
new_datum =
|
|
|
|
(struct varlena *) palloc0(INDIRECT_POINTER_SIZE);
|
|
|
|
|
|
|
|
free[natt] = true;
|
|
|
|
|
|
|
|
reconstructed = palloc0(toast_pointer.va_rawsize);
|
|
|
|
|
|
|
|
ent->reconstructed = reconstructed;
|
|
|
|
|
|
|
|
/* stitch toast tuple back together from its parts */
|
|
|
|
dlist_foreach(it, &ent->chunks)
|
|
|
|
{
|
|
|
|
bool isnull;
|
2014-03-07 23:02:48 +01:00
|
|
|
ReorderBufferChange *cchange;
|
|
|
|
ReorderBufferTupleBuf *ctup;
|
|
|
|
Pointer chunk;
|
|
|
|
|
|
|
|
cchange = dlist_container(ReorderBufferChange, node, it.cur);
|
|
|
|
ctup = cchange->data.tp.newtuple;
|
|
|
|
chunk = DatumGetPointer(
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
fastgetattr(&ctup->tuple, 3, toast_desc, &isnull));
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
Assert(!isnull);
|
|
|
|
Assert(!VARATT_IS_EXTERNAL(chunk));
|
|
|
|
Assert(!VARATT_IS_SHORT(chunk));
|
|
|
|
|
|
|
|
memcpy(VARDATA(reconstructed) + data_done,
|
|
|
|
VARDATA(chunk),
|
|
|
|
VARSIZE(chunk) - VARHDRSZ);
|
|
|
|
data_done += VARSIZE(chunk) - VARHDRSZ;
|
|
|
|
}
|
|
|
|
Assert(data_done == toast_pointer.va_extsize);
|
|
|
|
|
|
|
|
/* make sure its marked as compressed or not */
|
|
|
|
if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
|
|
|
|
SET_VARSIZE_COMPRESSED(reconstructed, data_done + VARHDRSZ);
|
|
|
|
else
|
|
|
|
SET_VARSIZE(reconstructed, data_done + VARHDRSZ);
|
|
|
|
|
|
|
|
memset(&redirect_pointer, 0, sizeof(redirect_pointer));
|
|
|
|
redirect_pointer.pointer = reconstructed;
|
|
|
|
|
|
|
|
SET_VARTAG_EXTERNAL(new_datum, VARTAG_INDIRECT);
|
|
|
|
memcpy(VARDATA_EXTERNAL(new_datum), &redirect_pointer,
|
|
|
|
sizeof(redirect_pointer));
|
|
|
|
|
|
|
|
attrs[natt] = PointerGetDatum(new_datum);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Build tuple in separate memory & copy tuple back into the tuplebuf
|
|
|
|
* passed to the output plugin. We can't directly heap_fill_tuple() into
|
|
|
|
* the tuplebuf because attrs[] will point back into the current content.
|
|
|
|
*/
|
2014-03-07 23:02:48 +01:00
|
|
|
tmphtup = heap_form_tuple(desc, attrs, isnull);
|
|
|
|
Assert(newtup->tuple.t_len <= MaxHeapTupleSize);
|
logical decoding: Fix handling of large old tuples with replica identity full.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
2016-03-06 03:02:20 +01:00
|
|
|
Assert(ReorderBufferTupleBufData(newtup) == newtup->tuple.t_data);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
2014-03-07 23:02:48 +01:00
|
|
|
memcpy(newtup->tuple.t_data, tmphtup->t_data, tmphtup->t_len);
|
|
|
|
newtup->tuple.t_len = tmphtup->t_len;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* free resources we won't further need, more persistent stuff will be
|
|
|
|
* free'd in ReorderBufferToastReset().
|
|
|
|
*/
|
|
|
|
RelationClose(toast_rel);
|
2014-03-07 23:02:48 +01:00
|
|
|
pfree(tmphtup);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
for (natt = 0; natt < desc->natts; natt++)
|
|
|
|
{
|
|
|
|
if (free[natt])
|
|
|
|
pfree(DatumGetPointer(attrs[natt]));
|
|
|
|
}
|
|
|
|
pfree(attrs);
|
|
|
|
pfree(free);
|
|
|
|
pfree(isnull);
|
|
|
|
|
|
|
|
MemoryContextSwitchTo(oldcontext);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Free all resources allocated for toast reconstruction.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
|
|
|
|
{
|
|
|
|
HASH_SEQ_STATUS hstat;
|
|
|
|
ReorderBufferToastEnt *ent;
|
|
|
|
|
|
|
|
if (txn->toast_hash == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* sequentially walk over the hash and free everything */
|
|
|
|
hash_seq_init(&hstat, txn->toast_hash);
|
|
|
|
while ((ent = (ReorderBufferToastEnt *) hash_seq_search(&hstat)) != NULL)
|
|
|
|
{
|
|
|
|
dlist_mutable_iter it;
|
|
|
|
|
|
|
|
if (ent->reconstructed != NULL)
|
|
|
|
pfree(ent->reconstructed);
|
|
|
|
|
|
|
|
dlist_foreach_modify(it, &ent->chunks)
|
|
|
|
{
|
|
|
|
ReorderBufferChange *change =
|
|
|
|
dlist_container(ReorderBufferChange, node, it.cur);
|
|
|
|
|
|
|
|
dlist_delete(&change->node);
|
|
|
|
ReorderBufferReturnChange(rb, change);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
hash_destroy(txn->toast_hash);
|
|
|
|
txn->toast_hash = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* ---------------------------------------
|
|
|
|
* Visibility support for logical decoding
|
|
|
|
*
|
|
|
|
*
|
|
|
|
* Lookup actual cmin/cmax values when using decoding snapshot. We can't
|
|
|
|
* always rely on stored cmin/cmax values because of two scenarios:
|
|
|
|
*
|
|
|
|
* * A tuple got changed multiple times during a single transaction and thus
|
|
|
|
* has got a combocid. Combocid's are only valid for the duration of a
|
|
|
|
* single transaction.
|
|
|
|
* * A tuple with a cmin but no cmax (and thus no combocid) got
|
|
|
|
* deleted/updated in another transaction than the one which created it
|
|
|
|
* which we are looking at right now. As only one of cmin, cmax or combocid
|
2014-03-05 19:00:22 +01:00
|
|
|
* is actually stored in the heap we don't have access to the value we
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
* need anymore.
|
|
|
|
*
|
|
|
|
* To resolve those problems we have a per-transaction hash of (cmin,
|
|
|
|
* cmax) tuples keyed by (relfilenode, ctid) which contains the actual
|
|
|
|
* (cmin, cmax) values. That also takes care of combocids by simply
|
|
|
|
* not caring about them at all. As we have the real cmin/cmax values
|
|
|
|
* combocids aren't interesting.
|
|
|
|
*
|
|
|
|
* As we only care about catalog tuples here the overhead of this
|
|
|
|
* hashtable should be acceptable.
|
|
|
|
*
|
|
|
|
* Heap rewrites complicate this a bit, check rewriteheap.c for
|
|
|
|
* details.
|
|
|
|
* -------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* struct for qsort()ing mapping files by lsn somewhat efficiently */
|
|
|
|
typedef struct RewriteMappingFile
|
|
|
|
{
|
|
|
|
XLogRecPtr lsn;
|
|
|
|
char fname[MAXPGPATH];
|
|
|
|
} RewriteMappingFile;
|
|
|
|
|
|
|
|
#if NOT_USED
|
|
|
|
static void
|
|
|
|
DisplayMapping(HTAB *tuplecid_data)
|
|
|
|
{
|
|
|
|
HASH_SEQ_STATUS hstat;
|
|
|
|
ReorderBufferTupleCidEnt *ent;
|
|
|
|
|
|
|
|
hash_seq_init(&hstat, tuplecid_data);
|
|
|
|
while ((ent = (ReorderBufferTupleCidEnt *) hash_seq_search(&hstat)) != NULL)
|
|
|
|
{
|
|
|
|
elog(DEBUG3, "mapping: node: %u/%u/%u tid: %u/%u cmin: %u, cmax: %u",
|
|
|
|
ent->key.relnode.dbNode,
|
|
|
|
ent->key.relnode.spcNode,
|
|
|
|
ent->key.relnode.relNode,
|
2017-03-28 17:52:55 +02:00
|
|
|
ItemPointerGetBlockNumber(&ent->key.tid),
|
|
|
|
ItemPointerGetOffsetNumber(&ent->key.tid),
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
ent->cmin,
|
|
|
|
ent->cmax
|
|
|
|
);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Apply a single mapping file to tuplecid_data.
|
|
|
|
*
|
|
|
|
* The mapping file has to have been verified to be a) committed b) for our
|
|
|
|
* transaction c) applied in LSN order.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ApplyLogicalMappingFile(HTAB *tuplecid_data, Oid relid, const char *fname)
|
|
|
|
{
|
|
|
|
char path[MAXPGPATH];
|
|
|
|
int fd;
|
|
|
|
int readBytes;
|
|
|
|
LogicalRewriteMappingData map;
|
|
|
|
|
2014-07-02 21:07:47 +02:00
|
|
|
sprintf(path, "pg_logical/mappings/%s", fname);
|
2017-09-23 15:49:22 +02:00
|
|
|
fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
if (fd < 0)
|
|
|
|
ereport(ERROR,
|
2016-08-19 19:38:55 +02:00
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not open file \"%s\": %m", path)));
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
while (true)
|
|
|
|
{
|
|
|
|
ReorderBufferTupleCidKey key;
|
|
|
|
ReorderBufferTupleCidEnt *ent;
|
|
|
|
ReorderBufferTupleCidEnt *new_ent;
|
2014-05-06 18:12:18 +02:00
|
|
|
bool found;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/* be careful about padding */
|
|
|
|
memset(&key, 0, sizeof(ReorderBufferTupleCidKey));
|
|
|
|
|
|
|
|
/* read all mappings till the end of the file */
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
pgstat_report_wait_start(WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
readBytes = read(fd, &map, sizeof(LogicalRewriteMappingData));
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
pgstat_report_wait_end();
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
if (readBytes < 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not read file \"%s\": %m",
|
|
|
|
path)));
|
2014-05-06 18:12:18 +02:00
|
|
|
else if (readBytes == 0) /* EOF */
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
break;
|
|
|
|
else if (readBytes != sizeof(LogicalRewriteMappingData))
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
2014-05-01 00:16:53 +02:00
|
|
|
errmsg("could not read from file \"%s\": read %d instead of %d bytes",
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
path, readBytes,
|
|
|
|
(int32) sizeof(LogicalRewriteMappingData))));
|
|
|
|
|
|
|
|
key.relnode = map.old_node;
|
|
|
|
ItemPointerCopy(&map.old_tid,
|
|
|
|
&key.tid);
|
|
|
|
|
|
|
|
|
|
|
|
ent = (ReorderBufferTupleCidEnt *)
|
|
|
|
hash_search(tuplecid_data,
|
|
|
|
(void *) &key,
|
|
|
|
HASH_FIND,
|
|
|
|
NULL);
|
|
|
|
|
|
|
|
/* no existing mapping, no need to update */
|
|
|
|
if (!ent)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
key.relnode = map.new_node;
|
|
|
|
ItemPointerCopy(&map.new_tid,
|
|
|
|
&key.tid);
|
|
|
|
|
|
|
|
new_ent = (ReorderBufferTupleCidEnt *)
|
|
|
|
hash_search(tuplecid_data,
|
|
|
|
(void *) &key,
|
|
|
|
HASH_ENTER,
|
|
|
|
&found);
|
|
|
|
|
|
|
|
if (found)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Make sure the existing mapping makes sense. We sometime update
|
|
|
|
* old records that did not yet have a cmax (e.g. pg_class' own
|
|
|
|
* entry while rewriting it) during rewrites, so allow that.
|
|
|
|
*/
|
|
|
|
Assert(ent->cmin == InvalidCommandId || ent->cmin == new_ent->cmin);
|
|
|
|
Assert(ent->cmax == InvalidCommandId || ent->cmax == new_ent->cmax);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* update mapping */
|
|
|
|
new_ent->cmin = ent->cmin;
|
|
|
|
new_ent->cmax = ent->cmax;
|
|
|
|
new_ent->combocid = ent->combocid;
|
|
|
|
}
|
|
|
|
}
|
2018-08-16 16:49:10 +02:00
|
|
|
|
|
|
|
CloseTransientFile(fd);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
2018-08-16 16:49:10 +02:00
|
|
|
* Check whether the TransactionOid 'xid' is in the pre-sorted array 'xip'.
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
TransactionIdInArray(TransactionId xid, TransactionId *xip, Size num)
|
|
|
|
{
|
|
|
|
return bsearch(&xid, xip, num,
|
|
|
|
sizeof(TransactionId), xidComparator) != NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* qsort() comparator for sorting RewriteMappingFiles in LSN order.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
file_sort_by_lsn(const void *a_p, const void *b_p)
|
|
|
|
{
|
2014-05-06 18:12:18 +02:00
|
|
|
RewriteMappingFile *a = *(RewriteMappingFile **) a_p;
|
|
|
|
RewriteMappingFile *b = *(RewriteMappingFile **) b_p;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
if (a->lsn < b->lsn)
|
|
|
|
return -1;
|
|
|
|
else if (a->lsn > b->lsn)
|
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Apply any existing logical remapping files if there are any targeted at our
|
|
|
|
* transaction for relid.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
UpdateLogicalMappings(HTAB *tuplecid_data, Oid relid, Snapshot snapshot)
|
|
|
|
{
|
|
|
|
DIR *mapping_dir;
|
|
|
|
struct dirent *mapping_de;
|
|
|
|
List *files = NIL;
|
|
|
|
ListCell *file;
|
|
|
|
RewriteMappingFile **files_a;
|
|
|
|
size_t off;
|
|
|
|
Oid dboid = IsSharedRelation(relid) ? InvalidOid : MyDatabaseId;
|
|
|
|
|
2014-07-02 21:07:47 +02:00
|
|
|
mapping_dir = AllocateDir("pg_logical/mappings");
|
|
|
|
while ((mapping_de = ReadDir(mapping_dir, "pg_logical/mappings")) != NULL)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
2014-05-06 18:12:18 +02:00
|
|
|
Oid f_dboid;
|
|
|
|
Oid f_relid;
|
|
|
|
TransactionId f_mapped_xid;
|
|
|
|
TransactionId f_create_xid;
|
|
|
|
XLogRecPtr f_lsn;
|
|
|
|
uint32 f_hi,
|
|
|
|
f_lo;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
RewriteMappingFile *f;
|
|
|
|
|
|
|
|
if (strcmp(mapping_de->d_name, ".") == 0 ||
|
|
|
|
strcmp(mapping_de->d_name, "..") == 0)
|
|
|
|
continue;
|
|
|
|
|
2014-05-06 18:12:18 +02:00
|
|
|
/* Ignore files that aren't ours */
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
if (strncmp(mapping_de->d_name, "map-", 4) != 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
|
|
|
|
&f_dboid, &f_relid, &f_hi, &f_lo,
|
|
|
|
&f_mapped_xid, &f_create_xid) != 6)
|
2014-05-01 00:16:53 +02:00
|
|
|
elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
f_lsn = ((uint64) f_hi) << 32 | f_lo;
|
|
|
|
|
|
|
|
/* mapping for another database */
|
|
|
|
if (f_dboid != dboid)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* mapping for another relation */
|
|
|
|
if (f_relid != relid)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* did the creating transaction abort? */
|
|
|
|
if (!TransactionIdDidCommit(f_create_xid))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* not for our transaction */
|
|
|
|
if (!TransactionIdInArray(f_mapped_xid, snapshot->subxip, snapshot->subxcnt))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* ok, relevant, queue for apply */
|
|
|
|
f = palloc(sizeof(RewriteMappingFile));
|
|
|
|
f->lsn = f_lsn;
|
|
|
|
strcpy(f->fname, mapping_de->d_name);
|
|
|
|
files = lappend(files, f);
|
|
|
|
}
|
|
|
|
FreeDir(mapping_dir);
|
|
|
|
|
|
|
|
/* build array we can easily sort */
|
|
|
|
files_a = palloc(list_length(files) * sizeof(RewriteMappingFile *));
|
|
|
|
off = 0;
|
|
|
|
foreach(file, files)
|
|
|
|
{
|
|
|
|
files_a[off++] = lfirst(file);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* sort files so we apply them in LSN order */
|
|
|
|
qsort(files_a, list_length(files), sizeof(RewriteMappingFile *),
|
|
|
|
file_sort_by_lsn);
|
|
|
|
|
2014-05-06 18:12:18 +02:00
|
|
|
for (off = 0; off < list_length(files); off++)
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
{
|
|
|
|
RewriteMappingFile *f = files_a[off];
|
2014-05-06 18:12:18 +02:00
|
|
|
|
2014-05-01 00:16:53 +02:00
|
|
|
elog(DEBUG1, "applying mapping: \"%s\" in %u", f->fname,
|
2014-05-06 18:12:18 +02:00
|
|
|
snapshot->subxip[0]);
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
ApplyLogicalMappingFile(tuplecid_data, relid, f->fname);
|
|
|
|
pfree(f);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Lookup cmin/cmax of a tuple, during logical decoding where we can't rely on
|
|
|
|
* combocids.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
|
|
|
|
Snapshot snapshot,
|
|
|
|
HeapTuple htup, Buffer buffer,
|
|
|
|
CommandId *cmin, CommandId *cmax)
|
|
|
|
{
|
|
|
|
ReorderBufferTupleCidKey key;
|
|
|
|
ReorderBufferTupleCidEnt *ent;
|
|
|
|
ForkNumber forkno;
|
|
|
|
BlockNumber blockno;
|
2014-05-06 18:12:18 +02:00
|
|
|
bool updated_mapping = false;
|
Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 22:32:18 +01:00
|
|
|
|
|
|
|
/* be careful about padding */
|
|
|
|
memset(&key, 0, sizeof(key));
|
|
|
|
|
|
|
|
Assert(!BufferIsLocal(buffer));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* get relfilenode from the buffer, no convenient way to access it other
|
|
|
|
* than that.
|
|
|
|
*/
|
|
|
|
BufferGetTag(buffer, &key.relnode, &forkno, &blockno);
|
|
|
|
|
|
|
|
/* tuples can only be in the main fork */
|
|
|
|
Assert(forkno == MAIN_FORKNUM);
|
|
|
|
Assert(blockno == ItemPointerGetBlockNumber(&htup->t_self));
|
|
|
|
|
|
|
|
ItemPointerCopy(&htup->t_self,
|
|
|
|
&key.tid);
|
|
|
|
|
|
|
|
restart:
|
|
|
|
ent = (ReorderBufferTupleCidEnt *)
|
|
|
|
hash_search(tuplecid_data,
|
|
|
|
(void *) &key,
|
|
|
|
HASH_FIND,
|
|
|
|
NULL);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* failed to find a mapping, check whether the table was rewritten and
|
|
|
|
* apply mapping if so, but only do that once - there can be no new
|
|
|
|
* mappings while we are in here since we have to hold a lock on the
|
|
|
|
* relation.
|
|
|
|
*/
|
|
|
|
if (ent == NULL && !updated_mapping)
|
|
|
|
{
|
|
|
|
UpdateLogicalMappings(tuplecid_data, htup->t_tableOid, snapshot);
|
|
|
|
/* now check but don't update for a mapping again */
|
|
|
|
updated_mapping = true;
|
|
|
|
goto restart;
|
|
|
|
}
|
|
|
|
else if (ent == NULL)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (cmin)
|
|
|
|
*cmin = ent->cmin;
|
|
|
|
if (cmax)
|
|
|
|
*cmax = ent->cmax;
|
|
|
|
return true;
|
|
|
|
}
|