Commit Graph

137 Commits

Author SHA1 Message Date
Tom Lane 9e3755ecb2 Remove useless duplicate inclusions of system header files.
c.h #includes a number of core libc header files, such as <stdio.h>.
There's no point in re-including these after having read postgres.h,
postgres_fe.h, or c.h; so remove code that did so.

While at it, also fix some places that were ignoring our standard pattern
of "include postgres[_fe].h, then system header files, then other Postgres
header files".  While there's not any great magic in doing it that way
rather than system headers last, it's silly to have just a few files
deviating from the general pattern.  (But I didn't attempt to enforce this
globally, only in files I was touching anyway.)

I'd be the first to say that this is mostly compulsive neatnik-ism,
but over time it might save enough compile cycles to be useful.
2017-02-25 16:12:55 -05:00
Tom Lane c29aff959d Consistently declare timestamp variables as TimestampTz.
Twiddle the replication-related code so that its timestamp variables
are declared TimestampTz, rather than the uninformative "int64" that
was previously used for meant-to-be-always-integer timestamps.
This resolves the int64-vs-TimestampTz declaration inconsistencies
introduced by commit 7c030783a, though in the opposite direction to
what was originally suggested.

This required including datatype/timestamp.h in a couple more places
than before.  I decided it would be a good idea to slim down that
header by not having it pull in <float.h> etc, as those headers are
no longer at all relevant to its purpose.  Unsurprisingly, a small number
of .c files turn out to have been depending on those inclusions, so add
them back in the .c files as needed.

Discussion: https://postgr.es/m/26788.1487455319@sss.pgh.pa.us
Discussion: https://postgr.es/m/27694.1487456324@sss.pgh.pa.us
2017-02-23 15:57:08 -05:00
Peter Eisentraut c3368f9173 Fix logical replication with different encodings
reported by Shinoda, Noriyoshi <noriyoshi.shinoda@hpe.com>; partial
patch by Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
2017-02-23 11:29:12 -05:00
Fujii Masao d36537008a Remove confusing comment about unsupported feature.
The initial table synchronization feature has not been supported yet,
but there was the confusing header comment about it in logical/worker.c.
2017-02-22 02:49:42 +09:00
Robert Haas 85c11324ca Rename user-facing tools with "xlog" in the name to say "wal".
This means pg_receivexlog because pg_receivewal, pg_resetxlog
becomes pg_resetwal, and pg_xlogdump becomes pg_waldump.
2017-02-09 16:23:46 -05:00
Heikki Linnakangas d02d985349 Fix typo in variable name.
Masahiko Sawada
2017-02-06 11:45:08 +02:00
Heikki Linnakangas 181bdb90ba Fix typos in comments.
Backpatch to all supported versions, where applicable, to make backpatching
of future fixes go more smoothly.

Josh Soref

Discussion: https://www.postgresql.org/message-id/CACZqfqCf+5qRztLPgmmosr-B0Ye4srWzzw_mo4c_8_B_mtjmJQ@mail.gmail.com
2017-02-06 11:33:58 +02:00
Tom Lane ab02896510 Provide CatalogTupleDelete() as a wrapper around simple_heap_delete().
This extends the work done in commit 2f5c9d9c9 to provide a more nearly
complete abstraction layer hiding the details of index updating for catalog
changes.  That commit only invented abstractions for catalog inserts and
updates, leaving nearby code for catalog deletes still calling the
heap-level routines directly.  That seems rather ugly from here, and it
does little to help if we ever want to shift to a storage system in which
indexing work is needed at delete time.

Hence, create a wrapper function CatalogTupleDelete(), and replace calls
of simple_heap_delete() on catalog tuples with it.  There are now very
few direct calls of [simple_]heap_delete remaining in the tree.

Discussion: https://postgr.es/m/462.1485902736@sss.pgh.pa.us
2017-02-01 16:13:30 -05:00
Alvaro Herrera 2f5c9d9c9c Tweak catalog indexing abstraction for upcoming WARM
Split the existing CatalogUpdateIndexes into two different routines,
CatalogTupleInsert and CatalogTupleUpdate, which do both the heap
insert/update plus the index update.  This removes over 300 lines of
boilerplate code all over src/backend/catalog/ and src/backend/commands.
The resulting code is much more pleasing to the eye.

Also, by encapsulating what happens in detail during an UPDATE, this
facilitates the upcoming WARM patch, which is going to add a few more
lines to the update case making the boilerplate even more boring.

The original CatalogUpdateIndexes is removed; there was only one use
left, and since it's just three lines, we can as well expand it in place
there.  We could keep it, but WARM is going to break all the UPDATE
out-of-core callsites anyway, so there seems to be no benefit in doing
so.

Author: Pavan Deolasee
Discussion: https://www.postgr.es/m/CABOikdOcFYSZ4vA2gYfs=M2cdXzXX4qGHeEiW3fu9PCfkHLa2A@mail.gmail.com
2017-01-31 18:42:24 -03:00
Peter Eisentraut 3d9e73ea5f Update copyright years in some recently added files 2017-01-25 12:32:05 -05:00
Fujii Masao dc82f5a640 Be sure to release the lock on failure to launch logical replication worker.
Petr Jelinek
2017-01-24 12:41:09 +09:00
Peter Eisentraut 70c56a014e Fix NULL pointer access in logical replication workers
From: Petr Jelinek <pjmodos@pjmodos.net>
2017-01-23 12:33:27 -05:00
Peter Eisentraut f21a563d25 Move some things from builtins.h to new header files
This avoids that builtins.h has to include additional header files.
2017-01-20 20:29:53 -05:00
Peter Eisentraut 665d1fad99 Logical replication
- Add PUBLICATION catalogs and DDL
- Add SUBSCRIPTION catalog and DDL
- Define logical replication protocol and output plugin
- Add logical replication workers

From: Petr Jelinek <petr@2ndquadrant.com>
Reviewed-by: Steve Singer <steve@ssinger.info>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Erik Rijkers <er@xs4all.nl>
Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
2017-01-20 09:04:49 -05:00
Bruce Momjian 1d25779284 Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
Andres Freund 6ef2eba3f5 Skip checkpoints, archiving on idle systems.
Some background activity (like checkpoints, archive timeout, standby
snapshots) is not supposed to happen on an idle system. Unfortunately
so far it was not easy to determine when a system is idle, which
defeated some of the attempts to avoid redundant activity on an idle
system.

To make that easier, allow to make individual WAL insertions as not
being "important". By checking whether any important activity happened
since the last time an activity was performed, it now is easy to check
whether some action needs to be repeated.

Use the new facility for checkpoints, archive timeout and standby
snapshots.

The lack of a facility causes some issues in older releases, but in my
opinion the consequences (superflous checkpoints / archived segments)
aren't grave enough to warrant backpatching.

Author: Michael Paquier, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Amit Kapila, Kyotaro HORIGUCHI
Bug: #13685
Discussion:
    https://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org
    https://www.postgresql.org/message-id/CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com
Backpatch: -
2016-12-22 11:31:50 -08:00
Robert Haas 3761fe3c20 Simplify LWLock tranche machinery by removing array_base/array_stride.
array_base and array_stride were added so that we could identify the
offset of an LWLock within a tranche, but this facility is only very
marginally used apart from the main tranche.  So, give every lock in
the main tranche its own tranche ID and get rid of array_base,
array_stride, and all that's attached.  For debugging facilities
(Trace_lwlocks and LWLOCK_STATS) print the pointer address of the
LWLock using %p instead of the offset.  This is arguably more useful,
and certainly a lot cheaper.  Drop the offset-within-tranche from
the information reported to dtrace and from one can't-happen message
inside lwlock.c.

The main user-visible impact of this change is that pg_stat_activity
will now report all waits for LWLocks as "LWLock" rather than
reporting some as "LWLockTranche" and others as "LWLockNamed".

The main motivation for this change is that the need to specify an
array_base and an array_stride is awkward for parallel query.  There
is only a very limited supply of tranche IDs so we can't just keep
allocating new ones, and if we try to use the same tranche IDs every
time then we run into trouble when multiple parallel contexts are
use simultaneously.  So if we didn't get rid of this mechanism we'd
have to make it even more complicated.  By simplifying it in this
way, we instead reduce the size of the generated code for lwlock.c
by about 5%.

Discussion: http://postgr.es/m/CA+TgmoYsFn6NUW1x0AZtupJGUAs1UDY4dJtCN47_Q6D0sP80PA@mail.gmail.com
2016-12-16 11:29:23 -05:00
Heikki Linnakangas 56f39009c5 Fix typos in comments.
Vinayak Pokale
2016-10-26 11:12:31 +03:00
Andres Freund 61633f7904 Correct logical decoding restore behaviour for subtransactions.
Before initializing iteration over a subtransaction's changes, the last
few changes were not spilled to disk. That's correct if the transaction
didn't spill to disk, but otherwise... This bug can lead to missed or
misorderd subtransaction contents when they were spilled to disk.

Move spilling of the remaining in-memory changes to
ReorderBufferIterTXNInit(), where it can easily be applied to the top
transaction and, if present, subtransactions.

Since this code had too many bugs already, noticeably increase test
coverage.

Fixes: #14319
Reported-By: Huan Ruan
Discussion: <20160909012610.20024.58169@wrigleys.postgresql.org>
Backport: 9,4-, where logical decoding was added
2016-10-03 22:11:36 -07:00
Robert Haas 308985b0b4 Fix dangling pointer problem in ReorderBufferSerializeChange.
Commit 3fe3511d05 introduced a new
case into this function, but neglected to ensure that the "ondisk"
pointer got updated after a possible reallocation as the code does
in other cases.

Stas Kelvich, per diagnosis by Konstantin Knizhnik.
2016-09-28 11:19:46 -04:00
Peter Eisentraut 49eb0fd097 Add location field to DefElem
Add a location field to the DefElem struct, used to parse many utility
commands.  Update various error messages to supply error position
information.

To propogate the error position information in a more systematic way,
create a ParseState in standard_ProcessUtility() and pass that to
interested functions implementing the utility commands.  This seems
better than passing the query string and then reassembling a parse state
ad hoc, which violates the encapsulation of the ParseState type.

Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
2016-09-06 12:00:00 -04:00
Simon Riggs d851bef2d6 Dirty replication slots when using sql interface
When pg_logical_slot_get_changes(...) sets confirmed_flush_lsn to the point at
which replay stopped, it doesn't dirty the replication slot.  So if the replay
didn't cause restart_lsn or catalog_xmin to change as well, this change will
not get written out to disk. Even on a clean shutdown.

If Pg crashes or restarts, a subsequent pg_logical_slot_get_changes(...) call
will see the same changes already replayed since it uses the slot's
confirmed_flush_lsn as the start point for fetching changes. The caller can't
specify a start LSN when using the SQL interface.

Mark the slot as dirty after reading changes using the SQL interface so that
users won't see repeated changes after a clean shutdown. Repeated changes still
occur when using the walsender interface or after an unclean shutdown.

Craig Ringer
2016-09-05 09:44:38 +01:00
Tom Lane b6182081be Remove duplicate code from ReorderBufferCleanupTXN().
Andres is apparently the only hacker who thinks this code is better as-is.
I (tgl) follow some of his logic, but the fact that it's setting off
warnings from static code analyzers seems like a sufficient reason to
put the complexity into a comment rather than the code.

Aleksander Alekseev

Discussion: <20160404190345.54d84ee8@fujitsu>
2016-09-04 20:49:44 -04:00
Tom Lane ea268cdc9a Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters.  While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea.  Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.

While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.

In passing, change TopMemoryContext to use the default allocation
parameters.  Formerly it could only be extended 8K at a time.  That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there.  There seems no good reason
not to let it use growing blocks like most other contexts.

Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain.  The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.

Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 17:50:38 -04:00
Alvaro Herrera 6f79ae7fe5 reorderbuffer: preserve errno while reporting error
Clobbering errno during cleanup after an error is an oft-repeated, easy
to make mistake.  Deal with it here as everywhere else, by saving it
aside and restoring after cleanup, before ereport'ing.

In passing, add a missing errcode declaration in another ereport() call
in the same file, which I noticed while skimming the file looking for
similar problems.

Backpatch to 9.4, where this code was introduced.
2016-08-19 14:38:55 -03:00
Tom Lane bcbecbce2f Don't propagate a null subtransaction snapshot up to parent transaction.
This oversight could cause logical decoding to fail to decode an outer
transaction containing changes, if a subtransaction had an XID but no
actual changes.  Per bug #14279 from Marko Tiikkaja.  Patch by Marko
based on analysis by Andrew Gierth.

Discussion: <20160804191757.1430.39011@wrigleys.postgresql.org>
2016-08-07 13:15:55 -04:00
Magnus Hagander 55d57359f2 Fix typos in comments and debug message
Antonin Houska
2016-07-18 18:46:57 +02:00
Tom Lane f8c58554db Fix typo in ReorderBufferIterTXNInit().
This looks like it would cause changes from subtransactions to be missed
by the iterator being constructed, if those changes had been spilled to
disk previously.  This implies that large subtransactions might be lost
(in whole or in part) by logical replication.  Found and fixed by
Petru-Florin Mihancea, per bug #14208.

Report: <20160622144830.5791.22512@wrigleys.postgresql.org>
2016-06-30 12:37:02 -04:00
Robert Haas 4bc424b968 pgindent run for 9.6 2016-06-09 18:02:36 -04:00
Greg Stark e1623c3959 Fix various common mispellings.
Mostly these are just comments but there are a few in documentation
and a handful in code and tests. Hopefully this doesn't cause too much
unnecessary pain for backpatching. I relented from some of the most
common like "thru" for that reason. The rest don't seem numerous
enough to cause problems.

Thanks to Kevin Lyda's tool https://pypi.python.org/pypi/misspellings
2016-06-03 16:08:45 +01:00
Alvaro Herrera c1543a81a7 Revert timeline following in replication slots
This reverts commits f07d18b6e9, 82c83b3372, 3a3b309041, and
24c5f1a103.

This feature has shown enough immaturity that it was deemed better to
rip it out before rushing some more fixes at the last minute.  There are
discussions on larger changes in this area for the next release.
2016-05-04 17:32:22 -03:00
Alvaro Herrera 234a266066 Fix code comments regarding logical decoding
Back in 3b02ea4f07 I added some comments in various places to explain
how logical decoding and other things worked.  Not all of the changes
were welcome, because they were misleading or wrong.  This changes them
a little bit to make them more accurate.

Some other comments are also changed to be more accurate.  Also, fix a
bunch of typos.

Author: Álvaro Herrera, Craig Ringer

Andres Freund reviewed some parts of this.
2016-05-02 16:04:29 -03:00
Andres Freund c6ff84b06a Emit invalidations to standby for transactions without xid.
So far, when a transaction with pending invalidations, but without an
assigned xid, committed, we simply ignored those invalidation
messages. That's problematic, because those are actually sent for a
reason.

Known symptoms of this include that existing sessions on a hot-standby
replica sometimes fail to notice new concurrently built indexes and
visibility map updates.

The solution is to WAL log such invalidations in transactions without an
xid. We considered to alternatively force-assign an xid, but that'd be
problematic for vacuum, which might be run in systems with few xids.

Important: This adds a new WAL record, but as the patch has to be
back-patched, we can't bump the WAL page magic. This means that standbys
have to be updated before primaries; otherwise
"PANIC: standby_redo: unknown op code 32" errors can be encountered.

XXX:

Reported-By: Васильев Дмитрий, Masahiko Sawada
Discussion:
    CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com
    CAD21AoDpZ6Xjg=gFrGPnSn4oTRRcwK1EBrWCq9OqOHuAcMMC=w@mail.gmail.com
2016-04-26 20:21:54 -07:00
Andres Freund 7b16781228 Fix trivial typo. 2016-04-14 19:25:16 -07:00
Tom Lane 6a3d3965d6 Fix core dump in ReorderBufferRestoreChange on alignment-picky platforms.
When re-reading an update involving both an old tuple and a new tuple from
disk, reorderbuffer.c was careless about whether the new tuple is suitably
aligned for direct access --- in general, it isn't.  We'd missed seeing
this in the buildfarm because the contrib/test_decoding tests exercise this
code path only a few times, and by chance all of those cases have old
tuples with length a multiple of 4, which is usually enough to make the
access to the new tuple's t_len safe.  For some still-not-entirely-clear
reason, however, Debian's sparc build gets a bus error, as reported by
Christoph Berg; perhaps it's assuming 8-byte alignment of the pointer?

The lack of previous field reports is probably because you need all of
these conditions to trigger a crash: an alignment-picky platform (not
Intel), a transaction large enough to spill to disk, an update within
that xact that changes a primary-key field and has an odd-length old tuple,
and of course logical decoding tracing the transaction.

Avoid the alignment assumption by using memcpy instead of fetching t_len
directly, and add a test case that exposes the crash on picky platforms.
Back-patch to 9.4 where the bug was introduced.

Discussion: <20160413094117.GC21485@msg.credativ.de>
2016-04-14 19:42:21 -04:00
Tom Lane 994f112573 Adjust datatype of ReplicationState.acquired_by.
It was declared as "pid_t", which would be fine except that none of
the places that printed it in error messages took any thought for the
possibility that it's not equivalent to "int".  This leads to warnings
on some buildfarm members, and could possibly lead to actually wrong
error messages on those platforms.  There doesn't seem to be any very
good reason not to just make it "int"; it's only ever assigned from
MyProcPid, which is int.  If we want to cope with PIDs that are wider
than int, this is not the place to start.

Also, fix the comment, which seems to perhaps be a leftover from a time
when the field was only a bool?

Per buildfarm.  Back-patch to 9.5 which has same issue.
2016-04-14 12:18:09 -04:00
Andres Freund be65eddd80 Add required database and origin filtering for logical messages.
Logical messages, added in 3fe3511d05, during decoding failed to filter
messages emitted in other databases and messages emitted "under" a
replication origin the output plugin isn't interested in.

Add tests to verify that both types of filtering actually work. While
touching message.sql remove hunk obsoleted by d25379e.

Bump XLOG_PAGE_MAGIC because xl_logical_message changed and because
3fe3511d05 had omitted doing so. 3fe3511d05 additionally didn't bump
catversion, but 7a542700d has done so since.

Author: Petr Jelinek
Reported-By: Andres Freund
Discussion: 20160406142513.wotqy3ba3kanr423@alap3.anarazel.de
2016-04-13 17:38:54 -07:00
Simon Riggs 3fe3511d05 Generic Messages for Logical Decoding
API and mechanism to allow generic messages to be inserted into WAL that are
intended to be read by logical decoding plugins. This commit adds an optional
new callback to the logical decoding API.

Messages are either text or bytea. Messages can be transactional, or not, and
are identified by a prefix to allow multiple concurrent decoding plugins.

(Not to be confused with Generic WAL records, which are intended to allow crash
recovery of extensible objects.)

Author: Petr Jelinek and Andres Freund
Reviewers: Artur Zakirov, Tomas Vondra, Simon Riggs
Discussion: 5685F999.6010202@2ndquadrant.com
2016-04-06 10:05:41 +01:00
Teodor Sigaev 65578341af Add Generic WAL interface
This interface is designed to give an access to WAL for extensions which
could implement new access method, for example. Previously it was
impossible because restoring from custom WAL would need to access system
catalog to find a redo custom function. This patch suggests generic way
to describe changes on page with standart layout.

Bump XLOG_PAGE_MAGIC because of new record type.

Author: Alexander Korotkov with a help of Petr Jelinek, Markus Nullmeier and
	minor editorization by my
Reviewers: Petr Jelinek, Alvaro Herrera, Teodor Sigaev, Jim Nasby,
	Michael Paquier
2016-04-01 12:21:48 +03:00
Alvaro Herrera 3501f71c21 Fix broken variable declaration
Author: Konstantin Knizhnik
2016-03-30 23:39:15 -03:00
Alvaro Herrera 24c5f1a103 Enable logical slots to follow timeline switches
When decoding from a logical slot, it's necessary for xlog reading to be
able to read xlog from historical (i.e. not current) timelines;
otherwise, decoding fails after failover, because the archives are in
the historical timeline.  This is required to make "failover logical
slots" possible; it currently has no other use, although theoretically
it could be used by an extension that creates a slot on a standby and
continues to replay from the slot when the standby is promoted.

This commit includes a module in src/test/modules with functions to
manipulate the slots (which is not otherwise possible in SQL code) in
order to enable testing, and a new test in src/test/recovery to ensure
that the behavior is as expected.

Author: Craig Ringer
Reviewed-By: Oleksii Kliukin, Andres Freund, Petr Jelínek
2016-03-30 20:07:05 -03:00
Alvaro Herrera 3b02ea4f07 XLogReader general code cleanup
Some minor tweaks and comment additions, for cleanliness sake and to
avoid having the upcoming timeline-following patch be polluted with
unrelated cleanup.

Extracted from a larger patch by Craig Ringer, reviewed by Andres
Freund, with some additions by myself.
2016-03-30 18:56:13 -03:00
Andres Freund 1a7a43672b Don't use !! but != 0/NULL to force boolean evaluation.
I introduced several uses of !! to force bit arithmetic to be boolean,
but per discussion the project prefers != 0/NULL.

Discussion: CA+TgmoZP5KakLGP6B4vUjgMBUW0woq_dJYi0paOz-My0Hwt_vQ@mail.gmail.com
2016-03-27 18:10:19 +02:00
Andres Freund 1d4a0ab19a Avoid unlikely data-loss scenarios due to rename() without fsync.
Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes. Use the previously added durable_rename()/durable_link_or_rename()
in various places where we previously just renamed files.

Most of the changed call sites are arguably not critical, but it seems
better to err on the side of too much durability.  The most prominent
known case where the previously missing fsyncs could cause data loss is
crashes at the end of a checkpoint. After the actual checkpoint has been
performed, old WAL files are recycled. When they're filled, their
contents are fdatasynced, but we did not fsync the containing
directory. An OS/hardware crash in an unfortunate moment could then end
up leaving that file with its old name, but new content; WAL replay
would thus not replay it.

Reported-By: Tomas Vondra
Author: Michael Paquier, Tomas Vondra, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
2016-03-09 18:53:53 -08:00
Andres Freund 606e0f9841 Introduce durable_rename() and durable_link_or_rename().
Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes; especially on filesystems like xfs and ext4 when mounted
with data=writeback. To be certain that a rename() atomically replaces
the previous file contents in the face of crashes and different
filesystems, one has to fsync the old filename, rename the file, fsync
the new filename, fsync the containing directory.  This sequence is not
generally adhered to currently; which exposes us to data loss risks. To
avoid having to repeat this arduous sequence, introduce
durable_rename(), which wraps all that.

Also add durable_link_or_rename(). Several places use link() (with a
fallback to rename()) to rename a file, trying to avoid replacing the
target file out of paranoia. Some of those rename sequences need to be
durable as well. There seems little reason extend several copies of the
same logic, so centralize the link() callers.

This commit does not yet make use of the new functions; they're used in
a followup commit.

Author: Michael Paquier, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
2016-03-09 18:53:53 -08:00
Andres Freund b63bea5fd3 Further improvements to c8f621c43.
Coverity and inspection for the issue addressed in fd45d16f found some
questionable code.

Specifically coverity noticed that the wrong length was added in
ReorderBufferSerializeChange() - without immediate negative consequences
as the variable isn't used afterwards.  During code-review and testing I
noticed that a bit of space was wasted when allocating tuple bufs in
several places.  Thirdly, the debug memset()s in
ReorderBufferGetTupleBuf() reduce the error checking valgrind can do.

Backpatch: 9.4, like c8f621c43.
2016-03-07 14:24:03 -08:00
Andres Freund fd45d16f62 Fix wrong allocation size in c8f621c43.
In c8f621c43 I forgot to account for MAXALIGN when allocating a new
tuplebuf in ReorderBufferGetTupleBuf(). That happens to currently not
cause active problems on a number of platforms because the affected
pointer is already aligned, but others, like ppc and hppa, trigger this
in the regression test, due to a debug memset clearing memory.

Fix that.

Backpatch: 9.4, like the previous commit.
2016-03-06 16:27:20 -08:00
Andres Freund c8f621c43a logical decoding: Fix handling of large old tuples with replica identity full.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds.  Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.

Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.

This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size.  As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.

Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
2016-03-05 18:02:20 -08:00
Andres Freund 0bda14d54c logical decoding: old/newtuple in spooled UPDATE changes was switched around.
Somehow I managed to flip the order of restoring old & new tuples when
de-spooling a change in a large transaction from disk. This happens to
only take effect when a change is spooled to disk which has old/new
versions of the tuple. That only is the case for UPDATEs where he
primary key changed or where replica identity is changed to FULL.

The tests didn't catch this because either spooled updates, or updates
that changed primary keys, were tested; not both at the same time.

Found while adding tests for the following commit.

Backpatch: 9.4, where logical decoding was added
2016-03-05 18:02:20 -08:00
Andres Freund d9e903f3cb logical decoding: Tell reorderbuffer about all xids.
Logical decoding's reorderbuffer keeps transactions in an LSN ordered
list for efficiency. To make that's efficiently possible upper-level
xids are forced to be logged before nested subtransaction xids.  That
only works though if these records are all looked at: Unfortunately we
didn't do so for e.g. row level locks, which are otherwise uninteresting
for logical decoding.

This could lead to errors like:
"ERROR: subxact logged without previous toplevel record".

It's not sufficient to just look at row locking records, the xid could
appear first due to a lot of other types of records (which will trigger
the transaction to be marked logged with MarkCurrentTransactionIdLoggedIfAny).
So invent infrastructure to tell reorderbuffer about xids seen, when
they'd otherwise not pass through reorderbuffer.c.

Reported-By: Jarred Ward
Bug: #13844
Discussion: 20160105033249.1087.66040@wrigleys.postgresql.org
Backpatch: 9.4, where logical decoding was added
2016-03-05 18:02:20 -08:00