Commit Graph

623 Commits

Author SHA1 Message Date
Alvaro Herrera 234a266066 Fix code comments regarding logical decoding
Back in 3b02ea4f07 I added some comments in various places to explain
how logical decoding and other things worked.  Not all of the changes
were welcome, because they were misleading or wrong.  This changes them
a little bit to make them more accurate.

Some other comments are also changed to be more accurate.  Also, fix a
bunch of typos.

Author: Álvaro Herrera, Craig Ringer

Andres Freund reviewed some parts of this.
2016-05-02 16:04:29 -03:00
Tom Lane 17d5db352c Remove warning about num_sync being too large in synchronous_standby_names.
If we're not going to reject such setups entirely, throwing a WARNING in
check_synchronous_standby_names() is unhelpful, because it will cause the
warning to be logged again every time the postmaster receives SIGHUP.
Per discussion, just remove the warning.

In passing, improve the documentation for synchronous_commit, which had not
gotten the word that now there can be more than one synchronous standby.
2016-04-30 10:54:45 -04:00
Andres Freund 59455018a8 Remember asking for feedback during walsender shutdown.
Since 5a991ef8 we're explicitly asking for feedback from the receiving
side when shutting down walsender, if there's not yet replicated
data.

Unfortunately we didn't remember (i.e. set waiting_for_ping_response to
true) having asked for feedback, leading to scenarios in which replies
were requested at a high frequency.

I can't reproduce this problem on my laptop, I think that's because the
problem requires a significant TCP window to manifest due to the
!pq_is_send_pending() condition. But since this clearly is a bug, let's
fix it.  There's quite possibly more wrong than just this though.

While fiddling with WalSndDone(), I rewrote a hard to understand comment
about looking at the flush vs. the write position.

Reported-By: Nick Cleaton, Magnus Hagander
Author: Nick Cleaton
Discussion: CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ@mail.gmail.com
    CABUevExsjROqDcD0A2rnJ6HK6FuKGyewJr3PL12pw85BHFGS2Q@mail.gmail.com
Backpatch: 9.4, were 5a991ef8 introduced the use of feedback messages
    during shutdown.
2016-04-28 22:11:18 -07:00
Tom Lane 4c804fbdfb Clean up parsing of synchronous_standby_names GUC variable.
Commit 989be0810d added a flex/bison lexer/parser to interpret
synchronous_standby_names.  It was done in a pretty crufty way, though,
making assorted end-use sites responsible for calling the parser at the
right times.  That was not only vulnerable to errors of omission, but made
it possible that lexer/parser errors occur at very undesirable times,
and created memory leakages even if there was no error.

Instead, perform the parsing once during check_synchronous_standby_names
and let guc.c manage the resulting data.  To do that, we have to flatten
the parsed representation into a single hunk of malloc'd memory, but that
is not very hard.

While at it, work a little harder on making useful error reports for
parsing problems; the previous code felt that "synchronous_standby_names
parser returned 1" was an appropriate user-facing error message.  (To
be fair, it did also log a syntax error message, but separately from the
GUC problem report, which is at best confusing.)  It had some outright
bugs in the face of invalid input, too.

I (tgl) also concluded that we need to restrict unquoted names in
synchronous_standby_names to be just SQL identifiers.  The previous coding
would accept darn near anything, which (1) makes the quoting convention
both nearly-unnecessary and formally ambiguous, (2) makes it very hard to
understand what is a syntax error and what is a creative interpretation of
the input as a standby name, and (3) makes it impossible to further extend
the syntax in future without a compatibility break.  I presume that we're
intending future extensions of the syntax, else this parsing infrastructure
is massive overkill, so (3) is an important objection.  Since we've taken
a compatibility hit for non-identifier names with this change anyway, we
might as well lock things down now and insist that users use double quotes
for standby names that aren't identifiers.

Kyotaro Horiguchi and Tom Lane
2016-04-27 17:55:25 -04:00
Andres Freund c6ff84b06a Emit invalidations to standby for transactions without xid.
So far, when a transaction with pending invalidations, but without an
assigned xid, committed, we simply ignored those invalidation
messages. That's problematic, because those are actually sent for a
reason.

Known symptoms of this include that existing sessions on a hot-standby
replica sometimes fail to notice new concurrently built indexes and
visibility map updates.

The solution is to WAL log such invalidations in transactions without an
xid. We considered to alternatively force-assign an xid, but that'd be
problematic for vacuum, which might be run in systems with few xids.

Important: This adds a new WAL record, but as the patch has to be
back-patched, we can't bump the WAL page magic. This means that standbys
have to be updated before primaries; otherwise
"PANIC: standby_redo: unknown op code 32" errors can be encountered.

XXX:

Reported-By: Васильев Дмитрий, Masahiko Sawada
Discussion:
    CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com
    CAD21AoDpZ6Xjg=gFrGPnSn4oTRRcwK1EBrWCq9OqOHuAcMMC=w@mail.gmail.com
2016-04-26 20:21:54 -07:00
Andres Freund 7b16781228 Fix trivial typo. 2016-04-14 19:25:16 -07:00
Tom Lane 6a3d3965d6 Fix core dump in ReorderBufferRestoreChange on alignment-picky platforms.
When re-reading an update involving both an old tuple and a new tuple from
disk, reorderbuffer.c was careless about whether the new tuple is suitably
aligned for direct access --- in general, it isn't.  We'd missed seeing
this in the buildfarm because the contrib/test_decoding tests exercise this
code path only a few times, and by chance all of those cases have old
tuples with length a multiple of 4, which is usually enough to make the
access to the new tuple's t_len safe.  For some still-not-entirely-clear
reason, however, Debian's sparc build gets a bus error, as reported by
Christoph Berg; perhaps it's assuming 8-byte alignment of the pointer?

The lack of previous field reports is probably because you need all of
these conditions to trigger a crash: an alignment-picky platform (not
Intel), a transaction large enough to spill to disk, an update within
that xact that changes a primary-key field and has an odd-length old tuple,
and of course logical decoding tracing the transaction.

Avoid the alignment assumption by using memcpy instead of fetching t_len
directly, and add a test case that exposes the crash on picky platforms.
Back-patch to 9.4 where the bug was introduced.

Discussion: <20160413094117.GC21485@msg.credativ.de>
2016-04-14 19:42:21 -04:00
Tom Lane c2dc194bdb Adjust signature of walrcv_receive hook.
Commit 314cbfc5da redefined the signature of this hook as
typedef int (*walrcv_receive_type) (char **buffer, int *wait_fd);

But in fact the type of the "wait_fd" variable ought to be pgsocket,
which is what WaitLatchOrSocket expects, and which is necessary if
we want to be able to assign PGINVALID_SOCKET to it on Windows.
So fix that.
2016-04-14 13:49:37 -04:00
Tom Lane 994f112573 Adjust datatype of ReplicationState.acquired_by.
It was declared as "pid_t", which would be fine except that none of
the places that printed it in error messages took any thought for the
possibility that it's not equivalent to "int".  This leads to warnings
on some buildfarm members, and could possibly lead to actually wrong
error messages on those platforms.  There doesn't seem to be any very
good reason not to just make it "int"; it's only ever assigned from
MyProcPid, which is int.  If we want to cope with PIDs that are wider
than int, this is not the place to start.

Also, fix the comment, which seems to perhaps be a leftover from a time
when the field was only a bool?

Per buildfarm.  Back-patch to 9.5 which has same issue.
2016-04-14 12:18:09 -04:00
Andres Freund be65eddd80 Add required database and origin filtering for logical messages.
Logical messages, added in 3fe3511d05, during decoding failed to filter
messages emitted in other databases and messages emitted "under" a
replication origin the output plugin isn't interested in.

Add tests to verify that both types of filtering actually work. While
touching message.sql remove hunk obsoleted by d25379e.

Bump XLOG_PAGE_MAGIC because xl_logical_message changed and because
3fe3511d05 had omitted doing so. 3fe3511d05 additionally didn't bump
catversion, but 7a542700d has done so since.

Author: Petr Jelinek
Reported-By: Andres Freund
Discussion: 20160406142513.wotqy3ba3kanr423@alap3.anarazel.de
2016-04-13 17:38:54 -07:00
Fujii Masao 46d73e0d65 Remove unused function GetOldestWALSendPointer from walsender code.
That unused function was introduced as a sample because synchronous
replication or replication monitoring tools might need it in the future.
Recently commit 989be08 added the function SyncRepGetOldestSyncRecPtr
which provides almost the same functionality for multiple synchronous
standbys feature. So it's time to remove that unused sample function.
This commit does that.
2016-04-13 04:36:29 +09:00
Fujii Masao 0038c1e218 Use ereport(ERROR) instead of Assert() to emit syncrep_parser error.
The existing code would either Assert or generate an invalid
SyncRepConfig variable, neither of which is desirable. A regular
error should be thrown instead.

This commit silences compiler warning in non assertion-enabled builds.

Per report from Jeff Janes.
Suggested fix by Tom Lane.
2016-04-11 15:52:27 +09:00
Fujii Masao ead9963c47 Use proper format specifier %X/%X for LSN, again.
Commit cee31f5 fixed this problem, but commit 989be08 accidentally
reverted the fix.

Thomas Munro
2016-04-06 22:20:52 +09:00
Simon Riggs 3fe3511d05 Generic Messages for Logical Decoding
API and mechanism to allow generic messages to be inserted into WAL that are
intended to be read by logical decoding plugins. This commit adds an optional
new callback to the logical decoding API.

Messages are either text or bytea. Messages can be transactional, or not, and
are identified by a prefix to allow multiple concurrent decoding plugins.

(Not to be confused with Generic WAL records, which are intended to allow crash
recovery of extensible objects.)

Author: Petr Jelinek and Andres Freund
Reviewers: Artur Zakirov, Tomas Vondra, Simon Riggs
Discussion: 5685F999.6010202@2ndquadrant.com
2016-04-06 10:05:41 +01:00
Fujii Masao 989be0810d Support multiple synchronous standby servers.
Previously synchronous replication offered only the ability to confirm
that all changes made by a transaction had been transferred to at most
one synchronous standby server.

This commit extends synchronous replication so that it supports multiple
synchronous standby servers. It enables users to consider one or more
standby servers as synchronous, and increase the level of transaction
durability by ensuring that transaction commits wait for replies from
all of those synchronous standbys.

Multiple synchronous standby servers are configured in
synchronous_standby_names which is extended to support new syntax of
'num_sync ( standby_name [ , ... ] )', where num_sync specifies
the number of synchronous standbys that transaction commits need to
wait for replies from and standby_name is the name of a standby
server.

The syntax of 'standby_name [ , ... ]' which was used in 9.5 or before
is also still supported. It's the same as new syntax with num_sync=1.

This commit doesn't include "quorum commit" feature which was discussed
in pgsql-hackers. Synchronous standbys are chosen based on their priorities.
synchronous_standby_names determines the priority of each standby for
being chosen as a synchronous standby. The standbys whose names appear
earlier in the list are given higher priority and will be considered as
synchronous. Other standby servers appearing later in this list
represent potential synchronous standbys.

The regression test for multiple synchronous standbys is not included
in this commit. It should come later.

Authors: Sawada Masahiko, Beena Emerson, Michael Paquier, Fujii Masao
Reviewed-By: Kyotaro Horiguchi, Amit Kapila, Robert Haas, Simon Riggs,
Amit Langote, Thomas Munro, Sameer Thakur, Suraj Kharage, Abhijit Menon-Sen,
Rajeev Rastogi

Many thanks to the various individuals who were involved in
discussing and developing this feature.
2016-04-06 17:18:25 +09:00
Magnus Hagander 7117685461 Implement backup API functions for non-exclusive backups
Previously non-exclusive backups had to be done using the replication protocol
and pg_basebackup. With this commit it's now possible to make them using
pg_start_backup/pg_stop_backup as well, as long as the backup program can
maintain a persistent connection to the database.

Doing this, backup_label and tablespace_map are returned as results from
pg_stop_backup() instead of being written to the data directory. This makes
the server safe from a crash during an ongoing backup, which can be a problem
with exclusive backups.

The old syntax of the functions remain and work exactly as before, but since the
new syntax is safer this should eventually be deprecated and removed.

Only reference documentation is included. The main section on backup still needs
to be rewritten to cover this, but since that is already scheduled for a separate
large rewrite, it's not included in this patch.

Reviewed by David Steele and Amit Kapila
2016-04-05 20:03:49 +02:00
Peter Eisentraut 4dcd4da98c Fix error message from wal_level value renaming
found by Ian Barwick
2016-04-04 21:17:54 -04:00
Teodor Sigaev 65578341af Add Generic WAL interface
This interface is designed to give an access to WAL for extensions which
could implement new access method, for example. Previously it was
impossible because restoring from custom WAL would need to access system
catalog to find a redo custom function. This patch suggests generic way
to describe changes on page with standart layout.

Bump XLOG_PAGE_MAGIC because of new record type.

Author: Alexander Korotkov with a help of Petr Jelinek, Markus Nullmeier and
	minor editorization by my
Reviewers: Petr Jelinek, Alvaro Herrera, Teodor Sigaev, Jim Nasby,
	Michael Paquier
2016-04-01 12:21:48 +03:00
Alvaro Herrera 3501f71c21 Fix broken variable declaration
Author: Konstantin Knizhnik
2016-03-30 23:39:15 -03:00
Fujii Masao cee31f5fee Use proper format specifier %X/%X for LSN. 2016-03-31 11:03:40 +09:00
Alvaro Herrera 24c5f1a103 Enable logical slots to follow timeline switches
When decoding from a logical slot, it's necessary for xlog reading to be
able to read xlog from historical (i.e. not current) timelines;
otherwise, decoding fails after failover, because the archives are in
the historical timeline.  This is required to make "failover logical
slots" possible; it currently has no other use, although theoretically
it could be used by an extension that creates a slot on a standby and
continues to replay from the slot when the standby is promoted.

This commit includes a module in src/test/modules with functions to
manipulate the slots (which is not otherwise possible in SQL code) in
order to enable testing, and a new test in src/test/recovery to ensure
that the behavior is as expected.

Author: Craig Ringer
Reviewed-By: Oleksii Kliukin, Andres Freund, Petr Jelínek
2016-03-30 20:07:05 -03:00
Alvaro Herrera 3b02ea4f07 XLogReader general code cleanup
Some minor tweaks and comment additions, for cleanliness sake and to
avoid having the upcoming timeline-following patch be polluted with
unrelated cleanup.

Extracted from a larger patch by Craig Ringer, reviewed by Andres
Freund, with some additions by myself.
2016-03-30 18:56:13 -03:00
Robert Haas 314cbfc5da Add new replication mode synchronous_commit = 'remote_apply'.
In this mode, the master waits for the transaction to be applied on
the remote side, not just written to disk.  That means that you can
count on a transaction started on the standby to see all commits
previously acknowledged by the master.

To make this work, the standby sends a reply after replaying each
commit record generated with synchronous_commit >= 'remote_apply'.
This introduces a small inefficiency: the extra replies will be sent
even by standbys that aren't the current synchronous standby.  But
previously-existing synchronous_commit levels make no attempt at all
to optimize which replies are sent based on what the primary cares
about, so this is no worse, and at least avoids any extra replies for
people not using the feature at all.

Thomas Munro, reviewed by Michael Paquier and by me.  Some additional
tweaks by me.
2016-03-29 21:29:49 -04:00
Andres Freund 1a7a43672b Don't use !! but != 0/NULL to force boolean evaluation.
I introduced several uses of !! to force bit arithmetic to be boolean,
but per discussion the project prefers != 0/NULL.

Discussion: CA+TgmoZP5KakLGP6B4vUjgMBUW0woq_dJYi0paOz-My0Hwt_vQ@mail.gmail.com
2016-03-27 18:10:19 +02:00
Peter Eisentraut b555ed8102 Merge wal_level "archive" and "hot_standby" into new name "replica"
The distinction between "archive" and "hot_standby" existed only because
at the time "hot_standby" was added, there was some uncertainty about
stability.  This is now a long time ago.  We would like to move forward
with simplifying the replication configuration, but this distinction is
in the way, because a primary server cannot tell (without asking a
standby or predicting the future) which one of these would be the
appropriate level.

Pick a new name for the combined setting to make it clearer that it
covers all (non-logical) backup and replication uses.  The old values
are still accepted but are converted internally.

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reviewed-by: David Steele <david@pgmasters.net>
2016-03-18 23:56:03 +01:00
Robert Haas 53be0b1add Provide much better wait information in pg_stat_activity.
When a process is waiting for a heavyweight lock, we will now indicate
the type of heavyweight lock for which it is waiting.  Also, you can
now see when a process is waiting for a lightweight lock - in which
case we will indicate the individual lock name or the tranche, as
appropriate - or for a buffer pin.

Amit Kapila, Ildus Kurbangaliev, reviewed by me.  Lots of helpful
discussion and suggestions by many others, including Alexander
Korotkov, Vladimir Borodin, and many others.
2016-03-10 12:44:09 -05:00
Andres Freund 1d4a0ab19a Avoid unlikely data-loss scenarios due to rename() without fsync.
Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes. Use the previously added durable_rename()/durable_link_or_rename()
in various places where we previously just renamed files.

Most of the changed call sites are arguably not critical, but it seems
better to err on the side of too much durability.  The most prominent
known case where the previously missing fsyncs could cause data loss is
crashes at the end of a checkpoint. After the actual checkpoint has been
performed, old WAL files are recycled. When they're filled, their
contents are fdatasynced, but we did not fsync the containing
directory. An OS/hardware crash in an unfortunate moment could then end
up leaving that file with its old name, but new content; WAL replay
would thus not replay it.

Reported-By: Tomas Vondra
Author: Michael Paquier, Tomas Vondra, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
2016-03-09 18:53:53 -08:00
Andres Freund 606e0f9841 Introduce durable_rename() and durable_link_or_rename().
Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes; especially on filesystems like xfs and ext4 when mounted
with data=writeback. To be certain that a rename() atomically replaces
the previous file contents in the face of crashes and different
filesystems, one has to fsync the old filename, rename the file, fsync
the new filename, fsync the containing directory.  This sequence is not
generally adhered to currently; which exposes us to data loss risks. To
avoid having to repeat this arduous sequence, introduce
durable_rename(), which wraps all that.

Also add durable_link_or_rename(). Several places use link() (with a
fallback to rename()) to rename a file, trying to avoid replacing the
target file out of paranoia. Some of those rename sequences need to be
durable as well. There seems little reason extend several copies of the
same logic, so centralize the link() callers.

This commit does not yet make use of the new functions; they're used in
a followup commit.

Author: Michael Paquier, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
2016-03-09 18:53:53 -08:00
Peter Eisentraut a40814d7aa Handle invalid libpq sockets in more places
Also, make error messages consistent.

From: Michael Paquier <michael.paquier@gmail.com>
2016-03-08 21:10:33 -05:00
Andres Freund b63bea5fd3 Further improvements to c8f621c43.
Coverity and inspection for the issue addressed in fd45d16f found some
questionable code.

Specifically coverity noticed that the wrong length was added in
ReorderBufferSerializeChange() - without immediate negative consequences
as the variable isn't used afterwards.  During code-review and testing I
noticed that a bit of space was wasted when allocating tuple bufs in
several places.  Thirdly, the debug memset()s in
ReorderBufferGetTupleBuf() reduce the error checking valgrind can do.

Backpatch: 9.4, like c8f621c43.
2016-03-07 14:24:03 -08:00
Andres Freund fd45d16f62 Fix wrong allocation size in c8f621c43.
In c8f621c43 I forgot to account for MAXALIGN when allocating a new
tuplebuf in ReorderBufferGetTupleBuf(). That happens to currently not
cause active problems on a number of platforms because the affected
pointer is already aligned, but others, like ppc and hppa, trigger this
in the regression test, due to a debug memset clearing memory.

Fix that.

Backpatch: 9.4, like the previous commit.
2016-03-06 16:27:20 -08:00
Andres Freund c8f621c43a logical decoding: Fix handling of large old tuples with replica identity full.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds.  Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.

Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.

This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size.  As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.

Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
2016-03-05 18:02:20 -08:00
Andres Freund 0bda14d54c logical decoding: old/newtuple in spooled UPDATE changes was switched around.
Somehow I managed to flip the order of restoring old & new tuples when
de-spooling a change in a large transaction from disk. This happens to
only take effect when a change is spooled to disk which has old/new
versions of the tuple. That only is the case for UPDATEs where he
primary key changed or where replica identity is changed to FULL.

The tests didn't catch this because either spooled updates, or updates
that changed primary keys, were tested; not both at the same time.

Found while adding tests for the following commit.

Backpatch: 9.4, where logical decoding was added
2016-03-05 18:02:20 -08:00
Andres Freund d9e903f3cb logical decoding: Tell reorderbuffer about all xids.
Logical decoding's reorderbuffer keeps transactions in an LSN ordered
list for efficiency. To make that's efficiently possible upper-level
xids are forced to be logged before nested subtransaction xids.  That
only works though if these records are all looked at: Unfortunately we
didn't do so for e.g. row level locks, which are otherwise uninteresting
for logical decoding.

This could lead to errors like:
"ERROR: subxact logged without previous toplevel record".

It's not sufficient to just look at row locking records, the xid could
appear first due to a lot of other types of records (which will trigger
the transaction to be marked logged with MarkCurrentTransactionIdLoggedIfAny).
So invent infrastructure to tell reorderbuffer about xids seen, when
they'd otherwise not pass through reorderbuffer.c.

Reported-By: Jarred Ward
Bug: #13844
Discussion: 20160105033249.1087.66040@wrigleys.postgresql.org
Backpatch: 9.4, where logical decoding was added
2016-03-05 18:02:20 -08:00
Andres Freund 7c17aac69d logical decoding: fix decoding of a commit's commit time.
When adding replication origins in 5aa235042, I somehow managed to set
the timestamp of decoded transactions to InvalidXLogRecptr when decoding
one made without a replication origin. Fix that, and the wrong type of
the new commit_time variable.

This didn't trigger a regression test failure because we explicitly
don't show commit timestamps in the regression tests, as they obviously
are variable. Add a test that checks that a decoded commit's timestamp
is within minutes of NOW() from before the commit.

Reported-By: Weiping Qu
Diagnosed-By: Artur Zakirov
Discussion: 56D4197E.9050706@informatik.uni-kl.de,
    56D42918.1010108@postgrespro.ru
Backpatch: 9.5, where 5aa235042 originates.
2016-03-02 23:42:21 -08:00
Alvaro Herrera 10b4852215 Fix typos
Author: Amit Langote
2016-02-29 18:11:58 -03:00
Peter Eisentraut 18777c38e9 Improve error message about active replication slot
The old phrasing was awkward if a replication slot is activated and
deactivated repeatedly.
2016-02-17 21:23:28 -05:00
Robert Haas 63461a63f9 Make builtin lwlock tranche names consistent.
Previously, we had a mix of styles.

Amit Kapila
2016-02-12 08:07:11 -05:00
Robert Haas e98fd78607 Fix typo in comment.
Michael Paquier
2016-02-05 08:11:00 -05:00
Robert Haas 7191ce8bea Make all built-in lwlock tranche IDs fixed.
This makes the values more stable, which seems like a good thing for
anybody who needs to look at at them.

Alexander Korotkov and Amit Kapila
2016-02-02 06:45:55 -05:00
Magnus Hagander e51ab85cd9 Fix typos in comments
Author: Michael Paquier
2016-02-01 11:43:48 +01:00
Robert Haas 2251179e6a Migrate replication slot I/O locks into a separate tranche.
This is following in a long train of similar changes and for the same
reasons - see b319356f0e and
fe702a7b3f inter alia.

Author: Amit Kapila
Reviewed-by: Alexander Korotkov, Robert Haas
2016-01-29 09:45:38 -05:00
Simon Riggs 422a55a687 Refactor to create generic WAL page read callback
Previously we didn’t have a generic WAL page read callback function,
surprisingly. Logical decoding has logical_read_local_xlog_page(), which was
actually generic, so move that to xlogfunc.c and rename to
read_local_xlog_page().
Maintain logical_read_local_xlog_page() so existing callers still work.

As requested by Michael Paquier, Alvaro Herrera and Andres Freund
2016-01-20 17:18:58 -08:00
Tom Lane 26d538dc93 Clean up some lack-of-STRICT issues in the core code, too.
A scan for missed proisstrict markings in the core code turned up
these functions:

brin_summarize_new_values
pg_stat_reset_single_table_counters
pg_stat_reset_single_function_counters
pg_create_logical_replication_slot
pg_create_physical_replication_slot
pg_drop_replication_slot

The first three of these take OID, so a null argument will normally look
like a zero to them, resulting in "ERROR: could not open relation with OID
0" for brin_summarize_new_values, and no action for the pg_stat_reset_XXX
functions.  The other three will dump core on a null argument, though this
is mitigated by the fact that they won't do so until after checking that
the caller is superuser or has rolreplication privilege.

In addition, the pg_logical_slot_get/peek[_binary]_changes family was
intentionally marked nonstrict, but failed to make nullness checks on all
the arguments; so again a null-pointer-dereference crash is possible but
only for superusers and rolreplication users.

Add the missing ARGISNULL checks to the latter functions, and mark the
former functions as strict in pg_proc.  Make that change in the back
branches too, even though we can't force initdb there, just so that
installations initdb'd in future won't have the issue.  Since none of these
bugs rise to the level of security issues (and indeed the pg_stat_reset_XXX
functions hardly misbehave at all), it seems sufficient to do this.

In addition, fix some order-of-operations oddities in the slot_get_changes
family, mostly cosmetic, but not the part that moves the function's last
few operations into the PG_TRY block.  As it stood, there was significant
risk for an error to exit without clearing historical information from
the system caches.

The slot_get_changes bugs go back to 9.4 where that code was introduced.
Back-patch appropriate subsets of the pg_proc changes into all active
branches, as well.
2016-01-09 16:58:32 -05:00
Alvaro Herrera b1a9bad9e7 pgstat: add WAL receiver status view & SRF
This new view provides insight into the state of a running WAL receiver
in a HOT standby node.
The information returned includes the PID of the WAL receiver process,
its status (stopped, starting, streaming, etc), start LSN and TLI, last
received LSN and TLI, timestamp of last message send and receipt, latest
end-of-WAL LSN and time, and the name of the slot (if any).

Access to the detailed data is only granted to superusers; others only
get the PID.

Author: Michael Paquier
Reviewer: Haribabu Kommi
2016-01-07 16:21:19 -03:00
Bruce Momjian ee94300446 Update copyright for 2016
Backpatch certain files through 9.1
2016-01-02 13:33:40 -05:00
Robert Haas 6e7b335930 Remove duplicate word.
Kyotaro Horiguchi
2015-12-18 12:43:52 -05:00
Robert Haas 4496226782 Fix copy-and-paste error in logical decoding callback.
This could result in the error context misidentifying where the error
actually occurred.

Craig Ringer
2015-12-18 12:17:35 -05:00
Magnus Hagander a91bdf67c4 Consistently set all fields in pg_stat_replication to null instead of 0
Previously the "sent" field would be set to 0 and all other xlog
pointers be set to NULL if there were no valid values (such as when
in a backup sending walsender).
2015-12-13 16:53:38 +01:00
Magnus Hagander 263c19572b Properly initialize write, flush and replay locations in walsender slots
These would leak random xlog positions if a walsender used for backup would
a walsender slot previously used by a replication walsender.

In passing also fix a couple of cases where the xlog pointer is directly
compared to zero instead of using XLogRecPtrIsInvalid, noted by
Michael Paquier.
2015-12-13 16:46:56 +01:00
Tom Lane 00cdd83521 Adopt the GNU convention for handling tar-archive members exceeding 8GB.
The POSIX standard for tar headers requires archive member sizes to be
printed in octal with at most 11 digits, limiting the representable file
size to 8GB.  However, GNU tar and apparently most other modern tars
support a convention in which oversized values can be stored in base-256,
allowing any practical file to be a tar member.  Adopt this convention
to remove two limitations:
* pg_dump with -Ft output format failed if the contents of any one table
exceeded 8GB.
* pg_basebackup failed if the data directory contained any file exceeding
8GB.  (This would be a fatal problem for installations configured with a
table segment size of 8GB or more, and it has also been seen to fail when
large core dump files exist in the data directory.)

File sizes under 8GB are still printed in octal, so that no compatibility
issues are created except in cases that would have failed entirely before.

In addition, this patch fixes several bugs in the same area:

* In 9.3 and later, we'd defined tarCreateHeader's file-size argument as
size_t, which meant that on 32-bit machines it would write a corrupt tar
header for file sizes between 4GB and 8GB, even though no error was raised.
This broke both "pg_dump -Ft" and pg_basebackup for such cases.

* pg_restore from a tar archive would fail on tables of size between 4GB
and 8GB, on machines where either "size_t" or "unsigned long" is 32 bits.
This happened even with an archive file not affected by the previous bug.

* pg_basebackup would fail if there were files of size between 4GB and 8GB,
even on 64-bit machines.

* In 9.3 and later, "pg_basebackup -Ft" failed entirely, for any file size,
on 64-bit big-endian machines.

In view of these potential data-loss bugs, back-patch to all supported
branches, even though removal of the documented 8GB limit might otherwise
be considered a new feature rather than a bug fix.
2015-11-21 20:21:31 -05:00
Andres Freund f3a764b0da Set replication origin when decoding commit records.
By accident the replication origin was not set properly in
DecodeCommit(). That's bad because the origin is passed to the output
plugins origin filter, and accessible from the output plugin via
ReorderBufferTXN->origin_id.  Accessing the origin of individual changes
worked before the fix, which is why this wasn't notices earlier.

Reported-By: Craig Ringer
Author: Craig Ringer
Discussion: CAMsr+YFhBJLp=qfSz3-J+0P1zLkE8zNXM2otycn20QRMx380gw@mail.gmail.com
Backpatch: 9.5, where replication origins where introduced
2015-11-09 00:03:35 +01:00
Peter Eisentraut a8d585c091 Message style improvements
Message style, plurals, quoting, spelling, consistency with similar
messages
2015-10-28 20:38:36 -04:00
Alvaro Herrera 0cd836a4e8 Measure string lengths only once
Bernd Helmle complained that CreateReplicationSlot() was assigning the
same value to the same variable twice, so we could remove one of them.
Code inspection reveals that we can actually remove both assignments:
according to the author the assignment was there for beauty of the
strlen line only, and another possible fix to that is to put the strlen
in its own line, so do that.

To be consistent within the file, refactor all duplicated strlen()
calls, which is what we do elsewhere in the backend anyway.  In
basebackup.c, snprintf already returns the right length; no need for
strlen afterwards.

Backpatch to 9.4, where replication slots were introduced, to keep code
identical.  Some of this is older, but the patch doesn't apply cleanly
and it's only of cosmetic value anyway.

Discussion: http://www.postgresql.org/message-id/BE2FD71DEA35A2287EA5F018@eje.credativ.lan
2015-10-27 13:20:40 -03:00
Robert Haas 8f6bb851bd Remove more volatile qualifiers.
Prior to commit 0709b7ee72, access to
variables within a spinlock-protected critical section had to be done
through a volatile pointer, but that should no longer be necessary.
This continues work begun in df4077cda2
and 6ba4ecbf47.

Thomas Munro and Michael Paquier
2015-10-06 15:45:02 -04:00
Andres Freund 920218cbc0 Improve errhint() about replication slot naming restrictions.
The existing hint talked about "may only contain letters", but the
actual requirement is more strict: only lower case letters are allowed.

Reported-By: Rushabh Lathia
Author: Rushabh Lathia
Discussion: AGPqQf2x50qcwbYOBKzb4x75sO_V3g81ZsA8+Ji9iN5t_khFhQ@mail.gmail.com
Backpatch: 9.4-, where replication slots were added
2015-10-03 15:29:08 +02:00
Alvaro Herrera 17f5831c81 Fix "sesssion" typo
It was introduced alongside replication origins, by commit
5aa2350426, so backpatch to 9.5.

Pointed out by Fujii Masao
2015-09-28 19:13:42 -03:00
Andres Freund c314ead5be Add ability to reserve WAL upon slot creation via replication protocol.
Since 6fcd885 it is possible to immediately reserve WAL when creating a
slot via pg_create_physical_replication_slot(). Extend the replication
protocol to allow that as well.

Although, in contrast to the SQL interface, it is possible to update the
reserved location via the replication interface, it is still useful
being able to reserve upon creation there. Otherwise the logic in
ReplicationSlotReserveWal() has to be repeated in slot employing
clients.

Author: Michael Paquier
Discussion: CAB7nPqT0Wc1W5mdYGeJ_wbutbwNN+3qgrFR64avXaQCiJMGaYA@mail.gmail.com
2015-09-06 13:30:57 +02:00
Heikki Linnakangas c80b5f66c6 Fix misc typos.
Oskari Saarenmaa. Backpatch to stable branches where applicable.
2015-09-05 11:35:49 +03:00
Andres Freund e95126cf04 Don't use function definitions looking like old-style ones.
This fixes a bunch of somewhat pedantic warnings with new
compilers. Since by far the majority of other functions definitions use
the (void) style it just seems to be consistent to do so as well in the
remaining few places.
2015-08-15 17:25:00 +02:00
Andres Freund a4b059fdde Remove duplicated assignment in pg_create_physical_replication_slot.
Reported-By: Gurjeet Singh
2015-08-12 17:35:50 +02:00
Alvaro Herrera 4901b2f495 Don't include rel.h when relcache.h is sufficient
Trivial change to reduce exposure of rel.h.
2015-08-11 13:03:14 -03:00
Andres Freund 6fcd88511f Allow pg_create_physical_replication_slot() to reserve WAL.
When creating a physical slot it's often useful to immediately reserve
the current WAL position instead of only doing after the first feedback
message arrives. That e.g. allows slots to guarantee that all the WAL
for a base backup will be available afterwards.

Logical slots already have to reserve WAL during creation, so generalize
that logic into being usable for both physical and logical slots.

Catversion bump because of the new parameter.

Author: Gurjeet Singh
Reviewed-By: Andres Freund
Discussion: CABwTF4Wh_dBCzTU=49pFXR6coR4NW1ynb+vBqT+Po=7fuq5iCw@mail.gmail.com
2015-08-11 12:34:31 +02:00
Andres Freund 093d0c83c1 Introduce macros determining if a replication slot is physical or logical.
These make the code a bit easier to read, and make it easier to add a
more explicit notion of a slot's type at some point in the future.

Author: Gurjeet Singh
Discussion: CABwTF4Wh_dBCzTU=49pFXR6coR4NW1ynb+vBqT+Po=7fuq5iCw@mail.gmail.com
2015-08-11 12:32:48 +02:00
Andres Freund 3b425b7c02 Minor cleanups in slot related code.
Fix a bunch of typos, and remove two superflous includes.

Author: Gurjeet Singh
Discussion: CABwTF4Wh_dBCzTU=49pFXR6coR4NW1ynb+vBqT+Po=7fuq5iCw@mail.gmail.com
Backpatch: 9.4
2015-08-11 12:32:48 +02:00
Andres Freund 3f811c2d6f Add confirmed_flush column to pg_replication_slots.
There's no reason not to expose both restart_lsn and confirmed_flush
since they have rather distinct meanings. The former is the oldest WAL
still required and valid for both physical and logical slots, whereas
the latter is the location up to which a logical slot's consumer has
confirmed receiving data. Most of the time a slot will require older
WAL (i.e. restart_lsn) than the confirmed
position (i.e. confirmed_flush_lsn).

Author: Marko Tiikkaja, editorialized by me
Discussion: 559D110B.1020109@joh.to
2015-08-10 13:28:18 +02:00
Andres Freund 5c4b25acce Fix copy & paste mistake in pg_get_replication_slots().
XLogRecPtr was compared with InvalidTransactionId instead of
InvalidXLogRecPtr. As both are defined to the same value this doesn't
cause any actual problems, but it's still wrong.

Backpatch: 9.4-master, bug was introduced in 9.4
2015-08-10 13:28:18 +02:00
Andres Freund 18e8613564 Address points made in post-commit review of replication origins.
Amit reviewed the replication origins patch and made some good
points. Address them. This fixes typos in error messages, docs and
comments and adds a missing error check (although in a
should-never-happen scenario).

Discussion: CAA4eK1JqUBVeWWKwUmBPryFaje4190ug0y-OAUHWQ6tD83V4xg@mail.gmail.com
Backpatch: 9.5, where replication origins were introduced.
2015-08-07 15:09:05 +02:00
Andres Freund a855118be3 Fix debug message output when connecting to a logical slot.
Previously the message erroneously printed the same LSN twice as the
assignment to the start_lsn variable was before the message. Correct
that.

Reported-By: Marko Tiikkaja
Author: Marko Tiikkaja
Backpatch: 9.5, where logical decoding was introduced
2015-08-05 13:26:01 +02:00
Andres Freund b2f6f749c7 Fix logical decoding bug leading to inefficient reopening of files.
When spilling transaction data to disk a simple typo caused the output
file to be closed and reopened for every serialized change. That happens
to not have a huge impact on linux, which is why it probably wasn't
noticed so far, but on windows that appears to trigger actual disk
writes after every change. Not fun.

The bug fortunately does not have any impact besides speed. A change
could end up being in the wrong segment (last instead of next), but
since we read all files to the end, that's just ugly, not really
problematic. It's not a problem to upgrade, since transaction spill
files do not persist across restarts.

Bug: #13484
Reported-By: Olivier Gosseaume
Discussion: 20150703090217.1190.63940@wrigleys.postgresql.org

Backpatch to 9.4, where logical decoding was added.
2015-07-07 13:12:46 +02:00
Peter Eisentraut 385522c7dc Fix typo 2015-06-10 21:30:17 -04:00
Tom Lane 32f628be74 Fix assorted inconsistencies in our calls of readlink().
Ensure that we null-terminate the result string (one place in pg_rewind).
Be paranoid about out-of-range results from readlink() (should not happen,
but there is no good reason for some call sites to be careful about it and
others not).  Consistently use the whole buffer, not sometimes one byte
less.  Ensure we emit an appropriate errcode() in all cases.  Spell the
error messages the same way.

The only serious bug here is the missing null-termination in pg_rewind,
which is new code, so no need for a back-patch.

Abhijit Menon-Sen and Tom Lane
2015-05-28 12:17:22 -04:00
Bruce Momjian 807b9e0dff pgindent run for 9.5 2015-05-23 21:35:49 -04:00
Heikki Linnakangas fa60fb63e5 Fix more typos in comments.
Patch by CharSyam, plus a few more I spotted with grep.
2015-05-20 19:45:43 +03:00
Heikki Linnakangas 4fc72cc7bb Collection of typo fixes.
Use "a" and "an" correctly, mostly in comments. Two error messages were
also fixed (they were just elogs, so no translation work required). Two
function comments in pg_proc.h were also fixed. Etsuro Fujita reported one
of these, but I found a lot more with grep.

Also fix a few other typos spotted while grepping for the a/an typos.
For example, "consists out of ..." -> "consists of ...". Plus a "though"/
"through" mixup reported by Euler Taveira.

Many of these typos were in old code, which would be nice to backpatch to
make future backpatching easier. But much of the code was new, and I didn't
feel like crafting separate patches for each branch. So no backpatching.
2015-05-20 16:56:22 +03:00
Magnus Hagander 3b075e9d7b Fix typos in comments
Dmitriy Olshevskiy
2015-05-17 14:58:04 +02:00
Heikki Linnakangas ffd37740ee Add archive_mode='always' option.
In 'always' mode, the standby independently archives all files it receives
from the primary.

Original patch by Fujii Masao, docs and review by me.
2015-05-15 18:55:24 +03:00
Andrew Dunstan 72d422a522 Map basebackup tablespaces using a tablespace_map file
Windows can't reliably restore symbolic links from a tar format, so
instead during backup start we create a tablespace_map file, which is
used by the restoring postgres to create the correct links in pg_tblspc.
The backup protocol also now has an option to request this file to be
included in the backup stream, and this is used by pg_basebackup when
operating in tar mode.

This is done on all platforms, not just Windows.

This means that pg_basebackup will not not work in tar mode against 9.4
and older servers, as this protocol option isn't implemented there.

Amit Kapila, reviewed by Dilip Kumar, with a little editing from me.
2015-05-12 09:29:10 -04:00
Heikki Linnakangas 179cdd0981 Add macros to check if a filename is a WAL segment or other such file.
We had many instances of the strlen + strspn combination to check for that.
This makes the code a bit easier to read.
2015-05-08 21:58:57 +03:00
Andres Freund 168d5805e4 Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint.  DO NOTHING avoids the
constraint violation, without touching the pre-existing row.  DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed.  The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.

This feature is often referred to as upsert.

This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert.  If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made.  If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.

To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.

Bumps catversion as stored rules change.

Author: Peter Geoghegan, with significant contributions from Heikki
    Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
    Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:43:10 +02:00
Peter Eisentraut ad8d6d064c Fix typos
Author: Erik Rijkers <er@xs4all.nl>
2015-05-04 20:40:19 -04:00
Andres Freund 2b22795b32 Copy editing of the replication origins patch.
Michael Paquier and myself.
2015-05-01 12:22:13 +02:00
Andres Freund e0f26fc765 Correct replication origin's use of UINT16_MAX to PG_UINT16_MAX.
We can't rely on UINT16_MAX being present, which is why we introduced
PG_UINT16_MAX...

Buildfarm animal bowerbird via Andrew Gierth.
2015-04-30 00:19:36 +02:00
Andres Freund 5aa2350426 Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
  e.g. to avoid loops in bi-directional replication setups

The solution to these problems, as implemented here, consist out of
three parts:

1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
   replication origin, how far replay has progressed in a efficient and
   crash safe manner.
3) The ability to filter out changes performed on the behest of a
   replication origin during logical decoding; this allows complex
   replication topologies. E.g. by filtering all replayed changes out.

Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated.  We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.

This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL.  Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.

For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.

Bumps both catversion and wal page magic.

Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
    20140923182422.GA15776@alap3.anarazel.de,
    20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00
Andres Freund dfbaed4597 Use a fd opened for read/write when syncing slots during startup.
Some operating systems, including the reporter's windows, return EBADFD
or similar when fsync() is invoked on a O_RDONLY file descriptor.
Unfortunately RestoreSlotFromDisk() does exactly that; which causes
failures after restarts in at least some scenarios.

If you hit the bug the error message will be something like
ERROR: could not fsync file "pg_replslot/$name/state": Bad file descriptor

Simply use O_RDWR instead of O_RDONLY when opening the relevant file
descriptor to fix the bug.  Unfortunately I have no way of verifying the
fix, but we've seen similar problems in the past.

This bug goes back to 9.4 where slots were introduced. Backpatch
accordingly.

Reported-By: Patrice Drolet
Bug: #13143:
Discussion: 20150424101006.2556.60897@wrigleys.postgresql.org
2015-04-28 00:17:43 +02:00
Andres Freund 6aab1f45ac Fix various typos and grammar errors in comments.
Author: Dmitriy Olshevskiy
Discussion: 553D00A6.4090205@bk.ru
2015-04-26 18:42:31 +02:00
Andres Freund d811c037ce Add 'active_in' column to pg_replication_slots.
Right now it is visible whether a replication slot is active in any
session, but not in which.  Adding the active_in column, containing the
pid of the backend having acquired the slot, makes it much easier to
associate pg_replication_slots entries with the corresponding
pg_stat_replication/pg_stat_activity row.

This should have been done from the start, but I (Andres) dropped the
ball there somehow.

Author: Craig Ringer, revised by me Discussion:
CAMsr+YFKgZca5_7_ouaMWxA5PneJC9LNViPzpDHusaPhU9pA7g@mail.gmail.com
2015-04-21 11:51:06 +02:00
Heikki Linnakangas e2999abcd1 Fix assertion failure in logical decoding.
Logical decoding set SnapshotData's regd_count field to avoid the
snapshot manager from prematurely freeing snapshots that are generated
by the decoding system. That was always an abuse of the field, as it was
never supposed to be used outside the snapshot manager. Commit 94028691
made snapshot manager's tracking of the snapshots smarter, and that scheme
fell apart. The snapshot manager got confused and hit the assertion, when
a snapshot that was marked with regd_count==1 was not found in the heap,
where the snapshot manager tracks registered the snapshots.

To fix, don't abuse the regd_count field like that. Logical decoding still
abuses the active_count field for similar purposes, but that's currently
harmless.

The assertion failure was first reported by Michael Paquier
2015-04-16 21:50:07 +03:00
Heikki Linnakangas 4f700bcd20 Reorganize our CRC source files again.
Now that we use CRC-32C in WAL and the control file, the "traditional" and
"legacy" CRC-32 variants are not used in any frontend programs anymore.
Move the code for those back from src/common to src/backend/utils/hash.

Also move the slicing-by-8 implementation (back) to src/port. This is in
preparation for next patch that will add another implementation that uses
Intel SSE 4.2 instructions to calculate CRC-32C, where available.
2015-04-14 17:03:42 +03:00
Heikki Linnakangas a10589a512 Remove duplicated words in comments.
David Rowley
2015-04-12 10:46:17 +03:00
Fujii Masao 6e4bf4ecd3 Fix error handling of XLogReaderAllocate in case of OOM
Similarly to previous fix 9b8d478, commit 2c03216 has switched
XLogReaderAllocate() to use a set of palloc calls instead of malloc,
causing any callers of this function to fail with an error instead of
receiving a NULL pointer in case of out-of-memory error. Fix this by
using palloc_extended with MCXT_ALLOC_NO_OOM that will safely return
NULL in case of an OOM.

Michael Paquier, slightly modified by me.
2015-04-03 21:55:37 +09:00
Simon Riggs 7dae3cf68c Correct comment to use RS_EPHEMERAL 2015-04-02 07:45:19 -04:00
Heikki Linnakangas 1d0db8de04 Remove spurious semicolons.
Petr Jelinek
2015-03-31 15:12:27 +03:00
Tom Lane 785941cdc3 Tweak __attribute__-wrapping macros for better pgindent results.
This improves on commit bbfd7edae5 by
making two simple changes:

* pg_attribute_noreturn now takes parentheses, ie pg_attribute_noreturn().
Likewise pg_attribute_unused(), pg_attribute_packed().  This reduces
pgindent's tendency to misformat declarations involving them.

* attributes are now always attached to function declarations, not
definitions.  Previously some places were taking creative shortcuts,
which were not merely candidates for bad misformatting by pgindent
but often were outright wrong anyway.  (It does little good to put a
noreturn annotation where callers can't see it.)  In any case, if
we would like to believe that these macros can be used with non-gcc
compilers, we should avoid gratuitous variance in usage patterns.

I also went through and manually improved the formatting of a lot of
declarations, and got rid of excessively repetitive (and now obsolete
anyway) comments informing the reader what pg_attribute_printf is for.
2015-03-26 14:03:25 -04:00
Tatsuo Ishii 656ea810e5 Make SyncRepWakeQueue to a static function
It is only used in src/backend/replication/syncrep.c.

Back-patch to all supported branches except 9.1 which declares the
function as static.
2015-03-26 10:34:08 +09:00
Andres Freund 4f1b890b13 Merge the various forms of transaction commit & abort records.
Since 465883b0a two versions of commit records have existed. A compact
version that was used when no cache invalidations, smgr unlinks and
similar were needed, and a full version that could deal with all
that. Additionally the full version was embedded into twophase commit
records.

That resulted in a measurable reduction in the size of the logged WAL in
some workloads. But more recently additions like logical decoding, which
e.g. needs information about the database something was executed on,
made it applicable in fewer situations. The static split generally made
it hard to expand the commit record, because concerns over the size made
it hard to add anything to the compact version.

Additionally it's not particularly pretty to have twophase.c insert
RM_XACT records.

Rejigger things so that the commit and abort records only have one form
each, including the twophase equivalents. The presence of the various
optional (in the sense of not being in every record) pieces is indicated
by a bits in the 'xinfo' flag.  That flag previously was not included in
compact commit records. To prevent an increase in size due to its
presence, it's only included if necessary; signalled by a bit in the
xl_info bits available for xact.c, similar to heapam.c's
XLOG_HEAP_OPMASK/XLOG_HEAP_INIT_PAGE.

Twophase commit/aborts are now the same as their normal
counterparts. The original transaction's xid is included in an optional
data field.

This means that commit records generally are smaller, except in the case
of a transaction with subtransactions, but no other special cases; the
increase there is four bytes, which seems acceptable given that the more
common case of not having subtransactions shrank.  The savings are
especially measurable for twophase commits, which previously always used
the full version; but will in practice only infrequently have required
that.

The motivation for this work are not the space savings and and
deduplication though; it's that it makes it easier to extend commit
records with additional information. That's just a few lines of code
now; without impacting the common case where that information is not
needed.

Discussion: 20150220152150.GD4149@awork2.anarazel.de,
    235610.92468.qm%40web29004.mail.ird.yahoo.com

Reviewed-By: Heikki Linnakangas, Simon Riggs
2015-03-15 17:37:07 +01:00
Tatsuo Ishii 364c006c1f Fix integer overflow in debug message of walreceiver
The message tries to tell the replication apply delay which fails if
the first WAL record is not applied yet. Fix is, instead of telling
overflowed minus numeric, showing "N/A" which indicates that the delay
data is not yet available. Problem reported by me and patch by
Fabrízio de Royes Mello.

Back patched to 9.4, 9.3 and 9.2 stable branches (9.1 and 9.0 do not
have the debug message).
2015-03-14 08:16:50 +09:00
Andres Freund bbfd7edae5 Add macros wrapping all usage of gcc's __attribute__.
Until now __attribute__() was defined to be empty for all compilers but
gcc. That's problematic because it prevents using it in other compilers;
which is necessary e.g. for atomics portability.  It's also just
generally dubious to do so in a header as widely included as c.h.

Instead add pg_attribute_format_arg, pg_attribute_printf,
pg_attribute_noreturn macros which are implemented in the compilers that
understand them. Also add pg_attribute_noreturn and pg_attribute_packed,
but don't provide fallbacks, since they can affect functionality.

This means that external code that, possibly unwittingly, relied on
__attribute__ defined to be empty on !gcc compilers may now run into
warnings or errors on those compilers. But there shouldn't be many
occurances of that and it's hard to work around...

Discussion: 54B58BA3.8040302@ohmu.fi
Author: Oskari Saarenmaa, with some minor changes by me.
2015-03-11 14:30:01 +01:00
Fujii Masao f8b031bca8 Fix an obsolete reference to SnapshotNow in comment.
Peter Geoghegan
2015-03-04 12:25:48 +09:00
Peter Eisentraut 23a78352c0 Error when creating names too long for tar format
The tar format (at least the version we are using), does not support
file names or symlink targets longer than 99 bytes.  Until now, the tar
creation code would silently truncate any names that are too long.  (Its
original application was pg_dump, where this never happens.)  This
creates problems when running base backups over the replication
protocol.

The most important problem is when a tablespace path is longer than 99
bytes, which will result in a truncated tablespace path being backed up.
Less importantly, the basebackup protocol also promises to back up any
other files it happens to find in the data directory, which would also
lead to file name truncation if someone put a file with a long name in
there.

Now both of these cases result in an error during the backup.

Add tests that fail when a too-long file name or symlink is attempted to
be backed up.

Reviewed-by: Robert Hass <robertmhaas@gmail.com>
2015-02-24 13:41:07 -05:00
Heikki Linnakangas dd58c6098f Fix typo in README.
Kyotaro Horiguchi
2015-02-24 14:33:26 +02:00
Tom Lane e1a11d9311 Use FLEXIBLE_ARRAY_MEMBER for HeapTupleHeaderData.t_bits[].
This requires changing quite a few places that were depending on
sizeof(HeapTupleHeaderData), but it seems for the best.

Michael Paquier, some adjustments by me
2015-02-21 15:13:06 -05:00
Tom Lane 4f38a281a3 Fix missing PQclear() in libpqrcv_endstreaming().
This omission leaked one PGresult per WAL streaming cycle, which possibly
would never be enough to notice in the real world, but it's still a leak.

Per Coverity.  Back-patch to 9.3 where the error was introduced.
2015-02-11 19:20:49 -05:00
Heikki Linnakangas ff16b40f8c Report WAL flush, not insert, position in replication IDENTIFY_SYSTEM
When beginning streaming replication, the client usually issues the
IDENTIFY_SYSTEM command, which used to return the current WAL insert
position. That's not suitable for the intended purpose of that field,
however. pg_receivexlog uses it to start replication from the reported
point, but if it hasn't been flushed to disk yet, it will fail. Change
IDENTIFY_SYSTEM to report the flush position instead.

Backpatch to 9.1 and above. 9.0 doesn't report any WAL position.
2015-02-06 11:26:50 +02:00
Heikki Linnakangas 2b3a8b20c2 Be more careful to not lose sync in the FE/BE protocol.
If any error occurred while we were in the middle of reading a protocol
message from the client, we could lose sync, and incorrectly try to
interpret a part of another message as a new protocol message. That will
usually lead to an "invalid frontend message" error that terminates the
connection. However, this is a security issue because an attacker might
be able to deliberately cause an error, inject a Query message in what's
supposed to be just user data, and have the server execute it.

We were quite careful to not have CHECK_FOR_INTERRUPTS() calls or other
operations that could ereport(ERROR) in the middle of processing a message,
but a query cancel interrupt or statement timeout could nevertheless cause
it to happen. Also, the V2 fastpath and COPY handling were not so careful.
It's very difficult to recover in the V2 COPY protocol, so we will just
terminate the connection on error. In practice, that's what happened
previously anyway, as we lost protocol sync.

To fix, add a new variable in pqcomm.c, PqCommReadingMsg, that is set
whenever we're in the middle of reading a message. When it's set, we cannot
safely ERROR out and continue running, because we might've read only part
of a message. PqCommReadingMsg acts somewhat similarly to critical sections
in that if an error occurs while it's set, the error handler will force the
connection to be terminated, as if the error was FATAL. It's not
implemented by promoting ERROR to FATAL in elog.c, like ERROR is promoted
to PANIC in critical sections, because we want to be able to use
PG_TRY/CATCH to recover and regain protocol sync. pq_getmessage() takes
advantage of that to prevent an OOM error from terminating the connection.

To prevent unnecessary connection terminations, add a holdoff mechanism
similar to HOLD/RESUME_INTERRUPTS() that can be used hold off query cancel
interrupts, but still allow die interrupts. The rules on which interrupts
are processed when are now a bit more complicated, so refactor
ProcessInterrupts() and the calls to it in signal handlers so that the
signal handlers always call it if ImmediateInterruptOK is set, and
ProcessInterrupts() can decide to not do anything if the other conditions
are not met.

Reported by Emil Lenngren. Patch reviewed by Noah Misch and Andres Freund.
Backpatch to all supported versions.

Security: CVE-2015-0244
2015-02-02 17:09:53 +02:00
Tom Lane c70f9e8988 Further cleanup of ReorderBufferCommit().
On closer inspection, we can remove the "volatile" qualifier on
"using_subtxn" so long as we initialize that before the PG_TRY block,
which there's no particularly good reason not to do.
Also, push the "change" variable inside the PG_TRY so as to remove
all question of whether it needs "volatile", and remove useless
early initializations of "snapshow_now" and "using_subtxn".
2015-01-25 22:49:56 -05:00
Tom Lane f8a4dd2e14 Fix unsafe coding in ReorderBufferCommit().
"iterstate" must be marked volatile since it's changed inside the PG_TRY
block and then used in the PG_CATCH stanza.  Noted by Mark Wilding of
Salesforce.  (We really need to see if we can't get the C compiler to warn
about this.)

Also, reset iterstate to NULL after the mainline ReorderBufferIterTXNFinish
call, to ensure the PG_CATCH block doesn't try to do that a second time.
2015-01-24 13:25:19 -05:00
Tom Lane 586dd5d6a5 Replace a bunch more uses of strncpy() with safer coding.
strncpy() has a well-deserved reputation for being unsafe, so make an
effort to get rid of nearly all occurrences in HEAD.

A large fraction of the remaining uses were passing length less than or
equal to the known strlen() of the source, in which case no null-padding
can occur and the behavior is equivalent to memcpy(), though doubtless
slower and certainly harder to reason about.  So just use memcpy() in
these cases.

In other cases, use either StrNCpy() or strlcpy() as appropriate (depending
on whether padding to the full length of the destination buffer seems
useful).

I left a few strncpy() calls alone in the src/timezone/ code, to keep it
in sync with upstream (the IANA tzcode distribution).  There are also a
few such calls in ecpg that could possibly do with more analysis.

AFAICT, none of these changes are more than cosmetic, except for the four
occurrences in fe-secure-openssl.c, which are in fact buggy: an overlength
source leads to a non-null-terminated destination buffer and ensuing
misbehavior.  These don't seem like security issues, first because no stack
clobber is possible and second because if your values of sslcert etc are
coming from untrusted sources then you've got problems way worse than this.
Still, it's undesirable to have unpredictable behavior for overlength
inputs, so back-patch those four changes to all active branches.
2015-01-24 13:05:42 -05:00
Andres Freund ff44fba46c Replace walsender's latch with the general shared latch.
Relying on the normal shared latch simplifies interrupt/signal
handling because we can rely on all signal handlers setting the proc
latch. That in turn allows us to avoid the use of
ImmediateInterruptOK, which arguably isn't correct because
WaitLatchOrSocket isn't declared to be immediately interruptible.

Also change sections that wait on the walsender's latch to notice
interrupts quicker/more reliably and make them more consistent with
each other.

This is part of a larger "get rid of ImmediateInterruptOK" series.

Discussion: 20150115020335.GZ5245@awork2.anarazel.de
2015-01-17 13:00:42 +01:00
Andres Freund 59f71a0d0b Add a default local latch for use in signal handlers.
To do so, move InitializeLatchSupport() into the new common process
initialization functions, and add a new global variable MyLatch.

MyLatch is usable as soon InitPostmasterChild() has been called
(i.e. very early during startup). Initially it points to a process
local latch that exists in all processes. InitProcess/InitAuxiliaryProcess
then replaces that local latch with PGPROC->procLatch. During shutdown
the reverse happens.

This is primarily advantageous for two reasons: For one it simplifies
dealing with the shared process latch, especially in signal handlers,
because instead of having to check for MyProc, MyLatch can be used
unconditionally. For another, a later patch that makes FEs/BE
communication use latches, now can rely on the existence of a latch,
even before having gone through InitProcess.

Discussion: 20140927191243.GD5423@alap3.anarazel.de
2015-01-14 18:45:22 +01:00
Andres Freund 31c453165b Commonalize process startup code.
Move common code, that was duplicated in every postmaster child/every
standalone process, into two functions in miscinit.c.  Not only does
that already result in a fair amount of net code reduction but it also
makes it much easier to remove more duplication in the future. The
prime motivation wasn't code deduplication though, but easier addition
of new common code.
2015-01-14 00:33:14 +01:00
Andres Freund 14e8803f10 Add barriers to the latch code.
Since their introduction latches have required barriers in SetLatch
and ResetLatch - but when they were introduced there wasn't any
barrier abstraction. Instead latches were documented to rely on the
callsites to provide barrier semantics.

Now that the barrier support looks halfway complete, add the necessary
barriers to both latch implementations.

Also remove a now superflous lock acquisition from syncrep.c and a
superflous (and insufficient) barrier from freelist.c. There might be
other cases that can now be simplified, but those are the only ones
I've seen on a quick scan.

We might want to backpatch this at some later point, but right now the
barrier infrastructure in the backbranches isn't totally on par with
master.

Discussion: 20150112154026.GB2092@awork2.anarazel.de
2015-01-13 12:58:43 +01:00
Bruce Momjian 4baaf863ec Update copyright for 2015
Backpatch certain files through 9.0
2015-01-06 11:43:47 -05:00
Andres Freund 2c0a485896 Prevent WAL files created by pg_basebackup -x/X from being archived again.
WAL (and timeline history) files created by pg_basebackup did not
maintain the new base backup's archive status. That's currently not a
problem if the new node is used as a standby - but if that node is
promoted all still existing files can get archived again.  With a high
wal_keep_segment settings that can happen a significant time later -
which is quite confusing.

Change both the backend (for the -x/-X fetch case) and pg_basebackup
(for -X stream) itself to always mark WAL/timeline files included in
the base backup as .done. That's in line with walreceiver.c doing so.

The verbosity of the pg_basebackup changes show pretty clearly that it
needs some refactoring, but that'd result in not be backpatchable
changes.

Backpatch to 9.1 where pg_basebackup was introduced.

Discussion: 20141205002854.GE21964@awork2.anarazel.de
2015-01-03 20:54:12 +01:00
Andres Freund ccb161b66a Add pg_string_endswith as the start of a string helper library in src/common.
Backpatch to 9.3 where src/common was introduce, because a bugfix that
needs to be backpatched, requires the function. Earlier branches will
have to duplicate the code.
2015-01-03 20:54:12 +01:00
Fujii Masao 4a5593197b Remove duplicate include of slot.h.
Back-patch to 9.4, where this problem was added.
2014-12-25 22:47:53 +09:00
Alvaro Herrera a609d96778 Revert "Use a bitmask to represent role attributes"
This reverts commit 1826987a46.

The overall design was deemed unacceptable, in discussion following the
previous commit message; we might find some parts of it still
salvageable, but I don't want to be on the hook for fixing it, so let's
wait until we have a new patch.
2014-12-23 15:35:49 -03:00
Alvaro Herrera 1826987a46 Use a bitmask to represent role attributes
The previous representation using a boolean column for each attribute
would not scale as well as we want to add further attributes.

Extra auxilliary functions are added to go along with this change, to
make up for the lost convenience of access of the old representation.

Catalog version bumped due to change in catalogs and the new functions.

Author: Adam Brightwell, minor tweaks by Álvaro
Reviewed by: Stephen Frost, Andres Freund, Álvaro Herrera
2014-12-23 10:22:09 -03:00
Tom Lane 4a14f13a0a Improve hash_create's API for selecting simple-binary-key hash functions.
Previously, if you wanted anything besides C-string hash keys, you had to
specify a custom hashing function to hash_create().  Nearly all such
callers were specifying tag_hash or oid_hash; which is tedious, and rather
error-prone, since a caller could easily miss the opportunity to optimize
by using hash_uint32 when appropriate.  Replace this with a design whereby
callers using simple binary-data keys just specify HASH_BLOBS and don't
need to mess with specific support functions.  hash_create() itself will
take care of optimizing when the key size is four bytes.

This nets out saving a few hundred bytes of code space, and offers
a measurable performance improvement in tidbitmap.c (which was not
exploiting the opportunity to use hash_uint32 for its 4-byte keys).
There might be some wins elsewhere too, I didn't analyze closely.

In future we could look into offering a similar optimized hashing function
for 8-byte keys.  Under this design that could be done in a centralized
and machine-independent fashion, whereas getting it right for keys of
platform-dependent sizes would've been notationally painful before.

For the moment, the old way still works fine, so as not to break source
code compatibility for loadable modules.  Eventually we might want to
remove tag_hash and friends from the exported API altogether, since there's
no real need for them to be explicitly referenced from outside dynahash.c.

Teodor Sigaev and Tom Lane
2014-12-18 13:36:36 -05:00
Heikki Linnakangas 4d65e16a6f Misc comment typo fixes.
Backpatch the applicable parts, just to make backpatching future patches
easier.
2014-12-16 16:37:46 +02:00
Heikki Linnakangas b1332e98c4 Put the logic to decide which synchronous standby is active into a function.
This avoids duplicating the code.

Michael Paquier, reviewed by Simon Riggs and me
2014-12-12 14:26:42 +02:00
Tom Lane 06d5803ffa Fix assorted confusion between Oid and int32.
In passing, also make some debugging elog's in pgstat.c a bit more
consistently worded.

Back-patch as far as applicable (9.3 or 9.4; none of these mistakes are
really old).

Mark Dilger identified and patched the type violations; the message
rewordings are mine.
2014-12-11 15:41:15 -05:00
Alvaro Herrera 73c986adde Keep track of transaction commit timestamps
Transactions can now set their commit timestamp directly as they commit,
or an external transaction commit timestamp can be fed from an outside
system using the new function TransactionTreeSetCommitTsData().  This
data is crash-safe, and truncated at Xid freeze point, same as pg_clog.

This module is disabled by default because it causes a performance hit,
but can be enabled in postgresql.conf requiring only a server restart.

A new test in src/test/modules is included.

Catalog version bumped due to the new subdirectory within PGDATA and a
couple of new SQL functions.

Authors: Álvaro Herrera and Petr Jelínek

Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert
Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven
Singer, Peter Eisentraut
2014-12-03 11:53:02 -03:00
Andres Freund 0fd38e1370 Don't skip SQL backends in logical decoding for visibility computation.
The logical decoding patchset introduced PROC_IN_LOGICAL_DECODING flag
PGXACT flag, that allows such backends to be skipped when computing
the xmin horizon/snapshots. That's fine and sensible for walsenders
streaming out logical changes, but not at all fine for SQL backends
doing logical decoding. If the latter set that flag any change they
have performed outside of logical decoding will not be regarded as
visible - which e.g. can lead to that change being vacuumed away.

Note that not setting the flag for SQL backends isn't particularly
bothersome - the SQL backend doesn't do streaming, so it only runs for
a limited amount of time.

Per buildfarm member 'tick' and Alvaro.

Backpatch to 9.4, where logical decoding was introduced.
2014-12-02 23:47:08 +01:00
Heikki Linnakangas b3fc6727ce Allow using connection URI in primary_conninfo.
The old method of appending options to the connection string didn't work if
the primary_conninfo was a postgres:// style URI, instead of a traditional
connection string. Use PQconnectdbParams instead.

Alex Shulgin
2014-11-25 18:26:05 +02:00
Heikki Linnakangas 0bd624d63b Distinguish XLOG_FPI records generated for hint-bit updates.
Add a new XLOG_FPI_FOR_HINT record type, and use that for full-page images
generated for hint bit updates, when checksums are enabled. The new record
type is replayed exactly the same as XLOG_FPI, but allows them to be tallied
separately e.g. in pg_xlogdump.
2014-11-24 11:09:08 +02:00
Heikki Linnakangas 2c03216d83 Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.

There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.

This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.

For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.

The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.

Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 18:46:41 +02:00
Peter Eisentraut a15d387c22 Improve logical decoding log messages
suggestions from Robert Haas
2014-11-13 20:44:34 -05:00
Andres Freund 89fd41b390 Fix and improve cache invalidation logic for logical decoding.
There are basically three situations in which logical decoding needs
to perform cache invalidation. During/After replaying a transaction
with catalog changes, when skipping a uninteresting transaction that
performed catalog changes and when erroring out while replaying a
transaction. Unfortunately these three cases were all done slightly
differently - partially because 8de3e410fa, which greatly simplifies
matters, got committed in the midst of the development of logical
decoding.

The actually problematic case was when logical decoding skipped
transaction commits (and thus processed invalidations). When used via
the SQL interface cache invalidation could access the catalog - bad,
because we didn't set up enough state to allow that correctly. It'd
not be hard to setup sufficient state, but the simpler solution is to
always perform cache invalidation outside a valid transaction.

Also make the different cache invalidation cases look as similar as
possible, to ease code review.

This fixes the assertion failure reported by Antonin Houska in
53EE02D9.7040702@gmail.com. The presented testcase has been expanded
into a regression test.

Backpatch to 9.4, where logical decoding was introduced.
2014-11-13 20:34:31 +01:00
Andres Freund 5a2c184058 Fix xmin/xmax horizon computation during logical decoding initialization.
When building the initial historic catalog snapshot there were
scenarios where snapbuild.c would use incorrect xmin/xmax values when
starting from a xl_running_xacts record. The values used were always a
bit suspect, but happened to be correct in the easy to test
cases. Notably the values used when the the initial snapshot was
computed while no other transactions were running were correct.

This is likely to be the cause of the occasional buildfarm failures on
animals markhor and tick; but it's quite possible to reproduce
problems without CLOBBER_CACHE_ALWAYS.

Backpatch to 9.4, where logical decoding was introduced.
2014-11-13 20:34:30 +01:00
Andres Freund ec5896aed3 Fix several weaknesses in slot and logical replication on-disk serialization.
Heikki noticed in 544E23C0.8090605@vmware.com that slot.c and
snapbuild.c were missing the FIN_CRC32 call when computing/checking
checksums of on disk files. That doesn't lower the the error detection
capabilities of the checksum, but is inconsistent with other usages.

In a followup mail Heikki also noticed that, contrary to a comment,
the 'version' and 'length' struct fields of replication slot's on disk
data where not covered by the checksum. That's not likely to lead to
actually missed corruption as those fields are cross checked with the
expected version and the actual file length. But it's wrong
nonetheless.

As fixing these issues makes existing on disk files unreadable, bump
the expected versions of on disk files for both slots and logical
decoding historic catalog snapshots.  This means that loading old
files will fail with
ERROR: "replication slot file ... has unsupported version 1"
and
ERROR: "snapbuild state file ... has unsupported version 1 instead of
2" respectively. Given the low likelihood of anybody already using
these new features in a production setup that seems acceptable.

Fixing these issues made me notice that there's no regression test
covering the loading of historic snapshot from disk - so add one.

Backpatch to 9.4 where these features were introduced.
2014-11-12 18:52:49 +01:00
Peter Eisentraut 8339f33d68 Message improvements 2014-11-11 20:02:30 -05:00
Alvaro Herrera 7516f52594 BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes.  They work by maintaining "summary" data about
block ranges.  Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not.  Normal index scans are not supported
because these indexes do not store TIDs.

As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.

For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range.  This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results.  In this commit I only include minmax.

Catalog version bumped due to new builtin catalog entries.

There's more that could be done here, but this is a good step forwards.

Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.

Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.

PS:
  The research leading to these results has received funding from the
  European Union's Seventh Framework Programme (FP7/2007-2013) under
  grant agreement n° 318633.
2014-11-07 16:38:14 -03:00
Heikki Linnakangas 5028f22f6e Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.

Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.

The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 11:39:48 +02:00
Robert Haas bc279c92f0 Fix typos.
David Rowley
2014-10-20 10:33:16 -04:00
Andres Freund 0ef3c29a4b Improve documentation about binary/textual output mode for output plugins.
Also improve related error message as it contributed to the confusion.

Discussion: CAB7nPqQrqFzjqCjxu4GZzTrD9kpj6HMn9G5aOOMwt1WZ8NfqeA@mail.gmail.com,
    CAB7nPqQXc_+g95zWnqaa=mVQ4d3BVRs6T41frcEYi2ocUrR3+A@mail.gmail.com

Per discussion between Michael Paquier, Robert Haas and Andres Freund

Backpatch to 9.4 where logical decoding was introduced.
2014-10-01 13:22:17 +02:00
Andres Freund 56a312aac8 Fix VPATH builds of the replication parser from git for some !gcc compilers.
Some compilers don't automatically search the current directory for
included files. 9cc2c182fc fixed that for builds from tarballs by
adding an include to the source directory. But that doesn't work when
the scanner is generated in the VPATH directory. Use the same search
path as the other parsers in the tree.

One compiler that definitely was affected is solaris' sun cc.

Backpatch to 9.1 which introduced using an actual parser for
replication commands.
2014-09-25 15:22:26 +02:00
Fujii Masao 4ad2a54805 Add GUC to enable logging of replication commands.
Previously replication commands like IDENTIFY_COMMAND were not logged
even when log_statements is set to all. Some users who want to audit
all types of statements were not satisfied with this situation. To
address the problem, this commit adds new GUC log_replication_commands.
If it's enabled, all replication commands are logged in the server log.

There are many ways to allow us to enable that logging. For example,
we can extend log_statement so that replication commands are logged
when it's set to all. But per discussion in the community, we reached
the consensus to add separate GUC for that.

Reviewed by Ian Barwick, Robert Haas and Heikki Linnakangas.
2014-09-13 02:55:45 +09:00
Peter Eisentraut 303f4d1012 Assorted message fixes and improvements 2014-09-05 01:25:27 -04:00
Andres Freund 5a64cb740d Fix s/pluggins/plugins/ typo in two comments.
Michael Paquier
2014-09-01 12:01:29 +02:00
Andres Freund 8fff977e29 Declare two variables in snapbuild.c as static.
Neither is accessed externally, I just seem to have missed the static
when writing the code.
2014-08-31 23:53:12 +02:00
Fujii Masao c476288653 Revert "Fix bug in checking of IDENTIFY_SYSTEM result."
This reverts commit 083d29c65b.

The commit changed the code so that it causes an errors when
IDENTIFY_SYSTEM returns three columns. But which prevents us
from using the replication-related utilities against the server
with older version. This is not what we want. For that
compatibility, we allow the utilities to receive three columns
as the result of IDENTIFY_SYSTEM eventhough it actually returns
four columns in 9.4 or later.

Pointed out by Andres Freund.
2014-08-19 18:30:38 +09:00
Fujii Masao 083d29c65b Fix bug in checking of IDENTIFY_SYSTEM result.
5a991ef869 added new column into
the result of IDENTIFY_SYSTEM command. But it was not reflected into
several codes checking that result. Specifically though the number of
columns in the result was increased to 4, it was still compared with 3
in some replication codes.

Back-patch to 9.4 where the number of columns in IDENTIFY_SYSTEM
result was increased.

Report from Michael Paquier
2014-08-19 17:26:07 +09:00
Andres Freund 41d5f8ad73 Be less aggressive in asking for feedback of logical walsender clients.
When doing logical decoding using START_LOGICAL_REPLICATION in a
walsender process the walsender sometimes was sending out keepalive
messages too frequently. Asking for feedback every time.

WalSndWaitForWal() sends out keepalive messages when it's waiting for
new WAL to be generated locally when it sees that the remote side
hasn't yet flushed WAL up to the local position. That generally is
good but causes problems if the remote side only writes but doesn't
flush changes yet. So check for both remote write and flush position.

Additionally we've asked for feedback to the keepalive message which
isn't warranted when waiting for WAL in contrast to preventing
timeouts because of wal_sender_timeout.

Complaint and patch by Steve Singer.
2014-08-12 11:04:50 +02:00
Heikki Linnakangas 54685338e3 Move log_newpage and log_newpage_buffer to xlog.c.
log_newpage is used by many indexams, in addition to heap, but for
historical reasons it's always been part of the heapam rmgr. Starting with
9.3, we have another WAL record type for logging an image of a page,
XLOG_FPI. Simplify things by moving log_newpage and log_newpage_buffer to
xlog.c, and switch to using the XLOG_FPI record type.

Bump the WAL version number because the code to replay the old HEAP_NEWPAGE
records is removed.
2014-07-31 16:48:55 +03:00
Andres Freund 93a028f569 Properly remove ephemeral replication slots after a crash restart.
Ephemeral slots - slots that shouldn't survive database restarts -
weren't properly cleaned up after a immediate/crash restart. They were
ignored in the sense that they weren't restored into memory and thus
didn't cause unwanted resource retention; but they prevented a new
slot with the same name from being created.

Now ephemeral slots are fully removed during startup.

Backpatch to 9.4 where replication slots where added.
2014-07-24 14:44:45 +02:00
Noah Misch 0ffc201a51 Add file version information to most installed Windows binaries.
Prominent binaries already had this metadata.  A handful of minor
binaries, such as pg_regress.exe, still lack it; efforts to eliminate
such exceptions are welcome.

Michael Paquier, reviewed by MauMau.
2014-07-14 14:07:52 -04:00
Andres Freund 626bfad6cc Fix decoding of consecutive MULTI_INSERTs emitted by one heap_multi_insert().
Commit 1b86c81d2d fixed the decoding of toasted columns for the rows
contained in one xl_heap_multi_insert record. But that's not actually
enough, because heap_multi_insert() will actually first toast all
passed in rows and then emit several *_multi_insert records; one for
each page it fills with tuples.

Add a XLOG_HEAP_LAST_MULTI_INSERT flag which is set in
xl_heap_multi_insert->flag denoting that this multi_insert record is
the last emitted by one heap_multi_insert() call. Then use that flag
in decode.c to only set clear_toast_afterwards in the right situation.

Expand the number of rows inserted via COPY in the corresponding
regression test to make sure that more than one heap page is filled
with tuples by one heap_multi_insert() call.

Backpatch to 9.4 like the previous commit.
2014-07-12 14:28:19 +02:00
Andres Freund 1b86c81d2d Fix decoding of MULTI_INSERTs when rows other than the last are toasted.
When decoding the results of a HEAP2_MULTI_INSERT (currently only
generated by COPY FROM) toast columns for all but the last tuple
weren't replaced by their actual contents before being handed to the
output plugin. The reassembled toast datums where disregarded after
every REORDER_BUFFER_CHANGE_(INSERT|UPDATE|DELETE) which is correct
for plain inserts, updates, deletes, but not multi inserts - there we
generate several REORDER_BUFFER_CHANGE_INSERTs for a single
xl_heap_multi_insert record.

To solve the problem add a clear_toast_afterwards boolean to
ReorderBufferChange's union member that's used by modifications. All
row changes but multi_inserts always set that to true, but
multi_insert sets it only for the last change generated.

Add a regression test covering decoding of multi_inserts - there was
none at all before.

Backpatch to 9.4 where logical decoding was introduced.

Bug found by Petr Jelinek.
2014-07-06 15:58:01 +02:00
Andres Freund a36a8fa376 Rename logical decoding's pg_llog directory to pg_logical.
The old name wasn't very descriptive as of actual contents of the
directory, which are historical snapshots in the snapshots/
subdirectory and mappingdata for rewritten tuples in
mappings/. There's been a fair amount of discussion what would be a
good name. I'm settling for pg_logical because it's likely that
further data around logical decoding and replication will need saving
in the future.

Also add the missing entry for the directory into storage.sgml's list
of PGDATA contents.

Bumps catversion as the data directories won't be compatible.
2014-07-02 21:07:47 +02:00
Andres Freund 1cbc948010 Check interrupts during logical decoding more frequently.
When reading large amounts of preexisting WAL during logical decoding
using the SQL interface we possibly could fail to check interrupts in
due time. Similarly the same could happen on systems with a very high
WAL volume while creating a new logical replication slot, independent
of the used interface.

Previously these checks where only performed in xlogreader's read_page
callbacks, while waiting for new WAL to be produced. That's not
sufficient though, if there's never a need to wait.  Walsender's send
loop already contains a interrupt check.

Backpatch to 9.4 where the logical decoding feature was introduced.
2014-06-30 10:49:39 +02:00
Andres Freund e04a9ccd2c Consistency improvements for slot and decoding code.
Change the order of checks in similar functions to be the same; remove
a parameter that's not needed anymore; rename a memory context and
expand a couple of comments.

Per review comments from Amit Kapila
2014-06-12 13:33:27 +02:00
Fujii Masao a26ae56f51 Fix typos in comments. 2014-06-11 20:54:06 +09:00
Andres Freund fe7337f2dc Fix off-by-one in decoding causing one-record events to be skipped.
A ReorderBufferTransaction's end_lsn, the sentPtr advocated by
walsender keepalive messages, and the end location remembered by the
decoding get_*changes* SQL functions all use the location of the last
read record + 1. I.e. the LSN points to the beginning of the next
record. That cannot realistically be changed without changing the
replication protocol because that's how keepalive messages have worked
since 9.0.
The bug is that the logic inside the snapshot builder, which decides
whether a transaction's contents should be decoded, assumed the start
location would point towards the last byte of the last record. The
reason this didn't actually cause visible problems is that currently
that decision is only made for commit records. Since interesting
transactions always have at least one additional record - containing
actual data - we'd never skip a transaction.
But if there ever were transactions, or other events, with just one
record containing important information, we'd skip them after stopping
and restarting logical decoding.
2014-06-05 18:27:11 +02:00
Tom Lane 71ed8b3ca7 Revert "Fix bogus %name-prefix option syntax in all our Bison files."
This reverts commit 45b7abe59e.

It turns out that the %name-prefix syntax without "=" does not work
at all in pre-2.4 Bison.  We are not prepared to make such a large
jump in minimum required Bison version just to suppress a warning
message in a version hardly any developers are using yet.
When 3.0 gets more popular, we'll figure out a way to deal with this.
In the meantime, BISONFLAGS=-Wno-deprecated is recommendable for
anyone using 3.0 who doesn't want to see the warning.
2014-05-28 19:21:01 -04:00
Andres Freund 21d48d66c8 Don't pay heed to wal_sender_timeout while creating a decoding slot.
Sometimes CREATE_REPLICATION_SLOT ... LOGICAL ... needs to wait for
further WAL using WalSndWaitForWal(). That used to always respect
wal_sender_timeout and kill the session when waiting long enough
because no feedback/ping messages can be sent while the slot is still
being created.
Introduce the notion that last_reply_timestamp = 0 means that the
walsender currently doesn't need timeout processing to avoid that
problem. Use that notion for CREATE_REPLICATION_SLOT ... LOGICAL.

Bugreport and initial patch by Steve Singer, revised by me.
2014-05-29 00:32:09 +02:00
Tom Lane 45b7abe59e Fix bogus %name-prefix option syntax in all our Bison files.
%name-prefix doesn't use an "=" sign according to the Bison docs, but it
silently accepted one anyway, until Bison 3.0.  This was originally a
typo of mine in commit 012abebab1, and we
seem to have slavishly copied the error into all the other grammar files.

Per report from Vik Fearing; analysis by Peter Eisentraut.

Back-patch to all active branches, since somebody might try to build
a back branch with up-to-date tools.
2014-05-28 15:41:53 -04:00
Magnus Hagander 8232d6df4c Ensure cleanup in case of early errors in streaming base backups
Move the code that sends the initial status information as well as the
calculation of paths inside the ENSURE_ERROR_CLEANUP block. If this code
failed, we would "leak" a counter of number of concurrent backups, thereby
making the system always believe it was in backup mode. This could happen
if the sending failed (which it probably never did given that the small
amount of data to send would never cause a flush) or if the psprintf calls
ran out of memory. Both are very low risk, but all operations after
do_pg_start_backup should be protected.
2014-05-28 12:43:29 +02:00
Heikki Linnakangas 57b7e83b0d Fix misc typos in comments. 2014-05-23 08:16:21 -04:00
Fujii Masao 19a683f69f Fix typos in comments. 2014-05-22 12:43:50 +09:00
Tom Lane c1907f0cc4 Fix a bunch of functions that were declared static then defined not-static.
Per testing with a compiler that whines about this.
2014-05-17 17:57:53 -04:00
Tom Lane 6c42b2b10a Fix unaligned accesses in DecodeUpdate().
The xl_heap_header_len structures in an XLOG_HEAP_UPDATE record aren't
necessarily aligned adequately.  The regular replay function for these
records is aware of that, but decode.c didn't get the memo.  I'm not
sure why the buildfarm failed to catch this; the test_decoding test
certainly blows up real good on my old HPPA box.

Also, I'm pretty sure that the address arithmetic was wrong for the
case of XLOG_HEAP_CONTAINS_OLD and not XLOG_HEAP_CONTAINS_NEW_TUPLE,
though this apparently can't happen when logical decoding is active.
2014-05-17 15:53:21 -04:00
Heikki Linnakangas 03e2b1017c Fix thinko in logical decoding of commit-prepared records.
The decoding of prepared transaction commits accidentally used the XID of
the transaction performing the COMMIT PREPARED, not the XID of the prepared
transaction. Before bb38fb0d43 that lead to those transactions not being
decoded, afterwards to a assertion failure.
2014-05-16 10:53:10 +03:00
Heikki Linnakangas ff810b4928 Misc message style and doc fixes.
Euler Taveira
2014-05-15 14:49:11 +03:00
Robert Haas f1d8dd3647 Code review for logical decoding patch.
Post-commit review identified a number of places where addition was
used instead of multiplication or memory wasn't zeroed where it should
have been.  This commit also fixes one case where a structure member
was mis-initialized, and moves another memory allocation closer to
the place where the allocated storage is used for clarity.

Andres Freund
2014-05-09 10:44:04 -04:00
Bruce Momjian 0a78320057 pgindent run for 9.4
This includes removing tabs after periods in C comments, which was
applied to back branches, so this change should not effect backpatching.
2014-05-06 12:12:18 -04:00
Heikki Linnakangas 3a8e9e977f Fix use of free in walsender error handling after a sysid mismatch.
Found via valgrind. The bug exists since the introduction of the walsender,
so backpatch to 9.0.

Andres Freund
2014-05-06 15:17:41 +03:00
Heikki Linnakangas 377790fbd7 Pass sensible value to memset() when randomizing reorderbuffer's tuple slab.
This is entirely harmless, but still wrong. Noticed by coverity.

Andres Freund
2014-05-05 16:22:15 +03:00
Heikki Linnakangas c834576839 Use Size instead of uint32 to store result of sizeof()
Silences coverity and is more consistent with other functions in the
same file.

Andres Freund
2014-05-05 16:17:16 +03:00
Tom Lane 203b0d132f Improve error messages in reorderbuffer.c.
Be more clear about failure cases in relfilenode->relation lookup,
and fix some other places that were inconsistent or not per our
message style guidelines.

Andres Freund and Tom Lane
2014-04-30 18:16:53 -04:00
Tom Lane 2d00190495 Rationalize common/relpath.[hc].
Commit a730183926 created rather a mess by
putting dependencies on backend-only include files into include/common.
We really shouldn't do that.  To clean it up:

* Move TABLESPACE_VERSION_DIRECTORY back to its longtime home in
catalog/catalog.h.  We won't consider this symbol part of the FE/BE API.

* Push enum ForkNumber from relfilenode.h into relpath.h.  We'll consider
relpath.h as the source of truth for fork numbers, since relpath.c was
already partially serving that function, and anyway relfilenode.h was
kind of a random place for that enum.

* So, relfilenode.h now includes relpath.h rather than vice-versa.  This
direction of dependency is fine.  (That allows most, but not quite all,
of the existing explicit #includes of relpath.h to go away again.)

* Push forkname_to_number from catalog.c to relpath.c, just to centralize
fork number stuff a bit better.

* Push GetDatabasePath from catalog.c to relpath.c; it was rather odd
that the previous commit didn't keep this together with relpath().

* To avoid needing relfilenode.h in common/, redefine the underlying
function (now called GetRelationPath) as taking separate OID arguments,
and make the APIs using RelFileNode or RelFileNodeBackend into macro
wrappers.  (The macros have a potential multiple-eval risk, but none of
the existing call sites have an issue with that; one of them had such a
risk already anyway.)

* Fix failure to follow the directions when "init" fork type was added;
specifically, the errhint in forkname_to_number wasn't updated, and neither
was the SGML documentation for pg_relation_size().

* Fix tablespace-path-too-long check in CreateTableSpace() to account for
fork-name component of maximum-length pathnames.  This requires putting
FORKNAMECHARS into a header file, but it was rather useless (and
actually unreferenced) where it was.

The last couple of items are potentially back-patchable bug fixes,
if anyone is sufficiently excited about them; but personally I'm not.

Per a gripe from Christoph Berg about how include/common wasn't
self-contained.
2014-04-30 17:30:50 -04:00
Heikki Linnakangas 150a9df528 Fix a few more misc typos in comments. 2014-04-10 00:53:55 +03:00
Heikki Linnakangas 5b075ae893 Fix misc typos in comments. 2014-04-09 23:16:35 +03:00
Tom Lane 8120c7452a Fix bogus time printout in walreceiver's debug log messages.
The displayed sendtime and receipttime were always exactly equal, because
somebody forgot that timestamptz_to_str returns a static buffer (thereby
simplifying life for most callers, at the cost of complicating it for those
who need two results concurrently).  Apply the same pstrdup solution used
by the other call sites with this issue.  Back-patch to 9.2 where the
faulty code was introduced.  Per bug #9849 from Haruka Takatsuka, though
this is not exactly his patch.

Possibly we should change timestamptz_to_str's API, but I wouldn't want
to do so in the back branches.
2014-04-04 11:44:04 -04:00
Robert Haas 3f0e4be453 Fix thinko in logical decoding code.
Andres Freund
2014-03-31 13:03:18 -04:00
Alvaro Herrera f88d4cfc9d Setup error context callback for transaction lock waits
With this in place, a session blocking behind another one because of
tuple locks will get a context line mentioning the relation name, tuple
TID, and operation being done on tuple.  For example:

LOG:  process 11367 still waiting for ShareLock on transaction 717 after 1000.108 ms
DETAIL:  Process holding the lock: 11366. Wait queue: 11367.
CONTEXT:  while updating tuple (0,2) in relation "foo"
STATEMENT:  UPDATE foo SET value = 3;

Most usefully, the new line is displayed by log entries due to
log_lock_waits, although of course it will be printed by any other log
message as well.

Author: Christian Kruse, some tweaks by Álvaro Herrera
Reviewed-by: Amit Kapila, Andres Freund, Tom Lane, Robert Haas
2014-03-19 15:10:36 -03:00
Fujii Masao 2bccced110 Fix typos in comments.
Thom Brown
2014-03-17 20:47:28 +09:00
Fujii Masao 5c6d9fc4b2 Fix bug in clean shutdown of walsender that pg_receiving is connecting to.
On clean shutdown, walsender waits for all WAL to be replicated to a standby,
and exits. It determined whether that replication had been completed by
checking whether its sent location had been equal to a standby's flush
location. Unfortunately this condition never becomes true when the standby
such as pg_receivexlog which always returns an invalid flush location is
connecting to walsender, and then walsender waits forever.

This commit changes walsender so that it just checks a standby's write
location if a flush location is invalid.

Back-patch to 9.1 where enough infrastructure for this exists.
2014-03-17 20:37:50 +09:00
Robert Haas 890194f14d Comment fixes related to logical decoding.
Andres Freund, per complaints by Peter Eisentraut.
2014-03-12 14:03:09 -04:00
Robert Haas 5a991ef869 Allow logical decoding via the walsender interface.
In order for this to work, walsenders need the optional ability to
connect to a database, so the "replication" keyword now allows true
or false, for backward-compatibility, and the new value "database"
(which causes the "dbname" parameter to be respected).

walsender needs to loop not only when idle but also when sending
decoded data to the user and when waiting for more xlog data to decode.
This means that there are now three separate loops inside walsender.c;
although some refactoring has been done here, this is still a bit ugly.

Andres Freund, with contributions from Álvaro Herrera, and further
review by me.
2014-03-10 13:50:28 -04:00
Tom Lane ea177a3ba7 Remove unportable use of anonymous unions from reorderbuffer.h.
In b89e151054 I had assumed it was ok to use anonymous unions as
struct members, but while a longstanding extension in many compilers,
it's only been standardized in C11.

To fix, remove one of the anonymous unions which tried to hide some
implementation specific enum values and give the other a name. The
latter unfortunately requires changes in output plugins, but since the
feature has only been added a few days ago...

Andres Freund
2014-03-07 17:03:26 -05:00
Bruce Momjian 91d9de9751 fix ReplicationSlotsCountDBSlots for dropping unrelated databases
YAMAMOTO Takashi
2014-03-07 11:42:18 -05:00
Heikki Linnakangas 94ae6ba74d Send keepalives from walsender even when busy sending WAL.
If walsender doesn't hear from the client for the time specified by
wal_sender_timeout, it will conclude the connection or client is dead, and
disconnect. When half of wal_sender_timeout has elapsed, it sends a ping
to the client, leaving it the remainig half of wal_sender_timeout to
respond. However, it only checked if half of wal_sender_timeout had elapsed
when it was about to sleep, so if it was busy sending WAL to the client for
long enough, it would not send the ping request in time. Then the client
would not know it needs to send a reply, and the walsender will disconnect
even though the client is still alive. Fix that.

Andres Freund, reviewed by Robert Haas, and some further changes by me.
Backpatch to 9.3. Earlier versions relied on the client to send the
keepalives on its own, and hence didn't have this problem.
2014-03-06 21:38:51 +02:00
Robert Haas 406a1a9ef0 Fix some typos introduced by the logical decoding patch.
Erik Rijkers
2014-03-05 13:00:22 -05:00
Robert Haas 7e8db2dc42 Minor corrections to logical decoding patch. 2014-03-04 11:07:54 -05:00
Heikki Linnakangas 7558cc95d3 Error out on send failure in walsender loop.
I changed the loop in 9.3 to use "goto send_failure" instead of "break" on
errors, but I missed this one case. It was a relatively harmless bug: if
the flush fails once it will most likely fail again as soon as we try to
flush the output again. But it's a bug nevertheless.

Report and fix by Andres Freund.
2014-03-04 15:36:05 +02:00
Robert Haas b89e151054 Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables.  The output format is controlled by a
so-called "output plugin"; an example is included.  To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.

Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.

Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 16:32:18 -05:00
Robert Haas 34c6d9611d Use a longer buffer in libpqrcv_startstreaming.
Because of the new SLOT clause in the START_REPLICATION command, it's
possible for the command to end up too long for the old maximum buffer
length.

Andres Freund
2014-03-03 07:24:52 -05:00
Robert Haas d83ee62231 Corrections to replication slots code and documentation.
Andres Freund, per a report from Vik Faering
2014-03-03 07:16:54 -05:00
Stephen Frost 5592ebac55 Another round of Coverity fixes
Additional non-security issues/improvements spotted by Coverity.

In backend/libpq, no sense trying to protect against port->hba being
NULL after we've already dereferenced it in the switch() statement.

Prevent against possible overflow due to 32bit arithmitic in
basebackup throttling (not yet released, so no security concern).

Remove nonsensical check of array pointer against NULL in procarray.c,
looks to be a holdover from 9.1 and earlier when there were pointers
being used but now it's just an array.

Remove pointer check-against-NULL in tsearch/spell.c as we had already
dereferenced it above (in the strcmp()).

Remove dead code from adt/orderedsetaggs.c, isnull is checked
immediately after each tuplesort_getdatum() call and if true we return,
so no point checking it again down at the bottom.

Remove recently added minor error-condition memory leak in pg_regress.
2014-03-03 03:18:51 -05:00
Alvaro Herrera ef5856fd9b Allow BASE_BACKUP to be throttled
A new MAX_RATE option allows imposing a limit to the network transfer
rate from the server side.  This is useful to limit the stress that
taking a base backup has on the server.

pg_basebackup is now able to specify a value to the server, too.

Author: Antonin Houska

Patch reviewed by Stefan Radomski, Andres Freund, Zoltán Böszörményi,
Fujii Masao, and Álvaro Herrera.
2014-02-27 18:55:57 -03:00
Robert Haas 278c94209b pg_basebackup: Skip only the *contents* of pg_replslot.
Include the directory itself.

Fujii Masao
2014-02-25 11:23:45 -05:00
Robert Haas 6615e77439 Use pg_lsn data type in pg_stat_replication, too.
Michael Paquier, per a suggestion from Andres Freund
2014-02-24 10:38:45 -05:00
Robert Haas 6f289c2b7d Switch various builtin functions to use pg_lsn instead of text.
The functions in slotfuncs.c don't exist in any released version,
but the changes to xlogfuncs.c represent backward-incompatibilities.
Per discussion, we're hoping that the queries using these functions
are few enough and simple enough that this won't cause too much
breakage for users.

Michael Paquier, reviewed by Andres Freund and further modified
by me.
2014-02-19 11:37:43 -05:00
Fujii Masao 3e8554a54a Make pg_basebackup skip temporary statistics files.
The temporary statistics files don't need to be included in the backup
because they are always reset at the beginning of the archive recovery.
This patch changes pg_basebackup so that it skips all files located in
$PGDATA/pg_stat_tmp or the directory specified by stats_temp_directory
parameter.
2014-02-03 23:19:49 +09:00
Tom Lane 46825d4978 Clean up some sloppy coding in repl_gram.y.
Remove unused copy-and-pasted macro definitions, and improve formatting
of recently-added productions.

I got interested in this because buildfarm member protosciurus has been
crashing in "bison repl_gram.y" since commit 858ec11.  It's a long shot
that this will fix that, though maybe the missing trailing semicolon
has something to do with it?  In any case, there's no need to approve
of dead code, nor of code whose formatting isn't even self-consistent
let alone consistent with what's around it.
2014-02-02 12:51:14 -05:00
Fujii Masao 63be3b78f6 Fix typos in docs and comments.
Thom Brown
2014-02-02 10:28:18 +09:00
Tom Lane 214c7a4f0b Fix some more bugs in signal handlers and process shutdown logic.
WalSndKill was doing things exactly backwards: it should first clear
MyWalSnd (to stop signal handlers from touching MyWalSnd->latch),
then disown the latch, and only then mark the WalSnd struct unused by
clearing its pid field.

Also, WalRcvSigUsr1Handler and worker_spi_sighup failed to preserve
errno, which is surely a requirement for any signal handler.

Per discussion of recent buildfarm failures.  Back-patch as far
as the relevant code exists.
2014-02-01 16:21:23 -05:00
Robert Haas 858ec11858 Introduce replication slots.
Replication slots are a crash-safe data structure which can be created
on either a master or a standby to prevent premature removal of
write-ahead log segments needed by a standby, as well as (with
hot_standby_feedback=on) pruning of tuples whose removal would cause
replication conflicts.  Slots have some advantages over existing
techniques, as explained in the documentation.

In a few places, we refer to the type of replication slots introduced
by this patch as "physical" slots, because forthcoming patches for
logical decoding will also have slots, but with somewhat different
properties.

Andres Freund and Robert Haas
2014-01-31 22:45:36 -05:00
Fujii Masao dd515d4082 Change the suffix of auto conf temporary file from "temp" to "tmp".
Michael Paquier
2014-01-27 12:39:11 +09:00
Heikki Linnakangas a472ae1e4e Fix Hot Standby feedback sending when streaming busily.
Commit 6f60fdd701 accidentally removed a
call to XLogWalRcvSendHSFeedback() after flushing received WAL to disk.
The consequence is that when walsender is busy streaming WAL, it doesn't
send HS feedback messages. One is sent if nothing is received from the
master for 100ms, but if there's a steady stream of WAL, it never happens.

Backpatch to 9.3.

Andres Freund and Amit Kapila
2014-01-16 23:15:41 +02:00
Bruce Momjian 7e04792a1c Update copyright for 2014
Update all files in head, and files COPYRIGHT and legal.sgml in all back
branches.
2014-01-07 16:05:30 -05:00
Magnus Hagander b168c5ef27 Avoid including tablespaces inside PGDATA twice in base backups
If a tablespace was crated inside PGDATA it was backed up both as part
of the PGDATA backup and as the backup of the tablespace. Avoid this
by skipping any directory inside PGDATA that contains one of the active
tablespaces.

Dimitri Fontaine and Magnus Hagander
2014-01-07 17:11:32 +01:00
Tatsuo Ishii 65d6e4cb5c Add ALTER SYSTEM command to edit the server configuration file.
Patch contributed by Amit Kapila. Reviewed by Hari Babu, Masao Fujii,
Boszormenyi Zoltan, Andres Freund, Greg Smith and others.
2013-12-18 23:42:44 +09:00
Heikki Linnakangas dde6282500 Fix more instances of "the the" in comments.
Plus one instance of "to to" in the docs.
2013-12-13 20:02:01 +02:00
Heikki Linnakangas a93bdfc711 Fix typo in comment.
Also line-wrap an over-wide line in a comment that's ignored by pgindent.
2013-09-03 13:17:09 +03:00
Magnus Hagander db4ef73760 Don't crash when pg_xlog is empty and pg_basebackup -x is used
The backup will not work (without a logarchive, and that's the whole
point of -x) in this case, this patch just changes it to throw an
error instead of crashing when this happens.

Noticed and diagnosed by TAKATSUKA Haruka
2013-08-24 17:13:49 +02:00
Peter Eisentraut 229fb58d4f Treat timeline IDs as unsigned in replication parser
Timeline IDs are unsigned ints everywhere, except the replication parser
treated them as signed ints.
2013-08-14 23:18:49 -04:00
Peter Eisentraut 626092a2e1 Message style improvements 2013-07-28 07:01:13 -04:00
Fujii Masao 985bd7d497 Support clean switchover.
In replication, when we shutdown the master, walsender tries to send
all the outstanding WAL records to the standby, and then to exit. This
basically means that all the WAL records are fully synced between
two servers after the clean shutdown of the master. So, after
promoting the standby to new master, we can restart the stopped
master as new standby without the need for a fresh backup from
new master.

But there was one problem so far: though walsender tries to send all
the outstanding WAL records, it doesn't wait for them to be replicated
to the standby. Then, before receiving all the WAL records,
walreceiver can detect the closure of connection and exit. We cannot
guarantee that there is no missing WAL in the standby after clean
shutdown of the master. In this case, backup from new master is
required when restarting the stopped master as new standby.

This patch fixes this problem. It just changes walsender so that it
waits for all the outstanding WAL records to be replicated to the
standby before closing the replication connection.

Per discussion, this is a fix that needs to get backpatched rather than
new feature. So, back-patch to 9.1 where enough infrastructure for
this exists.

Patch by me, reviewed by Andres Freund.
2013-06-26 02:14:37 +09:00
Peter Eisentraut d7eb6f46de Minor spell checking 2013-05-30 20:56:58 -04:00
Bruce Momjian 9af4159fce pgindent run for release 9.3
This is the first run of the Perl-based pgindent script.  Also update
pgindent instructions.
2013-05-29 16:58:43 -04:00
Heikki Linnakangas 2ffa66f497 Fix walsender failure at promotion.
If a standby server has a cascading standby server connected to it, it's
possible that WAL has already been sent up to the next WAL page boundary,
splitting a WAL record in the middle, when the first standby server is
promoted. Don't throw an assertion failure or error in walsender if that
happens.

Also, fix a variant of the same bug in pg_receivexlog: if it had already
received WAL on previous timeline up to a segment boundary, when the
upstream standby server is promoted so that the timeline switch record falls
on the previous segment, pg_receivexlog would miss the segment containing
the timeline switch. To fix that, have walsender send the position of the
timeline switch at end-of-streaming, in addition to the next timeline's ID.
It was previously assumed that the switch happened exactly where the
streaming stopped.

Note: this is an incompatible change in the streaming protocol. You might
get an error if you try to stream over timeline switches, if the client is
running 9.3beta1 and the server is more recent. It should be fine after a
reconnect, however.

Reported by Fujii Masao.
2013-05-08 20:30:17 +03:00
Heikki Linnakangas 28ba260906 In base backup, only include our own tablespace version directory.
If you have clusters of different versions pointing to the same tablespace
location, we would incorrectly include all the data belonging to the other
versions, too.

Fixes bug #7986, reported by Sergey Burladyan.
2013-03-25 20:19:22 +02:00
Tom Lane da5aeccf64 Move pqsignal() to libpgport.
We had two copies of this function in the backend and libpq, which was
already pretty bogus, but it turns out that we need it in some other
programs that don't use libpq (such as pg_test_fsync).  So put it where
it probably should have been all along.  The signal-mask-initialization
support in src/backend/libpq/pqsignal.c stays where it is, though, since
we only need that in the backend.
2013-03-17 12:06:42 -04:00
Heikki Linnakangas 3a9e64aa0d Cannot use WL_SOCKET_WRITEABLE without WL_SOCKET_READABLE.
In copy-out mode, the frontend should not send any messages until the
backend has finished streaming, by sending a CopyDone message. I'm not sure
if it would be legal for the client to send a new query before receiving the
CopyDone message from the backend, but trying to support that would require
bigger changes to the backend code structure.

Fixes an assertion failure reported by Fujii Masao.
2013-02-27 19:28:51 +02:00
Peter Eisentraut 4f36292669 Add quotes to messages 2013-02-22 23:33:07 -05:00
Simon Riggs c2f79ba269 Force archive_status of .done for xlogs created by dearchival/replication.
This is a forward-patch of commit 6f4b8a4f4f,
applied to 9.2 back in August. The plan was to do something else in master,
but it looks like it's not going to happen, so let's just apply the 9.2
solution to master as well.

Fujii Masao
2013-02-15 19:28:06 +02:00
Peter Eisentraut 0cb1fac3b1 Add noreturn attributes to some error reporting functions 2013-02-12 07:13:22 -05:00
Simon Riggs bd56e74127 Reset master xmin when hot_standby_feedback disabled.
If walsender has xmin of standby then ensure we
reset the value to 0 when we change from hot_standby_feedback=on
to hot_standby_feedback=off.
2013-02-04 10:29:22 +00:00
Heikki Linnakangas 990fe3c4ed Fix more issues with cascading replication and timeline switches.
When a standby server follows the master using WAL archive, and it chooses
a new timeline (recovery_target_timeline='latest'), it only fetches the
timeline history file for the chosen target timeline, not any other history
files that might be missing from pg_xlog. For example, if the current
timeline is 2, and we choose 4 as the new recovery target timeline, the
history file for timeline 3 is not fetched, even if it's part of this
server's history. That's enough for the standby itself - the history file
for timeline 4 includes timeline 3 as well - but if a cascading standby
server wants to recover to timeline 3, it needs the history file. To fix,
when a new recovery target timeline is chosen, try to copy any missing
history files from the archive to pg_xlog between the old and new target
timeline.

A second similar issue was with the WAL files. When a standby recovers from
archive, and it reaches a segment that contains a switch to a new timeline,
recovery fetches only the WAL file labelled with the new timeline's ID. The
file from the new timeline contains a copy of the WAL from the old timeline
up to the point where the switch happened, and recovery recovers it from the
new file. But in streaming replication, walsender only tries to read it
from the old timeline's file. To fix, change walsender to read it from the
new file, so that it behaves the same as recovery in that sense, and doesn't
try to open the possibly nonexistent file with the old timeline's ID.
2013-01-23 10:19:20 +02:00
Heikki Linnakangas 6f7cddc7ae Now that START_REPLICATION returns the next timeline's ID after reaching end
of timeline, take advantage of that in walreceiver.

Startup process is still in control of choosign the target timeline, by
scanning the timeline history files present in pg_xlog, but walreceiver now
uses the next timeline's ID to fetch its history file immediately after it
has finished streaming the old timeline. Before, the standby would first try
to restart streaming on the old timeline, which fetches the missing timeline
history file as a side-effect, and only then restart from the new timeline.
This patch eliminates the extra iteration, which speeds up the timeline
switch and reduces the noise in the log caused by the extra restart on the
old timeline.
2013-01-18 11:59:34 +02:00
Heikki Linnakangas 3684a534ef I added a result set to START_STREAMING command, but neglected walreceiver.
The patch to allow pg_receivexlog to switch timeline added a result set
after copy has ended in START_STREAMING command, to return the next
timeline's ID to the client. But walreceived didn't get the memo, and threw
an error on the unexpected result set. Fix.
2013-01-17 23:45:45 +02:00
Heikki Linnakangas 0b6329130e Make pg_receivexlog and pg_basebackup -X stream work across timeline switches.
This mirrors the changes done earlier to the server in standby mode. When
receivelog reaches the end of a timeline, as reported by the server, it
fetches the timeline history file of the next timeline, and restarts
streaming from the new timeline by issuing a new START_STREAMING command.

When pg_receivexlog crosses a timeline, it leaves the .partial suffix on the
last segment on the old timeline. This helps you to tell apart a partial
segment left in the directory because of a timeline switch, and a completed
segment. If you just follow a single server, it won't make a difference, but
it can be significant in more complicated scenarios where new WAL is still
generated on the old timeline.

This includes two small changes to the streaming replication protocol:
First, when you reach the end of timeline while streaming, the server now
sends the TLI of the next timeline in the server's history to the client.
pg_receivexlog uses that as the next timeline, so that it doesn't need to
parse the timeline history file like a standby server does. Second, when
BASE_BACKUP command sends the begin and end WAL positions, it now also sends
the timeline IDs corresponding the positions.
2013-01-17 20:23:00 +02:00
Heikki Linnakangas 3f4b1749a8 Return value of lseek() can be negative on failure.
Because the return value of lseek() was assigned to an unsigned size_t
variable, we'd fail to notice an error return code -1. Compiler gave a
warning about this.

Andres Freund
2013-01-15 00:42:37 +02:00
Tom Lane b853eb9718 Improve handling of ereport(ERROR) and elog(ERROR).
In commit 71450d7fd6, we added code to inform
suitably-intelligent compilers that ereport() doesn't return if the elevel
is ERROR or higher.  This patch extends that to elog(), and also fixes a
double-evaluation hazard that the previous commit created in ereport(),
as well as reducing the emitted code size.

The elog() improvement requires the compiler to support __VA_ARGS__, which
should be available in just about anything nowadays since it's required by
C99.  But our minimum language baseline is still C89, so add a configure
test for that.

The previous commit assumed that ereport's elevel could be evaluated twice,
which isn't terribly safe --- there are already counterexamples in xlog.c.
On compilers that have __builtin_constant_p, we can use that to protect the
second test, since there's no possible optimization gain if the compiler
doesn't know the value of elevel.  Otherwise, use a local variable inside
the macros to prevent double evaluation.  The local-variable solution is
inferior because (a) it leads to useless code being emitted when elevel
isn't constant, and (b) it increases the optimization level needed for the
compiler to recognize that subsequent code is unreachable.  But it seems
better than not teaching non-gcc compilers about unreachability at all.

Lastly, if the compiler has __builtin_unreachable(), we can use that
instead of abort(), resulting in a noticeable code savings since no
function call is actually emitted.  However, it seems wise to do this only
in non-assert builds.  In an assert build, continue to use abort(), so that
the behavior will be predictable and debuggable if the "impossible"
happens.

These changes involve making the ereport and elog macros emit do-while
statement blocks not just expressions, which forces small changes in
a few call sites.

Andres Freund, Tom Lane, Heikki Linnakangas
2013-01-13 18:40:09 -05:00
Heikki Linnakangas b0daba57bb Tolerate timeline switches while "pg_basebackup -X fetch" is running.
If you take a base backup from a standby server with "pg_basebackup -X
fetch", and the timeline switches while the backup is being taken, the
backup used to fail with an error "requested WAL segment %s has already
been removed". This is because the server-side code that sends over the
required WAL files would not construct the WAL filename with the correct
timeline after a switch.

Fix that by using readdir() to scan pg_xlog for all the WAL segments in the
range, regardless of timeline.

Also, include all timeline history files in the backup, if taken with
"-X fetch". That fixes another related bug: If a timeline switch happened
just before the backup was initiated in a standby, the WAL segment
containing the initial checkpoint record contains WAL from the older
timeline too. Recovery will not accept that without a timeline history file
that lists the older timeline.

Backpatch to 9.2. Versions prior to that were not affected as you could not
take a base backup from a standby before 9.2.
2013-01-03 19:51:00 +02:00
Heikki Linnakangas ee994272ca Delay reading timeline history file until it's fetched from master.
Streaming replication can fetch any missing timeline history files from the
master, but recovery would read the timeline history file for the target
timeline before reading the checkpoint record, and before walreceiver has
had a chance to fetch it from the master. Delay reading it, and the sanity
checks involving timeline history, until after reading the checkpoint
record.

There is at least one scenario where this makes a difference: if you take
a base backup from a standby server right after a timeline switch, the
WAL segment containing the initial checkpoint record will begin with an
older timeline ID. Without the timeline history file, recovering that file
will fail as the older timeline ID is not recognized to be an ancestor of
the target timeline. If you try to recover from such a backup, using only
streaming replication to fetch the WAL, this patch is required for that to
work.
2013-01-03 10:41:58 +02:00
Magnus Hagander 794397ae1d Move tar function headers to pgtar.h
This makes it possible to include them only where they are used, so
we can avoid the conflict of the uid_t and gid_t datatypes that happened
in plperl (since plperl doesn't need the tar functions)
2013-01-02 20:34:08 +01:00
Bruce Momjian bd61a623ac Update copyrights for 2013
Fully update git head, and update back branches in ./COPYRIGHT and
legal.sgml files.
2013-01-01 17:15:01 -05:00
Magnus Hagander f5d4bdd3a5 Unify some tar functionality across different parts
Move some of the tar functionality that existed mostly duplicated
in both pg_dump and the walsender basebackup functionality into
port/tar.c instead, so it can be used from both. It will also be
used by pg_basebackup in the future, which would've caused a third
copy of it around.

Zoltan Boszormenyi and Magnus Hagander
2013-01-01 18:15:57 +01:00
Alvaro Herrera 5ab3af46dd Remove obsolete XLogRecPtr macros
This gets rid of XLByteLT, XLByteLE, XLByteEQ and XLByteAdvance.
These were useful for brevity when XLogRecPtrs were split in
xlogid/xrecoff; but now that they are simple uint64's, they are just
clutter.  The only downside to making this change would be ease of
backporting patches, but that has been negated by other substantive
changes to the involved code anyway.  The clarity of simpler expressions
makes the change worthwhile.

Most of the changes are mechanical, but in a couple of places, the patch
author chose to invert the operator sense, making the code flow more
logical (and more in line with preceding comments).

Author: Andres Freund
Eyeballed by Dimitri Fontaine and Alvaro Herrera
2012-12-28 13:06:15 -03:00
Alvaro Herrera 24eca7977e Assign InvalidXLogRecPtr instead of MemSet(0)
For consistency.

Author: Andres Freund
2012-12-27 18:33:03 -03:00
Heikki Linnakangas 1ff92eea14 Fix sloppiness in the timeline switch over streaming replication patch.
Here's another attempt at fixing the logic that decides how far the WAL can
be streamed, which was still broken if the timeline changed while streaming.
You would get an assertion failure. The way the logic is now written is more
readable, too.

Thom Brown reported the assertion failure.
2012-12-21 20:08:12 +02:00
Heikki Linnakangas 36e4456d78 Fix race condition if a file is removed while pg_basebackup is running.
If a relation file was removed when the server-side counterpart of
pg_basebackup was just about to open it to send it to the client, you'd
get a "could not open file" error. Fix that.

Backpatch to 9.1, this goes back to when pg_basebackup was introduced.
2012-12-21 15:34:15 +02:00
Heikki Linnakangas af275a12df Follow TLI of last replayed record, not recovery target TLI, in walsenders.
Most of the time, the last replayed record comes from the recovery target
timeline, but there is a corner case where it makes a difference. When
the startup process scans for a new timeline, and decides to change recovery
target timeline, there is a window where the recovery target TLI has already
been bumped, but there are no WAL segments from the new timeline in pg_xlog
yet. For example, if we have just replayed up to point 0/30002D8, on
timeline 1, there is a WAL file called 000000010000000000000003 in pg_xlog
that contains the WAL up to that point. When recovery switches recovery
target timeline to 2, a walsender can immediately try to read WAL from
0/30002D8, from timeline 2, so it will try to open WAL file
000000020000000000000003. However, that doesn't exist yet - the startup
process hasn't copied that file from the archive yet nor has the walreceiver
streamed it yet, so walsender fails with error "requested WAL segment
000000020000000000000003 has already been removed". That's harmless, in that
the standby will try to reconnect later and by that time the segment is
already created, but error messages that should be ignored are not good.

To fix that, have walsender track the TLI of the last replayed record,
instead of the recovery target timeline. That way walsender will not try to
read anything from timeline 2, until the WAL segment has been created and at
least one record has been replayed from it. The recovery target timeline is
now xlog.c's internal affair, it doesn't need to be exposed in shared memory
anymore.

This fixes the error reported by Thom Brown. depesz the same error message,
but I'm not sure if this fixes his scenario.
2012-12-20 14:39:04 +02:00
Heikki Linnakangas abfd192b1b Allow a streaming replication standby to follow a timeline switch.
Before this patch, streaming replication would refuse to start replicating
if the timeline in the primary doesn't exactly match the standby. The
situation where it doesn't match is when you have a master, and two
standbys, and you promote one of the standbys to become new master.
Promoting bumps up the timeline ID, and after that bump, the other standby
would refuse to continue.

There's significantly more timeline related logic in streaming replication
now. First of all, when a standby connects to primary, it will ask the
primary for any timeline history files that are missing from the standby.
The missing files are sent using a new replication command TIMELINE_HISTORY,
and stored in standby's pg_xlog directory. Using the timeline history files,
the standby can follow the latest timeline present in the primary
(recovery_target_timeline='latest'), just as it can follow new timelines
appearing in an archive directory.

START_REPLICATION now takes a TIMELINE parameter, to specify exactly which
timeline to stream WAL from. This allows the standby to request the primary
to send over WAL that precedes the promotion. The replication protocol is
changed slightly (in a backwards-compatible way although there's little hope
of streaming replication working across major versions anyway), to allow
replication to stop when the end of timeline reached, putting the walsender
back into accepting a replication command.

Many thanks to Amit Kapila for testing and reviewing various versions of
this patch.
2012-12-13 19:17:32 +02:00
Heikki Linnakangas add6c3179a Make the streaming replication protocol messages architecture-independent.
We used to send structs wrapped in CopyData messages, which works as long as
the client and server agree on things like endianess, timestamp format and
alignment. That's good enough for running a standby server, which has to run
on the same platform anyway, but it's useful for tools like pg_receivexlog
to work across platforms.

This breaks protocol compatibility of streaming replication, but we never
promised that to be compatible across versions, anyway.
2012-11-07 19:09:13 +02:00
Heikki Linnakangas 7d3ed5ae78 Fix typo in comment.
Fujii Masao
2012-10-15 13:01:31 +03:00
Heikki Linnakangas 6f60fdd701 Improve replication connection timeouts.
Rename replication_timeout to wal_sender_timeout, and add a new setting
called wal_receiver_timeout that does the same at the walreceiver side.
There was previously no timeout in walreceiver, so if the network went down,
for example, the walreceiver could take a long time to notice that the
connection was lost. Now with the two settings, both sides of a replication
connection will detect a broken connection similarly.

It is no longer necessary to manually set wal_receiver_status_interval to
a value smaller than the timeout. Both wal sender and receiver now
automatically send a "ping" message if more than 1/2 of the configured
timeout has elapsed, and it hasn't received any messages from the other end.

Amit Kapila, heavily edited by me.
2012-10-11 17:48:08 +03:00
Peter Eisentraut 8521d13194 Refactor flex and bison make rules
Numerous flex and bison make rules have appeared in the source tree
over time, and they are all virtually identical, so we can replace
them by pattern rules with some variables for customization.

Users of pgxs will also be able to benefit from this.
2012-10-11 06:57:04 -04:00
Heikki Linnakangas 0b77aebabf Remove stray newline in comment. 2012-10-09 13:06:48 +03:00
Peter Eisentraut b6d4522296 Remove generation of repl_gram.h
It was apparently never necessary.
2012-10-08 20:36:46 -04:00
Heikki Linnakangas 9c0e2b9182 Fix walsender handling of postmaster shutdown, to not go into endless loop.
This bug was introduced by my patch to use the regular die/quickdie signal
handlers in walsender processes. I tried to make walsender exit at next
CHECK_FOR_INTERRUPTS() by setting ProcDiePending, but that's not enough, you
need to set InterruptPending too. On second thoght, it was not a very good
way to make walsender exit anyway, so use proc_exit(0) instead.

Also, send a CommandComplete message before exiting; that's what we did
before, and you get a nicer error message in the standby that way.

Reported by Thom Brown.
2012-10-08 13:32:14 +03:00
Heikki Linnakangas fd5942c18f Use the regular main processing loop also in walsenders.
The regular backend's main loop handles signal handling and error recovery
better than the current WAL sender command loop does. For example, if the
client hangs and a SIGTERM is received before starting streaming, the
walsender will now terminate immediately, rather than hang until the
connection times out.
2012-10-05 17:21:12 +03:00
Tom Lane 05b555d12b Fix tar files emitted by pg_dump and pg_basebackup to be POSIX conformant.
Both programs got the "magic" string wrong, causing standard-conforming tar
implementations to believe the output was just legacy tar format without
any POSIX extensions.  This doesn't actually matter that much, especially
since pg_dump failed to fill the POSIX fields anyway, but still there is
little point in emitting tar format if we can't be compliant with the
standard.  In addition, pg_dump failed to write the EOF marker correctly
(there should be 2 blocks of zeroes not just one), pg_basebackup put the
numeric group ID in the wrong place, and both programs had a pretty
brain-dead idea of how to compute the checksum.  Fix all that and improve
the comments a bit.

pg_restore is modified to accept either the correct POSIX-compliant "magic"
string or the previous value.  This part of the change will need to be
back-patched to avoid an unnecessary compatibility break when a previous
version tries to read tar-format output from 9.3 pg_dump.

Brian Weaver and Tom Lane
2012-09-28 15:19:15 -04:00
Heikki Linnakangas c4c227477b Fix bugs in cascading replication with recovery_target_timeline='latest'
The cascading replication code assumed that the current RecoveryTargetTLI
never changes, but that's not true with recovery_target_timeline='latest'.
The obvious upshot of that is that RecoveryTargetTLI in shared memory needs
to be protected by a lock. A less obvious consequence is that when a
cascading standby is connected, and the standby switches to a new target
timeline after scanning the archive, it will continue to stream WAL to the
cascading standby, but from a wrong file, ie. the file of the previous
timeline. For example, if the standby is currently streaming from the middle
of file 000000010000000000000005, and the timeline changes, the standby
will continue to stream from that file. However, the WAL on the new
timeline is in file 000000020000000000000005, so the standby sends garbage
from 000000010000000000000005 to the cascading standby, instead of the
correct WAL from file 000000020000000000000005.

This also fixes a related bug where a partial WAL segment is restored from
the archive and streamed to a cascading standby. The code assumed that when
a WAL segment is copied from the archive, it can immediately be fully
streamed to a cascading standby. However, if the segment is only partially
filled, ie. has the right size, but only N first bytes contain valid WAL,
that's not safe. That can happen if a partial WAL segment is manually copied
to the archive, or if a partial WAL segment is archived because a server is
started up on a new timeline within that segment. The cascading standby will
get confused if the WAL it received is not valid, and will get stuck until
it's restarted. This patch fixes that problem by not allowing WAL restored
from the archive to be streamed to a cascading standby until it's been
replayed, and thus validated.
2012-09-04 19:33:21 -07:00
Heikki Linnakangas fe811ae810 Fix typos in README. 2012-08-31 11:30:11 +03:00
Simon Riggs da4efa13d8 Turn off WalSender keepalives by default, users can enable if desired 2012-08-09 17:07:03 +01:00
Simon Riggs 87d8bd7c9f Ensure all replication message info is available and correct via WalRcv 2012-08-09 17:03:59 +01:00