Commit Graph

235 Commits

Author SHA1 Message Date
Bruce Momjian
97c39498e5 Update copyright for 2019
Backpatch-through: certain files through 9.4
2019-01-02 12:44:25 -05:00
Noah Misch
1db439ad49 Raise some timeouts to 180s, in test code.
Slow runs of buildfarm members chipmunk, hornet and mandrill saw the
shorter timeouts expire.  The 180s timeout in poll_query_until has been
trouble-free since 2a0f89cd71 introduced
it two years ago, so use 180s more widely.  Back-patch to 9.6, where the
first of these timeouts was introduced.

Reviewed by Michael Paquier.

Discussion: https://postgr.es/m/20181209001601.GC2973271@rfd.leadboat.com
2018-12-10 20:15:42 -08:00
Peter Eisentraut
f2cbffc7a6 Only allow one recovery target setting
The previous recovery.conf regime accepted multiple recovery_target*
settings and used the last one.  This does not translate well to the
general GUC system.  Specifically, under EXEC_BACKEND, the settings
are written out not in any particular order, so the order in which
they were originally set is not available to new processes.

Rather than redesign the GUC system, it was decided to abandon the old
behavior and only allow one recovery target setting.  A second setting
will cause an error.  However, it is allowed to set the same parameter
multiple times or unset a parameter and set a different one.

Discussion: https://www.postgresql.org/message-id/flat/27802171543235530%40iva2-6ec8f0a6115e.qloud-c.yandex.net#701a59c837ad0bf8c244344aaf3ef5a4
2018-11-28 13:55:54 +01:00
Peter Eisentraut
2dedf4d9a8 Integrate recovery.conf into postgresql.conf
recovery.conf settings are now set in postgresql.conf (or other GUC
sources).  Currently, all the affected settings are PGC_POSTMASTER;
this could be refined in the future case by case.

Recovery is now initiated by a file recovery.signal.  Standby mode is
initiated by a file standby.signal.  The standby_mode setting is
gone.  If a recovery.conf file is found, an error is issued.

The trigger_file setting has been renamed to promote_trigger_file as
part of the move.

The documentation chapter "Recovery Configuration" has been integrated
into "Server Configuration".

pg_basebackup -R now appends settings to postgresql.auto.conf and
creates a standby.signal file.

Author: Fujii Masao <masao.fujii@gmail.com>
Author: Simon Riggs <simon@2ndquadrant.com>
Author: Abhijit Menon-Sen <ams@2ndquadrant.com>
Author: Sergei Kornilov <sk@zsrv.org>
Discussion: https://www.postgresql.org/message-id/flat/607741529606767@web3g.yandex.ru/
2018-11-25 16:33:40 +01:00
Andres Freund
691d79a079 Disallow starting server with insufficient wal_level for existing slot.
Previously it was possible to create a slot, change wal_level, and
restart, even if the new wal_level was insufficient for the
slot. That's a problem for both logical and physical slots, because
the necessary WAL records are not generated.

This removes a few tests in newer versions that, somewhat
inexplicably, whether restarting with a too low wal_level worked (a
buggy behaviour!).

Reported-By: Joshua D. Drake
Author: Andres Freund
Discussion: https://postgr.es/m/20181029191304.lbsmhshkyymhw22w@alap3.anarazel.de
Backpatch: 9.4-, where replication slots where introduced
2018-10-31 15:46:39 -07:00
Michael Paquier
10074651e3 Add pg_promote function
This function is able to promote a standby with this new SQL-callable
function.  Execution access can be granted to non-superusers so that
failover tools can observe the principle of least privilege.

Catalog version is bumped.

Author: Laurenz Albe
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/6e7c79b3ec916cf49742fb8849ed17cd87aed620.camel@cybertec.at
2018-10-25 09:46:00 +09:00
Tom Lane
d11eae09e4 Fix bogus loop logic in 013_crash_restart test's pump_until subroutine.
The pump_nb() step might've already received the desired data, so we must
check for that at the top of the loop not the bottom.  Otherwise, the
call to pump() will sit with nothing to do until the timeout elapses.
pump_until then falls out with apparent success ... but the timeout has
been used up, causing the next call of pump_until to report a timeout
failure.  I believe this explains the intermittent timeout failures
we've seen in the buildfarm ever since this test went in.  I was able
to reproduce the problem on gaur semi-repeatably, and this appears to
fix it.

In passing, remove a duplicate assignment, fix one stdin-assignment to
look like the rest, and document the test's dependency on test_decoding.
2018-08-12 18:05:49 -04:00
Tom Lane
a360f952ff Remove race-prone hot_standby_feedback test cases in 001_stream_rep.pl.
This script supposed that if it turned hot_standby_feedback on and then
shut down the standby server, at least one feedback message would be
guaranteed to be sent before the standby stops.  But there is no such
guarantee, if the standby's walreceiver process is slow enough --- and
we've seen multiple failures in the buildfarm showing that that does
happen in practice.  While we could rearrange the walreceiver logic to
make it less likely, it seems probably impossible to create a really
bulletproof guarantee of that sort; and if we tried, we might create
situations where the walreceiver wouldn't react in a timely manner to
shutdown commands.  It seems better instead to remove the script's
assumption that feedback will occur before shutdown.

But once we do that, these last few tests seem quite redundant with
the earlier tests in the script.  So let's just drop them altogether
and save some buildfarm cycles.

Backpatch to v10 where these tests were added.

Discussion: https://postgr.es/m/1922.1531592205@sss.pgh.pa.us
2018-07-18 17:39:27 -04:00
Michael Paquier
3c64dcb1e3 Prevent references to invalid relation pages after fresh promotion
If a standby crashes after promotion before having completed its first
post-recovery checkpoint, then the minimal recovery point which marks
the LSN position where the cluster is able to reach consistency may be
set to a position older than the first end-of-recovery checkpoint while
all the WAL available should be replayed.  This leads to the instance
thinking that it contains inconsistent pages, causing a PANIC and a hard
instance crash even if all the WAL available has not been replayed for
certain sets of records replayed.  When in crash recovery,
minRecoveryPoint is expected to always be set to InvalidXLogRecPtr,
which forces the recovery to replay all the WAL available, so this
commit makes sure that the local copy of minRecoveryPoint from the
control file is initialized properly and stays as it is while crash
recovery is performed.  Once switching to archive recovery or if crash
recovery finishes, then the local copy minRecoveryPoint can be safely
updated.

Pavan Deolasee has reported and diagnosed the failure in the first
place, and the base fix idea to rely on the local copy of
minRecoveryPoint comes from Kyotaro Horiguchi, which has been expanded
into a full-fledged patch by me.  The test included in this commit has
been written by Álvaro Herrera and Pavan Deolasee, which I have modified
to make it faster and more reliable with sleep phases.

Backpatch down to all supported versions where the bug appears, aka 9.3
which is where the end-of-recovery checkpoint is not run by the startup
process anymore.  The test gets easily supported down to 10, still it
has been tested on all branches.

Reported-by: Pavan Deolasee
Diagnosed-by: Pavan Deolasee
Reviewed-by: Pavan Deolasee, Kyotaro Horiguchi
Author: Michael Paquier, Kyotaro Horiguchi, Pavan Deolasee, Álvaro
Herrera
Discussion: https://postgr.es/m/CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com
2018-07-05 10:46:18 +09:00
Tom Lane
c992dca26e Clarify the README files for the various separate TAP-based test suites.
Explain the difference between "make check" and "make installcheck".
Mention the need for --enable-tap-tests (only some of these did so
before).  Standardize their wording about how to run the tests.
2018-06-19 19:30:50 -04:00
Andrew Dunstan
3a7cc727c7 Don't fall off the end of perl functions
This complies with the perlcritic policy
Subroutines::RequireFinalReturn, which is a severity 4 policy. Since we
only currently check at severity level 5, the policy is raised to that
level until we move to level 4 or lower, so that any new infringements
will be caught.

A small cosmetic piece of tidying of the pgperlcritic script is
included.

Mike Blackwell

Discussion: https://postgr.es/m/CAESHdJpfFm_9wQnQ3koY3c91FoRQsO-fh02za9R3OEMndOn84A@mail.gmail.com
2018-05-27 09:08:42 -04:00
Andrew Dunstan
35361ee788 Restrict vertical tightness to parentheses in Perl code
The vertical tightness settings collapse vertical whitespace between
opening and closing brackets (parentheses, square brakets and braces).
This can make data structures in particular harder to read, and is not
very consistent with our style in non-Perl code. This patch restricts
that setting to parentheses only, and reformats all the perl code
accordingly. Not applying this to parentheses has some unfortunate
effects, so the consensus is to keep the setting for parentheses and not
for the others.

The diff for this patch does highlight some places where structures
should have trailing commas. They can be added manually, as there is no
automatic tool to do so.

Discussion: https://postgr.es/m/a2f2b87c-56be-c070-bfc0-36288b4b41c1@2ndQuadrant.com
2018-05-09 10:14:46 -04:00
Peter Eisentraut
76ece16974 perltidy: Add option --nooutdent-long-comments 2018-04-27 11:37:43 -04:00
Peter Eisentraut
d4f16d5071 perltidy: Add option --nooutdent-long-quotes 2018-04-27 11:37:43 -04:00
Tom Lane
f04d4ac919 Reindent Perl files with perltidy version 20170521.
Discussion: https://postgr.es/m/CABUevEzK3cNiHZQ18f5tK0guoT+cN_jWeVzhYYxY=r+1Q3SmoA@mail.gmail.com
2018-04-25 14:00:19 -04:00
Andrew Dunstan
9ad21a6957 Don't use an Msys virtual path to create a tablespace
The new unlogged_reinit recovery tests create a new tablespace using
TestLib.pm's tempdir. However, on msys that function returns a virtual
path that isn't understood by Postgres. Here we add a new function to
TestLib.pm to turn such a path into a real path on the underlying file
system, and use it in the new test to create the tablespace. The new
function is essentially a NOOP everywhere but msys.
2018-03-20 08:25:36 +10:30
Peter Eisentraut
df411e7c66 Add tests for reinit.c
This tests how the different forks of unlogged tables behave in crash
recovery.

Author: David Steele <david@pgmasters.net>
2018-03-14 09:03:44 -04:00
Peter Eisentraut
f5da5683a8 Add installcheck support to more test suites
Several of the test suites under src/test/ were missing an installcheck
target.
2018-01-23 07:11:38 -05:00
Bruce Momjian
9d4649ca49 Update copyright for 2018
Backpatch-through: certain files through 9.3
2018-01-02 23:30:12 -05:00
Noah Misch
6078770c1a In tests, await an LSN no later than the recovery target.
Otherwise, the test fails with "Timed out while waiting for standby to
catch up".  This happened rarely, perhaps only when autovacuum wrote WAL
between our choosing the recovery target and choosing the LSN to await.
Commit b26f7fa6ae fixed one case of this.
Fix two more.  Back-patch to 9.6, which introduced the affected test.

Discussion: https://postgr.es/m/20180101055227.GA2952815@rfd.leadboat.com
2017-12-31 21:58:29 -08:00
Robert Haas
575cead991 Remove redundant line from Makefile.
Masahiko Sawada, reviewed by Michael Paquier

Discussion: http://postgr.es/m/CAD21AoDFes_Mgye-1K89rmTgeU3RxYF3zgTjzCJVq2KzzcpC4A@mail.gmail.com
2017-11-16 15:30:56 -05:00
Andres Freund
784905795f Try to make crash restart test work on windows.
Author: Andres Freund
Tested-By: Andrew Dunstan
Discussion: https://postgr.es/m/20170930224424.ud5ilchmclbl5y5n@alap3.anarazel.de
2017-10-01 15:24:58 -07:00
Tom Lane
01c7d3ef85 Ten-second timeout in 013_crash_restart.pl is not enough, let's try 60.
Per buildfarm member topminnow.
2017-09-23 12:56:31 -04:00
Andres Freund
8d926029e8 Expand expected output for recovery test even further.
I'd assumed that the backend being killed should be able to get out an
error message - but it turns out it's not guaranteed that it's not
still sending a ready-for-query.  Really need to do something about
getting these error message to the client.

Reported-By: Thomas Munro, Tom Lane
Discussion:
	https://postgr.es/m/CAEepm=0TE90nded+bNthP45_PEvGAAr=3gxhHJObL4xmOLtX0w@mail.gmail.com
	https://postgr.es/m/14968.1506101414@sss.pgh.pa.us
2017-09-22 11:35:26 -07:00
Andres Freund
5ada1fcd0c Accept that server might not be able to send error in crash recovery test.
As it turns out we can't rely that the script's monitoring session is
terminated with a proper error by the server, because the session
might be terminated while already trying to send data.

Also improve robustness and error reporting facilities of the test,
developed while debugging this issue.

Discussion: https://postgr.es/m/20170920020038.kllxgilo7xzwmtto@alap3.anarazel.de
2017-09-19 21:37:24 -07:00
Andres Freund
1910353675 Make new crash restart test a bit more robust.
Add timeouts in case psql doesn't deliver the expected output, and try
to cause the monitoring psql to be fully connected to a backend.  This
isn't necessarily everything needed, but at least the timeouts should
reduce the pain for buildfarm owners.

Author: Andres Freund
Reported-By: Tom Lane, BF animals prairiedog and calliphoridae
Discussion: https://postgr.es/m/E1du6ZT-00043I-91@gemulon.postgresql.org
2017-09-19 10:39:52 -07:00
Andres Freund
a1924a4ea2 Add test for postmaster crash restarts.
Given that I managed to break this...  We probably should extend the
tests to also cover other sub-processes dying, but that's something
for later.

Author: Andres Freund
Discussion: https://postgr.es/m/20170917080752.rcmihzfmgbeuqjk2@alap3.anarazel.de
2017-09-18 17:25:53 -07:00
Tom Lane
e183530550 Fix RecursiveCopy.pm to cope with disappearing files.
When copying from an active database tree, it's possible for files to be
deleted after we see them in a readdir() scan but before we can open them.
(Once we've got a file open, we don't expect any further errors from it
getting unlinked, though.)  Tweak RecursiveCopy so it can cope with this
case, so as to avoid irreproducible test failures.

Back-patch to 9.6 where this code was added.  In v10 and HEAD, also
remove unused "use RecursiveCopy" in one recovery test script.

Michael Paquier and Tom Lane

Discussion: https://postgr.es/m/24621.1504924323@sss.pgh.pa.us
2017-09-11 22:02:58 -04:00
Alvaro Herrera
89c59b742a Fix two-phase commit test for recovery mode
The original code had a race condition because it never ensured the
standby was caught up before proceeding; add a wait similar to every
other place that does this.

Author: Michaël Paquier
Discussion: https://postgr.es/m/CAB7nPqTm9p+LCm1mVJYvgpwagRK+uibT-pKq0O2-paOWxT62jw@mail.gmail.com
2017-09-01 16:51:55 +02:00
Simon Riggs
4f27c674fd Avoid race condition in logical replication test
Wait for slot to become inactive before continuing.

Author: Petr Jelinek
2017-09-01 14:55:44 +01:00
Tom Lane
21d304dfed Final pgindent + perltidy run for v10. 2017-08-14 17:29:33 -04:00
Tom Lane
3043c1ddd1 Simplify fetch-slot-xmins logic in recovery TAP tests.
Merge wait_slot_xmins() into get_slot_xmins().  At this point the only
place that wasn't doing a wait was the initial-state test, and a wait
there seems pretty harmless.

Michael Paquier

Discussion: https://postgr.es/m/CAB7nPqSp_SLQb2uU7am+sn4V3g1UKv8j3yZU385oAG1cG_BN9Q@mail.gmail.com
2017-08-12 12:08:54 -04:00
Peter Eisentraut
a1ef920e27 Remove uses of "slave" in replication contexts
This affects mostly code comments, some documentation, and tests.
Official APIs already used "standby".
2017-08-10 22:55:41 -04:00
Tom Lane
4576a69354 Fix yet another race condition in recovery/t/001_stream_rep.pl.
In commit 5c77690f6, we added polling in front of most of the
get_slot_xmins calls in 001_stream_rep.pl, but today's results from
buildfarm member nightjar show that at least one more poll loop
is needed.

Proactively add a poll loop before the next-to-last get_slot_xmins call
as well.  It may be that there is no race condition there because the
standby_2 server is shut down at that point, but I'm quite tired of
fighting with this test script.  The empirical evidence that it's safe,
from the buildfarm, is no stronger than the evidence for the other
call that nightjar just proved unsafe.

The only remaining get_slot_xmins calls without wait_slot_xmins
protection are the first two, which should be OK since nothing has
happened at that point.  It's tempting to ignore that special case
and merge get_slot_xmins and wait_slot_xmins into a single function.
I didn't go that far though.

Discussion: https://postgr.es/m/18436.1502228036@sss.pgh.pa.us
2017-08-08 18:03:30 -04:00
Tom Lane
ec86af9175 Fix another race-condition-ish issue in recovery/t/001_stream_rep.pl.
Buildfarm members hornet and sungazer have shown multiple instances of
"Failed test 'xmin of non-cascaded slot with hs feedback has changed'".
The reason seems to be that the test is checking the current xmin of the
master server's replication slot against a past xmin of the first slave
server's replication slot.  Even though the latter slot is downstream of
the former, it's possible for its reported xmin to be ahead of the former's
reported xmin, because those numbers are updated whenever the respective
downstream walreceiver feels like it (see logic in WalReceiverMain).
Instrumenting this test shows that indeed the slave slot's xmin does often
advance before the master's does, especially if an autovacuum transaction
manages to occur during the relevant window.  If we happen to capture such
an advanced xmin as $xmin, then the subsequent wait_slot_xmins call can
fall through before the master's xmin has advanced at all, and then if it
advances before the get_slot_xmins call, we can get the observed failure.
Yeah, that's a bit of a long chain of deduction, but it's hard to explain
any other way how the test can get past an "xmin <> '$xmin'" check only
to have the next query find that xmin does equal $xmin.

Fix by keeping separate images of the master and slave slots' xmins
and testing their has-xmin-advanced conditions independently.
2017-07-05 23:59:20 -04:00
Peter Eisentraut
6deb52b202 Remove unnecessary pg_is_in_recovery calls in tests
Since pg_ctl promote already waits for recovery to end, these calls are
obsolete.
2017-07-05 13:37:08 -04:00
Tom Lane
647675228f Fix race condition in recovery/t/009_twophase.pl test.
Since reducing pg_ctl's reaction time in commit c61559ec3, some
slower buildfarm members have shown erratic failures in this test.
The reason turns out to be that the test assumes synchronous
replication (because it does not provide any lag time for a commit
to replicate before shutting down the servers), but it had only
enabled sync rep in one direction.  The observed symptoms correspond
to failure to replicate the last committed transaction in the other
direction, which can be expected to happen if the shutdown command
is issued soon enough and we are providing no synchronous-commit
guarantees.

Fix that, and add a bit more paranoid state checking at the bottom
of the script.

Michael Paquier and myself

Discussion: https://postgr.es/m/908.1498965681@sss.pgh.pa.us
2017-07-02 22:01:19 -04:00
Tom Lane
4e15387d2d Try to improve readability of recovery/t/009_twophase.pl test.
The original coding here was very confusing, because it named the
two servers it set up "master" and "slave" even though it swapped
their replication roles multiple times.  At any given point in the
script it was very unobvious whether "$node_master" actually referred
to the server named "master" or the other one.  Instead, pick arbitrary
names for the two servers --- I used "london" and "paris" --- and
distinguish those permanent names from the nonce references $cur_master
and $cur_slave.  Add logging to help distinguish which is which at
any given point.  Also, use distinct data and transaction names to
make all the prepared transactions easily distinguishable in the
postmaster logs.  (There was one place where we intentionally tested
that the server could cope with re-use of a transaction name, but
it seems like one place is sufficient for that purpose.)

Also, add checks at the end to make sure that all the transactions
that were supposed to be committed did survive.

Discussion: https://postgr.es/m/28238.1499010855@sss.pgh.pa.us
2017-07-02 15:27:29 -04:00
Tom Lane
de3de0afd7 Improve TAP test function PostgresNode::poll_query_until().
Add an optional "expected" argument to override the default assumption
that we're waiting for the query to return "t".  This allows replacing
a handwritten polling loop in recovery/t/007_sync_rep.pl with use of
poll_query_until(); AFAICS that's the only remaining ad-hoc polling
loop in our TAP tests.

Change poll_query_until() to probe ten times per second not once per
second.  Like some similar changes I've been making recently, the
one-second interval seems to be rooted in ancient traditions rather
than the actual likely wait duration on modern machines.  I'd consider
reducing it further if there were a convenient way to spawn just one
psql for the whole loop rather than one per probe attempt.

Discussion: https://postgr.es/m/12486.1498938782@sss.pgh.pa.us
2017-07-02 14:03:41 -04:00
Tom Lane
b0f069d931 Clean up misuse and nonuse of poll_query_until().
Several callers of PostgresNode::poll_query_until() neglected to check
for failure; I do not think that's optional.  Also, rewrite one place
that had reinvented poll_query_until() for no very good reason.
2017-07-01 14:25:09 -04:00
Tom Lane
08aed6604d Eat XIDs more efficiently in recovery TAP test.
The point of this loop is to insert 1000 rows into the test table
and consume 1000 XIDs.  I can't see any good reason why it's useful
to launch 1000 psqls and 1000 backend processes to accomplish that.
Pushing the looping into a plpgsql DO block shaves about 10 seconds
off the runtime of the src/test/recovery TAP tests on my machine;
that's over 10% of the runtime of that test suite.

It is, in fact, sufficiently more efficient that we now demonstrably
need wait_slot_xmins() afterwards, or the slaves' xmins may not have
moved yet.
2017-06-28 22:11:12 -04:00
Tom Lane
5c77690f6f Improve wait logic in TAP tests for streaming replication.
Remove hard-wired sleep(2) delays in 001_stream_rep.pl in favor of using
poll_query_until to check for the desired state to appear.  In addition,
add such a wait before the last test in the script, as it's possible
to demonstrate failures there after upcoming improvements in pg_ctl.

(We might end up adding polling before each of the get_slot_xmins calls in
this script, but I feel no great need to do that until shown necessary.)

In passing, clarify the description strings for some of the test cases.

Michael Paquier and Craig Ringer, pursuant to a complaint from me

Discussion: https://postgr.es/m/8962.1498425057@sss.pgh.pa.us
2017-06-26 14:39:55 -04:00
Bruce Momjian
ce55481032 Post-PG 10 beta1 pgperltidy run 2017-05-17 19:01:23 -04:00
Andrew Dunstan
734cb4c2e7 Avoid tests which crash the calling process on Windows
Certain recovery tests use the Perl IPC::Run module's start/kill_kill
method of processing. On at least some versions of perl this causes the
whole process and its caller to crash. If we ever find a better way of
doing these tests they can be re-enabled on this platform. This does not
affect Mingw or Cygwin builds, which use a different perl and a
different shell and so are not affected.
2017-05-12 06:41:23 -04:00
Tom Lane
d10c626de4 Rename WAL-related functions and views to use "lsn" not "location".
Per discussion, "location" is a rather vague term that could refer to
multiple concepts.  "LSN" is an unambiguous term for WAL locations and
should be preferred.  Some function names, view column names, and function
output argument names used "lsn" already, but others used "location",
as well as yet other terms such as "wal_position".  Since we've already
renamed a lot of things in this area from "xlog" to "wal" for v10,
we may as well incur a bit more compatibility pain and make these names
all consistent.

David Rowley, minor additional docs hacking by me

Discussion: https://postgr.es/m/CAKJS1f8O0njDKe8ePFQ-LK5-EjwThsDws6ohJ-+c6nWK+oUxtg@mail.gmail.com
2017-05-11 11:49:59 -04:00
Simon Riggs
0352c15e5a Additional tests for subtransactions in recovery
Tests for normal and prepared transactions

Author: Nikhil Sontakke, placed in new test file by me
2017-04-27 14:26:57 +02:00
Fujii Masao
346199dcab Set the priorities of all quorum synchronous standbys to 1.
In quorum-based synchronous replication, all the standbys listed in
synchronous_standby_names equally have chances to be chosen
as synchronous standbys. So they should have the same priority.
However, previously, quorum standbys whose names appear earlier
in the list were given higher priority values though the difference of
those priority values didn't affect the selection of synchronous standbys.
Users could see those "meaningless" priority values in pg_stat_replication
and this was confusing.

This commit gives all the quorum synchronous standbys the same
highest priority, i.e., 1, in order to remove such confusion.

Author: Fujii Masao
Reviewed-by: Masahiko Sawada, Kyotaro Horiguchi
Discussion: http://postgr.es/m/CAHGQGwEKOw=SmPLxJzkBsH6wwDBgOnVz46QjHbtsiZ-d-2RGUg@mail.gmail.com
2017-04-26 01:07:13 +09:00
Tom Lane
8a19c1a373 Make PostgresNode::append_conf append a newline automatically.
Although the documentation for append_conf said clearly that it didn't
add a newline, many test authors seem to have forgotten that ... or maybe
they just consulted the example at the top of the POD documentation,
which clearly shows adding a config entry without bothering to add a
trailing newline.  The worst part of that is that it works, as long as
you don't do it more than once, since the backend isn't picky about
whether config files end with newlines.  So there's not a strong forcing
function reminding test authors not to do it like that.  Upshot is that
this is a terribly fragile way to go about things, and there's at least
one existing test case that is demonstrably broken and not testing what
it thinks it is.

Let's just make append_conf append a newline, instead; that is clearly
way safer than the old definition.

I also cleaned up a few call sites that were unnecessarily ugly.
(I left things alone in places where it's plausible that additional
config lines would need to be added someday.)

Back-patch the change in append_conf itself to 9.6 where it was added,
as having a definitional inconsistency between branches would obviously
be pretty hazardous for back-patching TAP tests.  The other changes are
just cosmetic and don't need to be back-patched.

Discussion: https://postgr.es/m/19751.1492892376@sss.pgh.pa.us
2017-04-22 16:58:15 -04:00
Peter Eisentraut
2e74e636bd Change 'diag' to 'note' in TAP tests
Reduce noise from TAP tests by changing 'diag' to 'note', so output only
goes to the test's log file not stdout, unless in verbose mode.  This
also removes the junk on screen when running the TAP tests in parallel.

Author: Craig Ringer <craig@2ndquadrant.com>
2017-03-28 20:38:06 -04:00
Simon Riggs
ff539da316 Cleanup slots during drop database
Automatically drop all logical replication slots associated with a
database when the database is dropped. Previously we threw an ERROR
if a slot existed. Now we throw ERROR only if a slot is active in
the database being dropped.

Craig Ringer
2017-03-28 10:05:21 -04:00
Simon Riggs
5737c12df0 Report catalog_xmin separately in hot_standby_feedback
If the upstream walsender is using a physical replication slot, store the
catalog_xmin in the slot's catalog_xmin field. If the upstream doesn't use a
slot and has only a PGPROC entry behaviour doesn't change, as we store the
combined xmin and catalog_xmin in the PGPROC entry.

Author: Craig Ringer
2017-03-25 14:07:27 +00:00
Peter Eisentraut
cd07f73d32 Fix recovery test hang
The test would hang if a sufficient ~/.psqlrc was present.  Fix by using
psql -X.
2017-03-25 00:10:57 -04:00
Robert Haas
857ee8e391 Add a txid_status function.
If your connection to the database server is lost while a COMMIT is
in progress, it may be difficult to figure out whether the COMMIT was
successful or not.  This function will tell you, provided that you
don't wait too long to ask.  It may be useful in other situations,
too.

Craig Ringer, reviewed by Simon Riggs and by me

Discussion: http://postgr.es/m/CAMsr+YHQiWNEi0daCTboS40T+V5s_+dst3PYv_8v2wNVH+Xx4g@mail.gmail.com
2017-03-24 12:00:53 -04:00
Simon Riggs
1148e22a82 Teach xlogreader to follow timeline switches
Uses page-based mechanism to ensure we’re using the correct timeline.

Tests are included to exercise the functionality using a cold disk-level copy
of the master that's started up as a replica with slots intact, but the
intended use of the functionality is with later features.

Craig Ringer, reviewed by Simon Riggs and Andres Freund
2017-03-22 07:05:12 +00:00
Peter Eisentraut
9ca2dd578d Avoid Perl warning
Perl versions before 5.12 would warn "Use of implicit split to @_ is
deprecated".

Author: Jeff Janes <jeff.janes@gmail.com>
2017-03-22 00:18:49 -04:00
Simon Riggs
eb2a6131be Add a pg_recvlogical wrapper to PostgresNode
Allows testing of logical decoding using SQL interface and/or pg_recvlogical
Most logical decoding tests are in contrib/test_decoding. This module
is for work that doesn't fit well there, like where server restarts
are required.

Craig Ringer
2017-03-21 14:04:49 +00:00
Robert Haas
caa6c1f193 TAP tests for target_session_attrs connection parameter.
Michael Paquier
2017-02-26 23:41:23 +05:30
Alvaro Herrera
30820982b2 Add tests for two-phase commit
There's some ongoing performance work on this area, so let's make sure
we don't break things.

Extracted from a larger patch originally by Stas Kelvich.

Authors: Stas Kelvich, Nikhil Sontakke, Michael Paquier
Discussion: https://postgr.es/m/CAMGcDxfsuLLOg=h5cTg3g77Jjk-UGnt=RW7zK57zBSoFsapiWA@mail.gmail.com
2017-02-21 19:00:45 -03:00
Robert Haas
806091c96f Remove all references to "xlog" from SQL-callable functions in pg_proc.
Commit f82ec32ac3 renamed the pg_xlog
directory to pg_wal.  To make things consistent, and because "xlog" is
terrible terminology for either "transaction log" or "write-ahead log"
rename all SQL-callable functions that contain "xlog" in the name to
instead contain "wal".  (Note that this may pose an upgrade hazard for
some users.)

Similarly, rename the xlog_position argument of the functions that
create slots to be called wal_position.

Discussion: https://www.postgresql.org/message-id/CA+Tgmob=YmA=H3DbW1YuOXnFVgBheRmyDkWcD9M8f=5bGWYEoQ@mail.gmail.com
2017-02-09 15:10:09 -05:00
Simon Riggs
ec4b975016 Reset hot standby xmin on master after restart
Hot_standby_feedback could be reset by reload and worked correctly, but if
the server was restarted rather than reloaded the xmin was not reset.
Force reset always if hot_standby_feedback is enabled at startup.

Ants Aasma, Craig Ringer

Reported-by: Ants Aasma
2017-01-26 18:14:02 +00:00
Magnus Hagander
f6d6d2920d Change default values for backup and replication parameters
This changes the default values of the following parameters:

wal_level = replica
max_wal_senders = 10
max_replication_slots = 10

in order to make it possible to make a backup and set up simple
replication on the default settings, without requiring a system restart.

Discussion: https://postgr.es/m/CABUevEy4PR_EAvZEzsbF5s+V0eEvw7shJ2t-AUwbHOjT+yRb3A@mail.gmail.com

Reviewed by Peter Eisentraut. Benchmark help from Tomas Vondra.
2017-01-14 17:14:56 +01:00
Simon Riggs
0813216cb4 Add 18 new recovery TAP tests
Add new tests for physical repl slots and hot standby feedback.

Craig Ringer, reviewed by Aleksander Alekseev and Simon Riggs
2017-01-04 16:54:28 +00:00
Simon Riggs
fb093e4cb3 Allow PostgresNode.pm tests to wait for catchup
Add methods to the core test framework PostgresNode.pm to allow us to
test that standby nodes have caught up with the master, as well as
basic LSN handling.  Used in tests recovery/t/001_stream_rep.pl and
recovery/t/004_timeline_switch.pl

Craig Ringer, reviewed by Aleksander Alekseev and Simon Riggs
2017-01-04 16:50:23 +00:00
Bruce Momjian
1d25779284 Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
Fujii Masao
3901fd70cc Support quorum-based synchronous replication.
This feature is also known as "quorum commit" especially in discussion
on pgsql-hackers.

This commit adds the following new syntaxes into synchronous_standby_names
GUC. By using FIRST and ANY keywords, users can specify the method to
choose synchronous standbys from the listed servers.

  FIRST num_sync (standby_name [, ...])
  ANY num_sync (standby_name [, ...])

The keyword FIRST specifies a priority-based synchronous replication
which was available also in 9.6 or before. This method makes transaction
commits wait until their WAL records are replicated to num_sync
synchronous standbys chosen based on their priorities.

The keyword ANY specifies a quorum-based synchronous replication
and makes transaction commits wait until their WAL records are
replicated to *at least* num_sync listed standbys. In this method,
the values of sync_state.pg_stat_replication for the listed standbys
are reported as "quorum". The priority is still assigned to each standby,
but not used in this method.

The existing syntaxes having neither FIRST nor ANY keyword are still
supported. They are the same as new syntax with FIRST keyword, i.e.,
a priorirty-based synchronous replication.

Author: Masahiko Sawada
Reviewed-By: Michael Paquier, Amit Kapila and me
Discussion: <CAD21AoAACi9NeC_ecm+Vahm+MMA6nYh=Kqs3KB3np+MBOS_gZg@mail.gmail.com>

Many thanks to the various individuals who were involved in
discussing and developing this feature.
2016-12-19 21:15:30 +09:00
Robert Haas
f267c1c244 Fix possible pg_basebackup failure on standby with "include WAL".
If a restartpoint flushed no dirty buffers, it could fail to update
the minimum recovery point, leading to a minimum recovery point prior
to the starting REDO location.  perform_base_backup() would interpret
that as meaning that no WAL files at all needed to be included in the
backup, failing an internal sanity check.  To fix, have restartpoints
always update the minimum recovery point to just after the checkpoint
record itself, so that the file (or files) containing the checkpoint
record will always be included in the backup.

Code by Amit Kapila, per a design suggestion by me, with some
additional work on the code comment by me.  Test case by Michael
Paquier.  Report by Kyotaro Horiguchi.
2016-10-27 11:19:51 -04:00
Peter Eisentraut
e5a9bcb529 Use pg_ctl promote -w in TAP tests
Switch TAP tests to use the new wait mode of pg_ctl promote.  This
allows avoiding extra logic with poll_query_until() to be sure that a
promoted standby is ready for read-write queries.

From: Michael Paquier <michael.paquier@gmail.com>
2016-10-19 09:18:50 -04:00
Heikki Linnakangas
917dc7d239 Fix WAL-logging of FSM and VM truncation.
When a relation is truncated, it is important that the FSM is truncated as
well. Otherwise, after recovery, the FSM can return a page that has been
truncated away, leading to errors like:

ERROR:  could not read block 28991 in file "base/16390/572026": read only 0
of 8192 bytes

We were using MarkBufferDirtyHint() to dirty the buffer holding the last
remaining page of the FSM, but during recovery, that might in fact not
dirty the page, and the FSM update might be lost.

To fix, use the stronger MarkBufferDirty() function. MarkBufferDirty()
requires us to do WAL-logging ourselves, to protect from a torn page, if
checksumming is enabled.

Also fix an oversight in visibilitymap_truncate: it also needs to WAL-log
when checksumming is enabled.

Analysis by Pavan Deolasee.

Discussion: <CABOikdNr5vKucqyZH9s1Mh0XebLs_jRhKv6eJfNnD2wxTn=_9A@mail.gmail.com>
2016-10-19 14:26:05 +03:00
Simon Riggs
d851bef2d6 Dirty replication slots when using sql interface
When pg_logical_slot_get_changes(...) sets confirmed_flush_lsn to the point at
which replay stopped, it doesn't dirty the replication slot.  So if the replay
didn't cause restart_lsn or catalog_xmin to change as well, this change will
not get written out to disk. Even on a clean shutdown.

If Pg crashes or restarts, a subsequent pg_logical_slot_get_changes(...) call
will see the same changes already replayed since it uses the slot's
confirmed_flush_lsn as the start point for fetching changes. The caller can't
specify a start LSN when using the SQL interface.

Mark the slot as dirty after reading changes using the SQL interface so that
users won't see repeated changes after a clean shutdown. Repeated changes still
occur when using the walsender interface or after an unclean shutdown.

Craig Ringer
2016-09-05 09:44:38 +01:00
Simon Riggs
35250b6ad7 New recovery target recovery_target_lsn
Michael Paquier
2016-09-03 17:48:01 +01:00
Tom Lane
b5bce6c1ec Final pgindent + perltidy run for 9.6. 2016-08-15 13:42:51 -04:00
Alvaro Herrera
b26f7fa6ae Fix assorted problems in recovery tests
In test 001_stream_rep we're using pg_stat_replication.write_location to
determine catch-up status, but we care about xlog having been applied
not just received, so change that to apply_location.

In test 003_recovery_targets, we query the database for a recovery
target specification and later for the xlog position supposedly
corresponding to that recovery specification.  If for whatever reason
more WAL is written between the two queries, the recovery specification
is earlier than the xlog position used by the query in the test harness,
so we wait forever, leading to test failures.  Deal with this by using a
single query to extract both items.  In 2a0f89cd71 we tried to deal
with it by giving them more tests to run, but in hindsight that was
obviously doomed to failure (no revert of that, though).

Per hamster buildfarm failures.

Author: Michaël Paquier
2016-08-03 13:21:23 -04:00
Peter Eisentraut
f6ced51f91 Consistently capitalize names of recovery tests 2016-08-02 10:47:03 -04:00
Noah Misch
3be0a62ffe Finish pgindent run for 9.6: Perl files. 2016-06-12 04:19:56 -04:00
Alvaro Herrera
c1543a81a7 Revert timeline following in replication slots
This reverts commits f07d18b6e9, 82c83b3372, 3a3b309041, and
24c5f1a103.

This feature has shown enough immaturity that it was deemed better to
rip it out before rushing some more fixes at the last minute.  There are
discussions on larger changes in this area for the next release.
2016-05-04 17:32:22 -03:00
Fujii Masao
36c1c91604 Make regression test for multiple synchronous standbys more stable.
The regression test checks whether the output of pg_stat_replication is
expected or not after changing synchronous_standby_names and reloading
the configuration file. Regarding this test logic, previously there was
a timing issue which made the test result unstable. That is,
pg_stat_replication could return unexpected result during small window
after the configuration file was reloaded before new setting value
took effect, and which made the test fail.

This commit changes the test logic so that it uses a loop with a timeout
to give some room for the test to pass. Now the test fails only when
pg_stat_replication keeps returning unexpected result for 30 seconds.

Michael Paquier
2016-04-15 13:58:14 +09:00
Fujii Masao
196b72fb9a Add regression tests for multiple synchronous standbys.
Authors: Suraj Kharage, Michael Paquier, Masahiko Sawada, refactored by me
Reviewed-By: Kyotaro Horiguchi
2016-04-08 16:48:53 +09:00
Alvaro Herrera
82c83b3372 Fix logical_decoding_timelines test crashes
In the test_slot_timelines test module, we were abusing passing NULL
values which was received as zeroes in x86, but this breaks in ARM
(buildfarm member hamster) by crashing instead.  Fix the breakage by
marking these functions as STRICT; the InvalidXid value that was
previously implicit in NULL values (on x86 at least) can now be passed
as 0.  Failing to follow the fmgr protocol to check for NULLs beforehand
was causing ARM to fail, as evidenced by segmentation faults in
buildfarm member hamster.

In order to use the new functionality in the test script, use COALESCE
in the right spot to avoid forwarding NULL values.

This was diagnosed from the hamster crash by Craig Ringer, who also
proposed a different patch (checking for NULL values explicitely in the
C function code, and keeping the non-strictness in the C functions).
I decided to go with this approach instead.
2016-04-01 16:47:00 -03:00
Alvaro Herrera
61608d3836 Fix recovery_min_apply_delay test
Previously this test was relying too much on WAL replay to occur in the
exact configured interval, which was unreliable on slow or overly busy
servers.  Use a custom loop instead of poll_query_until, which is
hopefully more reliable.

Per continued failures on buildfarm member hamster (which is probably
the only one running this test suite)

Author: Michaël Paquier
2016-03-31 16:06:32 -03:00
Alvaro Herrera
24c5f1a103 Enable logical slots to follow timeline switches
When decoding from a logical slot, it's necessary for xlog reading to be
able to read xlog from historical (i.e. not current) timelines;
otherwise, decoding fails after failover, because the archives are in
the historical timeline.  This is required to make "failover logical
slots" possible; it currently has no other use, although theoretically
it could be used by an extension that creates a slot on a standby and
continues to replay from the slot when the standby is promoted.

This commit includes a module in src/test/modules with functions to
manipulate the slots (which is not otherwise possible in SQL code) in
order to enable testing, and a new test in src/test/recovery to ensure
that the behavior is as expected.

Author: Craig Ringer
Reviewed-By: Oleksii Kliukin, Andres Freund, Petr Jelínek
2016-03-30 20:07:05 -03:00
Alvaro Herrera
2c83f435a3 Rework PostgresNode's psql method
This makes the psql() method much more capable: it captures both stdout
and stderr; it now returns the psql exit code rather than stdout; a
timeout can now be specified, as can ON_ERROR_STOP behavior; it gained a
new "on_error_die" (defaulting to off) parameter to raise an exception
if there's any problem.  Finally, additional parameters to psql can be
passed if there's need for further tweaking.

For convenience, a new safe_psql() method retains much of the old
behavior of psql(), except that it uses on_error_die on, so that
problems like syntax errors in SQL commands can be detected more easily.

Many existing TAP test files now use safe_psql, which is what is really
wanted.  A couple of ->psql() calls are now added in the commit_ts
tests, which verify that the right thing is happening on certain errors.
Some ->command_fails() calls in recovery tests that were verifying that
psql failed also became ->psql() calls now.

Author: Craig Ringer. Some tweaks by Álvaro Herrera
Reviewed-By: Michaël Paquier
2016-03-03 17:58:30 -03:00
Alvaro Herrera
5bec1ad464 Fix mistakes in recovery tests
One test was relying on method remove_tree that isn't implemented in the
oldest Perl we support; fix it by using the older rmtree instead.

Another test had a typo in a SQL command, which isn't noticed because
the PostgresNode->psql() method doesn't check that queries return
correctly.  That's undesirable and will also be fixed later on, but for
now let's make the test actually work.

Author: Craig Ringer
2016-03-03 12:51:47 -03:00
Alvaro Herrera
5847397dec Minor tweaks for new src/test/recovery
Author: Michael Paquier
2016-02-29 18:16:59 -03:00
Alvaro Herrera
74d58425c7 Apply last revision of recovery patch
I applied the previous-to-last revision of Michaël's submitted patch
instead of the last; these two tweaks pointed out by Craig were left out
of the previous commit by accident.
2016-02-26 16:22:53 -03:00
Alvaro Herrera
49148645f7 Add a test framework for recovery
This long-awaited framework is an expansion of the existing PostgresNode
stuff to support additional features for recovery testing; the recovery
tests included in this commit are a starting point that cover some of
the recovery features we have.  More scripts are expected to be added
later.

Author: Michaël Paquier, a bit of help from Amir Rohan
Reviewed by: Amir Rohan, Stas Kelvich, Kyotaro Horiguchi, Victor Wagner,
Craig Ringer, Álvaro Herrera
Discussion: http://www.postgresql.org/message-id/CAB7nPqTf7V6rswrFa=q_rrWeETUWagP=h8LX8XAov2Jcxw0DRg@mail.gmail.com
Discussion: http://www.postgresql.org/message-id/trinity-b4a8035d-59af-4c42-a37e-258f0f28e44a-1443795007012@3capp-mailcom-lxa08
2016-02-26 16:13:30 -03:00