Commit Graph

77 Commits

Author SHA1 Message Date
Michael Paquier 3c64dcb1e3 Prevent references to invalid relation pages after fresh promotion
If a standby crashes after promotion before having completed its first
post-recovery checkpoint, then the minimal recovery point which marks
the LSN position where the cluster is able to reach consistency may be
set to a position older than the first end-of-recovery checkpoint while
all the WAL available should be replayed.  This leads to the instance
thinking that it contains inconsistent pages, causing a PANIC and a hard
instance crash even if all the WAL available has not been replayed for
certain sets of records replayed.  When in crash recovery,
minRecoveryPoint is expected to always be set to InvalidXLogRecPtr,
which forces the recovery to replay all the WAL available, so this
commit makes sure that the local copy of minRecoveryPoint from the
control file is initialized properly and stays as it is while crash
recovery is performed.  Once switching to archive recovery or if crash
recovery finishes, then the local copy minRecoveryPoint can be safely
updated.

Pavan Deolasee has reported and diagnosed the failure in the first
place, and the base fix idea to rely on the local copy of
minRecoveryPoint comes from Kyotaro Horiguchi, which has been expanded
into a full-fledged patch by me.  The test included in this commit has
been written by Álvaro Herrera and Pavan Deolasee, which I have modified
to make it faster and more reliable with sleep phases.

Backpatch down to all supported versions where the bug appears, aka 9.3
which is where the end-of-recovery checkpoint is not run by the startup
process anymore.  The test gets easily supported down to 10, still it
has been tested on all branches.

Reported-by: Pavan Deolasee
Diagnosed-by: Pavan Deolasee
Reviewed-by: Pavan Deolasee, Kyotaro Horiguchi
Author: Michael Paquier, Kyotaro Horiguchi, Pavan Deolasee, Álvaro
Herrera
Discussion: https://postgr.es/m/CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com
2018-07-05 10:46:18 +09:00
Tom Lane c992dca26e Clarify the README files for the various separate TAP-based test suites.
Explain the difference between "make check" and "make installcheck".
Mention the need for --enable-tap-tests (only some of these did so
before).  Standardize their wording about how to run the tests.
2018-06-19 19:30:50 -04:00
Andrew Dunstan 3a7cc727c7 Don't fall off the end of perl functions
This complies with the perlcritic policy
Subroutines::RequireFinalReturn, which is a severity 4 policy. Since we
only currently check at severity level 5, the policy is raised to that
level until we move to level 4 or lower, so that any new infringements
will be caught.

A small cosmetic piece of tidying of the pgperlcritic script is
included.

Mike Blackwell

Discussion: https://postgr.es/m/CAESHdJpfFm_9wQnQ3koY3c91FoRQsO-fh02za9R3OEMndOn84A@mail.gmail.com
2018-05-27 09:08:42 -04:00
Andrew Dunstan 35361ee788 Restrict vertical tightness to parentheses in Perl code
The vertical tightness settings collapse vertical whitespace between
opening and closing brackets (parentheses, square brakets and braces).
This can make data structures in particular harder to read, and is not
very consistent with our style in non-Perl code. This patch restricts
that setting to parentheses only, and reformats all the perl code
accordingly. Not applying this to parentheses has some unfortunate
effects, so the consensus is to keep the setting for parentheses and not
for the others.

The diff for this patch does highlight some places where structures
should have trailing commas. They can be added manually, as there is no
automatic tool to do so.

Discussion: https://postgr.es/m/a2f2b87c-56be-c070-bfc0-36288b4b41c1@2ndQuadrant.com
2018-05-09 10:14:46 -04:00
Peter Eisentraut 76ece16974 perltidy: Add option --nooutdent-long-comments 2018-04-27 11:37:43 -04:00
Peter Eisentraut d4f16d5071 perltidy: Add option --nooutdent-long-quotes 2018-04-27 11:37:43 -04:00
Tom Lane f04d4ac919 Reindent Perl files with perltidy version 20170521.
Discussion: https://postgr.es/m/CABUevEzK3cNiHZQ18f5tK0guoT+cN_jWeVzhYYxY=r+1Q3SmoA@mail.gmail.com
2018-04-25 14:00:19 -04:00
Andrew Dunstan 9ad21a6957 Don't use an Msys virtual path to create a tablespace
The new unlogged_reinit recovery tests create a new tablespace using
TestLib.pm's tempdir. However, on msys that function returns a virtual
path that isn't understood by Postgres. Here we add a new function to
TestLib.pm to turn such a path into a real path on the underlying file
system, and use it in the new test to create the tablespace. The new
function is essentially a NOOP everywhere but msys.
2018-03-20 08:25:36 +10:30
Peter Eisentraut df411e7c66 Add tests for reinit.c
This tests how the different forks of unlogged tables behave in crash
recovery.

Author: David Steele <david@pgmasters.net>
2018-03-14 09:03:44 -04:00
Peter Eisentraut f5da5683a8 Add installcheck support to more test suites
Several of the test suites under src/test/ were missing an installcheck
target.
2018-01-23 07:11:38 -05:00
Bruce Momjian 9d4649ca49 Update copyright for 2018
Backpatch-through: certain files through 9.3
2018-01-02 23:30:12 -05:00
Noah Misch 6078770c1a In tests, await an LSN no later than the recovery target.
Otherwise, the test fails with "Timed out while waiting for standby to
catch up".  This happened rarely, perhaps only when autovacuum wrote WAL
between our choosing the recovery target and choosing the LSN to await.
Commit b26f7fa6ae fixed one case of this.
Fix two more.  Back-patch to 9.6, which introduced the affected test.

Discussion: https://postgr.es/m/20180101055227.GA2952815@rfd.leadboat.com
2017-12-31 21:58:29 -08:00
Robert Haas 575cead991 Remove redundant line from Makefile.
Masahiko Sawada, reviewed by Michael Paquier

Discussion: http://postgr.es/m/CAD21AoDFes_Mgye-1K89rmTgeU3RxYF3zgTjzCJVq2KzzcpC4A@mail.gmail.com
2017-11-16 15:30:56 -05:00
Andres Freund 784905795f Try to make crash restart test work on windows.
Author: Andres Freund
Tested-By: Andrew Dunstan
Discussion: https://postgr.es/m/20170930224424.ud5ilchmclbl5y5n@alap3.anarazel.de
2017-10-01 15:24:58 -07:00
Tom Lane 01c7d3ef85 Ten-second timeout in 013_crash_restart.pl is not enough, let's try 60.
Per buildfarm member topminnow.
2017-09-23 12:56:31 -04:00
Andres Freund 8d926029e8 Expand expected output for recovery test even further.
I'd assumed that the backend being killed should be able to get out an
error message - but it turns out it's not guaranteed that it's not
still sending a ready-for-query.  Really need to do something about
getting these error message to the client.

Reported-By: Thomas Munro, Tom Lane
Discussion:
	https://postgr.es/m/CAEepm=0TE90nded+bNthP45_PEvGAAr=3gxhHJObL4xmOLtX0w@mail.gmail.com
	https://postgr.es/m/14968.1506101414@sss.pgh.pa.us
2017-09-22 11:35:26 -07:00
Andres Freund 5ada1fcd0c Accept that server might not be able to send error in crash recovery test.
As it turns out we can't rely that the script's monitoring session is
terminated with a proper error by the server, because the session
might be terminated while already trying to send data.

Also improve robustness and error reporting facilities of the test,
developed while debugging this issue.

Discussion: https://postgr.es/m/20170920020038.kllxgilo7xzwmtto@alap3.anarazel.de
2017-09-19 21:37:24 -07:00
Andres Freund 1910353675 Make new crash restart test a bit more robust.
Add timeouts in case psql doesn't deliver the expected output, and try
to cause the monitoring psql to be fully connected to a backend.  This
isn't necessarily everything needed, but at least the timeouts should
reduce the pain for buildfarm owners.

Author: Andres Freund
Reported-By: Tom Lane, BF animals prairiedog and calliphoridae
Discussion: https://postgr.es/m/E1du6ZT-00043I-91@gemulon.postgresql.org
2017-09-19 10:39:52 -07:00
Andres Freund a1924a4ea2 Add test for postmaster crash restarts.
Given that I managed to break this...  We probably should extend the
tests to also cover other sub-processes dying, but that's something
for later.

Author: Andres Freund
Discussion: https://postgr.es/m/20170917080752.rcmihzfmgbeuqjk2@alap3.anarazel.de
2017-09-18 17:25:53 -07:00
Tom Lane e183530550 Fix RecursiveCopy.pm to cope with disappearing files.
When copying from an active database tree, it's possible for files to be
deleted after we see them in a readdir() scan but before we can open them.
(Once we've got a file open, we don't expect any further errors from it
getting unlinked, though.)  Tweak RecursiveCopy so it can cope with this
case, so as to avoid irreproducible test failures.

Back-patch to 9.6 where this code was added.  In v10 and HEAD, also
remove unused "use RecursiveCopy" in one recovery test script.

Michael Paquier and Tom Lane

Discussion: https://postgr.es/m/24621.1504924323@sss.pgh.pa.us
2017-09-11 22:02:58 -04:00
Alvaro Herrera 89c59b742a Fix two-phase commit test for recovery mode
The original code had a race condition because it never ensured the
standby was caught up before proceeding; add a wait similar to every
other place that does this.

Author: Michaël Paquier
Discussion: https://postgr.es/m/CAB7nPqTm9p+LCm1mVJYvgpwagRK+uibT-pKq0O2-paOWxT62jw@mail.gmail.com
2017-09-01 16:51:55 +02:00
Simon Riggs 4f27c674fd Avoid race condition in logical replication test
Wait for slot to become inactive before continuing.

Author: Petr Jelinek
2017-09-01 14:55:44 +01:00
Tom Lane 21d304dfed Final pgindent + perltidy run for v10. 2017-08-14 17:29:33 -04:00
Tom Lane 3043c1ddd1 Simplify fetch-slot-xmins logic in recovery TAP tests.
Merge wait_slot_xmins() into get_slot_xmins().  At this point the only
place that wasn't doing a wait was the initial-state test, and a wait
there seems pretty harmless.

Michael Paquier

Discussion: https://postgr.es/m/CAB7nPqSp_SLQb2uU7am+sn4V3g1UKv8j3yZU385oAG1cG_BN9Q@mail.gmail.com
2017-08-12 12:08:54 -04:00
Peter Eisentraut a1ef920e27 Remove uses of "slave" in replication contexts
This affects mostly code comments, some documentation, and tests.
Official APIs already used "standby".
2017-08-10 22:55:41 -04:00
Tom Lane 4576a69354 Fix yet another race condition in recovery/t/001_stream_rep.pl.
In commit 5c77690f6, we added polling in front of most of the
get_slot_xmins calls in 001_stream_rep.pl, but today's results from
buildfarm member nightjar show that at least one more poll loop
is needed.

Proactively add a poll loop before the next-to-last get_slot_xmins call
as well.  It may be that there is no race condition there because the
standby_2 server is shut down at that point, but I'm quite tired of
fighting with this test script.  The empirical evidence that it's safe,
from the buildfarm, is no stronger than the evidence for the other
call that nightjar just proved unsafe.

The only remaining get_slot_xmins calls without wait_slot_xmins
protection are the first two, which should be OK since nothing has
happened at that point.  It's tempting to ignore that special case
and merge get_slot_xmins and wait_slot_xmins into a single function.
I didn't go that far though.

Discussion: https://postgr.es/m/18436.1502228036@sss.pgh.pa.us
2017-08-08 18:03:30 -04:00
Tom Lane ec86af9175 Fix another race-condition-ish issue in recovery/t/001_stream_rep.pl.
Buildfarm members hornet and sungazer have shown multiple instances of
"Failed test 'xmin of non-cascaded slot with hs feedback has changed'".
The reason seems to be that the test is checking the current xmin of the
master server's replication slot against a past xmin of the first slave
server's replication slot.  Even though the latter slot is downstream of
the former, it's possible for its reported xmin to be ahead of the former's
reported xmin, because those numbers are updated whenever the respective
downstream walreceiver feels like it (see logic in WalReceiverMain).
Instrumenting this test shows that indeed the slave slot's xmin does often
advance before the master's does, especially if an autovacuum transaction
manages to occur during the relevant window.  If we happen to capture such
an advanced xmin as $xmin, then the subsequent wait_slot_xmins call can
fall through before the master's xmin has advanced at all, and then if it
advances before the get_slot_xmins call, we can get the observed failure.
Yeah, that's a bit of a long chain of deduction, but it's hard to explain
any other way how the test can get past an "xmin <> '$xmin'" check only
to have the next query find that xmin does equal $xmin.

Fix by keeping separate images of the master and slave slots' xmins
and testing their has-xmin-advanced conditions independently.
2017-07-05 23:59:20 -04:00
Peter Eisentraut 6deb52b202 Remove unnecessary pg_is_in_recovery calls in tests
Since pg_ctl promote already waits for recovery to end, these calls are
obsolete.
2017-07-05 13:37:08 -04:00
Tom Lane 647675228f Fix race condition in recovery/t/009_twophase.pl test.
Since reducing pg_ctl's reaction time in commit c61559ec3, some
slower buildfarm members have shown erratic failures in this test.
The reason turns out to be that the test assumes synchronous
replication (because it does not provide any lag time for a commit
to replicate before shutting down the servers), but it had only
enabled sync rep in one direction.  The observed symptoms correspond
to failure to replicate the last committed transaction in the other
direction, which can be expected to happen if the shutdown command
is issued soon enough and we are providing no synchronous-commit
guarantees.

Fix that, and add a bit more paranoid state checking at the bottom
of the script.

Michael Paquier and myself

Discussion: https://postgr.es/m/908.1498965681@sss.pgh.pa.us
2017-07-02 22:01:19 -04:00
Tom Lane 4e15387d2d Try to improve readability of recovery/t/009_twophase.pl test.
The original coding here was very confusing, because it named the
two servers it set up "master" and "slave" even though it swapped
their replication roles multiple times.  At any given point in the
script it was very unobvious whether "$node_master" actually referred
to the server named "master" or the other one.  Instead, pick arbitrary
names for the two servers --- I used "london" and "paris" --- and
distinguish those permanent names from the nonce references $cur_master
and $cur_slave.  Add logging to help distinguish which is which at
any given point.  Also, use distinct data and transaction names to
make all the prepared transactions easily distinguishable in the
postmaster logs.  (There was one place where we intentionally tested
that the server could cope with re-use of a transaction name, but
it seems like one place is sufficient for that purpose.)

Also, add checks at the end to make sure that all the transactions
that were supposed to be committed did survive.

Discussion: https://postgr.es/m/28238.1499010855@sss.pgh.pa.us
2017-07-02 15:27:29 -04:00
Tom Lane de3de0afd7 Improve TAP test function PostgresNode::poll_query_until().
Add an optional "expected" argument to override the default assumption
that we're waiting for the query to return "t".  This allows replacing
a handwritten polling loop in recovery/t/007_sync_rep.pl with use of
poll_query_until(); AFAICS that's the only remaining ad-hoc polling
loop in our TAP tests.

Change poll_query_until() to probe ten times per second not once per
second.  Like some similar changes I've been making recently, the
one-second interval seems to be rooted in ancient traditions rather
than the actual likely wait duration on modern machines.  I'd consider
reducing it further if there were a convenient way to spawn just one
psql for the whole loop rather than one per probe attempt.

Discussion: https://postgr.es/m/12486.1498938782@sss.pgh.pa.us
2017-07-02 14:03:41 -04:00
Tom Lane b0f069d931 Clean up misuse and nonuse of poll_query_until().
Several callers of PostgresNode::poll_query_until() neglected to check
for failure; I do not think that's optional.  Also, rewrite one place
that had reinvented poll_query_until() for no very good reason.
2017-07-01 14:25:09 -04:00
Tom Lane 08aed6604d Eat XIDs more efficiently in recovery TAP test.
The point of this loop is to insert 1000 rows into the test table
and consume 1000 XIDs.  I can't see any good reason why it's useful
to launch 1000 psqls and 1000 backend processes to accomplish that.
Pushing the looping into a plpgsql DO block shaves about 10 seconds
off the runtime of the src/test/recovery TAP tests on my machine;
that's over 10% of the runtime of that test suite.

It is, in fact, sufficiently more efficient that we now demonstrably
need wait_slot_xmins() afterwards, or the slaves' xmins may not have
moved yet.
2017-06-28 22:11:12 -04:00
Tom Lane 5c77690f6f Improve wait logic in TAP tests for streaming replication.
Remove hard-wired sleep(2) delays in 001_stream_rep.pl in favor of using
poll_query_until to check for the desired state to appear.  In addition,
add such a wait before the last test in the script, as it's possible
to demonstrate failures there after upcoming improvements in pg_ctl.

(We might end up adding polling before each of the get_slot_xmins calls in
this script, but I feel no great need to do that until shown necessary.)

In passing, clarify the description strings for some of the test cases.

Michael Paquier and Craig Ringer, pursuant to a complaint from me

Discussion: https://postgr.es/m/8962.1498425057@sss.pgh.pa.us
2017-06-26 14:39:55 -04:00
Bruce Momjian ce55481032 Post-PG 10 beta1 pgperltidy run 2017-05-17 19:01:23 -04:00
Andrew Dunstan 734cb4c2e7 Avoid tests which crash the calling process on Windows
Certain recovery tests use the Perl IPC::Run module's start/kill_kill
method of processing. On at least some versions of perl this causes the
whole process and its caller to crash. If we ever find a better way of
doing these tests they can be re-enabled on this platform. This does not
affect Mingw or Cygwin builds, which use a different perl and a
different shell and so are not affected.
2017-05-12 06:41:23 -04:00
Tom Lane d10c626de4 Rename WAL-related functions and views to use "lsn" not "location".
Per discussion, "location" is a rather vague term that could refer to
multiple concepts.  "LSN" is an unambiguous term for WAL locations and
should be preferred.  Some function names, view column names, and function
output argument names used "lsn" already, but others used "location",
as well as yet other terms such as "wal_position".  Since we've already
renamed a lot of things in this area from "xlog" to "wal" for v10,
we may as well incur a bit more compatibility pain and make these names
all consistent.

David Rowley, minor additional docs hacking by me

Discussion: https://postgr.es/m/CAKJS1f8O0njDKe8ePFQ-LK5-EjwThsDws6ohJ-+c6nWK+oUxtg@mail.gmail.com
2017-05-11 11:49:59 -04:00
Simon Riggs 0352c15e5a Additional tests for subtransactions in recovery
Tests for normal and prepared transactions

Author: Nikhil Sontakke, placed in new test file by me
2017-04-27 14:26:57 +02:00
Fujii Masao 346199dcab Set the priorities of all quorum synchronous standbys to 1.
In quorum-based synchronous replication, all the standbys listed in
synchronous_standby_names equally have chances to be chosen
as synchronous standbys. So they should have the same priority.
However, previously, quorum standbys whose names appear earlier
in the list were given higher priority values though the difference of
those priority values didn't affect the selection of synchronous standbys.
Users could see those "meaningless" priority values in pg_stat_replication
and this was confusing.

This commit gives all the quorum synchronous standbys the same
highest priority, i.e., 1, in order to remove such confusion.

Author: Fujii Masao
Reviewed-by: Masahiko Sawada, Kyotaro Horiguchi
Discussion: http://postgr.es/m/CAHGQGwEKOw=SmPLxJzkBsH6wwDBgOnVz46QjHbtsiZ-d-2RGUg@mail.gmail.com
2017-04-26 01:07:13 +09:00
Tom Lane 8a19c1a373 Make PostgresNode::append_conf append a newline automatically.
Although the documentation for append_conf said clearly that it didn't
add a newline, many test authors seem to have forgotten that ... or maybe
they just consulted the example at the top of the POD documentation,
which clearly shows adding a config entry without bothering to add a
trailing newline.  The worst part of that is that it works, as long as
you don't do it more than once, since the backend isn't picky about
whether config files end with newlines.  So there's not a strong forcing
function reminding test authors not to do it like that.  Upshot is that
this is a terribly fragile way to go about things, and there's at least
one existing test case that is demonstrably broken and not testing what
it thinks it is.

Let's just make append_conf append a newline, instead; that is clearly
way safer than the old definition.

I also cleaned up a few call sites that were unnecessarily ugly.
(I left things alone in places where it's plausible that additional
config lines would need to be added someday.)

Back-patch the change in append_conf itself to 9.6 where it was added,
as having a definitional inconsistency between branches would obviously
be pretty hazardous for back-patching TAP tests.  The other changes are
just cosmetic and don't need to be back-patched.

Discussion: https://postgr.es/m/19751.1492892376@sss.pgh.pa.us
2017-04-22 16:58:15 -04:00
Peter Eisentraut 2e74e636bd Change 'diag' to 'note' in TAP tests
Reduce noise from TAP tests by changing 'diag' to 'note', so output only
goes to the test's log file not stdout, unless in verbose mode.  This
also removes the junk on screen when running the TAP tests in parallel.

Author: Craig Ringer <craig@2ndquadrant.com>
2017-03-28 20:38:06 -04:00
Simon Riggs ff539da316 Cleanup slots during drop database
Automatically drop all logical replication slots associated with a
database when the database is dropped. Previously we threw an ERROR
if a slot existed. Now we throw ERROR only if a slot is active in
the database being dropped.

Craig Ringer
2017-03-28 10:05:21 -04:00
Simon Riggs 5737c12df0 Report catalog_xmin separately in hot_standby_feedback
If the upstream walsender is using a physical replication slot, store the
catalog_xmin in the slot's catalog_xmin field. If the upstream doesn't use a
slot and has only a PGPROC entry behaviour doesn't change, as we store the
combined xmin and catalog_xmin in the PGPROC entry.

Author: Craig Ringer
2017-03-25 14:07:27 +00:00
Peter Eisentraut cd07f73d32 Fix recovery test hang
The test would hang if a sufficient ~/.psqlrc was present.  Fix by using
psql -X.
2017-03-25 00:10:57 -04:00
Robert Haas 857ee8e391 Add a txid_status function.
If your connection to the database server is lost while a COMMIT is
in progress, it may be difficult to figure out whether the COMMIT was
successful or not.  This function will tell you, provided that you
don't wait too long to ask.  It may be useful in other situations,
too.

Craig Ringer, reviewed by Simon Riggs and by me

Discussion: http://postgr.es/m/CAMsr+YHQiWNEi0daCTboS40T+V5s_+dst3PYv_8v2wNVH+Xx4g@mail.gmail.com
2017-03-24 12:00:53 -04:00
Simon Riggs 1148e22a82 Teach xlogreader to follow timeline switches
Uses page-based mechanism to ensure we’re using the correct timeline.

Tests are included to exercise the functionality using a cold disk-level copy
of the master that's started up as a replica with slots intact, but the
intended use of the functionality is with later features.

Craig Ringer, reviewed by Simon Riggs and Andres Freund
2017-03-22 07:05:12 +00:00
Peter Eisentraut 9ca2dd578d Avoid Perl warning
Perl versions before 5.12 would warn "Use of implicit split to @_ is
deprecated".

Author: Jeff Janes <jeff.janes@gmail.com>
2017-03-22 00:18:49 -04:00
Simon Riggs eb2a6131be Add a pg_recvlogical wrapper to PostgresNode
Allows testing of logical decoding using SQL interface and/or pg_recvlogical
Most logical decoding tests are in contrib/test_decoding. This module
is for work that doesn't fit well there, like where server restarts
are required.

Craig Ringer
2017-03-21 14:04:49 +00:00
Robert Haas caa6c1f193 TAP tests for target_session_attrs connection parameter.
Michael Paquier
2017-02-26 23:41:23 +05:30
Alvaro Herrera 30820982b2 Add tests for two-phase commit
There's some ongoing performance work on this area, so let's make sure
we don't break things.

Extracted from a larger patch originally by Stas Kelvich.

Authors: Stas Kelvich, Nikhil Sontakke, Michael Paquier
Discussion: https://postgr.es/m/CAMGcDxfsuLLOg=h5cTg3g77Jjk-UGnt=RW7zK57zBSoFsapiWA@mail.gmail.com
2017-02-21 19:00:45 -03:00