postgresql

mirror of https://git.postgresql.org/git/postgresql.git synced 2024-10-07 22:46:52 +02:00

Author	SHA1	Message	Date
Michael Paquier	f9c1b8dba4	Remove some unstable parts from new TAP test for archive status check The test is proving to have timing issues when looking at archive status files on standbys after crash recovery, while other parts of the test rely on pg_stat_archiver as a wait point to make sure that a given state of the archiving is reached. The coverage is not heavily impacted by the removal those extra tests. Per reports from several buildfarm animals, like crake, piculet, culicidae and francolin. Discussion: https://postgr.es/m/20200424005929.GK33034@paquier.xyz Backpatch-through: 9.5	2020-04-24 11:33:41 +09:00
Michael Paquier	4e87c4836a	Fix handling of WAL segments ready to be archived during crash recovery `78ea8b5` has fixed an issue related to the recycling of WAL segments on standbys depending on archive_mode. However, it has introduced a regression with the handling of WAL segments ready to be archived during crash recovery, causing those files to be recycled without getting archived. This commit fixes the regression by tracking in shared memory if a live cluster is either in crash recovery or archive recovery as the handling of WAL segments ready to be archived is different in both cases (those WAL segments should not be removed during crash recovery), and by using this new shared memory state to decide if a segment can be recycled or not. Previously, it was not possible to know if a cluster was in crash recovery or archive recovery as the shared state was able to track only if recovery was happening or not, leading to the problem. A set of TAP tests is added to close the gap here, making sure that WAL segments ready to be archived are correctly handled when a cluster is in archive or crash recovery with archive_mode set to "on" or "always", for both standby and primary. Reported-by: Benoît Lobréau Author: Jehan-Guillaume de Rorthais Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost Backpatch-through: 9.5	2020-04-24 08:48:28 +09:00
Noah Misch	d60cfb6bf2	Add a wait_for_catchup() before immediate stop of a test master. Per buildfarm member hoverfly, a slow walsender could make the test fail. Back-patch to v10, where the test was introduced. Discussion: https://postgr.es/m/20200414013849.GA886648@rfd.leadboat.com	2020-04-13 18:47:28 -07:00
Tom Lane	e083fa34ce	Further stabilize results of 019_replslot_limit.pl. Depending on specific values of restart_lsn or pg_current_wal_lsn() is obviously unsafe. The previous coding tried to dodge this issue by rounding off, but that's not good enough, as shown by multiple buildfarm members. Nuke all the uses of these values except for null-ness checks, pending some credible argument why we should think something else could be usefully stable. Kyotaro Horiguchi, further modified by me Discussion: https://postgr.es/m/E1jM1Sa-0003mS-99@gemulon.postgresql.org	2020-04-09 17:28:58 -04:00
Alexander Korotkov	1aac32df89	Revert `0f5ca02f53` `0f5ca02f53` introduces 3 new keywords. It appears to be too much for relatively small feature. Given now we past feature freeze, it's already late for discussion of the new syntax. So, revert. Discussion: https://postgr.es/m/28209.1586294824%40sss.pgh.pa.us	2020-04-08 11:37:27 +03:00
Alvaro Herrera	9e9abed746	Remove testing for precise LSN/reserved bytes in new TAP test Trying to ensure that a slot's restart_lsn or amount of reserved bytes exactly match some specific values seems unnecessary, and fragile as shown by failures in multiple buildfarm members. Discussion: https://postgr.es/m/20200407232602.GA21559@alvherre.pgsql	2020-04-07 23:28:27 -04:00
Alvaro Herrera	7e2ffb3885	Appease perlcritic Food for the gods must always be found somehow, even when the land starves.	2020-04-07 19:09:55 -04:00
Alvaro Herrera	c655077639	Allow users to limit storage reserved by replication slots Replication slots are useful to retain data that may be needed by a replication system. But experience has shown that allowing them to retain excessive data can lead to the primary failing because of running out of space. This new feature allows the user to configure a maximum amount of space to be reserved using the new option max_slot_wal_keep_size. Slots that overrun that space are invalidated at checkpoint time, enabling the storage to be released. Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Jehan-Guillaume de Rorthais <jgdr@dalibo.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20170228.122736.123383594.horiguchi.kyotaro@lab.ntt.co.jp	2020-04-07 18:35:00 -04:00
Alexander Korotkov	0f5ca02f53	Implement waiting for given lsn at transaction start This commit adds following optional clause to BEGIN and START TRANSACTION commands. WAIT FOR LSN lsn [ TIMEOUT timeout ] New clause pospones transaction start till given lsn is applied on standby. This clause allows user be sure, that changes previously made on primary would be visible on standby. New shared memory struct is used to track awaited lsn per backend. Recovery process wakes up backend once required lsn is applied. Author: Ivan Kartyshov, Anna Akenteva Reviewed-by: Craig Ringer, Thomas Munro, Robert Haas, Kyotaro Horiguchi Reviewed-by: Masahiko Sawada, Ants Aasma, Dmitry Ivanov, Simon Riggs Reviewed-by: Amit Kapila, Alexander Korotkov Discussion: https://postgr.es/m/0240c26c-9f84-30ea-fca9-93ab2df5f305%40postgrespro.ru	2020-04-07 23:51:10 +03:00
Thomas Munro	4c04be9b05	Introduce xid8-based functions to replace txid_XXX. The txid_XXX family of fmgr functions exposes 64 bit transaction IDs to users as int8. Now that we have an SQL type xid8 for FullTransactionId, define a new set of functions including pg_current_xact_id() and pg_current_snapshot() based on that. Keep the old functions around too, for now. It's a bit sneaky to use the same C functions for both, but since the binary representation is identical except for the signedness of the type, and since older functions are the ones using the wrong signedness, and since we'll presumably drop the older ones after a reasonable period of time, it seems reasonable to switch to FullTransactionId internally and share the code for both. Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Takao Fujii <btfujiitkp@oss.nttdata.com> Reviewed-by: Yoshikazu Imai <imai.yoshikazu@fujitsu.com> Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://postgr.es/m/20190725000636.666m5mad25wfbrri%40alap3.anarazel.de	2020-04-07 12:04:32 +12:00
Noah Misch	70de4e950c	Add perl2host call missing from a new test file. Oversight in today's commit `c6b92041d3`. Per buildfarm member jacana. Discussion: http://postgr.es/m/20200404223212.GC3442685@rfd.leadboat.com	2020-04-04 15:45:45 -07:00
Noah Misch	c6b92041d3	Skip WAL for new relfilenodes, under wal_level=minimal. Until now, only selected bulk operations (e.g. COPY) did this. If a given relfilenode received both a WAL-skipping COPY and a WAL-logged operation (e.g. INSERT), recovery could lose tuples from the COPY. See src/backend/access/transam/README section "Skipping WAL for New RelFileNode" for the new coding rules. Maintainers of table access methods should examine that section. To maintain data durability, just before commit, we choose between an fsync of the relfilenode and copying its contents to WAL. A new GUC, wal_skip_threshold, guides that choice. If this change slows a workload that creates small, permanent relfilenodes under wal_level=minimal, try adjusting wal_skip_threshold. Users setting a timeout on COMMIT may need to adjust that timeout, and log_min_duration_statement analysis will reflect time consumption moving to COMMIT from commands like COPY. Internally, this requires a reliable determination of whether RollbackAndReleaseCurrentSubTransaction() would unlink a relation's current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the specification of rd_createSubid such that the field is zero when a new rel has an old rd_node. Make relcache.c retain entries for certain dropped relations until end of transaction. Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN. Future servers accept older WAL, so this bump is discretionary. Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert Haas. Heikki Linnakangas and Michael Paquier implemented earlier designs that materially clarified the problem. Reviewed, in earlier designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane, Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout. Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org	2020-04-04 12:25:34 -07:00
Alvaro Herrera	1e6148032e	Allow walreceiver configuration to change on reload The parameters primary_conninfo, primary_slot_name and wal_receiver_create_temp_slot can now be changed with a simple "reload" signal, no longer requiring a server restart. This is achieved by signalling the walreceiver process to terminate and having it start again with the new values. Thanks to Andres Freund, Kyotaro Horiguchi, Fujii Masao for discussion. Author: Sergei Kornilov <sk@zsrv.org> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/19513901543181143@sas1-19a94364928d.qloud-c.yandex.net	2020-03-27 19:51:37 -03:00
Noah Misch	de9396326e	Revert "Skip WAL for new relfilenodes, under wal_level=minimal." This reverts commit `cb2fd7eac2`. Per numerous buildfarm members, it was incompatible with parallel query, and a test case assumed LP64. Back-patch to 9.5 (all supported versions). Discussion: https://postgr.es/m/20200321224920.GB1763544@rfd.leadboat.com	2020-03-22 09:24:09 -07:00
Noah Misch	cb2fd7eac2	Skip WAL for new relfilenodes, under wal_level=minimal. Until now, only selected bulk operations (e.g. COPY) did this. If a given relfilenode received both a WAL-skipping COPY and a WAL-logged operation (e.g. INSERT), recovery could lose tuples from the COPY. See src/backend/access/transam/README section "Skipping WAL for New RelFileNode" for the new coding rules. Maintainers of table access methods should examine that section. To maintain data durability, just before commit, we choose between an fsync of the relfilenode and copying its contents to WAL. A new GUC, wal_skip_threshold, guides that choice. If this change slows a workload that creates small, permanent relfilenodes under wal_level=minimal, try adjusting wal_skip_threshold. Users setting a timeout on COMMIT may need to adjust that timeout, and log_min_duration_statement analysis will reflect time consumption moving to COMMIT from commands like COPY. Internally, this requires a reliable determination of whether RollbackAndReleaseCurrentSubTransaction() would unlink a relation's current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the specification of rd_createSubid such that the field is zero when a new rel has an old rd_node. Make relcache.c retain entries for certain dropped relations until end of transaction. Back-patch to 9.5 (all supported versions). This introduces a new WAL record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC. As always, update standby systems before master systems. This changes sizeof(RelationData) and sizeof(IndexStmt), breaking binary compatibility for affected extensions. (The most recent commit to affect the same class of extensions was 089e4d405d0f3b94c74a2c6a54357a84a681754b.) Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert Haas. Heikki Linnakangas and Michael Paquier implemented earlier designs that materially clarified the problem. Reviewed, in earlier designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane, Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout. Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org	2020-03-21 09:38:26 -07:00
Michael Paquier	7ca8c9706a	Fix typo in recently-added TAP test for replication slots Oversight in commit `b0afdca`.	2020-01-31 13:57:56 +09:00
Michael Paquier	b0afdcad21	Fix slot data persistency when advancing physical replication slots Advancing a physical replication slot with pg_replication_slot_advance() did not mark the slot as dirty if any advancing was done, preventing the follow-up checkpoint to flush the slot data to disk. This caused the advancing to be lost even on clean restarts. This does not happen for logical slots as any advancing marked the slot as dirty. Per discussion, the original feature has been implemented so as in the event of a crash the slot may move backwards to a past LSN. This property is kept and more documentation is added about that. This commit adds some new TAP tests to check the persistency of physical and logical slots after advancing across clean restarts. Author: Alexey Kondratov, Michael Paquier Reviewed-by: Andres Freund, Kyotaro Horiguchi, Craig Ringer Discussion: https://postgr.es/m/059cc53a-8b14-653a-a24d-5f867503b0ee@postgrespro.ru Backpatch-through: 11	2020-01-30 11:14:02 +09:00
Peter Eisentraut	dc788668bb	Fail if recovery target is not reached Before, if a recovery target is configured, but the archive ended before the target was reached, recovery would end and the server would promote without further notice. That was deemed to be pretty wrong. With this change, if the recovery target is not reached, it is a fatal error. Based-on-patch-by: Leif Gunnar Erlandsen <leif@lako.no> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/993736dd3f1713ec1f63fc3b653839f5@lako.no	2020-01-29 15:58:14 +01:00
Amit Kapila	ed7bb5c311	Revert test added by commit `d207038053`. This test was trying to test the mechanism to release kernel FDs as needed to get us under the max_safe_fds limit in case of spill files. To do that, it needs to set max_files_per_process to a very low value which doesn't even permit starting of the server in the case when there are a few already opened files. This test also won't work on platforms where we use one FD per semaphore. Backpatch-through: 10, till where this test was added Discussion: https://postgr.es/m/CAA4eK1LHhERi06Q+MmP9qBXBBboi+7WV3910J0aUgz71LcnKAw@mail.gmail.com https://postgr.es/m/6485.1578583522@sss.pgh.pa.us	2020-01-14 07:53:50 +05:30
Amit Kapila	d207038053	Fix running out of file descriptors for spill files. Currently while decoding changes, if the number of changes exceeds a certain threshold, we spill those to disk. And this happens for each (sub)transaction. Now, while reading all these files, we don't close them until we read all the files. While reading these files, if the number of such files exceeds the maximum number of file descriptors, the operation errors out. Use PathNameOpenFile interface to open these files as that internally has the mechanism to release kernel FDs as needed to get us under the max_safe_fds limit. Reported-by: Amit Khandekar Author: Amit Khandekar Reviewed-by: Amit Kapila Backpatch-through: 9.4 Discussion: https://postgr.es/m/CAJ3gD9c-sECEn79zXw4yBnBdOttacoE-6gAyP0oy60nfs_sabQ@mail.gmail.com	2020-01-02 11:41:04 +05:30
Bruce Momjian	7559d8ebfa	Update copyrights for 2020 Backpatch-through: update all files in master, backpatch legal files through 9.4	2020-01-01 12:21:45 -05:00
Amit Kapila	98a9b37ba7	Revert commits `290acac92b` and `8a7e9e9dad`. This commit revert the commits to add a test case that tests the 'force' option when there is an active backend connected to the database being dropped. This feature internally sends SIGTERM to all the backends connected to the database being dropped and then the same is reported to the client. We found that on Windows, the client end of the socket is not able to read the data once we close the socket in the server which leads to loss of error message which is not what we expect. We also observed similar behavior in other cases like pg_terminate_backend(), pg_ctl kill TERM <pid>. There are probably a few others like that. The fix for this requires further study. Discussion: https://postgr.es/m/E1iaD8h-0004us-K9@gemulon.postgresql.org	2019-11-30 08:01:20 +05:30
Amit Kapila	290acac92b	Move pump_until to TestLib.pm. The subroutine pump_until provides the functionality to poll until the given string is matched, or a timeout occurs. This can be used from other places as well, so moving it to TestLib.pm. The immediate need is for an upcoming regression test patch for dropdb utility. Author: Vignesh C Reviewed-by: Amit Kapila Discussion: https://postgr.es/m/CAP_rwwmLJJbn70vLOZFpxGw3XD7nLB_7+NKz46H5EOO2k5H7OQ@mail.gmail.com	2019-11-28 08:28:17 +05:30
Michael Paquier	df86e52cac	Remove temporary WAL and history files at the end of archive recovery `cbc55da` has reworked the order of some actions at the end of archive recovery. Unfortunately this overlooked the fact that the startup process needs to remove RECOVERYXLOG (for temporary WAL segment newly recovered from archives) and RECOVERYHISTORY (for temporary history file) at this step, leaving the files around even after recovery ended. Backpatch to 9.5, like the previous commit. Author: Sawada Masahiko Reviewed-by: Fujii Masao, Michael Paquier Discussion: https://postgr.es/m/CAD21AoBO_eDQub6zojFnWtnmutRBWvYf7=cW4Hsqj+U_R26w3Q@mail.gmail.com Backpatch-through: 9.5	2019-10-02 15:53:07 +09:00
Andrew Dunstan	8e5ce1c3f8	Always skip recovery SysV shared memory tests on Windows The test for SysV support currently involves looking for the perl modules IPC::SharedMem and IPC::SysV. However, the perl on msys2 has these modules but the tests fail. Therefore, force skipping the tests on Windows platforms unconditionally. Discussion: https://postgr.es/m/176e86ba-1a46-9d8c-5ae4-9865a463b411@2ndQuadrant.com	2019-09-06 15:50:53 -04:00
Tom Lane	7de19fbc0b	Use data directory inode number, not port, to select SysV resource keys. This approach provides a much tighter binding between a data directory and the associated SysV shared memory block (and SysV or named-POSIX semaphores, if we're using those). Key collisions are still possible, but only between data directories stored on different filesystems, so the situation should be negligible in practice. More importantly, restarting the postmaster with a different port number no longer risks failing to identify a relevant shared memory block, even when postmaster.pid has been removed. A standalone backend is likewise much more certain to detect conflicting leftover backends. (In the longer term, we might now think about deprecating the port as a cluster-wide value, so that one postmaster could support sockets with varying port numbers. But that's for another day.) The hazards fixed here apply only on Unix systems; our Windows code paths already use identifiers derived from the data directory path name rather than the port. src/test/recovery/t/017_shm.pl, which intends to test key-collision cases, has been substantially rewritten since it can no longer use two postmasters with identical port numbers to trigger the case. Instead, use Perl's IPC::SharedMem module to create a conflicting shmem segment directly. The test script will be skipped if that module is not available. (This means that some older buildfarm members won't run it, but I don't think that that results in any meaningful coverage loss.) Patch by me; thanks to Noah Misch and Peter Eisentraut for discussion and review. Discussion: https://postgr.es/m/16908.1557521200@sss.pgh.pa.us	2019-09-05 13:31:46 -04:00
Tom Lane	fb57f40eec	Fix 007_sync_rep.pl to notice failures in ALTER SYSTEM SET. If a test case tried to set an invalid value of synchronous_standby_names, the test script didn't detect that, which seems like a bad idea. Noticed while testing a proposed patch that broke some of these test cases.	2019-08-26 17:02:52 -04:00
Michael Paquier	66bde49d96	Fix inconsistencies and typos in the tree, take 10 This addresses some issues with unnecessary code comments, fixes various typos in docs and comments, and removes some orphaned structures and definitions. Author: Alexander Lakhin Discussion: https://postgr.es/m/9aabc775-5494-b372-8bcb-4dfc0bd37c68@gmail.com	2019-08-13 13:53:41 +09:00
Michael Paquier	c8e177f0bb	Fix failure with pgperlcritic from the TAP test of synchronous replication Oversight in `7d81bdc`, which introduced a new routine in perl lacking a return clause. Per buildfarm member crake. Backpatch down to 9.6 like its parent. Reported-by: Andrew Dunstan Discussion: https://postgr.es/m/16da29fa-d504-1380-7095-40de586dc038@2ndQuadrant.com Backpatch-through: 9.6	2019-07-25 07:55:23 +09:00
Michael Paquier	7d81bdc8c0	Improve stability of TAP test for synchronous replication Slow buildfarm machines have run into issues with this TAP test caused by a race condition related to the startup of a set of standbys, where it is possible to finish with an unexpected order in the WAL sender array of the primary. This closes the race condition by making sure that any standby started is registered into the WAL sender array of the primary before starting the next one based on lookups of pg_stat_replication. Backpatch down to 9.6 where the test has been introduced. Author: Michael Paquier Reviewed-by: Álvaro Herrera, Noah Misch Discussion: https://postgr.es/m/20190617055145.GB18917@paquier.xyz Backpatch-through: 9.6	2019-07-24 10:53:39 +09:00
Michael Paquier	23bccc823d	Fix inconsistencies and typos in the tree This is numbered take 7, and addresses a set of issues with code comments, variable names and unreferenced variables. Author: Alexander Lakhin Discussion: https://postgr.es/m/dff75442-2468-f74f-568c-6006e141062f@gmail.com	2019-07-22 10:01:50 +09:00
Noah Misch	660a2b1903	Consolidate methods for translating a Perl path to a Windows path. This fixes some TAP suites when using msys Perl and a builddir located in an msys mount point other than "/". For example, builddir=/c/pg exhibited the problem, since /c/pg falls in mount point "/c". Back-patch to 9.6, where tests first started to perform such translations. In back branches, offer both new and old APIs. Reviewed by Andrew Dunstan. Discussion: https://postgr.es/m/20190610045838.GA238501@rfd.leadboat.com	2019-06-21 20:34:23 -07:00
Alvaro Herrera	b976845815	Fix double-word typos Discussion: https://postgr.es/m/20190612184527.GA24266@alvherre.pgsql Reviewed-by: Michaël Paquier	2019-06-13 10:03:56 -04:00
Noah Misch	31d250e049	Update stale comments, and fix comment typos.	2019-06-08 10:12:26 -07:00
Tom Lane	db6e2b4c52	Initial pgperltidy run for v12. Make all the perl code look nice, too (for some value of "nice").	2019-05-22 13:36:19 -04:00
Michael Paquier	5a9323eab6	Remove dependency to pageinspect in recovery tests If contrib/pageinspect is not installed, this causes the test checking the minimum recovery point to fail. The point is that the dependency with pageinspect is not really necessary as the test does also all checks with an offline cluster by scanning directly the on-disk pages, which is enough for the purpose of the test. Per complaint from Tom Lane. Author: Michael Paquier Discussion: https://postgr.es/m/17806.1555566345@sss.pgh.pa.us	2019-04-19 15:51:23 +09:00
Noah Misch	e12a472612	Don't write to stdin of a test process that could have already exited. Instead, close that stdin. Per buildfarm member conchuela. Back-patch to 9.6, where the test was introduced. Discussion: https://postgr.es/m/26478.1555373328@sss.pgh.pa.us	2019-04-15 18:13:44 -07:00
Michael Paquier	c34677fdaa	Fix SHOW ALL command for non-superusers with replication connection Since Postgres 10, SHOW commands can be triggered with replication connections in a WAL sender context, however it missed that a transaction context is needed for syscache lookups. This commit makes sure that the syscache lookups can happen correctly by setting a transaction context when running SHOW commands in a WAL sender. Superuser-only parameters can be displayed using SHOW commands not only to superusers, but also to members of system role pg_read_all_settings, which requires a syscache lookup to check if the connected role is a member of this system role or not, or the instance crashes. Superusers do not need to check the syscache so it worked correctly in this case. New tests are added to cover this issue. Reported-by: Alexander Kukushkin Author: Michael Paquier Reviewed-by: Álvaro Herrera Discussion: https://postgr.es/m/15734-2daa8761eeed8e20@postgresql.org Backpatch-through: 10	2019-04-15 12:34:32 +09:00
Noah Misch	9daefff122	MSYS: Translate REGRESS_SHLIB to a Windows file name. Per buildfarm member jacana. Back-patch to v11; earlier branches skip the affected test under msys. Discussion: https://postgr.es/m/GrdLgAdUK9FdyZg8VIcTDKVOkys122ZINEb3CjjoySfGj2KyPiMKTh1zqtRp0TAD7FJ27G-OBB3eplxIB5GhcQH5o8zzGZfp0MuJaXJxVxk=@yesql.se	2019-04-14 00:42:34 -07:00
Noah Misch	c098509927	Consistently test for in-use shared memory. postmaster startup scrutinizes any shared memory segment recorded in postmaster.pid, exiting if that segment matches the current data directory and has an attached process. When the postmaster.pid file was missing, a starting postmaster used weaker checks. Change to use the same checks in both scenarios. This increases the chance of a startup failure, in lieu of data corruption, if the DBA does "kill -9 `head -n1 postmaster.pid` && rm postmaster.pid && pg_ctl -w start". A postmaster will no longer stop if shmat() of an old segment fails with EACCES. A postmaster will no longer recycle segments pertaining to other data directories. That's good for production, but it's bad for integration tests that crash a postmaster and immediately delete its data directory. Such a test now leaks a segment indefinitely. No "make check-world" test does that. win32_shmem.c already avoided all these problems. In 9.6 and later, enhance PostgresNode to facilitate testing. Back-patch to 9.4 (all supported versions). Reviewed (in earlier versions) by Daniel Gustafsson and Kyotaro HORIGUCHI. Discussion: https://postgr.es/m/20190408064141.GA2016666@rfd.leadboat.com	2019-04-12 22:36:38 -07:00
Noah Misch	82150a05be	Revert "Consistently test for in-use shared memory." This reverts commits `2f932f71d9`, `16ee6eaf80` and `6f0e190056`. The buildfarm has revealed several bugs. Back-patch like the original commits. Discussion: https://postgr.es/m/20190404145319.GA1720877@rfd.leadboat.com	2019-04-05 00:00:52 -07:00
Noah Misch	16ee6eaf80	Make src/test/recovery/t/017_shm.pl safe for concurrent execution. Buildfarm members idiacanthus and komodoensis, which share a host, both executed this test in the same second. That failed. Back-patch to 9.6, where the test first appeared. Discussion: https://postgr.es/m/20190404020543.GA1319573@rfd.leadboat.com	2019-04-03 23:16:37 -07:00
Noah Misch	2f932f71d9	Consistently test for in-use shared memory. postmaster startup scrutinizes any shared memory segment recorded in postmaster.pid, exiting if that segment matches the current data directory and has an attached process. When the postmaster.pid file was missing, a starting postmaster used weaker checks. Change to use the same checks in both scenarios. This increases the chance of a startup failure, in lieu of data corruption, if the DBA does "kill -9 `head -n1 postmaster.pid` && rm postmaster.pid && pg_ctl -w start". A postmaster will no longer recycle segments pertaining to other data directories. That's good for production, but it's bad for integration tests that crash a postmaster and immediately delete its data directory. Such a test now leaks a segment indefinitely. No "make check-world" test does that. win32_shmem.c already avoided all these problems. In 9.6 and later, enhance PostgresNode to facilitate testing. Back-patch to 9.4 (all supported versions). Reviewed by Daniel Gustafsson and Kyotaro HORIGUCHI. Discussion: https://postgr.es/m/20130911033341.GD225735@tornado.leadboat.com	2019-04-03 17:03:46 -07:00
Peter Eisentraut	8e93a516e6	Don't propagate PGAPPNAME through pg_ctl in tests When libpq is loaded in the server (for instance, by libpqwalreceiver), it may use libpq environment variables set in the postmaster environment for connection parameter defaults. This has some confusing effects in our test suites. For example, the TAP test infrastructure sets PGAPPNAME to allow identifying clients in the server log. But this environment variable is also inherited by temporary servers started with pg_ctl and is then in turn used by libpqwalreceiver as the application_name for connecting to remote servers where it then shows up in pg_stat_replication and is relevant for things like synchronous_standby_names. Replication already has a suitable default for application_name, and overriding that accidentally then requires the individual test cases to re-override that, which is all very confusing and unnecessary. To fix, unset PGAPPNAME temporarily before running pg_ctl start or restart in the tests. More comprehensive approaches like unsetting all environment variables in pg_ctl were considered but might be too complicated to achieve portably. The now unnecessary re-overriding of application_name by test cases is also removed. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://www.postgresql.org/message-id/flat/33383613-690e-6f1b-d5ba-4957ff40f6ce@2ndquadrant.com	2019-03-16 08:58:40 +01:00
Michael Paquier	364298be22	Fix race condition in recently-added TAP test for recovery consistency A couple of queries are run on the primary to create and fill in a test table, which gets checked on the standby afterwards. However the test was not waiting for the confirmation that the necessary records have been replayed on the standby, leading to spurious failures. Per buildfarm member loach. Thanks to Thomas Munro for the report and Tom Lane for the failure analysis. Discussion: https://postgr.es/m/CA+hUKGLUpqG52xtriUz5RpmeKPoEfNxNc-CginG+Cx+X2-Ycew@mail.gmail.com	2019-03-14 12:41:45 +09:00
Michael Paquier	b0825d28ea	Add TAP test to check consistency of minimum recovery LSN `c186ba13` has fixed an issue related to the updates of the minimum recovery LSN across multiple processes on standbys, but we never really had a test case able to reliably check its logic. This commit introduces a new test case to close the gap, and is designed to check the consistency of data based on the minimum recovery point set by either the startup process or the checkpointer for both an offline cluster (by looking at the on-disk page headers) and an online cluster (using pageinspect). Note that with `c186ba13` reverted, this test fails badly for both the online and offline cases, as designed. Author: Michael Paquier, Andrew Gierth Reviewed-by: Andrew Gierth, Georgios Kokolatos, Arthur Zakirov Discussion: https://postgr.es/m/20181108044525.GA17482@paquier.xyz	2019-03-13 14:58:24 +09:00
Andrew Dunstan	4eff1e9f0b	Allow recovery tests to run on Windows as an admin user This is the only test that fails when run as an admin user. The reason is that when Postgres is started via pg_ctl its admin privileges are lowered. However, this test called 'postgres -D datadir' directly, resulting in a failure. Replace that by calling pg_ctl and then checking the result for the expected failure, and the logfile for the expected error message.	2019-03-04 15:54:02 -05:00
Michael Paquier	b108676708	Add TAP tests for 2PC post-commit callbacks of multixacts at recovery The current set of TAP tests for two-phase transactions include some coverage for post-commit callbacks of multixact, but it lacked tests for testing the recovery of those callbacks. This commit adds some tests with soft and hard restarts in this case, using transactions which include DDLs. Author: Michael Paquier Reviewed-by: Oleksii Kliukin Discussion: https://postgr.es/m/20190221055431.GO15532@paquier.xyz	2019-02-23 08:19:31 +09:00
Tom Lane	d6f6f0fc2d	Allow for yet another crash symptom in 013_crash_restart.pl. Given the right timing, psql could emit "connection to server was lost" rather than one of the other messages that this test script checked for. It looks like commit `4247db625` may have made this more likely, but I don't really believe it was impossible before then. Rather than stress about it, just add that spelling as one of the crash-successfully- detected cases. Discussion: https://postgr.es/m/19344.1548554028@sss.pgh.pa.us	2019-01-26 22:12:48 -05:00
Peter Eisentraut	0acb3bc33a	Change default of recovery_target_timeline to 'latest' This is what one usually wants for recovery and almost always wants for a standby. Discussion: https://www.postgresql.org/message-id/flat/6dd2c23a-4162-8469-410f-bfe146e28c0c@2ndquadrant.com/ Reviewed-by: David Steele <david@pgmasters.net> Reviewed-by: Michael Paquier <michael@paquier.xyz>	2019-01-13 10:01:05 +01:00
Bruce Momjian	97c39498e5	Update copyright for 2019 Backpatch-through: certain files through 9.4	2019-01-02 12:44:25 -05:00
Noah Misch	1db439ad49	Raise some timeouts to 180s, in test code. Slow runs of buildfarm members chipmunk, hornet and mandrill saw the shorter timeouts expire. The 180s timeout in poll_query_until has been trouble-free since `2a0f89cd71` introduced it two years ago, so use 180s more widely. Back-patch to 9.6, where the first of these timeouts was introduced. Reviewed by Michael Paquier. Discussion: https://postgr.es/m/20181209001601.GC2973271@rfd.leadboat.com	2018-12-10 20:15:42 -08:00
Peter Eisentraut	f2cbffc7a6	Only allow one recovery target setting The previous recovery.conf regime accepted multiple recovery_target* settings and used the last one. This does not translate well to the general GUC system. Specifically, under EXEC_BACKEND, the settings are written out not in any particular order, so the order in which they were originally set is not available to new processes. Rather than redesign the GUC system, it was decided to abandon the old behavior and only allow one recovery target setting. A second setting will cause an error. However, it is allowed to set the same parameter multiple times or unset a parameter and set a different one. Discussion: https://www.postgresql.org/message-id/flat/27802171543235530%40iva2-6ec8f0a6115e.qloud-c.yandex.net#701a59c837ad0bf8c244344aaf3ef5a4	2018-11-28 13:55:54 +01:00
Peter Eisentraut	2dedf4d9a8	Integrate recovery.conf into postgresql.conf recovery.conf settings are now set in postgresql.conf (or other GUC sources). Currently, all the affected settings are PGC_POSTMASTER; this could be refined in the future case by case. Recovery is now initiated by a file recovery.signal. Standby mode is initiated by a file standby.signal. The standby_mode setting is gone. If a recovery.conf file is found, an error is issued. The trigger_file setting has been renamed to promote_trigger_file as part of the move. The documentation chapter "Recovery Configuration" has been integrated into "Server Configuration". pg_basebackup -R now appends settings to postgresql.auto.conf and creates a standby.signal file. Author: Fujii Masao <masao.fujii@gmail.com> Author: Simon Riggs <simon@2ndquadrant.com> Author: Abhijit Menon-Sen <ams@2ndquadrant.com> Author: Sergei Kornilov <sk@zsrv.org> Discussion: https://www.postgresql.org/message-id/flat/607741529606767@web3g.yandex.ru/	2018-11-25 16:33:40 +01:00
Andres Freund	691d79a079	Disallow starting server with insufficient wal_level for existing slot. Previously it was possible to create a slot, change wal_level, and restart, even if the new wal_level was insufficient for the slot. That's a problem for both logical and physical slots, because the necessary WAL records are not generated. This removes a few tests in newer versions that, somewhat inexplicably, whether restarting with a too low wal_level worked (a buggy behaviour!). Reported-By: Joshua D. Drake Author: Andres Freund Discussion: https://postgr.es/m/20181029191304.lbsmhshkyymhw22w@alap3.anarazel.de Backpatch: 9.4-, where replication slots where introduced	2018-10-31 15:46:39 -07:00
Michael Paquier	10074651e3	Add pg_promote function This function is able to promote a standby with this new SQL-callable function. Execution access can be granted to non-superusers so that failover tools can observe the principle of least privilege. Catalog version is bumped. Author: Laurenz Albe Reviewed-by: Michael Paquier, Masahiko Sawada Discussion: https://postgr.es/m/6e7c79b3ec916cf49742fb8849ed17cd87aed620.camel@cybertec.at	2018-10-25 09:46:00 +09:00
Tom Lane	d11eae09e4	Fix bogus loop logic in 013_crash_restart test's pump_until subroutine. The pump_nb() step might've already received the desired data, so we must check for that at the top of the loop not the bottom. Otherwise, the call to pump() will sit with nothing to do until the timeout elapses. pump_until then falls out with apparent success ... but the timeout has been used up, causing the next call of pump_until to report a timeout failure. I believe this explains the intermittent timeout failures we've seen in the buildfarm ever since this test went in. I was able to reproduce the problem on gaur semi-repeatably, and this appears to fix it. In passing, remove a duplicate assignment, fix one stdin-assignment to look like the rest, and document the test's dependency on test_decoding.	2018-08-12 18:05:49 -04:00
Tom Lane	a360f952ff	Remove race-prone hot_standby_feedback test cases in 001_stream_rep.pl. This script supposed that if it turned hot_standby_feedback on and then shut down the standby server, at least one feedback message would be guaranteed to be sent before the standby stops. But there is no such guarantee, if the standby's walreceiver process is slow enough --- and we've seen multiple failures in the buildfarm showing that that does happen in practice. While we could rearrange the walreceiver logic to make it less likely, it seems probably impossible to create a really bulletproof guarantee of that sort; and if we tried, we might create situations where the walreceiver wouldn't react in a timely manner to shutdown commands. It seems better instead to remove the script's assumption that feedback will occur before shutdown. But once we do that, these last few tests seem quite redundant with the earlier tests in the script. So let's just drop them altogether and save some buildfarm cycles. Backpatch to v10 where these tests were added. Discussion: https://postgr.es/m/1922.1531592205@sss.pgh.pa.us	2018-07-18 17:39:27 -04:00
Michael Paquier	3c64dcb1e3	Prevent references to invalid relation pages after fresh promotion If a standby crashes after promotion before having completed its first post-recovery checkpoint, then the minimal recovery point which marks the LSN position where the cluster is able to reach consistency may be set to a position older than the first end-of-recovery checkpoint while all the WAL available should be replayed. This leads to the instance thinking that it contains inconsistent pages, causing a PANIC and a hard instance crash even if all the WAL available has not been replayed for certain sets of records replayed. When in crash recovery, minRecoveryPoint is expected to always be set to InvalidXLogRecPtr, which forces the recovery to replay all the WAL available, so this commit makes sure that the local copy of minRecoveryPoint from the control file is initialized properly and stays as it is while crash recovery is performed. Once switching to archive recovery or if crash recovery finishes, then the local copy minRecoveryPoint can be safely updated. Pavan Deolasee has reported and diagnosed the failure in the first place, and the base fix idea to rely on the local copy of minRecoveryPoint comes from Kyotaro Horiguchi, which has been expanded into a full-fledged patch by me. The test included in this commit has been written by Álvaro Herrera and Pavan Deolasee, which I have modified to make it faster and more reliable with sleep phases. Backpatch down to all supported versions where the bug appears, aka 9.3 which is where the end-of-recovery checkpoint is not run by the startup process anymore. The test gets easily supported down to 10, still it has been tested on all branches. Reported-by: Pavan Deolasee Diagnosed-by: Pavan Deolasee Reviewed-by: Pavan Deolasee, Kyotaro Horiguchi Author: Michael Paquier, Kyotaro Horiguchi, Pavan Deolasee, Álvaro Herrera Discussion: https://postgr.es/m/CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com	2018-07-05 10:46:18 +09:00
Tom Lane	c992dca26e	Clarify the README files for the various separate TAP-based test suites. Explain the difference between "make check" and "make installcheck". Mention the need for --enable-tap-tests (only some of these did so before). Standardize their wording about how to run the tests.	2018-06-19 19:30:50 -04:00
Andrew Dunstan	3a7cc727c7	Don't fall off the end of perl functions This complies with the perlcritic policy Subroutines::RequireFinalReturn, which is a severity 4 policy. Since we only currently check at severity level 5, the policy is raised to that level until we move to level 4 or lower, so that any new infringements will be caught. A small cosmetic piece of tidying of the pgperlcritic script is included. Mike Blackwell Discussion: https://postgr.es/m/CAESHdJpfFm_9wQnQ3koY3c91FoRQsO-fh02za9R3OEMndOn84A@mail.gmail.com	2018-05-27 09:08:42 -04:00
Andrew Dunstan	35361ee788	Restrict vertical tightness to parentheses in Perl code The vertical tightness settings collapse vertical whitespace between opening and closing brackets (parentheses, square brakets and braces). This can make data structures in particular harder to read, and is not very consistent with our style in non-Perl code. This patch restricts that setting to parentheses only, and reformats all the perl code accordingly. Not applying this to parentheses has some unfortunate effects, so the consensus is to keep the setting for parentheses and not for the others. The diff for this patch does highlight some places where structures should have trailing commas. They can be added manually, as there is no automatic tool to do so. Discussion: https://postgr.es/m/a2f2b87c-56be-c070-bfc0-36288b4b41c1@2ndQuadrant.com	2018-05-09 10:14:46 -04:00
Peter Eisentraut	76ece16974	perltidy: Add option --nooutdent-long-comments	2018-04-27 11:37:43 -04:00
Peter Eisentraut	d4f16d5071	perltidy: Add option --nooutdent-long-quotes	2018-04-27 11:37:43 -04:00
Tom Lane	f04d4ac919	Reindent Perl files with perltidy version 20170521. Discussion: https://postgr.es/m/CABUevEzK3cNiHZQ18f5tK0guoT+cN_jWeVzhYYxY=r+1Q3SmoA@mail.gmail.com	2018-04-25 14:00:19 -04:00
Andrew Dunstan	9ad21a6957	Don't use an Msys virtual path to create a tablespace The new unlogged_reinit recovery tests create a new tablespace using TestLib.pm's tempdir. However, on msys that function returns a virtual path that isn't understood by Postgres. Here we add a new function to TestLib.pm to turn such a path into a real path on the underlying file system, and use it in the new test to create the tablespace. The new function is essentially a NOOP everywhere but msys.	2018-03-20 08:25:36 +10:30
Peter Eisentraut	df411e7c66	Add tests for reinit.c This tests how the different forks of unlogged tables behave in crash recovery. Author: David Steele <david@pgmasters.net>	2018-03-14 09:03:44 -04:00
Peter Eisentraut	f5da5683a8	Add installcheck support to more test suites Several of the test suites under src/test/ were missing an installcheck target.	2018-01-23 07:11:38 -05:00
Bruce Momjian	9d4649ca49	Update copyright for 2018 Backpatch-through: certain files through 9.3	2018-01-02 23:30:12 -05:00
Noah Misch	6078770c1a	In tests, await an LSN no later than the recovery target. Otherwise, the test fails with "Timed out while waiting for standby to catch up". This happened rarely, perhaps only when autovacuum wrote WAL between our choosing the recovery target and choosing the LSN to await. Commit `b26f7fa6ae` fixed one case of this. Fix two more. Back-patch to 9.6, which introduced the affected test. Discussion: https://postgr.es/m/20180101055227.GA2952815@rfd.leadboat.com	2017-12-31 21:58:29 -08:00
Robert Haas	575cead991	Remove redundant line from Makefile. Masahiko Sawada, reviewed by Michael Paquier Discussion: http://postgr.es/m/CAD21AoDFes_Mgye-1K89rmTgeU3RxYF3zgTjzCJVq2KzzcpC4A@mail.gmail.com	2017-11-16 15:30:56 -05:00
Andres Freund	784905795f	Try to make crash restart test work on windows. Author: Andres Freund Tested-By: Andrew Dunstan Discussion: https://postgr.es/m/20170930224424.ud5ilchmclbl5y5n@alap3.anarazel.de	2017-10-01 15:24:58 -07:00
Tom Lane	01c7d3ef85	Ten-second timeout in 013_crash_restart.pl is not enough, let's try 60. Per buildfarm member topminnow.	2017-09-23 12:56:31 -04:00
Andres Freund	8d926029e8	Expand expected output for recovery test even further. I'd assumed that the backend being killed should be able to get out an error message - but it turns out it's not guaranteed that it's not still sending a ready-for-query. Really need to do something about getting these error message to the client. Reported-By: Thomas Munro, Tom Lane Discussion: https://postgr.es/m/CAEepm=0TE90nded+bNthP45_PEvGAAr=3gxhHJObL4xmOLtX0w@mail.gmail.com https://postgr.es/m/14968.1506101414@sss.pgh.pa.us	2017-09-22 11:35:26 -07:00
Andres Freund	5ada1fcd0c	Accept that server might not be able to send error in crash recovery test. As it turns out we can't rely that the script's monitoring session is terminated with a proper error by the server, because the session might be terminated while already trying to send data. Also improve robustness and error reporting facilities of the test, developed while debugging this issue. Discussion: https://postgr.es/m/20170920020038.kllxgilo7xzwmtto@alap3.anarazel.de	2017-09-19 21:37:24 -07:00
Andres Freund	1910353675	Make new crash restart test a bit more robust. Add timeouts in case psql doesn't deliver the expected output, and try to cause the monitoring psql to be fully connected to a backend. This isn't necessarily everything needed, but at least the timeouts should reduce the pain for buildfarm owners. Author: Andres Freund Reported-By: Tom Lane, BF animals prairiedog and calliphoridae Discussion: https://postgr.es/m/E1du6ZT-00043I-91@gemulon.postgresql.org	2017-09-19 10:39:52 -07:00
Andres Freund	a1924a4ea2	Add test for postmaster crash restarts. Given that I managed to break this... We probably should extend the tests to also cover other sub-processes dying, but that's something for later. Author: Andres Freund Discussion: https://postgr.es/m/20170917080752.rcmihzfmgbeuqjk2@alap3.anarazel.de	2017-09-18 17:25:53 -07:00
Tom Lane	e183530550	Fix RecursiveCopy.pm to cope with disappearing files. When copying from an active database tree, it's possible for files to be deleted after we see them in a readdir() scan but before we can open them. (Once we've got a file open, we don't expect any further errors from it getting unlinked, though.) Tweak RecursiveCopy so it can cope with this case, so as to avoid irreproducible test failures. Back-patch to 9.6 where this code was added. In v10 and HEAD, also remove unused "use RecursiveCopy" in one recovery test script. Michael Paquier and Tom Lane Discussion: https://postgr.es/m/24621.1504924323@sss.pgh.pa.us	2017-09-11 22:02:58 -04:00
Alvaro Herrera	89c59b742a	Fix two-phase commit test for recovery mode The original code had a race condition because it never ensured the standby was caught up before proceeding; add a wait similar to every other place that does this. Author: Michaël Paquier Discussion: https://postgr.es/m/CAB7nPqTm9p+LCm1mVJYvgpwagRK+uibT-pKq0O2-paOWxT62jw@mail.gmail.com	2017-09-01 16:51:55 +02:00
Simon Riggs	4f27c674fd	Avoid race condition in logical replication test Wait for slot to become inactive before continuing. Author: Petr Jelinek	2017-09-01 14:55:44 +01:00
Tom Lane	21d304dfed	Final pgindent + perltidy run for v10.	2017-08-14 17:29:33 -04:00
Tom Lane	3043c1ddd1	Simplify fetch-slot-xmins logic in recovery TAP tests. Merge wait_slot_xmins() into get_slot_xmins(). At this point the only place that wasn't doing a wait was the initial-state test, and a wait there seems pretty harmless. Michael Paquier Discussion: https://postgr.es/m/CAB7nPqSp_SLQb2uU7am+sn4V3g1UKv8j3yZU385oAG1cG_BN9Q@mail.gmail.com	2017-08-12 12:08:54 -04:00
Peter Eisentraut	a1ef920e27	Remove uses of "slave" in replication contexts This affects mostly code comments, some documentation, and tests. Official APIs already used "standby".	2017-08-10 22:55:41 -04:00
Tom Lane	4576a69354	Fix yet another race condition in recovery/t/001_stream_rep.pl. In commit `5c77690f6`, we added polling in front of most of the get_slot_xmins calls in 001_stream_rep.pl, but today's results from buildfarm member nightjar show that at least one more poll loop is needed. Proactively add a poll loop before the next-to-last get_slot_xmins call as well. It may be that there is no race condition there because the standby_2 server is shut down at that point, but I'm quite tired of fighting with this test script. The empirical evidence that it's safe, from the buildfarm, is no stronger than the evidence for the other call that nightjar just proved unsafe. The only remaining get_slot_xmins calls without wait_slot_xmins protection are the first two, which should be OK since nothing has happened at that point. It's tempting to ignore that special case and merge get_slot_xmins and wait_slot_xmins into a single function. I didn't go that far though. Discussion: https://postgr.es/m/18436.1502228036@sss.pgh.pa.us	2017-08-08 18:03:30 -04:00
Tom Lane	ec86af9175	Fix another race-condition-ish issue in recovery/t/001_stream_rep.pl. Buildfarm members hornet and sungazer have shown multiple instances of "Failed test 'xmin of non-cascaded slot with hs feedback has changed'". The reason seems to be that the test is checking the current xmin of the master server's replication slot against a past xmin of the first slave server's replication slot. Even though the latter slot is downstream of the former, it's possible for its reported xmin to be ahead of the former's reported xmin, because those numbers are updated whenever the respective downstream walreceiver feels like it (see logic in WalReceiverMain). Instrumenting this test shows that indeed the slave slot's xmin does often advance before the master's does, especially if an autovacuum transaction manages to occur during the relevant window. If we happen to capture such an advanced xmin as $xmin, then the subsequent wait_slot_xmins call can fall through before the master's xmin has advanced at all, and then if it advances before the get_slot_xmins call, we can get the observed failure. Yeah, that's a bit of a long chain of deduction, but it's hard to explain any other way how the test can get past an "xmin <> '$xmin'" check only to have the next query find that xmin does equal $xmin. Fix by keeping separate images of the master and slave slots' xmins and testing their has-xmin-advanced conditions independently.	2017-07-05 23:59:20 -04:00
Peter Eisentraut	6deb52b202	Remove unnecessary pg_is_in_recovery calls in tests Since pg_ctl promote already waits for recovery to end, these calls are obsolete.	2017-07-05 13:37:08 -04:00
Tom Lane	647675228f	Fix race condition in recovery/t/009_twophase.pl test. Since reducing pg_ctl's reaction time in commit `c61559ec3`, some slower buildfarm members have shown erratic failures in this test. The reason turns out to be that the test assumes synchronous replication (because it does not provide any lag time for a commit to replicate before shutting down the servers), but it had only enabled sync rep in one direction. The observed symptoms correspond to failure to replicate the last committed transaction in the other direction, which can be expected to happen if the shutdown command is issued soon enough and we are providing no synchronous-commit guarantees. Fix that, and add a bit more paranoid state checking at the bottom of the script. Michael Paquier and myself Discussion: https://postgr.es/m/908.1498965681@sss.pgh.pa.us	2017-07-02 22:01:19 -04:00
Tom Lane	4e15387d2d	Try to improve readability of recovery/t/009_twophase.pl test. The original coding here was very confusing, because it named the two servers it set up "master" and "slave" even though it swapped their replication roles multiple times. At any given point in the script it was very unobvious whether "$node_master" actually referred to the server named "master" or the other one. Instead, pick arbitrary names for the two servers --- I used "london" and "paris" --- and distinguish those permanent names from the nonce references $cur_master and $cur_slave. Add logging to help distinguish which is which at any given point. Also, use distinct data and transaction names to make all the prepared transactions easily distinguishable in the postmaster logs. (There was one place where we intentionally tested that the server could cope with re-use of a transaction name, but it seems like one place is sufficient for that purpose.) Also, add checks at the end to make sure that all the transactions that were supposed to be committed did survive. Discussion: https://postgr.es/m/28238.1499010855@sss.pgh.pa.us	2017-07-02 15:27:29 -04:00
Tom Lane	de3de0afd7	Improve TAP test function PostgresNode::poll_query_until(). Add an optional "expected" argument to override the default assumption that we're waiting for the query to return "t". This allows replacing a handwritten polling loop in recovery/t/007_sync_rep.pl with use of poll_query_until(); AFAICS that's the only remaining ad-hoc polling loop in our TAP tests. Change poll_query_until() to probe ten times per second not once per second. Like some similar changes I've been making recently, the one-second interval seems to be rooted in ancient traditions rather than the actual likely wait duration on modern machines. I'd consider reducing it further if there were a convenient way to spawn just one psql for the whole loop rather than one per probe attempt. Discussion: https://postgr.es/m/12486.1498938782@sss.pgh.pa.us	2017-07-02 14:03:41 -04:00
Tom Lane	b0f069d931	Clean up misuse and nonuse of poll_query_until(). Several callers of PostgresNode::poll_query_until() neglected to check for failure; I do not think that's optional. Also, rewrite one place that had reinvented poll_query_until() for no very good reason.	2017-07-01 14:25:09 -04:00
Tom Lane	08aed6604d	Eat XIDs more efficiently in recovery TAP test. The point of this loop is to insert 1000 rows into the test table and consume 1000 XIDs. I can't see any good reason why it's useful to launch 1000 psqls and 1000 backend processes to accomplish that. Pushing the looping into a plpgsql DO block shaves about 10 seconds off the runtime of the src/test/recovery TAP tests on my machine; that's over 10% of the runtime of that test suite. It is, in fact, sufficiently more efficient that we now demonstrably need wait_slot_xmins() afterwards, or the slaves' xmins may not have moved yet.	2017-06-28 22:11:12 -04:00
Tom Lane	5c77690f6f	Improve wait logic in TAP tests for streaming replication. Remove hard-wired sleep(2) delays in 001_stream_rep.pl in favor of using poll_query_until to check for the desired state to appear. In addition, add such a wait before the last test in the script, as it's possible to demonstrate failures there after upcoming improvements in pg_ctl. (We might end up adding polling before each of the get_slot_xmins calls in this script, but I feel no great need to do that until shown necessary.) In passing, clarify the description strings for some of the test cases. Michael Paquier and Craig Ringer, pursuant to a complaint from me Discussion: https://postgr.es/m/8962.1498425057@sss.pgh.pa.us	2017-06-26 14:39:55 -04:00
Bruce Momjian	ce55481032	Post-PG 10 beta1 pgperltidy run	2017-05-17 19:01:23 -04:00
Andrew Dunstan	734cb4c2e7	Avoid tests which crash the calling process on Windows Certain recovery tests use the Perl IPC::Run module's start/kill_kill method of processing. On at least some versions of perl this causes the whole process and its caller to crash. If we ever find a better way of doing these tests they can be re-enabled on this platform. This does not affect Mingw or Cygwin builds, which use a different perl and a different shell and so are not affected.	2017-05-12 06:41:23 -04:00
Tom Lane	d10c626de4	Rename WAL-related functions and views to use "lsn" not "location". Per discussion, "location" is a rather vague term that could refer to multiple concepts. "LSN" is an unambiguous term for WAL locations and should be preferred. Some function names, view column names, and function output argument names used "lsn" already, but others used "location", as well as yet other terms such as "wal_position". Since we've already renamed a lot of things in this area from "xlog" to "wal" for v10, we may as well incur a bit more compatibility pain and make these names all consistent. David Rowley, minor additional docs hacking by me Discussion: https://postgr.es/m/CAKJS1f8O0njDKe8ePFQ-LK5-EjwThsDws6ohJ-+c6nWK+oUxtg@mail.gmail.com	2017-05-11 11:49:59 -04:00
Simon Riggs	0352c15e5a	Additional tests for subtransactions in recovery Tests for normal and prepared transactions Author: Nikhil Sontakke, placed in new test file by me	2017-04-27 14:26:57 +02:00
Fujii Masao	346199dcab	Set the priorities of all quorum synchronous standbys to 1. In quorum-based synchronous replication, all the standbys listed in synchronous_standby_names equally have chances to be chosen as synchronous standbys. So they should have the same priority. However, previously, quorum standbys whose names appear earlier in the list were given higher priority values though the difference of those priority values didn't affect the selection of synchronous standbys. Users could see those "meaningless" priority values in pg_stat_replication and this was confusing. This commit gives all the quorum synchronous standbys the same highest priority, i.e., 1, in order to remove such confusion. Author: Fujii Masao Reviewed-by: Masahiko Sawada, Kyotaro Horiguchi Discussion: http://postgr.es/m/CAHGQGwEKOw=SmPLxJzkBsH6wwDBgOnVz46QjHbtsiZ-d-2RGUg@mail.gmail.com	2017-04-26 01:07:13 +09:00
Tom Lane	8a19c1a373	Make PostgresNode::append_conf append a newline automatically. Although the documentation for append_conf said clearly that it didn't add a newline, many test authors seem to have forgotten that ... or maybe they just consulted the example at the top of the POD documentation, which clearly shows adding a config entry without bothering to add a trailing newline. The worst part of that is that it works, as long as you don't do it more than once, since the backend isn't picky about whether config files end with newlines. So there's not a strong forcing function reminding test authors not to do it like that. Upshot is that this is a terribly fragile way to go about things, and there's at least one existing test case that is demonstrably broken and not testing what it thinks it is. Let's just make append_conf append a newline, instead; that is clearly way safer than the old definition. I also cleaned up a few call sites that were unnecessarily ugly. (I left things alone in places where it's plausible that additional config lines would need to be added someday.) Back-patch the change in append_conf itself to 9.6 where it was added, as having a definitional inconsistency between branches would obviously be pretty hazardous for back-patching TAP tests. The other changes are just cosmetic and don't need to be back-patched. Discussion: https://postgr.es/m/19751.1492892376@sss.pgh.pa.us	2017-04-22 16:58:15 -04:00
Peter Eisentraut	2e74e636bd	Change 'diag' to 'note' in TAP tests Reduce noise from TAP tests by changing 'diag' to 'note', so output only goes to the test's log file not stdout, unless in verbose mode. This also removes the junk on screen when running the TAP tests in parallel. Author: Craig Ringer <craig@2ndquadrant.com>	2017-03-28 20:38:06 -04:00
Simon Riggs	ff539da316	Cleanup slots during drop database Automatically drop all logical replication slots associated with a database when the database is dropped. Previously we threw an ERROR if a slot existed. Now we throw ERROR only if a slot is active in the database being dropped. Craig Ringer	2017-03-28 10:05:21 -04:00
Simon Riggs	5737c12df0	Report catalog_xmin separately in hot_standby_feedback If the upstream walsender is using a physical replication slot, store the catalog_xmin in the slot's catalog_xmin field. If the upstream doesn't use a slot and has only a PGPROC entry behaviour doesn't change, as we store the combined xmin and catalog_xmin in the PGPROC entry. Author: Craig Ringer	2017-03-25 14:07:27 +00:00
Peter Eisentraut	cd07f73d32	Fix recovery test hang The test would hang if a sufficient ~/.psqlrc was present. Fix by using psql -X.	2017-03-25 00:10:57 -04:00
Robert Haas	857ee8e391	Add a txid_status function. If your connection to the database server is lost while a COMMIT is in progress, it may be difficult to figure out whether the COMMIT was successful or not. This function will tell you, provided that you don't wait too long to ask. It may be useful in other situations, too. Craig Ringer, reviewed by Simon Riggs and by me Discussion: http://postgr.es/m/CAMsr+YHQiWNEi0daCTboS40T+V5s_+dst3PYv_8v2wNVH+Xx4g@mail.gmail.com	2017-03-24 12:00:53 -04:00
Simon Riggs	1148e22a82	Teach xlogreader to follow timeline switches Uses page-based mechanism to ensure we’re using the correct timeline. Tests are included to exercise the functionality using a cold disk-level copy of the master that's started up as a replica with slots intact, but the intended use of the functionality is with later features. Craig Ringer, reviewed by Simon Riggs and Andres Freund	2017-03-22 07:05:12 +00:00
Peter Eisentraut	9ca2dd578d	Avoid Perl warning Perl versions before 5.12 would warn "Use of implicit split to @_ is deprecated". Author: Jeff Janes <jeff.janes@gmail.com>	2017-03-22 00:18:49 -04:00
Simon Riggs	eb2a6131be	Add a pg_recvlogical wrapper to PostgresNode Allows testing of logical decoding using SQL interface and/or pg_recvlogical Most logical decoding tests are in contrib/test_decoding. This module is for work that doesn't fit well there, like where server restarts are required. Craig Ringer	2017-03-21 14:04:49 +00:00
Robert Haas	caa6c1f193	TAP tests for target_session_attrs connection parameter. Michael Paquier	2017-02-26 23:41:23 +05:30
Alvaro Herrera	30820982b2	Add tests for two-phase commit There's some ongoing performance work on this area, so let's make sure we don't break things. Extracted from a larger patch originally by Stas Kelvich. Authors: Stas Kelvich, Nikhil Sontakke, Michael Paquier Discussion: https://postgr.es/m/CAMGcDxfsuLLOg=h5cTg3g77Jjk-UGnt=RW7zK57zBSoFsapiWA@mail.gmail.com	2017-02-21 19:00:45 -03:00
Robert Haas	806091c96f	Remove all references to "xlog" from SQL-callable functions in pg_proc. Commit `f82ec32ac3` renamed the pg_xlog directory to pg_wal. To make things consistent, and because "xlog" is terrible terminology for either "transaction log" or "write-ahead log" rename all SQL-callable functions that contain "xlog" in the name to instead contain "wal". (Note that this may pose an upgrade hazard for some users.) Similarly, rename the xlog_position argument of the functions that create slots to be called wal_position. Discussion: https://www.postgresql.org/message-id/CA+Tgmob=YmA=H3DbW1YuOXnFVgBheRmyDkWcD9M8f=5bGWYEoQ@mail.gmail.com	2017-02-09 15:10:09 -05:00
Simon Riggs	ec4b975016	Reset hot standby xmin on master after restart Hot_standby_feedback could be reset by reload and worked correctly, but if the server was restarted rather than reloaded the xmin was not reset. Force reset always if hot_standby_feedback is enabled at startup. Ants Aasma, Craig Ringer Reported-by: Ants Aasma	2017-01-26 18:14:02 +00:00
Magnus Hagander	f6d6d2920d	Change default values for backup and replication parameters This changes the default values of the following parameters: wal_level = replica max_wal_senders = 10 max_replication_slots = 10 in order to make it possible to make a backup and set up simple replication on the default settings, without requiring a system restart. Discussion: https://postgr.es/m/CABUevEy4PR_EAvZEzsbF5s+V0eEvw7shJ2t-AUwbHOjT+yRb3A@mail.gmail.com Reviewed by Peter Eisentraut. Benchmark help from Tomas Vondra.	2017-01-14 17:14:56 +01:00
Simon Riggs	0813216cb4	Add 18 new recovery TAP tests Add new tests for physical repl slots and hot standby feedback. Craig Ringer, reviewed by Aleksander Alekseev and Simon Riggs	2017-01-04 16:54:28 +00:00
Simon Riggs	fb093e4cb3	Allow PostgresNode.pm tests to wait for catchup Add methods to the core test framework PostgresNode.pm to allow us to test that standby nodes have caught up with the master, as well as basic LSN handling. Used in tests recovery/t/001_stream_rep.pl and recovery/t/004_timeline_switch.pl Craig Ringer, reviewed by Aleksander Alekseev and Simon Riggs	2017-01-04 16:50:23 +00:00
Bruce Momjian	1d25779284	Update copyright via script for 2017	2017-01-03 13:48:53 -05:00
Fujii Masao	3901fd70cc	Support quorum-based synchronous replication. This feature is also known as "quorum commit" especially in discussion on pgsql-hackers. This commit adds the following new syntaxes into synchronous_standby_names GUC. By using FIRST and ANY keywords, users can specify the method to choose synchronous standbys from the listed servers. FIRST num_sync (standby_name [, ...]) ANY num_sync (standby_name [, ...]) The keyword FIRST specifies a priority-based synchronous replication which was available also in 9.6 or before. This method makes transaction commits wait until their WAL records are replicated to num_sync synchronous standbys chosen based on their priorities. The keyword ANY specifies a quorum-based synchronous replication and makes transaction commits wait until their WAL records are replicated to at least num_sync listed standbys. In this method, the values of sync_state.pg_stat_replication for the listed standbys are reported as "quorum". The priority is still assigned to each standby, but not used in this method. The existing syntaxes having neither FIRST nor ANY keyword are still supported. They are the same as new syntax with FIRST keyword, i.e., a priorirty-based synchronous replication. Author: Masahiko Sawada Reviewed-By: Michael Paquier, Amit Kapila and me Discussion: <CAD21AoAACi9NeC_ecm+Vahm+MMA6nYh=Kqs3KB3np+MBOS_gZg@mail.gmail.com> Many thanks to the various individuals who were involved in discussing and developing this feature.	2016-12-19 21:15:30 +09:00
Robert Haas	f267c1c244	Fix possible pg_basebackup failure on standby with "include WAL". If a restartpoint flushed no dirty buffers, it could fail to update the minimum recovery point, leading to a minimum recovery point prior to the starting REDO location. perform_base_backup() would interpret that as meaning that no WAL files at all needed to be included in the backup, failing an internal sanity check. To fix, have restartpoints always update the minimum recovery point to just after the checkpoint record itself, so that the file (or files) containing the checkpoint record will always be included in the backup. Code by Amit Kapila, per a design suggestion by me, with some additional work on the code comment by me. Test case by Michael Paquier. Report by Kyotaro Horiguchi.	2016-10-27 11:19:51 -04:00
Peter Eisentraut	e5a9bcb529	Use pg_ctl promote -w in TAP tests Switch TAP tests to use the new wait mode of pg_ctl promote. This allows avoiding extra logic with poll_query_until() to be sure that a promoted standby is ready for read-write queries. From: Michael Paquier <michael.paquier@gmail.com>	2016-10-19 09:18:50 -04:00
Heikki Linnakangas	917dc7d239	Fix WAL-logging of FSM and VM truncation. When a relation is truncated, it is important that the FSM is truncated as well. Otherwise, after recovery, the FSM can return a page that has been truncated away, leading to errors like: ERROR: could not read block 28991 in file "base/16390/572026": read only 0 of 8192 bytes We were using MarkBufferDirtyHint() to dirty the buffer holding the last remaining page of the FSM, but during recovery, that might in fact not dirty the page, and the FSM update might be lost. To fix, use the stronger MarkBufferDirty() function. MarkBufferDirty() requires us to do WAL-logging ourselves, to protect from a torn page, if checksumming is enabled. Also fix an oversight in visibilitymap_truncate: it also needs to WAL-log when checksumming is enabled. Analysis by Pavan Deolasee. Discussion: <CABOikdNr5vKucqyZH9s1Mh0XebLs_jRhKv6eJfNnD2wxTn=_9A@mail.gmail.com>	2016-10-19 14:26:05 +03:00
Simon Riggs	d851bef2d6	Dirty replication slots when using sql interface When pg_logical_slot_get_changes(...) sets confirmed_flush_lsn to the point at which replay stopped, it doesn't dirty the replication slot. So if the replay didn't cause restart_lsn or catalog_xmin to change as well, this change will not get written out to disk. Even on a clean shutdown. If Pg crashes or restarts, a subsequent pg_logical_slot_get_changes(...) call will see the same changes already replayed since it uses the slot's confirmed_flush_lsn as the start point for fetching changes. The caller can't specify a start LSN when using the SQL interface. Mark the slot as dirty after reading changes using the SQL interface so that users won't see repeated changes after a clean shutdown. Repeated changes still occur when using the walsender interface or after an unclean shutdown. Craig Ringer	2016-09-05 09:44:38 +01:00
Simon Riggs	35250b6ad7	New recovery target recovery_target_lsn Michael Paquier	2016-09-03 17:48:01 +01:00
Tom Lane	b5bce6c1ec	Final pgindent + perltidy run for 9.6.	2016-08-15 13:42:51 -04:00
Alvaro Herrera	b26f7fa6ae	Fix assorted problems in recovery tests In test 001_stream_rep we're using pg_stat_replication.write_location to determine catch-up status, but we care about xlog having been applied not just received, so change that to apply_location. In test 003_recovery_targets, we query the database for a recovery target specification and later for the xlog position supposedly corresponding to that recovery specification. If for whatever reason more WAL is written between the two queries, the recovery specification is earlier than the xlog position used by the query in the test harness, so we wait forever, leading to test failures. Deal with this by using a single query to extract both items. In `2a0f89cd71` we tried to deal with it by giving them more tests to run, but in hindsight that was obviously doomed to failure (no revert of that, though). Per hamster buildfarm failures. Author: Michaël Paquier	2016-08-03 13:21:23 -04:00
Peter Eisentraut	f6ced51f91	Consistently capitalize names of recovery tests	2016-08-02 10:47:03 -04:00
Noah Misch	3be0a62ffe	Finish pgindent run for 9.6: Perl files.	2016-06-12 04:19:56 -04:00
Alvaro Herrera	c1543a81a7	Revert timeline following in replication slots This reverts commits `f07d18b6e9`, `82c83b3372`, `3a3b309041`, and `24c5f1a103`. This feature has shown enough immaturity that it was deemed better to rip it out before rushing some more fixes at the last minute. There are discussions on larger changes in this area for the next release.	2016-05-04 17:32:22 -03:00
Fujii Masao	36c1c91604	Make regression test for multiple synchronous standbys more stable. The regression test checks whether the output of pg_stat_replication is expected or not after changing synchronous_standby_names and reloading the configuration file. Regarding this test logic, previously there was a timing issue which made the test result unstable. That is, pg_stat_replication could return unexpected result during small window after the configuration file was reloaded before new setting value took effect, and which made the test fail. This commit changes the test logic so that it uses a loop with a timeout to give some room for the test to pass. Now the test fails only when pg_stat_replication keeps returning unexpected result for 30 seconds. Michael Paquier	2016-04-15 13:58:14 +09:00
Fujii Masao	196b72fb9a	Add regression tests for multiple synchronous standbys. Authors: Suraj Kharage, Michael Paquier, Masahiko Sawada, refactored by me Reviewed-By: Kyotaro Horiguchi	2016-04-08 16:48:53 +09:00
Alvaro Herrera	82c83b3372	Fix logical_decoding_timelines test crashes In the test_slot_timelines test module, we were abusing passing NULL values which was received as zeroes in x86, but this breaks in ARM (buildfarm member hamster) by crashing instead. Fix the breakage by marking these functions as STRICT; the InvalidXid value that was previously implicit in NULL values (on x86 at least) can now be passed as 0. Failing to follow the fmgr protocol to check for NULLs beforehand was causing ARM to fail, as evidenced by segmentation faults in buildfarm member hamster. In order to use the new functionality in the test script, use COALESCE in the right spot to avoid forwarding NULL values. This was diagnosed from the hamster crash by Craig Ringer, who also proposed a different patch (checking for NULL values explicitely in the C function code, and keeping the non-strictness in the C functions). I decided to go with this approach instead.	2016-04-01 16:47:00 -03:00
Alvaro Herrera	61608d3836	Fix recovery_min_apply_delay test Previously this test was relying too much on WAL replay to occur in the exact configured interval, which was unreliable on slow or overly busy servers. Use a custom loop instead of poll_query_until, which is hopefully more reliable. Per continued failures on buildfarm member hamster (which is probably the only one running this test suite) Author: Michaël Paquier	2016-03-31 16:06:32 -03:00
Alvaro Herrera	24c5f1a103	Enable logical slots to follow timeline switches When decoding from a logical slot, it's necessary for xlog reading to be able to read xlog from historical (i.e. not current) timelines; otherwise, decoding fails after failover, because the archives are in the historical timeline. This is required to make "failover logical slots" possible; it currently has no other use, although theoretically it could be used by an extension that creates a slot on a standby and continues to replay from the slot when the standby is promoted. This commit includes a module in src/test/modules with functions to manipulate the slots (which is not otherwise possible in SQL code) in order to enable testing, and a new test in src/test/recovery to ensure that the behavior is as expected. Author: Craig Ringer Reviewed-By: Oleksii Kliukin, Andres Freund, Petr Jelínek	2016-03-30 20:07:05 -03:00
Alvaro Herrera	2c83f435a3	Rework PostgresNode's psql method This makes the psql() method much more capable: it captures both stdout and stderr; it now returns the psql exit code rather than stdout; a timeout can now be specified, as can ON_ERROR_STOP behavior; it gained a new "on_error_die" (defaulting to off) parameter to raise an exception if there's any problem. Finally, additional parameters to psql can be passed if there's need for further tweaking. For convenience, a new safe_psql() method retains much of the old behavior of psql(), except that it uses on_error_die on, so that problems like syntax errors in SQL commands can be detected more easily. Many existing TAP test files now use safe_psql, which is what is really wanted. A couple of ->psql() calls are now added in the commit_ts tests, which verify that the right thing is happening on certain errors. Some ->command_fails() calls in recovery tests that were verifying that psql failed also became ->psql() calls now. Author: Craig Ringer. Some tweaks by Álvaro Herrera Reviewed-By: Michaël Paquier	2016-03-03 17:58:30 -03:00
Alvaro Herrera	5bec1ad464	Fix mistakes in recovery tests One test was relying on method remove_tree that isn't implemented in the oldest Perl we support; fix it by using the older rmtree instead. Another test had a typo in a SQL command, which isn't noticed because the PostgresNode->psql() method doesn't check that queries return correctly. That's undesirable and will also be fixed later on, but for now let's make the test actually work. Author: Craig Ringer	2016-03-03 12:51:47 -03:00
Alvaro Herrera	5847397dec	Minor tweaks for new src/test/recovery Author: Michael Paquier	2016-02-29 18:16:59 -03:00
Alvaro Herrera	74d58425c7	Apply last revision of recovery patch I applied the previous-to-last revision of Michaël's submitted patch instead of the last; these two tweaks pointed out by Craig were left out of the previous commit by accident.	2016-02-26 16:22:53 -03:00
Alvaro Herrera	49148645f7	Add a test framework for recovery This long-awaited framework is an expansion of the existing PostgresNode stuff to support additional features for recovery testing; the recovery tests included in this commit are a starting point that cover some of the recovery features we have. More scripts are expected to be added later. Author: Michaël Paquier, a bit of help from Amir Rohan Reviewed by: Amir Rohan, Stas Kelvich, Kyotaro Horiguchi, Victor Wagner, Craig Ringer, Álvaro Herrera Discussion: http://www.postgresql.org/message-id/CAB7nPqTf7V6rswrFa=q_rrWeETUWagP=h8LX8XAov2Jcxw0DRg@mail.gmail.com Discussion: http://www.postgresql.org/message-id/trinity-b4a8035d-59af-4c42-a37e-258f0f28e44a-1443795007012@3capp-mailcom-lxa08	2016-02-26 16:13:30 -03:00

1 2 3 4 5

235 Commits