postgresql

Commit Graph

Author	SHA1	Message	Date
Heikki Linnakangas	85ba0748ed	Fix bug in WAL_DEBUG. The record header was not copied correctly to the buffer that was passed to the rm_desc function. Broken by my rm_desc signature refactoring patch.	2014-06-23 12:22:36 +03:00
Heikki Linnakangas	0ef0b6784c	Change the signature of rm_desc so that it's passed a XLogRecord. Just feels more natural, and is more consistent with rm_redo.	2014-06-14 10:46:48 +03:00
Andres Freund	e04a9ccd2c	Consistency improvements for slot and decoding code. Change the order of checks in similar functions to be the same; remove a parameter that's not needed anymore; rename a memory context and expand a couple of comments. Per review comments from Amit Kapila	2014-06-12 13:33:27 +02:00
Tom Lane	5f93c37805	Add defenses against running with a wrong selection of LOBLKSIZE. It's critical that the backend's idea of LOBLKSIZE match the way data has actually been divided up in pg_largeobject. While we don't provide any direct way to adjust that value, doing so is a one-line source code change and various people have expressed interest recently in changing it. So, just as with TOAST_MAX_CHUNK_SIZE, it seems prudent to record the value in pg_control and cross-check that the backend's compiled-in setting matches the on-disk data. Also tweak the code in inv_api.c so that fetches from pg_largeobject explicitly verify that the length of the data field is not more than LOBLKSIZE. Formerly we just had Asserts() for that, which is no protection at all in production builds. In some of the call sites an overlength data value would translate directly to a security-relevant stack clobber, so it seems worth one extra runtime comparison to be sure. In the back branches, we can't change the contents of pg_control; but we can still make the extra checks in inv_api.c, which will offer some amount of protection against running with the wrong value of LOBLKSIZE.	2014-06-05 11:31:06 -04:00
Andres Freund	f0c108560b	Consistently spell a replication slot's name as slot_name. Previously there's been a mix between 'slotname' and 'slot_name'. It's not nice to be unneccessarily inconsistent in a new feature. As a post beta1 initdb now is required in the wake of `eeca4cd35e`, fix the inconsistencies. Most the changes won't affect usage of replication slots because the majority of changes is around function parameter names. The prominent exception to that is that the recovery.conf parameter 'primary_slotname' is now named 'primary_slot_name'.	2014-06-05 16:29:20 +02:00
Tom Lane	c1907f0cc4	Fix a bunch of functions that were declared static then defined not-static. Per testing with a compiler that whines about this.	2014-05-17 17:57:53 -04:00
Tom Lane	0d0b2bf175	Rename min_recovery_apply_delay to recovery_min_apply_delay. Per discussion, this seems like a more consistent choice of name. Fabrízio de Royes Mello, after a suggestion by Peter Eisentraut; some additional documentation wordsmithing by me	2014-05-10 19:46:19 -04:00
Bruce Momjian	0a78320057	pgindent run for 9.4 This includes removing tabs after periods in C comments, which was applied to back branches, so this change should not effect backpatching.	2014-05-06 12:12:18 -04:00
Tom Lane	5035701e07	Improve generation algorithm for database system identifier. As noted some time ago, the original coding had a typo ("\|" for "^") that made the result less unique than intended. Even the intended behavior is obsolete since it was based on wanting to produce a usable value even if we didn't have int64 arithmetic --- a limitation we stopped supporting years ago. Instead, let's redefine the system identifier as tv_sec in the upper 32 bits (same as before), tv_usec in the next 20 bits, and the low 12 bits of getpid() in the remaining bits. This is still hardly guaranteed-universally-unique, but it's noticeably better than before. Per my proposal at <29019.1374535940@sss.pgh.pa.us>	2014-04-26 15:11:10 -04:00
Bruce Momjian	83defef8c7	report stat() error in trigger file check Permissions might prevent the existence of the trigger file from being checked. Per report from Andres Freund	2014-04-17 11:55:57 -04:00
Heikki Linnakangas	848b9f05ab	Use correctly-sized buffer when zero-filling a WAL file. I mixed up BLCKSZ and XLOG_BLCKSZ when I changed the way the buffer is allocated a couple of weeks ago. With the default settings, they are both 8k, but they can be changed at compile-time.	2014-04-16 10:26:36 +03:00
Heikki Linnakangas	787064cd00	Fix typo in comment. Tomonari Katsumata	2014-04-10 13:11:49 +03:00
Robert Haas	59202fae04	Fix some compiler warnings that clang emits with -pedantic. Andres Freund	2014-04-04 11:29:50 -04:00
Heikki Linnakangas	d9e7873bbb	In checkpoint, move the check for in-progress xacts out of critical section. GetVirtualXIDsDelayingChkpt calls palloc, which isn't safe in a critical section. I thought I covered this case with the exemption for the checkpointer, but CreateCheckPoint is also called from the startup process.	2014-04-04 17:31:22 +03:00
Heikki Linnakangas	877b088785	Avoid allocations in critical sections. If a palloc in a critical section fails, it becomes a PANIC.	2014-04-04 13:35:44 +03:00
Heikki Linnakangas	c2a6724823	Pass more than the first XLogRecData entry to rm_desc, with WAL_DEBUG. If you compile with WAL_DEBUG and enable it with wal_debug=on, we used to only pass the first XLogRecData entry to the rm_desc routine. I think the original assumprion was that the first XLogRecData entry contains all the necessary information for the rm_desc routine, but that's a pretty shaky assumption. At least standby_redo didn't get the memo. To fix, piece together all the data in a temporary buffer, and pass that to the rm_desc routine. It's been like this forever, but the patch didn't apply cleanly to back-branches. Probably wouldn't be hard to fix the conflicts, but it's not worth the trouble.	2014-03-26 18:17:53 +02:00
Fujii Masao	49638868f8	Don't forget to flush XLOG_PARAMETER_CHANGE record. Backpatch to 9.0 where XLOG_PARAMETER_CHANGE record was instroduced.	2014-03-26 02:12:39 +09:00
Heikki Linnakangas	3ed249b741	Fix "the the" typos. Erik Rijkers	2014-03-24 08:42:13 +02:00
Heikki Linnakangas	68a2e52bba	Replace the XLogInsert slots with regular LWLocks. The special feature the XLogInsert slots had over regular LWLocks is the insertingAt value that was updated atomically with releasing backends waiting on it. Add new functions to the LWLock API to do that, and replace the slots with LWLocks. This reduces the amount of duplicated code. (There's still some duplication, but at least it's all in lwlock.c now.) Reviewed by Andres Freund.	2014-03-21 15:10:48 +01:00
Heikki Linnakangas	59a5ab3f42	Remove rm_safe_restartpoint machinery. It is no longer used, none of the resource managers have multi-record actions that would make it unsafe to perform a restartpoint. Also don't allow rm_cleanup to write WAL records, it's also no longer required. Move the call to rm_cleanup routines to make it more symmetric with rm_startup.	2014-03-18 22:10:35 +02:00
Bruce Momjian	886c0be3f6	C comments: remove odd blank lines after #ifdef WIN32 lines	2014-03-13 01:34:42 -04:00
Heikki Linnakangas	a3115f0d9e	Only WAL-log the modified portion in an UPDATE, if possible. When a row is updated, and the new tuple version is put on the same page as the old one, only WAL-log the part of the new tuple that's not identical to the old. This saves significantly on the amount of WAL that needs to be written, in the common case that most fields are not modified. Amit Kapila, with a lot of back and forth with me, Robert Haas, and others.	2014-03-12 23:28:36 +02:00
Heikki Linnakangas	956685f82b	Do wal_level and hot standby checks when doing crash-then-archive recovery. CheckRequiredParameterValues() should perform the checks if archive recovery was requested, even if we are going to perform crash recovery first. Reported by Kyotaro HORIGUCHI. Backpatch to 9.2, like the crash-then-archive recovery mode.	2014-03-05 14:48:14 +02:00
Heikki Linnakangas	af246c37c0	Fix lastReplayedEndRecPtr calculation when starting from shutdown checkpoint. When entering crash recovery followed by archive recovery, and the latest checkpoint is a shutdown checkpoint, and there are no more WAL records to replay before transitioning from crash to archive recovery, we would not immediately allow read-only connections in hot standby mode even if we could. That's because when starting from a shutdown checkpoint, we set lastReplayedEndRecPtr incorrectly to the record before the checkpoint record, instead of the checkpoint record itself. We don't run the redo routine of the shutdown checkpoint record, but starting recovery from it goes through the same motions, so it should be considered as replayed. Reported by Kyotaro HORIGUCHI. All versions with hot standby are affected, so backpatch to 9.0.	2014-03-05 13:51:19 +02:00
Robert Haas	b89e151054	Introduce logical decoding. This feature, building on previous commits, allows the write-ahead log stream to be decoded into a series of logical changes; that is, inserts, updates, and deletes and the transactions which contain them. It is capable of handling decoding even across changes to the schema of the effected tables. The output format is controlled by a so-called "output plugin"; an example is included. To make use of this in a real replication system, the output plugin will need to be modified to produce output in the format appropriate to that system, and to perform filtering. Currently, information can be extracted from the logical decoding system only via SQL; future commits will add the ability to stream changes via walsender. Andres Freund, with review and other contributions from many other people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan, Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve Singer.	2014-03-03 16:32:18 -05:00
Heikki Linnakangas	d8a42b150f	Remove bogus while-loop. Commit `abf5c5c9a4` added a bogus while- statement after the for(;;)-loop. It went unnoticed in testing, because it was dead code. Report by KONDO Mitsumasa. Backpatch to 9.3. The commit that introduced this was also applied to 9.2, but not the bogus while-loop part, because the code in 9.2 looks quite different.	2014-02-28 13:33:41 +02:00
Heikki Linnakangas	8f09ca436d	Improve comment on setting data_checksum GUC. There was an extra space there, and "fixed" wasn't very descriptive.	2014-02-20 10:58:30 +02:00
Heikki Linnakangas	057152b37c	Fix comment; checkpointer, not bgwriter, performs checkpoints since 9.2. Amit Langote	2014-02-18 09:48:18 +02:00
Tom Lane	01824385ae	Prevent potential overruns of fixed-size buffers. Coverity identified a number of places in which it couldn't prove that a string being copied into a fixed-size buffer would fit. We believe that most, perhaps all of these are in fact safe, or are copying data that is coming from a trusted source so that any overrun is not really a security issue. Nonetheless it seems prudent to forestall any risk by using strlcpy() and similar functions. Fixes by Peter Eisentraut and Jozef Mlich based on Coverity reports. In addition, fix a potential null-pointer-dereference crash in contrib/chkpass. The crypt(3) function is defined to return NULL on failure, but chkpass.c didn't check for that before using the result. The main practical case in which this could be an issue is if libc is configured to refuse to execute unapproved hashing algorithms (e.g., "FIPS mode"). This ideally should've been a separate commit, but since it touches code adjacent to one of the buffer overrun changes, I included it in this commit to avoid last-minute merge issues. This issue was reported by Honza Horak. Security: CVE-2014-0065 for buffer overruns, CVE-2014-0066 for crypt()	2014-02-17 11:20:21 -05:00
Heikki Linnakangas	4d894b41cd	Change the order that pg_xlog and WAL archive are polled for WAL segments. If there is a WAL segment with same ID but different TLI present in both the WAL archive and pg_xlog, prefer the one with higher TLI. Before this patch, the archive was polled first, for all expected TLIs, and only if no file was found was pg_xlog scanned. This was a change in behavior from 9.3, which first scanned archive and pg_xlog for the highest TLI, then archive and pg_xlog for the next highest TLI and so forth. This patch reverts the behavior back to what it was in 9.2. The reason for this is that if for example you try to do archive recovery to timeline 2, which branched off timeline 1, but the WAL for timeline 2 is not archived yet, we would replay past the timeline switch point on timeline 1 using the archived files, before even looking timeline 2's files in pg_xlog Report and patch by Kyotaro Horiguchi. Backpatch to 9.3 where the behavior was changed.	2014-02-14 15:15:09 +02:00
Heikki Linnakangas	d699ba4134	Fix WakeupWaiters() to not wake up an exclusive locker unnecessarily. WakeupWaiters() is supposed to wake up all LW_WAIT_UNTIL_FREE waiters of the slot, but the loop incorrectly also woke up the first LW_EXCLUSIVE waiter, if there was no LW_WAIT_UNTIL_FREE waiters in the queue. Noted by Andres Freund. This code is new in 9.4, so no backpatching.	2014-02-10 15:18:18 +02:00
Robert Haas	858ec11858	Introduce replication slots. Replication slots are a crash-safe data structure which can be created on either a master or a standby to prevent premature removal of write-ahead log segments needed by a standby, as well as (with hot_standby_feedback=on) pruning of tuples whose removal would cause replication conflicts. Slots have some advantages over existing techniques, as explained in the documentation. In a few places, we refer to the type of replication slots introduced by this patch as "physical" slots, because forthcoming patches for logical decoding will also have slots, but with somewhat different properties. Andres Freund and Robert Haas	2014-01-31 22:45:36 -05:00
Heikki Linnakangas	71c6a8e375	Add recovery_target='immediate' option. This allows ending recovery as a consistent state has been reached. Without this, there was no easy way to e.g restore an online backup, without replaying any extra WAL after the backup ended. MauMau and me.	2014-01-25 17:34:04 +02:00
Tom Lane	ac4ef637ad	Allow use of "z" flag in our printf calls, and use it where appropriate. Since C99, it's been standard for printf and friends to accept a "z" size modifier, meaning "whatever size size_t has". Up to now we've generally dealt with printing size_t values by explicitly casting them to unsigned long and using the "l" modifier; but this is really the wrong thing on platforms where pointers are wider than longs (such as Win64). So let's start using "z" instead. To ensure we can do that on all platforms, teach src/port/snprintf.c to understand "z", and add a configure test to force use of that implementation when the platform's version doesn't handle "z". Having done that, modify a bunch of places that were using the unsigned-long hack to use "z" instead. This patch doesn't pretend to have gotten everyplace that could benefit, but it catches many of them. I made an effort in particular to ensure that all uses of the same error message text were updated together, so as not to increase the number of translatable strings. It's possible that this change will result in format-string warnings from pre-C99 compilers. We might have to reconsider if there are any popular compilers that will warn about this; but let's start by seeing what the buildfarm thinks. Andres Freund, with a little additional work by me	2014-01-23 17:18:33 -05:00
Tom Lane	061b079f89	Fix multiple bugs in index page locking during hot-standby WAL replay. In ordinary operation, VACUUM must be careful to take a cleanup lock on each leaf page of a btree index; this ensures that no indexscans could still be "in flight" to heap tuples due to be deleted. (Because of possible index-tuple motion due to concurrent page splits, it's not enough to lock only the pages we're deleting index tuples from.) In Hot Standby, the WAL replay process must likewise lock every leaf page. There were several bugs in the code for that: * The replay scan might come across unused, all-zero pages in the index. While btree_xlog_vacuum itself did the right thing (ie, nothing) with such pages, xlogutils.c supposed that such pages must be corrupt and would throw an error. This accounts for various reports of replication failures with "PANIC: WAL contains references to invalid pages". To fix, add a ReadBufferMode value that instructs XLogReadBufferExtended not to complain when we're doing this. * btree_xlog_vacuum performed the extra locking if standbyState == STANDBY_SNAPSHOT_READY, but that's not the correct test: we won't open up for hot standby queries until the database has reached consistency, and we don't want to do the extra locking till then either, for fear of reading corrupted pages (which bufmgr.c would complain about). Fix by exporting a new function from xlog.c that will report whether we're actually in hot standby replay mode. * To ensure full coverage of the index in the replay scan, btvacuumscan would emit a dummy WAL record for the last page of the index, if no vacuuming work had been done on that page. However, if the last page of the index is all-zero, that would result in corruption of said page, since the functions called on it weren't prepared to handle that case. There's no need to lock any such pages, so change the logic to target the last normal leaf page instead. The first two of these bugs were diagnosed by Andres Freund, the other one by me. Fixes based on ideas from Heikki Linnakangas and myself. This has been wrong since Hot Standby was introduced, so back-patch to 9.0.	2014-01-14 17:35:21 -05:00
Heikki Linnakangas	c945af80cf	Refactor checking whether we've reached the recovery target. Makes the replay loop slightly more readable, by separating the concerns of whether to stop and whether to delay, and how to extract the timestamp from a record. This has the user-visible change that the timestamp of the last applied record is now updated after actually applying it. Before, it was updated just before applying it. That meant that pg_last_xact_replay_timestamp() could return the timestamp of a commit record that is in process of being replayed, but not yet applied. Normally the difference is small, but if min_recovery_apply_delay is set, there could be a significant delay between reading a record and applying it. Another behavioral change is that if you recover to a restore point, we stop after the restore point record, not before it. It makes no difference as far as running queries on the server is concerned, as applying a restore point record changes nothing, but if examine the timeline history you will see that the new timeline branched off just after the restore point record, not before it. One practical consequence is that if you do PITR to the new timeline, and set recovery target to the same named restore point again, it will find and stop recovery at the same restore point. Conceptually, I think it makes more sense to consider the restore point as part of the new timeline's history than not. In principle, setting the last-replayed timestamp before actually applying the record was a bug all along, but it doesn't seem worth the risk to backpatch, since min_recovery_apply_delay was only added in 9.4.	2014-01-09 14:00:39 +02:00
Heikki Linnakangas	3739e5ab93	Fix pause_at_recovery_target + recovery_target_inclusive combination. If pause_at_recovery_target is set, recovery pauses before applying the target record, even if recovery_target_inclusive is set. If you then continue with pg_xlog_replay_resume(), it will apply the target record before ending recovery. In other words, if you log in while it's paused and verify that the database looks OK, ending recovery changes its state again, possibly destroying data that you were tring to salvage with PITR. Backpatch to 9.1, this has been broken since pause_at_recovery_target was added.	2014-01-08 23:28:52 +02:00
Heikki Linnakangas	815d71deed	If multiple recovery_targets are specified, use the latest one. The docs say that only one of recovery_target_xid, recovery_target_time, or recovery_target_name can be specified. But the code actually did something different, so that a name overrode time, and xid overrode both time and name. Now the target specified last takes effect, whether it's an xid, time or name. With this patch, we still accept multiple recovery_target settings, even though docs say that only one can be specified. It's a general property of the recovery.conf file parser that you if you specify the same option twice, the last one takes effect, like with postgresql.conf.	2014-01-08 22:26:39 +02:00
Heikki Linnakangas	d59ff6c110	Fix bug in determining when recovery has reached consistency. When starting WAL replay from an online checkpoint, the last replayed WAL record variable was initialized using the checkpoint record's location, even though the records between the REDO location and the checkpoint record had not been replayed yet. That was noted as "slightly confusing" but harmless in the comment, but in some cases, it fooled CheckRecoveryConsistency to incorrectly conclude that we had already reached a consistent state immediately at the beginning of WAL replay. That caused the system to accept read-only connections in hot standby mode too early, and also PANICs with message "WAL contains references to invalid pages". Fix by initializing the variables to the REDO location instead. In 9.2 and above, change CheckRecoveryConsistency() to use lastReplayedEndRecPtr variable when checking if backup end location has been reached. It was inconsistently using EndRecPtr for that check, but lastReplayedEndRecPtr when checking min recovery point. It made no difference before this patch, because in all the places where CheckRecoveryConsistency was called the two variables were the same, but it was always an accident waiting to happen, and would have been wrong after this patch anyway. Report and analysis by Tomonari Katsumata, bug #8686. Backpatch to 9.0, where hot standby was introduced.	2014-01-08 15:03:09 +02:00
Bruce Momjian	7e04792a1c	Update copyright for 2014 Update all files in head, and files COPYRIGHT and legal.sgml in all back branches.	2014-01-07 16:05:30 -05:00
Magnus Hagander	9544cc0d65	Move permissions check from do_pg_start_backup to pg_start_backup And the same for do_pg_stop_backup. The code in do_pg_* is not allowed to access the catalogs. For manual base backups, the permissions check can be handled in the calling function, and for streaming base backups only users with the required permissions can get past the authentication step in the first place. Reported by Antonin Houska, diagnosed by Andres Freund	2014-01-07 17:50:56 +01:00
Robert Haas	4b351841fa	Rename walLogHints to wal_log_hints for easier grepping. Michael Paquier	2014-01-01 20:17:00 -05:00
Fujii Masao	961bf59fb7	Rename wal_log_hintbits to wal_log_hints, per discussion on pgsql-hackers. Sawada Masahiko	2013-12-21 03:33:16 +09:00
Heikki Linnakangas	dde6282500	Fix more instances of "the the" in comments. Plus one instance of "to to" in the docs.	2013-12-13 20:02:01 +02:00
Heikki Linnakangas	50e547096c	Add GUC to enable WAL-logging of hint bits, even with checksums disabled. WAL records of hint bit updates is useful to tools that want to examine which pages have been modified. In particular, this is required to make the pg_rewind tool safe (without checksums). This can also be used to test how much extra WAL-logging would occur if you enabled checksums, without actually enabling them (which you can't currently do without re-initdb'ing). Sawada Masahiko, docs by Samrat Revagade. Reviewed by Dilip Kumar, with further changes by me.	2013-12-13 16:26:14 +02:00
Simon Riggs	36da3cfb45	Allow time delayed standbys and recovery Set min_recovery_apply_delay to force a delay in recovery apply for commit and restore point WAL records. Other records are replayed immediately. Delay is measured between WAL record time and local standby time. Robert Haas, Fabrízio de Royes Mello and Simon Riggs Detailed review by Mitsumasa Kondo	2013-12-12 10:53:20 +00:00
Tom Lane	22310b808d	Remove bogus executable permissions on xlog.c. Apparently fat-fingered in `1a3d104475`. Noted by Peter Geoghegan.	2013-12-11 22:12:25 -05:00
Robert Haas	e55704d8b2	Add new wal_level, logical, sufficient for logical decoding. When wal_level=logical, we'll log columns from the old tuple as configured by the REPLICA IDENTITY facility added in commit `07cacba983`. This makes it possible a properly-configured logical replication solution to correctly follow table updates even if they change the chosen key columns, or, with REPLICA IDENTITY FULL, even if the table has no key at all. Note that updates which do not modify the replica identity column won't log anything extra, making the choice of a good key (i.e. one that will rarely be changed) important to performance when wal_level=logical is configured. Each insert, update, or delete to a catalog table will also log the CMIN and/or CMAX values of stamped by the current transaction. This is necessary because logical decoding will require access to historical snapshots of the catalog in order to decode some data types, and the CMIN/CMAX values that we may need in order to judge row visibility may have been overwritten by the time we need them. Andres Freund, reviewed in various versions by myself, Heikki Linnakangas, KONDO Mitsumasa, and many others.	2013-12-10 19:01:40 -05:00
Alvaro Herrera	1df0122daa	Truncate pg_multixact/'s contents during crash recovery Commit `9dc842f08` of 8.2 era prevented MultiXact truncation during crash recovery, because there was no guarantee that enough state had been setup, and because it wasn't deemed to be a good idea to remove data during crash recovery anyway. Since then, due to Hot-Standby, streaming replication and PITR, the amount of time a cluster can spend doing crash recovery has increased significantly, to the point that a cluster may even never come out of it. This has made not truncating the content of pg_multixact/ not defensible anymore. To fix, take care to setup enough state for multixact truncation before crash recovery starts (easy since checkpoints contain the required information), and move the current end-of-recovery actions to a new TrimMultiXact() function, analogous to TrimCLOG(). At some later point, this should probably done similarly to the way clog.c is doing it, which is to just WAL log truncations, but we can't do that for the back branches. Back-patch to 9.0. 8.4 also has the problem, but since there's no hot standby there, it's much less pressing. In 9.2 and earlier, this patch is simpler than in newer branches, because multixact access during recovery isn't required. Add appropriate checks to make sure that's not happening. Andres Freund	2013-11-29 21:47:15 -03:00
Heikki Linnakangas	1a3d104475	Avoid acquiring spinlock when checking if recovery has finished, for speed. RecoveryIsInProgress() can be called very frequently. During normal operation, it just checks a backend-local variable and returns quickly, but during hot standby, it checks a spinlock-protected shared variable. Those spinlock acquisitions can become a point of contention on a busy hot standby system. Replace the spinlock acquisition with a memory barrier. Per discussion with Andres Freund, Ants Aasma and Merlin Moncure.	2013-11-22 13:07:23 +02:00
Robert Haas	cacbdd7810	Use appendStringInfoString instead of appendStringInfo where possible. This shaves a few cycles, and generally seems like good programming practice. David Rowley	2013-10-31 10:55:59 -04:00
Heikki Linnakangas	5962519b36	TYPEALIGN doesn't work on int64 on 32-bit platforms. The TYPEALIGN macro, and the related ones like MAXALIGN, don't work with values larger than intptr_t, because TYPEALIGN casts the argument to intptr_t to do the arithmetic. That's not a problem when dealing with pointers or lengths or offsets related to pointers, but the XLogInsert scaling patch added a call to MAXALIGN with an XLogRecPtr argument. To fix, add wider variants of the macros, called TYPEALIGN64 and MAXALIGN64, which are just like the existing variants but work with uint64 instead of intptr_t. Report and patch by David Rowley, analysis by Andres Freund.	2013-10-08 01:59:57 +03:00
Heikki Linnakangas	0892ecbc01	Add a GUC to report whether data page checksums are enabled. Bernd Helmle	2013-09-16 14:36:01 +03:00
Jeff Davis	b1892aaeaa	Revert WAL posix_fallocate() patches. This reverts commit `269e780822` and commit `5b571bb8c8`. Unfortunately, the initial patch had insufficient performance testing, and resulted in a regression. Per report by Thom Brown.	2013-09-04 23:43:41 -07:00
Heikki Linnakangas	375d8526f2	Keep heavily-contended fields in XLogCtlInsert on different cache lines. Performance testing shows that if the insertpos_lck spinlock and the fields that it protects are on the same cache line with other variables that are frequently accessed, the false sharing can hurt performance a lot. Keep them apart by adding some padding.	2013-09-04 23:14:33 +03:00
Heikki Linnakangas	3619a20d33	Rename the "fast_promote" file to just "promote". This keeps the usual trigger file name unchanged from 9.2, avoiding nasty issues if you use a pre-9.3 pg_ctl binary with a 9.3 server or vice versa. The fallback behavior of creating a full checkpoint before starting up is now triggered by a file called "fallback_promote". That can be useful for debugging purposes, but we don't expect any users to have to resort to that and we might want to remove that in the future, which is why the fallback mechanism is undocumented.	2013-08-19 20:59:51 +03:00
Peter Eisentraut	072457b360	Message punctuation and pluralization fixes	2013-08-09 08:02:44 -04:00
Peter Eisentraut	626092a2e1	Message style improvements	2013-07-28 07:01:13 -04:00
Heikki Linnakangas	107cbc90a7	Fix variable names mentioned in comment to match the code. Also, in another comment, explain why holding an insertion slot is a critical section. Per review by Amit Kapila.	2013-07-17 23:32:32 +03:00
Heikki Linnakangas	59c02a36f0	Fix assert failure at end of recovery, broken by XLogInsert scaling patch. Initialization of the first XLOG buffer at end-of-recovery was broken for the case that the last read WAL record ended at a page boundary. Instead of trying to copy the last full xlog page to the buffer cache in that case, just set shared state so that the next page is initialized when the first WAL record after startup is inserted. (that's what we did in earlier version, too) To make the shared state required for that case less surprising, replace the XLogCtl->curridx variable, which was the index of the latest initialized buffer, with an XLogRecPtr of how far the buffers have been initialized. That also allows us to get rid of the XLogRecEndPtrToBufIdx macro. While we're at it, make a similar change for XLogCtl->Write.curridx, getting rid of that variable and calculating the next buffer to write from XLogCtl->LogwrtResult instead.	2013-07-17 23:12:22 +03:00
Heikki Linnakangas	f489470f8a	Fix Windows build. Was broken by my xloginsert scaling patch. XLogCtl global variable needs to be initialized in each process, as it's not inherited by fork() on Windows.	2013-07-08 17:28:48 +03:00
Heikki Linnakangas	9a20a9b21b	Improve scalability of WAL insertions. This patch replaces WALInsertLock with a number of WAL insertion slots, allowing multiple backends to insert WAL records to the WAL buffers concurrently. This is particularly useful for parallel loading large amounts of data on a system with many CPUs. This has one user-visible change: switching to a new WAL segment with pg_switch_xlog() now fills the remaining unused portion of the segment with zeros. This potentially adds some overhead, but it has been a very common practice by DBA's to clear the "tail" of the segment with an external pg_clearxlogtail utility anyway, to make the WAL files compress better. With this patch, it's no longer necessary to do that. This patch adds a new GUC, xloginsert_slots, to tune the number of WAL insertion slots. Performance testing suggests that the default, 8, works pretty well for all kinds of worklods, but I left the GUC in place to allow others with different hardware to test that easily. We might want to remove that before release. Reviewed by Andres Freund.	2013-07-08 11:23:56 +03:00
Jeff Davis	5b571bb8c8	Handle posix_fallocate() errors. On some platforms, posix_fallocate() is available but may still return EINVAL if the underlying filesystem does not support it. So, in case of an error, fall through to the alternate implementation that just writes zeros. Per buildfarm failure and analysis by Tom Lane.	2013-07-06 13:46:04 -07:00
Jeff Davis	269e780822	Use posix_fallocate() for new WAL files, where available. This function is more efficient than actually writing out zeroes to the new file, per microbenchmarks by Jon Nelson. Also, it may reduce the likelihood of WAL file fragmentation. Jon Nelson, with review by Andres Freund, Greg Smith and me.	2013-07-05 12:30:29 -07:00
Robert Haas	6bc8ef0b7f	Add new GUC, max_worker_processes, limiting number of bgworkers. In 9.3, there's no particular limit on the number of bgworkers; instead, we just count up the number that are actually registered, and use that to set MaxBackends. However, that approach causes problems for Hot Standby, which needs both MaxBackends and the size of the lock table to be the same on the standby as on the master, yet it may not be desirable to run the same bgworkers in both places. 9.3 handles that by failing to notice the problem, which will probably work fine in nearly all cases anyway, but is not theoretically sound. A further problem with simply counting the number of registered workers is that new workers can't be registered without a postmaster restart. This is inconvenient for administrators, since bouncing the postmaster causes an interruption of service. Moreover, there are a number of applications for background processes where, by necessity, the background process must be started on the fly (e.g. parallel query). While this patch doesn't actually make it possible to register new background workers after startup time, it's a necessary prerequisite. Patch by me. Review by Michael Paquier.	2013-07-04 11:24:24 -04:00
Heikki Linnakangas	79ce29c734	Retry short writes when flushing WAL. We don't normally bother retrying when the number of bytes written by write() is short of what was requested. It is generally assumed that a write() to disk doesn't return short, unless you run out of disk space. While writing the WAL, however, it seems prudent to try a bit harder, because a failure leads to PANIC. The write() is also much larger than most write()s in the backend (up to wal_buffers), so there's more room for surprises. Also retry on EINTR. All signals used in the backend are flagged SA_RESTART nowadays, so it shouldn't happen, but better to be defensive.	2013-07-01 09:36:00 +03:00
Simon Riggs	1f09121b4e	Ensure no xid gaps during Hot Standby startup In some cases with higher numbers of subtransactions it was possible for us to incorrectly initialize subtrans leading to complaints of missing pages. Bug report by Sergey Konoplev Analysis and fix by Andres Freund	2013-06-23 11:05:02 +01:00
Jeff Davis	b8fd1a09f3	Add buffer_std flag to MarkBufferDirtyHint(). MarkBufferDirtyHint() writes WAL, and should know if it's got a standard buffer or not. Currently, the only callers where buffer_std is false are related to the FSM. In passing, rename XLOG_HINT to XLOG_FPI, which is more descriptive. Back-patch to 9.3.	2013-06-17 08:02:12 -07:00
Tom Lane	c62866eeaf	Remove special-case treatment of LOG severity level in standalone mode. elog.c has historically treated LOG messages as low-priority during bootstrap and standalone operation. This has led to confusion and even masked a bug, because the normal expectation of code authors is that elog(LOG) will put something into the postmaster log, and that wasn't happening during initdb. So get rid of the special-case rule and make the priority order the same as it is in normal operation. To keep from cluttering initdb's output and the behavior of a standalone backend, tweak the severity level of three messages routinely issued by xlog.c during startup and shutdown so that they won't appear in these cases. Per my proposal back in December.	2013-06-13 23:15:15 -04:00
Noah Misch	fb435f40d5	Observe array length in HaveVirtualXIDsDelayingChkpt(). Since commit `f21bb9cfb5`, this function ignores the caller-provided length and loops until it finds a terminator, which GetVirtualXIDsDelayingChkpt() never adds. Restore the previous loop control logic. In passing, revert the addition of an unused variable by the same commit, presumably a debugging relic.	2013-06-12 19:50:14 -04:00
Heikki Linnakangas	f73cb5567c	Fix typo in comment.	2013-06-06 18:27:01 +03:00
Heikki Linnakangas	e1e2bb34f1	Code review of recycling WAL segments in a restartpoint. Seems cleaner to get the currently-replayed TLI in the same call to GetXLogReplayRecPtr that we get the WAL position. Make it more clear in the comment what the code does when recovery has already ended (RecoveryInProgress() will set ThisTimeLineID in that case). Finally, make resetting ThisTimeLineID afterwards more explicit.	2013-06-03 09:25:12 +03:00
Stephen Frost	551938ae22	Post-pgindent cleanup Make slightly better decisions about indentation than what pgindent is capable of. Mostly breaking out long function calls into one line per argument, with a few other minor adjustments. No functional changes- all whitespace. pgindent ran cleanly (didn't change anything) after. Passes all regressions.	2013-06-01 09:38:15 -04:00
Bruce Momjian	9af4159fce	pgindent run for release 9.3 This is the first run of the Perl-based pgindent script. Also update pgindent instructions.	2013-05-29 16:58:43 -04:00
Simon Riggs	22a27ef113	After fast promotion use CHECKPOINT_FORCE Not necessary for correctness, just to make log_checkpoints output look less singular. Requested by Fujii Masao	2013-05-21 21:27:12 +01:00
Simon Riggs	75a192638f	Maintain ThisTimeLineID correctly in checkpointer checkpointer needs to reset ThisTimeLineID after a restartpoint to allow installing/recycling new WAL files. If recovery has already ended this would leave ThisTimeLineID set incorrectly and so we must reset it otherwise later checkpoints do not have the correct timeline. Bug report by Heikki Linnakangas. Further investigation by Heikki and myself.	2013-05-21 21:17:04 +01:00
Simon Riggs	d4337a0dcb	Init crash recovery using the latest available TLI This simplifies the handling of crashes after fast promotion and various minor cases that can exist in short timing windows around that case. Broad fix to bug reported by Michael Paquier on -hackers, approach prompted by Heikki Linnakangas	2013-05-19 17:31:07 +01:00
Simon Riggs	1781744cfc	Emit msg correctly for timeline-crossing crash	2013-05-19 17:00:18 +01:00
Simon Riggs	c94dff4c3c	Remove single space on end of a line in xlog.c Michael Paquier	2013-05-19 15:38:47 +01:00
Heikki Linnakangas	2ffa66f497	Fix walsender failure at promotion. If a standby server has a cascading standby server connected to it, it's possible that WAL has already been sent up to the next WAL page boundary, splitting a WAL record in the middle, when the first standby server is promoted. Don't throw an assertion failure or error in walsender if that happens. Also, fix a variant of the same bug in pg_receivexlog: if it had already received WAL on previous timeline up to a segment boundary, when the upstream standby server is promoted so that the timeline switch record falls on the previous segment, pg_receivexlog would miss the segment containing the timeline switch. To fix that, have walsender send the position of the timeline switch at end-of-streaming, in addition to the next timeline's ID. It was previously assumed that the switch happened exactly where the streaming stopped. Note: this is an incompatible change in the streaming protocol. You might get an error if you try to stream over timeline switches, if the client is running 9.3beta1 and the server is more recent. It should be fine after a reconnect, however. Reported by Fujii Masao.	2013-05-08 20:30:17 +03:00
Simon Riggs	443951748c	Record data_checksum_version in control file. The value is not used anywhere in code, but will allow future changes to the checksum version should that become necessary in the future.	2013-04-30 12:27:12 +01:00
Simon Riggs	2317a63328	Make fast promotion the default promotion mode. Continue to allow a request for synchronous checkpoints as a mechanism in case of problems.	2013-04-24 12:21:18 +01:00
Heikki Linnakangas	594041311c	Fix calculation of how many segments to retain for wal_keep_segments. KeepLogSeg function was broken when we switched to use a 64-bit int for the segment number. Per report from Jeff Janes.	2013-04-08 16:29:56 +03:00
Simon Riggs	5787c6730e	Skip extraneous locking in XLogCheckBuffer(). Heikki reported comment was wrong, so fixed code to match the comment: we only need to take additional locking precautions when we have a shared lock on the buffer.	2013-04-08 09:11:49 +01:00
Simon Riggs	47c4333189	Avoid tricky race condition recording XLOG_HINT We copy the buffer before inserting an XLOG_HINT to avoid WAL CRC errors caused by concurrent hint writes to buffer while share locked. To make this work we refactor RestoreBackupBlock() to allow an XLOG_HINT to avoid the normal path for backup blocks, which assumes the underlying buffer is exclusive locked. Resulting code completely changes layout of XLOG_HINT WAL records, but this isn't even beta code, so this is a low impact change. In passing, avoid taking WALInsertLock for full page writes on checksummed hints, remove related cruft from XLogInsert() and improve xlog_desc record for XLOG_HINT. Andres Freund Bug report by Fujii Masao, testing by Jeff Janes and Jaime Casanova, review by Jeff Davis and Simon Riggs. Applied with changes from review and some comment editing.	2013-04-08 08:52:39 +01:00
Tom Lane	ce9ab88981	Make REPLICATION privilege checks test current user not authenticated user. The pg_start_backup() and pg_stop_backup() functions checked the privileges of the initially-authenticated user rather than the current user, which is wrong. For example, a user-defined index function could successfully call these functions when executed by ANALYZE within autovacuum. This could allow an attacker with valid but low-privilege database access to interfere with creation of routine backups. Reported and fixed by Noah Misch. Security: CVE-2013-1901	2013-04-01 13:09:24 -04:00
Simon Riggs	593c39d156	Revoke `bc5334d867`	2013-03-28 09:18:02 +00:00
Simon Riggs	bc5334d867	Allow external recovery_config_directory If required, recovery.conf can now be located outside of the data directory. Server needs read/write permissions on this directory.	2013-03-27 11:45:42 +00:00
Simon Riggs	96ef3b8ff1	Allow I/O reliability checks using 16-bit checksums Checksums are set immediately prior to flush out of shared buffers and checked when pages are read in again. Hint bit setting will require full page write when block is dirtied, which causes various infrastructure changes. Extensive comments, docs and README. WARNING message thrown if checksum fails on non-all zeroes page; ERROR thrown but can be disabled with ignore_checksum_failure = on. Feature enabled by an initdb option, since transition from option off to option on is long and complex and has not yet been implemented. Default is not to use checksums. Checksum used is WAL CRC-32 truncated to 16-bits. Simon Riggs, Jeff Davis, Greg Smith Wide input and assistance from many community members. Thank you.	2013-03-22 13:54:07 +00:00
Simon Riggs	bb7cc2623f	Remove PageSetTLI and rename pd_tli to pd_checksum Remove use of PageSetTLI() from all page manipulation functions and adjust README to indicate change in the way we make changes to pages. Repurpose those bytes into the pd_checksum field and explain how that works in comments about page header. Refactoring ahead of actual feature patch which would make use of the checksum field, arriving later. Jeff Davis, with comments and doc changes by Simon Riggs Direction suggested by Robert Haas; many others providing review comments.	2013-03-18 13:46:42 +00:00
Tom Lane	da5aeccf64	Move pqsignal() to libpgport. We had two copies of this function in the backend and libpq, which was already pretty bogus, but it turns out that we need it in some other programs that don't use libpq (such as pg_test_fsync). So put it where it probably should have been all along. The signal-mask-initialization support in src/backend/libpq/pqsignal.c stays where it is, though, since we only need that in the backend.	2013-03-17 12:06:42 -04:00
Heikki Linnakangas	7ccefe8610	Fix tli history file fetching, broken by the archive after crash recevery patch. If we were about to enter archive recovery after crash recovery, we scanned the archive for the latest tli history file, and set the recovery target timeline to that. However, when we actually tried to read the history file, we would not fetch the file from the archive, because we were not in archive recovery yet. To fix, make readTimeLineHistory and existsTimeLineHistory to always fetch the file from archive if archive recovery is requested, even if we're not in archive recovery yet. Backpatch to 9.2. Mitsumasa KONDO	2013-03-07 12:33:24 +02:00
Heikki Linnakangas	6c4f6664b2	Fix thinko in previous commit. We must still initialize minRecoveryPoint if we start straight with archive recovery, e.g when recovering from a normal base backup taken with pg_start/stop_backup. Otherwise we never consider the system consistent.	2013-02-22 13:12:43 +02:00
Heikki Linnakangas	abf5c5c9a4	If recovery.conf is created after "pg_ctl stop -m i", do crash recovery. If you create a base backup using an atomic filesystem snapshot, and try to perform PITR starting from that base backup, or if you just kill a master server and create recovery.conf to put it into standby mode, we don't know how far we need to recover before reaching consistency. Normally in crash recovery, we replay all the WAL present in pg_xlog, and assume that we're consistent after that. And normally in archive recovery, minRecoveryPoint, backupEndRequired, or backupEndPoint is set in the control file, indicating how far we need to replay to reach consistency. But if the server was previously up and running normally, and you kill -9 it or take an atomic filesystem snapshot, none of those fields are set in the control file. The solution is to perform crash recovery first, replaying all the WAL in pg_xlog. After that's done, we assume that the system is consistent like in normal crash recovery, and switch to archive recovery mode after that. Per report from Kyotaro HORIGUCHI. In his scenario, recovery.conf was created after "pg_ctl stop -m i". I'm not sure we need to support that exact scenario, but we should support backing up using a filesystem snapshot, which looks identical. This issue goes back to at least 9.0, where hot standby was introduced and we started to track when consistency is reached. In 9.1 and 9.2, we would open up for hot standby too early, and queries could briefly see an inconsistent state. But 9.2 made it more visible, as we started to PANIC if we see a reference to a non-existing page during recovery, if we've already reached consistency. This is a fairly big patch, so back-patch to 9.2 only, where the issue is more visible. We can consider back-patching further after this has received some more testing in 9.2 and master.	2013-02-22 12:32:41 +02:00
Heikki Linnakangas	1bd42cd70a	Better fix for "unarchived WAL files get deleted on crash recovery" bug. Revert my earlier fix for the bug that unarchived WAL files get deleted on crash recovery, commit `c9cc7e05c6`. We create a .done file for files streamed or restored from archive, so the WAL file recycling logic used during normal operation works just as well during archive recovery. Per Fujii Masao's suggestion.	2013-02-15 19:33:31 +02:00
Heikki Linnakangas	c9cc7e05c6	Don't delete unarchived WAL files during crash recovery. Bug reported by Jehan-Guillaume (ioguix) de Rorthais. This was introduced with the change to keep WAL files restored from archive in pg_xlog, in 9.2.	2013-02-15 17:43:59 +02:00
Heikki Linnakangas	62401db45c	Support unlogged GiST index. The reason this wasn't supported before was that GiST indexes need an increasing sequence to detect concurrent page-splits. In a regular WAL- logged GiST index, the LSN of the page-split record is used for that purpose, and in a temporary index, we can get away with a backend-local counter. Neither of those methods works for an unlogged relation. To provide such an increasing sequence of numbers, create a "fake LSN" counter that is saved and restored across shutdowns. On recovery, unlogged relations are blown away, so the counter doesn't need to survive that either. Jeevan Chalke, based on discussions with Robert Haas, Tom Lane and me.	2013-02-11 23:07:09 +02:00
Heikki Linnakangas	b669f416ce	Fix checkpoint after fast promotion. The intention was to request a regular online checkpoint immediately after end of recovery, when performing "fast promotion". However, because the checkpoint was requested before other backends were allowed to write WAL, the checkpointer process performed a restartpoint rather than a checkpoint. Delay the RequestCheckPoint call until after recovery has truly ended, so that you get a real checkpoint.	2013-02-11 22:22:08 +02:00
Heikki Linnakangas	7803e9327d	Include previous TLI in end-of-recovery and shutdown checkpoint records. This isn't used for anything but a sanity check at the moment, but it could be highly valuable for debugging purposes. It could also be used to recreate timeline history by traversing WAL, which seems useful.	2013-02-11 18:16:25 +02:00
Simon Riggs	072521b8c8	Rely only on checkpoint 1 at end of recovery. Searching for checkpoint 2 (previous) is not correct in all cases. Bug report from Heikki Linnakangas	2013-02-07 16:33:05 +00:00
Simon Riggs	3f0ab05233	Switch timelines if we crash soon after promotion. Previous patch to skip checkpoints at end of recovery didn't correctly perform crash recovery, fumbling the timeline switch. Now we record the minRecoveryPointTLI of the newly selected timeline, so that we crash recover to the correct timeline. Bug report from Fujii Masao, investigated by me.	2013-01-31 19:29:32 +00:00
Simon Riggs	fd4ced5230	Fast promote mode skips checkpoint at end of recovery. pg_ctl promote -m fast will skip the checkpoint at end of recovery so that we can achieve very fast failover when the apply delay is low. Write new WAL record XLOG_END_OF_RECOVERY to allow us to switch timeline correctly for downstream log readers. If we skip synchronous end of recovery checkpoint we request a normal spread checkpoint so that the window of re-recovery is low. Simon Riggs and Kyotaro Horiguchi, with input from Fujii Masao. Review by Heikki Linnakangas	2013-01-29 00:06:15 +00:00
Alvaro Herrera	0ac5ad5134	Improve concurrency of foreign key locking This patch introduces two additional lock modes for tuples: "SELECT FOR KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each other, in contrast with already existing "SELECT FOR SHARE" and "SELECT FOR UPDATE". UPDATE commands that do not modify the values stored in the columns that are part of the key of the tuple now grab a SELECT FOR NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently with tuple locks of the FOR KEY SHARE variety. Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this means the concurrency improvement applies to them, which is the whole point of this patch. The added tuple lock semantics require some rejiggering of the multixact module, so that the locking level that each transaction is holding can be stored alongside its Xid. Also, multixacts now need to persist across server restarts and crashes, because they can now represent not only tuple locks, but also tuple updates. This means we need more careful tracking of lifetime of pg_multixact SLRU files; since they now persist longer, we require more infrastructure to figure out when they can be removed. pg_upgrade also needs to be careful to copy pg_multixact files over from the old server to the new, or at least part of multixact.c state, depending on the versions of the old and new servers. Tuple time qualification rules (HeapTupleSatisfies routines) need to be careful not to consider tuples with the "is multi" infomask bit set as being only locked; they might need to look up MultiXact values (i.e. possibly do pg_multixact I/O) to find out the Xid that updated a tuple, whereas they previously were assured to only use information readily available from the tuple header. This is considered acceptable, because the extra I/O would involve cases that would previously cause some commands to block waiting for concurrent transactions to finish. Another important change is the fact that locking tuples that have previously been updated causes the future versions to be marked as locked, too; this is essential for correctness of foreign key checks. This causes additional WAL-logging, also (there was previously a single WAL record for a locked tuple; now there are as many as updated copies of the tuple there exist.) With all this in place, contention related to tuples being checked by foreign key rules should be much reduced. As a bonus, the old behavior that a subtransaction grabbing a stronger tuple lock than the parent (sub)transaction held on a given tuple and later aborting caused the weaker lock to be lost, has been fixed. Many new spec files were added for isolation tester framework, to ensure overall behavior is sane. There's probably room for several more tests. There were several reviewers of this patch; in particular, Noah Misch and Andres Freund spent considerable time in it. Original idea for the patch came from Simon Riggs, after a problem report by Joel Jacobson. Most code is from me, with contributions from Marti Raudsepp, Alexander Shulgin, Noah Misch and Andres Freund. This patch was discussed in several pgsql-hackers threads; the most important start at the following message-ids: AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com 1290721684-sup-3951@alvh.no-ip.org 1294953201-sup-2099@alvh.no-ip.org 1320343602-sup-2290@alvh.no-ip.org 1339690386-sup-8927@alvh.no-ip.org 4FE5FF020200002500048A3D@gw.wicourts.gov 4FEAB90A0200002500048B7D@gw.wicourts.gov	2013-01-23 12:04:59 -03:00
Heikki Linnakangas	990fe3c4ed	Fix more issues with cascading replication and timeline switches. When a standby server follows the master using WAL archive, and it chooses a new timeline (recovery_target_timeline='latest'), it only fetches the timeline history file for the chosen target timeline, not any other history files that might be missing from pg_xlog. For example, if the current timeline is 2, and we choose 4 as the new recovery target timeline, the history file for timeline 3 is not fetched, even if it's part of this server's history. That's enough for the standby itself - the history file for timeline 4 includes timeline 3 as well - but if a cascading standby server wants to recover to timeline 3, it needs the history file. To fix, when a new recovery target timeline is chosen, try to copy any missing history files from the archive to pg_xlog between the old and new target timeline. A second similar issue was with the WAL files. When a standby recovers from archive, and it reaches a segment that contains a switch to a new timeline, recovery fetches only the WAL file labelled with the new timeline's ID. The file from the new timeline contains a copy of the WAL from the old timeline up to the point where the switch happened, and recovery recovers it from the new file. But in streaming replication, walsender only tries to read it from the old timeline's file. To fix, change walsender to read it from the new file, so that it behaves the same as recovery in that sense, and doesn't try to open the possibly nonexistent file with the old timeline's ID.	2013-01-23 10:19:20 +02:00
Alvaro Herrera	8c17144c75	Fix off-by-one bug in xlog reading logic Bug reported by Michael Paquier Author: Andres Freund	2013-01-18 11:19:53 -03:00
Heikki Linnakangas	2ff6555313	Use the right timeline when beginning to stream from master. The xlogreader refactoring broke the logic to decide which timeline to start streaming from. XLogPageRead() uses the timeline history to check which timeline the requested WAL position falls into. However, after the refactoring, XLogPageRead() is always first called with the first page in the segment, to verify the segment header, and only then with the actual WAL position we're interested in. That first read of the segment's header made XLogPageRead() to always start streaming from the old timeline containing the segment header, not the timeline containing the actual record, if there was a timeline switch within the segment. I thought I fixed this yesterday, but that fix was too narrow and only fixed this for the corner-case that the timeline switch happened in the first page of the segment. To fix this more robustly, pass explicitly the position of the record we're actually interested in to XLogPageRead, and use that to decide which timeline to read from, rather than deduce it from the page and offset. Per report from Fujii Masao.	2013-01-18 11:46:49 +02:00
Heikki Linnakangas	0b6329130e	Make pg_receivexlog and pg_basebackup -X stream work across timeline switches. This mirrors the changes done earlier to the server in standby mode. When receivelog reaches the end of a timeline, as reported by the server, it fetches the timeline history file of the next timeline, and restarts streaming from the new timeline by issuing a new START_STREAMING command. When pg_receivexlog crosses a timeline, it leaves the .partial suffix on the last segment on the old timeline. This helps you to tell apart a partial segment left in the directory because of a timeline switch, and a completed segment. If you just follow a single server, it won't make a difference, but it can be significant in more complicated scenarios where new WAL is still generated on the old timeline. This includes two small changes to the streaming replication protocol: First, when you reach the end of timeline while streaming, the server now sends the TLI of the next timeline in the server's history to the client. pg_receivexlog uses that as the next timeline, so that it doesn't need to parse the timeline history file like a standby server does. Second, when BASE_BACKUP command sends the begin and end WAL positions, it now also sends the timeline IDs corresponding the positions.	2013-01-17 20:23:00 +02:00
Heikki Linnakangas	1296d5c53c	Fix a couple of error-handling bugs in the xlogreader patch. XLogReadRecord should reset its state on every error, to make sure it re-reads the page on next call. It was inconsistent in that some errors did that, but some did not. In ReadRecord(), don't give up on an error if we're in standby mode. The loop was set up to retry, but the checks within the loop broke out of the loop on any error. Andres Freund, with some tweaking by me.	2013-01-17 19:27:04 +02:00
Alvaro Herrera	7fcbf6a405	Split out XLog reading as an independent facility This new facility can not only be used by xlog.c to carry out crash recovery, but also by external programs. By supplying a function to read XLog pages from somewhere, all the WAL reading can be used for completely different purposes. For the standard backend use, the behavior should be pretty much the same as previously. As for non-backend programs, an hypothetical pg_xlogdump program is now closer to reality, but some more backend support is still necessary. This patch was originally submitted by Andres Freund in a different form, but Heikki Linnakangas opted for and authored another design of the concept. Andres has advanced the patch since Heikki's initial version. Review and some (mostly cosmetics) changes by me.	2013-01-16 16:12:53 -03:00
Heikki Linnakangas	b0daba57bb	Tolerate timeline switches while "pg_basebackup -X fetch" is running. If you take a base backup from a standby server with "pg_basebackup -X fetch", and the timeline switches while the backup is being taken, the backup used to fail with an error "requested WAL segment %s has already been removed". This is because the server-side code that sends over the required WAL files would not construct the WAL filename with the correct timeline after a switch. Fix that by using readdir() to scan pg_xlog for all the WAL segments in the range, regardless of timeline. Also, include all timeline history files in the backup, if taken with "-X fetch". That fixes another related bug: If a timeline switch happened just before the backup was initiated in a standby, the WAL segment containing the initial checkpoint record contains WAL from the older timeline too. Recovery will not accept that without a timeline history file that lists the older timeline. Backpatch to 9.2. Versions prior to that were not affected as you could not take a base backup from a standby before 9.2.	2013-01-03 19:51:00 +02:00
Heikki Linnakangas	ee994272ca	Delay reading timeline history file until it's fetched from master. Streaming replication can fetch any missing timeline history files from the master, but recovery would read the timeline history file for the target timeline before reading the checkpoint record, and before walreceiver has had a chance to fetch it from the master. Delay reading it, and the sanity checks involving timeline history, until after reading the checkpoint record. There is at least one scenario where this makes a difference: if you take a base backup from a standby server right after a timeline switch, the WAL segment containing the initial checkpoint record will begin with an older timeline ID. Without the timeline history file, recovering that file will fail as the older timeline ID is not recognized to be an ancestor of the target timeline. If you try to recover from such a backup, using only streaming replication to fetch the WAL, this patch is required for that to work.	2013-01-03 10:41:58 +02:00
Heikki Linnakangas	d194d7a526	Fix bug in streaming replication over multiple tli switches. After receiving some WAL over streaming replication, try to open the file from the timeline we're currently recieving, not recoveryTargetTLI. They are usually the same, which is why wasn't noticed before, but you'd get an error if there have been more than one timeline switch between the current point in WAL and the recovery target.	2013-01-02 14:35:15 +02:00
Heikki Linnakangas	4ffd589f44	Fix silly typo in code, which broke the check for reaching consistency.	2013-01-02 13:44:59 +02:00
Bruce Momjian	bd61a623ac	Update copyrights for 2013 Fully update git head, and update back branches in ./COPYRIGHT and legal.sgml files.	2013-01-01 17:15:01 -05:00
Heikki Linnakangas	60df192aea	Keep timeline history files restored from archive in pg_xlog. The cascading standby patch in 9.2 changed the way WAL files are treated when restored from the archive. Before, they were restored under a temporary filename, and not kept in pg_xlog, but after the patch, they were copied under pg_xlog. This is necessary for a cascading standby to find them, but it also means that if the archive goes offline and a standby is restarted, it can recover back to where it was using the files in pg_xlog. It also means that if you take an offline backup from a standby server, it includes all the required WAL files in pg_xlog. However, the same change was not made to timeline history files, so if the WAL segment containing the checkpoint record contains a timeline switch, you will still get an error if you try to restart recovery without the archive, or recover from an offline backup taken from the standby. With this patch, timeline history files restored from archive are copied into pg_xlog like WAL files are, so that pg_xlog contains all the files required to recover. This is a corner-case pre-existing issue in 9.2, but even more important in master where it's possible for a standby to follow a timeline switch through streaming replication. To make that possible, the timeline history files must be present in pg_xlog.	2012-12-30 14:29:45 +02:00
Alvaro Herrera	5ab3af46dd	Remove obsolete XLogRecPtr macros This gets rid of XLByteLT, XLByteLE, XLByteEQ and XLByteAdvance. These were useful for brevity when XLogRecPtrs were split in xlogid/xrecoff; but now that they are simple uint64's, they are just clutter. The only downside to making this change would be ease of backporting patches, but that has been negated by other substantive changes to the involved code anyway. The clarity of simpler expressions makes the change worthwhile. Most of the changes are mechanical, but in a couple of places, the patch author chose to invert the operator sense, making the code flow more logical (and more in line with preceding comments). Author: Andres Freund Eyeballed by Dimitri Fontaine and Alvaro Herrera	2012-12-28 13:06:15 -03:00
Alvaro Herrera	24eca7977e	Assign InvalidXLogRecPtr instead of MemSet(0) For consistency. Author: Andres Freund	2012-12-27 18:33:03 -03:00
Peter Eisentraut	a0bfb7b36e	Fix grammatical mistake in error message	2012-12-20 23:36:13 -05:00
Heikki Linnakangas	343ee00b73	Fix recycling of WAL segments after switching timeline during recovery. This was broken before, we would recycle old WAL segments on wrong timeline after the recovery target timeline had changed, but my recent commit to not initialize ThisTimeLineID at all in a standby's checkpointer process broke this completely. The problem is that when installing a recycled WAL segment as a future one, ThisTimeLineID is used to construct the filename. To fix, always update ThisTimeLineID to the current timeline being recovered, before recycling WAL segments at a restartpoint. This still leaves a small window where we might install WAL segments under wrong timeline ID, if the timeline is changed just as we're about to start recycling. Also, even if we're replaying timeline X at the momnent, there's no guarantee that we'll need as many WAL segments on that timeline as we recycle. We might be just about to reach the point where we switch to next timeline, so might only need one more WAL segment on the current timeline. We'll live with the waste in that situation. Bug pointed out by Fujii Masao. 9.1 and 9.2 had the same issue, when recovery target timeline was changed, but I committed a slightly different version of this patch on those branches.	2012-12-20 22:00:58 +02:00
Heikki Linnakangas	af275a12df	Follow TLI of last replayed record, not recovery target TLI, in walsenders. Most of the time, the last replayed record comes from the recovery target timeline, but there is a corner case where it makes a difference. When the startup process scans for a new timeline, and decides to change recovery target timeline, there is a window where the recovery target TLI has already been bumped, but there are no WAL segments from the new timeline in pg_xlog yet. For example, if we have just replayed up to point 0/30002D8, on timeline 1, there is a WAL file called 000000010000000000000003 in pg_xlog that contains the WAL up to that point. When recovery switches recovery target timeline to 2, a walsender can immediately try to read WAL from 0/30002D8, from timeline 2, so it will try to open WAL file 000000020000000000000003. However, that doesn't exist yet - the startup process hasn't copied that file from the archive yet nor has the walreceiver streamed it yet, so walsender fails with error "requested WAL segment 000000020000000000000003 has already been removed". That's harmless, in that the standby will try to reconnect later and by that time the segment is already created, but error messages that should be ignored are not good. To fix that, have walsender track the TLI of the last replayed record, instead of the recovery target timeline. That way walsender will not try to read anything from timeline 2, until the WAL segment has been created and at least one record has been replayed from it. The recovery target timeline is now xlog.c's internal affair, it doesn't need to be exposed in shared memory anymore. This fixes the error reported by Thom Brown. depesz the same error message, but I'm not sure if this fixes his scenario.	2012-12-20 14:39:04 +02:00
Heikki Linnakangas	1a11d4609e	Don't set ThisTimeLineID in checkpointer & bgwriter during recovery. We used to set it to the current recovery target timeline, but the recovery target timeline can change during recovery, leaving ThisTimeLineID at an old value. That seems worse than always leaving it at zero to begin with. AFAICS there was no good reason to set it in the first place. ThisTimeLineID is not needed in checkpointer or bgwriter process, until it's time to write the end-of-recovery checkpoint, and at that point ThisTimeLineID is updated anyway.	2012-12-20 14:39:04 +02:00
Heikki Linnakangas	e43f947bf3	Check if we've reached end-of-backup point also if no redo is required. If you restored from a backup taken from a standby, and the last record in the backup is the checkpoint record, ie. there is no redo required except for the checkpoint record, we would fail to notice that we've reached the end-of-backup point, and the database is consistent. The result was an error "WAL ends before end of online backup". To fix, move the have-we-reached-end-of-backup check into CheckRecoveryConsistency(), which is already responsible for similar checks with minRecoveryPoint, and is called in the right places. Backpatch to 9.2, this check and bug did not exist before that.	2012-12-19 14:22:00 +02:00
Heikki Linnakangas	abfd192b1b	Allow a streaming replication standby to follow a timeline switch. Before this patch, streaming replication would refuse to start replicating if the timeline in the primary doesn't exactly match the standby. The situation where it doesn't match is when you have a master, and two standbys, and you promote one of the standbys to become new master. Promoting bumps up the timeline ID, and after that bump, the other standby would refuse to continue. There's significantly more timeline related logic in streaming replication now. First of all, when a standby connects to primary, it will ask the primary for any timeline history files that are missing from the standby. The missing files are sent using a new replication command TIMELINE_HISTORY, and stored in standby's pg_xlog directory. Using the timeline history files, the standby can follow the latest timeline present in the primary (recovery_target_timeline='latest'), just as it can follow new timelines appearing in an archive directory. START_REPLICATION now takes a TIMELINE parameter, to specify exactly which timeline to stream WAL from. This allows the standby to request the primary to send over WAL that precedes the promotion. The replication protocol is changed slightly (in a backwards-compatible way although there's little hope of streaming replication working across major versions anyway), to allow replication to stop when the end of timeline reached, putting the walsender back into accepting a replication command. Many thanks to Amit Kapila for testing and reviewing various versions of this patch.	2012-12-13 19:17:32 +02:00
Heikki Linnakangas	970fb12de1	Consistency check should compare last record replayed, not last record read. EndRecPtr is the last record that we've read, but not necessarily yet replayed. CheckRecoveryConsistency should compare minRecoveryPoint with the last replayed record instead. This caused recovery to think it's reached consistency too early. Now that we do the check in CheckRecoveryConsistency correctly, we have to move the call of that function to after redoing a record. The current place, after reading a record but before replaying it, is wrong. In particular, if there are no more records after the one ending at minRecoveryPoint, we don't enter hot standby until one extra record is generated and read by the standby, and CheckRecoveryConsistency is called. These two bugs conspired to make the code appear to work correctly, except for the small window between reading the last record that reaches minRecoveryPoint, and replaying it. In the passing, rename recoveryLastRecPtr, which is the last record replayed, to lastReplayedEndRecPtr. This makes it slightly less confusing with replayEndRecPtr, which is the last record read that we're about to replay. Original report from Kyotaro HORIGUCHI, further diagnosis by Fujii Masao. Backpatch to 9.0, where Hot Standby subtly changed the test from "minRecoveryPoint < EndRecPtr" to "minRecoveryPoint <= EndRecPtr". The former works because where the test is performed, we have always read one more record than we've replayed.	2012-12-11 18:54:02 +02:00
Heikki Linnakangas	6be799664a	Fix the tracking of min recovery point timeline. Forgot to update it at the right place. Also, consider checkpoint record that switches to new timelne to be on the new timeline. This fixes erroneous "requested timeline 2 does not contain minimum recovery point" errors, pointed out by Amit Kapila while testing another patch.	2012-12-10 16:04:26 +02:00
Tom Lane	af4aba2f05	Ensure recovery pause feature doesn't pause unless users can connect. If we're not in hot standby mode, then there's no way for users to connect to reset the recoveryPause flag, so we shouldn't pause. The code was aware of this but the test to see if pausing was safe was seriously inadequate: it wasn't paying attention to reachedConsistency, and besides what it was testing was that we could legally enter hot standby, not that we have done so. Get rid of that in favor of checking LocalHotStandbyActive, which because of the coding in CheckRecoveryConsistency is tantamount to checking that we have told the postmaster to enter hot standby. Also, move the recoveryPausesHere() call that reacts to asynchronous recoveryPause requests so that it's not in the middle of application of a WAL record. I put it next to the recoveryStopsHere() call --- in future those are going to need to interact significantly, so this seems like a good waystation. Also, don't bother trying to read another WAL record if we've already decided not to continue recovery. This was no big deal when the code was written originally, but now that reading a record might entail actions like fetching an archive file, it seems a bit silly to do it like that. Per report from Jeff Janes and subsequent discussion. The pause feature needs quite a lot more work, but this gets rid of some indisputable bugs, and seems safe enough to back-patch.	2012-12-05 18:27:50 -05:00
Heikki Linnakangas	d67b06fe3e	Oops, meant to change the comment in writeTimeLineHistory.	2012-12-05 21:00:59 +02:00
Simon Riggs	6aa2e49a87	Must not reach consistency before XLOG_BACKUP_RECORD When waiting for an XLOG_BACKUP_RECORD the minRecoveryPoint will be incorrect, so we must not declare recovery as consistent before we have seen the record. Major bug allowing recovery to end too early in some cases, allowing people to see inconsistent db. This patch to HEAD and 9.2, other fix required for 9.1 and 9.0 Simon Riggs and Andres Freund, bug report by Jeff Janes	2012-12-05 13:28:03 +00:00
Heikki Linnakangas	90991c40eb	Downgrade a status message from LOG to DEBUG2. I never intended this to be anything other than a debugging aid, but forgot to change the level before committing.	2012-12-04 17:29:44 +02:00
Heikki Linnakangas	32f4de0adf	Write exact xlog position of timeline switch in the timeline history file. This allows us to do some more rigorous sanity checking for various incorrect point-in-time recovery scenarios, and provides more information for debugging purposes. It will also come handy in the upcoming patch to allow timeline switches to be replicated by streaming replication.	2012-12-04 17:29:07 +02:00
Heikki Linnakangas	5ce108bf32	Track the timeline associated with minRecoveryPoint, for more sanity checks. This allows recovery to notice certain incorrect recovery scenarios. If a server has recovered to point X on timeline 5, and you restart recovery, it better be on timeline 5 when it reaches point X again, not on some timeline with a higher ID. This can happen e.g if you a standby server is shut down, a new timeline appears in the WAL archive, and the standby server is restarted. It will try to follow the new timeline, which is wrong because some WAL on the old timeline was already replayed before shutdown. Requires an initdb (or at least pg_resetxlog), because this adds a field to the control file.	2012-12-04 11:31:00 +02:00
Andrew Dunstan	d5652e50d5	Attempt to unbreak MSVC builds broken by `f21bb9cfb5`. We can't use type uint, so use uint32.	2012-12-03 10:23:22 -05:00
Simon Riggs	f21bb9cfb5	Refactor inCommit flag into generic delayChkpt flag. Rename PGXACT->inCommit flag into delayChkpt flag, and generalise comments to allow use in other situations, such as the forthcoming potential use in checksum patch. Replace wait loop to look for VXIDs with delayChkpt set. No user visible changes, not behaviour changes at present. Simon Riggs, reviewed and rebased by Jeff Davis	2012-12-03 13:13:53 +00:00
Simon Riggs	7a764990d8	Clarify locking for PageGetLSN() in XLogCheckBuffer()	2012-12-03 12:20:31 +00:00
Heikki Linnakangas	a068c391ab	Refactor the code implementing standby-mode logic. It is now easier to see that it's a state machine, making the code easier to understand overall.	2012-12-03 12:32:44 +02:00
Tom Lane	3114cb60a1	Don't advance checkPoint.nextXid near the end of a checkpoint sequence. This reverts commit `c11130690d` in favor of actually fixing the problem: namely, that we should never have been modifying the checkpoint record's nextXid at this point to begin with. The nextXid should match the state as of the checkpoint's logical WAL position (ie the redo point), not the state as of its physical position. It's especially bogus to advance it in some wal_levels and not others. In any case there is no need for the checkpoint record to carry the same nextXid shown in the XLOG_RUNNING_XACTS record just emitted by LogStandbySnapshot, as any replay operation will already have adopted that value as current. This fixes bug #7710 from Tarvi Pillessaar, and probably also explains bug #6291 from Daniel Farina, in that if a checkpoint were in progress at the instant of XID wraparound, the epoch bump would be lost as reported. (And, of course, these days there's at least a 50-50 chance of a checkpoint being in progress at any given instant.) Diagnosed by me and independently by Andres Freund. Back-patch to all branches supporting hot standby.	2012-12-02 15:20:41 -05:00
Simon Riggs	5c11725867	Rearrange storage of data in xl_running_xacts. Previously we stored all xids mixed together. Now we store top-level xids first, followed by all subxids. Also skip logging any subxids if the snapshot is suboverflowed, since there are potentially large numbers of them and they are not useful in that case anyway. Has value in the envisaged design for decoding of WAL. No planned effect on Hot Standby. Andres Freund, reviewed by me	2012-12-02 19:39:37 +00:00
Simon Riggs	c11130690d	XidEpoch++ if wraparound during checkpoint. If wal_level = hot_standby we update the checkpoint nextxid, though in the case where a wraparound occurred half-way through a checkpoint we would neglect updating the epoch also. Updating the nextxid is arguably the wrong thing to do, but changing that may introduce subtle bugs into hot standby startup, while updating the value doesn't cause any known bugs yet. Minimal fix now to HEAD and backbranches, wider fix later in HEAD. Bug reported in #6291 by Daniel Farina and slightly differently in Cause analysis and recommended fixes from Tom Lane and Andres Freund. Applied patch is minimal version of Andres Freund's work.	2012-12-02 14:57:44 +00:00
Simon Riggs	9f98704b82	Clarify operation of online checkpoints. Previous comments left, but were too obscure for such an important aspect of the system.	2012-12-02 13:09:55 +00:00
Alvaro Herrera	1577b46b7c	Split out rmgr rm_desc functions into their own files This is necessary (but not sufficient) to have them compilable outside of a backend environment.	2012-11-28 13:01:15 -03:00
Heikki Linnakangas	dd7353dde8	If we don't have a backup-end-location, don't claim we've reached it. This was apparently a typo, which caused recovery to think that it immediately reached the end of backup, and allowed the database to start up too early. Reported by Jeff Janes. Backpatch to 9.2, where this code was introduced.	2012-11-28 15:14:27 +02:00
Heikki Linnakangas	1f67078ea3	Add OpenTransientFile, with automatic cleanup at end-of-xact. Files opened with BasicOpenFile or PathNameOpenFile are not automatically cleaned up on error. That puts unnecessary burden on callers that only want to keep the file open for a short time. There is AllocateFile, but that returns a buffered FILE * stream, which in many cases is not the nicest API to work with. So add function called OpenTransientFile, which returns a unbuffered fd that's cleaned up like the FILE* returned by AllocateFile(). This plugs a few rare fd leaks in error cases: 1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile 2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't use OpenTransientFile here because the fd is supposed to persist over transaction boundaries. 3. lo_import/lo_export - fixed by using OpenTransientFile instead of PathNameOpenFile. In addition to plugging those leaks, this replaces many BasicOpenFile() calls with OpenTransientFile() that were not leaking, because the code meticulously closed the file on error. That wasn't strictly necessary, but IMHO it's good for robustness. The same leaks exist in older versions, but given the rarity of the issues, I'm not backpatching this. Not yet, anyway - it might be good to backpatch later, after this mechanism has had some more testing in master branch.	2012-11-27 10:25:50 +02:00
Heikki Linnakangas	24c19e6bf9	Avoid bogus "out-of-sequence timeline ID" errors in standby-mode. When startup process opens a WAL segment after replaying part of it, it validates the first page on the WAL segment, even though the page it's really interested in later in the file. As part of the validation, it checks that the TLI on the page header is >= the TLI it saw on the last page it read. If the segment contains a timeline switch, and we have already replayed it, and then re-open the WAL segment (because of streaming replication got disconnected and reconnected, for example), the TLI check will fail when the first page is validated. Fix that by relaxing the TLI check when re-opening a WAL segment. Backpatch to 9.0. Earlier versions had the same code, but before standby mode was introduced in 9.0, recovery never tried to re-read a segment after partially replaying it. Reported by Amit Kapila, while testing a new feature.	2012-11-22 11:44:44 +02:00
Heikki Linnakangas	644a0a6379	Fix archive_cleanup_command. When I moved ExecuteRecoveryCommand() from xlog.c to xlogarchive.c, I didn't realize that it's called from the checkpoint process, not the startup process. I tried to use InRedo variable to decide whether or not to attempt cleaning up the archive (must not do so before we have read the initial checkpoint record), but that variable is only valid within the startup process. Instead, let ExecuteRecoveryCommand() always clean up the archive, and add an explicit argument to RestoreArchivedFile() to say whether that's allowed or not. The caller knows better. Reported by Erik Rijkers, diagnosis by Fujii Masao. Only 9.3devel is affected.	2012-11-19 10:14:20 +02:00
Tom Lane	3bbf668de9	Fix multiple problems in WAL replay. Most of the replay functions for WAL record types that modify more than one page failed to ensure that those pages were locked correctly to ensure that concurrent queries could not see inconsistent page states. This is a hangover from coding decisions made long before Hot Standby was added, when it was hardly necessary to acquire buffer locks during WAL replay at all, let alone hold them for carefully-chosen periods. The key problem was that RestoreBkpBlocks was written to hold lock on each page restored from a full-page image for only as long as it took to update that page. This was guaranteed to break any WAL replay function in which there was any update-ordering constraint between pages, because even if the nominal order of the pages is the right one, any mixture of full-page and non-full-page updates in the same record would result in out-of-order updates. Moreover, it wouldn't work for situations where there's a requirement to maintain lock on one page while updating another. Failure to honor an update ordering constraint in this way is thought to be the cause of bug #7648 from Daniel Farina: what seems to have happened there is that a btree page being split was rewritten from a full-page image before the new right sibling page was written, and because lock on the original page was not maintained it was possible for hot standby queries to try to traverse the page's right-link to the not-yet-existing sibling page. To fix, get rid of RestoreBkpBlocks as such, and instead create a new function RestoreBackupBlock that restores just one full-page image at a time. This function can be invoked by WAL replay functions at the points where they would otherwise perform non-full-page updates; in this way, the physical order of page updates remains the same no matter which pages are replaced by full-page images. We can then further adjust the logic in individual replay functions if it is necessary to hold buffer locks for overlapping periods. A side benefit is that we can simplify the handling of concurrency conflict resolution by moving that code into the record-type-specfic functions; there's no more need to contort the code layout to keep conflict resolution in front of the RestoreBkpBlocks call. In connection with that, standardize on zero-based numbering rather than one-based numbering for referencing the full-page images. In HEAD, I removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are still there in the header files in previous branches, but are no longer used by the code. In addition, fix some other bugs identified in the course of making these changes: spgRedoAddNode could fail to update the parent downlink at all, if the parent tuple is in the same page as either the old or new split tuple and we're not doing a full-page image: it would get fooled by the LSN having been advanced already. This would result in permanent index corruption, not just transient failure of concurrent queries. Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old tail page as a candidate for a full-page image; in the worst case this could result in torn-page corruption. heap_xlog_freeze() was inconsistent about using a cleanup lock or plain exclusive lock: it did the former in the normal path but the latter for a full-page image. A plain exclusive lock seems sufficient, so change to that. Also, remove gistRedoPageDeleteRecord(), which has been dead code since VACUUM FULL was rewritten. Back-patch to 9.0, where hot standby was introduced. Note however that 9.0 had a significantly different WAL-logging scheme for GIST index updates, and it doesn't appear possible to make that scheme safe for concurrent hot standby queries, because it can leave inconsistent states in the index even between WAL records. Given the lack of complaints from the field, we won't work too hard on fixing that branch.	2012-11-12 22:05:53 -05:00
Heikki Linnakangas	dbdf9679d7	Use correct text domain for translating errcontext() messages. errcontext() is typically used in an error context callback function, not within an ereport() invocation like e.g errmsg and errdetail are. That means that the message domain that the TEXTDOMAIN magic in ereport() determines is not the right one for the errcontext() calls. The message domain needs to be determined by the C file containing the errcontext() call, not the file containing the ereport() call. Fix by turning errcontext() into a macro that passes the TEXTDOMAIN to use for the errcontext message. "errcontext" was used in a few places as a variable or struct field name, I had to rename those out of the way, now that errcontext is a macro. We've had this problem all along, but this isn't doesn't seem worth backporting. It's a fairly minor issue, and turning errcontext from a function to a macro requires at least a recompile of any external code that calls errcontext().	2012-11-12 17:07:29 +02:00
Alvaro Herrera	2f1692d213	Fix erroneous choice of timeline variable, too	2012-10-31 17:05:55 -03:00
Alvaro Herrera	9b8dd7e8aa	Fix erroneous choices of segNo variables Commit `dfda6eba` (which changed segment numbers to use a single 64 bit variable instead of log/seg) introduced a couple of bogus choices of exactly which log segment number variable to use in each case. This is currently pretty harmless; in one place, the bogus number was only being used in an error message for a pretty unlikely condition (failure to fsync a WAL segment file). In the other, it was using a global variable instead of the local variable; but all callsites were passing the value of the global variable anyway. No need to backpatch because that commit is not on earlier branches.	2012-10-31 11:05:28 -03:00
Heikki Linnakangas	2d8c81ac86	Fix silly bug in previous refactoring. I extracted the refactoring patch from a larger patch that contained other changes too, but missed one unintentional change and didn't test enough...	2012-10-09 19:33:12 +03:00
Heikki Linnakangas	ff8f160bf4	Put the logic to wait for WAL in standby mode to a separate function. This is just refactoring with no user-visible effect, to make the code more readable.	2012-10-09 19:20:17 +03:00
Heikki Linnakangas	1a956481ba	Fix typo in comment, and reword it slightly while we're at it.	2012-10-04 10:35:48 +03:00
Heikki Linnakangas	93b6d78cf0	Add #includes needed on some platforms in the new files. Hopefully this makes the *BSD buildfarm animals happy.	2012-10-02 17:19:52 +03:00
Heikki Linnakangas	d5497b95f3	Split off functions related to timeline history files and XLOG archiving. This is just refactoring, to make the functions accessible outside xlog.c. A followup patch will make use of that, to allow fetching timeline history files over streaming replication.	2012-10-02 13:37:19 +03:00
Heikki Linnakangas	ab9a14e903	Fix WAL file replacement during cascading replication on Windows. When the startup process restores a WAL file from the archive, it deletes any old file with the same name and renames the new file in its place. On Windows, however, when a file is deleted, it still lingers as long as a process holds a file handle open on it. With cascading replication, a walsender process can hold the old file open, so the rename() in the startup process would fail. To fix that, rename the old file to a temporary name, to make the original file name available for reuse, before deleting the old file.	2012-09-05 18:52:12 -07:00
Tom Lane	2e0cc1f031	Fix inappropriate error messages for Hot Standby misconfiguration errors. Give the correct name of the GUC parameter being complained of. Also, emit a more suitable SQLSTATE (INVALID_PARAMETER_VALUE, not the default INTERNAL_ERROR). Gurjeet Singh, errcode adjustment by me	2012-09-05 21:49:08 -04:00
Heikki Linnakangas	358ff99d70	Fix compiler warnings about unused variables, caused by my previous commit. Reported by Peter Eisentraut.	2012-09-04 22:07:35 -07:00
Heikki Linnakangas	c4c227477b	Fix bugs in cascading replication with recovery_target_timeline='latest' The cascading replication code assumed that the current RecoveryTargetTLI never changes, but that's not true with recovery_target_timeline='latest'. The obvious upshot of that is that RecoveryTargetTLI in shared memory needs to be protected by a lock. A less obvious consequence is that when a cascading standby is connected, and the standby switches to a new target timeline after scanning the archive, it will continue to stream WAL to the cascading standby, but from a wrong file, ie. the file of the previous timeline. For example, if the standby is currently streaming from the middle of file 000000010000000000000005, and the timeline changes, the standby will continue to stream from that file. However, the WAL on the new timeline is in file 000000020000000000000005, so the standby sends garbage from 000000010000000000000005 to the cascading standby, instead of the correct WAL from file 000000020000000000000005. This also fixes a related bug where a partial WAL segment is restored from the archive and streamed to a cascading standby. The code assumed that when a WAL segment is copied from the archive, it can immediately be fully streamed to a cascading standby. However, if the segment is only partially filled, ie. has the right size, but only N first bytes contain valid WAL, that's not safe. That can happen if a partial WAL segment is manually copied to the archive, or if a partial WAL segment is archived because a server is started up on a new timeline within that segment. The cascading standby will get confused if the WAL it received is not valid, and will get stuck until it's restarted. This patch fixes that problem by not allowing WAL restored from the archive to be streamed to a cascading standby until it's been replayed, and thus validated.	2012-09-04 19:33:21 -07:00
Tom Lane	2a2352e07d	Replace memcpy() calls in xlog.c critical sections with struct assignments. This gets rid of a dangerous-looking use of the not-volatile XLogCtl pointer in a couple of spinlock-protected sections, where the normal coding rule is that you should only access shared memory through a pointer-to-volatile. I think the risk is only hypothetical not actual, since for there to be a bug the compiler would have to move the spinlock acquire or release across the memcpy() call, which one sincerely hopes it will not. Still, it looks cleaner this way. Per comment from Daniel Farina and subsequent discussion.	2012-09-03 15:39:15 -04:00
Tom Lane	10685ec082	Avoid somewhat-theoretical overflow risks in RecordIsValid(). This improves on commit `51fed14d73` by eliminating the assumption that we can form <some pointer value> + <some offset> without overflow. The entire point of those tests is that we don't trust the offset value, so coding them in a way that could wrap around if the buffer happens to be near the top of memory doesn't seem sound. Instead, track the remaining space as a size_t variable and compare offsets against that. Also, improve comment about why we need the extra early check on xl_tot_len.	2012-08-21 18:41:52 -04:00
Heikki Linnakangas	51fed14d73	Don't get confused if a WAL partial record header has xl_tot_len == 0. If a WAL record header was split across pages, but xl_tot_len was 0, we would get confused and conclude that we had already read the whole record, and proceed to CRC check it. That can lead to a crash in RecordIsValid(), which isn't careful to not read beyond end-of-record, as defined by xl_tot_len. Add an explicit sanity check for xl_tot_len <= SizeOfXlogRecord. Also, make RecordIsValid() more robust by checking in each step that it doesn't try to access memory beyond end of record, even if a length field in the record's or a backup block's header is bogus. Per report and analysis by Tom Lane.	2012-08-20 19:58:21 +03:00
Simon Riggs	8143a56854	Fix minor bug in XLogFileRead() that accidentally worked. Cascading replication copied the incoming file into pg_xlog but didn't set path correctly, so the first attempt to open file failed causing it to loop around and look for file in pg_xlog. So the earlier coding worked, but accidentally rather than by design. Spotted by Fujii Masao, fix by Fujii Masao and Simon Riggs	2012-08-08 21:25:23 +01:00
Simon Riggs	0f04fc67f7	fsync backup_label after pg_start_backup() Dave Kerr	2012-08-07 16:19:13 +01:00
Tom Lane	4a9c30a8a1	Fix management of pendingOpsTable in auxiliary processes. mdinit() was misusing IsBootstrapProcessingMode() to decide whether to create an fsync pending-operations table in the current process. This led to creating a table not only in the startup and checkpointer processes as intended, but also in the bgwriter process, not to mention other auxiliary processes such as walwriter and walreceiver. Creation of the table in the bgwriter is fatal, because it absorbs fsync requests that should have gone to the checkpointer; instead they just sit in bgwriter local memory and are never acted on. So writes performed by the bgwriter were not being fsync'd which could result in data loss after an OS crash. I think there is no live bug with respect to walwriter and walreceiver because those never perform any writes of shared buffers; but the potential is there for future breakage in those processes too. To fix, make AuxiliaryProcessMain() export the current process's AuxProcType as a global variable, and then make mdinit() test directly for the types of aux process that should have a pendingOpsTable. Having done that, we might as well also get rid of the random bool flags such as am_walreceiver that some of the aux processes had grown. (Note that we could not have fixed the bug by examining those variables in mdinit(), because it's called from BaseInit() which is run by AuxiliaryProcessMain() before entering any of the process-type-specific code.) Back-patch to 9.2, where the problem was introduced by the split-up of bgwriter and checkpointer processes. The bogus pendingOpsTable exists in walwriter and walreceiver processes in earlier branches, but absent any evidence that it causes actual problems there, I'll leave the older branches alone.	2012-07-18 15:28:10 -04:00
Robert Haas	3cf39e6ddb	Fix a stupid bug I introduced into XLogFlush(). Commit `f11e8be3e8` broke this; it was right in Peter's original patch, but I messed it up before committing.	2012-07-02 15:33:59 -04:00
Robert Haas	3bb592bb20	Fix position of WalSndWakeupRequest call. This avoids discriminating against wal_sync_method = open_sync or open_datasync. Fujii Masao, reviewed by Andres Freund	2012-07-02 14:44:10 -04:00
Peter Eisentraut	2b44306315	Assorted message style improvements	2012-07-02 21:12:46 +03:00
Robert Haas	82cdd2df75	Work a little harder on comments for walsender wakeup patch. Per gripe from Tom Lane.	2012-07-02 11:28:53 -04:00
Robert Haas	f11e8be3e8	Make commit_delay much smarter. Instead of letting every backend participating in a group commit wait independently, have the first one that becomes ready to flush WAL wait for the configured delay, and let all the others wait just long enough for that first process to complete its flush. This greatly increases the chances of being able to configure a commit_delay setting that actually improves performance. As a side consequence of this change, commit_delay now affects all WAL flushes, rather than just commits. There was some discussion on pgsql-hackers about whether to rename the GUC to, say, wal_flush_delay, but in the absence of consensus I am leaving it alone for now. Peter Geoghegan, with some changes, mostly to the documentation, by me.	2012-07-02 10:26:31 -04:00
Robert Haas	f83b59997d	Make walsender more responsive. Per testing by Andres Freund, this improves replication performance and reduces replication latency and latency jitter. I was a bit concerned about moving more work into XLogInsert, but testing seems to show that it's not a problem in practice. Along the way, improve comments for WaitLatchOrSocket. Andres Freund. Review and stylistic cleanup by me.	2012-07-02 09:41:01 -04:00
Heikki Linnakangas	567787f216	Validate xlog record header before enlarging the work area to store it. If the record header is garbled, we're now quite likely to notice it before we try to make a bogus memory allocation and run out of memory. That can still happen, if the xlog record is split across pages (we cannot verify the record header until reading the next page in that scenario), but this reduces the chances. An out-of-memory is treated as a corrupt record anyway, so this isn't a correctness issue, just a case of giving a better error message. Per Amit Kapila's suggestion.	2012-06-30 23:14:35 +03:00
Heikki Linnakangas	7a5c9ca93a	Initialize shared memory copy of ckptXidEpoch correctly when not in recovery. This bug was introduced by commit `20d98ab6e4`, so backpatch this to 9.0-9.2 like that one. This fixes bug #6710, reported by Tarvi Pillessaar	2012-06-29 19:32:15 +03:00
Heikki Linnakangas	8f85667a86	Update outdated commit; xlp_rem_len field is in page header now. Spotted by Amit Kapila	2012-06-28 20:35:18 +03:00
Heikki Linnakangas	a8f97b39c7	Fix two more neglected comments, still referring to log/seg. Fujii Masao	2012-06-27 19:11:26 +03:00
Heikki Linnakangas	ec786c6c81	I neglected many comments in the log+seg -> 64-bit segno patch. Fix. Reported by Amit Kapila.	2012-06-27 17:53:53 +03:00
Heikki Linnakangas	a218e23a08	Oops. Remove stray paren. I didn't notice this on my laptop as I don't HAVE_FSYNC_WRITETHROUGH.	2012-06-24 20:03:57 +03:00
Heikki Linnakangas	0ab9d1c4b3	Replace XLogRecPtr struct with a 64-bit integer. This simplifies code that needs to do arithmetic on XLogRecPtrs. To avoid changing on-disk format of data pages, the LSN on data pages is still stored in the old format. That should keep pg_upgrade happy. However, we have XLogRecPtrs embedded in the control file, and in the structs that are sent over the replication protocol, so this changes breaks compatibility of pg_basebackup and server. I didn't do anything about this in this patch, per discussion on -hackers, the right thing to do would to be to change the replication protocol to be architecture-independent, so that you could use a newer version of pg_receivexlog, for example, against an older server version.	2012-06-24 19:19:45 +03:00
Heikki Linnakangas	061e7efb1b	Allow WAL record header to be split across pages. This saves a few bytes of WAL space, but the real motivation is to make it predictable how much WAL space a record requires, as it no longer depends on whether we need to waste the last few bytes at end of WAL page because the header doesn't fit. The total length field of WAL record, xl_tot_len, is moved to the beginning of the WAL record header, so that it is still always found on the first page where a WAL record begins. Bump WAL version number again as this is an incompatible change.	2012-06-24 18:35:56 +03:00
Heikki Linnakangas	20ba5ca64c	Move WAL continuation record information to WAL page header. The continuation record only contained one field, xl_rem_len, so it makes things simpler to just include it in the WAL page header. This wastes four bytes on pages that don't begin with a continuation from previos page, plus four bytes on every page, because of padding. The motivation of this is to make it easier to calculate how much space a WAL record needs. Before this patch, it depended on how many page boundaries the record crosses. The motivation of that, in turn, is to separate the allocation of space in the WAL from the copying of the record data to the allocated space. Keeping the calculation of space required simple helps to keep the critical section of allocating the space from WAL short. But that's not included in this patch yet. Bump WAL version number again, as this is an incompatible change.	2012-06-24 18:35:30 +03:00
Heikki Linnakangas	dfda6ebaec	Don't waste the last segment of each 4GB logical log file. The comments claimed that wasting the last segment made it easier to do calculations with XLogRecPtrs, because you don't have problems representing last-byte-position-plus-1 that way. In my experience, however, it only made things more complicated, because the there was two ways to represent the boundary at the beginning of a logical log file: logid = n+1 and xrecoff = 0, or as xlogid = n and xrecoff = 4GB - XLOG_SEG_SIZE. Some functions were picky about which representation was used. Also, use a 64-bit segment number instead of the log/seg combination, to point to a certain WAL segment. We assume that all platforms have a working 64-bit integer type nowadays. This is an incompatible change in WAL format, so bumping WAL version number.	2012-06-24 18:35:29 +03:00
Tom Lane	b8b69d8990	Revert "Reduce checkpoints and WAL traffic on low activity database server" This reverts commit `18fb9d8d21`. Per discussion, it does not seem like a good idea to allow committed changes to go un-checkpointed indefinitely, as could happen in a low-traffic server; that makes us entirely reliant on the WAL stream with no redundancy that might aid data recovery in case of disk failure. This re-introduces the original problem of hot-standby setups generating a small continuing stream of WAL traffic even when idle, but there are other ways to address that without compromising crash recovery, so we'll revisit that issue in a future release cycle.	2012-06-13 18:48:44 -04:00
Bruce Momjian	927d61eeff	Run pgindent on 9.2 source tree in preparation for first 9.3 commit-fest.	2012-06-10 15:20:04 -04:00
Simon Riggs	2c8a4e9be2	Wake WALSender to reduce data loss at failover for async commit. WALSender now woken up after each background flush by WALwriter, avoiding multi-second replication delay for an all-async commit workload. Replication delay reduced from 7s with default settings to 200ms and often much less, allowing significantly reduced data loss at failover. Andres Freund and Simon Riggs	2012-06-07 19:22:47 +01:00
Tom Lane	acd4c7d58b	Fix an issue in recent walwriter hibernation patch. Users of asynchronous-commit mode expect there to be a guaranteed maximum delay before an async commit's WAL records get flushed to disk. The original version of the walwriter hibernation patch broke that. Add an extra shared-memory flag to allow async commits to kick the walwriter out of hibernation mode, without adding any noticeable overhead in cases where no action is needed.	2012-05-08 23:06:40 -04:00
Tom Lane	5461564a9d	Reduce idle power consumption of walwriter and checkpointer processes. This patch modifies the walwriter process so that, when it has not found anything useful to do for many consecutive wakeup cycles, it extends its sleep time to reduce the server's idle power consumption. It reverts to normal as soon as it's done any successful flushes. It's still true that during any async commit, backends check for completed, unflushed pages of WAL and signal the walwriter if there are any; so that in practice the walwriter can get awakened and returned to normal operation sooner than the sleep time might suggest. Also, improve the checkpointer so that it uses a latch and a computed delay time to not wake up at all except when it has something to do, replacing a previous hardcoded 0.5 sec wakeup cycle. This also is primarily useful for reducing the server's power consumption when idle. In passing, get rid of the dedicated latch for signaling the walwriter in favor of using its procLatch, since that comports better with possible generic signal handlers using that latch. Also, fix a pre-existing bug with failure to save/restore errno in walwriter's signal handlers. Peter Geoghegan, somewhat simplified by Tom	2012-05-08 20:03:26 -04:00
Tom Lane	809e7e21af	Converge all SQL-level statistics timing values to float8 milliseconds. This patch adjusts the core statistics views to match the decision already taken for pg_stat_statements, that values representing elapsed time should be represented as float8 and measured in milliseconds. By using float8, we are no longer tied to a specific maximum precision of timing data. (Internally, it's still microseconds, but we could now change that without needing changes at the SQL level.) The columns affected are pg_stat_bgwriter.checkpoint_write_time pg_stat_bgwriter.checkpoint_sync_time pg_stat_database.blk_read_time pg_stat_database.blk_write_time pg_stat_user_functions.total_time pg_stat_user_functions.self_time pg_stat_xact_user_functions.total_time pg_stat_xact_user_functions.self_time The first four of these are new in 9.2, so there is no compatibility issue from changing them. The others require a release note comment that they are now double precision (and can show a fractional part) rather than bigint as before; also their underlying statistics functions now match the column definitions, instead of returning bigint microseconds.	2012-04-30 14:03:33 -04:00
Robert Haas	0d2235a25b	Remove duplicate word in comment. Noted by Peter Geoghegan.	2012-04-30 13:14:46 -04:00
Robert Haas	5d4b60f2f2	Lots of doc corrections. Josh Kupershmidt	2012-04-23 22:43:09 -04:00
Peter Eisentraut	a33fcd7e79	Fix typo Kyotaro HORIGUCHI	2012-04-16 15:36:40 +03:00
Robert Haas	b736aef2ec	Publish checkpoint timing information to pg_stat_bgwriter. Greg Smith, Peter Geoghegan, and Robert Haas	2012-04-05 14:04:37 -04:00
Simon Riggs	68219aaf6b	Correct epoch of txid_current() when executed on a Hot Standby server. Initialise ckptXidEpoch from starting checkpoint and maintain the correct value as we roll forwards. This allows GetNextXidAndEpoch() to return the correct epoch when executed during recovery. Backpatch to 9.0 when the problem is first observable by a user. Bug report from Daniel Farina	2012-03-29 14:55:30 +01:00
Peter Eisentraut	e684ab5e1e	Add additional safety check against invalid backup label file It was already checking for invalid data after "BACKUP FROM", but would possibly crash if "BACKUP FROM" was missing altogether. found by Coverity	2012-03-14 22:41:50 +02:00
Heikki Linnakangas	d93f209f48	Silence warning about unused variable, when building without assertions.	2012-03-08 11:10:02 +02:00
Robert Haas	bc97c38115	Typo fix. Fujii Masao	2012-03-06 08:23:51 -05:00
Heikki Linnakangas	e587e2e3e3	Make the comments more clear on the fact that UpdateFullPageWrites() is not safe to call concurrently from multiple processes.	2012-03-06 10:45:58 +02:00
Heikki Linnakangas	7714c63829	Remove extra copies of LogwrtResult. This simplifies the code a little bit. The new rule is that to update XLogCtl->LogwrtResult, you must hold both WALWriteLock and info_lck, whereas before we had two copies, one that was protected by WALWriteLock and another protected by info_lck. The code that updates them was already holding both locks, so merging the two is trivial. The third copy, XLogCtl->Insert.LogwrtResult, was not totally redundant, it was used in AdvanceXLInsertBuffer to update the backend-local copy, before acquiring the info_lck to read the up-to-date value. But the value of that seems dubious; at best it's saving one spinlock acquisition per completed WAL page, which is not significant compared to all the other work involved. And in practice, it's probably not saving even that much.	2012-03-06 10:18:33 +02:00
Heikki Linnakangas	3b682df326	Simplify the way changes to full_page_writes are logged. It's harmless to do full page writes even when not strictly necessary, so when turning full_page_writes on, we can set the global flag first, and then call XLogInsert. Likewise, when turning it off, we can write the WAL record first, and then clear the flag. This way XLogInsert doesn't need any special handling of the XLOG_FPW_CHANGE record type. XLogInsert is complicated enough already, so anything we can keep away from there is a good thing. Actually I don't think the atomicity of the shared memory flag matters, anyway, because we only write the XLOG_FPW_CHANGE at the end of recovery, when there are no concurrent WAL insertions going on. But might as well make it safe, in case we allow changing full_page_writes on the fly in the future.	2012-03-06 09:48:30 +02:00
Heikki Linnakangas	1a01560cbb	Rename LWLockWaitUntilFree to LWLockAcquireOrWait. LWLockAcquireOrWait makes it more clear that the lock is acquired if it's free.	2012-02-08 09:17:13 +02:00
Tom Lane	c6d76d7c82	Add locking around WAL-replay modification of shared-memory variables. Originally, most of this code assumed that no Postgres backends could be running concurrently with it, and so no locking could be needed. That assumption fails in Hot Standby. While it's still true that Hot Standby backends should never change values like nextXid, they can examine them, and consistency is important in some cases such as when computing a snapshot. Therefore, prudence requires that WAL replay code obtain the relevant locks when modifying such variables, even though it can examine them without taking a lock. We were following that coding rule in some places but not all. This commit applies the coding rule uniformly to all updates of ShmemVariableCache and MultiXactState fields; a search of the replay routines did not find any other cases that seemed to be at risk. In addition, this commit fixes a longstanding thinko in replay of NEXTOID and checkpoint records: we tried to advance nextOid only if it was behind the value in the WAL record, but the comparison would draw the wrong conclusion if OID wraparound had occurred since the previous value. Better to just unconditionally assign the new value, since OID assignment shouldn't be happening during replay anyway. The additional locking seems to be more in the nature of future-proofing than fixing any live bug, so I am not going to back-patch it. The NEXTOID fix will be back-patched separately.	2012-02-06 12:34:10 -05:00
Tom Lane	17118825b8	Fix transient clobbering of shared buffers during WAL replay. RestoreBkpBlocks was in the habit of zeroing and refilling the target buffer; which was perfectly safe when the code was written, but is unsafe during Hot Standby operation. The reason is that we have coding rules that allow backends to continue accessing a tuple in a heap relation while holding only a pin on its buffer. Such a backend could see transiently zeroed data, if WAL replay had occasion to change other data on the page. This has been shown to be the cause of bug #6425 from Duncan Rance (who deserves kudos for developing a sufficiently-reproducible test case) as well as Bridget Frey's re-report of bug #6200. It most likely explains the original report as well, though we don't yet have confirmation of that. To fix, change the code so that only bytes that are supposed to change will change, even transiently. This actually saves cycles in RestoreBkpBlocks, since it's not writing the same bytes twice. Also fix seq_redo, which has the same disease, though it has to work a bit harder to meet the requirement. So far as I can tell, no other WAL replay routines have this type of bug. In particular, the index-related replay routines, which would certainly be broken if they had to meet the same standard, are not at risk because we do not have coding rules that allow access to an index page when not holding a buffer lock on it. Back-patch to 9.0 where Hot Standby was added.	2012-02-05 15:49:17 -05:00
Heikki Linnakangas	9b38d46d9f	Make group commit more effective. When a backend needs to flush the WAL, and someone else is already flushing the WAL, wait until it releases the WALInsertLock and check if we still need to do the flush or if the other backend already did the work for us, before acquiring WALInsertLock. This helps group commit, because when the WAL flush finishes, all the backends that were waiting for it can be woken up in one go, and the can all concurrently observe that they're done, rather than waking them up one by one in a cascading fashion. This is based on a new LWLock function, LWLockWaitUntilFree(), which has peculiar semantics. If the lock is immediately free, it grabs the lock and returns true. If it's not free, it waits until it is released, but then returns false without grabbing the lock. This is used in XLogFlush(), so that when the lock is acquired, the backend flushes the WAL, but if it's not, the backend first checks the current flush location before retrying. Original patch and benchmarking by Peter Geoghegan and Simon Riggs, although this patch as committed ended up being very different from that.	2012-01-30 16:53:48 +02:00
Simon Riggs	8366c7803e	Allow pg_basebackup from standby node with safety checking. Base backup follows recommended procedure, plus goes to great lengths to ensure that partial page writes are avoided. Jun Ishizuka and Fujii Masao, with minor modifications	2012-01-25 18:02:04 +00:00
Simon Riggs	5530623d03	Correctly initialise shared recoveryLastRecPtr in recovery. Previously we used ReadRecPtr rather than EndRecPtr, which was not a serious error but caused pg_stat_replication to report incorrect replay_location until at least one WAL record is replayed. Fujii Masao	2012-01-13 13:02:44 +00:00
Heikki Linnakangas	1b9dea04b5	Remove useless 'needlock' argument from GetXLogInsertRecPtr. It was always passed as 'true'.	2012-01-11 11:01:47 +02:00
Heikki Linnakangas	9c808f89c2	Refactor XLogInsert a bit. The rdata entries for backup blocks are now constructed before acquiring WALInsertLock, which slightly reduces the time the lock is held. Although I could not measure any benefit in benchmarks, the code is more readable this way.	2012-01-11 11:01:47 +02:00
Bruce Momjian	e126958c2e	Update copyright notices for year 2012.	2012-01-01 18:01:58 -05:00
Simon Riggs	64233902d2	Send new protocol keepalive messages to standby servers. Allows streaming replication users to calculate transfer latency and apply delay via internal functions. No external functions yet.	2011-12-31 13:30:26 +00:00
Tom Lane	2dd9322ba6	Move BKP_REMOVABLE bit from individual WAL records to WAL page headers. Removing this bit from xl_info allows us to restore the old limit of four (not three) separate pages touched by a WAL record, which is needed for the upcoming SP-GiST feature, and will likely be useful elsewhere in future. When we implemented XLR_BKP_REMOVABLE in 2007, we had to do it like that because no special WAL-visible action was taken when starting a backup. However, now we force a segment switch when starting a backup, so a compressing WAL archiver (such as pglesslog) that uses the state shown in the current page header will not be fooled as to removability of backup blocks. The only downside is that the archiver will not return to compressing mode for up to one WAL page after the backup is over, which is a small price to pay for getting back the extra xl_info bit. In any case the archiver could look for XLOG_BACKUP_END records if it thought it was worth the trouble to do so. Bump XLOG_PAGE_MAGIC since this is effectively a change in WAL format.	2011-12-12 16:22:14 -05:00
Heikki Linnakangas	9f0d2bdc88	Don't set reachedMinRecoveryPoint during crash recovery. In crash recovery, we don't reach consistency before replaying all of the WAL. Rename the variable to reachedConsistency, to make its intention clearer. In master, that was an active bug because of the recent patch to immediately PANIC if a reference to a missing page is found in WAL after reaching consistency, as Tom Lane's test case demonstrated. In 9.1 and 9.0, the only consequence was a misleading "consistent recovery state reached at %X/%X" message in the log at the beginning of crash recovery (the database is not consistent at that point yet). In 8.4, the log message was not printed in crash recovery, even though there was a similar reachedMinRecoveryPoint local variable that was also set early. So, backpatch to 9.1 and 9.0.	2011-12-09 15:21:12 +02:00
Heikki Linnakangas	1e616f6391	During recovery, if we reach consistent state and still have entries in the invalid-page hash table, PANIC immediately. Immediate PANIC is much better than waiting for end-of-recovery, which is what we did before, because the end-of-recovery might not come until months later if this is a standby server. Also refrain from creating a restartpoint if there are invalid-page entries in the hash table. Restarting recovery from such a restartpoint would not see the invalid references, and wouldn't be able to cross-check them when consistency is reached. That wouldn't matter when things are going smoothly, but the more sanity checks you have the better. Fujii Masao	2011-12-02 10:49:54 +02:00
Simon Riggs	4de82f7d7c	Wakeup WALWriter as needed for asynchronous commit performance. Previously we waited for wal_writer_delay before flushing WAL. Now we also wake WALWriter as soon as a WAL buffer page has filled. Significant effect observed on performance of asynchronous commits by Robert Haas, attributed to the ability to set hint bits on tuples earlier and so reducing contention caused by clog lookups.	2011-11-13 09:00:57 +00:00
Simon Riggs	a030bfa6e4	Move user functions related to WAL into xlogfuncs.c	2011-11-04 09:37:17 +00:00
Simon Riggs	750f70b0fe	Update more comments about checkpoints being done by bgwriter	2011-11-02 17:15:35 +00:00
Simon Riggs	18fb9d8d21	Reduce checkpoints and WAL traffic on low activity database server Previously, we skipped a checkpoint if no WAL had been written since last checkpoint, though this does not appear in user documentation. As of now, we skip a checkpoint until we have written at least one enough WAL to switch the next WAL file. This greatly reduces the level of activity and number of WAL messages generated by a very low activity server. This is safe because the purpose of a checkpoint is to act as a starting place for a recovery, in case of crash. This patch maintains minimal WAL volume for replay in case of crash, thus maintaining very low crash recovery time.	2011-11-02 15:26:33 +00:00
Simon Riggs	9aceb6ab3c	Refactor xlog.c to create src/backend/postmaster/startup.c Startup process now has its own dedicated file, just like all other special/background processes. Reduces role and size of xlog.c	2011-11-02 14:25:01 +00:00
Simon Riggs	86e3364899	Derive oldestActiveXid at correct time for Hot Standby. There was a timing window between when oldestActiveXid was derived and when it should have been derived that only shows itself under heavy load. Move code around to ensure correct timing of derivation. No change to StartupSUBTRANS() code, which is where this failed. Bug report by Chris Redekop	2011-11-02 08:54:56 +00:00
Simon Riggs	f8409b39d1	Fix timing of Startup CLOG and MultiXact during Hot Standby Patch by me, bug report by Chris Redekop, analysis by Florian Pflug	2011-11-02 08:07:44 +00:00
Simon Riggs	f3ebaad45b	Comment changes to show bgwriter no longer performs checkpoints.	2011-11-01 18:48:47 +00:00
Tom Lane	bb446b689b	Support synchronization of snapshots through an export/import procedure. A transaction can export a snapshot with pg_export_snapshot(), and then others can import it with SET TRANSACTION SNAPSHOT. The data does not leave the server so there are not security issues. A snapshot can only be imported while the exporting transaction is still running, and there are some other restrictions. I'm not totally convinced that we've covered all the bases for SSI (true serializable) mode, but it works fine for lesser isolation modes. Joachim Wieland, reviewed by Marko Tiikkaja, and rather heavily modified by Tom Lane	2011-10-22 18:23:30 -04:00
Tom Lane	aa90e148ca	Suppress -Wunused-result warnings about write() and fwrite(). This is merely an exercise in satisfying pedants, not a bug fix, because in every case we were checking for failure later with ferror(), or else there was nothing useful to be done about a failure anyway. Document the latter cases.	2011-10-18 21:37:51 -04:00
Tom Lane	d56b3afc03	Restructure error handling in reading of postgresql.conf. This patch has two distinct purposes: to report multiple problems in postgresql.conf rather than always bailing out after the first one, and to change the policy for whether changes are applied when there are unrelated errors in postgresql.conf. Formerly the policy was to apply no changes if any errors could be detected, but that had a significant consistency problem, because in some cases specific values might be seen as valid by some processes but invalid by others. This meant that the latter processes would fail to adopt changes in other parameters even though the former processes had done so. The new policy is that during SIGHUP, the file is rejected as a whole if there are any errors in the "name = value" syntax, or if any lines attempt to set nonexistent built-in parameters, or if any lines attempt to set custom parameters whose prefix is not listed in (the new value of) custom_variable_classes. These tests should always give the same results in all processes, and provide what seems a reasonably robust defense against loading values from badly corrupted config files. If these tests pass, all processes will apply all settings that they individually see as good, ignoring (but logging) any they don't. In addition, the postmaster does not abandon reading a configuration file after the first syntax error, but continues to read the file and report syntax errors (up to a maximum of 100 syntax errors per file). The postmaster will still refuse to start up if the configuration file contains any errors at startup time, but these changes allow multiple errors to be detected and reported before quitting. Alexey Klyukin, reviewed by Andy Colson and av (Alexander ?) with some additional hacking by Tom Lane	2011-10-02 16:50:04 -04:00
Tom Lane	a7801b62f2	Move Timestamp/Interval typedefs and basic macros into datatype/timestamp.h. As per my recent proposal, this refactors things so that these typedefs and macros are available in a header that can be included in frontend-ish code. I also changed various headers that were undesirably including utils/timestamp.h to include datatype/timestamp.h instead. Unsurprisingly, this showed that half the system was getting utils/timestamp.h by way of xlog.h. No actual code changes here, just header refactoring.	2011-09-09 13:23:41 -04:00
Alvaro Herrera	56a9ed92b6	Adjust translator comment format to xgettext expectations	2011-09-05 19:04:30 -03:00
Alvaro Herrera	b64f18c583	Mark some untranslatable messages with errmsg_internal	2011-09-05 17:48:07 -03:00
Heikki Linnakangas	1d0392b245	Fix comment about which version had BACKUP METHOD line in backup_lable, again. It was invalidated again by Fujii's patch to 9.1.	2011-08-17 12:31:23 +03:00
Heikki Linnakangas	2877c67bc2	Fix bogus comment that claimed that the new BACKUP METHOD line in backup_label was new in 9.0. Spotted by Fujii Masao.	2011-08-16 12:23:51 +03:00
Tom Lane	4dab3d5ae1	Change the autovacuum launcher to use WaitLatch instead of a poll loop. In pursuit of this (and with the expectation that WaitLatch will be needed in more places), convert the latch field that was already added to PGPROC for sync rep into a generic latch that is activated for all PGPROC-owning processes, and change many of the standard backend signal handlers to set that latch when a signal happens. This will allow WaitLatch callers to be wakened properly by these signals. In passing, fix a whole bunch of signal handlers that had been hacked to do things that might change errno, without adding the necessary save/restore logic for errno. Also make some minor fixes in unix_latch.c, and clean up bizarre and unsafe scheme for disowning the process's latch. Much of this has to be back-patched into 9.1. Peter Geoghegan, with additional work by Tom	2011-08-10 12:22:21 -04:00
Heikki Linnakangas	41f9ffd928	If backup-end record is not seen, and we reach end of recovery from a streamed backup, throw an error and refuse to start up. The restore has not finished correctly in that case and the data directory is possibly corrupt. We already errored out in case of archive recovery, but could not during crash recovery because we couldn't distinguish between the case that pg_start_backup() was called and the database then crashed (must not error, data is OK), and the case that we're restoring from a backup and not all the needed WAL was replayed (data can be corrupt). To distinguish those cases, add a line to backup_label to indicate whether the backup was taken with pg_start/stop_backup(), or by streaming (ie. pg_basebackup). This requires re-initdb, because of a new field added to the control file.	2011-08-10 09:22:49 +03:00
Tom Lane	9f17ffd866	Measure WaitLatch's timeout parameter in milliseconds, not microseconds. The original definition had the problem that timeouts exceeding about 2100 seconds couldn't be specified on 32-bit machines. Milliseconds seem like sufficient resolution, and finer grain than that would be fantasy anyway on many platforms. Back-patch to 9.1 so that this aspect of the latch API won't change between 9.1 and later releases. Peter Geoghegan	2011-08-09 18:52:29 -04:00
Simon Riggs	5286105800	Cascading replication feature for streaming log-based replication. Standby servers can now have WALSender processes, which can work with either WALReceiver or archive_commands to pass data. Fully updated docs, including new conceptual terms of sending server, upstream and downstream servers. WALSenders terminated when promote to master. Fujii Masao, review, rework and doc rewrite by Simon Riggs	2011-07-19 03:40:03 +01:00
Heikki Linnakangas	89fd72cbf2	Introduce a pipe between postmaster and each backend, which can be used to detect postmaster death. Postmaster keeps the write-end of the pipe open, so when it dies, children get EOF in the read-end. That can conveniently be waited for in select(), which allows eliminating some of the polling loops that check for postmaster death. This patch doesn't yet change all the loops to use the new mechanism, expect a follow-on patch to do that. This changes the interface to WaitLatch, so that it takes as argument a bitmask of events that it waits for. Possible events are latch set, timeout, postmaster death, and socket becoming readable or writeable. The pipe method behaves slightly differently from the kill() method previously used in PostmasterIsAlive() in the case that postmaster has died, but its parent has not yet read its exit code with waitpid(). The pipe returns EOF as soon as the process dies, but kill() continues to return true until waitpid() has been called (IOW while the process is a zombie). Because of that, change PostmasterIsAlive() to use the pipe too, otherwise WaitLatch() would return immediately with WL_POSTMASTER_DEATH, while PostmasterIsAlive() would claim it's still alive. That could easily lead to busy-waiting while postmaster is in zombie state. Peter Geoghegan with further changes by me, reviewed by Fujii Masao and Florian Pflug.	2011-07-08 18:44:07 +03:00
Peter Eisentraut	21f1e15aaf	Unify spelling of "canceled", "canceling", "cancellation" We had previously (`af26857a27`) established the U.S. spellings as standard.	2011-06-29 09:28:46 +03:00
Simon Riggs	465883b0a2	Introduce compact WAL record for the common case of commit (non-DDL). XLOG_XACT_COMMIT_COMPACT leaves out invalidation messages and relfilenodes, saving considerable space for the vast majority of transaction commits. XLOG_XACT_COMMIT keeps same definition as XLOG_PAGE_MAGIC 0xD067 and earlier. Leonardo Francalanci and Simon Riggs	2011-06-28 22:58:17 +01:00
Robert Haas	503c7305a1	Make the visibility map crash-safe. This involves two main changes from the previous behavior. First, when we set a bit in the visibility map, emit a new WAL record of type XLOG_HEAP2_VISIBLE. Replay sets the page-level PD_ALL_VISIBLE bit and the visibility map bit. Second, when inserting, updating, or deleting a tuple, we can no longer get away with clearing the visibility map bit after releasing the lock on the corresponding heap page, because an intervening crash might leave the visibility map bit set and the page-level bit clear. Making this work requires a bit of interface refactoring. In passing, a few minor but related cleanups: change the test in visibilitymap_set and visibilitymap_clear to throw an error if the wrong page (or no page) is pinned, rather than silently doing nothing; this case should never occur. Also, remove duplicate definitions of InvalidXLogRecPtr. Patch by me, review by Noah Misch.	2011-06-21 23:04:40 -04:00
Tom Lane	c2ba0121c7	Work around gcc 4.6.0 bug that breaks WAL replay. ReadRecord's habit of using both direct references to tmpRecPtr and references to *RecPtr (which is pointing at tmpRecPtr) triggers an optimization bug in gcc 4.6.0, which apparently has forgotten about aliasing rules. Avoid the compiler bug, and make the code more readable to boot, by getting rid of the direct references. Improve the comments while at it. Back-patch to all supported versions, in case they get built with 4.6.0. Tom Lane, with some cosmetic suggestions from Alex Hunsaker	2011-06-10 17:04:29 -04:00
Bruce Momjian	6560407c7d	Pgindent run before 9.1 beta2.	2011-06-09 14:32:50 -04:00
Heikki Linnakangas	a0c8514149	Shut down WAL receiver if it's still running at end of recovery. We used to just check that it's not running and PANIC if it was, but that can rightfully happen if recovery stops at recovery target.	2011-05-11 12:46:08 +03:00
Robert Haas	aea1f24c2c	recoveryStopsHere() must check the resource manager ID. Before commit `c016ce7281`, this wasn't needed, but now that multiple resource manager IDs can percolate down through here, we have to make sure we know which one we've got. Otherwise, we can confuse (for example) an XLOG_XACT_COMMIT record with an XLOG_CHECKPOINT_SHUTDOWN record. Review by Jaime Casanova	2011-04-18 08:27:19 -04:00
Heikki Linnakangas	54685b1c2b	Revert the patch to check if we've reached end-of-backup also when doing crash recovery, and throw an error if not. hubert depesz lubaczewski pointed out that that situation also happens in the crash recovery following a system crash that happens during an online backup. We might want to do something smarter in 9.1, like put the check back for backups taken with pg_basebackup, but that's for another patch.	2011-04-13 22:05:40 +03:00
Bruce Momjian	bf50caf105	pgindent run before PG 9.1 beta 1.	2011-04-10 11:42:00 -04:00
Tom Lane	2594cf0e8c	Revise the API for GUC variable assign hooks. The previous functions of assign hooks are now split between check hooks and assign hooks, where the former can fail but the latter shouldn't. Aside from being conceptually clearer, this approach exposes the "canonicalized" form of the variable value to guc.c without having to do an actual assignment. And that lets us fix the problem recently noted by Bernd Helmle that the auto-tune patch for wal_buffers resulted in bogus log messages about "parameter "wal_buffers" cannot be changed without restarting the server". There may be some speed advantage too, because this design lets hook functions avoid re-parsing variable values when restoring a previous state after a rollback (they can store a pre-parsed representation of the value instead). This patch also resolves a longstanding annoyance about custom error messages from variable assign hooks: they should modify, not appear separately from, guc.c's own message about "invalid parameter value".	2011-04-07 00:12:02 -04:00
Heikki Linnakangas	1f0bab8494	Improve error message when WAL ends before reaching end of online backup.	2011-03-31 10:09:49 +03:00
Heikki Linnakangas	acf4740132	Check that we've reached end-of-backup also when we're not performing archive recovery. It's possible to restore an online backup without recovery.conf, by simply copying all the necessary WAL files to pg_xlog. "pg_basebackup -x" does that too. That's the use case where this cross-check is useful. Backpatch to 9.0. We used to do this in earlier versins, but in 9.0 the code was inadvertently changed so that the check is only performed after archive recovery. Fujii Masao.	2011-03-30 10:53:28 +03:00
Simon Riggs	b5f2f2a712	Minor changes to recovery pause behaviour. Change location LOG message so it works each time we pause, not just for final pause. Ensure that we pause only if we are in Hot Standby and can connect to allow us to run resume function. This change supercedes the code to override parameter recoveryPauseAtTarget to false if not attempting to enter Hot Standby, which is now removed.	2011-03-23 19:35:53 +00:00
Simon Riggs	b98ac467f5	Prevent intermittent hang in recovery from bgwriter interaction. Startup process waited for cleanup lock but when hot_standby = off the pid was not registered, so that the bgwriter would not wake the waiting process as intended.	2011-03-23 13:30:05 +00:00
Heikki Linnakangas	6d8096e2f3	When two base backups are started at the same time with pg_basebackup, ensure that they use different checkpoints as the starting point. We use the checkpoint redo location as a unique identifier for the base backup in the end-of-backup record, and in the backup history file name. Bug spotted by Fujii Masao.	2011-03-21 11:25:25 +02:00
Robert Haas	777e8c0015	Remove bogus semicolons in recoveryPausesHere. Without this, the startup process goes into a tight loop, consuming 100% of one CPU and failing to respond to interrupts.	2011-03-18 08:09:09 -04:00
Bruce Momjian	5ca543fb2e	Clarify C comment that O_SYNC/O_FSYNC are really the same settting, as opposed to O_DSYNC.	2011-03-10 20:02:52 -05:00
Robert Haas	d16e290a8a	Emit a LOG message when pausing at the recovery target. Fujii Masao	2011-03-10 14:37:14 -05:00
Heikki Linnakangas	4cd3fb6e12	Truncate predicate lock manager's SLRU lazily at checkpoint. That's safer than doing it aggressively whenever the tail-XID pointer is advanced, because this way we don't need to do it while holding SerializableXactHashLock. This also fixes bug #5915 spotted by YAMAMOTO Takashi, and removes an obsolete comment spotted by Kevin Grittner.	2011-03-08 12:12:54 +02:00
Heikki Linnakangas	1a4ab9ec23	If recovery_target_timeline is set to 'latest' and standby mode is enabled, periodically rescan the archive for new timelines, while waiting for new WAL segments to arrive. This allows you to set up a standby server that follows the TLI change if another standby server is promoted to master. Before this, you had to restart the standby server to make it notice the new timeline. This patch only scans the archive for TLI changes, it won't follow a TLI change in streaming replication. That is much needed too, but it would be a much bigger patch than I dare to sneak in this late in the release cycle. There was discussion on improving the sanity checking of the WAL segments so that the system would notice more reliably if the new timeline isn't an ancestor of the current one, but that is not included in this patch. Reviewed by Fujii Masao.	2011-03-07 21:14:47 +02:00

... 3 4 5 6 7 ...

916 Commits