postgresql

Commit Graph

Author	SHA1	Message	Date
Simon Riggs	8366c7803e	Allow pg_basebackup from standby node with safety checking. Base backup follows recommended procedure, plus goes to great lengths to ensure that partial page writes are avoided. Jun Ishizuka and Fujii Masao, with minor modifications	2012-01-25 18:02:04 +00:00
Simon Riggs	5530623d03	Correctly initialise shared recoveryLastRecPtr in recovery. Previously we used ReadRecPtr rather than EndRecPtr, which was not a serious error but caused pg_stat_replication to report incorrect replay_location until at least one WAL record is replayed. Fujii Masao	2012-01-13 13:02:44 +00:00
Heikki Linnakangas	1b9dea04b5	Remove useless 'needlock' argument from GetXLogInsertRecPtr. It was always passed as 'true'.	2012-01-11 11:01:47 +02:00
Heikki Linnakangas	9c808f89c2	Refactor XLogInsert a bit. The rdata entries for backup blocks are now constructed before acquiring WALInsertLock, which slightly reduces the time the lock is held. Although I could not measure any benefit in benchmarks, the code is more readable this way.	2012-01-11 11:01:47 +02:00
Robert Haas	33aaa139e6	Make the number of CLOG buffers adaptive, based on shared_buffers. Previously, this was hardcoded: we always had 8. Performance testing shows that isn't enough, especially on big SMP systems, so we allow it to scale up as high as 32 when there's adequate memory. On the flip side, when shared_buffers is very small, drop the number of CLOG buffers down to as little as 4, so that we can start the postmaster even when very little shared memory is available. Per extensive discussion with Simon Riggs, Tom Lane, and others on pgsql-hackers.	2012-01-06 14:32:18 -05:00
Bruce Momjian	e126958c2e	Update copyright notices for year 2012.	2012-01-01 18:01:58 -05:00
Simon Riggs	64233902d2	Send new protocol keepalive messages to standby servers. Allows streaming replication users to calculate transfer latency and apply delay via internal functions. No external functions yet.	2011-12-31 13:30:26 +00:00
Tom Lane	d0024cd188	Avoid crashing when we have problems unlinking files post-commit. smgrdounlink takes care to not throw an ERROR if it fails to unlink something, but that caution was rendered useless by commit `3396000684`, which put an smgrexists call in front of it; smgrexists does throw error if anything looks funny, such as getting a permissions error from trying to open the file. If that happens post-commit, you get a PANIC, and what's worse the same logic appears in the WAL replay code, so the database even fails to restart. Restore the intended behavior by removing the smgrexists call --- it isn't accomplishing anything that we can't do better by adjusting mdunlink's ideas of whether it ought to warn about ENOENT or not. Per report from Joseph Shraibman of unrecoverable crash after trying to drop a table whose FSM fork had somehow gotten chmod'd to 000 permissions. Backpatch to 8.4, where the bogus coding was introduced.	2011-12-20 15:00:36 -05:00
Tom Lane	dd45d3ad33	Fix some long-obsolete references to XLogOpenRelation. These were missed in commit `a213f1ee6c`, which removed that function.	2011-12-17 18:26:52 -05:00
Tom Lane	8daeb5ddd6	Add SP-GiST (space-partitioned GiST) index access method. SP-GiST is comparable to GiST in flexibility, but supports non-balanced partitioned search structures rather than balanced trees. As described at PGCon 2011, this new indexing structure can beat GiST in both index build time and query speed for search problems that it is well matched to. There are a number of areas that could still use improvement, but at this point the code seems committable. Teodor Sigaev and Oleg Bartunov, with considerable revisions by Tom Lane	2011-12-17 16:42:30 -05:00
Tom Lane	2dd9322ba6	Move BKP_REMOVABLE bit from individual WAL records to WAL page headers. Removing this bit from xl_info allows us to restore the old limit of four (not three) separate pages touched by a WAL record, which is needed for the upcoming SP-GiST feature, and will likely be useful elsewhere in future. When we implemented XLR_BKP_REMOVABLE in 2007, we had to do it like that because no special WAL-visible action was taken when starting a backup. However, now we force a segment switch when starting a backup, so a compressing WAL archiver (such as pglesslog) that uses the state shown in the current page header will not be fooled as to removability of backup blocks. The only downside is that the archiver will not return to compressing mode for up to one WAL page after the backup is over, which is a small price to pay for getting back the extra xl_info bit. In any case the archiver could look for XLOG_BACKUP_END records if it thought it was worth the trouble to do so. Bump XLOG_PAGE_MAGIC since this is effectively a change in WAL format.	2011-12-12 16:22:14 -05:00
Heikki Linnakangas	9f0d2bdc88	Don't set reachedMinRecoveryPoint during crash recovery. In crash recovery, we don't reach consistency before replaying all of the WAL. Rename the variable to reachedConsistency, to make its intention clearer. In master, that was an active bug because of the recent patch to immediately PANIC if a reference to a missing page is found in WAL after reaching consistency, as Tom Lane's test case demonstrated. In 9.1 and 9.0, the only consequence was a misleading "consistent recovery state reached at %X/%X" message in the log at the beginning of crash recovery (the database is not consistent at that point yet). In 8.4, the log message was not printed in crash recovery, even though there was a similar reachedMinRecoveryPoint local variable that was also set early. So, backpatch to 9.1 and 9.0.	2011-12-09 15:21:12 +02:00
Heikki Linnakangas	1e616f6391	During recovery, if we reach consistent state and still have entries in the invalid-page hash table, PANIC immediately. Immediate PANIC is much better than waiting for end-of-recovery, which is what we did before, because the end-of-recovery might not come until months later if this is a standby server. Also refrain from creating a restartpoint if there are invalid-page entries in the hash table. Restarting recovery from such a restartpoint would not see the invalid references, and wouldn't be able to cross-check them when consistency is reached. That wouldn't matter when things are going smoothly, but the more sanity checks you have the better. Fujii Masao	2011-12-02 10:49:54 +02:00
Robert Haas	ed0b409d22	Move "hot" members of PGPROC into a separate PGXACT array. This speeds up snapshot-taking and reduces ProcArrayLock contention. Also, the PGPROC (and PGXACT) structures used by two-phase commit are now allocated as part of the main array, rather than in a separate array, and we keep ProcArray sorted in pointer order. These changes are intended to minimize the number of cache lines that must be pulled in to take a snapshot, and testing shows a substantial increase in performance on both read and write workloads at high concurrencies. Pavan Deolasee, Heikki Linnakangas, Robert Haas	2011-11-25 08:02:10 -05:00
Simon Riggs	4de82f7d7c	Wakeup WALWriter as needed for asynchronous commit performance. Previously we waited for wal_writer_delay before flushing WAL. Now we also wake WALWriter as soon as a WAL buffer page has filled. Significant effect observed on performance of asynchronous commits by Robert Haas, attributed to the ability to set hint bits on tuples earlier and so reducing contention caused by clog lookups.	2011-11-13 09:00:57 +00:00
Simon Riggs	a030bfa6e4	Move user functions related to WAL into xlogfuncs.c	2011-11-04 09:37:17 +00:00
Simon Riggs	750f70b0fe	Update more comments about checkpoints being done by bgwriter	2011-11-02 17:15:35 +00:00
Simon Riggs	18fb9d8d21	Reduce checkpoints and WAL traffic on low activity database server Previously, we skipped a checkpoint if no WAL had been written since last checkpoint, though this does not appear in user documentation. As of now, we skip a checkpoint until we have written at least one enough WAL to switch the next WAL file. This greatly reduces the level of activity and number of WAL messages generated by a very low activity server. This is safe because the purpose of a checkpoint is to act as a starting place for a recovery, in case of crash. This patch maintains minimal WAL volume for replay in case of crash, thus maintaining very low crash recovery time.	2011-11-02 15:26:33 +00:00
Simon Riggs	9aceb6ab3c	Refactor xlog.c to create src/backend/postmaster/startup.c Startup process now has its own dedicated file, just like all other special/background processes. Reduces role and size of xlog.c	2011-11-02 14:25:01 +00:00
Simon Riggs	86e3364899	Derive oldestActiveXid at correct time for Hot Standby. There was a timing window between when oldestActiveXid was derived and when it should have been derived that only shows itself under heavy load. Move code around to ensure correct timing of derivation. No change to StartupSUBTRANS() code, which is where this failed. Bug report by Chris Redekop	2011-11-02 08:54:56 +00:00
Simon Riggs	f8409b39d1	Fix timing of Startup CLOG and MultiXact during Hot Standby Patch by me, bug report by Chris Redekop, analysis by Florian Pflug	2011-11-02 08:07:44 +00:00
Simon Riggs	f3ebaad45b	Comment changes to show bgwriter no longer performs checkpoints.	2011-11-01 18:48:47 +00:00
Tom Lane	bb446b689b	Support synchronization of snapshots through an export/import procedure. A transaction can export a snapshot with pg_export_snapshot(), and then others can import it with SET TRANSACTION SNAPSHOT. The data does not leave the server so there are not security issues. A snapshot can only be imported while the exporting transaction is still running, and there are some other restrictions. I'm not totally convinced that we've covered all the bases for SSI (true serializable) mode, but it works fine for lesser isolation modes. Joachim Wieland, reviewed by Marko Tiikkaja, and rather heavily modified by Tom Lane	2011-10-22 18:23:30 -04:00
Tom Lane	aa90e148ca	Suppress -Wunused-result warnings about write() and fwrite(). This is merely an exercise in satisfying pedants, not a bug fix, because in every case we were checking for failure later with ferror(), or else there was nothing useful to be done about a failure anyway. Document the latter cases.	2011-10-18 21:37:51 -04:00
Tom Lane	fa56a0c3e0	Fix uninitialized-variable bug.	2011-10-04 17:08:18 -04:00
Alvaro Herrera	09e196e453	Use callbacks in SlruScanDirectory for the actual action Previously, the code assumed that the only possible action to take was to delete files behind a certain cutoff point. The async notify code was already a crock: it used a different "pagePrecedes" function for truncation than for regular operation. By allowing it to pass a callback to SlruScanDirectory it can do cleanly exactly what it needs to do. The clog.c code also had its own use for SlruScanDirectory, which is made a bit simpler with this.	2011-10-04 14:03:23 -03:00
Tom Lane	d56b3afc03	Restructure error handling in reading of postgresql.conf. This patch has two distinct purposes: to report multiple problems in postgresql.conf rather than always bailing out after the first one, and to change the policy for whether changes are applied when there are unrelated errors in postgresql.conf. Formerly the policy was to apply no changes if any errors could be detected, but that had a significant consistency problem, because in some cases specific values might be seen as valid by some processes but invalid by others. This meant that the latter processes would fail to adopt changes in other parameters even though the former processes had done so. The new policy is that during SIGHUP, the file is rejected as a whole if there are any errors in the "name = value" syntax, or if any lines attempt to set nonexistent built-in parameters, or if any lines attempt to set custom parameters whose prefix is not listed in (the new value of) custom_variable_classes. These tests should always give the same results in all processes, and provide what seems a reasonably robust defense against loading values from badly corrupted config files. If these tests pass, all processes will apply all settings that they individually see as good, ignoring (but logging) any they don't. In addition, the postmaster does not abandon reading a configuration file after the first syntax error, but continues to read the file and report syntax errors (up to a maximum of 100 syntax errors per file). The postmaster will still refuse to start up if the configuration file contains any errors at startup time, but these changes allow multiple errors to be detected and reported before quitting. Alexey Klyukin, reviewed by Andy Colson and av (Alexander ?) with some additional hacking by Tom Lane	2011-10-02 16:50:04 -04:00
Tom Lane	57eb009092	Allow snapshot references to still work during transaction abort. In REPEATABLE READ (nee SERIALIZABLE) mode, an attempt to do GetTransactionSnapshot() between AbortTransaction and CleanupTransaction failed, because GetTransactionSnapshot would recompute the transaction snapshot (which is already wrong, given the isolation mode) and then re-register it in the TopTransactionResourceOwner, leading to an Assert because the TopTransactionResourceOwner should be empty of resources after AbortTransaction. This is the root cause of bug #6218 from Yamamoto Takashi. While changing plancache.c to avoid requesting a snapshot when handling a ROLLBACK masks the problem, I think this is really a snapmgr.c bug: it's lower-level than the resource manager mechanism and should not be shutting itself down before we unwind resource manager resources. However, just postponing the release of the transaction snapshot until cleanup time didn't work because of the circular dependency with TopTransactionResourceOwner. Fix by managing the internal reference to that snapshot manually instead of depending on TopTransactionResourceOwner. This saves a few cycles as well as making the module layering more straightforward. predicate.c's dependencies on TopTransactionResourceOwner go away too. I think this is a longstanding bug, but there's no evidence that it's more than a latent bug, so it doesn't seem worth any risk of back-patching.	2011-09-26 22:25:28 -04:00
Tom Lane	a7801b62f2	Move Timestamp/Interval typedefs and basic macros into datatype/timestamp.h. As per my recent proposal, this refactors things so that these typedefs and macros are available in a header that can be included in frontend-ish code. I also changed various headers that were undesirably including utils/timestamp.h to include datatype/timestamp.h instead. Unsurprisingly, this showed that half the system was getting utils/timestamp.h by way of xlog.h. No actual code changes here, just header refactoring.	2011-09-09 13:23:41 -04:00
Simon Riggs	df383b03e6	Partially revoke attempt to improve performance with many savepoints. Maintain difference between subtransaction release and commit introduced by earlier patch.	2011-09-07 12:11:26 +01:00
Alvaro Herrera	56a9ed92b6	Adjust translator comment format to xgettext expectations	2011-09-05 19:04:30 -03:00
Alvaro Herrera	b64f18c583	Mark some untranslatable messages with errmsg_internal	2011-09-05 17:48:07 -03:00
Tom Lane	1609797c25	Clean up the #include mess a little. walsender.h should depend on xlog.h, not vice versa. (Actually, the inclusion was circular until a couple hours ago, which was even sillier; but Bruce broke it in the expedient rather than logically correct direction.) Because of that poor decision, plus blind application of pgrminclude, we had a situation where half the system was depending on xlog.h to include such unrelated stuff as array.h and guc.h. Clean up the header inclusion, and manually revert a lot of what pgrminclude had done so things build again. This episode reinforces my feeling that pgrminclude should not be run without adult supervision. Inclusion changes in header files in particular need to be reviewed with great care. More generally, it'd be good if we had a clearer notion of module layering to dictate which headers can sanely include which others ... but that's a big task for another day.	2011-09-04 01:13:16 -04:00
Peter Eisentraut	f1e4f3d44f	Whitespace adjustment for consistency in the file	2011-09-03 01:28:05 +03:00
Bruce Momjian	6416a82a62	Remove unnecessary #include references, per pgrminclude script.	2011-09-01 10:04:27 -04:00
Robert Haas	eab2ef6164	Remove some tabs from README file. Some of the ASCII art expected 8-space tab stops, and some of it expected 4-space tab stops. Per report from YAMAMOTO Takashi.	2011-08-29 22:26:29 -04:00
Bruce Momjian	f261deb4b4	Add missing includes after pgrminclude run.	2011-08-26 18:15:14 -04:00
Heikki Linnakangas	1d0392b245	Fix comment about which version had BACKUP METHOD line in backup_lable, again. It was invalidated again by Fujii's patch to 9.1.	2011-08-17 12:31:23 +03:00
Tom Lane	2ada6779c5	Fix race condition in relcache init file invalidation. The previous code tried to synchronize by unlinking the init file twice, but that doesn't actually work: it leaves a window wherein a third process could read the already-stale init file but miss the SI messages that would tell it the data is stale. The result would be bizarre failures in catalog accesses, typically "could not read block 0 in file ..." later during startup. Instead, hold RelCacheInitLock across both the unlink and the sending of the SI messages. This is more straightforward, and might even be a bit faster since only one unlink call is needed. This has been wrong since it was put in (in 2002!), so back-patch to all supported releases.	2011-08-16 13:11:54 -04:00
Heikki Linnakangas	2877c67bc2	Fix bogus comment that claimed that the new BACKUP METHOD line in backup_label was new in 9.0. Spotted by Fujii Masao.	2011-08-16 12:23:51 +03:00
Tom Lane	4dab3d5ae1	Change the autovacuum launcher to use WaitLatch instead of a poll loop. In pursuit of this (and with the expectation that WaitLatch will be needed in more places), convert the latch field that was already added to PGPROC for sync rep into a generic latch that is activated for all PGPROC-owning processes, and change many of the standard backend signal handlers to set that latch when a signal happens. This will allow WaitLatch callers to be wakened properly by these signals. In passing, fix a whole bunch of signal handlers that had been hacked to do things that might change errno, without adding the necessary save/restore logic for errno. Also make some minor fixes in unix_latch.c, and clean up bizarre and unsafe scheme for disowning the process's latch. Much of this has to be back-patched into 9.1. Peter Geoghegan, with additional work by Tom	2011-08-10 12:22:21 -04:00
Heikki Linnakangas	41f9ffd928	If backup-end record is not seen, and we reach end of recovery from a streamed backup, throw an error and refuse to start up. The restore has not finished correctly in that case and the data directory is possibly corrupt. We already errored out in case of archive recovery, but could not during crash recovery because we couldn't distinguish between the case that pg_start_backup() was called and the database then crashed (must not error, data is OK), and the case that we're restoring from a backup and not all the needed WAL was replayed (data can be corrupt). To distinguish those cases, add a line to backup_label to indicate whether the backup was taken with pg_start/stop_backup(), or by streaming (ie. pg_basebackup). This requires re-initdb, because of a new field added to the control file.	2011-08-10 09:22:49 +03:00
Tom Lane	9f17ffd866	Measure WaitLatch's timeout parameter in milliseconds, not microseconds. The original definition had the problem that timeouts exceeding about 2100 seconds couldn't be specified on 32-bit machines. Milliseconds seem like sufficient resolution, and finer grain than that would be fantasy anyway on many platforms. Back-patch to 9.1 so that this aspect of the latch API won't change between 9.1 and later releases. Peter Geoghegan	2011-08-09 18:52:29 -04:00
Simon Riggs	7cb7122800	Remove O(N^2) performance issue with multiple SAVEPOINTs. Subtransaction locks now released en masse at main commit, rather than repeatedly re-scanning for locks as we ascend the nested transaction tree. Split transaction state TBLOCK_SUBEND into two states, TBLOCK_SUBCOMMIT and TBLOCK_SUBRELEASE to allow the commit path to be optimised using the existing code in ResourceOwnerRelease() which appears to have been intended for this usage, judging from comments therein.	2011-07-19 17:21:24 +01:00
Simon Riggs	5286105800	Cascading replication feature for streaming log-based replication. Standby servers can now have WALSender processes, which can work with either WALReceiver or archive_commands to pass data. Fully updated docs, including new conceptual terms of sending server, upstream and downstream servers. WALSenders terminated when promote to master. Fujii Masao, review, rework and doc rewrite by Simon Riggs	2011-07-19 03:40:03 +01:00
Heikki Linnakangas	89fd72cbf2	Introduce a pipe between postmaster and each backend, which can be used to detect postmaster death. Postmaster keeps the write-end of the pipe open, so when it dies, children get EOF in the read-end. That can conveniently be waited for in select(), which allows eliminating some of the polling loops that check for postmaster death. This patch doesn't yet change all the loops to use the new mechanism, expect a follow-on patch to do that. This changes the interface to WaitLatch, so that it takes as argument a bitmask of events that it waits for. Possible events are latch set, timeout, postmaster death, and socket becoming readable or writeable. The pipe method behaves slightly differently from the kill() method previously used in PostmasterIsAlive() in the case that postmaster has died, but its parent has not yet read its exit code with waitpid(). The pipe returns EOF as soon as the process dies, but kill() continues to return true until waitpid() has been called (IOW while the process is a zombie). Because of that, change PostmasterIsAlive() to use the pipe too, otherwise WaitLatch() would return immediately with WL_POSTMASTER_DEATH, while PostmasterIsAlive() would claim it's still alive. That could easily lead to busy-waiting while postmaster is in zombie state. Peter Geoghegan with further changes by me, reviewed by Fujii Masao and Florian Pflug.	2011-07-08 18:44:07 +03:00
Peter Eisentraut	21f1e15aaf	Unify spelling of "canceled", "canceling", "cancellation" We had previously (`af26857a27`) established the U.S. spellings as standard.	2011-06-29 09:28:46 +03:00
Simon Riggs	465883b0a2	Introduce compact WAL record for the common case of commit (non-DDL). XLOG_XACT_COMMIT_COMPACT leaves out invalidation messages and relfilenodes, saving considerable space for the vast majority of transaction commits. XLOG_XACT_COMMIT keeps same definition as XLOG_PAGE_MAGIC 0xD067 and earlier. Leonardo Francalanci and Simon Riggs	2011-06-28 22:58:17 +01:00
Robert Haas	503c7305a1	Make the visibility map crash-safe. This involves two main changes from the previous behavior. First, when we set a bit in the visibility map, emit a new WAL record of type XLOG_HEAP2_VISIBLE. Replay sets the page-level PD_ALL_VISIBLE bit and the visibility map bit. Second, when inserting, updating, or deleting a tuple, we can no longer get away with clearing the visibility map bit after releasing the lock on the corresponding heap page, because an intervening crash might leave the visibility map bit set and the page-level bit clear. Making this work requires a bit of interface refactoring. In passing, a few minor but related cleanups: change the test in visibilitymap_set and visibilitymap_clear to throw an error if the wrong page (or no page) is pinned, rather than silently doing nothing; this case should never occur. Also, remove duplicate definitions of InvalidXLogRecPtr. Patch by me, review by Noah Misch.	2011-06-21 23:04:40 -04:00
Heikki Linnakangas	cb94db91b2	pgindent run of recent SSI changes. Also, remove an unnecessary #include. Kevin Grittner	2011-06-16 16:17:22 +03:00
Heikki Linnakangas	85ea93384a	Oops, forgot to change the order of entries in 2PC callback arrays when I renumbered the resource managers. This should fix the buildfarm..	2011-06-14 15:16:36 +03:00
Tom Lane	c2ba0121c7	Work around gcc 4.6.0 bug that breaks WAL replay. ReadRecord's habit of using both direct references to tmpRecPtr and references to *RecPtr (which is pointing at tmpRecPtr) triggers an optimization bug in gcc 4.6.0, which apparently has forgotten about aliasing rules. Avoid the compiler bug, and make the code more readable to boot, by getting rid of the direct references. Improve the comments while at it. Back-patch to all supported versions, in case they get built with 4.6.0. Tom Lane, with some cosmetic suggestions from Alex Hunsaker	2011-06-10 17:04:29 -04:00
Bruce Momjian	6560407c7d	Pgindent run before 9.1 beta2.	2011-06-09 14:32:50 -04:00
Alvaro Herrera	c6eb5740b3	Fix assorted typos	2011-05-12 08:52:56 -04:00
Heikki Linnakangas	a0c8514149	Shut down WAL receiver if it's still running at end of recovery. We used to just check that it's not running and PANIC if it was, but that can rightfully happen if recovery stops at recovery target.	2011-05-11 12:46:08 +03:00
Tom Lane	d2088ae949	Move RegisterPredicateLockingXid() call to a safer place. The SSI patch inserted a call of RegisterPredicateLockingXid into GetNewTransactionId, which was a bad idea on a couple of grounds. First, it's not necessary to hold XidGenLock while manipulating that shared memory, and doing so is bad because XidGenLock is a high-contention lock that should be held for as short a time as possible. (Not to mention that it adds an entirely unnecessary deadlock hazard, since we must take SerializableXactHashLock as well.) Second, the specific place where it was put was between extending CLOG and advancing nextXid, which could result in unpleasant behavior in case of a failure there. Pull the call out to AssignTransactionId, which is much safer and arguably better from a modularity standpoint too. There is more work to do to clean up the failure-before-advancing-nextXid issue, but that is a separate change that will need to be back-patched. So for the moment I just want to make GetNewTransactionId look the same as it did in prior versions.	2011-05-06 12:57:28 -04:00
Robert Haas	aea1f24c2c	recoveryStopsHere() must check the resource manager ID. Before commit `c016ce7281`, this wasn't needed, but now that multiple resource manager IDs can percolate down through here, we have to make sure we know which one we've got. Otherwise, we can confuse (for example) an XLOG_XACT_COMMIT record with an XLOG_CHECKPOINT_SHUTDOWN record. Review by Jaime Casanova	2011-04-18 08:27:19 -04:00
Heikki Linnakangas	54685b1c2b	Revert the patch to check if we've reached end-of-backup also when doing crash recovery, and throw an error if not. hubert depesz lubaczewski pointed out that that situation also happens in the crash recovery following a system crash that happens during an online backup. We might want to do something smarter in 9.1, like put the check back for backups taken with pg_basebackup, but that's for another patch.	2011-04-13 22:05:40 +03:00
Bruce Momjian	bf50caf105	pgindent run before PG 9.1 beta 1.	2011-04-10 11:42:00 -04:00
Tom Lane	2594cf0e8c	Revise the API for GUC variable assign hooks. The previous functions of assign hooks are now split between check hooks and assign hooks, where the former can fail but the latter shouldn't. Aside from being conceptually clearer, this approach exposes the "canonicalized" form of the variable value to guc.c without having to do an actual assignment. And that lets us fix the problem recently noted by Bernd Helmle that the auto-tune patch for wal_buffers resulted in bogus log messages about "parameter "wal_buffers" cannot be changed without restarting the server". There may be some speed advantage too, because this design lets hook functions avoid re-parsing variable values when restoring a previous state after a rollback (they can store a pre-parsed representation of the value instead). This patch also resolves a longstanding annoyance about custom error messages from variable assign hooks: they should modify, not appear separately from, guc.c's own message about "invalid parameter value".	2011-04-07 00:12:02 -04:00
Simon Riggs	88f32b7ca2	Avoid assuming there will be only 3 states for synchronous_commit. Also avoid hardcoding the current default state by giving it the name "on" and replace with a meaningful name that reflects its behaviour. Coding only, no change in behaviour.	2011-04-04 23:23:13 +01:00
Robert Haas	240067b3b0	Merge synchronous_replication setting into synchronous_commit. This means one less thing to configure when setting up synchronous replication, and also avoids some ambiguity around what the behavior should be when the settings of these variables conflict. Fujii Masao, with additional hacking by me.	2011-04-04 16:25:52 -04:00
Heikki Linnakangas	1f0bab8494	Improve error message when WAL ends before reaching end of online backup.	2011-03-31 10:09:49 +03:00
Heikki Linnakangas	acf4740132	Check that we've reached end-of-backup also when we're not performing archive recovery. It's possible to restore an online backup without recovery.conf, by simply copying all the necessary WAL files to pg_xlog. "pg_basebackup -x" does that too. That's the use case where this cross-check is useful. Backpatch to 9.0. We used to do this in earlier versins, but in 9.0 the code was inadvertently changed so that the check is only performed after archive recovery. Fujii Masao.	2011-03-30 10:53:28 +03:00
Simon Riggs	b5f2f2a712	Minor changes to recovery pause behaviour. Change location LOG message so it works each time we pause, not just for final pause. Ensure that we pause only if we are in Hot Standby and can connect to allow us to run resume function. This change supercedes the code to override parameter recoveryPauseAtTarget to false if not attempting to enter Hot Standby, which is now removed.	2011-03-23 19:35:53 +00:00
Simon Riggs	b98ac467f5	Prevent intermittent hang in recovery from bgwriter interaction. Startup process waited for cleanup lock but when hot_standby = off the pid was not registered, so that the bgwriter would not wake the waiting process as intended.	2011-03-23 13:30:05 +00:00
Heikki Linnakangas	6d8096e2f3	When two base backups are started at the same time with pg_basebackup, ensure that they use different checkpoints as the starting point. We use the checkpoint redo location as a unique identifier for the base backup in the end-of-backup record, and in the backup history file name. Bug spotted by Fujii Masao.	2011-03-21 11:25:25 +02:00
Robert Haas	777e8c0015	Remove bogus semicolons in recoveryPausesHere. Without this, the startup process goes into a tight loop, consuming 100% of one CPU and failing to respond to interrupts.	2011-03-18 08:09:09 -04:00
Robert Haas	84abea76f6	Add pause_at_recovery_target to recovery.conf.sample; improve docs. Fujii Masao, but with the proposed behavior change reverted, and the rest adjusted accordingly.	2011-03-17 14:04:11 -04:00
Bruce Momjian	5ca543fb2e	Clarify C comment that O_SYNC/O_FSYNC are really the same settting, as opposed to O_DSYNC.	2011-03-10 20:02:52 -05:00
Robert Haas	d16e290a8a	Emit a LOG message when pausing at the recovery target. Fujii Masao	2011-03-10 14:37:14 -05:00
Heikki Linnakangas	4cd3fb6e12	Truncate predicate lock manager's SLRU lazily at checkpoint. That's safer than doing it aggressively whenever the tail-XID pointer is advanced, because this way we don't need to do it while holding SerializableXactHashLock. This also fixes bug #5915 spotted by YAMAMOTO Takashi, and removes an obsolete comment spotted by Kevin Grittner.	2011-03-08 12:12:54 +02:00
Heikki Linnakangas	1a4ab9ec23	If recovery_target_timeline is set to 'latest' and standby mode is enabled, periodically rescan the archive for new timelines, while waiting for new WAL segments to arrive. This allows you to set up a standby server that follows the TLI change if another standby server is promoted to master. Before this, you had to restart the standby server to make it notice the new timeline. This patch only scans the archive for TLI changes, it won't follow a TLI change in streaming replication. That is much needed too, but it would be a much bigger patch than I dare to sneak in this late in the release cycle. There was discussion on improving the sanity checking of the WAL segments so that the system would notice more reliably if the new timeline isn't an ancestor of the current one, but that is not included in this patch. Reviewed by Fujii Masao.	2011-03-07 21:14:47 +02:00
Simon Riggs	a8a8a3e096	Efficient transaction-controlled synchronous replication. If a standby is broadcasting reply messages and we have named one or more standbys in synchronous_standby_names then allow users who set synchronous_replication to wait for commit, which then provides strict data integrity guarantees. Design avoids sending and receiving transaction state information so minimises bookkeeping overheads. We synchronize with the highest priority standby that is connected and ready to synchronize. Other standbys can be defined to takeover in case of standby failure. This version has very strict behaviour; more relaxed options may be added at a later date. Simon Riggs and Fujii Masao, with reviews by Yeb Havinga, Jaime Casanova, Heikki Linnakangas and Robert Haas, plus the assistance of many other design reviewers.	2011-03-06 22:49:16 +00:00
Tom Lane	a874fe7b4c	Refactor the executor's API to support data-modifying CTEs better. The originally committed patch for modifying CTEs didn't interact well with EXPLAIN, as noted by myself, and also had corner-case problems with triggers, as noted by Dean Rasheed. Those problems show it is really not practical for ExecutorEnd to call any user-defined code; so split the cleanup duties out into a new function ExecutorFinish, which must be called between the last ExecutorRun call and ExecutorEnd. Some Asserts have been added to these functions to help verify correct usage. It is no longer necessary for callers of the executor to call AfterTriggerBeginQuery/AfterTriggerEndQuery for themselves, as this is now done by ExecutorStart/ExecutorFinish respectively. If you really need to suppress that and do it for yourself, pass EXEC_FLAG_SKIP_TRIGGERS to ExecutorStart. Also, refactor portal commit processing to allow for the possibility that PortalDrop will invoke user-defined code. I think this is not actually necessary just yet, since the portal-execution-strategy logic forces any non-pure-SELECT query to be run to completion before we will consider committing. But it seems like good future-proofing.	2011-02-27 13:44:12 -05:00
Robert Haas	79ad8fc5f8	Named restore point improvements. Emit a log message when creating a named restore point, and improve documentation for pg_create_restore_point(). Euler Taveira de Oliveira, per suggestions from Thom Brown, with some additional wordsmithing by me.	2011-02-24 19:02:00 -05:00
Simon Riggs	bca8b7f16a	Hot Standby feedback for avoidance of cleanup conflicts on standby. Standby optionally sends back information about oldestXmin of queries which is then checked and applied to the WALSender's proc->xmin. GetOldestXmin() is modified slightly to agree with GetSnapshotData(), so that all backends on primary include WALSender within their snapshots. Note this does nothing to change the snapshot xmin on either master or standby. Feedback piggybacks on the standby reply message. vacuum_defer_cleanup_age is no longer used on standby, though parameter still exists on primary, since some use cases still exist. Simon Riggs, review comments from Fujii Masao, Heikki Linnakangas, Robert Haas	2011-02-16 19:29:37 +00:00
Robert Haas	4695da5ae9	pg_ctl promote Fujii Masao, reviewed by Robert Haas, Stephen Frost, and Magnus Hagander.	2011-02-15 21:30:23 -05:00
Simon Riggs	5c588be729	PITR can stop at a named restore point when recovery target = time though must not update the last transaction timestamp. Plus comment and message cleanup for recent named restore point. Fujii Masao, minor changes by me	2011-02-15 00:51:39 +00:00
Heikki Linnakangas	b186523fd9	Send status updates back from standby server to master, indicating how far the standby has written, flushed, and applied the WAL. At the moment, this is for informational purposes only, the values are only shown in pg_stat_replication system view, but in the future they will also be needed for synchronous replication. Extracted from Simon riggs' synchronous replication patch by Robert Haas, with some tweaking by me.	2011-02-10 21:04:02 +02:00
Magnus Hagander	3144c33a2f	Implement NOWAIT option for BASE_BACKUP command Specifying this option makes the server not wait for the xlog to be archived, or emit a warning that it can't, instead leaving the responsibility with the client. This is useful when the log is being streamed using the streaming protocol in parallel with the backup, without having log archiving enabled.	2011-02-09 10:59:53 +01:00
Simon Riggs	c016ce7281	Named restore points in recovery. Users can record named points, then new recovery.conf parameter recovery_target_name allows PITR to specify named points as recovery targets. Jaime Casanova, reviewed by Euler Taveira de Oliveira, plus minor edits	2011-02-08 19:39:08 +00:00
Simon Riggs	8c6e3adbf7	Basic Recovery Control functions for use in Hot Standby. Pause, Resume, Status check functions only. Also, new recovery.conf parameter to pause_at_recovery_target, default on. Simon Riggs, reviewed by Fujii Masao	2011-02-08 18:30:22 +00:00
Simon Riggs	faa0550572	Remove rare corner case for data loss when triggering standby server. If the standby was streaming when trigger file arrives, check also in the archive for additional WAL files. This is a corner case since it is unlikely that we would trigger a failover while the master is still available and sending data to standby, while at the same time running in archive mode and also while the streaming standby has fallen behind archive. Someone would eventually be unlucky; we must plug all gaps however small. Fujii Masao	2011-02-08 14:38:02 +00:00
Heikki Linnakangas	dafaa3efb7	Implement genuine serializable isolation level. Until now, our Serializable mode has in fact been what's called Snapshot Isolation, which allows some anomalies that could not occur in any serialized ordering of the transactions. This patch fixes that using a method called Serializable Snapshot Isolation, based on research papers by Michael J. Cahill (see README-SSI for full references). In Serializable Snapshot Isolation, transactions run like they do in Snapshot Isolation, but a predicate lock manager observes the reads and writes performed and aborts transactions if it detects that an anomaly might occur. This method produces some false positives, ie. it sometimes aborts transactions even though there is no anomaly. To track reads we implement predicate locking, see storage/lmgr/predicate.c. Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared memory is finite, so when a transaction takes many tuple-level locks on a page, the locks are promoted to a single page-level lock, and further to a single relation level lock if necessary. To lock key values with no matching tuple, a sequential scan always takes a relation-level lock, and an index scan acquires a page-level lock that covers the search key, whether or not there are any matching keys at the moment. A predicate lock doesn't conflict with any regular locks or with another predicate locks in the normal sense. They're only used by the predicate lock manager to detect the danger of anomalies. Only serializable transactions participate in predicate locking, so there should be no extra overhead for for other transactions. Predicate locks can't be released at commit, but must be remembered until all the transactions that overlapped with it have completed. That means that we need to remember an unbounded amount of predicate locks, so we apply a lossy but conservative method of tracking locks for committed transactions. If we run short of shared memory, we overflow to a new "pg_serial" SLRU pool. We don't currently allow Serializable transactions in Hot Standby mode. That would be hard, because even read-only transactions can cause anomalies that wouldn't otherwise occur. Serializable isolation mode now means the new fully serializable level. Repeatable Read gives you the old Snapshot Isolation level that we have always had. Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and Anssi Kääriäinen	2011-02-08 00:09:08 +02:00
Robert Haas	0af695fd43	Log restartpoints in the same fashion as checkpoints. Prior to 9.0, restartpoints never created, deleted, or recycled WAL files, but now they can. This code makes log_checkpoints treat checkpoints and restartpoints symmetrically. It also adjusts up the documentation of the parameter to mention restartpoints. Fujii Masao. Docs by me, as suggested by Itagaki Takahiro.	2011-02-02 21:08:53 -05:00
Heikki Linnakangas	997b48ed96	Support multiple concurrent pg_basebackup backups. With this patch, pg_basebackup doesn't write a backup_label file in the data directory, so it doesn't interfere with a pg_start/stop_backup() based backup anymore. backup_label is still included in the backup, but it is injected directly into the tar stream. Heikki Linnakangas, reviewed by Fujii Masao and Magnus Hagander.	2011-01-31 18:25:39 +02:00
Tom Lane	0f73aae13d	Allow the wal_buffers setting to be auto-tuned to a reasonable value. If wal_buffers is initially set to -1 (which is now the default), it's replaced by 1/32nd of shared_buffers, with a minimum of 8 (the old default) and a maximum of the XLOG segment size. The allowed range for manual settings is still from 4 up to whatever will fit in shared memory. Greg Smith, with implementation correction by me.	2011-01-22 20:31:24 -05:00
Magnus Hagander	4448917d51	Split pg_start_backup() and pg_stop_backup() into two pieces Move the actual functionality into a separate function that's easier to call internally, and change the SQL-callable function to be a wrapper calling this. Also create a pg_abort_backup() function, only callable internally, that does only the most vital parts of pg_stop_backup(), making it safe(r) to call from error handlers.	2011-01-09 21:00:28 +01:00
Robert Haas	a9f72b4083	Improve recovery.conf.sample comments. Jehan-Guillaume de Rorthais, with some additional wordsmithing by me.	2011-01-07 11:01:25 -05:00
Robert Haas	dc8a14311a	Update comments in RecordTransactionCommit() to mention unlogged tables.	2011-01-03 10:29:22 -05:00
Bruce Momjian	5d950e3b0c	Stamp copyrights for year 2011.	2011-01-01 13:18:15 -05:00
Alvaro Herrera	55573990ca	Avoid unnecessary public struct declaration in slru.h Instead, declare a public wrapper of the sole function using it for external callers, so that they don't have to always pass a NULL argument. Author: Kevin Grittner	2010-12-30 12:09:17 -03:00
Robert Haas	53dbc27c62	Support unlogged tables. The contents of an unlogged table are WAL-logged; thus, they are not available on standby servers and are truncated whenever the database system enters recovery. Indexes on unlogged tables are also unlogged. Unlogged GiST indexes are not currently supported.	2010-12-29 06:48:53 -05:00
Magnus Hagander	9b8aff8c19	Add REPLICATION privilege for ROLEs This privilege is required to do Streaming Replication, instead of superuser, making it possible to set up a SR slave that doesn't have write permissions on the master. Superuser privileges do NOT override this check, so in order to use the default superuser account for replication it must be explicitly granted the REPLICATION permissions. This is backwards incompatible change, in the interest of higher default security.	2010-12-29 11:05:03 +01:00
Bruce Momjian	5000472112	Remove quotes from boolean recovery.conf.sample parameters, now that the quotes are not required. This now matches postgresql.conf's specification of booleans.	2010-12-24 11:51:51 -05:00
Heikki Linnakangas	9de3aa65f0	Rewrite the GiST insertion logic so that we don't need the post-recovery cleanup stage to finish incomplete inserts or splits anymore. There was two reasons for the cleanup step: 1. When a new tuple was inserted to a leaf page, the downlink in the parent needed to be updated to contain (ie. to be consistent with) the new key. Updating the parent in turn might require recursively updating the parent of the parent. We now handle that by updating the parent while traversing down the tree, so that when we insert the leaf tuple, all the parents are already consistent with the new key, and the tree is consistent at every step. 2. When a page is split, we need to insert the downlink for the new right page(s), and update the downlink for the original page to not include keys that moved to the right page(s). We now handle that by setting a new flag, F_FOLLOW_RIGHT, on the non-rightmost pages in the split. When that flag is set, scans always follow the rightlink, regardless of the NSN mechanism used to detect concurrent page splits. That way the tree is consistent right after split, even though the downlink is still missing. This is very similar to the way B-tree splits are handled. When the downlink is inserted in the parent, the flag is cleared. To keep the insertion algorithm simple, when an insertion sees an incomplete split, indicated by the F_FOLLOW_RIGHT flag, it finishes the split before doing anything else. These changes allow removing the whole "invalid tuple" mechanism, but I retained the scan code to still follow invalid tuples correctly. While we don't create any such tuples anymore, we want to handle them gracefully in case you pg_upgrade a GiST index that has them. If we encounter any on an insert, though, we just throw an error saying that you need to REINDEX. The issue that got me into doing this is that if you did a checkpoint while an insert or split was in progress, and the checkpoint finishes quickly so that there is no WAL record related to the insert between RedoRecPtr and the checkpoint record, recovery from that checkpoint would not know to finish the incomplete insert. IOW, we have the same issue we solved with the rm_safe_restartpoint mechanism during normal operation too. It's highly unlikely to happen in practice, and this fix is far too large to backpatch, so we're just going to live with in previous versions, but this refactoring fixes it going forward. With this patch, you don't get the annoying 'index "FOO" needs VACUUM or REINDEX to finish crash recovery' notices anymore if you crash at an unfortunate moment.	2010-12-23 16:21:47 +02:00
Robert Haas	f6a0863e3c	Allow transactions that don't write WAL to commit asynchronously. This case can arise if a transaction has written data, but only to temporary tables. Loss of the commit record in case of a crash won't matter, because the temporary tables will be lost anyway. Reviewed by Heikki Linnakangas and Simon Riggs.	2010-12-20 12:59:33 -05:00
Robert Haas	34c70c7ac4	Instrument checkpoint sync calls. Greg Smith, reviewed by Jeff Janes	2010-12-14 09:26:19 -05:00
Tom Lane	04f4e10cfc	Use symbolic names not octal constants for file permission flags. Purely cosmetic patch to make our coding standards more consistent --- we were doing symbolic some places and octal other places. This patch fixes all C-coded uses of mkdir, chmod, and umask. There might be some other calls I missed. Inconsistency noted while researching tablespace directory permissions issue.	2010-12-10 17:35:33 -05:00
Simon Riggs	e620ee35b2	Optimize commit_siblings in two ways to improve group commit. First, avoid scanning the whole ProcArray once we know there are at least commit_siblings active; second, skip the check altogether if commit_siblings = 0. Greg Smith	2010-12-08 18:48:03 +00:00
Heikki Linnakangas	5a031a5556	Fix bugs in the hot standby known-assigned-xids tracking logic. If there's an old transaction running in the master, and a lot of transactions have started and finished since, and a WAL-record is written in the gap between the creating the running-xacts snapshot and WAL-logging it, recovery will fail with "too many KnownAssignedXids" error. This bug was reported by Joachim Wieland on Nov 19th. In the same scenario, when fewer transactions have started so that all the xids fit in KnownAssignedXids despite the first bug, a more serious bug arises. We incorrectly initialize the clog code with the oldest still running transaction, and when we see the WAL record belonging to a transaction with an XID larger than one that committed already before the checkpoint we're recovering from, we zero the clog page containing the already committed transaction, leading to data loss. In hindsight, trying to track xids in the known-assigned-xids array before seeing the running-xacts record was too complicated. To fix that, hold XidGenLock while the running-xacts snapshot is taken and WAL-logged. That ensures that no transaction can begin or end in that gap, so that in recvoery we know that the snapshot contains all transactions running at that point in WAL.	2010-12-07 09:23:30 +01:00
Heikki Linnakangas	95e42a2c29	Fix two typos, by Fujii Masao.	2010-12-06 12:38:05 +01:00
Robert Haas	5ef6c91383	Remove now-outdated mention of quotes being required in recovery.conf. Noted by Itagaki Takahiro.	2010-12-03 09:00:18 -05:00
Robert Haas	970a18687f	Use GUC lexer for recovery.conf parsing. This eliminates some crufty, special-purpose code and, as a non-trivial side benefit, allows recovery.conf parameters to be unquoted. Dimitri Fontaine, with review and cleanup by Alvaro Herrera, Itagaki Takahiro, and me.	2010-12-03 08:56:44 -05:00
Peter Eisentraut	fc946c39ae	Remove useless whitespace at end of lines	2010-11-23 22:34:55 +02:00
Heikki Linnakangas	542bdb2146	Fix bug introduced by the recent patch to check that the checkpoint redo location read from backup label file can be found: wasShutdown was set incorrectly when a backup label file was found. Jeff Davis, with a little tweaking by me.	2010-11-11 19:32:11 +02:00
Robert Haas	7ba6e4f0e0	Add monitoring function pg_last_xact_replay_timestamp. Fujii Masao, with a little wordsmithing by me.	2010-11-09 22:52:19 -05:00
Heikki Linnakangas	8c843fff2d	Bootstrap WAL to begin at segment logid=0 logseg=1 (000000010000000000000001) rather than 0/0, so that we can safely use 0/0 as an invalid value. This is a more future-proof fix for the corner-case bug in streaming replication that was fixed yesterday. We had a similar corner-case bug with log/seg 0/0 back in February as well. Avoiding 0/0 as a valid value should prevent bugs like that in the future. Per Tom Lane's idea. Back-patch to 9.0. Since this only affects bootstrapping, it makes no difference to existing installations. We don't need to worry about the bug in existing installations, because if you've managed to get past the initial base backup already, you won't hit the bug in the future either.	2010-11-02 11:39:48 +02:00
Heikki Linnakangas	931b6db39b	Fix corner-case bug in tracking of latest removed WAL segment during streaming replication. We used log/seg 0/0 to indicate that no WAL segments have been removed since startup, but 0/0 is a valid value for the very first WAL segment after initdb. To make that disambiguous, store (latest removed WAL segment + 1) in the global variable. Per report from Matt Chesler, also reproduced by Greg Smith.	2010-11-01 10:05:15 +02:00
Heikki Linnakangas	0c6293dd03	Before removing backup_label and irrevocably changing pg_control file, check that WAL file containing the checkpoint redo-location can be found. This avoids making the cluster irrecoverable if the redo location is in an earlie WAL file than the checkpoint record. Report, analysis and patch by Jeff Davis, with small changes by me.	2010-10-26 21:43:52 +03:00
Tom Lane	def30e84c4	Don't try to fetch database name when SetTransactionIdLimit() is executed outside a transaction. This repairs brain fade in my patch of 2009-08-30: the reason we had been storing oldest-database name, not OID, in ShmemVariableCache was of course to avoid having to do a catalog lookup at times when it might be unsafe. This error explains why Aleksandr Dushein is having trouble getting out of an XID wraparound state in bug #5718, though not how he got into that state in the first place. I suspect pg_upgrade is at fault there.	2010-10-20 12:48:51 -04:00
Alvaro Herrera	17a16663d0	Remove AtStart_Cache() call in CommandCounterIncrement(). This call was present in the aboriginal code from Berkeley, and has never been touched; it may very well be that it was there to mask effects of bugs in other places and it may no longer be necessary. The removal has been foreseen in a code comment since 2007; this seems to be a good time to test this hypothesis.	2010-10-20 11:33:57 -03:00
Simon Riggs	3bbcc5c999	Make startup process respond to signals to cancel waiting on latch. A tidy up for recently committed changes to startup latch. Fujii Masao	2010-10-14 19:15:26 +01:00
Simon Riggs	45cd9199c2	Fix bug in comment of timeline history file. Fujii Masao	2010-10-14 19:06:06 +01:00
Magnus Hagander	9f2e211386	Remove cvs keywords from all files.	2010-09-20 22:08:53 +02:00
Tom Lane	54d0e2886a	Add some documentation about how we WAL-log filesystem actions. Per a question from Robert Haas.	2010-09-17 00:42:39 +00:00
Heikki Linnakangas	79b54816db	Fix two typos in comments, spotted by Fujii Masao and Thom Brown	2010-09-15 13:58:22 +00:00
Heikki Linnakangas	723d0184e2	Use a latch to make startup process wake up and replay immediately when new WAL arrives via streaming replication. This reduces the latency, and also allows us to use a longer polling interval, which is good for energy efficiency. We still need to poll to check for the appearance of a trigger file, but the interval is now 5 seconds (instead of 100ms), like when waiting for a new WAL segment to appear in WAL archive.	2010-09-15 10:35:05 +00:00
Heikki Linnakangas	2746e5f21d	Introduce latches. A latch is a boolean variable, with the capability to wait until it is set. Latches can be used to reliably wait until a signal arrives, which is hard otherwise because signals don't interrupt select() on some platforms, and even when they do, there's race conditions. On Unix, latches use the so called self-pipe trick under the covers to implement the sleep until the latch is set, without race conditions. On Windows, Windows events are used. Use the new latch abstraction to sleep in walsender, so that as soon as a transaction finishes, walsender is woken up to immediately send the WAL to the standby. This reduces the latency between master and standby, which is good. Preliminary work by Fujii Masao. The latch implementation is by me, with helpful comments from many people.	2010-09-11 15:48:04 +00:00
Tom Lane	eb36d1ad51	Fix oversight in RelFileNodeBackend patch: CreateFakeRelcacheEntry needs to initialize the rd_backend field of a fake Relation entry correctly. Fortunately, that is easy, since only non-temp relations should ever be mentioned in the WAL stream.	2010-08-30 16:46:23 +00:00
Simon Riggs	ac791d3ca1	Fix misleading DEBUG2 issued during RemoveOldXlogFiles()	2010-08-30 15:37:41 +00:00
Simon Riggs	e72f15ed60	Truncate subtrans after each restartpoint. Issue reported by Harald Kolb, patch by Fujii Masao, review by me.	2010-08-30 14:22:05 +00:00
Alvaro Herrera	3a1b51de19	Remove duplicate translatable phrase	2010-08-26 19:23:41 +00:00
Robert Haas	debcec7dc3	Include the backend ID in the relpath of temporary relations. This allows us to reliably remove all leftover temporary relation files on cluster startup without reference to system catalogs or WAL; therefore, we no longer include temporary relations in XLOG_XACT_COMMIT and XLOG_XACT_ABORT WAL records. Since these changes require including a backend ID in each SharedInvalSmgrMsg, the size of the SharedInvalidationMessage.id field has been reduced from two bytes to one, and the maximum number of connections has been reduced from INT_MAX / 4 to 2^23-1. It would be possible to remove these restrictions by increasing the size of SharedInvalidationMessage by 4 bytes, but right now that doesn't seem like a good trade-off. Review by Jaime Casanova and Tom Lane.	2010-08-13 20:10:54 +00:00
Robert Haas	95ef7cd40d	Make RecordTransactionCommit() respect wal_level. Since the only purpose of WAL-loggin SharedInvalidationMessages is to support Hot Standby operation, they needn't be included when wal_level < hot_standby. Back-patch to 9.0. Review by Heikki Linnakanagas and Fujii Masao.	2010-08-13 15:42:21 +00:00
Robert Haas	30c22eb8fc	Correct sundry errors in Hot Standby-related comments. Fujii Masao	2010-08-12 23:24:54 +00:00
Simon Riggs	5b8bd0529e	Rename asyncCommitLSN to asyncXactLSN to reflect changed role in 9.0. Transaction aborts now record their LSN to avoid corner case behaviour in SR/HS, hence change of name of variables and functions. As pointed out by Fujii Masao. Cosmetic changes only.	2010-07-29 22:27:27 +00:00
Robert Haas	7be8946c78	Avoid deep recursion when assigning XIDs to multiple levels of subxacts. Backpatch to 8.0. Andres Freund, with cleanup and adjustment for older branches by me.	2010-07-23 00:43:00 +00:00
Tom Lane	672efc0865	Update obsolete comment. Noted by Josh Tolley.	2010-07-08 16:08:30 +00:00
Bruce Momjian	239d769e7e	pgindent run for 9.0, second run	2010-07-06 19:19:02 +00:00
Tom Lane	8771634666	Don't set recoveryLastXTime when replaying a checkpoint --- that was a bogus idea from the start since the variable is only meant to track commit/abort events. This patch reverts the logic around the variable to what it was in 8.4, except that the value is now kept in shared memory rather than a static variable, so that it can be reported correctly by CreateRestartPoint (which is executed in the bgwriter).	2010-07-03 22:15:45 +00:00
Tom Lane	e76c1a0f4d	Replace max_standby_delay with two parameters, max_standby_archive_delay and max_standby_streaming_delay, and revise the implementation to avoid assuming that timestamps found in WAL records can meaningfully be compared to clock time on the standby server. Instead, the delay limits are compared to the elapsed time since we last obtained a new WAL segment from archive or since we were last "caught up" to WAL data arriving via streaming replication. This avoids problems with clock skew between primary and standby, as well as other corner cases that the original coding would misbehave in, such as the primary server having significant idle time between transactions. Per my complaint some time ago and considerable ensuing discussion. Do some desultory editing on the hot standby documentation, too.	2010-07-03 20:43:58 +00:00
Bruce Momjian	b57ddccf05	Add C comment about why synchronous_commit=off behavior can lose committed transactions in a postmaster crash.	2010-06-29 18:44:58 +00:00
Robert Haas	400916b6d7	emode_for_corrupt_record shouldn't reduce LOG messages to WARNING. In non-interactive sessions, WARNING sorts below LOG.	2010-06-28 19:46:19 +00:00
Tom Lane	09698bb5fb	Make RemoveOldXlogFiles's debug printout match style used elsewhere: log and seg aren't an XLogRecPtr and shouldn't be printed like one. Fujii Masao	2010-06-17 17:37:23 +00:00
Tom Lane	07e8b6aabc	Don't allow walsender to send WAL data until it's been safely fsync'd on the master. Otherwise a subsequent crash could cause the master to lose WAL that has already been applied on the slave, resulting in the slave being out of sync and soon corrupt. Per recent discussion and an example from Robert Haas. Fujii Masao	2010-06-17 16:41:25 +00:00
Heikki Linnakangas	6da07cd80d	If a corrupt WAL record is received by streaming replication, disconnect and retry. If the record is genuinely corrupt in the master database, there's little hope of recovering, but it's better than simply retrying to apply the corrupt WAL record in a tight loop without even trying to retransmit it, which is what we used to do.	2010-06-14 06:04:21 +00:00
Peter Eisentraut	c86efdde5f	Fix typo/bug, found by Clang compiler	2010-06-12 09:14:52 +00:00
Itagaki Takahiro	56834fc759	Rename restartpoint_command to archive_cleanup_command.	2010-06-10 08:13:50 +00:00
Heikki Linnakangas	0a7cb85531	Make TriggerFile variable static. It's not used outside xlog.c. Fujii Masao	2010-06-10 07:49:23 +00:00
Heikki Linnakangas	346d7cd7fa	Return NULL instead of 0/0 in pg_last_xlog_receive_location() and pg_last_xlog_replay_location(). Per Robert Haas's suggestion, after Itagaki Takahiro pointed out an issue in the docs. Also, some wording changes in the docs by me.	2010-06-10 07:00:27 +00:00
Heikki Linnakangas	71815306e9	In standby mode, respect checkpoint_segments in addition to checkpoint_timeout to trigger restartpoints. We used to deliberately only do time-based restartpoints, because if checkpoint_segments is small we would spend time doing restartpoints more often than really necessary. But now that restartpoints are done in bgwriter, they're not as disruptive as they used to be. Secondly, because streaming replication stores the streamed WAL files in pg_xlog, we want to clean it up more often to avoid running out of disk space when checkpoint_timeout is large and checkpoint_segments small. Patch by Fujii Masao, with some minor changes by me.	2010-06-09 15:04:07 +00:00
Magnus Hagander	8c873bbfa7	Make the walwriter close it's handle to an old xlog segment if it's no longer the current one. Not doing this would leave the walwriter with a handle to a deleted file if there was nothing for it to do for a long period of time, preventing the file from being completely removed. Reported by Tollef Fog Heen, and thanks to Heikki for some hand-holding with the patch.	2010-06-09 10:54:45 +00:00
Peter Eisentraut	cb6038c168	Fix some inconsistent quoting of wal_level values in messages When referring to postgresql.conf syntax, then it's without quotes (wal_level=archive); in narrative it's with double quotes. But never single quotes.	2010-06-03 21:02:12 +00:00
Robert Haas	d561430b66	On clean shutdown during recovery, don't warn about possible corruption. Fujii Masao. Review by Heikki Linnakangas and myself.	2010-06-03 03:20:00 +00:00
Heikki Linnakangas	6b24036365	Fix obsolete comments that I neglected to update in a previous patch. Fujii Masao	2010-06-02 09:28:44 +00:00
Heikki Linnakangas	c5bd8feac6	Adjust comment to reflect that we now have Hot Standby. Pointed out by Robert Haas.	2010-05-27 00:38:39 +00:00
Robert Haas	ea9968c331	Rename PM_RECOVERY_CONSISTENT and PMSIGNAL_RECOVERY_CONSISTENT. The new names PM_HOT_STANDBY and PMSIGNAL_BEGIN_HOT_STANDBY more accurately reflect their actual function.	2010-05-15 20:01:32 +00:00
Simon Riggs	4a24c9a063	Fix bug in processing of checkpoint time for max_standby_delay. Latest log time was incorrectly set, typically leading to dates in the past, which would cause more cancellations in Hot Standby on a quiet server.	2010-05-15 07:14:43 +00:00
Simon Riggs	fd34374b17	Add many new Asserts in code and fix simple bug that slipped through without them, related to previous commit. Report by Bruce Momjian.	2010-05-14 07:11:49 +00:00
Simon Riggs	463f151a23	Ensure that top level aborts call XLogSetAsyncCommit(). Not doing so simply leads to data waiting in wal_buffers which then causes later commits to potentially do emergency writes and for all forms of replication to be potentially delayed without need or benefit. Issue pointed out exactly by Fujii Masao, following bug report by Robert Haas on a separate though related topic.	2010-05-13 11:39:30 +00:00
Simon Riggs	8431e296ea	Cleanup initialization of Hot Standby. Clarify working with reanalysis of requirements and documentation on LogStandbySnapshot(). Fixes two minor bugs reported by Tom Lane that would lead to an incorrect snapshot after transaction wraparound. Also fix two other problems discovered that would give incorrect snapshots in certain cases. ProcArrayApplyRecoveryInfo() substantially rewritten. Some minor refactoring of xact_redo_apply() and ExpireTreeKnownAssignedTransactionIds().	2010-05-13 11:15:38 +00:00
Heikki Linnakangas	ffe8c7c677	Need to hold ControlFileLock while updating control file. Update minRecoveryPoint in control file when replaying a parameter change record, to ensure that we don't allow hot standby on WAL generated without wal_level='hot_standby' after a standby restart.	2010-05-03 11:17:52 +00:00
Tom Lane	f9ed327f76	Clean up some awkward, inaccurate, and inefficient processing around MaxStandbyDelay. Use the GUC units mechanism for the value, and choose more appropriate timestamp functions for performing tests with it. Make the ps_activity manipulation in ResolveRecoveryConflictWithVirtualXIDs have behavior similar to ps_activity code elsewhere, notably not updating the display when update_process_title is off and not truncating the display contents at an arbitrarily-chosen length. Improve the docs to be explicit about what MaxStandbyDelay actually measures, viz the difference between primary and standby servers' clocks, and the possible hazards if their clocks aren't in sync.	2010-05-02 02:10:33 +00:00
Tom Lane	69f7a4d8e3	Adjust error checks in pg_start_backup and pg_stop_backup to make it possible to perform a backup without archive_mode being enabled. This gives up some user-error protection in order to improve usefulness for streaming-replication scenarios. Per discussion.	2010-04-29 21:49:03 +00:00
Tom Lane	f0488bd57c	Rename the parameter recovery_connections to hot_standby, to reduce possible confusion with streaming-replication settings. Also, change its default value to "off", because of concern about executing new and poorly-tested code during ordinary non-replicating operation. Per discussion. In passing do some minor editing of related documentation.	2010-04-29 21:36:19 +00:00
Tom Lane	77acab75df	Modify ShmemInitStruct and ShmemInitHash to throw errors internally, rather than returning NULL for some-but-not-all failures as they used to. Remove now-redundant tests for NULL from call sites. We had to do something about this because many call sites were failing to check for NULL; and changing it like this seems a lot more useful and mistake-proof than adding checks to the call sites without them.	2010-04-28 16:54:16 +00:00
Heikki Linnakangas	9b8a73326e	Introduce wal_level GUC to explicitly control if information needed for archival or hot standby should be WAL-logged, instead of deducing that from other options like archive_mode. This replaces recovery_connections GUC in the primary, where it now has no effect, but it's still used in the standby to enable/disable hot standby. Remove the WAL-logging of "unlogged operations", like creating an index without WAL-logging and fsyncing it at the end. Instead, we keep a copy of the wal_mode setting and the settings that affect how much shared memory a hot standby server needs to track master transactions (max_connections, max_prepared_xacts, max_locks_per_xact) in pg_control. Whenever the settings change, at server restart, write a WAL record noting the new settings and update pg_control. This allows us to notice the change in those settings in the standby at the right moment, they used to be included in checkpoint records, but that meant that a changed value was not reflected in the standby until the first checkpoint after the change. Bump PG_CONTROL_VERSION and XLOG_PAGE_MAGIC. Whack XLOG_PAGE_MAGIC back to the sequence it used to follow, before hot standby and subsequent patches changed it to 0x9003.	2010-04-28 16:10:43 +00:00
Tom Lane	2871b4618a	Replace the KnownAssignedXids hash table with a sorted-array data structure, and be more tense about the locking requirements for it, to improve performance in Hot Standby mode. In passing fix a few bugs and improve a number of comments in the existing HS code. Simon Riggs, with some editorialization by Tom	2010-04-28 00:09:05 +00:00
Heikki Linnakangas	3efba16d56	If a base backup is cancelled by server shutdown or crash, throw an error in WAL recovery when it sees the shutdown checkpoint record. It's more user-friendly to find out about it at that point than at the end of recovery, and you're not left wondering why your hot standby server never opens up for read-only connections.	2010-04-27 09:25:18 +00:00
Simon Riggs	491d1ea5b3	Previous patch revoked following objections.	2010-04-23 20:21:31 +00:00
Simon Riggs	6ca23b1a29	Make CheckRequiredParameterValues() depend upon correct combination of parameters. Fix bug report by Robert Haas that error message and hint was incorrect if wrong mode parameters specified on master. Internal changes only. Proposals for parameter simplification on master/primary still under way.	2010-04-23 19:57:19 +00:00
Robert Haas	481cb5d9b5	Rename standby_keep_segments to wal_keep_segments. Also, make the name of the GUC and the name of the backing variable match. Alnong the way, clean up a couple of slight typographical errors in the related docs.	2010-04-20 11:15:06 +00:00
Simon Riggs	d38603bd97	Improve sequence and sense of messages from pg_stop_backup(). Now doesn't report it is waiting until it actually is waiting, plus message doesn't appear until at least 5 seconds wait, so we avoid reporting the wait before we've given the archiver a reasonable time to wake up and archive the file we just created earlier in the function. Also add new unconditional message to confirm safe completion. Now a normal, healthy execution does not report waiting at all, just safe completion.	2010-04-18 18:44:53 +00:00
Simon Riggs	2847de9df2	Remove some additional changes in previous commit that belong elsewhere.	2010-04-18 18:17:12 +00:00
Simon Riggs	21d6a6a128	Tune GetSnapshotData() during Hot Standby by avoiding loop through normal backends. Makes code clearer also, since we avoid various Assert()s. Performance of snapshots taken during recovery no longer depends upon number of read-only backends.	2010-04-18 18:06:07 +00:00
Heikki Linnakangas	78974cfb9b	In standby mode, suppress repeated LOG messages about a corrupt record, which just indicates that we've reached the end of valid WAL found in the standby.	2010-04-16 08:58:16 +00:00
Bruce Momjian	ec4b9bcc3d	Doc change: effect -> affect, per Robert Haas	2010-04-15 03:05:59 +00:00
Simon Riggs	55d7556a4d	Fix minor typo in comment in xlog.c	2010-04-14 10:29:07 +00:00
Heikki Linnakangas	361bd1662e	Allow Hot Standby to begin from a shutdown checkpoint. Patch by Simon Riggs & me	2010-04-13 14:17:46 +00:00
Heikki Linnakangas	30556568f5	Update the location of last removed WAL segment in shared memory only after actually removing one, so that if we can't remove segments because WAL archiving is lagging behind, we don't unnecessarily forbid streaming the old not-yet-archived segments that are still perfectly valid. Per suggestion from Fujii Masao.	2010-04-12 10:40:43 +00:00
Heikki Linnakangas	e57cd7f0a1	Change the logic to decide when to delete old WAL segments, so that it doesn't take into account how far the WAL senders are. This way a hung WAL sender doesn't prevent old WAL segments from being recycled/removed in the primary, ultimately causing the disk to fill up. Instead add standby_keep_segments setting to control how many old WAL segments are kept in the primary. This also makes it more reliable to use streaming replication without WAL archiving, assuming that you set standby_keep_segments high enough.	2010-04-12 09:52:29 +00:00
Heikki Linnakangas	0f11ed5886	Allow quotes to be escaped in recovery.conf, by doubling them. This patch also makes the parsing a little bit stricter, rejecting garbage after the parameter value and values with missing ending quotes, for example.	2010-04-07 10:58:49 +00:00
Heikki Linnakangas	370f770c15	Forbid using pg_xlogfile_name() and pg_xlogfile_name_offset() during recovery. We might want to relax this in the future, but ThisTimeLineID isn't currently correct in backends during recovery, so the filename returned was wrong.	2010-04-07 06:12:52 +00:00
Simon Riggs	89c5008158	Further message changes when recovery.conf parameters missing.	2010-04-06 17:51:58 +00:00
Heikki Linnakangas	492d9f2309	Rename "Log-streaming replication parameters" header to "Standby server parameters" in recovery.conf, to match the grouping in the documentation. Fujii Masao	2010-04-06 14:53:20 +00:00
Simon Riggs	cf2575b8c4	Check compulsory parameters in recovery.conf in standby_mode, per docs.	2010-04-02 21:50:40 +00:00
Simon Riggs	31f00d163b	Move system startup message prior to any calls out of data directory. This allows us to see what mode the server is in before it starts to perform actions that can block or hang. Otherwise server messages may not appear until after messages that say FATAL the database server is starting up.	2010-04-02 13:10:56 +00:00
Robert Haas	54943734f8	Refer to max_wal_senders in a more consistent fashion. The error message now makes explicit reference to the GUC that must be changed to fix the problem, using wording suggested by Tom Lane. Along the way, rename the GUC from MaxWalSenders to max_wal_senders for consistency and grep-ability.	2010-04-01 00:43:29 +00:00
Bruce Momjian	55a01b4c0a	Change recovery.conf.sample to match postgresql.conf by showing only default values, with example comments.	2010-03-31 14:18:45 +00:00
Heikki Linnakangas	2a77355ea1	Change the retry-loop in standby mode to also try restoring files from pg_xlog directory. This is essential for replaying WAL records that were streamed from the master, after a standby server restart. If a corrupt record is seen in a file restored from the archive or streamed from the master, log it as a WARNING and keep retrying. If the corruption is permanent, and not just a glitch in the whatever copies the files to the archive or a network error not caught by CRC checks in TCP for example, we will keep retrying and logging the WARNING indefinitely. But that's better than shutting down completely, the standby is still useful for running read-only queries. In PITR the recovery ends at such a corrupt record, which is a bit questionable, but that's the behavior we had in previous releases and we don't feel like chaning it now. It does make sense for tools like pg_standby.	2010-03-30 16:23:57 +00:00
Simon Riggs	de66effede	Edit recovery.conf.sample so it matches docs. Change standby_mode example to 'on or 'off' rather than 'true' or 'false', as shown in docs. Add restartpoint_command. Add section header for recovery target parameters, matching docs.	2010-03-29 18:50:36 +00:00
Peter Eisentraut	c248d17120	Message tuning	2010-03-21 00:17:59 +00:00
Simon Riggs	3cdafe40e7	Adjust comment in .history file to match recovery target specified. Comment present since 8.0 was never fully meaningful, since two recovery targets cannot be specified. Refactor recovery target type to make this change and associated code easier to understand. No change in function. Bug report arising from internal support question.	2010-03-19 11:05:15 +00:00
Heikki Linnakangas	c21ac0b58e	Add restartpoint_command option to recovery.conf. Fix bug in %r handling in recovery_end_command, it always came out as 0 because InRedo was cleared before recovery_end_command was executed. Also, always take ControlFileLock when reading checkpoint location for %r. The recovery_end_command bug and the missing locking was present in 8.4 as well, that part of this patch will be backported separately.	2010-03-18 09:17:18 +00:00
Simon Riggs	1a163a0c68	Remove incorrect comment from GetWriteRecPtr(): the return value is always correct, as described in comments at start of xlog.c	2010-03-15 18:49:17 +00:00
Itagaki Takahiro	17d8de0e61	pg_start_backup() can use a share lock to lock ControlFileLock instead of an exclusive lock. The change is almost for code cleanup. Since there seems to be no performance benefits from it, backports should not be needed. Fujii Masao	2010-03-10 02:04:48 +00:00
Bruce Momjian	65e806cba1	pgindent run for 9.0	2010-02-26 02:01:40 +00:00
Tom Lane	a2239b96e0	Make pg_stop_backup's reporting a bit more verbose in hopes of making error cases less intimidating for novices. Per discussion. Greg Smith	2010-02-25 02:17:50 +00:00
Tom Lane	05d8a561ff	Clean up handling of XactReadOnly and RecoveryInProgress checks. Add some checks that seem logically necessary, in particular let's make real sure that HS slave sessions cannot create temp tables. (If they did they would think that temp tables belonging to the master's session with the same BackendId were theirs. We must not allow myTempNamespace to become set in a slave session.) Change setval() and nextval() so that they are only allowed on temp sequences in a read-only transaction. This seems consistent with what we allow for table modifications in read-only transactions. Since an HS slave can't have a temp sequence, this also provides a nicer cure for the setval PANIC reported by Erik Rijkers. Make the error messages more uniform, and have them mention the specific command being complained of. This seems worth the trifling amount of extra code, since people are likely to see such messages a lot more than before.	2010-02-20 21:24:02 +00:00
Heikki Linnakangas	ad458cfe81	Don't use O_DIRECT when writing WAL files if archiving or streaming is enabled. Bypassing the kernel cache is counter-productive in that case, because the archiver/walsender process will read from the WAL file soon after it's written, and if it's not cached the read will cause a physical read, eating I/O bandwidth available on the WAL drive. Also, walreceiver process does unaligned writes, so disable O_DIRECT in walreceiver process for that reason too.	2010-02-19 10:51:04 +00:00
Itagaki Takahiro	3230fd056a	Fix STOP WAL LOCATION in backup history files no to return the next segment of XLOG_BACKUP_END record even if the the record is placed at a segment boundary. Furthermore the previous implementation could return nonexistent segment file name when the boundary is in segments that has "FE" suffix; We never use segments with "FF" suffix. Backpatch to 8.0, where hot backup was introduced. Reported by Fujii Masao.	2010-02-19 01:04:03 +00:00
Tom Lane	50a90fac40	Stamp HEAD as 9.0devel, and update various places that were referring to 8.5 (hope I got 'em all). Per discussion, this release will be 9.0 not 8.5.	2010-02-17 04:19:41 +00:00
Tom Lane	c64339face	When updating ShmemVariableCache from a checkpoint record, be sure to set all the values derived from oldestXid, not just that field. Brain fade in one of my patches associated with flat file removal, exposed by a report from Fujii Masao. With this change, xidVacLimit should always be valid, so remove a couple of bits of complexity associated with the previous assumption that sometimes it wouldn't get set right away.	2010-02-17 03:10:33 +00:00
Tom Lane	d1e027221d	Replace the pg_listener-based LISTEN/NOTIFY mechanism with an in-memory queue. In addition, add support for a "payload" string to be passed along with each notify event. This implementation should be significantly more efficient than the old one, and is also more compatible with Hot Standby usage. There is not yet any facility for HS slaves to receive notifications generated on the master, although such a thing is possible in future. Joachim Wieland, reviewed by Jeff Davis; also hacked on by me.	2010-02-16 22:34:57 +00:00
Robert Haas	e26c539e9f	Wrap calls to SearchSysCache and related functions using macros. The purpose of this change is to eliminate the need for every caller of SearchSysCache, SearchSysCacheCopy, SearchSysCacheExists, GetSysCacheOid, and SearchSysCacheList to know the maximum number of allowable keys for a syscache entry (currently 4). This will make it far easier to increase the maximum number of keys in a future release should we choose to do so, and it makes the code shorter, too. Design and review by Tom Lane.	2010-02-14 18:42:19 +00:00
Simon Riggs	dd428c79a4	Fix relcache init file invalidation during Hot Standby for the case where a database has a non-default tablespaceid. Pass thru MyDatabaseId and MyDatabaseTableSpace to allow file path to be re-created in standby and correct invalidation to take place in all cases. Update and rework xact_commit_desc() debug messages. Bug report from Tom by code inspection. Fix by me.	2010-02-13 16:15:48 +00:00
Heikki Linnakangas	e465390d03	Reduce the chatter to the log when starting a standby server. Don't echo all the recovery.conf options. Don't emit the "initializing recovery connections" message, which doesn't mean anything to a user. Remove the "starting archive recovery" message and replace the "automatic recovery in progress" message with a more informative message saying whether the server is doing PITR, normal archive recovery, or standby mode.	2010-02-12 09:49:08 +00:00
Heikki Linnakangas	54cbd1757e	If primary_conninfo is not set, don't try to establish streaming connection.	2010-02-12 07:56:36 +00:00
Heikki Linnakangas	9fa01f6c8a	Check for partial WAL files in standby mode. If restore_command restores a partial WAL file, assume it's because the file is just being copied to the archive and treat it the same as "file not found" in standby mode. pg_standby has a similar check, so it seems reasonable to have the same level of protection in the built-in standby mode.	2010-02-12 07:36:44 +00:00
Heikki Linnakangas	161d9d51b3	Now that streaming replication switches between streaming mode and restoring from archive, the last WAL segment is not necessarily open at the end of recovery. Fix assertion that assumed that. Fujii Masao, fixing the assertion failure reported by Martin Pihlak.	2010-02-10 08:25:25 +00:00
Tom Lane	cbe9d6beb4	Fix up rickety handling of relation-truncation interlocks. Move rd_targblock, rd_fsm_nblocks, and rd_vm_nblocks from relcache to the smgr relation entries, so that they will get reset to InvalidBlockNumber whenever an smgr-level flush happens. Because we now send smgr invalidation messages immediately (not at end of transaction) when a relation truncation occurs, this ensures that other backends will reset their values before they next access the relation. We no longer need the unreliable assumption that a VACUUM that's doing a truncation will hold its AccessExclusive lock until commit --- in fact, we can intentionally release that lock as soon as we've completed the truncation. This patch therefore reverts (most of) Alvaro's patch of 2009-11-10, as well as my marginal hacking on it yesterday. We can also get rid of assorted no-longer-needed relcache flushes, which are far more expensive than an smgr flush because they kill a lot more state. In passing this patch fixes smgr_redo's failure to perform visibility-map truncation, and cleans up some rather dubious assumptions in freespace.c and visibilitymap.c about when rd_fsm_nblocks and rd_vm_nblocks can be out of date.	2010-02-09 21:43:30 +00:00
Heikki Linnakangas	4cea603128	Remove piece of code to zero out minRecoveryPoint when starting crash recovery. It's zeroed out whenever a checkpoint is written, so the only scenario where the removed code did anything is when you kill archive recovery, remove recovery.conf, and start up the server, so that it goes into crash recovery instead. That's a "don't do that" scenario, but it seems better to not clear minRecoveryPoint but instead update it like we do in archive recovery, which is what will now happen.	2010-02-08 09:08:51 +00:00
Tom Lane	0a469c8769	Remove old-style VACUUM FULL (which was known for a little while as VACUUM FULL INPLACE), along with a boatload of subsidiary code and complexity. Per discussion, the use case for this method of vacuuming is no longer large enough to justify maintaining it; not to mention that we don't wish to invest the work that would be needed to make it play nicely with Hot Standby. Aside from the code directly related to old-style VACUUM FULL, this commit removes support for certain WAL record types that could only be generated within VACUUM FULL, redirect-pointer removal in heap_page_prune, and nontransactional generation of cache invalidation sinval messages (the last being the sticking point for Hot Standby). We still have to retain all code that copes with finding HEAP_MOVED_OFF and HEAP_MOVED_IN flag bits on existing tuples. This can't be removed as long as we want to support in-place update from pre-9.0 databases.	2010-02-08 04:33:55 +00:00
Tom Lane	b9b8831ad6	Create a "relation mapping" infrastructure to support changing the relfilenodes of shared or nailed system catalogs. This has two key benefits: * The new CLUSTER-based VACUUM FULL can be applied safely to all catalogs. * We no longer have to use an unsafe reindex-in-place approach for reindexing shared catalogs. CLUSTER on nailed catalogs now works too, although I left it disabled on shared catalogs because the resulting pg_index.indisclustered update would only be visible in one database. Since reindexing shared system catalogs is now fully transactional and crash-safe, the former special cases in REINDEX behavior have been removed; shared catalogs are treated the same as non-shared. This commit does not do anything about the recently-discussed problem of deadlocks between VACUUM FULL/CLUSTER on a system catalog and other concurrent queries; will address that in a separate patch. As a stopgap, parallel_schedule has been tweaked to run vacuum.sql by itself, to avoid such failures during the regression tests.	2010-02-07 20:48:13 +00:00
Simon Riggs	296578feb4	Revoke augmentation of WAL records for btree delete, per discussion.	2010-02-01 13:40:28 +00:00
Simon Riggs	6d2bc0a6cf	Augment WAL records for btree delete with GetOldestXmin() to reduce false positives during Hot Standby conflict processing. Simple patch to enhance conflict processing, following previous discussions. Controlled by parameter minimize_standby_conflicts = on \| off, with default off allows measurement of performance impact to see whether it should be set on all the time.	2010-01-29 18:39:05 +00:00
Heikki Linnakangas	b0509ef601	Fix crashing bug at the end of recovery in Streaming Replication, when restore_command is not given. Fujii Masao.	2010-01-28 19:17:22 +00:00
Heikki Linnakangas	83cb7da7dc	Fix bug in wasender's xlogid boundary handling, reported by Erik Rijkers. LogwrtRqst.Write can be set to non-existent FF log segment, we mustn't try to send that in XLogSend(). Also fix similar bug in ReadRecord(), which I just introduced in the ReadRecord() refactoring patch.	2010-01-27 16:41:09 +00:00
Heikki Linnakangas	1bb2558046	Make standby server continuously retry restoring the next WAL segment with restore_command, if the connection to the primary server is lost. This ensures that the standby can recover automatically, if the connection is lost for a long time and standby falls behind so much that the required WAL segments have been archived and deleted in the master. This also makes standby_mode useful without streaming replication; the server will keep retrying restore_command every few seconds until the trigger file is found. That's the same basic functionality pg_standby offers, but without the bells and whistles. To implement that, refactor the ReadRecord/FetchRecord functions. The FetchRecord() function introduced in the original streaming replication patch is removed, and all the retry logic is now in a new function called XLogReadPage(). XLogReadPage() is now responsible for executing restore_command, launching walreceiver, and waiting for new WAL to arrive from primary, as required. This also changes the life cycle of walreceiver. When launched, it now only tries to connect to the master once, and exits if the connection fails, or is lost during streaming for any reason. The startup process detects the death, and re-launches walreceiver if necessary.	2010-01-27 15:27:51 +00:00
Simon Riggs	aed1a0121a	Fix longstanding gripe that we check for 0000000001.history at start of archive recovery, even when we know it is never present.	2010-01-26 00:07:13 +00:00
Tom Lane	875353b99f	Fix assorted core dumps and Assert failures that could occur during AbortTransaction or AbortSubTransaction, when trying to clean up after an error that prevented (sub)transaction start from completing: * access to TopTransactionResourceOwner that might not exist * assert failure in AtEOXact_GUC, if AtStart_GUC not called yet * assert failure or core dump in AfterTriggerEndSubXact, if AfterTriggerBeginSubXact not called yet Per testing by injecting elog(ERROR) at successive steps in StartTransaction and StartSubTransaction. It's not clear whether all of these cases could really occur in the field, but at least one of them is easily exposed by simple stress testing, as per my accidental discovery yesterday.	2010-01-24 21:49:17 +00:00
Simon Riggs	959ac58c04	In HS, Startup process sets SIGALRM when waiting for buffer pin. If woken by alarm we send SIGUSR1 to all backends requesting that they check to see if they are blocking Startup process. If so, they throw ERROR/FATAL as for other conflict resolutions. Deadlock stop gap removed. max_standby_delay = -1 option removed to prevent deadlock.	2010-01-23 16:37:12 +00:00
Heikki Linnakangas	09b115f706	Write a WAL record whenever we perform an operation without WAL-logging that would've been WAL-logged if archiving was enabled. If we encounter such records in archive recovery anyway, we know that some data is missing from the log. A WARNING is emitted in that case. Original patch by Fujii Masao, with changes by me.	2010-01-20 19:43:40 +00:00
Simon Riggs	a8ce974cdd	Teach standby conflict resolution to use SIGUSR1 Conflict reason is passed through directly to the backend, so we can take decisions about the effect of the conflict based upon the local state. No specific changes, as yet, though this prepares for later work. CancelVirtualTransaction() sends signals while holding ProcArrayLock. Introduce errdetail_abort() to give message detail explaining that the abort was caused by conflict processing. Remove CONFLICT_MODE states in favour of using PROCSIG_RECOVERY_CONFLICT states directly, for clarity.	2010-01-16 10:05:59 +00:00
Heikki Linnakangas	40f908bdcd	Introduce Streaming Replication. This includes two new kinds of postmaster processes, walsenders and walreceiver. Walreceiver is responsible for connecting to the primary server and streaming WAL to disk, while walsender runs in the primary server and streams WAL from disk to the client. Documentation still needs work, but the basics are there. We will probably pull the replication section to a new chapter later on, as well as the sections describing file-based replication. But let's do that as a separate patch, so that it's easier to see what has been added/changed. This patch also adds a new section to the chapter about FE/BE protocol, documenting the protocol used by walsender/walreceivxer. Bump catalog version because of two new functions, pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for monitoring the progress of replication. Fujii Masao, with additional hacking by me	2010-01-15 09:19:10 +00:00
Simon Riggs	42edbd16fb	During Hot Standby, set DatabasePath correctly during relcache init file deletion, so that we attempt to unlink the correct filepath. unlink() errors are ignorable there, so lack of a DatabasePath initialization step did not cause visible problems until a related bug showed up on Solaris. Code refactored from xact_redo_commit() to ProcessCommittedInvalidationMessages() in inval.c. Recovery may replay shared invalidation messages for many databases, so we cannot SetDatabasePath() once as we do in normal backends. Read the databaseid from the shared invalidation messages, then set DatabasePath temporarily before calling RelationCacheInitFileInvalidate(). Problem report by Robert Treat, analysis and fix by me.	2010-01-09 16:49:27 +00:00
Heikki Linnakangas	06f82b2961	Write an end-of-backup WAL record at pg_stop_backup(), and wait for it at recovery instead of reading the backup history file. This is more robust, as it stops you from prematurely starting up an inconsisten cluster if the backup history file is lost for some reason, or if the base backup was never finished with pg_stop_backup(). This also paves the way for a simpler streaming replication patch, which doesn't need to care about backup history files anymore. The backup history file is still created and archived as before, but it's not used by the system anymore. It's just for informational purposes now. Bump PG_CONTROL_VERSION as the location of the backup startpoint is now written to a new field in pg_control, and catversion because initdb is required Original patch by Fujii Masao per Simon's idea, with further fixes by me.	2010-01-04 12:50:50 +00:00
Bruce Momjian	0239800893	Update copyright for the year 2010.	2010-01-02 16:58:17 +00:00
Heikki Linnakangas	ff1e1e45b9	Reset minRecoveryPoint at checkpoints, so that we don't uselessly update it in the control file at crash recovery following an archive recovery. Per Fujii Masao and subsequent discussion.	2009-12-30 08:37:21 +00:00
Simon Riggs	efc16ea520	Allow read only connections during recovery, known as Hot Standby. Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record. New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far. This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required. Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit. Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.	2009-12-19 01:32:45 +00:00
Tom Lane	62aba76568	Prevent indirect security attacks via changing session-local state within an allegedly immutable index function. It was previously recognized that we had to prevent such a function from executing SET/RESET ROLE/SESSION AUTHORIZATION, or it could trivially obtain the privileges of the session user. However, since there is in general no privilege checking for changes of session-local state, it is also possible for such a function to change settings in a way that might subvert later operations in the same session. Examples include changing search_path to cause an unexpected function to be called, or replacing an existing prepared statement with another one that will execute a function of the attacker's choosing. The present patch secures VACUUM, ANALYZE, and CREATE INDEX/REINDEX against these threats, which are the same places previously deemed to need protection against the SET ROLE issue. GUC changes are still allowed, since there are many useful cases for that, but we prevent security problems by forcing a rollback of any GUC change after completing the operation. Other cases are handled by throwing an error if any change is attempted; these include temp table creation, closing a cursor, and creating or deleting a prepared statement. (In 7.4, the infrastructure to roll back GUC changes doesn't exist, so we settle for rejecting changes of "search_path" in these contexts.) Original report and patch by Gurjeet Singh, additional analysis by Tom Lane. Security: CVE-2009-4136	2009-12-09 21:57:51 +00:00
Heikki Linnakangas	cd87b6f8a5	Fix an old bug in multixact and two-phase commit. Prepared transactions can be part of multixacts, so allocate a slot for each prepared transaction in the "oldest member" array in multixact.c. On PREPARE TRANSACTION, transfer the oldest member value from the current backends slot to the prepared xact slot. Also save and recover the value from the 2pc state file. The symptom of the bug was that after a transaction prepared, a shared lock still held by the prepared transaction was sometimes ignored by other transactions. Fix back to 8.1, where both 2PC and multixact were introduced.	2009-11-23 09:58:36 +00:00
Heikki Linnakangas	7f2a10fecd	Don't error out if recycling or removing an old WAL segment fails at the end of checkpoint. Although the checkpoint has been written to WAL at that point already, so that all data is safe, and we'll retry removing the WAL segment at the next checkpoint, if such a failure persists we won't be able to remove any other old WAL segments either and will eventually run out of disk space. It's better to treat the failure as non-fatal, and move on to clean any other WAL segment and continue with any other end-of-checkpoint cleanup. We don't normally expect any such failures, but on Windows it can happen with some anti-virus or backup software that lock files without FILE_SHARE_DELETE flag. Also, the loop in pgrename() to retry when the file is locked was broken. If a file is locked on Windows, you get ERROR_SHARE_VIOLATION, not ERROR_ACCESS_DENIED, at least on modern versions. Fix that, although I left the check for ERROR_ACCESS_DENIED in there as well (presumably it was correct in some environment), and added ERROR_LOCK_VIOLATION to be consistent with similar checks in pgwin32_open(). Reduce the timeout on the loop from 30s to 10s, on the grounds that since it's been broken, we've effectively had a timeout of 0s and no-one has complained, so a smaller timeout is actually closer to the old behavior. A longer timeout would mean that if recycling a WAL file fails because it's locked for some reason, InstallXLogFileSegment() will hold ControlFileLock for longer, potentially blocking other backends, so a long timeout isn't totally harmless. While we're at it, set errno correctly in pgrename(). Backpatch to 8.2, which is the oldest version supported on Windows. The xlog.c changes would make sense on other platforms and thus on older versions as well, but since there's no such locking issues on other platforms, it's not worth it.	2009-09-13 18:32:08 +00:00
Heikki Linnakangas	4e2d5efc6a	On Windows, when a file is deleted and another process still has an open file handle on it, the file goes into "pending deletion" state where it still shows up in directory listing, but isn't accessible otherwise. That confuses RemoveOldXLogFiles(), making it think that the file hasn't been archived yet, while it actually was, and it was deleted along with the .done file. Fix that by renaming the file with ".deleted" extension before deleting it. Also check the return value of rename() and unlink(), so that if the removal fails for any reason (e.g another process is holding the file locked), we don't delete the .done file until the WAL file is really gone. Backpatch to 8.2, which is the oldest version supported on Windows.	2009-09-10 09:42:10 +00:00
Tom Lane	794e3e81a0	Force VACUUM to recalculate oldestXmin even when we haven't changed our own database's datfrozenxid, if the current value is old enough to be forcing autovacuums or warning messages. This ensures that a bogus value is replaced as soon as possible. Per a comment from Heikki.	2009-09-01 04:46:49 +00:00
Tom Lane	14f445fccf	Actually, we need to bump the format identifier on twophase files because of readjustment of 2PC rmgr IDs for flatfile removal.	2009-09-01 04:15:45 +00:00
Alvaro Herrera	a8bb8eb583	Remove flatfiles.c, which is now obsolete. Recent commits have removed the various uses it was supporting. It was a performance bottleneck, according to bug report #4919 by Lauris Ulmanis; seems it slowed down user creation after a billion users.	2009-09-01 02:54:52 +00:00
Tom Lane	25ec228ef7	Track the current XID wrap limit (or more accurately, the oldest unfrozen XID) in checkpoint records. This eliminates the need to recompute the value from scratch during database startup, which is one of the two remaining reasons for the flatfile code to exist. It should also simplify life for hot-standby operation. To avoid bloating the checkpoint records unreasonably, I switched from tracking the oldest database by name to tracking it by OID. This turns out to save cycles in general (everywhere but the warning-generating paths, which we hardly care about) and also helps us deal with the case that the oldest database got dropped instead of being vacuumed. The prior coding might go for a long time without updating the wrap limit in that case, which is bad because it might result in a lot of useless autovacuum activity.	2009-08-31 02:23:23 +00:00
Heikki Linnakangas	9cd6685f91	In the checkpoint written at the end of archive recovery, the WAL page header was incorrectly initialized with timeline ID 0. That rendered the WAL page unrecoverable, making a subsequent archive recovery stop at that point. ThisTimeLineID needs to be initialized before calling AdvanceXLInsertBuffer(). This fixes bug #5011 reported by James Bardin. Backpatch to 8.4, as the bug was introduced by the changes to use of bgwriter for writing the end-of-archive-recovery checkpoint. Patch by Tom Lane.	2009-08-27 07:15:41 +00:00
Tom Lane	04011cc970	Allow backends to start up without use of the flat-file copy of pg_database. To make this work in the base case, pg_database now has a nailed-in-cache relation descriptor that is initialized using hardwired knowledge in relcache.c. This means pg_database is added to the set of relations that need to have a Schema_pg_xxx macro maintained in pg_attribute.h. When this path is taken, we'll have to do a seqscan of pg_database to find the row we need. In the normal case, we are able to do an indexscan to find the database's row by name. This is made possible by storing a global relcache init file that describes only the shared catalogs and their indexes (and therefore is usable by all backends in any database). A new backend loads this cache file, finds its database OID after an indexscan on pg_database, and then loads the local relcache init file for that database. This change should effectively eliminate number of databases as a factor in backend startup time, even with large numbers of databases. However, the real reason for doing it is as a first step towards getting rid of the flat files altogether. There are still several other sub-projects to be tackled before that can happen.	2009-08-12 20:53:31 +00:00
Tom Lane	97e14f6e93	Document that LocalSetXLogInsertAllowed can be re-executed. Per comment from Simon.	2009-08-08 16:39:17 +00:00
Tom Lane	87740caa01	rm_cleanup functions need to be allowed to write WAL entries. This oversight appears to explain the recent reports of "PANIC: cannot make new WAL entries during recovery".	2009-08-07 19:29:49 +00:00
Tom Lane	2de48a83e6	Cleanup and code review for the patch that made bgwriter active during archive recovery. Invent a separate state variable and inquiry function for XLogInsertAllowed() to clarify some tests and make the management of writing the end-of-recovery checkpoint less klugy. Fix several places that were incorrectly testing InRecovery when they should be looking at RecoveryInProgress or XLogInsertAllowed (because they will now be executed in the bgwriter not startup process). Clarify handling of bad LSNs passed to XLogFlush during recovery. Use a spinlock for setting/testing SharedRecoveryInProgress. Improve quite a lot of comments. Heikki and Tom	2009-06-26 20:29:04 +00:00
Heikki Linnakangas	7e48b77b1c	Fix some serious bugs in archive recovery, now that bgwriter is active during it: When bgwriter is active, the startup process can't perform mdsync() correctly because it won't see the fsync requests accumulated in bgwriter's private pendingOpsTable. Therefore make bgwriter responsible for the end-of-recovery checkpoint as well, when it's active. When bgwriter is active (= archive recovery), the startup process must not accumulate fsync requests to its own pendingOpsTable, since bgwriter won't see them there when it performs restartpoints. Make startup process drop its pendingOpsTable when bgwriter is launched to avoid that. Update minimum recovery point one last time when leaving archive recovery. It won't be updated by the end-of-recovery checkpoint because XLogFlush() sees us as out of recovery already. This fixes bug #4879 reported by Fujii Masao.	2009-06-25 21:36:00 +00:00
Heikki Linnakangas	ebaa1952f1	The code to unlink dropped relations in FinishPreparedTransaction() was acting like runs inside WAL recovery, but it doesn't. I must've copy-pasted this from a redo-function in the relation forks patch. Noticed by Tom Lane while he was looking through callers of smgrdounlink().	2009-06-25 19:05:52 +00:00
Bruce Momjian	d747140279	8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef list provided by Andrew.	2009-06-11 14:49:15 +00:00
Heikki Linnakangas	7c8d7a2eec	Only recycle normal files in pg_xlog as WAL segments. pg_standby creates symbolic links with the -l option, and as Fujii Masao pointed out we ended up overwriting files in the archive directory before this patch. Patch by Aidan Van Dyk, Fujii Masao and me. Backpatch to 8.3, where pg_standby was introduced.	2009-06-02 06:18:06 +00:00
Heikki Linnakangas	2e6107cb62	When archiving is enabled, rotate the last WAL segment at shutdown so that all transactions are archived. Original patch by Guillaume Smet.	2009-05-28 11:02:16 +00:00
Tom Lane	4616d57dad	Fix all the server-side SIGQUIT handlers (grumble ... why so many identical copies?) to ensure they really don't run proc_exit/shmem_exit callbacks, as was intended. I broke this behavior recently by installing atexit callbacks without thinking about the one case where we truly don't want to run those callback functions. Noted in an example from Dave Page.	2009-05-15 15:56:39 +00:00
Tom Lane	bfab3f19e3	Include recovery_end_command in recovery.conf.sample. Per suggestion of Jaime Casanova.	2009-05-14 22:22:01 +00:00
Tom Lane	284e12c398	Improve a couple of comments.	2009-05-14 21:28:35 +00:00
Heikki Linnakangas	9e403c2587	Add recovery_end_command option to recovery.conf. recovery_end_command is run at the end of archive recovery, providing a chance to do external cleanup. Modify pg_standby so that it no longer removes the trigger file, that is to be done using the recovery_end_command now. Provide a "smart" failover mode in pg_standby, where we don't fail over immediately, but only after recovering all unapplied WAL from the archive. That gives you zero data loss assuming all WAL was archived before failover, which is what most users of pg_standby actually want. recovery_end_command by Simon Riggs, pg_standby changes by Fujii Masao and myself.	2009-05-14 20:31:09 +00:00
Tom Lane	23543c732b	Rewrite xml.c's memory management (yet again). Give up on the idea of redirecting libxml's allocations into a Postgres context. Instead, just let it use malloc directly, and add PG_TRY blocks as needed to be sure we release libxml data structures in error recovery code paths. This is ugly but seems much more likely to play nicely with third-party uses of libxml, as seen in recent trouble reports about using Perl XML facilities in pl/perl and bug #4774 about contrib/xml2. I left the code for allocation redirection in place, but it's only built/used if you #define USE_LIBXMLCONTEXT. This is because I found it useful to corral libxml's allocations in a palloc context when hunting for libxml memory leaks, and we're surely going to have more of those in the future with this type of approach. But we don't want it turned on in a normal build because it breaks exactly what we need to fix. I have not re-indented most of the code sections that are now wrapped by PG_TRY(); that's for ease of review. pg_indent will fix it. This is a pre-existing bug in 8.3, but I don't dare back-patch this change until it's gotten a reasonable amount of field testing.	2009-05-13 20:27:17 +00:00
Heikki Linnakangas	223431cba1	Request XLOG switch before writing checkpoint in pg_start_backup(). Otherwise you can end up with an unrecoverable backup if you start a new base backup right after finishing archive recovery. In that scenario, the redo pointer of the checkpoint that pg_start_backup() writes points to the XLOG segment where the timeline-changing end-of-archive-recovery checkpoint is. The beginning of that segment contains pages with the old timeline ID, and we don't accept that in recovery unless we find a history file covering the old timeline ID. If you omit pg_xlog from the base backup and clear the archive directory before starting the backup, there will be no such history file available. The bug is present in all versions since PITR was introduced in 8.0, but I'm back-patching only back to 8.2. Earlier versions didn't have XLOG switch records, making this fix unfeasible. Given the lack of reports until now, it doesn't seem worthwhile to spend more effort to fix 8.0 and 8.1. Per report and suggestion by Mikael Krantz	2009-05-07 11:25:25 +00:00
Tom Lane	8d4f2ecd41	Change the default value of max_prepared_transactions to zero, and add documentation warnings against setting it nonzero unless active use of prepared transactions is intended and a suitable transaction manager has been installed. This should help to prevent the type of scenario we've seen several times now where a prepared transaction is forgotten and eventually causes severe maintenance problems (or even anti-wraparound shutdown). The only real reason we had the default be nonzero in the first place was to support regression testing of the feature. To still be able to do that, tweak pg_regress to force a nonzero value during "make check". Since we cannot force a nonzero value in "make installcheck", add a variant regression test "expected" file that shows the results that will be obtained when max_prepared_transactions is zero. Also, extend the HINT messages for transaction wraparound warnings to mention the possibility that old prepared transactions are causing the problem. All per today's discussion.	2009-04-23 00:23:46 +00:00
Heikki Linnakangas	bae8102f52	After archive recovery, mark the last WAL segment from the parent timeline ready for archival. It was marked at the next checkpoint anyway, but waiting for the next checkpoint is an unnecessary delay. Fujii Masao	2009-04-22 19:51:12 +00:00
Tom Lane	387060951e	Add an optional parameter to pg_start_backup() that specifies whether to do the checkpoint in immediate or lazy mode. This is to address complaints that pg_start_backup() takes a long time even when there's no need to minimize its I/O consumption.	2009-04-07 00:31:26 +00:00
Bruce Momjian	0e550ff617	Revert DTrace patch from Robert Lor	2009-04-02 20:59:10 +00:00
Bruce Momjian	227f817c1f	Add support for additional DTrace probes. Robert Lor	2009-04-02 19:14:34 +00:00
Tom Lane	e04810e8c4	Code review for dtrace probes added (so far) to 8.4. Adjust placement of some bufmgr probes, take out redundant and memory-leak-inducing path arguments to smgr__md__read__done and smgr__md__write__done, fix bogus attempt to recalculate space used in sort__done, clean up formatting in places where I'm not sure pgindent will do a nice job by itself.	2009-03-11 23:19:25 +00:00
Heikki Linnakangas	fb7df896fc	Reload config file in startup process on SIGHUP. Fujii Masao	2009-03-04 13:56:40 +00:00
Heikki Linnakangas	bc134d7a51	Change the signaling of end-of-recovery. Startup process now indicates end of recovery by exiting with exit code 0, like in previous releases. Per Tom's suggestion.	2009-02-23 09:28:50 +00:00
Heikki Linnakangas	cdd46c7654	Start background writer during archive recovery. Background writer now performs its usual buffer cleaning duties during archive recovery, and it's responsible for performing restartpoints. This requires some changes in postmaster. When the startup process has done all the initialization and is ready to start WAL redo, it signals the postmaster to launch the background writer. The postmaster is signaled again when the point in recovery is reached where we know that the database is in consistent state. Postmaster isn't interested in that at the moment, but that's the point where we could let other backends in to perform read-only queries. The postmaster is signaled third time when the recovery has ended, so that postmaster knows that it's safe to start accepting connections. The startup process now traps SIGTERM, and performs a "clean" shutdown. If you do a fast shutdown during recovery, a shutdown restartpoint is performed, like a shutdown checkpoint, and postmaster kills the processes cleanly. You still have to continue the recovery at next startup, though. Currently, the background writer is only launched during archive recovery. We could launch it during crash recovery as well, but it seems better to keep that codepath as simple as possible, for the sake of robustness. And it couldn't do any restartpoints during crash recovery anyway, so it wouldn't be that useful. log_restartpoints is gone. Use log_checkpoints instead. This is yet to be documented. This whole operation is a pre-requisite for Hot Standby, but has some value of its own whether the hot standby patch makes 8.4 or not. Simon Riggs, with lots of modifications by me.	2009-02-18 15:58:41 +00:00
Heikki Linnakangas	b75b66332a	Fix obsolete comment. Zdenek Kotala	2009-02-07 10:49:36 +00:00
Heikki Linnakangas	9187cedd7c	Put back fast-path for the case that there's no backup blocks in RestoreBkpBlocks. Went missing in my recent refactoring patch, as pointed out by Simon's hot standby patch.	2009-01-23 11:19:34 +00:00
Heikki Linnakangas	b2a667b9ee	Add a new option to RestoreBkpBlocks() to indicate if a cleanup lock should be used instead of the normal exclusive lock, and make WAL redo functions responsible for calling RestoreBkpBlocks(). They know better what kind of a lock they need. At the moment, this just moves things around with no functional change, but makes the hot standby patch that's under review cleaner.	2009-01-20 18:59:37 +00:00
Tom Lane	1a37056a74	Re-enable the old code in xlog.c that tried to use posix_fadvise(), so that we can get some buildfarm feedback about whether that function is still problematic. (Note that the planned async-preread patch will not really prove anything one way or the other in buildfarm testing, since it will be inactive with default GUC settings.)	2009-01-11 18:02:17 +00:00
Bruce Momjian	511db38ace	Update copyright for 2009.	2009-01-01 17:24:05 +00:00
Bruce Momjian	4ee79fd20d	Change the name of dtrace wal tracepoints: TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY Robert Lor	2008-12-24 20:41:29 +00:00
Bruce Momjian	5a90bc1fbe	The attached patch contains a couple of fixes in the existing probes and includes a few new ones. - Fixed compilation errors on OS X for probes that use typedefs - Fixed a number of probes to pass ForkNumber per the relation forks patch - The new probes are those that were taken out from the previous submitted patch and required simple fixes. Will submit the other probes that may require more discussion in a separate patch. Robert Lor	2008-12-17 01:39:04 +00:00
Tom Lane	17dc173660	To reduce confusion over whether VACUUM FULL is needed for anti-wraparound vacuuming (it's not), say "database-wide VACUUM" instead of "full-database VACUUM" in the relevant hint messages. Also, document the permissions needed to do this. Per today's discussion.	2008-12-11 18:16:18 +00:00
Heikki Linnakangas	dea81a6cf6	Revert SIGUSR1 multiplexing patch, per Tom's objection.	2008-12-09 15:59:39 +00:00
Heikki Linnakangas	7b05b3fa39	Provide support for multiplexing SIGUSR1 signal. The upcoming synchronous replication patch needs a signal, but we've already used SIGUSR1 and SIGUSR2 in normal backends. This patch allows reusing SIGUSR1 for that, and for other purposes too if the need arises.	2008-12-09 14:28:20 +00:00
Alvaro Herrera	7b640b0345	Fix a couple of snapshot management bugs in the new ResourceOwner world: non-writable large objects need to have their snapshots registered on the transaction resowner, not the current portal's, because it must persist until the large object is closed (which the portal does not). Also, ensure that the serializable snapshot is recorded by the transaction resource owner too, even when a subtransaction has changed the current resource owner before serializable is taken. Per bug reports from Pavan Deolasee.	2008-12-04 14:51:02 +00:00
Heikki Linnakangas	608195a3a3	Introduce visibility map. The visibility map is a bitmap with one bit per heap page, where a set bit indicates that all tuples on the page are visible to all transactions, and the page therefore doesn't need vacuuming. It is stored in a new relation fork. Lazy vacuum uses the visibility map to skip pages that don't need vacuuming. Vacuum is also responsible for setting the bits in the map. In the future, this can hopefully be used to implement index-only-scans, but we can't currently guarantee that the visibility map is always 100% up-to-date. In addition to the visibility map, there's a new PD_ALL_VISIBLE flag on each heap page, also indicating that all tuples on the page are visible to all transactions. It's important that this flag is kept up-to-date. It is also used to skip visibility tests in sequential scans, which gives a small performance gain on seqscans.	2008-12-03 13:05:22 +00:00
Heikki Linnakangas	b457b2a24e	If pg_stop_backup() is called just after switching to a new xlog file, wait for the previous instead of the new file to be archived. Based on patch by Simon Riggs.	2008-12-03 08:20:11 +00:00
Heikki Linnakangas	9858a8c81c	Rely on relcache invalidation to update the cached size of the FSM.	2008-11-26 17:08:58 +00:00
Heikki Linnakangas	3396000684	Rethink the way FSM truncation works. Instead of WAL-logging FSM truncations in FSM code, call FreeSpaceMapTruncateRel from smgr_redo. To make that cleaner from modularity point of view, move the WAL-logging one level up to RelationTruncate, and move RelationTruncate and all the related WAL-logging to new src/backend/catalog/storage.c file. Introduce new RelationCreateStorage and RelationDropStorage functions that are used instead of calling smgrcreate/smgrscheduleunlink directly. Move the pending rel deletion stuff from smgrcreate/smgrscheduleunlink to the new functions. This leaves smgr.c as a thin wrapper around md.c; all the transactional stuff is now in storage.c. This will make it easier to add new forks with similar truncation logic, like the visibility map.	2008-11-19 10:34:52 +00:00
Tom Lane	cad3a26a95	Fix sloppy omission of now-required #include's.	2008-11-11 14:17:02 +00:00
Heikki Linnakangas	7e8b0b9ab1	Change error messages to print the physical path, like "base/11517/3767_fsm", instead of symbolic names like "1663/11517/3767/1", per Alvaro's suggestion. I didn't change the messages in the higher-level index, heap and FSM routines, though, where the fork is implicit.	2008-11-11 13:19:16 +00:00
Tom Lane	1d577f5e49	Add a startup check that pg_xlog and pg_xlog/archive_status exist. If the latter doesn't exist, automatically recreate it. (We don't do this for pg_xlog, though, per discussion.) Jonah Harris	2008-11-09 17:51:15 +00:00
Alvaro Herrera	4ff0468371	Fix silly typo in previous commit.	2008-11-03 19:26:07 +00:00
Alvaro Herrera	d698bf83d1	Fix TransactionIdSetStatusBit so that it doesn't try to change a transaction from COMMITTED to SUBCOMMITTED during recovery. This wasn't previously possible, but it is now due to the recent changes on clog commit protocol for subtransactions. Simon Riggs	2008-11-03 19:24:03 +00:00
Alvaro Herrera	b107299c40	Fix mistakes in comment headers	2008-11-03 15:10:17 +00:00
Tom Lane	d7112cfa88	Remove the last vestiges of the MAKE_PTR/MAKE_OFFSET mechanism. We haven't allowed different processes to have different addresses for the shmem segment in quite a long time, but there were still a few places left that used the old coding convention. Clean them up to reduce confusion and improve the compiler's ability to detect pointer type mismatches. Kris Jurka	2008-11-02 21:24:52 +00:00
Heikki Linnakangas	19c8dc839b	Unite ReadBufferWithFork, ReadBufferWithStrategy, and ZeroOrReadBuffer functions into one ReadBufferExtended function, that takes the strategy and mode as argument. There's three modes, RBM_NORMAL which is the default used by plain ReadBuffer(), RBM_ZERO, which replaces ZeroOrReadBuffer, and a new mode RBM_ZERO_ON_ERROR, which allows callers to read corrupt pages without throwing an error. The FSM needs the new mode to recover from corrupt pages, which could happend if we crash after extending an FSM file, and the new page is "torn". Add fork number to some error messages in bufmgr.c, that still lacked it.	2008-10-31 15:05:00 +00:00
Tom Lane	2314baef38	Fix recoveryLastXTime logic so that it actually does what one would expect. Per gripe from Kevin Grittner. Backpatch to 8.3, where the bug was introduced.	2008-10-30 04:06:16 +00:00
Alvaro Herrera	97227e9ec0	These functions no longer return a value, per complaint from gothic_moth via Zdenek Kotala.	2008-10-20 20:38:24 +00:00
Alvaro Herrera	06da3c570f	Rework subtransaction commit protocol for hot standby. This patch eliminates the marking of subtransactions as SUBCOMMITTED in pg_clog during their commit; instead they remain in-progress until main transaction commit. At main transaction commit, the commit protocol is atomic-by-page instead of one transaction at a time. To avoid a race condition with some subtransactions appearing committed before others in the case where they span more than one pg_clog page, we conserve the logic that marks them subcommitted before marking the parent committed. Simon Riggs with minor help from me	2008-10-20 19:18:18 +00:00
Heikki Linnakangas	15c121b3ed	Rewrite the FSM. Instead of relying on a fixed-size shared memory segment, the free space information is stored in a dedicated FSM relation fork, with each relation (except for hash indexes; they don't use FSM). This eliminates the max_fsm_relations and max_fsm_pages GUC options; remove any trace of them from the backend, initdb, and documentation. Rewrite contrib/pg_freespacemap to match the new FSM implementation. Also introduce a new variant of the get_raw_page(regclass, int4, int4) function in contrib/pageinspect that let's you to return pages from any relation fork, and a new fsm_page_contents() function to inspect the new FSM pages.	2008-09-30 10:52:14 +00:00
Heikki Linnakangas	61d9674988	Make LC_COLLATE and LC_CTYPE database-level settings. Collation and ctype are now more like encoding, stored in new datcollate and datctype columns in pg_database. This is a stripped-down version of Radek Strnad's patch, with further changes by me.	2008-09-23 09:20:39 +00:00
Tom Lane	ead21631e8	Fix a couple of problems pointed out by Fujii Masao in the 2008-Apr-05 patch for pg_stop_backup. First, it is possible that the history file name is not alphabetically later than the last WAL file name, so we should explicitly check that both have been archived. Second, the previous coding would wait forever if a checkpoint had managed to remove the WAL file before we look for it. Simon Riggs, plus some code cleanup by me.	2008-09-08 16:42:15 +00:00
Heikki Linnakangas	3f0e808c4a	Introduce the concept of relation forks. An smgr relation can now consist of multiple forks, and each fork can be created and grown separately. The bulk of this patch is about changing the smgr API to include an extra ForkNumber argument in every smgr function. Also, smgrscheduleunlink and smgrdounlink no longer implicitly call smgrclose, because other forks might still exist after unlinking one. The callers of those functions have been modified to call smgrclose instead. This patch in itself doesn't have any user-visible effect, but provides the infrastructure needed for upcoming patches. The additional forks envisioned are a rewritten FSM implementation that doesn't rely on a fixed-size shared memory block, and a visibility map to allow skipping portions of a table in VACUUM that have no dead tuples.	2008-08-11 11:05:11 +00:00
Alvaro Herrera	e36e6b1cab	Add a few more DTrace probes to the backend. Robert Lor	2008-08-01 13:16:09 +00:00
Tom Lane	9d035f4254	Clean up the use of some page-header-access macros: principally, use SizeOfPageHeaderData instead of sizeof(PageHeaderData) in places where that makes the code clearer, and avoid casting between Page and PageHeader where possible. Zdenek Kotala, with some additional cleanup by Heikki Linnakangas. I did not apply the parts of the proposed patch that would have resulted in slightly changing the on-disk format of hash indexes; it seems to me that's not a win as long as there's any chance of having in-place upgrade for 8.4.	2008-07-13 20:45:47 +00:00
Bruce Momjian	6b797c852b	Fix recovery.conf boolean variables to take the same range of string values as postgresql.conf.	2008-06-30 22:10:43 +00:00
Alvaro Herrera	a3540b0f65	Improve our #include situation by moving pointer types away from the corresponding struct definitions. This allows other headers to avoid including certain highly-loaded headers such as rel.h and relscan.h, instead using just relcache.h, heapam.h or genam.h, which are more lightweight and thus cause less unnecessary dependencies.	2008-06-19 00:46:06 +00:00
Heikki Linnakangas	a213f1ee6c	Refactor XLogOpenRelation() and XLogReadBuffer() in preparation for relation forks. XLogOpenRelation() and the associated light-weight relation cache in xlogutils.c is gone, and XLogReadBuffer() now takes a RelFileNode as argument, instead of Relation. For functions that still need a Relation struct during WAL replay, there's a new function called CreateFakeRelcacheEntry() that returns a fake entry like XLogOpenRelation() used to.	2008-06-12 09:12:31 +00:00
Alvaro Herrera	cc87402d6e	Move BufferGetPageSize and BufferGetPage from bufpage.h to bufmgr.h. It is more logical that way, and also it reduces the amount of unnecessary includes in bufpage.h, which is widely used. Zdenek Kotala. My previous patch to bufpage.h should also have credited him as author, but I forgot (sorry about that).	2008-06-08 22:00:48 +00:00
Magnus Hagander	8eee526c19	Set hidden field for guc enum missed in previous commit.	2008-05-28 15:22:05 +00:00
Heikki Linnakangas	50ff07d5b1	Remove arbitrary 10MB limit on two-phase state file size. It's not that hard to go beoynd 10MB, as demonstrated by Gavin Sharry's example of dropping a schema with ~25000 objects. The really bogus thing about the limit was that it was enforced when a state file file was read in, not when it was written, so you would end up with a prepared transaction that you can't commit or abort, and the only recourse was to shut down the server and remove the file by hand. Raise the limit to MaxAllocSize, and enforce it also when a state file is written. We could've removed the limit altogether, but reading in a file larger than MaxAllocSize would fail anyway because we read it into a palloc'd buffer. Backpatch down to 8.1, where 2PC and this issue was introduced.	2008-05-19 18:16:26 +00:00
Tom Lane	1a604b4e31	Fix a subtle bug exposed by recent wal_sync_method rearrangements. Formerly, the default value of wal_sync_method was determined inside xlog.c, but now it is determined inside guc.c. guc.c was reading xlogdefs.h without having read <fcntl.h>, leading to wrong determination of DEFAULT_SYNC_METHOD. Obviously xlogdefs.h needs to include <fcntl.h> for itself to ensure stable results.	2008-05-17 17:24:57 +00:00
Tom Lane	8a2f5d221b	Reduce unnecessary PANIC to ERROR, improve a couple of comments.	2008-05-16 19:15:05 +00:00
Magnus Hagander	9bf1db04c0	Remove the special variable for open_sync_bit used in O_SYNC and O_DSYNC modes, replacing it with a call to a function that derives it from the sync_method variable, now that it has distinct values for these two cases. This means that assign_xlog_sync_method() no longer changes any settings, thus fixing the bug introduced in the change to use a guc enum for wal_sync_method.	2008-05-14 14:02:57 +00:00
Magnus Hagander	72e2db86b9	Don't try to close negative file descriptors, since this can cause crashes on certain platforms. In particular, the MSVC runtime is known to do this. Fixes bug #4162, reported and diagnosed by Javier Pimas	2008-05-13 20:53:52 +00:00
Alvaro Herrera	5da9da71c4	Improve snapshot manager by keeping explicit track of snapshots. There are two ways to track a snapshot: there's the "registered" list, which is used for arbitrary long-lived snapshots; and there's the "active stack", which is used for the snapshot that is considered "active" at any time. This also allows users of snapshots to stop worrying about snapshot memory allocation and freeing, and about using PG_TRY blocks around ActiveSnapshot assignment. This is all done automatically now. As a consequence, this allows us to reset MyProc->xmin when there are no more snapshots registered in the current backend, reducing the impact that long-running transactions have on VACUUM.	2008-05-12 20:02:02 +00:00
Magnus Hagander	aa82790fca	Fix breakage by the wal_sync_method patch in installations that use O_DSYNC (specifically this broke all the Windows buildfarm members)	2008-05-12 19:45:23 +00:00
Alvaro Herrera	9084399782	Put back bufmgr.h in bufpage.h -- it is needed by some macros. Remove #include bufmgr.h from (most?) source files which already include bufpage.h.	2008-05-12 16:06:10 +00:00
Magnus Hagander	2739a4e1d2	Report which WAL sync method we are trying to change to when it fails, not which one we had before (that worked, and thus is completley irrelevant)	2008-05-12 14:27:47 +00:00
Magnus Hagander	f99760c19f	Convert wal_sync_method to guc enum.	2008-05-12 08:35:05 +00:00
Alvaro Herrera	f8c4d7db60	Restructure some header files a bit, in particular heapam.h, by removing some unnecessary #include lines in it. Also, move some tuple routine prototypes and macros to htup.h, which allows removal of heapam.h inclusion from some .c files. For this to work, a new header file access/sysattr.h needed to be created, initially containing attribute numbers of system columns, for pg_dump usage. While at it, make contrib ltree, intarray and hstore header files more consistent with our header style.	2008-05-12 00:00:54 +00:00
Heikki Linnakangas	c5f42ce8d5	Fix Assert introduced in previous patch.	2008-05-09 15:27:17 +00:00
Heikki Linnakangas	f0eb3e5e58	Fix incorrect archive truncation point calculation in the %r recovery_command parameter. This fixes bug 4137 reported by Wojciech Strzalka, where a WAL file is deleted too early when starting the recovery of a warm standby server. Also add a sanity check in pg_standby so that it will refuse to delete anything earlier than the file being restored, and improve the debug message in case nothing is deleted. Simon Riggs. Backpatch to 8.3, which is where %r was introduced.	2008-05-09 14:27:47 +00:00
Magnus Hagander	380d1ee69e	Update error messages, per notes from Tom. Laurenz Albe	2008-04-24 14:23:43 +00:00
Magnus Hagander	c979a1fefa	Prevent shutdown in normal mode if online backup is running, and have pg_ctl warn about this. Cancel running online backups (by renaming the backup_label file, thus rendering the backup useless) when shutting down in fast mode. Laurenz Albe	2008-04-23 13:44:59 +00:00
Tom Lane	8472bf7a73	Allow float8, int8, and related datatypes to be passed by value on machines where Datum is 8 bytes wide. Since this will break old-style C functions (those still using version 0 calling convention) that have arguments or results of these types, provide a configure option to disable it and retain the old pass-by-reference behavior. Likewise, provide a configure option to disable the recently-committed float4 pass-by-value change. Zoltan Boszormenyi, plus configurability stuff by me.	2008-04-21 00:26:47 +00:00
Tom Lane	d1cbd26ded	Repair two places where SIGTERM exit could leave shared memory state corrupted. (Neither is very important if SIGTERM is used to shut down the whole database cluster together, but there's a problem if someone tries to SIGTERM individual backends.) To do this, introduce new infrastructure macros PG_ENSURE_ERROR_CLEANUP/PG_END_ENSURE_ERROR_CLEANUP that take care of transiently pushing an on_shmem_exit cleanup hook. Also use this method for createdb cleanup --- that wasn't a shared-memory-corruption problem, but SIGTERM abort of createdb could leave orphaned files lying around. Backpatch as far as 8.2. The shmem corruption cases don't exist in 8.1, and the createdb usage doesn't seem important enough to risk backpatching further.	2008-04-16 23:59:40 +00:00
Bruce Momjian	2a1cf97c22	Have pg_stop_backup() wait for all archive files to be sent, rather than returing right away. This guarantees that when pg_stop_backup() returns, you have a valid backup. Simon Riggs	2008-04-05 01:34:06 +00:00
Alvaro Herrera	78f02ca1f5	Rename snapmgmt.c/h to snapmgr.c/h, for consistency with other files. Per complaint from Tom Lane.	2008-03-26 18:48:59 +00:00
Alvaro Herrera	d43b085d57	Separate snapshot management code from tuple visibility code, create a snapmgmt.c file for the former. The header files have also been reorganized in three parts: the most basic snapshot definitions are now in a new file snapshot.h, and the also new snapmgmt.h keeps the definitions for snapmgmt.c. tqual.h has been reduced to the bare minimum. This patch is just a first step towards managing live snapshots within a transaction; there is no functionality change. Per my proposal to pgsql-patches on 20080318191940.GB27458@alvh.no-ip.org and subsequent discussion.	2008-03-26 16:20:48 +00:00
Tom Lane	220db7ccd8	Simplify and standardize conversions between TEXT datums and ordinary C strings. This patch introduces four support functions cstring_to_text, cstring_to_text_with_len, text_to_cstring, and text_to_cstring_buffer, and two macros CStringGetTextDatum and TextDatumGetCString. A number of existing macros that provided variants on these themes were removed. Most of the places that need to make such conversions now require just one function or macro call, in place of the multiple notational layers that used to be needed. There are no longer any direct calls of textout or textin, and we got most of the places that were using handmade conversions via memcpy (there may be a few still lurking, though). This commit doesn't make any serious effort to eliminate transient memory leaks caused by detoasting toasted text objects before they reach text_to_cstring. We changed PG_GETARG_TEXT_P to PG_GETARG_TEXT_PP in a few places where it was easy, but much more could be done. Brendan Jurd and Tom Lane	2008-03-25 22:42:46 +00:00
Bruce Momjian	fca9fff41b	More README src cleanups.	2008-03-21 13:23:29 +00:00
Bruce Momjian	4e228447aa	Make source code READMEs more consistent. Add CVS tags to all README files.	2008-03-20 17:55:15 +00:00
Peter Eisentraut	a7b7b07af3	Enable probes to work with Mac OS X Leopard and other OSes that will support DTrace in the future. Switch from using DTRACE_PROBEn macros to the dynamically generated macros. Use "dtrace -h" to create a header file that contains the dynamically generated macros to be used in the source code instead of the DTRACE_PROBEn macros. A dummy header file is generated for builds without DTrace support. Author: Robert Lor <Robert.Lor@sun.com>	2008-03-17 19:44:41 +00:00
Tom Lane	32846f8152	Fix TransactionIdIsCurrentTransactionId() to use binary search instead of linear search when checking child-transaction XIDs. This makes for an important speedup in transactions that have large numbers of children, as in a recent example from Craig Ringer. We can also get rid of an ugly kluge that represented lists of TransactionIds as lists of OIDs. Heikki Linnakangas	2008-03-17 02:18:55 +00:00
Tom Lane	611b4393f2	Make TransactionIdIsInProgress check transam.c's single-item XID status cache before it goes groveling through the ProcArray. In situations where the same recently-committed transaction ID is checked repeatedly by tqual.c, this saves a lot of shared-memory searches. And it's cheap enough that it shouldn't hurt noticeably when it doesn't help. Concept and patch by Simon, some minor tweaking and comment-cleanup by Tom.	2008-03-11 20:20:35 +00:00
Tom Lane	2fc2795456	Remove no-longer-used XLogCacheByte field of XLogCtl. Itagaki Takahiro	2008-03-10 02:13:22 +00:00
Tom Lane	7d6e6e2e97	Fix PREPARE TRANSACTION to reject the case where the transaction has dropped a temporary table; we can't support that because there's no way to clean up the source backend's internal state if the eventual COMMIT PREPARED is done by another backend. This was checked correctly in 8.1 but I broke it in 8.2 :-(. Patch by Heikki Linnakangas, original trouble report by John Smith.	2008-03-04 19:54:06 +00:00
Peter Eisentraut	0474dcb608	Refactor backend makefiles to remove lots of duplicate code	2008-02-19 10:30:09 +00:00
Tom Lane	cd00406774	Replace time_t with pg_time_t (same values, but always int64) in on-disk data structures and backend internal APIs. This solves problems we've seen recently with inconsistent layout of pg_control between machines that have 32-bit time_t and those that have already migrated to 64-bit time_t. Also, we can get out from under the problem that Windows' Unix-API emulation is not consistent about the width of time_t. There are a few remaining places where local time_t variables are used to hold the current or recent result of time(NULL). I didn't bother changing these since they do not affect any cross-module APIs and surely all platforms will have 64-bit time_t before overflow becomes an actual risk. time_t should be avoided for anything visible to extension modules, however.	2008-02-17 02:09:32 +00:00
Peter Eisentraut	6f8f8d2daa	Provide a clearer error message if the pg_control version number looks wrong because of mismatched byte ordering.	2008-01-21 11:17:46 +00:00
Tom Lane	ac12412ede	Revise memory management for libxml calls. Instead of keeping libxml's data in whichever context happens to be current during a call of an xml.c function, use a dedicated context that will not go away until we explicitly delete it (which we do at transaction end or subtransaction abort). This makes recovery after an error much simpler --- we don't have to individually delete the data structures created by libxml. Also, we need to initialize and cleanup libxml only once per transaction (if there's no error) instead of once per function call, so it should be a bit faster. We'll need to keep an eye out for intra-transaction memory leaks, though. Alvaro and Tom.	2008-01-15 18:57:00 +00:00
Tom Lane	eedb068c0a	Make standard maintenance operations (including VACUUM, ANALYZE, REINDEX, and CLUSTER) execute as the table owner rather than the calling user, using the same privilege-switching mechanism already used for SECURITY DEFINER functions. The purpose of this change is to ensure that user-defined functions used in index definitions cannot acquire the privileges of a superuser account that is performing routine maintenance. While a function used in an index is supposed to be IMMUTABLE and thus not able to do anything very interesting, there are several easy ways around that restriction; and even if we could plug them all, there would remain a risk of reading sensitive information and broadcasting it through a covert channel such as CPU usage. To prevent bypassing this security measure, execution of SET SESSION AUTHORIZATION and SET ROLE is now forbidden within a SECURITY DEFINER context. Thanks to Itagaki Takahiro for reporting this vulnerability. Security: CVE-2007-6600	2008-01-03 21:23:15 +00:00
Bruce Momjian	9098ab9e32	Update copyrights in source tree to 2008.	2008-01-01 19:46:01 +00:00
Tom Lane	895a94de6d	Avoid incrementing the CommandCounter when CommandCounterIncrement is called but no database changes have been made since the last CommandCounterIncrement. This should result in a significant improvement in the number of "commands" that can typically be performed within a transaction before hitting the 2^32 CommandId size limit. In particular this buys back (and more) the possible adverse consequences of my previous patch to fix plan caching behavior. The implementation requires tracking whether the current CommandCounter value has been "used" to mark any tuples. CommandCounter values stored into snapshots are presumed not to be used for this purpose. This requires some small executor changes, since the executor used to conflate the curcid of the snapshot it was using with the command ID to mark output tuples with. Separating these concepts allows some small simplifications in executor APIs. Something for the TODO list: look into having CommandCounterIncrement not do AcceptInvalidationMessages. It seems fairly bogus to be doing it there, but exactly where to do it instead isn't clear, and I'm disinclined to mess with asynchronous behavior during late beta.	2007-11-30 21:22:54 +00:00
Bruce Momjian	f639df0d61	Small comment spacing improvement.	2007-11-16 01:51:22 +00:00
Bruce Momjian	7d4c99b414	Fix pgindent to properly handle 'else' and single-line comments on the same line; previous fix was only partial. Re-run pgindent on files that need it.	2007-11-15 23:23:44 +00:00
Bruce Momjian	f6e8730d11	Re-run pgindent with updated list of typedefs. (Updated README should avoid this problem in the future.)	2007-11-15 22:25:18 +00:00
Peter Eisentraut	b30769ee54	When logging the recovery.conf parameters, show them quoted as they would appear in the configuration file.	2007-11-15 22:02:12 +00:00
Bruce Momjian	fdf5a5efb7	pgindent run for 8.3.	2007-11-15 21:14:46 +00:00
Tom Lane	6cc4451b5c	Prevent re-use of a deleted relation's relfilenode until after the next checkpoint. This guards against an unlikely data-loss scenario in which we re-use the relfilenode, then crash, then replay the deletion and recreation of the file. Even then we'd be OK if all insertions into the new relation had been WAL-logged ... but that's not guaranteed given all the no-WAL-logging optimizations that have recently been added. Patch by Heikki Linnakangas, per a discussion last month.	2007-11-15 20:36:40 +00:00
Bruce Momjian	82748bc253	Reduce error level of ROLLBACK outside a transaction from WARNING to NOTICE.	2007-11-10 14:36:44 +00:00
Alvaro Herrera	745c1b2c2a	Rearrange vacuum-related bits in PGPROC as a bitmask, to better support having several of them. Add two more flags: whether the process is executing an ANALYZE, and whether a vacuum is for Xid wraparound (which is obviously only set by autovacuum). Sneakily move the worker's recently-acquired PostAuthDelay to a more useful place.	2007-10-24 20:55:36 +00:00
Tom Lane	5c8eb929e6	When telling the bgwriter that we need a checkpoint because too much xlog has been consumed, recheck against the latest value of RedoRecPtr before really sending the signal. This avoids useless checkpoint activity if XLogWrite is executed when we have a very stale local copy of RedoRecPtr. The potential for useless checkpoint is very much worse in 8.3 because of the walwriter process (which never does XLogInsert), so while this behavior was intentional, it needs to be changed. Per report from Itagaki Takahiro.	2007-10-12 19:39:59 +00:00
Tom Lane	ab051bd293	Adjust recovery PS display as agreed with Simon: 'waiting for XXX' while the restore_command does its thing, then 'recovering XXX' while processing the segment file. These operations are heavyweight enough that an extra PS display set shouldn't bother anyone.	2007-09-30 17:28:56 +00:00
Tom Lane	77ccbe64dd	Make recovery show the current input WAL segment name in the startup process' PS display. After a suggestion by Simon (not exactly his patch though).	2007-09-29 18:32:56 +00:00
Tom Lane	b46bd55a6c	Make archive recovery always start a new timeline, rather than only when a recovery stop time was used. This avoids a corner-case risk of trying to overwrite an existing archived copy of the last WAL segment, and seems simpler and cleaner all around than the original definition. Per example from Jon Colverson and subsequent analysis by Simon.	2007-09-29 01:36:10 +00:00
Tom Lane	f18dfc4835	Minor improvements in backup and recovery: - create a separate archive_mode GUC, on which archive_command is dependent - %r option in recovery.conf sends last restartpoint to recovery command - %r used in pg_standby, updated README - minor other code cleanup in pg_standby - doc on Warm Standby now mentions pg_standby and %r - log_restartpoints recovery option emits LOG message at each restartpoint - end of recovery now displays last transaction end time, as requested by Warren Little; also shown at each restartpoint - restart archiver if needed to carry away WAL files at shutdown Simon Riggs	2007-09-26 22:36:30 +00:00
Tom Lane	bd0af827da	Fix comments that misspelled TransactionIdIsInProgress, per Heikki.	2007-09-21 16:32:19 +00:00
Tom Lane	ef4d38c86c	Rename recently-added pg_stat_activity column from txn_start to xact_start, for consistency with other column names such as in pg_stat_database.	2007-09-11 03:28:05 +00:00
Tom Lane	6bd4f401b0	Replace the former method of determining snapshot xmax --- to wit, calling ReadNewTransactionId from GetSnapshotData --- with a "latestCompletedXid" variable that is updated during transaction commit or abort. Since latestCompletedXid is written only in places that had to lock ProcArrayLock exclusively anyway, and is read only in places that had to lock ProcArrayLock shared anyway, it adds no new locking requirements to the system despite being cluster-wide. Moreover, removing ReadNewTransactionId from snapshot acquisition eliminates the need to take both XidGenLock and ProcArrayLock at the same time. Since XidGenLock is sometimes held across I/O this can be a significant win. Some preliminary benchmarking suggested that this patch has no effect on average throughput but can significantly improve the worst-case transaction times seen in pgbench. Concept by Florian Pflug, implementation by Tom Lane.	2007-09-08 20:31:15 +00:00
Tom Lane	0a51e7073c	Don't take ProcArrayLock while exiting a transaction that has no XID; there is no need for serialization against snapshot-taking because the xact doesn't affect anyone else's snapshot anyway. Per discussion. Also, move various info about the interlocking of transactions and snapshots out of code comments and into a hopefully-more-cohesive discussion in access/transam/README. Also, remove a couple of now-obsolete comments about having to force some WAL to be written to persuade RecordTransactionCommit to do its thing.	2007-09-07 20:59:26 +00:00
Tom Lane	4bf2dfb9a2	Quick hack to make the VXID of a prepared transaction be -1/XID, so that different prepared xacts can be told apart in the pg_locks view. Per suggestion from Florian.	2007-09-05 20:53:17 +00:00
Tom Lane	295e63983d	Implement lazy XID allocation: transactions that do not modify any database rows will normally never obtain an XID at all. We already did things this way for subtransactions, but this patch extends the concept to top-level transactions. In applications where there are lots of short read-only transactions, this should improve performance noticeably; not so much from removal of the actual XID-assignments, as from reduction of overhead that's driven by the rate of XID consumption. We add a concept of a "virtual transaction ID" so that active transactions can be uniquely identified even if they don't have a regular XID. This is a much lighter-weight concept: uniqueness of VXIDs is only guaranteed over the short term, and no on-disk record is made about them. Florian Pflug, with some editorialization by Tom.	2007-09-05 18:10:48 +00:00
Tom Lane	2abae34a2e	Implement function-local GUC parameter settings, as per recent discussion. There are still some loose ends: I didn't do anything about the SET FROM CURRENT idea yet, and it's not real clear whether we are happy with the interaction of SET LOCAL with function-local settings. The documentation is a bit spartan, too.	2007-09-03 00:39:26 +00:00
Tom Lane	a52e4408b9	Add a debug logging message when a resource manager rejects an attempted restart point. Per suggestion from Simon Riggs.	2007-08-28 23:17:47 +00:00
Tom Lane	647fd9a108	Fix two bugs induced in VACUUM FULL by async-commit patch. First, we cannot assume that XLogAsyncCommitFlush guarantees hint bits will be settable, because clog.c's inexact LSN bookkeeping results in windows where a previously flushed transaction is considered unhintable because it shares an LSN slot with a later unflushed transaction. But repair_frag requires XMIN_COMMITTED to be correct so that it can distinguish tuples moved by the current vacuum. Since not being able to set the bit is an uncommon corner case, the most practical way of dealing with it seems to be to abandon shrinking (ie, don't invoke repair_frag) when we find a non-dead tuple whose XMIN_COMMITTED bit couldn't be set. Second, it is possible for the same reason that a RECENTLY_DEAD tuple does not get its XMAX_COMMITTED bit set during scan_heap. But by the time repair_frag examines the tuple it might be possible to set the bit. We therefore must take buffer content lock when calling HeapTupleSatisfiesVacuum a second time, else we can get an Assert failure in SetBufferCommitInfoNeedsSave. This latter bug is latent in existing releases, but I think it cannot actually occur without async commit, since the first HeapTupleSatisfiesVacuum call should always have set the bit. So I'm not going to back-patch it. In passing, reduce the existing "cannot shrink relation" messages from NOTICE to LOG level. The new message must be no higher than LOG if we don't want unpredictable regression test failures, and consistency seems like a good idea. Also arrange that only one such message is reported per VACUUM FULL; in typical scenarios you could get spammed with many such messages, which seems a bit useless.	2007-08-13 19:08:26 +00:00
Tom Lane	bdd6b62245	Switch over to using the src/timezone functions for formatting timestamps displayed in the postmaster log. This avoids Windows-specific problems with localized time zone names that are in the wrong encoding, and generally seems like a good idea to forestall other potential platform-dependent issues. To preserve the existing behavior that all backends will log in the same time zone, create a new GUC variable log_timezone that can only be changed on a system-wide basis, and reference log-related calculations to that zone instead of the TimeZone variable. This fixes the issue reported by Hiroshi Saito that timestamps printed by xlog.c startup could be improperly localized on Windows. We still need a simpler patch for that problem in the back branches, however.	2007-08-04 01:26:54 +00:00

... 5 6 7 8 9 ...

1217 Commits