postgresql

Commit Graph

Author	SHA1	Message	Date
Heikki Linnakangas	1a956481ba	Fix typo in comment, and reword it slightly while we're at it.	2012-10-04 10:35:48 +03:00
Heikki Linnakangas	93b6d78cf0	Add #includes needed on some platforms in the new files. Hopefully this makes the *BSD buildfarm animals happy.	2012-10-02 17:19:52 +03:00
Heikki Linnakangas	d5497b95f3	Split off functions related to timeline history files and XLOG archiving. This is just refactoring, to make the functions accessible outside xlog.c. A followup patch will make use of that, to allow fetching timeline history files over streaming replication.	2012-10-02 13:37:19 +03:00
Heikki Linnakangas	ab9a14e903	Fix WAL file replacement during cascading replication on Windows. When the startup process restores a WAL file from the archive, it deletes any old file with the same name and renames the new file in its place. On Windows, however, when a file is deleted, it still lingers as long as a process holds a file handle open on it. With cascading replication, a walsender process can hold the old file open, so the rename() in the startup process would fail. To fix that, rename the old file to a temporary name, to make the original file name available for reuse, before deleting the old file.	2012-09-05 18:52:12 -07:00
Tom Lane	2e0cc1f031	Fix inappropriate error messages for Hot Standby misconfiguration errors. Give the correct name of the GUC parameter being complained of. Also, emit a more suitable SQLSTATE (INVALID_PARAMETER_VALUE, not the default INTERNAL_ERROR). Gurjeet Singh, errcode adjustment by me	2012-09-05 21:49:08 -04:00
Heikki Linnakangas	358ff99d70	Fix compiler warnings about unused variables, caused by my previous commit. Reported by Peter Eisentraut.	2012-09-04 22:07:35 -07:00
Heikki Linnakangas	c4c227477b	Fix bugs in cascading replication with recovery_target_timeline='latest' The cascading replication code assumed that the current RecoveryTargetTLI never changes, but that's not true with recovery_target_timeline='latest'. The obvious upshot of that is that RecoveryTargetTLI in shared memory needs to be protected by a lock. A less obvious consequence is that when a cascading standby is connected, and the standby switches to a new target timeline after scanning the archive, it will continue to stream WAL to the cascading standby, but from a wrong file, ie. the file of the previous timeline. For example, if the standby is currently streaming from the middle of file 000000010000000000000005, and the timeline changes, the standby will continue to stream from that file. However, the WAL on the new timeline is in file 000000020000000000000005, so the standby sends garbage from 000000010000000000000005 to the cascading standby, instead of the correct WAL from file 000000020000000000000005. This also fixes a related bug where a partial WAL segment is restored from the archive and streamed to a cascading standby. The code assumed that when a WAL segment is copied from the archive, it can immediately be fully streamed to a cascading standby. However, if the segment is only partially filled, ie. has the right size, but only N first bytes contain valid WAL, that's not safe. That can happen if a partial WAL segment is manually copied to the archive, or if a partial WAL segment is archived because a server is started up on a new timeline within that segment. The cascading standby will get confused if the WAL it received is not valid, and will get stuck until it's restarted. This patch fixes that problem by not allowing WAL restored from the archive to be streamed to a cascading standby until it's been replayed, and thus validated.	2012-09-04 19:33:21 -07:00
Tom Lane	2a2352e07d	Replace memcpy() calls in xlog.c critical sections with struct assignments. This gets rid of a dangerous-looking use of the not-volatile XLogCtl pointer in a couple of spinlock-protected sections, where the normal coding rule is that you should only access shared memory through a pointer-to-volatile. I think the risk is only hypothetical not actual, since for there to be a bug the compiler would have to move the spinlock acquire or release across the memcpy() call, which one sincerely hopes it will not. Still, it looks cleaner this way. Per comment from Daniel Farina and subsequent discussion.	2012-09-03 15:39:15 -04:00
Tom Lane	10685ec082	Avoid somewhat-theoretical overflow risks in RecordIsValid(). This improves on commit `51fed14d73` by eliminating the assumption that we can form <some pointer value> + <some offset> without overflow. The entire point of those tests is that we don't trust the offset value, so coding them in a way that could wrap around if the buffer happens to be near the top of memory doesn't seem sound. Instead, track the remaining space as a size_t variable and compare offsets against that. Also, improve comment about why we need the extra early check on xl_tot_len.	2012-08-21 18:41:52 -04:00
Heikki Linnakangas	51fed14d73	Don't get confused if a WAL partial record header has xl_tot_len == 0. If a WAL record header was split across pages, but xl_tot_len was 0, we would get confused and conclude that we had already read the whole record, and proceed to CRC check it. That can lead to a crash in RecordIsValid(), which isn't careful to not read beyond end-of-record, as defined by xl_tot_len. Add an explicit sanity check for xl_tot_len <= SizeOfXlogRecord. Also, make RecordIsValid() more robust by checking in each step that it doesn't try to access memory beyond end of record, even if a length field in the record's or a backup block's header is bogus. Per report and analysis by Tom Lane.	2012-08-20 19:58:21 +03:00
Simon Riggs	8143a56854	Fix minor bug in XLogFileRead() that accidentally worked. Cascading replication copied the incoming file into pg_xlog but didn't set path correctly, so the first attempt to open file failed causing it to loop around and look for file in pg_xlog. So the earlier coding worked, but accidentally rather than by design. Spotted by Fujii Masao, fix by Fujii Masao and Simon Riggs	2012-08-08 21:25:23 +01:00
Simon Riggs	0f04fc67f7	fsync backup_label after pg_start_backup() Dave Kerr	2012-08-07 16:19:13 +01:00
Tom Lane	4a9c30a8a1	Fix management of pendingOpsTable in auxiliary processes. mdinit() was misusing IsBootstrapProcessingMode() to decide whether to create an fsync pending-operations table in the current process. This led to creating a table not only in the startup and checkpointer processes as intended, but also in the bgwriter process, not to mention other auxiliary processes such as walwriter and walreceiver. Creation of the table in the bgwriter is fatal, because it absorbs fsync requests that should have gone to the checkpointer; instead they just sit in bgwriter local memory and are never acted on. So writes performed by the bgwriter were not being fsync'd which could result in data loss after an OS crash. I think there is no live bug with respect to walwriter and walreceiver because those never perform any writes of shared buffers; but the potential is there for future breakage in those processes too. To fix, make AuxiliaryProcessMain() export the current process's AuxProcType as a global variable, and then make mdinit() test directly for the types of aux process that should have a pendingOpsTable. Having done that, we might as well also get rid of the random bool flags such as am_walreceiver that some of the aux processes had grown. (Note that we could not have fixed the bug by examining those variables in mdinit(), because it's called from BaseInit() which is run by AuxiliaryProcessMain() before entering any of the process-type-specific code.) Back-patch to 9.2, where the problem was introduced by the split-up of bgwriter and checkpointer processes. The bogus pendingOpsTable exists in walwriter and walreceiver processes in earlier branches, but absent any evidence that it causes actual problems there, I'll leave the older branches alone.	2012-07-18 15:28:10 -04:00
Robert Haas	3cf39e6ddb	Fix a stupid bug I introduced into XLogFlush(). Commit `f11e8be3e8` broke this; it was right in Peter's original patch, but I messed it up before committing.	2012-07-02 15:33:59 -04:00
Robert Haas	3bb592bb20	Fix position of WalSndWakeupRequest call. This avoids discriminating against wal_sync_method = open_sync or open_datasync. Fujii Masao, reviewed by Andres Freund	2012-07-02 14:44:10 -04:00
Peter Eisentraut	2b44306315	Assorted message style improvements	2012-07-02 21:12:46 +03:00
Robert Haas	82cdd2df75	Work a little harder on comments for walsender wakeup patch. Per gripe from Tom Lane.	2012-07-02 11:28:53 -04:00
Robert Haas	f11e8be3e8	Make commit_delay much smarter. Instead of letting every backend participating in a group commit wait independently, have the first one that becomes ready to flush WAL wait for the configured delay, and let all the others wait just long enough for that first process to complete its flush. This greatly increases the chances of being able to configure a commit_delay setting that actually improves performance. As a side consequence of this change, commit_delay now affects all WAL flushes, rather than just commits. There was some discussion on pgsql-hackers about whether to rename the GUC to, say, wal_flush_delay, but in the absence of consensus I am leaving it alone for now. Peter Geoghegan, with some changes, mostly to the documentation, by me.	2012-07-02 10:26:31 -04:00
Robert Haas	f83b59997d	Make walsender more responsive. Per testing by Andres Freund, this improves replication performance and reduces replication latency and latency jitter. I was a bit concerned about moving more work into XLogInsert, but testing seems to show that it's not a problem in practice. Along the way, improve comments for WaitLatchOrSocket. Andres Freund. Review and stylistic cleanup by me.	2012-07-02 09:41:01 -04:00
Heikki Linnakangas	567787f216	Validate xlog record header before enlarging the work area to store it. If the record header is garbled, we're now quite likely to notice it before we try to make a bogus memory allocation and run out of memory. That can still happen, if the xlog record is split across pages (we cannot verify the record header until reading the next page in that scenario), but this reduces the chances. An out-of-memory is treated as a corrupt record anyway, so this isn't a correctness issue, just a case of giving a better error message. Per Amit Kapila's suggestion.	2012-06-30 23:14:35 +03:00
Heikki Linnakangas	7a5c9ca93a	Initialize shared memory copy of ckptXidEpoch correctly when not in recovery. This bug was introduced by commit `20d98ab6e4`, so backpatch this to 9.0-9.2 like that one. This fixes bug #6710, reported by Tarvi Pillessaar	2012-06-29 19:32:15 +03:00
Heikki Linnakangas	8f85667a86	Update outdated commit; xlp_rem_len field is in page header now. Spotted by Amit Kapila	2012-06-28 20:35:18 +03:00
Heikki Linnakangas	a8f97b39c7	Fix two more neglected comments, still referring to log/seg. Fujii Masao	2012-06-27 19:11:26 +03:00
Heikki Linnakangas	ec786c6c81	I neglected many comments in the log+seg -> 64-bit segno patch. Fix. Reported by Amit Kapila.	2012-06-27 17:53:53 +03:00
Heikki Linnakangas	a218e23a08	Oops. Remove stray paren. I didn't notice this on my laptop as I don't HAVE_FSYNC_WRITETHROUGH.	2012-06-24 20:03:57 +03:00
Heikki Linnakangas	0ab9d1c4b3	Replace XLogRecPtr struct with a 64-bit integer. This simplifies code that needs to do arithmetic on XLogRecPtrs. To avoid changing on-disk format of data pages, the LSN on data pages is still stored in the old format. That should keep pg_upgrade happy. However, we have XLogRecPtrs embedded in the control file, and in the structs that are sent over the replication protocol, so this changes breaks compatibility of pg_basebackup and server. I didn't do anything about this in this patch, per discussion on -hackers, the right thing to do would to be to change the replication protocol to be architecture-independent, so that you could use a newer version of pg_receivexlog, for example, against an older server version.	2012-06-24 19:19:45 +03:00
Heikki Linnakangas	061e7efb1b	Allow WAL record header to be split across pages. This saves a few bytes of WAL space, but the real motivation is to make it predictable how much WAL space a record requires, as it no longer depends on whether we need to waste the last few bytes at end of WAL page because the header doesn't fit. The total length field of WAL record, xl_tot_len, is moved to the beginning of the WAL record header, so that it is still always found on the first page where a WAL record begins. Bump WAL version number again as this is an incompatible change.	2012-06-24 18:35:56 +03:00
Heikki Linnakangas	20ba5ca64c	Move WAL continuation record information to WAL page header. The continuation record only contained one field, xl_rem_len, so it makes things simpler to just include it in the WAL page header. This wastes four bytes on pages that don't begin with a continuation from previos page, plus four bytes on every page, because of padding. The motivation of this is to make it easier to calculate how much space a WAL record needs. Before this patch, it depended on how many page boundaries the record crosses. The motivation of that, in turn, is to separate the allocation of space in the WAL from the copying of the record data to the allocated space. Keeping the calculation of space required simple helps to keep the critical section of allocating the space from WAL short. But that's not included in this patch yet. Bump WAL version number again, as this is an incompatible change.	2012-06-24 18:35:30 +03:00
Heikki Linnakangas	dfda6ebaec	Don't waste the last segment of each 4GB logical log file. The comments claimed that wasting the last segment made it easier to do calculations with XLogRecPtrs, because you don't have problems representing last-byte-position-plus-1 that way. In my experience, however, it only made things more complicated, because the there was two ways to represent the boundary at the beginning of a logical log file: logid = n+1 and xrecoff = 0, or as xlogid = n and xrecoff = 4GB - XLOG_SEG_SIZE. Some functions were picky about which representation was used. Also, use a 64-bit segment number instead of the log/seg combination, to point to a certain WAL segment. We assume that all platforms have a working 64-bit integer type nowadays. This is an incompatible change in WAL format, so bumping WAL version number.	2012-06-24 18:35:29 +03:00
Tom Lane	b8b69d8990	Revert "Reduce checkpoints and WAL traffic on low activity database server" This reverts commit `18fb9d8d21`. Per discussion, it does not seem like a good idea to allow committed changes to go un-checkpointed indefinitely, as could happen in a low-traffic server; that makes us entirely reliant on the WAL stream with no redundancy that might aid data recovery in case of disk failure. This re-introduces the original problem of hot-standby setups generating a small continuing stream of WAL traffic even when idle, but there are other ways to address that without compromising crash recovery, so we'll revisit that issue in a future release cycle.	2012-06-13 18:48:44 -04:00
Bruce Momjian	927d61eeff	Run pgindent on 9.2 source tree in preparation for first 9.3 commit-fest.	2012-06-10 15:20:04 -04:00
Simon Riggs	2c8a4e9be2	Wake WALSender to reduce data loss at failover for async commit. WALSender now woken up after each background flush by WALwriter, avoiding multi-second replication delay for an all-async commit workload. Replication delay reduced from 7s with default settings to 200ms and often much less, allowing significantly reduced data loss at failover. Andres Freund and Simon Riggs	2012-06-07 19:22:47 +01:00
Tom Lane	acd4c7d58b	Fix an issue in recent walwriter hibernation patch. Users of asynchronous-commit mode expect there to be a guaranteed maximum delay before an async commit's WAL records get flushed to disk. The original version of the walwriter hibernation patch broke that. Add an extra shared-memory flag to allow async commits to kick the walwriter out of hibernation mode, without adding any noticeable overhead in cases where no action is needed.	2012-05-08 23:06:40 -04:00
Tom Lane	5461564a9d	Reduce idle power consumption of walwriter and checkpointer processes. This patch modifies the walwriter process so that, when it has not found anything useful to do for many consecutive wakeup cycles, it extends its sleep time to reduce the server's idle power consumption. It reverts to normal as soon as it's done any successful flushes. It's still true that during any async commit, backends check for completed, unflushed pages of WAL and signal the walwriter if there are any; so that in practice the walwriter can get awakened and returned to normal operation sooner than the sleep time might suggest. Also, improve the checkpointer so that it uses a latch and a computed delay time to not wake up at all except when it has something to do, replacing a previous hardcoded 0.5 sec wakeup cycle. This also is primarily useful for reducing the server's power consumption when idle. In passing, get rid of the dedicated latch for signaling the walwriter in favor of using its procLatch, since that comports better with possible generic signal handlers using that latch. Also, fix a pre-existing bug with failure to save/restore errno in walwriter's signal handlers. Peter Geoghegan, somewhat simplified by Tom	2012-05-08 20:03:26 -04:00
Tom Lane	809e7e21af	Converge all SQL-level statistics timing values to float8 milliseconds. This patch adjusts the core statistics views to match the decision already taken for pg_stat_statements, that values representing elapsed time should be represented as float8 and measured in milliseconds. By using float8, we are no longer tied to a specific maximum precision of timing data. (Internally, it's still microseconds, but we could now change that without needing changes at the SQL level.) The columns affected are pg_stat_bgwriter.checkpoint_write_time pg_stat_bgwriter.checkpoint_sync_time pg_stat_database.blk_read_time pg_stat_database.blk_write_time pg_stat_user_functions.total_time pg_stat_user_functions.self_time pg_stat_xact_user_functions.total_time pg_stat_xact_user_functions.self_time The first four of these are new in 9.2, so there is no compatibility issue from changing them. The others require a release note comment that they are now double precision (and can show a fractional part) rather than bigint as before; also their underlying statistics functions now match the column definitions, instead of returning bigint microseconds.	2012-04-30 14:03:33 -04:00
Robert Haas	0d2235a25b	Remove duplicate word in comment. Noted by Peter Geoghegan.	2012-04-30 13:14:46 -04:00
Robert Haas	5d4b60f2f2	Lots of doc corrections. Josh Kupershmidt	2012-04-23 22:43:09 -04:00
Peter Eisentraut	a33fcd7e79	Fix typo Kyotaro HORIGUCHI	2012-04-16 15:36:40 +03:00
Robert Haas	b736aef2ec	Publish checkpoint timing information to pg_stat_bgwriter. Greg Smith, Peter Geoghegan, and Robert Haas	2012-04-05 14:04:37 -04:00
Simon Riggs	68219aaf6b	Correct epoch of txid_current() when executed on a Hot Standby server. Initialise ckptXidEpoch from starting checkpoint and maintain the correct value as we roll forwards. This allows GetNextXidAndEpoch() to return the correct epoch when executed during recovery. Backpatch to 9.0 when the problem is first observable by a user. Bug report from Daniel Farina	2012-03-29 14:55:30 +01:00
Peter Eisentraut	e684ab5e1e	Add additional safety check against invalid backup label file It was already checking for invalid data after "BACKUP FROM", but would possibly crash if "BACKUP FROM" was missing altogether. found by Coverity	2012-03-14 22:41:50 +02:00
Heikki Linnakangas	d93f209f48	Silence warning about unused variable, when building without assertions.	2012-03-08 11:10:02 +02:00
Robert Haas	bc97c38115	Typo fix. Fujii Masao	2012-03-06 08:23:51 -05:00
Heikki Linnakangas	e587e2e3e3	Make the comments more clear on the fact that UpdateFullPageWrites() is not safe to call concurrently from multiple processes.	2012-03-06 10:45:58 +02:00
Heikki Linnakangas	7714c63829	Remove extra copies of LogwrtResult. This simplifies the code a little bit. The new rule is that to update XLogCtl->LogwrtResult, you must hold both WALWriteLock and info_lck, whereas before we had two copies, one that was protected by WALWriteLock and another protected by info_lck. The code that updates them was already holding both locks, so merging the two is trivial. The third copy, XLogCtl->Insert.LogwrtResult, was not totally redundant, it was used in AdvanceXLInsertBuffer to update the backend-local copy, before acquiring the info_lck to read the up-to-date value. But the value of that seems dubious; at best it's saving one spinlock acquisition per completed WAL page, which is not significant compared to all the other work involved. And in practice, it's probably not saving even that much.	2012-03-06 10:18:33 +02:00
Heikki Linnakangas	3b682df326	Simplify the way changes to full_page_writes are logged. It's harmless to do full page writes even when not strictly necessary, so when turning full_page_writes on, we can set the global flag first, and then call XLogInsert. Likewise, when turning it off, we can write the WAL record first, and then clear the flag. This way XLogInsert doesn't need any special handling of the XLOG_FPW_CHANGE record type. XLogInsert is complicated enough already, so anything we can keep away from there is a good thing. Actually I don't think the atomicity of the shared memory flag matters, anyway, because we only write the XLOG_FPW_CHANGE at the end of recovery, when there are no concurrent WAL insertions going on. But might as well make it safe, in case we allow changing full_page_writes on the fly in the future.	2012-03-06 09:48:30 +02:00
Heikki Linnakangas	1a01560cbb	Rename LWLockWaitUntilFree to LWLockAcquireOrWait. LWLockAcquireOrWait makes it more clear that the lock is acquired if it's free.	2012-02-08 09:17:13 +02:00
Tom Lane	c6d76d7c82	Add locking around WAL-replay modification of shared-memory variables. Originally, most of this code assumed that no Postgres backends could be running concurrently with it, and so no locking could be needed. That assumption fails in Hot Standby. While it's still true that Hot Standby backends should never change values like nextXid, they can examine them, and consistency is important in some cases such as when computing a snapshot. Therefore, prudence requires that WAL replay code obtain the relevant locks when modifying such variables, even though it can examine them without taking a lock. We were following that coding rule in some places but not all. This commit applies the coding rule uniformly to all updates of ShmemVariableCache and MultiXactState fields; a search of the replay routines did not find any other cases that seemed to be at risk. In addition, this commit fixes a longstanding thinko in replay of NEXTOID and checkpoint records: we tried to advance nextOid only if it was behind the value in the WAL record, but the comparison would draw the wrong conclusion if OID wraparound had occurred since the previous value. Better to just unconditionally assign the new value, since OID assignment shouldn't be happening during replay anyway. The additional locking seems to be more in the nature of future-proofing than fixing any live bug, so I am not going to back-patch it. The NEXTOID fix will be back-patched separately.	2012-02-06 12:34:10 -05:00
Tom Lane	17118825b8	Fix transient clobbering of shared buffers during WAL replay. RestoreBkpBlocks was in the habit of zeroing and refilling the target buffer; which was perfectly safe when the code was written, but is unsafe during Hot Standby operation. The reason is that we have coding rules that allow backends to continue accessing a tuple in a heap relation while holding only a pin on its buffer. Such a backend could see transiently zeroed data, if WAL replay had occasion to change other data on the page. This has been shown to be the cause of bug #6425 from Duncan Rance (who deserves kudos for developing a sufficiently-reproducible test case) as well as Bridget Frey's re-report of bug #6200. It most likely explains the original report as well, though we don't yet have confirmation of that. To fix, change the code so that only bytes that are supposed to change will change, even transiently. This actually saves cycles in RestoreBkpBlocks, since it's not writing the same bytes twice. Also fix seq_redo, which has the same disease, though it has to work a bit harder to meet the requirement. So far as I can tell, no other WAL replay routines have this type of bug. In particular, the index-related replay routines, which would certainly be broken if they had to meet the same standard, are not at risk because we do not have coding rules that allow access to an index page when not holding a buffer lock on it. Back-patch to 9.0 where Hot Standby was added.	2012-02-05 15:49:17 -05:00
Heikki Linnakangas	9b38d46d9f	Make group commit more effective. When a backend needs to flush the WAL, and someone else is already flushing the WAL, wait until it releases the WALInsertLock and check if we still need to do the flush or if the other backend already did the work for us, before acquiring WALInsertLock. This helps group commit, because when the WAL flush finishes, all the backends that were waiting for it can be woken up in one go, and the can all concurrently observe that they're done, rather than waking them up one by one in a cascading fashion. This is based on a new LWLock function, LWLockWaitUntilFree(), which has peculiar semantics. If the lock is immediately free, it grabs the lock and returns true. If it's not free, it waits until it is released, but then returns false without grabbing the lock. This is used in XLogFlush(), so that when the lock is acquired, the backend flushes the WAL, but if it's not, the backend first checks the current flush location before retrying. Original patch and benchmarking by Peter Geoghegan and Simon Riggs, although this patch as committed ended up being very different from that.	2012-01-30 16:53:48 +02:00
Simon Riggs	8366c7803e	Allow pg_basebackup from standby node with safety checking. Base backup follows recommended procedure, plus goes to great lengths to ensure that partial page writes are avoided. Jun Ishizuka and Fujii Masao, with minor modifications	2012-01-25 18:02:04 +00:00
Simon Riggs	5530623d03	Correctly initialise shared recoveryLastRecPtr in recovery. Previously we used ReadRecPtr rather than EndRecPtr, which was not a serious error but caused pg_stat_replication to report incorrect replay_location until at least one WAL record is replayed. Fujii Masao	2012-01-13 13:02:44 +00:00
Heikki Linnakangas	1b9dea04b5	Remove useless 'needlock' argument from GetXLogInsertRecPtr. It was always passed as 'true'.	2012-01-11 11:01:47 +02:00
Heikki Linnakangas	9c808f89c2	Refactor XLogInsert a bit. The rdata entries for backup blocks are now constructed before acquiring WALInsertLock, which slightly reduces the time the lock is held. Although I could not measure any benefit in benchmarks, the code is more readable this way.	2012-01-11 11:01:47 +02:00
Bruce Momjian	e126958c2e	Update copyright notices for year 2012.	2012-01-01 18:01:58 -05:00
Simon Riggs	64233902d2	Send new protocol keepalive messages to standby servers. Allows streaming replication users to calculate transfer latency and apply delay via internal functions. No external functions yet.	2011-12-31 13:30:26 +00:00
Tom Lane	2dd9322ba6	Move BKP_REMOVABLE bit from individual WAL records to WAL page headers. Removing this bit from xl_info allows us to restore the old limit of four (not three) separate pages touched by a WAL record, which is needed for the upcoming SP-GiST feature, and will likely be useful elsewhere in future. When we implemented XLR_BKP_REMOVABLE in 2007, we had to do it like that because no special WAL-visible action was taken when starting a backup. However, now we force a segment switch when starting a backup, so a compressing WAL archiver (such as pglesslog) that uses the state shown in the current page header will not be fooled as to removability of backup blocks. The only downside is that the archiver will not return to compressing mode for up to one WAL page after the backup is over, which is a small price to pay for getting back the extra xl_info bit. In any case the archiver could look for XLOG_BACKUP_END records if it thought it was worth the trouble to do so. Bump XLOG_PAGE_MAGIC since this is effectively a change in WAL format.	2011-12-12 16:22:14 -05:00
Heikki Linnakangas	9f0d2bdc88	Don't set reachedMinRecoveryPoint during crash recovery. In crash recovery, we don't reach consistency before replaying all of the WAL. Rename the variable to reachedConsistency, to make its intention clearer. In master, that was an active bug because of the recent patch to immediately PANIC if a reference to a missing page is found in WAL after reaching consistency, as Tom Lane's test case demonstrated. In 9.1 and 9.0, the only consequence was a misleading "consistent recovery state reached at %X/%X" message in the log at the beginning of crash recovery (the database is not consistent at that point yet). In 8.4, the log message was not printed in crash recovery, even though there was a similar reachedMinRecoveryPoint local variable that was also set early. So, backpatch to 9.1 and 9.0.	2011-12-09 15:21:12 +02:00
Heikki Linnakangas	1e616f6391	During recovery, if we reach consistent state and still have entries in the invalid-page hash table, PANIC immediately. Immediate PANIC is much better than waiting for end-of-recovery, which is what we did before, because the end-of-recovery might not come until months later if this is a standby server. Also refrain from creating a restartpoint if there are invalid-page entries in the hash table. Restarting recovery from such a restartpoint would not see the invalid references, and wouldn't be able to cross-check them when consistency is reached. That wouldn't matter when things are going smoothly, but the more sanity checks you have the better. Fujii Masao	2011-12-02 10:49:54 +02:00
Simon Riggs	4de82f7d7c	Wakeup WALWriter as needed for asynchronous commit performance. Previously we waited for wal_writer_delay before flushing WAL. Now we also wake WALWriter as soon as a WAL buffer page has filled. Significant effect observed on performance of asynchronous commits by Robert Haas, attributed to the ability to set hint bits on tuples earlier and so reducing contention caused by clog lookups.	2011-11-13 09:00:57 +00:00
Simon Riggs	a030bfa6e4	Move user functions related to WAL into xlogfuncs.c	2011-11-04 09:37:17 +00:00
Simon Riggs	750f70b0fe	Update more comments about checkpoints being done by bgwriter	2011-11-02 17:15:35 +00:00
Simon Riggs	18fb9d8d21	Reduce checkpoints and WAL traffic on low activity database server Previously, we skipped a checkpoint if no WAL had been written since last checkpoint, though this does not appear in user documentation. As of now, we skip a checkpoint until we have written at least one enough WAL to switch the next WAL file. This greatly reduces the level of activity and number of WAL messages generated by a very low activity server. This is safe because the purpose of a checkpoint is to act as a starting place for a recovery, in case of crash. This patch maintains minimal WAL volume for replay in case of crash, thus maintaining very low crash recovery time.	2011-11-02 15:26:33 +00:00
Simon Riggs	9aceb6ab3c	Refactor xlog.c to create src/backend/postmaster/startup.c Startup process now has its own dedicated file, just like all other special/background processes. Reduces role and size of xlog.c	2011-11-02 14:25:01 +00:00
Simon Riggs	86e3364899	Derive oldestActiveXid at correct time for Hot Standby. There was a timing window between when oldestActiveXid was derived and when it should have been derived that only shows itself under heavy load. Move code around to ensure correct timing of derivation. No change to StartupSUBTRANS() code, which is where this failed. Bug report by Chris Redekop	2011-11-02 08:54:56 +00:00
Simon Riggs	f8409b39d1	Fix timing of Startup CLOG and MultiXact during Hot Standby Patch by me, bug report by Chris Redekop, analysis by Florian Pflug	2011-11-02 08:07:44 +00:00
Simon Riggs	f3ebaad45b	Comment changes to show bgwriter no longer performs checkpoints.	2011-11-01 18:48:47 +00:00
Tom Lane	bb446b689b	Support synchronization of snapshots through an export/import procedure. A transaction can export a snapshot with pg_export_snapshot(), and then others can import it with SET TRANSACTION SNAPSHOT. The data does not leave the server so there are not security issues. A snapshot can only be imported while the exporting transaction is still running, and there are some other restrictions. I'm not totally convinced that we've covered all the bases for SSI (true serializable) mode, but it works fine for lesser isolation modes. Joachim Wieland, reviewed by Marko Tiikkaja, and rather heavily modified by Tom Lane	2011-10-22 18:23:30 -04:00
Tom Lane	aa90e148ca	Suppress -Wunused-result warnings about write() and fwrite(). This is merely an exercise in satisfying pedants, not a bug fix, because in every case we were checking for failure later with ferror(), or else there was nothing useful to be done about a failure anyway. Document the latter cases.	2011-10-18 21:37:51 -04:00
Tom Lane	d56b3afc03	Restructure error handling in reading of postgresql.conf. This patch has two distinct purposes: to report multiple problems in postgresql.conf rather than always bailing out after the first one, and to change the policy for whether changes are applied when there are unrelated errors in postgresql.conf. Formerly the policy was to apply no changes if any errors could be detected, but that had a significant consistency problem, because in some cases specific values might be seen as valid by some processes but invalid by others. This meant that the latter processes would fail to adopt changes in other parameters even though the former processes had done so. The new policy is that during SIGHUP, the file is rejected as a whole if there are any errors in the "name = value" syntax, or if any lines attempt to set nonexistent built-in parameters, or if any lines attempt to set custom parameters whose prefix is not listed in (the new value of) custom_variable_classes. These tests should always give the same results in all processes, and provide what seems a reasonably robust defense against loading values from badly corrupted config files. If these tests pass, all processes will apply all settings that they individually see as good, ignoring (but logging) any they don't. In addition, the postmaster does not abandon reading a configuration file after the first syntax error, but continues to read the file and report syntax errors (up to a maximum of 100 syntax errors per file). The postmaster will still refuse to start up if the configuration file contains any errors at startup time, but these changes allow multiple errors to be detected and reported before quitting. Alexey Klyukin, reviewed by Andy Colson and av (Alexander ?) with some additional hacking by Tom Lane	2011-10-02 16:50:04 -04:00
Tom Lane	a7801b62f2	Move Timestamp/Interval typedefs and basic macros into datatype/timestamp.h. As per my recent proposal, this refactors things so that these typedefs and macros are available in a header that can be included in frontend-ish code. I also changed various headers that were undesirably including utils/timestamp.h to include datatype/timestamp.h instead. Unsurprisingly, this showed that half the system was getting utils/timestamp.h by way of xlog.h. No actual code changes here, just header refactoring.	2011-09-09 13:23:41 -04:00
Alvaro Herrera	56a9ed92b6	Adjust translator comment format to xgettext expectations	2011-09-05 19:04:30 -03:00
Alvaro Herrera	b64f18c583	Mark some untranslatable messages with errmsg_internal	2011-09-05 17:48:07 -03:00
Heikki Linnakangas	1d0392b245	Fix comment about which version had BACKUP METHOD line in backup_lable, again. It was invalidated again by Fujii's patch to 9.1.	2011-08-17 12:31:23 +03:00
Heikki Linnakangas	2877c67bc2	Fix bogus comment that claimed that the new BACKUP METHOD line in backup_label was new in 9.0. Spotted by Fujii Masao.	2011-08-16 12:23:51 +03:00
Tom Lane	4dab3d5ae1	Change the autovacuum launcher to use WaitLatch instead of a poll loop. In pursuit of this (and with the expectation that WaitLatch will be needed in more places), convert the latch field that was already added to PGPROC for sync rep into a generic latch that is activated for all PGPROC-owning processes, and change many of the standard backend signal handlers to set that latch when a signal happens. This will allow WaitLatch callers to be wakened properly by these signals. In passing, fix a whole bunch of signal handlers that had been hacked to do things that might change errno, without adding the necessary save/restore logic for errno. Also make some minor fixes in unix_latch.c, and clean up bizarre and unsafe scheme for disowning the process's latch. Much of this has to be back-patched into 9.1. Peter Geoghegan, with additional work by Tom	2011-08-10 12:22:21 -04:00
Heikki Linnakangas	41f9ffd928	If backup-end record is not seen, and we reach end of recovery from a streamed backup, throw an error and refuse to start up. The restore has not finished correctly in that case and the data directory is possibly corrupt. We already errored out in case of archive recovery, but could not during crash recovery because we couldn't distinguish between the case that pg_start_backup() was called and the database then crashed (must not error, data is OK), and the case that we're restoring from a backup and not all the needed WAL was replayed (data can be corrupt). To distinguish those cases, add a line to backup_label to indicate whether the backup was taken with pg_start/stop_backup(), or by streaming (ie. pg_basebackup). This requires re-initdb, because of a new field added to the control file.	2011-08-10 09:22:49 +03:00
Tom Lane	9f17ffd866	Measure WaitLatch's timeout parameter in milliseconds, not microseconds. The original definition had the problem that timeouts exceeding about 2100 seconds couldn't be specified on 32-bit machines. Milliseconds seem like sufficient resolution, and finer grain than that would be fantasy anyway on many platforms. Back-patch to 9.1 so that this aspect of the latch API won't change between 9.1 and later releases. Peter Geoghegan	2011-08-09 18:52:29 -04:00
Simon Riggs	5286105800	Cascading replication feature for streaming log-based replication. Standby servers can now have WALSender processes, which can work with either WALReceiver or archive_commands to pass data. Fully updated docs, including new conceptual terms of sending server, upstream and downstream servers. WALSenders terminated when promote to master. Fujii Masao, review, rework and doc rewrite by Simon Riggs	2011-07-19 03:40:03 +01:00
Heikki Linnakangas	89fd72cbf2	Introduce a pipe between postmaster and each backend, which can be used to detect postmaster death. Postmaster keeps the write-end of the pipe open, so when it dies, children get EOF in the read-end. That can conveniently be waited for in select(), which allows eliminating some of the polling loops that check for postmaster death. This patch doesn't yet change all the loops to use the new mechanism, expect a follow-on patch to do that. This changes the interface to WaitLatch, so that it takes as argument a bitmask of events that it waits for. Possible events are latch set, timeout, postmaster death, and socket becoming readable or writeable. The pipe method behaves slightly differently from the kill() method previously used in PostmasterIsAlive() in the case that postmaster has died, but its parent has not yet read its exit code with waitpid(). The pipe returns EOF as soon as the process dies, but kill() continues to return true until waitpid() has been called (IOW while the process is a zombie). Because of that, change PostmasterIsAlive() to use the pipe too, otherwise WaitLatch() would return immediately with WL_POSTMASTER_DEATH, while PostmasterIsAlive() would claim it's still alive. That could easily lead to busy-waiting while postmaster is in zombie state. Peter Geoghegan with further changes by me, reviewed by Fujii Masao and Florian Pflug.	2011-07-08 18:44:07 +03:00
Peter Eisentraut	21f1e15aaf	Unify spelling of "canceled", "canceling", "cancellation" We had previously (`af26857a27`) established the U.S. spellings as standard.	2011-06-29 09:28:46 +03:00
Simon Riggs	465883b0a2	Introduce compact WAL record for the common case of commit (non-DDL). XLOG_XACT_COMMIT_COMPACT leaves out invalidation messages and relfilenodes, saving considerable space for the vast majority of transaction commits. XLOG_XACT_COMMIT keeps same definition as XLOG_PAGE_MAGIC 0xD067 and earlier. Leonardo Francalanci and Simon Riggs	2011-06-28 22:58:17 +01:00
Robert Haas	503c7305a1	Make the visibility map crash-safe. This involves two main changes from the previous behavior. First, when we set a bit in the visibility map, emit a new WAL record of type XLOG_HEAP2_VISIBLE. Replay sets the page-level PD_ALL_VISIBLE bit and the visibility map bit. Second, when inserting, updating, or deleting a tuple, we can no longer get away with clearing the visibility map bit after releasing the lock on the corresponding heap page, because an intervening crash might leave the visibility map bit set and the page-level bit clear. Making this work requires a bit of interface refactoring. In passing, a few minor but related cleanups: change the test in visibilitymap_set and visibilitymap_clear to throw an error if the wrong page (or no page) is pinned, rather than silently doing nothing; this case should never occur. Also, remove duplicate definitions of InvalidXLogRecPtr. Patch by me, review by Noah Misch.	2011-06-21 23:04:40 -04:00
Tom Lane	c2ba0121c7	Work around gcc 4.6.0 bug that breaks WAL replay. ReadRecord's habit of using both direct references to tmpRecPtr and references to *RecPtr (which is pointing at tmpRecPtr) triggers an optimization bug in gcc 4.6.0, which apparently has forgotten about aliasing rules. Avoid the compiler bug, and make the code more readable to boot, by getting rid of the direct references. Improve the comments while at it. Back-patch to all supported versions, in case they get built with 4.6.0. Tom Lane, with some cosmetic suggestions from Alex Hunsaker	2011-06-10 17:04:29 -04:00
Bruce Momjian	6560407c7d	Pgindent run before 9.1 beta2.	2011-06-09 14:32:50 -04:00
Heikki Linnakangas	a0c8514149	Shut down WAL receiver if it's still running at end of recovery. We used to just check that it's not running and PANIC if it was, but that can rightfully happen if recovery stops at recovery target.	2011-05-11 12:46:08 +03:00
Robert Haas	aea1f24c2c	recoveryStopsHere() must check the resource manager ID. Before commit `c016ce7281`, this wasn't needed, but now that multiple resource manager IDs can percolate down through here, we have to make sure we know which one we've got. Otherwise, we can confuse (for example) an XLOG_XACT_COMMIT record with an XLOG_CHECKPOINT_SHUTDOWN record. Review by Jaime Casanova	2011-04-18 08:27:19 -04:00
Heikki Linnakangas	54685b1c2b	Revert the patch to check if we've reached end-of-backup also when doing crash recovery, and throw an error if not. hubert depesz lubaczewski pointed out that that situation also happens in the crash recovery following a system crash that happens during an online backup. We might want to do something smarter in 9.1, like put the check back for backups taken with pg_basebackup, but that's for another patch.	2011-04-13 22:05:40 +03:00
Bruce Momjian	bf50caf105	pgindent run before PG 9.1 beta 1.	2011-04-10 11:42:00 -04:00
Tom Lane	2594cf0e8c	Revise the API for GUC variable assign hooks. The previous functions of assign hooks are now split between check hooks and assign hooks, where the former can fail but the latter shouldn't. Aside from being conceptually clearer, this approach exposes the "canonicalized" form of the variable value to guc.c without having to do an actual assignment. And that lets us fix the problem recently noted by Bernd Helmle that the auto-tune patch for wal_buffers resulted in bogus log messages about "parameter "wal_buffers" cannot be changed without restarting the server". There may be some speed advantage too, because this design lets hook functions avoid re-parsing variable values when restoring a previous state after a rollback (they can store a pre-parsed representation of the value instead). This patch also resolves a longstanding annoyance about custom error messages from variable assign hooks: they should modify, not appear separately from, guc.c's own message about "invalid parameter value".	2011-04-07 00:12:02 -04:00
Heikki Linnakangas	1f0bab8494	Improve error message when WAL ends before reaching end of online backup.	2011-03-31 10:09:49 +03:00
Heikki Linnakangas	acf4740132	Check that we've reached end-of-backup also when we're not performing archive recovery. It's possible to restore an online backup without recovery.conf, by simply copying all the necessary WAL files to pg_xlog. "pg_basebackup -x" does that too. That's the use case where this cross-check is useful. Backpatch to 9.0. We used to do this in earlier versins, but in 9.0 the code was inadvertently changed so that the check is only performed after archive recovery. Fujii Masao.	2011-03-30 10:53:28 +03:00
Simon Riggs	b5f2f2a712	Minor changes to recovery pause behaviour. Change location LOG message so it works each time we pause, not just for final pause. Ensure that we pause only if we are in Hot Standby and can connect to allow us to run resume function. This change supercedes the code to override parameter recoveryPauseAtTarget to false if not attempting to enter Hot Standby, which is now removed.	2011-03-23 19:35:53 +00:00
Simon Riggs	b98ac467f5	Prevent intermittent hang in recovery from bgwriter interaction. Startup process waited for cleanup lock but when hot_standby = off the pid was not registered, so that the bgwriter would not wake the waiting process as intended.	2011-03-23 13:30:05 +00:00
Heikki Linnakangas	6d8096e2f3	When two base backups are started at the same time with pg_basebackup, ensure that they use different checkpoints as the starting point. We use the checkpoint redo location as a unique identifier for the base backup in the end-of-backup record, and in the backup history file name. Bug spotted by Fujii Masao.	2011-03-21 11:25:25 +02:00
Robert Haas	777e8c0015	Remove bogus semicolons in recoveryPausesHere. Without this, the startup process goes into a tight loop, consuming 100% of one CPU and failing to respond to interrupts.	2011-03-18 08:09:09 -04:00
Bruce Momjian	5ca543fb2e	Clarify C comment that O_SYNC/O_FSYNC are really the same settting, as opposed to O_DSYNC.	2011-03-10 20:02:52 -05:00
Robert Haas	d16e290a8a	Emit a LOG message when pausing at the recovery target. Fujii Masao	2011-03-10 14:37:14 -05:00
Heikki Linnakangas	4cd3fb6e12	Truncate predicate lock manager's SLRU lazily at checkpoint. That's safer than doing it aggressively whenever the tail-XID pointer is advanced, because this way we don't need to do it while holding SerializableXactHashLock. This also fixes bug #5915 spotted by YAMAMOTO Takashi, and removes an obsolete comment spotted by Kevin Grittner.	2011-03-08 12:12:54 +02:00
Heikki Linnakangas	1a4ab9ec23	If recovery_target_timeline is set to 'latest' and standby mode is enabled, periodically rescan the archive for new timelines, while waiting for new WAL segments to arrive. This allows you to set up a standby server that follows the TLI change if another standby server is promoted to master. Before this, you had to restart the standby server to make it notice the new timeline. This patch only scans the archive for TLI changes, it won't follow a TLI change in streaming replication. That is much needed too, but it would be a much bigger patch than I dare to sneak in this late in the release cycle. There was discussion on improving the sanity checking of the WAL segments so that the system would notice more reliably if the new timeline isn't an ancestor of the current one, but that is not included in this patch. Reviewed by Fujii Masao.	2011-03-07 21:14:47 +02:00
Robert Haas	79ad8fc5f8	Named restore point improvements. Emit a log message when creating a named restore point, and improve documentation for pg_create_restore_point(). Euler Taveira de Oliveira, per suggestions from Thom Brown, with some additional wordsmithing by me.	2011-02-24 19:02:00 -05:00
Simon Riggs	bca8b7f16a	Hot Standby feedback for avoidance of cleanup conflicts on standby. Standby optionally sends back information about oldestXmin of queries which is then checked and applied to the WALSender's proc->xmin. GetOldestXmin() is modified slightly to agree with GetSnapshotData(), so that all backends on primary include WALSender within their snapshots. Note this does nothing to change the snapshot xmin on either master or standby. Feedback piggybacks on the standby reply message. vacuum_defer_cleanup_age is no longer used on standby, though parameter still exists on primary, since some use cases still exist. Simon Riggs, review comments from Fujii Masao, Heikki Linnakangas, Robert Haas	2011-02-16 19:29:37 +00:00
Robert Haas	4695da5ae9	pg_ctl promote Fujii Masao, reviewed by Robert Haas, Stephen Frost, and Magnus Hagander.	2011-02-15 21:30:23 -05:00
Simon Riggs	5c588be729	PITR can stop at a named restore point when recovery target = time though must not update the last transaction timestamp. Plus comment and message cleanup for recent named restore point. Fujii Masao, minor changes by me	2011-02-15 00:51:39 +00:00
Heikki Linnakangas	b186523fd9	Send status updates back from standby server to master, indicating how far the standby has written, flushed, and applied the WAL. At the moment, this is for informational purposes only, the values are only shown in pg_stat_replication system view, but in the future they will also be needed for synchronous replication. Extracted from Simon riggs' synchronous replication patch by Robert Haas, with some tweaking by me.	2011-02-10 21:04:02 +02:00
Magnus Hagander	3144c33a2f	Implement NOWAIT option for BASE_BACKUP command Specifying this option makes the server not wait for the xlog to be archived, or emit a warning that it can't, instead leaving the responsibility with the client. This is useful when the log is being streamed using the streaming protocol in parallel with the backup, without having log archiving enabled.	2011-02-09 10:59:53 +01:00
Simon Riggs	c016ce7281	Named restore points in recovery. Users can record named points, then new recovery.conf parameter recovery_target_name allows PITR to specify named points as recovery targets. Jaime Casanova, reviewed by Euler Taveira de Oliveira, plus minor edits	2011-02-08 19:39:08 +00:00
Simon Riggs	8c6e3adbf7	Basic Recovery Control functions for use in Hot Standby. Pause, Resume, Status check functions only. Also, new recovery.conf parameter to pause_at_recovery_target, default on. Simon Riggs, reviewed by Fujii Masao	2011-02-08 18:30:22 +00:00
Simon Riggs	faa0550572	Remove rare corner case for data loss when triggering standby server. If the standby was streaming when trigger file arrives, check also in the archive for additional WAL files. This is a corner case since it is unlikely that we would trigger a failover while the master is still available and sending data to standby, while at the same time running in archive mode and also while the streaming standby has fallen behind archive. Someone would eventually be unlucky; we must plug all gaps however small. Fujii Masao	2011-02-08 14:38:02 +00:00
Robert Haas	0af695fd43	Log restartpoints in the same fashion as checkpoints. Prior to 9.0, restartpoints never created, deleted, or recycled WAL files, but now they can. This code makes log_checkpoints treat checkpoints and restartpoints symmetrically. It also adjusts up the documentation of the parameter to mention restartpoints. Fujii Masao. Docs by me, as suggested by Itagaki Takahiro.	2011-02-02 21:08:53 -05:00
Heikki Linnakangas	997b48ed96	Support multiple concurrent pg_basebackup backups. With this patch, pg_basebackup doesn't write a backup_label file in the data directory, so it doesn't interfere with a pg_start/stop_backup() based backup anymore. backup_label is still included in the backup, but it is injected directly into the tar stream. Heikki Linnakangas, reviewed by Fujii Masao and Magnus Hagander.	2011-01-31 18:25:39 +02:00
Tom Lane	0f73aae13d	Allow the wal_buffers setting to be auto-tuned to a reasonable value. If wal_buffers is initially set to -1 (which is now the default), it's replaced by 1/32nd of shared_buffers, with a minimum of 8 (the old default) and a maximum of the XLOG segment size. The allowed range for manual settings is still from 4 up to whatever will fit in shared memory. Greg Smith, with implementation correction by me.	2011-01-22 20:31:24 -05:00
Magnus Hagander	4448917d51	Split pg_start_backup() and pg_stop_backup() into two pieces Move the actual functionality into a separate function that's easier to call internally, and change the SQL-callable function to be a wrapper calling this. Also create a pg_abort_backup() function, only callable internally, that does only the most vital parts of pg_stop_backup(), making it safe(r) to call from error handlers.	2011-01-09 21:00:28 +01:00
Bruce Momjian	5d950e3b0c	Stamp copyrights for year 2011.	2011-01-01 13:18:15 -05:00
Robert Haas	53dbc27c62	Support unlogged tables. The contents of an unlogged table are WAL-logged; thus, they are not available on standby servers and are truncated whenever the database system enters recovery. Indexes on unlogged tables are also unlogged. Unlogged GiST indexes are not currently supported.	2010-12-29 06:48:53 -05:00
Magnus Hagander	9b8aff8c19	Add REPLICATION privilege for ROLEs This privilege is required to do Streaming Replication, instead of superuser, making it possible to set up a SR slave that doesn't have write permissions on the master. Superuser privileges do NOT override this check, so in order to use the default superuser account for replication it must be explicitly granted the REPLICATION permissions. This is backwards incompatible change, in the interest of higher default security.	2010-12-29 11:05:03 +01:00
Robert Haas	34c70c7ac4	Instrument checkpoint sync calls. Greg Smith, reviewed by Jeff Janes	2010-12-14 09:26:19 -05:00
Tom Lane	04f4e10cfc	Use symbolic names not octal constants for file permission flags. Purely cosmetic patch to make our coding standards more consistent --- we were doing symbolic some places and octal other places. This patch fixes all C-coded uses of mkdir, chmod, and umask. There might be some other calls I missed. Inconsistency noted while researching tablespace directory permissions issue.	2010-12-10 17:35:33 -05:00
Heikki Linnakangas	5a031a5556	Fix bugs in the hot standby known-assigned-xids tracking logic. If there's an old transaction running in the master, and a lot of transactions have started and finished since, and a WAL-record is written in the gap between the creating the running-xacts snapshot and WAL-logging it, recovery will fail with "too many KnownAssignedXids" error. This bug was reported by Joachim Wieland on Nov 19th. In the same scenario, when fewer transactions have started so that all the xids fit in KnownAssignedXids despite the first bug, a more serious bug arises. We incorrectly initialize the clog code with the oldest still running transaction, and when we see the WAL record belonging to a transaction with an XID larger than one that committed already before the checkpoint we're recovering from, we zero the clog page containing the already committed transaction, leading to data loss. In hindsight, trying to track xids in the known-assigned-xids array before seeing the running-xacts record was too complicated. To fix that, hold XidGenLock while the running-xacts snapshot is taken and WAL-logged. That ensures that no transaction can begin or end in that gap, so that in recvoery we know that the snapshot contains all transactions running at that point in WAL.	2010-12-07 09:23:30 +01:00
Heikki Linnakangas	95e42a2c29	Fix two typos, by Fujii Masao.	2010-12-06 12:38:05 +01:00
Robert Haas	970a18687f	Use GUC lexer for recovery.conf parsing. This eliminates some crufty, special-purpose code and, as a non-trivial side benefit, allows recovery.conf parameters to be unquoted. Dimitri Fontaine, with review and cleanup by Alvaro Herrera, Itagaki Takahiro, and me.	2010-12-03 08:56:44 -05:00
Peter Eisentraut	fc946c39ae	Remove useless whitespace at end of lines	2010-11-23 22:34:55 +02:00
Heikki Linnakangas	542bdb2146	Fix bug introduced by the recent patch to check that the checkpoint redo location read from backup label file can be found: wasShutdown was set incorrectly when a backup label file was found. Jeff Davis, with a little tweaking by me.	2010-11-11 19:32:11 +02:00
Robert Haas	7ba6e4f0e0	Add monitoring function pg_last_xact_replay_timestamp. Fujii Masao, with a little wordsmithing by me.	2010-11-09 22:52:19 -05:00
Heikki Linnakangas	8c843fff2d	Bootstrap WAL to begin at segment logid=0 logseg=1 (000000010000000000000001) rather than 0/0, so that we can safely use 0/0 as an invalid value. This is a more future-proof fix for the corner-case bug in streaming replication that was fixed yesterday. We had a similar corner-case bug with log/seg 0/0 back in February as well. Avoiding 0/0 as a valid value should prevent bugs like that in the future. Per Tom Lane's idea. Back-patch to 9.0. Since this only affects bootstrapping, it makes no difference to existing installations. We don't need to worry about the bug in existing installations, because if you've managed to get past the initial base backup already, you won't hit the bug in the future either.	2010-11-02 11:39:48 +02:00
Heikki Linnakangas	931b6db39b	Fix corner-case bug in tracking of latest removed WAL segment during streaming replication. We used log/seg 0/0 to indicate that no WAL segments have been removed since startup, but 0/0 is a valid value for the very first WAL segment after initdb. To make that disambiguous, store (latest removed WAL segment + 1) in the global variable. Per report from Matt Chesler, also reproduced by Greg Smith.	2010-11-01 10:05:15 +02:00
Heikki Linnakangas	0c6293dd03	Before removing backup_label and irrevocably changing pg_control file, check that WAL file containing the checkpoint redo-location can be found. This avoids making the cluster irrecoverable if the redo location is in an earlie WAL file than the checkpoint record. Report, analysis and patch by Jeff Davis, with small changes by me.	2010-10-26 21:43:52 +03:00
Simon Riggs	3bbcc5c999	Make startup process respond to signals to cancel waiting on latch. A tidy up for recently committed changes to startup latch. Fujii Masao	2010-10-14 19:15:26 +01:00
Simon Riggs	45cd9199c2	Fix bug in comment of timeline history file. Fujii Masao	2010-10-14 19:06:06 +01:00
Magnus Hagander	9f2e211386	Remove cvs keywords from all files.	2010-09-20 22:08:53 +02:00
Heikki Linnakangas	79b54816db	Fix two typos in comments, spotted by Fujii Masao and Thom Brown	2010-09-15 13:58:22 +00:00
Heikki Linnakangas	723d0184e2	Use a latch to make startup process wake up and replay immediately when new WAL arrives via streaming replication. This reduces the latency, and also allows us to use a longer polling interval, which is good for energy efficiency. We still need to poll to check for the appearance of a trigger file, but the interval is now 5 seconds (instead of 100ms), like when waiting for a new WAL segment to appear in WAL archive.	2010-09-15 10:35:05 +00:00
Simon Riggs	ac791d3ca1	Fix misleading DEBUG2 issued during RemoveOldXlogFiles()	2010-08-30 15:37:41 +00:00
Simon Riggs	e72f15ed60	Truncate subtrans after each restartpoint. Issue reported by Harald Kolb, patch by Fujii Masao, review by me.	2010-08-30 14:22:05 +00:00
Alvaro Herrera	3a1b51de19	Remove duplicate translatable phrase	2010-08-26 19:23:41 +00:00
Simon Riggs	5b8bd0529e	Rename asyncCommitLSN to asyncXactLSN to reflect changed role in 9.0. Transaction aborts now record their LSN to avoid corner case behaviour in SR/HS, hence change of name of variables and functions. As pointed out by Fujii Masao. Cosmetic changes only.	2010-07-29 22:27:27 +00:00
Bruce Momjian	239d769e7e	pgindent run for 9.0, second run	2010-07-06 19:19:02 +00:00
Tom Lane	8771634666	Don't set recoveryLastXTime when replaying a checkpoint --- that was a bogus idea from the start since the variable is only meant to track commit/abort events. This patch reverts the logic around the variable to what it was in 8.4, except that the value is now kept in shared memory rather than a static variable, so that it can be reported correctly by CreateRestartPoint (which is executed in the bgwriter).	2010-07-03 22:15:45 +00:00
Tom Lane	e76c1a0f4d	Replace max_standby_delay with two parameters, max_standby_archive_delay and max_standby_streaming_delay, and revise the implementation to avoid assuming that timestamps found in WAL records can meaningfully be compared to clock time on the standby server. Instead, the delay limits are compared to the elapsed time since we last obtained a new WAL segment from archive or since we were last "caught up" to WAL data arriving via streaming replication. This avoids problems with clock skew between primary and standby, as well as other corner cases that the original coding would misbehave in, such as the primary server having significant idle time between transactions. Per my complaint some time ago and considerable ensuing discussion. Do some desultory editing on the hot standby documentation, too.	2010-07-03 20:43:58 +00:00
Robert Haas	400916b6d7	emode_for_corrupt_record shouldn't reduce LOG messages to WARNING. In non-interactive sessions, WARNING sorts below LOG.	2010-06-28 19:46:19 +00:00
Tom Lane	09698bb5fb	Make RemoveOldXlogFiles's debug printout match style used elsewhere: log and seg aren't an XLogRecPtr and shouldn't be printed like one. Fujii Masao	2010-06-17 17:37:23 +00:00
Tom Lane	07e8b6aabc	Don't allow walsender to send WAL data until it's been safely fsync'd on the master. Otherwise a subsequent crash could cause the master to lose WAL that has already been applied on the slave, resulting in the slave being out of sync and soon corrupt. Per recent discussion and an example from Robert Haas. Fujii Masao	2010-06-17 16:41:25 +00:00
Heikki Linnakangas	6da07cd80d	If a corrupt WAL record is received by streaming replication, disconnect and retry. If the record is genuinely corrupt in the master database, there's little hope of recovering, but it's better than simply retrying to apply the corrupt WAL record in a tight loop without even trying to retransmit it, which is what we used to do.	2010-06-14 06:04:21 +00:00
Peter Eisentraut	c86efdde5f	Fix typo/bug, found by Clang compiler	2010-06-12 09:14:52 +00:00
Itagaki Takahiro	56834fc759	Rename restartpoint_command to archive_cleanup_command.	2010-06-10 08:13:50 +00:00
Heikki Linnakangas	0a7cb85531	Make TriggerFile variable static. It's not used outside xlog.c. Fujii Masao	2010-06-10 07:49:23 +00:00
Heikki Linnakangas	346d7cd7fa	Return NULL instead of 0/0 in pg_last_xlog_receive_location() and pg_last_xlog_replay_location(). Per Robert Haas's suggestion, after Itagaki Takahiro pointed out an issue in the docs. Also, some wording changes in the docs by me.	2010-06-10 07:00:27 +00:00
Heikki Linnakangas	71815306e9	In standby mode, respect checkpoint_segments in addition to checkpoint_timeout to trigger restartpoints. We used to deliberately only do time-based restartpoints, because if checkpoint_segments is small we would spend time doing restartpoints more often than really necessary. But now that restartpoints are done in bgwriter, they're not as disruptive as they used to be. Secondly, because streaming replication stores the streamed WAL files in pg_xlog, we want to clean it up more often to avoid running out of disk space when checkpoint_timeout is large and checkpoint_segments small. Patch by Fujii Masao, with some minor changes by me.	2010-06-09 15:04:07 +00:00
Magnus Hagander	8c873bbfa7	Make the walwriter close it's handle to an old xlog segment if it's no longer the current one. Not doing this would leave the walwriter with a handle to a deleted file if there was nothing for it to do for a long period of time, preventing the file from being completely removed. Reported by Tollef Fog Heen, and thanks to Heikki for some hand-holding with the patch.	2010-06-09 10:54:45 +00:00
Peter Eisentraut	cb6038c168	Fix some inconsistent quoting of wal_level values in messages When referring to postgresql.conf syntax, then it's without quotes (wal_level=archive); in narrative it's with double quotes. But never single quotes.	2010-06-03 21:02:12 +00:00
Robert Haas	d561430b66	On clean shutdown during recovery, don't warn about possible corruption. Fujii Masao. Review by Heikki Linnakangas and myself.	2010-06-03 03:20:00 +00:00
Heikki Linnakangas	6b24036365	Fix obsolete comments that I neglected to update in a previous patch. Fujii Masao	2010-06-02 09:28:44 +00:00
Heikki Linnakangas	c5bd8feac6	Adjust comment to reflect that we now have Hot Standby. Pointed out by Robert Haas.	2010-05-27 00:38:39 +00:00
Robert Haas	ea9968c331	Rename PM_RECOVERY_CONSISTENT and PMSIGNAL_RECOVERY_CONSISTENT. The new names PM_HOT_STANDBY and PMSIGNAL_BEGIN_HOT_STANDBY more accurately reflect their actual function.	2010-05-15 20:01:32 +00:00
Simon Riggs	4a24c9a063	Fix bug in processing of checkpoint time for max_standby_delay. Latest log time was incorrectly set, typically leading to dates in the past, which would cause more cancellations in Hot Standby on a quiet server.	2010-05-15 07:14:43 +00:00
Simon Riggs	fd34374b17	Add many new Asserts in code and fix simple bug that slipped through without them, related to previous commit. Report by Bruce Momjian.	2010-05-14 07:11:49 +00:00
Simon Riggs	8431e296ea	Cleanup initialization of Hot Standby. Clarify working with reanalysis of requirements and documentation on LogStandbySnapshot(). Fixes two minor bugs reported by Tom Lane that would lead to an incorrect snapshot after transaction wraparound. Also fix two other problems discovered that would give incorrect snapshots in certain cases. ProcArrayApplyRecoveryInfo() substantially rewritten. Some minor refactoring of xact_redo_apply() and ExpireTreeKnownAssignedTransactionIds().	2010-05-13 11:15:38 +00:00
Heikki Linnakangas	ffe8c7c677	Need to hold ControlFileLock while updating control file. Update minRecoveryPoint in control file when replaying a parameter change record, to ensure that we don't allow hot standby on WAL generated without wal_level='hot_standby' after a standby restart.	2010-05-03 11:17:52 +00:00
Tom Lane	f9ed327f76	Clean up some awkward, inaccurate, and inefficient processing around MaxStandbyDelay. Use the GUC units mechanism for the value, and choose more appropriate timestamp functions for performing tests with it. Make the ps_activity manipulation in ResolveRecoveryConflictWithVirtualXIDs have behavior similar to ps_activity code elsewhere, notably not updating the display when update_process_title is off and not truncating the display contents at an arbitrarily-chosen length. Improve the docs to be explicit about what MaxStandbyDelay actually measures, viz the difference between primary and standby servers' clocks, and the possible hazards if their clocks aren't in sync.	2010-05-02 02:10:33 +00:00
Tom Lane	69f7a4d8e3	Adjust error checks in pg_start_backup and pg_stop_backup to make it possible to perform a backup without archive_mode being enabled. This gives up some user-error protection in order to improve usefulness for streaming-replication scenarios. Per discussion.	2010-04-29 21:49:03 +00:00
Tom Lane	f0488bd57c	Rename the parameter recovery_connections to hot_standby, to reduce possible confusion with streaming-replication settings. Also, change its default value to "off", because of concern about executing new and poorly-tested code during ordinary non-replicating operation. Per discussion. In passing do some minor editing of related documentation.	2010-04-29 21:36:19 +00:00
Heikki Linnakangas	9b8a73326e	Introduce wal_level GUC to explicitly control if information needed for archival or hot standby should be WAL-logged, instead of deducing that from other options like archive_mode. This replaces recovery_connections GUC in the primary, where it now has no effect, but it's still used in the standby to enable/disable hot standby. Remove the WAL-logging of "unlogged operations", like creating an index without WAL-logging and fsyncing it at the end. Instead, we keep a copy of the wal_mode setting and the settings that affect how much shared memory a hot standby server needs to track master transactions (max_connections, max_prepared_xacts, max_locks_per_xact) in pg_control. Whenever the settings change, at server restart, write a WAL record noting the new settings and update pg_control. This allows us to notice the change in those settings in the standby at the right moment, they used to be included in checkpoint records, but that meant that a changed value was not reflected in the standby until the first checkpoint after the change. Bump PG_CONTROL_VERSION and XLOG_PAGE_MAGIC. Whack XLOG_PAGE_MAGIC back to the sequence it used to follow, before hot standby and subsequent patches changed it to 0x9003.	2010-04-28 16:10:43 +00:00
Heikki Linnakangas	3efba16d56	If a base backup is cancelled by server shutdown or crash, throw an error in WAL recovery when it sees the shutdown checkpoint record. It's more user-friendly to find out about it at that point than at the end of recovery, and you're not left wondering why your hot standby server never opens up for read-only connections.	2010-04-27 09:25:18 +00:00
Simon Riggs	491d1ea5b3	Previous patch revoked following objections.	2010-04-23 20:21:31 +00:00
Simon Riggs	6ca23b1a29	Make CheckRequiredParameterValues() depend upon correct combination of parameters. Fix bug report by Robert Haas that error message and hint was incorrect if wrong mode parameters specified on master. Internal changes only. Proposals for parameter simplification on master/primary still under way.	2010-04-23 19:57:19 +00:00
Robert Haas	481cb5d9b5	Rename standby_keep_segments to wal_keep_segments. Also, make the name of the GUC and the name of the backing variable match. Alnong the way, clean up a couple of slight typographical errors in the related docs.	2010-04-20 11:15:06 +00:00
Simon Riggs	d38603bd97	Improve sequence and sense of messages from pg_stop_backup(). Now doesn't report it is waiting until it actually is waiting, plus message doesn't appear until at least 5 seconds wait, so we avoid reporting the wait before we've given the archiver a reasonable time to wake up and archive the file we just created earlier in the function. Also add new unconditional message to confirm safe completion. Now a normal, healthy execution does not report waiting at all, just safe completion.	2010-04-18 18:44:53 +00:00
Simon Riggs	2847de9df2	Remove some additional changes in previous commit that belong elsewhere.	2010-04-18 18:17:12 +00:00
Simon Riggs	21d6a6a128	Tune GetSnapshotData() during Hot Standby by avoiding loop through normal backends. Makes code clearer also, since we avoid various Assert()s. Performance of snapshots taken during recovery no longer depends upon number of read-only backends.	2010-04-18 18:06:07 +00:00
Heikki Linnakangas	78974cfb9b	In standby mode, suppress repeated LOG messages about a corrupt record, which just indicates that we've reached the end of valid WAL found in the standby.	2010-04-16 08:58:16 +00:00
Bruce Momjian	ec4b9bcc3d	Doc change: effect -> affect, per Robert Haas	2010-04-15 03:05:59 +00:00
Simon Riggs	55d7556a4d	Fix minor typo in comment in xlog.c	2010-04-14 10:29:07 +00:00
Heikki Linnakangas	361bd1662e	Allow Hot Standby to begin from a shutdown checkpoint. Patch by Simon Riggs & me	2010-04-13 14:17:46 +00:00
Heikki Linnakangas	30556568f5	Update the location of last removed WAL segment in shared memory only after actually removing one, so that if we can't remove segments because WAL archiving is lagging behind, we don't unnecessarily forbid streaming the old not-yet-archived segments that are still perfectly valid. Per suggestion from Fujii Masao.	2010-04-12 10:40:43 +00:00
Heikki Linnakangas	e57cd7f0a1	Change the logic to decide when to delete old WAL segments, so that it doesn't take into account how far the WAL senders are. This way a hung WAL sender doesn't prevent old WAL segments from being recycled/removed in the primary, ultimately causing the disk to fill up. Instead add standby_keep_segments setting to control how many old WAL segments are kept in the primary. This also makes it more reliable to use streaming replication without WAL archiving, assuming that you set standby_keep_segments high enough.	2010-04-12 09:52:29 +00:00
Heikki Linnakangas	0f11ed5886	Allow quotes to be escaped in recovery.conf, by doubling them. This patch also makes the parsing a little bit stricter, rejecting garbage after the parameter value and values with missing ending quotes, for example.	2010-04-07 10:58:49 +00:00
Heikki Linnakangas	370f770c15	Forbid using pg_xlogfile_name() and pg_xlogfile_name_offset() during recovery. We might want to relax this in the future, but ThisTimeLineID isn't currently correct in backends during recovery, so the filename returned was wrong.	2010-04-07 06:12:52 +00:00
Simon Riggs	89c5008158	Further message changes when recovery.conf parameters missing.	2010-04-06 17:51:58 +00:00
Simon Riggs	cf2575b8c4	Check compulsory parameters in recovery.conf in standby_mode, per docs.	2010-04-02 21:50:40 +00:00
Simon Riggs	31f00d163b	Move system startup message prior to any calls out of data directory. This allows us to see what mode the server is in before it starts to perform actions that can block or hang. Otherwise server messages may not appear until after messages that say FATAL the database server is starting up.	2010-04-02 13:10:56 +00:00
Robert Haas	54943734f8	Refer to max_wal_senders in a more consistent fashion. The error message now makes explicit reference to the GUC that must be changed to fix the problem, using wording suggested by Tom Lane. Along the way, rename the GUC from MaxWalSenders to max_wal_senders for consistency and grep-ability.	2010-04-01 00:43:29 +00:00
Heikki Linnakangas	2a77355ea1	Change the retry-loop in standby mode to also try restoring files from pg_xlog directory. This is essential for replaying WAL records that were streamed from the master, after a standby server restart. If a corrupt record is seen in a file restored from the archive or streamed from the master, log it as a WARNING and keep retrying. If the corruption is permanent, and not just a glitch in the whatever copies the files to the archive or a network error not caught by CRC checks in TCP for example, we will keep retrying and logging the WARNING indefinitely. But that's better than shutting down completely, the standby is still useful for running read-only queries. In PITR the recovery ends at such a corrupt record, which is a bit questionable, but that's the behavior we had in previous releases and we don't feel like chaning it now. It does make sense for tools like pg_standby.	2010-03-30 16:23:57 +00:00
Peter Eisentraut	c248d17120	Message tuning	2010-03-21 00:17:59 +00:00
Simon Riggs	3cdafe40e7	Adjust comment in .history file to match recovery target specified. Comment present since 8.0 was never fully meaningful, since two recovery targets cannot be specified. Refactor recovery target type to make this change and associated code easier to understand. No change in function. Bug report arising from internal support question.	2010-03-19 11:05:15 +00:00
Heikki Linnakangas	c21ac0b58e	Add restartpoint_command option to recovery.conf. Fix bug in %r handling in recovery_end_command, it always came out as 0 because InRedo was cleared before recovery_end_command was executed. Also, always take ControlFileLock when reading checkpoint location for %r. The recovery_end_command bug and the missing locking was present in 8.4 as well, that part of this patch will be backported separately.	2010-03-18 09:17:18 +00:00
Simon Riggs	1a163a0c68	Remove incorrect comment from GetWriteRecPtr(): the return value is always correct, as described in comments at start of xlog.c	2010-03-15 18:49:17 +00:00
Itagaki Takahiro	17d8de0e61	pg_start_backup() can use a share lock to lock ControlFileLock instead of an exclusive lock. The change is almost for code cleanup. Since there seems to be no performance benefits from it, backports should not be needed. Fujii Masao	2010-03-10 02:04:48 +00:00
Bruce Momjian	65e806cba1	pgindent run for 9.0	2010-02-26 02:01:40 +00:00
Tom Lane	a2239b96e0	Make pg_stop_backup's reporting a bit more verbose in hopes of making error cases less intimidating for novices. Per discussion. Greg Smith	2010-02-25 02:17:50 +00:00
Heikki Linnakangas	ad458cfe81	Don't use O_DIRECT when writing WAL files if archiving or streaming is enabled. Bypassing the kernel cache is counter-productive in that case, because the archiver/walsender process will read from the WAL file soon after it's written, and if it's not cached the read will cause a physical read, eating I/O bandwidth available on the WAL drive. Also, walreceiver process does unaligned writes, so disable O_DIRECT in walreceiver process for that reason too.	2010-02-19 10:51:04 +00:00
Itagaki Takahiro	3230fd056a	Fix STOP WAL LOCATION in backup history files no to return the next segment of XLOG_BACKUP_END record even if the the record is placed at a segment boundary. Furthermore the previous implementation could return nonexistent segment file name when the boundary is in segments that has "FE" suffix; We never use segments with "FF" suffix. Backpatch to 8.0, where hot backup was introduced. Reported by Fujii Masao.	2010-02-19 01:04:03 +00:00
Tom Lane	50a90fac40	Stamp HEAD as 9.0devel, and update various places that were referring to 8.5 (hope I got 'em all). Per discussion, this release will be 9.0 not 8.5.	2010-02-17 04:19:41 +00:00
Tom Lane	c64339face	When updating ShmemVariableCache from a checkpoint record, be sure to set all the values derived from oldestXid, not just that field. Brain fade in one of my patches associated with flat file removal, exposed by a report from Fujii Masao. With this change, xidVacLimit should always be valid, so remove a couple of bits of complexity associated with the previous assumption that sometimes it wouldn't get set right away.	2010-02-17 03:10:33 +00:00
Heikki Linnakangas	e465390d03	Reduce the chatter to the log when starting a standby server. Don't echo all the recovery.conf options. Don't emit the "initializing recovery connections" message, which doesn't mean anything to a user. Remove the "starting archive recovery" message and replace the "automatic recovery in progress" message with a more informative message saying whether the server is doing PITR, normal archive recovery, or standby mode.	2010-02-12 09:49:08 +00:00
Heikki Linnakangas	54cbd1757e	If primary_conninfo is not set, don't try to establish streaming connection.	2010-02-12 07:56:36 +00:00
Heikki Linnakangas	9fa01f6c8a	Check for partial WAL files in standby mode. If restore_command restores a partial WAL file, assume it's because the file is just being copied to the archive and treat it the same as "file not found" in standby mode. pg_standby has a similar check, so it seems reasonable to have the same level of protection in the built-in standby mode.	2010-02-12 07:36:44 +00:00
Heikki Linnakangas	161d9d51b3	Now that streaming replication switches between streaming mode and restoring from archive, the last WAL segment is not necessarily open at the end of recovery. Fix assertion that assumed that. Fujii Masao, fixing the assertion failure reported by Martin Pihlak.	2010-02-10 08:25:25 +00:00
Heikki Linnakangas	4cea603128	Remove piece of code to zero out minRecoveryPoint when starting crash recovery. It's zeroed out whenever a checkpoint is written, so the only scenario where the removed code did anything is when you kill archive recovery, remove recovery.conf, and start up the server, so that it goes into crash recovery instead. That's a "don't do that" scenario, but it seems better to not clear minRecoveryPoint but instead update it like we do in archive recovery, which is what will now happen.	2010-02-08 09:08:51 +00:00
Tom Lane	0a469c8769	Remove old-style VACUUM FULL (which was known for a little while as VACUUM FULL INPLACE), along with a boatload of subsidiary code and complexity. Per discussion, the use case for this method of vacuuming is no longer large enough to justify maintaining it; not to mention that we don't wish to invest the work that would be needed to make it play nicely with Hot Standby. Aside from the code directly related to old-style VACUUM FULL, this commit removes support for certain WAL record types that could only be generated within VACUUM FULL, redirect-pointer removal in heap_page_prune, and nontransactional generation of cache invalidation sinval messages (the last being the sticking point for Hot Standby). We still have to retain all code that copes with finding HEAP_MOVED_OFF and HEAP_MOVED_IN flag bits on existing tuples. This can't be removed as long as we want to support in-place update from pre-9.0 databases.	2010-02-08 04:33:55 +00:00
Tom Lane	b9b8831ad6	Create a "relation mapping" infrastructure to support changing the relfilenodes of shared or nailed system catalogs. This has two key benefits: * The new CLUSTER-based VACUUM FULL can be applied safely to all catalogs. * We no longer have to use an unsafe reindex-in-place approach for reindexing shared catalogs. CLUSTER on nailed catalogs now works too, although I left it disabled on shared catalogs because the resulting pg_index.indisclustered update would only be visible in one database. Since reindexing shared system catalogs is now fully transactional and crash-safe, the former special cases in REINDEX behavior have been removed; shared catalogs are treated the same as non-shared. This commit does not do anything about the recently-discussed problem of deadlocks between VACUUM FULL/CLUSTER on a system catalog and other concurrent queries; will address that in a separate patch. As a stopgap, parallel_schedule has been tweaked to run vacuum.sql by itself, to avoid such failures during the regression tests.	2010-02-07 20:48:13 +00:00
Simon Riggs	296578feb4	Revoke augmentation of WAL records for btree delete, per discussion.	2010-02-01 13:40:28 +00:00
Simon Riggs	6d2bc0a6cf	Augment WAL records for btree delete with GetOldestXmin() to reduce false positives during Hot Standby conflict processing. Simple patch to enhance conflict processing, following previous discussions. Controlled by parameter minimize_standby_conflicts = on \| off, with default off allows measurement of performance impact to see whether it should be set on all the time.	2010-01-29 18:39:05 +00:00
Heikki Linnakangas	b0509ef601	Fix crashing bug at the end of recovery in Streaming Replication, when restore_command is not given. Fujii Masao.	2010-01-28 19:17:22 +00:00
Heikki Linnakangas	83cb7da7dc	Fix bug in wasender's xlogid boundary handling, reported by Erik Rijkers. LogwrtRqst.Write can be set to non-existent FF log segment, we mustn't try to send that in XLogSend(). Also fix similar bug in ReadRecord(), which I just introduced in the ReadRecord() refactoring patch.	2010-01-27 16:41:09 +00:00
Heikki Linnakangas	1bb2558046	Make standby server continuously retry restoring the next WAL segment with restore_command, if the connection to the primary server is lost. This ensures that the standby can recover automatically, if the connection is lost for a long time and standby falls behind so much that the required WAL segments have been archived and deleted in the master. This also makes standby_mode useful without streaming replication; the server will keep retrying restore_command every few seconds until the trigger file is found. That's the same basic functionality pg_standby offers, but without the bells and whistles. To implement that, refactor the ReadRecord/FetchRecord functions. The FetchRecord() function introduced in the original streaming replication patch is removed, and all the retry logic is now in a new function called XLogReadPage(). XLogReadPage() is now responsible for executing restore_command, launching walreceiver, and waiting for new WAL to arrive from primary, as required. This also changes the life cycle of walreceiver. When launched, it now only tries to connect to the master once, and exits if the connection fails, or is lost during streaming for any reason. The startup process detects the death, and re-launches walreceiver if necessary.	2010-01-27 15:27:51 +00:00
Simon Riggs	aed1a0121a	Fix longstanding gripe that we check for 0000000001.history at start of archive recovery, even when we know it is never present.	2010-01-26 00:07:13 +00:00
Simon Riggs	959ac58c04	In HS, Startup process sets SIGALRM when waiting for buffer pin. If woken by alarm we send SIGUSR1 to all backends requesting that they check to see if they are blocking Startup process. If so, they throw ERROR/FATAL as for other conflict resolutions. Deadlock stop gap removed. max_standby_delay = -1 option removed to prevent deadlock.	2010-01-23 16:37:12 +00:00
Heikki Linnakangas	09b115f706	Write a WAL record whenever we perform an operation without WAL-logging that would've been WAL-logged if archiving was enabled. If we encounter such records in archive recovery anyway, we know that some data is missing from the log. A WARNING is emitted in that case. Original patch by Fujii Masao, with changes by me.	2010-01-20 19:43:40 +00:00
Heikki Linnakangas	40f908bdcd	Introduce Streaming Replication. This includes two new kinds of postmaster processes, walsenders and walreceiver. Walreceiver is responsible for connecting to the primary server and streaming WAL to disk, while walsender runs in the primary server and streams WAL from disk to the client. Documentation still needs work, but the basics are there. We will probably pull the replication section to a new chapter later on, as well as the sections describing file-based replication. But let's do that as a separate patch, so that it's easier to see what has been added/changed. This patch also adds a new section to the chapter about FE/BE protocol, documenting the protocol used by walsender/walreceivxer. Bump catalog version because of two new functions, pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for monitoring the progress of replication. Fujii Masao, with additional hacking by me	2010-01-15 09:19:10 +00:00
Heikki Linnakangas	06f82b2961	Write an end-of-backup WAL record at pg_stop_backup(), and wait for it at recovery instead of reading the backup history file. This is more robust, as it stops you from prematurely starting up an inconsisten cluster if the backup history file is lost for some reason, or if the base backup was never finished with pg_stop_backup(). This also paves the way for a simpler streaming replication patch, which doesn't need to care about backup history files anymore. The backup history file is still created and archived as before, but it's not used by the system anymore. It's just for informational purposes now. Bump PG_CONTROL_VERSION as the location of the backup startpoint is now written to a new field in pg_control, and catversion because initdb is required Original patch by Fujii Masao per Simon's idea, with further fixes by me.	2010-01-04 12:50:50 +00:00
Bruce Momjian	0239800893	Update copyright for the year 2010.	2010-01-02 16:58:17 +00:00
Heikki Linnakangas	ff1e1e45b9	Reset minRecoveryPoint at checkpoints, so that we don't uselessly update it in the control file at crash recovery following an archive recovery. Per Fujii Masao and subsequent discussion.	2009-12-30 08:37:21 +00:00
Simon Riggs	efc16ea520	Allow read only connections during recovery, known as Hot Standby. Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record. New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far. This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required. Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit. Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.	2009-12-19 01:32:45 +00:00
Heikki Linnakangas	7f2a10fecd	Don't error out if recycling or removing an old WAL segment fails at the end of checkpoint. Although the checkpoint has been written to WAL at that point already, so that all data is safe, and we'll retry removing the WAL segment at the next checkpoint, if such a failure persists we won't be able to remove any other old WAL segments either and will eventually run out of disk space. It's better to treat the failure as non-fatal, and move on to clean any other WAL segment and continue with any other end-of-checkpoint cleanup. We don't normally expect any such failures, but on Windows it can happen with some anti-virus or backup software that lock files without FILE_SHARE_DELETE flag. Also, the loop in pgrename() to retry when the file is locked was broken. If a file is locked on Windows, you get ERROR_SHARE_VIOLATION, not ERROR_ACCESS_DENIED, at least on modern versions. Fix that, although I left the check for ERROR_ACCESS_DENIED in there as well (presumably it was correct in some environment), and added ERROR_LOCK_VIOLATION to be consistent with similar checks in pgwin32_open(). Reduce the timeout on the loop from 30s to 10s, on the grounds that since it's been broken, we've effectively had a timeout of 0s and no-one has complained, so a smaller timeout is actually closer to the old behavior. A longer timeout would mean that if recycling a WAL file fails because it's locked for some reason, InstallXLogFileSegment() will hold ControlFileLock for longer, potentially blocking other backends, so a long timeout isn't totally harmless. While we're at it, set errno correctly in pgrename(). Backpatch to 8.2, which is the oldest version supported on Windows. The xlog.c changes would make sense on other platforms and thus on older versions as well, but since there's no such locking issues on other platforms, it's not worth it.	2009-09-13 18:32:08 +00:00
Heikki Linnakangas	4e2d5efc6a	On Windows, when a file is deleted and another process still has an open file handle on it, the file goes into "pending deletion" state where it still shows up in directory listing, but isn't accessible otherwise. That confuses RemoveOldXLogFiles(), making it think that the file hasn't been archived yet, while it actually was, and it was deleted along with the .done file. Fix that by renaming the file with ".deleted" extension before deleting it. Also check the return value of rename() and unlink(), so that if the removal fails for any reason (e.g another process is holding the file locked), we don't delete the .done file until the WAL file is really gone. Backpatch to 8.2, which is the oldest version supported on Windows.	2009-09-10 09:42:10 +00:00
Alvaro Herrera	a8bb8eb583	Remove flatfiles.c, which is now obsolete. Recent commits have removed the various uses it was supporting. It was a performance bottleneck, according to bug report #4919 by Lauris Ulmanis; seems it slowed down user creation after a billion users.	2009-09-01 02:54:52 +00:00
Tom Lane	25ec228ef7	Track the current XID wrap limit (or more accurately, the oldest unfrozen XID) in checkpoint records. This eliminates the need to recompute the value from scratch during database startup, which is one of the two remaining reasons for the flatfile code to exist. It should also simplify life for hot-standby operation. To avoid bloating the checkpoint records unreasonably, I switched from tracking the oldest database by name to tracking it by OID. This turns out to save cycles in general (everywhere but the warning-generating paths, which we hardly care about) and also helps us deal with the case that the oldest database got dropped instead of being vacuumed. The prior coding might go for a long time without updating the wrap limit in that case, which is bad because it might result in a lot of useless autovacuum activity.	2009-08-31 02:23:23 +00:00
Heikki Linnakangas	9cd6685f91	In the checkpoint written at the end of archive recovery, the WAL page header was incorrectly initialized with timeline ID 0. That rendered the WAL page unrecoverable, making a subsequent archive recovery stop at that point. ThisTimeLineID needs to be initialized before calling AdvanceXLInsertBuffer(). This fixes bug #5011 reported by James Bardin. Backpatch to 8.4, as the bug was introduced by the changes to use of bgwriter for writing the end-of-archive-recovery checkpoint. Patch by Tom Lane.	2009-08-27 07:15:41 +00:00
Tom Lane	04011cc970	Allow backends to start up without use of the flat-file copy of pg_database. To make this work in the base case, pg_database now has a nailed-in-cache relation descriptor that is initialized using hardwired knowledge in relcache.c. This means pg_database is added to the set of relations that need to have a Schema_pg_xxx macro maintained in pg_attribute.h. When this path is taken, we'll have to do a seqscan of pg_database to find the row we need. In the normal case, we are able to do an indexscan to find the database's row by name. This is made possible by storing a global relcache init file that describes only the shared catalogs and their indexes (and therefore is usable by all backends in any database). A new backend loads this cache file, finds its database OID after an indexscan on pg_database, and then loads the local relcache init file for that database. This change should effectively eliminate number of databases as a factor in backend startup time, even with large numbers of databases. However, the real reason for doing it is as a first step towards getting rid of the flat files altogether. There are still several other sub-projects to be tackled before that can happen.	2009-08-12 20:53:31 +00:00
Tom Lane	97e14f6e93	Document that LocalSetXLogInsertAllowed can be re-executed. Per comment from Simon.	2009-08-08 16:39:17 +00:00
Tom Lane	87740caa01	rm_cleanup functions need to be allowed to write WAL entries. This oversight appears to explain the recent reports of "PANIC: cannot make new WAL entries during recovery".	2009-08-07 19:29:49 +00:00
Tom Lane	2de48a83e6	Cleanup and code review for the patch that made bgwriter active during archive recovery. Invent a separate state variable and inquiry function for XLogInsertAllowed() to clarify some tests and make the management of writing the end-of-recovery checkpoint less klugy. Fix several places that were incorrectly testing InRecovery when they should be looking at RecoveryInProgress or XLogInsertAllowed (because they will now be executed in the bgwriter not startup process). Clarify handling of bad LSNs passed to XLogFlush during recovery. Use a spinlock for setting/testing SharedRecoveryInProgress. Improve quite a lot of comments. Heikki and Tom	2009-06-26 20:29:04 +00:00
Heikki Linnakangas	7e48b77b1c	Fix some serious bugs in archive recovery, now that bgwriter is active during it: When bgwriter is active, the startup process can't perform mdsync() correctly because it won't see the fsync requests accumulated in bgwriter's private pendingOpsTable. Therefore make bgwriter responsible for the end-of-recovery checkpoint as well, when it's active. When bgwriter is active (= archive recovery), the startup process must not accumulate fsync requests to its own pendingOpsTable, since bgwriter won't see them there when it performs restartpoints. Make startup process drop its pendingOpsTable when bgwriter is launched to avoid that. Update minimum recovery point one last time when leaving archive recovery. It won't be updated by the end-of-recovery checkpoint because XLogFlush() sees us as out of recovery already. This fixes bug #4879 reported by Fujii Masao.	2009-06-25 21:36:00 +00:00
Bruce Momjian	d747140279	8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef list provided by Andrew.	2009-06-11 14:49:15 +00:00
Heikki Linnakangas	7c8d7a2eec	Only recycle normal files in pg_xlog as WAL segments. pg_standby creates symbolic links with the -l option, and as Fujii Masao pointed out we ended up overwriting files in the archive directory before this patch. Patch by Aidan Van Dyk, Fujii Masao and me. Backpatch to 8.3, where pg_standby was introduced.	2009-06-02 06:18:06 +00:00
Heikki Linnakangas	2e6107cb62	When archiving is enabled, rotate the last WAL segment at shutdown so that all transactions are archived. Original patch by Guillaume Smet.	2009-05-28 11:02:16 +00:00
Tom Lane	4616d57dad	Fix all the server-side SIGQUIT handlers (grumble ... why so many identical copies?) to ensure they really don't run proc_exit/shmem_exit callbacks, as was intended. I broke this behavior recently by installing atexit callbacks without thinking about the one case where we truly don't want to run those callback functions. Noted in an example from Dave Page.	2009-05-15 15:56:39 +00:00
Tom Lane	284e12c398	Improve a couple of comments.	2009-05-14 21:28:35 +00:00
Heikki Linnakangas	9e403c2587	Add recovery_end_command option to recovery.conf. recovery_end_command is run at the end of archive recovery, providing a chance to do external cleanup. Modify pg_standby so that it no longer removes the trigger file, that is to be done using the recovery_end_command now. Provide a "smart" failover mode in pg_standby, where we don't fail over immediately, but only after recovering all unapplied WAL from the archive. That gives you zero data loss assuming all WAL was archived before failover, which is what most users of pg_standby actually want. recovery_end_command by Simon Riggs, pg_standby changes by Fujii Masao and myself.	2009-05-14 20:31:09 +00:00
Heikki Linnakangas	223431cba1	Request XLOG switch before writing checkpoint in pg_start_backup(). Otherwise you can end up with an unrecoverable backup if you start a new base backup right after finishing archive recovery. In that scenario, the redo pointer of the checkpoint that pg_start_backup() writes points to the XLOG segment where the timeline-changing end-of-archive-recovery checkpoint is. The beginning of that segment contains pages with the old timeline ID, and we don't accept that in recovery unless we find a history file covering the old timeline ID. If you omit pg_xlog from the base backup and clear the archive directory before starting the backup, there will be no such history file available. The bug is present in all versions since PITR was introduced in 8.0, but I'm back-patching only back to 8.2. Earlier versions didn't have XLOG switch records, making this fix unfeasible. Given the lack of reports until now, it doesn't seem worthwhile to spend more effort to fix 8.0 and 8.1. Per report and suggestion by Mikael Krantz	2009-05-07 11:25:25 +00:00
Heikki Linnakangas	bae8102f52	After archive recovery, mark the last WAL segment from the parent timeline ready for archival. It was marked at the next checkpoint anyway, but waiting for the next checkpoint is an unnecessary delay. Fujii Masao	2009-04-22 19:51:12 +00:00
Tom Lane	387060951e	Add an optional parameter to pg_start_backup() that specifies whether to do the checkpoint in immediate or lazy mode. This is to address complaints that pg_start_backup() takes a long time even when there's no need to minimize its I/O consumption.	2009-04-07 00:31:26 +00:00
Tom Lane	e04810e8c4	Code review for dtrace probes added (so far) to 8.4. Adjust placement of some bufmgr probes, take out redundant and memory-leak-inducing path arguments to smgr__md__read__done and smgr__md__write__done, fix bogus attempt to recalculate space used in sort__done, clean up formatting in places where I'm not sure pgindent will do a nice job by itself.	2009-03-11 23:19:25 +00:00
Heikki Linnakangas	fb7df896fc	Reload config file in startup process on SIGHUP. Fujii Masao	2009-03-04 13:56:40 +00:00
Heikki Linnakangas	bc134d7a51	Change the signaling of end-of-recovery. Startup process now indicates end of recovery by exiting with exit code 0, like in previous releases. Per Tom's suggestion.	2009-02-23 09:28:50 +00:00
Heikki Linnakangas	cdd46c7654	Start background writer during archive recovery. Background writer now performs its usual buffer cleaning duties during archive recovery, and it's responsible for performing restartpoints. This requires some changes in postmaster. When the startup process has done all the initialization and is ready to start WAL redo, it signals the postmaster to launch the background writer. The postmaster is signaled again when the point in recovery is reached where we know that the database is in consistent state. Postmaster isn't interested in that at the moment, but that's the point where we could let other backends in to perform read-only queries. The postmaster is signaled third time when the recovery has ended, so that postmaster knows that it's safe to start accepting connections. The startup process now traps SIGTERM, and performs a "clean" shutdown. If you do a fast shutdown during recovery, a shutdown restartpoint is performed, like a shutdown checkpoint, and postmaster kills the processes cleanly. You still have to continue the recovery at next startup, though. Currently, the background writer is only launched during archive recovery. We could launch it during crash recovery as well, but it seems better to keep that codepath as simple as possible, for the sake of robustness. And it couldn't do any restartpoints during crash recovery anyway, so it wouldn't be that useful. log_restartpoints is gone. Use log_checkpoints instead. This is yet to be documented. This whole operation is a pre-requisite for Hot Standby, but has some value of its own whether the hot standby patch makes 8.4 or not. Simon Riggs, with lots of modifications by me.	2009-02-18 15:58:41 +00:00
Heikki Linnakangas	b75b66332a	Fix obsolete comment. Zdenek Kotala	2009-02-07 10:49:36 +00:00
Heikki Linnakangas	9187cedd7c	Put back fast-path for the case that there's no backup blocks in RestoreBkpBlocks. Went missing in my recent refactoring patch, as pointed out by Simon's hot standby patch.	2009-01-23 11:19:34 +00:00
Heikki Linnakangas	b2a667b9ee	Add a new option to RestoreBkpBlocks() to indicate if a cleanup lock should be used instead of the normal exclusive lock, and make WAL redo functions responsible for calling RestoreBkpBlocks(). They know better what kind of a lock they need. At the moment, this just moves things around with no functional change, but makes the hot standby patch that's under review cleaner.	2009-01-20 18:59:37 +00:00
Tom Lane	1a37056a74	Re-enable the old code in xlog.c that tried to use posix_fadvise(), so that we can get some buildfarm feedback about whether that function is still problematic. (Note that the planned async-preread patch will not really prove anything one way or the other in buildfarm testing, since it will be inactive with default GUC settings.)	2009-01-11 18:02:17 +00:00
Bruce Momjian	511db38ace	Update copyright for 2009.	2009-01-01 17:24:05 +00:00
Bruce Momjian	4ee79fd20d	Change the name of dtrace wal tracepoints: TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY Robert Lor	2008-12-24 20:41:29 +00:00
Bruce Momjian	5a90bc1fbe	The attached patch contains a couple of fixes in the existing probes and includes a few new ones. - Fixed compilation errors on OS X for probes that use typedefs - Fixed a number of probes to pass ForkNumber per the relation forks patch - The new probes are those that were taken out from the previous submitted patch and required simple fixes. Will submit the other probes that may require more discussion in a separate patch. Robert Lor	2008-12-17 01:39:04 +00:00
Heikki Linnakangas	b457b2a24e	If pg_stop_backup() is called just after switching to a new xlog file, wait for the previous instead of the new file to be archived. Based on patch by Simon Riggs.	2008-12-03 08:20:11 +00:00
Tom Lane	1d577f5e49	Add a startup check that pg_xlog and pg_xlog/archive_status exist. If the latter doesn't exist, automatically recreate it. (We don't do this for pg_xlog, though, per discussion.) Jonah Harris	2008-11-09 17:51:15 +00:00
Heikki Linnakangas	19c8dc839b	Unite ReadBufferWithFork, ReadBufferWithStrategy, and ZeroOrReadBuffer functions into one ReadBufferExtended function, that takes the strategy and mode as argument. There's three modes, RBM_NORMAL which is the default used by plain ReadBuffer(), RBM_ZERO, which replaces ZeroOrReadBuffer, and a new mode RBM_ZERO_ON_ERROR, which allows callers to read corrupt pages without throwing an error. The FSM needs the new mode to recover from corrupt pages, which could happend if we crash after extending an FSM file, and the new page is "torn". Add fork number to some error messages in bufmgr.c, that still lacked it.	2008-10-31 15:05:00 +00:00
Tom Lane	2314baef38	Fix recoveryLastXTime logic so that it actually does what one would expect. Per gripe from Kevin Grittner. Backpatch to 8.3, where the bug was introduced.	2008-10-30 04:06:16 +00:00
Heikki Linnakangas	61d9674988	Make LC_COLLATE and LC_CTYPE database-level settings. Collation and ctype are now more like encoding, stored in new datcollate and datctype columns in pg_database. This is a stripped-down version of Radek Strnad's patch, with further changes by me.	2008-09-23 09:20:39 +00:00
Tom Lane	ead21631e8	Fix a couple of problems pointed out by Fujii Masao in the 2008-Apr-05 patch for pg_stop_backup. First, it is possible that the history file name is not alphabetically later than the last WAL file name, so we should explicitly check that both have been archived. Second, the previous coding would wait forever if a checkpoint had managed to remove the WAL file before we look for it. Simon Riggs, plus some code cleanup by me.	2008-09-08 16:42:15 +00:00
Heikki Linnakangas	3f0e808c4a	Introduce the concept of relation forks. An smgr relation can now consist of multiple forks, and each fork can be created and grown separately. The bulk of this patch is about changing the smgr API to include an extra ForkNumber argument in every smgr function. Also, smgrscheduleunlink and smgrdounlink no longer implicitly call smgrclose, because other forks might still exist after unlinking one. The callers of those functions have been modified to call smgrclose instead. This patch in itself doesn't have any user-visible effect, but provides the infrastructure needed for upcoming patches. The additional forks envisioned are a rewritten FSM implementation that doesn't rely on a fixed-size shared memory block, and a visibility map to allow skipping portions of a table in VACUUM that have no dead tuples.	2008-08-11 11:05:11 +00:00

... 3 4 5 6 7 ...

766 Commits