Commit Graph

594 Commits

Author SHA1 Message Date
Alvaro Herrera b64f18c583 Mark some untranslatable messages with errmsg_internal 2011-09-05 17:48:07 -03:00
Heikki Linnakangas 1d0392b245 Fix comment about which version had BACKUP METHOD line in backup_lable, again.
It was invalidated again by Fujii's patch to 9.1.
2011-08-17 12:31:23 +03:00
Heikki Linnakangas 2877c67bc2 Fix bogus comment that claimed that the new BACKUP METHOD line in
backup_label was new in 9.0. Spotted by Fujii Masao.
2011-08-16 12:23:51 +03:00
Tom Lane 4dab3d5ae1 Change the autovacuum launcher to use WaitLatch instead of a poll loop.
In pursuit of this (and with the expectation that WaitLatch will be needed
in more places), convert the latch field that was already added to PGPROC
for sync rep into a generic latch that is activated for all PGPROC-owning
processes, and change many of the standard backend signal handlers to set
that latch when a signal happens.  This will allow WaitLatch callers to be
wakened properly by these signals.

In passing, fix a whole bunch of signal handlers that had been hacked to do
things that might change errno, without adding the necessary save/restore
logic for errno.  Also make some minor fixes in unix_latch.c, and clean
up bizarre and unsafe scheme for disowning the process's latch.  Much of
this has to be back-patched into 9.1.

Peter Geoghegan, with additional work by Tom
2011-08-10 12:22:21 -04:00
Heikki Linnakangas 41f9ffd928 If backup-end record is not seen, and we reach end of recovery from a
streamed backup, throw an error and refuse to start up. The restore has not
finished correctly in that case and the data directory is possibly corrupt.
We already errored out in case of archive recovery, but could not during
crash recovery because we couldn't distinguish between the case that
pg_start_backup() was called and the database then crashed (must not error,
data is OK), and the case that we're restoring from a backup and not all
the needed WAL was replayed (data can be corrupt).

To distinguish those cases, add a line to backup_label to indicate
whether the backup was taken with pg_start/stop_backup(), or by streaming
(ie. pg_basebackup).

This requires re-initdb, because of a new field added to the control file.
2011-08-10 09:22:49 +03:00
Tom Lane 9f17ffd866 Measure WaitLatch's timeout parameter in milliseconds, not microseconds.
The original definition had the problem that timeouts exceeding about 2100
seconds couldn't be specified on 32-bit machines.  Milliseconds seem like
sufficient resolution, and finer grain than that would be fantasy anyway
on many platforms.

Back-patch to 9.1 so that this aspect of the latch API won't change between
9.1 and later releases.

Peter Geoghegan
2011-08-09 18:52:29 -04:00
Simon Riggs 5286105800 Cascading replication feature for streaming log-based replication.
Standby servers can now have WALSender processes, which can work with
either WALReceiver or archive_commands to pass data. Fully updated
docs, including new conceptual terms of sending server, upstream and
downstream servers. WALSenders terminated when promote to master.

Fujii Masao, review, rework and doc rewrite by Simon Riggs
2011-07-19 03:40:03 +01:00
Heikki Linnakangas 89fd72cbf2 Introduce a pipe between postmaster and each backend, which can be used to
detect postmaster death. Postmaster keeps the write-end of the pipe open,
so when it dies, children get EOF in the read-end. That can conveniently
be waited for in select(), which allows eliminating some of the polling
loops that check for postmaster death. This patch doesn't yet change all
the loops to use the new mechanism, expect a follow-on patch to do that.

This changes the interface to WaitLatch, so that it takes as argument a
bitmask of events that it waits for. Possible events are latch set, timeout,
postmaster death, and socket becoming readable or writeable.

The pipe method behaves slightly differently from the kill() method
previously used in PostmasterIsAlive() in the case that postmaster has died,
but its parent has not yet read its exit code with waitpid(). The pipe
returns EOF as soon as the process dies, but kill() continues to return
true until waitpid() has been called (IOW while the process is a zombie).
Because of that, change PostmasterIsAlive() to use the pipe too, otherwise
WaitLatch() would return immediately with WL_POSTMASTER_DEATH, while
PostmasterIsAlive() would claim it's still alive. That could easily lead to
busy-waiting while postmaster is in zombie state.

Peter Geoghegan with further changes by me, reviewed by Fujii Masao and
Florian Pflug.
2011-07-08 18:44:07 +03:00
Peter Eisentraut 21f1e15aaf Unify spelling of "canceled", "canceling", "cancellation"
We had previously (af26857a27)
established the U.S. spellings as standard.
2011-06-29 09:28:46 +03:00
Simon Riggs 465883b0a2 Introduce compact WAL record for the common case of commit (non-DDL).
XLOG_XACT_COMMIT_COMPACT leaves out invalidation messages and relfilenodes,
saving considerable space for the vast majority of transaction commits.
XLOG_XACT_COMMIT keeps same definition as XLOG_PAGE_MAGIC 0xD067 and earlier.

Leonardo Francalanci and Simon Riggs
2011-06-28 22:58:17 +01:00
Robert Haas 503c7305a1 Make the visibility map crash-safe.
This involves two main changes from the previous behavior.  First,
when we set a bit in the visibility map, emit a new WAL record of type
XLOG_HEAP2_VISIBLE.  Replay sets the page-level PD_ALL_VISIBLE bit and
the visibility map bit.  Second, when inserting, updating, or deleting
a tuple, we can no longer get away with clearing the visibility map
bit after releasing the lock on the corresponding heap page, because
an intervening crash might leave the visibility map bit set and the
page-level bit clear.  Making this work requires a bit of interface
refactoring.

In passing, a few minor but related cleanups: change the test in
visibilitymap_set and visibilitymap_clear to throw an error if the
wrong page (or no page) is pinned, rather than silently doing nothing;
this case should never occur.  Also, remove duplicate definitions of
InvalidXLogRecPtr.

Patch by me, review by Noah Misch.
2011-06-21 23:04:40 -04:00
Tom Lane c2ba0121c7 Work around gcc 4.6.0 bug that breaks WAL replay.
ReadRecord's habit of using both direct references to tmpRecPtr and
references to *RecPtr (which is pointing at tmpRecPtr) triggers an
optimization bug in gcc 4.6.0, which apparently has forgotten about
aliasing rules.  Avoid the compiler bug, and make the code more readable
to boot, by getting rid of the direct references.  Improve the comments
while at it.

Back-patch to all supported versions, in case they get built with 4.6.0.

Tom Lane, with some cosmetic suggestions from Alex Hunsaker
2011-06-10 17:04:29 -04:00
Bruce Momjian 6560407c7d Pgindent run before 9.1 beta2. 2011-06-09 14:32:50 -04:00
Heikki Linnakangas a0c8514149 Shut down WAL receiver if it's still running at end of recovery. We used to
just check that it's not running and PANIC if it was, but that can rightfully
happen if recovery stops at recovery target.
2011-05-11 12:46:08 +03:00
Robert Haas aea1f24c2c recoveryStopsHere() must check the resource manager ID.
Before commit c016ce7281, this wasn't
needed, but now that multiple resource manager IDs can percolate down
through here, we have to make sure we know which one we've got.
Otherwise, we can confuse (for example) an XLOG_XACT_COMMIT record
with an XLOG_CHECKPOINT_SHUTDOWN record.

Review by Jaime Casanova
2011-04-18 08:27:19 -04:00
Heikki Linnakangas 54685b1c2b Revert the patch to check if we've reached end-of-backup also when doing
crash recovery, and throw an error if not. hubert depesz lubaczewski pointed
out that that situation also happens in the crash recovery following a
system crash that happens during an online backup.

We might want to do something smarter in 9.1, like put the check back for
backups taken with pg_basebackup, but that's for another patch.
2011-04-13 22:05:40 +03:00
Bruce Momjian bf50caf105 pgindent run before PG 9.1 beta 1. 2011-04-10 11:42:00 -04:00
Tom Lane 2594cf0e8c Revise the API for GUC variable assign hooks.
The previous functions of assign hooks are now split between check hooks
and assign hooks, where the former can fail but the latter shouldn't.
Aside from being conceptually clearer, this approach exposes the
"canonicalized" form of the variable value to guc.c without having to do
an actual assignment.  And that lets us fix the problem recently noted by
Bernd Helmle that the auto-tune patch for wal_buffers resulted in bogus
log messages about "parameter "wal_buffers" cannot be changed without
restarting the server".  There may be some speed advantage too, because
this design lets hook functions avoid re-parsing variable values when
restoring a previous state after a rollback (they can store a pre-parsed
representation of the value instead).  This patch also resolves a
longstanding annoyance about custom error messages from variable assign
hooks: they should modify, not appear separately from, guc.c's own message
about "invalid parameter value".
2011-04-07 00:12:02 -04:00
Heikki Linnakangas 1f0bab8494 Improve error message when WAL ends before reaching end of online backup. 2011-03-31 10:09:49 +03:00
Heikki Linnakangas acf4740132 Check that we've reached end-of-backup also when we're not performing
archive recovery.

It's possible to restore an online backup without recovery.conf, by simply
copying all the necessary WAL files to pg_xlog. "pg_basebackup -x" does that
too. That's the use case where this cross-check is useful.

Backpatch to 9.0. We used to do this in earlier versins, but in 9.0 the code
was inadvertently changed so that the check is only performed after archive
recovery.

Fujii Masao.
2011-03-30 10:53:28 +03:00
Simon Riggs b5f2f2a712 Minor changes to recovery pause behaviour.
Change location LOG message so it works each time we pause, not
just for final pause.
Ensure that we pause only if we are in Hot Standby and can connect
to allow us to run resume function. This change supercedes the
code to override parameter recoveryPauseAtTarget to false if not
attempting to enter Hot Standby, which is now removed.
2011-03-23 19:35:53 +00:00
Simon Riggs b98ac467f5 Prevent intermittent hang in recovery from bgwriter interaction.
Startup process waited for cleanup lock but when hot_standby = off
the pid was not registered, so that the bgwriter would not wake
the waiting process as intended.
2011-03-23 13:30:05 +00:00
Heikki Linnakangas 6d8096e2f3 When two base backups are started at the same time with pg_basebackup,
ensure that they use different checkpoints as the starting point. We use
the checkpoint redo location as a unique identifier for the base backup in
the end-of-backup record, and in the backup history file name.

Bug spotted by Fujii Masao.
2011-03-21 11:25:25 +02:00
Robert Haas 777e8c0015 Remove bogus semicolons in recoveryPausesHere.
Without this, the startup process goes into a tight loop, consuming
100% of one CPU and failing to respond to interrupts.
2011-03-18 08:09:09 -04:00
Bruce Momjian 5ca543fb2e Clarify C comment that O_SYNC/O_FSYNC are really the same settting, as
opposed to O_DSYNC.
2011-03-10 20:02:52 -05:00
Robert Haas d16e290a8a Emit a LOG message when pausing at the recovery target.
Fujii Masao
2011-03-10 14:37:14 -05:00
Heikki Linnakangas 4cd3fb6e12 Truncate predicate lock manager's SLRU lazily at checkpoint. That's safer
than doing it aggressively whenever the tail-XID pointer is advanced, because
this way we don't need to do it while holding SerializableXactHashLock.

This also fixes bug #5915 spotted by YAMAMOTO Takashi, and removes an
obsolete comment spotted by Kevin Grittner.
2011-03-08 12:12:54 +02:00
Heikki Linnakangas 1a4ab9ec23 If recovery_target_timeline is set to 'latest' and standby mode is enabled,
periodically rescan the archive for new timelines, while waiting for new WAL
segments to arrive. This allows you to set up a standby server that follows
the TLI change if another standby server is promoted to master. Before this,
you had to restart the standby server to make it notice the new timeline.

This patch only scans the archive for TLI changes, it won't follow a TLI
change in streaming replication. That is much needed too, but it would be a
much bigger patch than I dare to sneak in this late in the release cycle.

There was discussion on improving the sanity checking of the WAL segments so
that the system would notice more reliably if the new timeline isn't an
ancestor of the current one, but that is not included in this patch.

Reviewed by Fujii Masao.
2011-03-07 21:14:47 +02:00
Robert Haas 79ad8fc5f8 Named restore point improvements.
Emit a log message when creating a named restore point, and improve
documentation for pg_create_restore_point().

Euler Taveira de Oliveira, 	per suggestions from Thom Brown, with some
additional wordsmithing by me.
2011-02-24 19:02:00 -05:00
Simon Riggs bca8b7f16a Hot Standby feedback for avoidance of cleanup conflicts on standby.
Standby optionally sends back information about oldestXmin of queries
which is then checked and applied to the WALSender's proc->xmin.
GetOldestXmin() is modified slightly to agree with GetSnapshotData(),
so that all backends on primary include WALSender within their snapshots.
Note this does nothing to change the snapshot xmin on either master or
standby. Feedback piggybacks on the standby reply message.
vacuum_defer_cleanup_age is no longer used on standby, though parameter
still exists on primary, since some use cases still exist.

Simon Riggs, review comments from Fujii Masao, Heikki Linnakangas, Robert Haas
2011-02-16 19:29:37 +00:00
Robert Haas 4695da5ae9 pg_ctl promote
Fujii Masao, reviewed by Robert Haas, Stephen Frost, and Magnus Hagander.
2011-02-15 21:30:23 -05:00
Simon Riggs 5c588be729 PITR can stop at a named restore point when recovery target = time
though must not update the last transaction timestamp.
Plus comment and message cleanup for recent named restore point.

Fujii Masao, minor changes by me
2011-02-15 00:51:39 +00:00
Heikki Linnakangas b186523fd9 Send status updates back from standby server to master, indicating how far
the standby has written, flushed, and applied the WAL. At the moment, this
is for informational purposes only, the values are only shown in
pg_stat_replication system view, but in the future they will also be needed
for synchronous replication.

Extracted from Simon riggs' synchronous replication patch by Robert Haas, with
some tweaking by me.
2011-02-10 21:04:02 +02:00
Magnus Hagander 3144c33a2f Implement NOWAIT option for BASE_BACKUP command
Specifying this option makes the server not wait for the
xlog to be archived, or emit a warning that it can't,
instead leaving the responsibility with the client.

This is useful when the log is being streamed using
the streaming protocol in parallel with the backup,
without having log archiving enabled.
2011-02-09 10:59:53 +01:00
Simon Riggs c016ce7281 Named restore points in recovery. Users can record named points, then
new recovery.conf parameter recovery_target_name allows PITR to
specify named points as recovery targets.

Jaime Casanova, reviewed by Euler Taveira de Oliveira, plus minor edits
2011-02-08 19:39:08 +00:00
Simon Riggs 8c6e3adbf7 Basic Recovery Control functions for use in Hot Standby. Pause, Resume,
Status check functions only. Also, new recovery.conf parameter to
pause_at_recovery_target, default on.

Simon Riggs, reviewed by Fujii Masao
2011-02-08 18:30:22 +00:00
Simon Riggs faa0550572 Remove rare corner case for data loss when triggering standby server.
If the standby was streaming when trigger file arrives, check also in the
archive for additional WAL files. This is a corner case since it is
unlikely that we would trigger a failover while the master is still
available and sending data to standby, while at the same time running in
archive mode and also while the streaming standby has fallen behind archive.
Someone would eventually be unlucky; we must plug all gaps however small.

Fujii Masao
2011-02-08 14:38:02 +00:00
Robert Haas 0af695fd43 Log restartpoints in the same fashion as checkpoints.
Prior to 9.0, restartpoints never created, deleted, or recycled WAL
files, but now they can.  This code makes log_checkpoints treat
checkpoints and restartpoints symmetrically.  It also adjusts up
the documentation of the parameter to mention restartpoints.

Fujii Masao.  Docs by me, as suggested by Itagaki Takahiro.
2011-02-02 21:08:53 -05:00
Heikki Linnakangas 997b48ed96 Support multiple concurrent pg_basebackup backups.
With this patch, pg_basebackup doesn't write a backup_label file in the
data directory, so it doesn't interfere with a pg_start/stop_backup() based
backup anymore. backup_label is still included in the backup, but it is
injected directly into the tar stream.

Heikki Linnakangas, reviewed by Fujii Masao and Magnus Hagander.
2011-01-31 18:25:39 +02:00
Tom Lane 0f73aae13d Allow the wal_buffers setting to be auto-tuned to a reasonable value.
If wal_buffers is initially set to -1 (which is now the default), it's
replaced by 1/32nd of shared_buffers, with a minimum of 8 (the old default)
and a maximum of the XLOG segment size.  The allowed range for manual
settings is still from 4 up to whatever will fit in shared memory.

Greg Smith, with implementation correction by me.
2011-01-22 20:31:24 -05:00
Magnus Hagander 4448917d51 Split pg_start_backup() and pg_stop_backup() into two pieces
Move the actual functionality into a separate function that's
easier to call internally, and change the SQL-callable function
to be a wrapper calling this.

Also create a pg_abort_backup() function, only callable internally,
that does only the most vital parts of pg_stop_backup(), making it
safe(r) to call from error handlers.
2011-01-09 21:00:28 +01:00
Bruce Momjian 5d950e3b0c Stamp copyrights for year 2011. 2011-01-01 13:18:15 -05:00
Robert Haas 53dbc27c62 Support unlogged tables.
The contents of an unlogged table are WAL-logged; thus, they are not
available on standby servers and are truncated whenever the database
system enters recovery.  Indexes on unlogged tables are also unlogged.
Unlogged GiST indexes are not currently supported.
2010-12-29 06:48:53 -05:00
Magnus Hagander 9b8aff8c19 Add REPLICATION privilege for ROLEs
This privilege is required to do Streaming Replication, instead of
superuser, making it possible to set up a SR slave that doesn't
have write permissions on the master.

Superuser privileges do NOT override this check, so in order to
use the default superuser account for replication it must be
explicitly granted the REPLICATION permissions. This is backwards
incompatible change, in the interest of higher default security.
2010-12-29 11:05:03 +01:00
Robert Haas 34c70c7ac4 Instrument checkpoint sync calls.
Greg Smith, reviewed by Jeff Janes
2010-12-14 09:26:19 -05:00
Tom Lane 04f4e10cfc Use symbolic names not octal constants for file permission flags.
Purely cosmetic patch to make our coding standards more consistent ---
we were doing symbolic some places and octal other places.  This patch
fixes all C-coded uses of mkdir, chmod, and umask.  There might be some
other calls I missed.  Inconsistency noted while researching tablespace
directory permissions issue.
2010-12-10 17:35:33 -05:00
Heikki Linnakangas 5a031a5556 Fix bugs in the hot standby known-assigned-xids tracking logic. If there's
an old transaction running in the master, and a lot of transactions have
started and finished since, and a WAL-record is written in the gap between
the creating the running-xacts snapshot and WAL-logging it, recovery will fail
with "too many KnownAssignedXids" error. This bug was reported by
Joachim Wieland on Nov 19th.

In the same scenario, when fewer transactions have started so that all the
xids fit in KnownAssignedXids despite the first bug, a more serious bug
arises. We incorrectly initialize the clog code with the oldest still running
transaction, and when we see the WAL record belonging to a transaction with
an XID larger than one that committed already before the checkpoint we're
recovering from, we zero the clog page containing the already committed
transaction, leading to data loss.

In hindsight, trying to track xids in the known-assigned-xids array before
seeing the running-xacts record was too complicated. To fix that, hold
XidGenLock while the running-xacts snapshot is taken and WAL-logged. That
ensures that no transaction can begin or end in that gap, so that in recvoery
we know that the snapshot contains all transactions running at that point in
WAL.
2010-12-07 09:23:30 +01:00
Heikki Linnakangas 95e42a2c29 Fix two typos, by Fujii Masao. 2010-12-06 12:38:05 +01:00
Robert Haas 970a18687f Use GUC lexer for recovery.conf parsing.
This eliminates some crufty, special-purpose code and, as a non-trivial
side benefit, allows recovery.conf parameters to be unquoted.

Dimitri Fontaine, with review and cleanup by Alvaro Herrera, Itagaki
Takahiro, and me.
2010-12-03 08:56:44 -05:00
Peter Eisentraut fc946c39ae Remove useless whitespace at end of lines 2010-11-23 22:34:55 +02:00
Heikki Linnakangas 542bdb2146 Fix bug introduced by the recent patch to check that the checkpoint redo
location read from backup label file can be found: wasShutdown was set
incorrectly when a backup label file was found.

Jeff Davis, with a little tweaking by me.
2010-11-11 19:32:11 +02:00
Robert Haas 7ba6e4f0e0 Add monitoring function pg_last_xact_replay_timestamp.
Fujii Masao, with a little wordsmithing by me.
2010-11-09 22:52:19 -05:00
Heikki Linnakangas 8c843fff2d Bootstrap WAL to begin at segment logid=0 logseg=1 (000000010000000000000001)
rather than 0/0, so that we can safely use 0/0 as an invalid value. This is a
more future-proof fix for the corner-case bug in streaming replication that
was fixed yesterday. We had a similar corner-case bug with log/seg 0/0 back in
February as well. Avoiding 0/0 as a valid value should prevent bugs like that
in the future. Per Tom Lane's idea.

Back-patch to 9.0. Since this only affects bootstrapping, it makes no
difference to existing installations. We don't need to worry about the
bug in existing installations, because if you've managed to get past the
initial base backup already, you won't hit the bug in the future either.
2010-11-02 11:39:48 +02:00
Heikki Linnakangas 931b6db39b Fix corner-case bug in tracking of latest removed WAL segment during
streaming replication. We used log/seg 0/0 to indicate that no WAL segments
have been removed since startup, but 0/0 is a valid value for the very first
WAL segment after initdb. To make that disambiguous, store
(latest removed WAL segment + 1) in the global variable.

Per report from Matt Chesler, also reproduced by Greg Smith.
2010-11-01 10:05:15 +02:00
Heikki Linnakangas 0c6293dd03 Before removing backup_label and irrevocably changing pg_control file, check
that WAL file containing the checkpoint redo-location can be found. This
avoids making the cluster irrecoverable if the redo location is in an earlie
WAL file than the checkpoint record.

Report, analysis and patch by Jeff Davis, with small changes by me.
2010-10-26 21:43:52 +03:00
Simon Riggs 3bbcc5c999 Make startup process respond to signals to cancel waiting on latch.
A tidy up for recently committed changes to startup latch.

Fujii Masao
2010-10-14 19:15:26 +01:00
Simon Riggs 45cd9199c2 Fix bug in comment of timeline history file.
Fujii Masao
2010-10-14 19:06:06 +01:00
Magnus Hagander 9f2e211386 Remove cvs keywords from all files. 2010-09-20 22:08:53 +02:00
Heikki Linnakangas 79b54816db Fix two typos in comments, spotted by Fujii Masao and Thom Brown 2010-09-15 13:58:22 +00:00
Heikki Linnakangas 723d0184e2 Use a latch to make startup process wake up and replay immediately when
new WAL arrives via streaming replication. This reduces the latency, and
also allows us to use a longer polling interval, which is good for energy
efficiency.

We still need to poll to check for the appearance of a trigger file, but
the interval is now 5 seconds (instead of 100ms), like when waiting for
a new WAL segment to appear in WAL archive.
2010-09-15 10:35:05 +00:00
Simon Riggs ac791d3ca1 Fix misleading DEBUG2 issued during RemoveOldXlogFiles() 2010-08-30 15:37:41 +00:00
Simon Riggs e72f15ed60 Truncate subtrans after each restartpoint.
Issue reported by Harald Kolb, patch by Fujii Masao, review by me.
2010-08-30 14:22:05 +00:00
Alvaro Herrera 3a1b51de19 Remove duplicate translatable phrase 2010-08-26 19:23:41 +00:00
Simon Riggs 5b8bd0529e Rename asyncCommitLSN to asyncXactLSN to reflect changed role in 9.0.
Transaction aborts now record their LSN to avoid corner case
behaviour in SR/HS, hence change of name of variables and functions.
As pointed out by Fujii Masao. Cosmetic changes only.
2010-07-29 22:27:27 +00:00
Bruce Momjian 239d769e7e pgindent run for 9.0, second run 2010-07-06 19:19:02 +00:00
Tom Lane 8771634666 Don't set recoveryLastXTime when replaying a checkpoint --- that was a bogus
idea from the start since the variable is only meant to track commit/abort
events.  This patch reverts the logic around the variable to what it was in
8.4, except that the value is now kept in shared memory rather than a static
variable, so that it can be reported correctly by CreateRestartPoint (which is
executed in the bgwriter).
2010-07-03 22:15:45 +00:00
Tom Lane e76c1a0f4d Replace max_standby_delay with two parameters, max_standby_archive_delay and
max_standby_streaming_delay, and revise the implementation to avoid assuming
that timestamps found in WAL records can meaningfully be compared to clock
time on the standby server.  Instead, the delay limits are compared to the
elapsed time since we last obtained a new WAL segment from archive or since
we were last "caught up" to WAL data arriving via streaming replication.
This avoids problems with clock skew between primary and standby, as well
as other corner cases that the original coding would misbehave in, such
as the primary server having significant idle time between transactions.
Per my complaint some time ago and considerable ensuing discussion.

Do some desultory editing on the hot standby documentation, too.
2010-07-03 20:43:58 +00:00
Robert Haas 400916b6d7 emode_for_corrupt_record shouldn't reduce LOG messages to WARNING.
In non-interactive sessions, WARNING sorts below LOG.
2010-06-28 19:46:19 +00:00
Tom Lane 09698bb5fb Make RemoveOldXlogFiles's debug printout match style used elsewhere:
log and seg aren't an XLogRecPtr and shouldn't be printed like one.
Fujii Masao
2010-06-17 17:37:23 +00:00
Tom Lane 07e8b6aabc Don't allow walsender to send WAL data until it's been safely fsync'd on the
master.  Otherwise a subsequent crash could cause the master to lose WAL that
has already been applied on the slave, resulting in the slave being out of
sync and soon corrupt.  Per recent discussion and an example from Robert Haas.

Fujii Masao
2010-06-17 16:41:25 +00:00
Heikki Linnakangas 6da07cd80d If a corrupt WAL record is received by streaming replication, disconnect
and retry. If the record is genuinely corrupt in the master database,
there's little hope of recovering, but it's better than simply retrying
to apply the corrupt WAL record in a tight loop without even trying to
retransmit it, which is what we used to do.
2010-06-14 06:04:21 +00:00
Peter Eisentraut c86efdde5f Fix typo/bug, found by Clang compiler 2010-06-12 09:14:52 +00:00
Itagaki Takahiro 56834fc759 Rename restartpoint_command to archive_cleanup_command. 2010-06-10 08:13:50 +00:00
Heikki Linnakangas 0a7cb85531 Make TriggerFile variable static. It's not used outside xlog.c.
Fujii Masao
2010-06-10 07:49:23 +00:00
Heikki Linnakangas 346d7cd7fa Return NULL instead of 0/0 in pg_last_xlog_receive_location() and
pg_last_xlog_replay_location(). Per Robert Haas's suggestion, after
Itagaki Takahiro pointed out an issue in the docs. Also, some wording
changes in the docs by me.
2010-06-10 07:00:27 +00:00
Heikki Linnakangas 71815306e9 In standby mode, respect checkpoint_segments in addition to
checkpoint_timeout to trigger restartpoints. We used to deliberately only
do time-based restartpoints, because if checkpoint_segments is small we
would spend time doing restartpoints more often than really necessary.
But now that restartpoints are done in bgwriter, they're not as
disruptive as they used to be. Secondly, because streaming replication
stores the streamed WAL files in pg_xlog, we want to clean it up more
often to avoid running out of disk space when checkpoint_timeout is large
and checkpoint_segments small.

Patch by Fujii Masao, with some minor changes by me.
2010-06-09 15:04:07 +00:00
Magnus Hagander 8c873bbfa7 Make the walwriter close it's handle to an old xlog segment if it's no longer
the current one. Not doing this would leave the walwriter with a handle to a
deleted file if there was nothing for it to do for a long period of time,
preventing the file from  being completely removed.

Reported by Tollef Fog Heen, and thanks to Heikki for some hand-holding with
the patch.
2010-06-09 10:54:45 +00:00
Peter Eisentraut cb6038c168 Fix some inconsistent quoting of wal_level values in messages
When referring to postgresql.conf syntax, then it's without quotes
(wal_level=archive); in narrative it's with double quotes.  But never
single quotes.
2010-06-03 21:02:12 +00:00
Robert Haas d561430b66 On clean shutdown during recovery, don't warn about possible corruption.
Fujii Masao.  Review by Heikki Linnakangas and myself.
2010-06-03 03:20:00 +00:00
Heikki Linnakangas 6b24036365 Fix obsolete comments that I neglected to update in a previous patch.
Fujii Masao
2010-06-02 09:28:44 +00:00
Heikki Linnakangas c5bd8feac6 Adjust comment to reflect that we now have Hot Standby. Pointed out by
Robert Haas.
2010-05-27 00:38:39 +00:00
Robert Haas ea9968c331 Rename PM_RECOVERY_CONSISTENT and PMSIGNAL_RECOVERY_CONSISTENT.
The new names PM_HOT_STANDBY and PMSIGNAL_BEGIN_HOT_STANDBY more accurately
reflect their actual function.
2010-05-15 20:01:32 +00:00
Simon Riggs 4a24c9a063 Fix bug in processing of checkpoint time for max_standby_delay. Latest
log time was incorrectly set, typically leading to dates in the past,
which would cause more cancellations in Hot Standby on a quiet server.
2010-05-15 07:14:43 +00:00
Simon Riggs fd34374b17 Add many new Asserts in code and fix simple bug that slipped through
without them, related to previous commit. Report by Bruce Momjian.
2010-05-14 07:11:49 +00:00
Simon Riggs 8431e296ea Cleanup initialization of Hot Standby. Clarify working with reanalysis
of requirements and documentation on LogStandbySnapshot(). Fixes
two minor bugs reported by Tom Lane that would lead to an incorrect
snapshot after transaction wraparound. Also fix two other problems
discovered that would give incorrect snapshots in certain cases.
ProcArrayApplyRecoveryInfo() substantially rewritten. Some minor
refactoring of xact_redo_apply() and ExpireTreeKnownAssignedTransactionIds().
2010-05-13 11:15:38 +00:00
Heikki Linnakangas ffe8c7c677 Need to hold ControlFileLock while updating control file. Update
minRecoveryPoint in control file when replaying a parameter change record,
to ensure that we don't allow hot standby on WAL generated without
wal_level='hot_standby' after a standby restart.
2010-05-03 11:17:52 +00:00
Tom Lane f9ed327f76 Clean up some awkward, inaccurate, and inefficient processing around
MaxStandbyDelay.  Use the GUC units mechanism for the value, and choose more
appropriate timestamp functions for performing tests with it.  Make the
ps_activity manipulation in ResolveRecoveryConflictWithVirtualXIDs have
behavior similar to ps_activity code elsewhere, notably not updating the
display when update_process_title is off and not truncating the display
contents at an arbitrarily-chosen length.  Improve the docs to be explicit
about what MaxStandbyDelay actually measures, viz the difference between
primary and standby servers' clocks, and the possible hazards if their clocks
aren't in sync.
2010-05-02 02:10:33 +00:00
Tom Lane 69f7a4d8e3 Adjust error checks in pg_start_backup and pg_stop_backup to make it possible
to perform a backup without archive_mode being enabled.  This gives up some
user-error protection in order to improve usefulness for streaming-replication
scenarios.  Per discussion.
2010-04-29 21:49:03 +00:00
Tom Lane f0488bd57c Rename the parameter recovery_connections to hot_standby, to reduce possible
confusion with streaming-replication settings.  Also, change its default
value to "off", because of concern about executing new and poorly-tested
code during ordinary non-replicating operation.  Per discussion.

In passing do some minor editing of related documentation.
2010-04-29 21:36:19 +00:00
Heikki Linnakangas 9b8a73326e Introduce wal_level GUC to explicitly control if information needed for
archival or hot standby should be WAL-logged, instead of deducing that from
other options like archive_mode. This replaces recovery_connections GUC in
the primary, where it now has no effect, but it's still used in the standby
to enable/disable hot standby.

Remove the WAL-logging of "unlogged operations", like creating an index
without WAL-logging and fsyncing it at the end. Instead, we keep a copy of
the wal_mode setting and the settings that affect how much shared memory a
hot standby server needs to track master transactions (max_connections,
max_prepared_xacts, max_locks_per_xact) in pg_control. Whenever the settings
change, at server restart, write a WAL record noting the new settings and
update pg_control. This allows us to notice the change in those settings in
the standby at the right moment, they used to be included in checkpoint
records, but that meant that a changed value was not reflected in the
standby until the first checkpoint after the change.

Bump PG_CONTROL_VERSION and XLOG_PAGE_MAGIC. Whack XLOG_PAGE_MAGIC back to
the sequence it used to follow, before hot standby and subsequent patches
changed it to 0x9003.
2010-04-28 16:10:43 +00:00
Heikki Linnakangas 3efba16d56 If a base backup is cancelled by server shutdown or crash, throw an error
in WAL recovery when it sees the shutdown checkpoint record. It's more
user-friendly to find out about it at that point than at the end of
recovery, and you're not left wondering why your hot standby server never
opens up for read-only connections.
2010-04-27 09:25:18 +00:00
Simon Riggs 491d1ea5b3 Previous patch revoked following objections. 2010-04-23 20:21:31 +00:00
Simon Riggs 6ca23b1a29 Make CheckRequiredParameterValues() depend upon correct combination
of parameters. Fix bug report by Robert Haas that error message and
hint was incorrect if wrong mode parameters specified on master.
Internal changes only. Proposals for parameter simplification on
master/primary still under way.
2010-04-23 19:57:19 +00:00
Robert Haas 481cb5d9b5 Rename standby_keep_segments to wal_keep_segments.
Also, make the name of the GUC and the name of the backing variable match.
Alnong the way, clean up a couple of slight typographical errors in the
related docs.
2010-04-20 11:15:06 +00:00
Simon Riggs d38603bd97 Improve sequence and sense of messages from pg_stop_backup().
Now doesn't report it is waiting until it actually is waiting,
plus message doesn't appear until at least 5 seconds wait, so
we avoid reporting the wait before we've given the archiver
a reasonable time to wake up and archive the file we just
created earlier in the function.
Also add new unconditional message to confirm safe completion.
Now a normal, healthy execution does not report waiting at
all, just safe completion.
2010-04-18 18:44:53 +00:00
Simon Riggs 2847de9df2 Remove some additional changes in previous commit that belong elsewhere. 2010-04-18 18:17:12 +00:00
Simon Riggs 21d6a6a128 Tune GetSnapshotData() during Hot Standby by avoiding loop
through normal backends. Makes code clearer also, since we
avoid various Assert()s. Performance of snapshots taken
during recovery no longer depends upon number of read-only
backends.
2010-04-18 18:06:07 +00:00
Heikki Linnakangas 78974cfb9b In standby mode, suppress repeated LOG messages about a corrupt record,
which just indicates that we've reached the end of valid WAL found in
the standby.
2010-04-16 08:58:16 +00:00
Bruce Momjian ec4b9bcc3d Doc change: effect -> affect, per Robert Haas 2010-04-15 03:05:59 +00:00
Simon Riggs 55d7556a4d Fix minor typo in comment in xlog.c 2010-04-14 10:29:07 +00:00
Heikki Linnakangas 361bd1662e Allow Hot Standby to begin from a shutdown checkpoint.
Patch by Simon Riggs & me
2010-04-13 14:17:46 +00:00
Heikki Linnakangas 30556568f5 Update the location of last removed WAL segment in shared memory only
after actually removing one, so that if we can't remove segments because
WAL archiving is lagging behind, we don't unnecessarily forbid streaming
the old not-yet-archived segments that are still perfectly valid. Per
suggestion from Fujii Masao.
2010-04-12 10:40:43 +00:00
Heikki Linnakangas e57cd7f0a1 Change the logic to decide when to delete old WAL segments, so that it
doesn't take into account how far the WAL senders are. This way a hung
WAL sender doesn't prevent old WAL segments from being recycled/removed
in the primary, ultimately causing the disk to fill up. Instead add
standby_keep_segments setting to control how many old WAL segments are
kept in the primary. This also makes it more reliable to use streaming
replication without WAL archiving, assuming that you set
standby_keep_segments high enough.
2010-04-12 09:52:29 +00:00
Heikki Linnakangas 0f11ed5886 Allow quotes to be escaped in recovery.conf, by doubling them. This patch
also makes the parsing a little bit stricter, rejecting garbage after the
parameter value and values with missing ending quotes, for example.
2010-04-07 10:58:49 +00:00
Heikki Linnakangas 370f770c15 Forbid using pg_xlogfile_name() and pg_xlogfile_name_offset() during
recovery. We might want to relax this in the future, but ThisTimeLineID
isn't currently correct in backends during recovery, so the filename
returned was wrong.
2010-04-07 06:12:52 +00:00
Simon Riggs 89c5008158 Further message changes when recovery.conf parameters missing. 2010-04-06 17:51:58 +00:00
Simon Riggs cf2575b8c4 Check compulsory parameters in recovery.conf in standby_mode, per docs. 2010-04-02 21:50:40 +00:00
Simon Riggs 31f00d163b Move system startup message prior to any calls out of data directory.
This allows us to see what mode the server is in before it starts to
perform actions that can block or hang. Otherwise server messages
may not appear until after messages that say FATAL the database
server is starting up.
2010-04-02 13:10:56 +00:00
Robert Haas 54943734f8 Refer to max_wal_senders in a more consistent fashion.
The error message now makes explicit reference to the GUC that must be changed
to fix the problem, using wording suggested by Tom Lane.  Along the way,
rename the GUC from MaxWalSenders to max_wal_senders for consistency and
grep-ability.
2010-04-01 00:43:29 +00:00
Heikki Linnakangas 2a77355ea1 Change the retry-loop in standby mode to also try restoring files from
pg_xlog directory. This is essential for replaying WAL records that
were streamed from the master, after a standby server restart.

If a corrupt record is seen in a file restored from the archive or
streamed from the master, log it as a WARNING and keep retrying. If the
corruption is permanent, and not just a glitch in the whatever copies the
files to the archive or a network error not caught by CRC checks in TCP
for example, we will keep retrying and logging the WARNING indefinitely.
But that's better than shutting down completely, the standby is still
useful for running read-only queries. In PITR the recovery ends at such a
corrupt record, which is a bit questionable, but that's the behavior we
had in previous releases and we don't feel like chaning it now. It does
make sense for tools like pg_standby.
2010-03-30 16:23:57 +00:00
Peter Eisentraut c248d17120 Message tuning 2010-03-21 00:17:59 +00:00
Simon Riggs 3cdafe40e7 Adjust comment in .history file to match recovery target specified. Comment
present since 8.0 was never fully meaningful, since two recovery targets
cannot be specified. Refactor recovery target type to make this change
and associated code easier to understand. No change in function.

Bug report arising from internal support question.
2010-03-19 11:05:15 +00:00
Heikki Linnakangas c21ac0b58e Add restartpoint_command option to recovery.conf. Fix bug in %r handling
in recovery_end_command, it always came out as 0 because InRedo was
cleared before recovery_end_command was executed. Also, always take
ControlFileLock when reading checkpoint location for %r.

The recovery_end_command bug and the missing locking was present in 8.4
as well, that part of this patch will be backported separately.
2010-03-18 09:17:18 +00:00
Simon Riggs 1a163a0c68 Remove incorrect comment from GetWriteRecPtr(): the return value is always
correct, as described in comments at start of xlog.c
2010-03-15 18:49:17 +00:00
Itagaki Takahiro 17d8de0e61 pg_start_backup() can use a share lock to lock ControlFileLock
instead of an exclusive lock.

The change is almost for code cleanup. Since there seems to be no
performance benefits from it, backports should not be needed.

Fujii Masao
2010-03-10 02:04:48 +00:00
Bruce Momjian 65e806cba1 pgindent run for 9.0 2010-02-26 02:01:40 +00:00
Tom Lane a2239b96e0 Make pg_stop_backup's reporting a bit more verbose in hopes of making
error cases less intimidating for novices.  Per discussion.

Greg Smith
2010-02-25 02:17:50 +00:00
Heikki Linnakangas ad458cfe81 Don't use O_DIRECT when writing WAL files if archiving or streaming is
enabled. Bypassing the kernel cache is counter-productive in that case,
because the archiver/walsender process will read from the WAL file
soon after it's written, and if it's not cached the read will cause
a physical read, eating I/O bandwidth available on the WAL drive.

Also, walreceiver process does unaligned writes, so disable O_DIRECT
in walreceiver process for that reason too.
2010-02-19 10:51:04 +00:00
Itagaki Takahiro 3230fd056a Fix STOP WAL LOCATION in backup history files no to return the next
segment of XLOG_BACKUP_END record even if the the record is placed
at a segment boundary. Furthermore the previous implementation could
return nonexistent segment file name when the boundary is in segments
that has "FE" suffix; We never use segments with "FF" suffix.

Backpatch to 8.0, where hot backup was introduced.

Reported by Fujii Masao.
2010-02-19 01:04:03 +00:00
Tom Lane 50a90fac40 Stamp HEAD as 9.0devel, and update various places that were referring to 8.5
(hope I got 'em all).  Per discussion, this release will be 9.0 not 8.5.
2010-02-17 04:19:41 +00:00
Tom Lane c64339face When updating ShmemVariableCache from a checkpoint record, be sure to set
all the values derived from oldestXid, not just that field.  Brain fade in
one of my patches associated with flat file removal, exposed by a report
from Fujii Masao.

With this change, xidVacLimit should always be valid, so remove a couple of
bits of complexity associated with the previous assumption that sometimes
it wouldn't get set right away.
2010-02-17 03:10:33 +00:00
Heikki Linnakangas e465390d03 Reduce the chatter to the log when starting a standby server. Don't
echo all the recovery.conf options. Don't emit the "initializing
recovery connections" message, which doesn't mean anything to a user.
Remove the "starting archive recovery" message and replace the
"automatic recovery in progress" message with a more informative message
saying whether the server is doing PITR, normal archive recovery, or
standby mode.
2010-02-12 09:49:08 +00:00
Heikki Linnakangas 54cbd1757e If primary_conninfo is not set, don't try to establish streaming
connection.
2010-02-12 07:56:36 +00:00
Heikki Linnakangas 9fa01f6c8a Check for partial WAL files in standby mode. If restore_command restores
a partial WAL file, assume it's because the file is just being copied to
the archive and treat it the same as "file not found" in standby mode.
pg_standby has a similar check, so it seems reasonable to have the same
level of protection in the built-in standby mode.
2010-02-12 07:36:44 +00:00
Heikki Linnakangas 161d9d51b3 Now that streaming replication switches between streaming mode and
restoring from archive, the last WAL segment is not necessarily open at
the end of recovery. Fix assertion that assumed that.

Fujii Masao, fixing the assertion failure reported by Martin Pihlak.
2010-02-10 08:25:25 +00:00
Heikki Linnakangas 4cea603128 Remove piece of code to zero out minRecoveryPoint when starting crash
recovery. It's zeroed out whenever a checkpoint is written, so the only
scenario where the removed code did anything is when you kill archive
recovery, remove recovery.conf, and start up the server, so that it goes
into crash recovery instead. That's a "don't do that" scenario, but it
seems better to not clear minRecoveryPoint but instead update it like we
do in archive recovery, which is what will now happen.
2010-02-08 09:08:51 +00:00
Tom Lane 0a469c8769 Remove old-style VACUUM FULL (which was known for a little while as
VACUUM FULL INPLACE), along with a boatload of subsidiary code and complexity.
Per discussion, the use case for this method of vacuuming is no longer large
enough to justify maintaining it; not to mention that we don't wish to invest
the work that would be needed to make it play nicely with Hot Standby.

Aside from the code directly related to old-style VACUUM FULL, this commit
removes support for certain WAL record types that could only be generated
within VACUUM FULL, redirect-pointer removal in heap_page_prune, and
nontransactional generation of cache invalidation sinval messages (the last
being the sticking point for Hot Standby).

We still have to retain all code that copes with finding HEAP_MOVED_OFF and
HEAP_MOVED_IN flag bits on existing tuples.  This can't be removed as long
as we want to support in-place update from pre-9.0 databases.
2010-02-08 04:33:55 +00:00
Tom Lane b9b8831ad6 Create a "relation mapping" infrastructure to support changing the relfilenodes
of shared or nailed system catalogs.  This has two key benefits:

* The new CLUSTER-based VACUUM FULL can be applied safely to all catalogs.

* We no longer have to use an unsafe reindex-in-place approach for reindexing
  shared catalogs.

CLUSTER on nailed catalogs now works too, although I left it disabled on
shared catalogs because the resulting pg_index.indisclustered update would
only be visible in one database.

Since reindexing shared system catalogs is now fully transactional and
crash-safe, the former special cases in REINDEX behavior have been removed;
shared catalogs are treated the same as non-shared.

This commit does not do anything about the recently-discussed problem of
deadlocks between VACUUM FULL/CLUSTER on a system catalog and other
concurrent queries; will address that in a separate patch.  As a stopgap,
parallel_schedule has been tweaked to run vacuum.sql by itself, to avoid
such failures during the regression tests.
2010-02-07 20:48:13 +00:00
Simon Riggs 296578feb4 Revoke augmentation of WAL records for btree delete, per discussion. 2010-02-01 13:40:28 +00:00
Simon Riggs 6d2bc0a6cf Augment WAL records for btree delete with GetOldestXmin() to reduce
false positives during Hot Standby conflict processing. Simple
patch to enhance conflict processing, following previous discussions.
Controlled by parameter minimize_standby_conflicts = on | off, with
default off allows measurement of performance impact to see whether
it should be set on all the time.
2010-01-29 18:39:05 +00:00
Heikki Linnakangas b0509ef601 Fix crashing bug at the end of recovery in Streaming Replication, when
restore_command is not given. Fujii Masao.
2010-01-28 19:17:22 +00:00
Heikki Linnakangas 83cb7da7dc Fix bug in wasender's xlogid boundary handling, reported by Erik Rijkers.
LogwrtRqst.Write can be set to non-existent FF log segment, we mustn't
try to send that in XLogSend().

Also fix similar bug in ReadRecord(), which I just introduced in the
ReadRecord() refactoring patch.
2010-01-27 16:41:09 +00:00
Heikki Linnakangas 1bb2558046 Make standby server continuously retry restoring the next WAL segment with
restore_command, if the connection to the primary server is lost. This
ensures that the standby can recover automatically, if the connection is
lost for a long time and standby falls behind so much that the required
WAL segments have been archived and deleted in the master.

This also makes standby_mode useful without streaming replication; the
server will keep retrying restore_command every few seconds until the
trigger file is found. That's the same basic functionality pg_standby
offers, but without the bells and whistles.

To implement that, refactor the ReadRecord/FetchRecord functions. The
FetchRecord() function introduced in the original streaming replication
patch is removed, and all the retry logic is now in a new function called
XLogReadPage(). XLogReadPage() is now responsible for executing
restore_command, launching walreceiver, and waiting for new WAL to arrive
from primary, as required.

This also changes the life cycle of walreceiver. When launched, it now only
tries to connect to the master once, and exits if the connection fails, or
is lost during streaming for any reason. The startup process detects the
death, and re-launches walreceiver if necessary.
2010-01-27 15:27:51 +00:00
Simon Riggs aed1a0121a Fix longstanding gripe that we check for 0000000001.history at start of
archive recovery, even when we know it is never present.
2010-01-26 00:07:13 +00:00
Simon Riggs 959ac58c04 In HS, Startup process sets SIGALRM when waiting for buffer pin. If
woken by alarm we send SIGUSR1 to all backends requesting that they
check to see if they are blocking Startup process. If so, they throw
ERROR/FATAL as for other conflict resolutions. Deadlock stop gap
removed. max_standby_delay = -1 option removed to prevent deadlock.
2010-01-23 16:37:12 +00:00
Heikki Linnakangas 09b115f706 Write a WAL record whenever we perform an operation without WAL-logging
that would've been WAL-logged if archiving was enabled. If we encounter
such records in archive recovery anyway, we know that some data is
missing from the log. A WARNING is emitted in that case.

Original patch by Fujii Masao, with changes by me.
2010-01-20 19:43:40 +00:00
Heikki Linnakangas 40f908bdcd Introduce Streaming Replication.
This includes two new kinds of postmaster processes, walsenders and
walreceiver. Walreceiver is responsible for connecting to the primary server
and streaming WAL to disk, while walsender runs in the primary server and
streams WAL from disk to the client.

Documentation still needs work, but the basics are there. We will probably
pull the replication section to a new chapter later on, as well as the
sections describing file-based replication. But let's do that as a separate
patch, so that it's easier to see what has been added/changed. This patch
also adds a new section to the chapter about FE/BE protocol, documenting the
protocol used by walsender/walreceivxer.

Bump catalog version because of two new functions,
pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for
monitoring the progress of replication.

Fujii Masao, with additional hacking by me
2010-01-15 09:19:10 +00:00
Heikki Linnakangas 06f82b2961 Write an end-of-backup WAL record at pg_stop_backup(), and wait for it at
recovery instead of reading the backup history file. This is more robust,
as it stops you from prematurely starting up an inconsisten cluster if the
backup history file is lost for some reason, or if the base backup was
never finished with pg_stop_backup().

This also paves the way for a simpler streaming replication patch, which
doesn't need to care about backup history files anymore.

The backup history file is still created and archived as before, but it's
not used by the system anymore. It's just for informational purposes now.

Bump PG_CONTROL_VERSION as the location of the backup startpoint is now
written to a new field in pg_control, and catversion because initdb is
required

Original patch by Fujii Masao per Simon's idea, with further fixes by me.
2010-01-04 12:50:50 +00:00
Bruce Momjian 0239800893 Update copyright for the year 2010. 2010-01-02 16:58:17 +00:00
Heikki Linnakangas ff1e1e45b9 Reset minRecoveryPoint at checkpoints, so that we don't uselessly update
it in the control file at crash recovery following an archive recovery.

Per Fujii Masao and subsequent discussion.
2009-12-30 08:37:21 +00:00
Simon Riggs efc16ea520 Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.

New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.

This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.

Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.

Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 01:32:45 +00:00
Heikki Linnakangas 7f2a10fecd Don't error out if recycling or removing an old WAL segment fails at the end
of checkpoint. Although the checkpoint has been written to WAL at that point
already, so that all data is safe, and we'll retry removing the WAL segment at
the next checkpoint, if such a failure persists we won't be able to remove any
other old WAL segments either and will eventually run out of disk space. It's
better to treat the failure as non-fatal, and move on to clean any other WAL
segment and continue with any other end-of-checkpoint cleanup.

We don't normally expect any such failures, but on Windows it can happen with
some anti-virus or backup software that lock files without FILE_SHARE_DELETE
flag.

Also, the loop in pgrename() to retry when the file is locked was broken. If a
file is locked on Windows, you get ERROR_SHARE_VIOLATION, not
ERROR_ACCESS_DENIED, at least on modern versions. Fix that, although I left
the check for ERROR_ACCESS_DENIED in there as well (presumably it was correct
in some environment), and added ERROR_LOCK_VIOLATION to be consistent with
similar checks in pgwin32_open(). Reduce the timeout on the loop from 30s to
10s, on the grounds that since it's been broken, we've effectively had a
timeout of 0s and no-one has complained, so a smaller timeout is actually
closer to the old behavior. A longer timeout would mean that if recycling a
WAL file fails because it's locked for some reason, InstallXLogFileSegment()
will hold ControlFileLock for longer, potentially blocking other backends, so
a long timeout isn't totally harmless.

While we're at it, set errno correctly in pgrename().

Backpatch to 8.2, which is the oldest version supported on Windows. The xlog.c
changes would make sense on other platforms and thus on older versions as
well, but since there's no such locking issues on other platforms, it's not
worth it.
2009-09-13 18:32:08 +00:00
Heikki Linnakangas 4e2d5efc6a On Windows, when a file is deleted and another process still has an open
file handle on it, the file goes into "pending deletion" state where it
still shows up in directory listing, but isn't accessible otherwise. That
confuses RemoveOldXLogFiles(), making it think that the file hasn't been
archived yet, while it actually was, and it was deleted along with the .done
file.

Fix that by renaming the file with ".deleted" extension before deleting it.
Also check the return value of rename() and unlink(), so that if the removal
fails for any reason (e.g another process is holding the file locked), we
don't delete the .done file until the WAL file is really gone.

Backpatch to 8.2, which is the oldest version supported on Windows.
2009-09-10 09:42:10 +00:00
Alvaro Herrera a8bb8eb583 Remove flatfiles.c, which is now obsolete.
Recent commits have removed the various uses it was supporting.  It was a
performance bottleneck, according to bug report #4919 by Lauris Ulmanis; seems
it slowed down user creation after a billion users.
2009-09-01 02:54:52 +00:00
Tom Lane 25ec228ef7 Track the current XID wrap limit (or more accurately, the oldest unfrozen
XID) in checkpoint records.  This eliminates the need to recompute the value
from scratch during database startup, which is one of the two remaining
reasons for the flatfile code to exist.  It should also simplify life for
hot-standby operation.

To avoid bloating the checkpoint records unreasonably, I switched from
tracking the oldest database by name to tracking it by OID.  This turns
out to save cycles in general (everywhere but the warning-generating
paths, which we hardly care about) and also helps us deal with the case
that the oldest database got dropped instead of being vacuumed.  The prior
coding might go for a long time without updating the wrap limit in that case,
which is bad because it might result in a lot of useless autovacuum activity.
2009-08-31 02:23:23 +00:00
Heikki Linnakangas 9cd6685f91 In the checkpoint written at the end of archive recovery, the WAL page header
was incorrectly initialized with timeline ID 0. That rendered the WAL page
unrecoverable, making a subsequent archive recovery stop at that point.
ThisTimeLineID needs to be initialized before calling AdvanceXLInsertBuffer().

This fixes bug #5011 reported by James Bardin. Backpatch to 8.4, as the bug
was introduced by the changes to use of bgwriter for writing the
end-of-archive-recovery checkpoint. Patch by Tom Lane.
2009-08-27 07:15:41 +00:00
Tom Lane 04011cc970 Allow backends to start up without use of the flat-file copy of pg_database.
To make this work in the base case, pg_database now has a nailed-in-cache
relation descriptor that is initialized using hardwired knowledge in
relcache.c.  This means pg_database is added to the set of relations that
need to have a Schema_pg_xxx macro maintained in pg_attribute.h.  When this
path is taken, we'll have to do a seqscan of pg_database to find the row
we need.

In the normal case, we are able to do an indexscan to find the database's row
by name.  This is made possible by storing a global relcache init file that
describes only the shared catalogs and their indexes (and therefore is usable
by all backends in any database).  A new backend loads this cache file,
finds its database OID after an indexscan on pg_database, and then loads
the local relcache init file for that database.

This change should effectively eliminate number of databases as a factor
in backend startup time, even with large numbers of databases.  However,
the real reason for doing it is as a first step towards getting rid of
the flat files altogether.  There are still several other sub-projects
to be tackled before that can happen.
2009-08-12 20:53:31 +00:00
Tom Lane 97e14f6e93 Document that LocalSetXLogInsertAllowed can be re-executed.
Per comment from Simon.
2009-08-08 16:39:17 +00:00
Tom Lane 87740caa01 rm_cleanup functions need to be allowed to write WAL entries. This oversight
appears to explain the recent reports of "PANIC: cannot make new WAL entries
during recovery".
2009-08-07 19:29:49 +00:00
Tom Lane 2de48a83e6 Cleanup and code review for the patch that made bgwriter active during
archive recovery.  Invent a separate state variable and inquiry function
for XLogInsertAllowed() to clarify some tests and make the management of
writing the end-of-recovery checkpoint less klugy.  Fix several places
that were incorrectly testing InRecovery when they should be looking at
RecoveryInProgress or XLogInsertAllowed (because they will now be executed
in the bgwriter not startup process).  Clarify handling of bad LSNs passed
to XLogFlush during recovery.  Use a spinlock for setting/testing
SharedRecoveryInProgress.  Improve quite a lot of comments.

Heikki and Tom
2009-06-26 20:29:04 +00:00