Commit Graph

1123 Commits

Author SHA1 Message Date
Magnus Hagander bc5ac36865 Add function pg_xlog_location_diff to help comparisons
Comparing two xlog locations are useful for example when calculating
replication lag.

Euler Taveira de Oliveira, reviewed by Fujii Masao, and some cleanups
from me
2012-03-04 12:22:38 +01:00
Heikki Linnakangas 1a01560cbb Rename LWLockWaitUntilFree to LWLockAcquireOrWait.
LWLockAcquireOrWait makes it more clear that the lock is acquired if it's
free.
2012-02-08 09:17:13 +02:00
Tom Lane c6d76d7c82 Add locking around WAL-replay modification of shared-memory variables.
Originally, most of this code assumed that no Postgres backends could be
running concurrently with it, and so no locking could be needed.  That
assumption fails in Hot Standby.  While it's still true that Hot Standby
backends should never change values like nextXid, they can examine them,
and consistency is important in some cases such as when computing a
snapshot.  Therefore, prudence requires that WAL replay code obtain the
relevant locks when modifying such variables, even though it can examine
them without taking a lock.  We were following that coding rule in some
places but not all.  This commit applies the coding rule uniformly to all
updates of ShmemVariableCache and MultiXactState fields; a search of the
replay routines did not find any other cases that seemed to be at risk.

In addition, this commit fixes a longstanding thinko in replay of NEXTOID
and checkpoint records: we tried to advance nextOid only if it was behind
the value in the WAL record, but the comparison would draw the wrong
conclusion if OID wraparound had occurred since the previous value.
Better to just unconditionally assign the new value, since OID assignment
shouldn't be happening during replay anyway.

The additional locking seems to be more in the nature of future-proofing
than fixing any live bug, so I am not going to back-patch it.  The NEXTOID
fix will be back-patched separately.
2012-02-06 12:34:10 -05:00
Tom Lane 17118825b8 Fix transient clobbering of shared buffers during WAL replay.
RestoreBkpBlocks was in the habit of zeroing and refilling the target
buffer; which was perfectly safe when the code was written, but is unsafe
during Hot Standby operation.  The reason is that we have coding rules
that allow backends to continue accessing a tuple in a heap relation while
holding only a pin on its buffer.  Such a backend could see transiently
zeroed data, if WAL replay had occasion to change other data on the page.
This has been shown to be the cause of bug #6425 from Duncan Rance (who
deserves kudos for developing a sufficiently-reproducible test case) as
well as Bridget Frey's re-report of bug #6200.  It most likely explains the
original report as well, though we don't yet have confirmation of that.

To fix, change the code so that only bytes that are supposed to change will
change, even transiently.  This actually saves cycles in RestoreBkpBlocks,
since it's not writing the same bytes twice.

Also fix seq_redo, which has the same disease, though it has to work a bit
harder to meet the requirement.

So far as I can tell, no other WAL replay routines have this type of bug.
In particular, the index-related replay routines, which would certainly be
broken if they had to meet the same standard, are not at risk because we
do not have coding rules that allow access to an index page when not
holding a buffer lock on it.

Back-patch to 9.0 where Hot Standby was added.
2012-02-05 15:49:17 -05:00
Heikki Linnakangas 9b38d46d9f Make group commit more effective.
When a backend needs to flush the WAL, and someone else is already flushing
the WAL, wait until it releases the WALInsertLock and check if we still need
to do the flush or if the other backend already did the work for us, before
acquiring WALInsertLock. This helps group commit, because when the WAL flush
finishes, all the backends that were waiting for it can be woken up in one
go, and the can all concurrently observe that they're done, rather than
waking them up one by one in a cascading fashion.

This is based on a new LWLock function, LWLockWaitUntilFree(), which has
peculiar semantics. If the lock is immediately free, it grabs the lock and
returns true. If it's not free, it waits until it is released, but then
returns false without grabbing the lock. This is used in XLogFlush(), so
that when the lock is acquired, the backend flushes the WAL, but if it's
not, the backend first checks the current flush location before retrying.

Original patch and benchmarking by Peter Geoghegan and Simon Riggs, although
this patch as committed ended up being very different from that.
2012-01-30 16:53:48 +02:00
Tom Lane ad10853b30 Assorted comment fixes, mostly just typos, but some obsolete statements.
YAMAMOTO Takashi
2012-01-29 19:23:56 -05:00
Simon Riggs 8366c7803e Allow pg_basebackup from standby node with safety checking.
Base backup follows recommended procedure, plus goes to great
lengths to ensure that partial page writes are avoided.

Jun Ishizuka and Fujii Masao, with minor modifications
2012-01-25 18:02:04 +00:00
Simon Riggs 5530623d03 Correctly initialise shared recoveryLastRecPtr in recovery.
Previously we used ReadRecPtr rather than EndRecPtr, which was
not a serious error but caused pg_stat_replication to report
incorrect replay_location until at least one WAL record is replayed.

Fujii Masao
2012-01-13 13:02:44 +00:00
Heikki Linnakangas 1b9dea04b5 Remove useless 'needlock' argument from GetXLogInsertRecPtr. It was always
passed as 'true'.
2012-01-11 11:01:47 +02:00
Heikki Linnakangas 9c808f89c2 Refactor XLogInsert a bit. The rdata entries for backup blocks are now
constructed before acquiring WALInsertLock, which slightly reduces the time
the lock is held. Although I could not measure any benefit in benchmarks,
the code is more readable this way.
2012-01-11 11:01:47 +02:00
Robert Haas 33aaa139e6 Make the number of CLOG buffers adaptive, based on shared_buffers.
Previously, this was hardcoded: we always had 8.  Performance testing
shows that isn't enough, especially on big SMP systems, so we allow it
to scale up as high as 32 when there's adequate memory.  On the flip
side, when shared_buffers is very small, drop the number of CLOG buffers
down to as little as 4, so that we can start the postmaster even
when very little shared memory is available.

Per extensive discussion with Simon Riggs, Tom Lane, and others on
pgsql-hackers.
2012-01-06 14:32:18 -05:00
Bruce Momjian e126958c2e Update copyright notices for year 2012. 2012-01-01 18:01:58 -05:00
Simon Riggs 64233902d2 Send new protocol keepalive messages to standby servers.
Allows streaming replication users to calculate transfer latency
and apply delay via internal functions. No external functions yet.
2011-12-31 13:30:26 +00:00
Tom Lane d0024cd188 Avoid crashing when we have problems unlinking files post-commit.
smgrdounlink takes care to not throw an ERROR if it fails to unlink
something, but that caution was rendered useless by commit
3396000684, which put an smgrexists call in
front of it; smgrexists *does* throw error if anything looks funny, such
as getting a permissions error from trying to open the file.  If that
happens post-commit, you get a PANIC, and what's worse the same logic
appears in the WAL replay code, so the database even fails to restart.

Restore the intended behavior by removing the smgrexists call --- it isn't
accomplishing anything that we can't do better by adjusting mdunlink's
ideas of whether it ought to warn about ENOENT or not.

Per report from Joseph Shraibman of unrecoverable crash after trying to
drop a table whose FSM fork had somehow gotten chmod'd to 000 permissions.
Backpatch to 8.4, where the bogus coding was introduced.
2011-12-20 15:00:36 -05:00
Tom Lane dd45d3ad33 Fix some long-obsolete references to XLogOpenRelation.
These were missed in commit a213f1ee6c,
which removed that function.
2011-12-17 18:26:52 -05:00
Tom Lane 8daeb5ddd6 Add SP-GiST (space-partitioned GiST) index access method.
SP-GiST is comparable to GiST in flexibility, but supports non-balanced
partitioned search structures rather than balanced trees.  As described at
PGCon 2011, this new indexing structure can beat GiST in both index build
time and query speed for search problems that it is well matched to.

There are a number of areas that could still use improvement, but at this
point the code seems committable.

Teodor Sigaev and Oleg Bartunov, with considerable revisions by Tom Lane
2011-12-17 16:42:30 -05:00
Tom Lane 2dd9322ba6 Move BKP_REMOVABLE bit from individual WAL records to WAL page headers.
Removing this bit from xl_info allows us to restore the old limit of four
(not three) separate pages touched by a WAL record, which is needed for the
upcoming SP-GiST feature, and will likely be useful elsewhere in future.

When we implemented XLR_BKP_REMOVABLE in 2007, we had to do it like that
because no special WAL-visible action was taken when starting a backup.
However, now we force a segment switch when starting a backup, so a
compressing WAL archiver (such as pglesslog) that uses the state shown in
the current page header will not be fooled as to removability of backup
blocks.  The only downside is that the archiver will not return to
compressing mode for up to one WAL page after the backup is over, which is
a small price to pay for getting back the extra xl_info bit.  In any case
the archiver could look for XLOG_BACKUP_END records if it thought it was
worth the trouble to do so.

Bump XLOG_PAGE_MAGIC since this is effectively a change in WAL format.
2011-12-12 16:22:14 -05:00
Heikki Linnakangas 9f0d2bdc88 Don't set reachedMinRecoveryPoint during crash recovery. In crash recovery,
we don't reach consistency before replaying all of the WAL. Rename the
variable to reachedConsistency, to make its intention clearer.

In master, that was an active bug because of the recent patch to
immediately PANIC if a reference to a missing page is found in WAL after
reaching consistency, as Tom Lane's test case demonstrated. In 9.1 and 9.0,
the only consequence was a misleading "consistent recovery state reached at
%X/%X" message in the log at the beginning of crash recovery (the database
is not consistent at that point yet). In 8.4, the log message was not
printed in crash recovery, even though there was a similar
reachedMinRecoveryPoint local variable that was also set early. So,
backpatch to 9.1 and 9.0.
2011-12-09 15:21:12 +02:00
Heikki Linnakangas 1e616f6391 During recovery, if we reach consistent state and still have entries in the
invalid-page hash table, PANIC immediately. Immediate PANIC is much better
than waiting for end-of-recovery, which is what we did before, because the
end-of-recovery might not come until months later if this is a standby
server.

Also refrain from creating a restartpoint if there are invalid-page entries
in the hash table. Restarting recovery from such a restartpoint would not
see the invalid references, and wouldn't be able to cross-check them when
consistency is reached. That wouldn't matter when things are going smoothly,
but the more sanity checks you have the better.

Fujii Masao
2011-12-02 10:49:54 +02:00
Robert Haas ed0b409d22 Move "hot" members of PGPROC into a separate PGXACT array.
This speeds up snapshot-taking and reduces ProcArrayLock contention.
Also, the PGPROC (and PGXACT) structures used by two-phase commit are
now allocated as part of the main array, rather than in a separate
array, and we keep ProcArray sorted in pointer order.  These changes
are intended to minimize the number of cache lines that must be pulled
in to take a snapshot, and testing shows a substantial increase in
performance on both read and write workloads at high concurrencies.

Pavan Deolasee, Heikki Linnakangas, Robert Haas
2011-11-25 08:02:10 -05:00
Simon Riggs 4de82f7d7c Wakeup WALWriter as needed for asynchronous commit performance.
Previously we waited for wal_writer_delay before flushing WAL. Now
we also wake WALWriter as soon as a WAL buffer page has filled.
Significant effect observed on performance of asynchronous commits
by Robert Haas, attributed to the ability to set hint bits on tuples
earlier and so reducing contention caused by clog lookups.
2011-11-13 09:00:57 +00:00
Simon Riggs a030bfa6e4 Move user functions related to WAL into xlogfuncs.c 2011-11-04 09:37:17 +00:00
Simon Riggs 750f70b0fe Update more comments about checkpoints being done by bgwriter 2011-11-02 17:15:35 +00:00
Simon Riggs 18fb9d8d21 Reduce checkpoints and WAL traffic on low activity database server
Previously, we skipped a checkpoint if no WAL had been written since
last checkpoint, though this does not appear in user documentation.
As of now, we skip a checkpoint until we have written at least one
enough WAL to switch the next WAL file. This greatly reduces the
level of activity and number of WAL messages generated by a very
low activity server. This is safe because the purpose of a checkpoint
is to act as a starting place for a recovery, in case of crash.
This patch maintains minimal WAL volume for replay in case of crash,
thus maintaining very low crash recovery time.
2011-11-02 15:26:33 +00:00
Simon Riggs 9aceb6ab3c Refactor xlog.c to create src/backend/postmaster/startup.c
Startup process now has its own dedicated file, just like all other
special/background processes. Reduces role and size of xlog.c
2011-11-02 14:25:01 +00:00
Simon Riggs 86e3364899 Derive oldestActiveXid at correct time for Hot Standby.
There was a timing window between when oldestActiveXid was derived
and when it should have been derived that only shows itself under
heavy load. Move code around to ensure correct timing of derivation.
No change to StartupSUBTRANS() code, which is where this failed.

Bug report by Chris Redekop
2011-11-02 08:54:56 +00:00
Simon Riggs f8409b39d1 Fix timing of Startup CLOG and MultiXact during Hot Standby
Patch by me, bug report by Chris Redekop, analysis by Florian Pflug
2011-11-02 08:07:44 +00:00
Simon Riggs f3ebaad45b Comment changes to show bgwriter no longer performs checkpoints. 2011-11-01 18:48:47 +00:00
Tom Lane bb446b689b Support synchronization of snapshots through an export/import procedure.
A transaction can export a snapshot with pg_export_snapshot(), and then
others can import it with SET TRANSACTION SNAPSHOT.  The data does not
leave the server so there are not security issues.  A snapshot can only
be imported while the exporting transaction is still running, and there
are some other restrictions.

I'm not totally convinced that we've covered all the bases for SSI (true
serializable) mode, but it works fine for lesser isolation modes.

Joachim Wieland, reviewed by Marko Tiikkaja, and rather heavily modified
by Tom Lane
2011-10-22 18:23:30 -04:00
Tom Lane aa90e148ca Suppress -Wunused-result warnings about write() and fwrite().
This is merely an exercise in satisfying pedants, not a bug fix, because
in every case we were checking for failure later with ferror(), or else
there was nothing useful to be done about a failure anyway.  Document
the latter cases.
2011-10-18 21:37:51 -04:00
Tom Lane fa56a0c3e0 Fix uninitialized-variable bug. 2011-10-04 17:08:18 -04:00
Alvaro Herrera 09e196e453 Use callbacks in SlruScanDirectory for the actual action
Previously, the code assumed that the only possible action to take was
to delete files behind a certain cutoff point.  The async notify code
was already a crock: it used a different "pagePrecedes" function for
truncation than for regular operation.  By allowing it to pass a
callback to SlruScanDirectory it can do cleanly exactly what it needs to
do.

The clog.c code also had its own use for SlruScanDirectory, which is
made a bit simpler with this.
2011-10-04 14:03:23 -03:00
Tom Lane d56b3afc03 Restructure error handling in reading of postgresql.conf.
This patch has two distinct purposes: to report multiple problems in
postgresql.conf rather than always bailing out after the first one,
and to change the policy for whether changes are applied when there are
unrelated errors in postgresql.conf.

Formerly the policy was to apply no changes if any errors could be
detected, but that had a significant consistency problem, because in some
cases specific values might be seen as valid by some processes but invalid
by others.  This meant that the latter processes would fail to adopt
changes in other parameters even though the former processes had done so.

The new policy is that during SIGHUP, the file is rejected as a whole
if there are any errors in the "name = value" syntax, or if any lines
attempt to set nonexistent built-in parameters, or if any lines attempt
to set custom parameters whose prefix is not listed in (the new value of)
custom_variable_classes.  These tests should always give the same results
in all processes, and provide what seems a reasonably robust defense
against loading values from badly corrupted config files.  If these tests
pass, all processes will apply all settings that they individually see as
good, ignoring (but logging) any they don't.

In addition, the postmaster does not abandon reading a configuration file
after the first syntax error, but continues to read the file and report
syntax errors (up to a maximum of 100 syntax errors per file).

The postmaster will still refuse to start up if the configuration file
contains any errors at startup time, but these changes allow multiple
errors to be detected and reported before quitting.

Alexey Klyukin, reviewed by Andy Colson and av (Alexander ?)
with some additional hacking by Tom Lane
2011-10-02 16:50:04 -04:00
Tom Lane 57eb009092 Allow snapshot references to still work during transaction abort.
In REPEATABLE READ (nee SERIALIZABLE) mode, an attempt to do
GetTransactionSnapshot() between AbortTransaction and CleanupTransaction
failed, because GetTransactionSnapshot would recompute the transaction
snapshot (which is already wrong, given the isolation mode) and then
re-register it in the TopTransactionResourceOwner, leading to an Assert
because the TopTransactionResourceOwner should be empty of resources after
AbortTransaction.  This is the root cause of bug #6218 from Yamamoto
Takashi.  While changing plancache.c to avoid requesting a snapshot when
handling a ROLLBACK masks the problem, I think this is really a snapmgr.c
bug: it's lower-level than the resource manager mechanism and should not be
shutting itself down before we unwind resource manager resources.  However,
just postponing the release of the transaction snapshot until cleanup time
didn't work because of the circular dependency with
TopTransactionResourceOwner.  Fix by managing the internal reference to
that snapshot manually instead of depending on TopTransactionResourceOwner.
This saves a few cycles as well as making the module layering more
straightforward.  predicate.c's dependencies on TopTransactionResourceOwner
go away too.

I think this is a longstanding bug, but there's no evidence that it's more
than a latent bug, so it doesn't seem worth any risk of back-patching.
2011-09-26 22:25:28 -04:00
Tom Lane a7801b62f2 Move Timestamp/Interval typedefs and basic macros into datatype/timestamp.h.
As per my recent proposal, this refactors things so that these typedefs and
macros are available in a header that can be included in frontend-ish code.
I also changed various headers that were undesirably including
utils/timestamp.h to include datatype/timestamp.h instead.  Unsurprisingly,
this showed that half the system was getting utils/timestamp.h by way of
xlog.h.

No actual code changes here, just header refactoring.
2011-09-09 13:23:41 -04:00
Simon Riggs df383b03e6 Partially revoke attempt to improve performance with many savepoints.
Maintain difference between subtransaction release and commit introduced
by earlier patch.
2011-09-07 12:11:26 +01:00
Alvaro Herrera 56a9ed92b6 Adjust translator comment format to xgettext expectations 2011-09-05 19:04:30 -03:00
Alvaro Herrera b64f18c583 Mark some untranslatable messages with errmsg_internal 2011-09-05 17:48:07 -03:00
Tom Lane 1609797c25 Clean up the #include mess a little.
walsender.h should depend on xlog.h, not vice versa.  (Actually, the
inclusion was circular until a couple hours ago, which was even sillier;
but Bruce broke it in the expedient rather than logically correct
direction.)  Because of that poor decision, plus blind application of
pgrminclude, we had a situation where half the system was depending on
xlog.h to include such unrelated stuff as array.h and guc.h.  Clean up
the header inclusion, and manually revert a lot of what pgrminclude had
done so things build again.

This episode reinforces my feeling that pgrminclude should not be run
without adult supervision.  Inclusion changes in header files in particular
need to be reviewed with great care.  More generally, it'd be good if we
had a clearer notion of module layering to dictate which headers can sanely
include which others ... but that's a big task for another day.
2011-09-04 01:13:16 -04:00
Peter Eisentraut f1e4f3d44f Whitespace adjustment for consistency in the file 2011-09-03 01:28:05 +03:00
Bruce Momjian 6416a82a62 Remove unnecessary #include references, per pgrminclude script. 2011-09-01 10:04:27 -04:00
Robert Haas eab2ef6164 Remove some tabs from README file.
Some of the ASCII art expected 8-space tab stops, and some of it
expected 4-space tab stops.

Per report from YAMAMOTO Takashi.
2011-08-29 22:26:29 -04:00
Bruce Momjian f261deb4b4 Add missing includes after pgrminclude run. 2011-08-26 18:15:14 -04:00
Heikki Linnakangas 1d0392b245 Fix comment about which version had BACKUP METHOD line in backup_lable, again.
It was invalidated again by Fujii's patch to 9.1.
2011-08-17 12:31:23 +03:00
Tom Lane 2ada6779c5 Fix race condition in relcache init file invalidation.
The previous code tried to synchronize by unlinking the init file twice,
but that doesn't actually work: it leaves a window wherein a third process
could read the already-stale init file but miss the SI messages that would
tell it the data is stale.  The result would be bizarre failures in catalog
accesses, typically "could not read block 0 in file ..." later during
startup.

Instead, hold RelCacheInitLock across both the unlink and the sending of
the SI messages.  This is more straightforward, and might even be a bit
faster since only one unlink call is needed.

This has been wrong since it was put in (in 2002!), so back-patch to all
supported releases.
2011-08-16 13:11:54 -04:00
Heikki Linnakangas 2877c67bc2 Fix bogus comment that claimed that the new BACKUP METHOD line in
backup_label was new in 9.0. Spotted by Fujii Masao.
2011-08-16 12:23:51 +03:00
Tom Lane 4dab3d5ae1 Change the autovacuum launcher to use WaitLatch instead of a poll loop.
In pursuit of this (and with the expectation that WaitLatch will be needed
in more places), convert the latch field that was already added to PGPROC
for sync rep into a generic latch that is activated for all PGPROC-owning
processes, and change many of the standard backend signal handlers to set
that latch when a signal happens.  This will allow WaitLatch callers to be
wakened properly by these signals.

In passing, fix a whole bunch of signal handlers that had been hacked to do
things that might change errno, without adding the necessary save/restore
logic for errno.  Also make some minor fixes in unix_latch.c, and clean
up bizarre and unsafe scheme for disowning the process's latch.  Much of
this has to be back-patched into 9.1.

Peter Geoghegan, with additional work by Tom
2011-08-10 12:22:21 -04:00
Heikki Linnakangas 41f9ffd928 If backup-end record is not seen, and we reach end of recovery from a
streamed backup, throw an error and refuse to start up. The restore has not
finished correctly in that case and the data directory is possibly corrupt.
We already errored out in case of archive recovery, but could not during
crash recovery because we couldn't distinguish between the case that
pg_start_backup() was called and the database then crashed (must not error,
data is OK), and the case that we're restoring from a backup and not all
the needed WAL was replayed (data can be corrupt).

To distinguish those cases, add a line to backup_label to indicate
whether the backup was taken with pg_start/stop_backup(), or by streaming
(ie. pg_basebackup).

This requires re-initdb, because of a new field added to the control file.
2011-08-10 09:22:49 +03:00
Tom Lane 9f17ffd866 Measure WaitLatch's timeout parameter in milliseconds, not microseconds.
The original definition had the problem that timeouts exceeding about 2100
seconds couldn't be specified on 32-bit machines.  Milliseconds seem like
sufficient resolution, and finer grain than that would be fantasy anyway
on many platforms.

Back-patch to 9.1 so that this aspect of the latch API won't change between
9.1 and later releases.

Peter Geoghegan
2011-08-09 18:52:29 -04:00
Simon Riggs 7cb7122800 Remove O(N^2) performance issue with multiple SAVEPOINTs.
Subtransaction locks now released en masse at main commit, rather than
repeatedly re-scanning for locks as we ascend the nested transaction tree.
Split transaction state TBLOCK_SUBEND into two states, TBLOCK_SUBCOMMIT
and TBLOCK_SUBRELEASE to allow the commit path to be optimised using
the existing code in ResourceOwnerRelease() which appears to have been
intended for this usage, judging from comments therein.
2011-07-19 17:21:24 +01:00
Simon Riggs 5286105800 Cascading replication feature for streaming log-based replication.
Standby servers can now have WALSender processes, which can work with
either WALReceiver or archive_commands to pass data. Fully updated
docs, including new conceptual terms of sending server, upstream and
downstream servers. WALSenders terminated when promote to master.

Fujii Masao, review, rework and doc rewrite by Simon Riggs
2011-07-19 03:40:03 +01:00
Heikki Linnakangas 89fd72cbf2 Introduce a pipe between postmaster and each backend, which can be used to
detect postmaster death. Postmaster keeps the write-end of the pipe open,
so when it dies, children get EOF in the read-end. That can conveniently
be waited for in select(), which allows eliminating some of the polling
loops that check for postmaster death. This patch doesn't yet change all
the loops to use the new mechanism, expect a follow-on patch to do that.

This changes the interface to WaitLatch, so that it takes as argument a
bitmask of events that it waits for. Possible events are latch set, timeout,
postmaster death, and socket becoming readable or writeable.

The pipe method behaves slightly differently from the kill() method
previously used in PostmasterIsAlive() in the case that postmaster has died,
but its parent has not yet read its exit code with waitpid(). The pipe
returns EOF as soon as the process dies, but kill() continues to return
true until waitpid() has been called (IOW while the process is a zombie).
Because of that, change PostmasterIsAlive() to use the pipe too, otherwise
WaitLatch() would return immediately with WL_POSTMASTER_DEATH, while
PostmasterIsAlive() would claim it's still alive. That could easily lead to
busy-waiting while postmaster is in zombie state.

Peter Geoghegan with further changes by me, reviewed by Fujii Masao and
Florian Pflug.
2011-07-08 18:44:07 +03:00
Peter Eisentraut 21f1e15aaf Unify spelling of "canceled", "canceling", "cancellation"
We had previously (af26857a27)
established the U.S. spellings as standard.
2011-06-29 09:28:46 +03:00
Simon Riggs 465883b0a2 Introduce compact WAL record for the common case of commit (non-DDL).
XLOG_XACT_COMMIT_COMPACT leaves out invalidation messages and relfilenodes,
saving considerable space for the vast majority of transaction commits.
XLOG_XACT_COMMIT keeps same definition as XLOG_PAGE_MAGIC 0xD067 and earlier.

Leonardo Francalanci and Simon Riggs
2011-06-28 22:58:17 +01:00
Robert Haas 503c7305a1 Make the visibility map crash-safe.
This involves two main changes from the previous behavior.  First,
when we set a bit in the visibility map, emit a new WAL record of type
XLOG_HEAP2_VISIBLE.  Replay sets the page-level PD_ALL_VISIBLE bit and
the visibility map bit.  Second, when inserting, updating, or deleting
a tuple, we can no longer get away with clearing the visibility map
bit after releasing the lock on the corresponding heap page, because
an intervening crash might leave the visibility map bit set and the
page-level bit clear.  Making this work requires a bit of interface
refactoring.

In passing, a few minor but related cleanups: change the test in
visibilitymap_set and visibilitymap_clear to throw an error if the
wrong page (or no page) is pinned, rather than silently doing nothing;
this case should never occur.  Also, remove duplicate definitions of
InvalidXLogRecPtr.

Patch by me, review by Noah Misch.
2011-06-21 23:04:40 -04:00
Heikki Linnakangas cb94db91b2 pgindent run of recent SSI changes. Also, remove an unnecessary #include.
Kevin Grittner
2011-06-16 16:17:22 +03:00
Heikki Linnakangas 85ea93384a Oops, forgot to change the order of entries in 2PC callback arrays when I
renumbered the resource managers. This should fix the buildfarm..
2011-06-14 15:16:36 +03:00
Tom Lane c2ba0121c7 Work around gcc 4.6.0 bug that breaks WAL replay.
ReadRecord's habit of using both direct references to tmpRecPtr and
references to *RecPtr (which is pointing at tmpRecPtr) triggers an
optimization bug in gcc 4.6.0, which apparently has forgotten about
aliasing rules.  Avoid the compiler bug, and make the code more readable
to boot, by getting rid of the direct references.  Improve the comments
while at it.

Back-patch to all supported versions, in case they get built with 4.6.0.

Tom Lane, with some cosmetic suggestions from Alex Hunsaker
2011-06-10 17:04:29 -04:00
Bruce Momjian 6560407c7d Pgindent run before 9.1 beta2. 2011-06-09 14:32:50 -04:00
Alvaro Herrera c6eb5740b3 Fix assorted typos 2011-05-12 08:52:56 -04:00
Heikki Linnakangas a0c8514149 Shut down WAL receiver if it's still running at end of recovery. We used to
just check that it's not running and PANIC if it was, but that can rightfully
happen if recovery stops at recovery target.
2011-05-11 12:46:08 +03:00
Tom Lane d2088ae949 Move RegisterPredicateLockingXid() call to a safer place.
The SSI patch inserted a call of RegisterPredicateLockingXid into
GetNewTransactionId, which was a bad idea on a couple of grounds.  First,
it's not necessary to hold XidGenLock while manipulating that shared
memory, and doing so is bad because XidGenLock is a high-contention lock
that should be held for as short a time as possible.  (Not to mention that
it adds an entirely unnecessary deadlock hazard, since we must take
SerializableXactHashLock as well.)  Second, the specific place where it was
put was between extending CLOG and advancing nextXid, which could result in
unpleasant behavior in case of a failure there.  Pull the call out to
AssignTransactionId, which is much safer and arguably better from a
modularity standpoint too.

There is more work to do to clean up the failure-before-advancing-nextXid
issue, but that is a separate change that will need to be back-patched.
So for the moment I just want to make GetNewTransactionId look the same as
it did in prior versions.
2011-05-06 12:57:28 -04:00
Robert Haas aea1f24c2c recoveryStopsHere() must check the resource manager ID.
Before commit c016ce7281, this wasn't
needed, but now that multiple resource manager IDs can percolate down
through here, we have to make sure we know which one we've got.
Otherwise, we can confuse (for example) an XLOG_XACT_COMMIT record
with an XLOG_CHECKPOINT_SHUTDOWN record.

Review by Jaime Casanova
2011-04-18 08:27:19 -04:00
Heikki Linnakangas 54685b1c2b Revert the patch to check if we've reached end-of-backup also when doing
crash recovery, and throw an error if not. hubert depesz lubaczewski pointed
out that that situation also happens in the crash recovery following a
system crash that happens during an online backup.

We might want to do something smarter in 9.1, like put the check back for
backups taken with pg_basebackup, but that's for another patch.
2011-04-13 22:05:40 +03:00
Bruce Momjian bf50caf105 pgindent run before PG 9.1 beta 1. 2011-04-10 11:42:00 -04:00
Tom Lane 2594cf0e8c Revise the API for GUC variable assign hooks.
The previous functions of assign hooks are now split between check hooks
and assign hooks, where the former can fail but the latter shouldn't.
Aside from being conceptually clearer, this approach exposes the
"canonicalized" form of the variable value to guc.c without having to do
an actual assignment.  And that lets us fix the problem recently noted by
Bernd Helmle that the auto-tune patch for wal_buffers resulted in bogus
log messages about "parameter "wal_buffers" cannot be changed without
restarting the server".  There may be some speed advantage too, because
this design lets hook functions avoid re-parsing variable values when
restoring a previous state after a rollback (they can store a pre-parsed
representation of the value instead).  This patch also resolves a
longstanding annoyance about custom error messages from variable assign
hooks: they should modify, not appear separately from, guc.c's own message
about "invalid parameter value".
2011-04-07 00:12:02 -04:00
Simon Riggs 88f32b7ca2 Avoid assuming there will be only 3 states for synchronous_commit.
Also avoid hardcoding the current default state by giving it the name
"on" and replace with a meaningful name that reflects its behaviour.
Coding only, no change in behaviour.
2011-04-04 23:23:13 +01:00
Robert Haas 240067b3b0 Merge synchronous_replication setting into synchronous_commit.
This means one less thing to configure when setting up synchronous
replication, and also avoids some ambiguity around what the behavior
should be when the settings of these variables conflict.

Fujii Masao, with additional hacking by me.
2011-04-04 16:25:52 -04:00
Heikki Linnakangas 1f0bab8494 Improve error message when WAL ends before reaching end of online backup. 2011-03-31 10:09:49 +03:00
Heikki Linnakangas acf4740132 Check that we've reached end-of-backup also when we're not performing
archive recovery.

It's possible to restore an online backup without recovery.conf, by simply
copying all the necessary WAL files to pg_xlog. "pg_basebackup -x" does that
too. That's the use case where this cross-check is useful.

Backpatch to 9.0. We used to do this in earlier versins, but in 9.0 the code
was inadvertently changed so that the check is only performed after archive
recovery.

Fujii Masao.
2011-03-30 10:53:28 +03:00
Simon Riggs b5f2f2a712 Minor changes to recovery pause behaviour.
Change location LOG message so it works each time we pause, not
just for final pause.
Ensure that we pause only if we are in Hot Standby and can connect
to allow us to run resume function. This change supercedes the
code to override parameter recoveryPauseAtTarget to false if not
attempting to enter Hot Standby, which is now removed.
2011-03-23 19:35:53 +00:00
Simon Riggs b98ac467f5 Prevent intermittent hang in recovery from bgwriter interaction.
Startup process waited for cleanup lock but when hot_standby = off
the pid was not registered, so that the bgwriter would not wake
the waiting process as intended.
2011-03-23 13:30:05 +00:00
Heikki Linnakangas 6d8096e2f3 When two base backups are started at the same time with pg_basebackup,
ensure that they use different checkpoints as the starting point. We use
the checkpoint redo location as a unique identifier for the base backup in
the end-of-backup record, and in the backup history file name.

Bug spotted by Fujii Masao.
2011-03-21 11:25:25 +02:00
Robert Haas 777e8c0015 Remove bogus semicolons in recoveryPausesHere.
Without this, the startup process goes into a tight loop, consuming
100% of one CPU and failing to respond to interrupts.
2011-03-18 08:09:09 -04:00
Robert Haas 84abea76f6 Add pause_at_recovery_target to recovery.conf.sample; improve docs.
Fujii Masao, but with the proposed behavior change reverted, and the
rest adjusted accordingly.
2011-03-17 14:04:11 -04:00
Bruce Momjian 5ca543fb2e Clarify C comment that O_SYNC/O_FSYNC are really the same settting, as
opposed to O_DSYNC.
2011-03-10 20:02:52 -05:00
Robert Haas d16e290a8a Emit a LOG message when pausing at the recovery target.
Fujii Masao
2011-03-10 14:37:14 -05:00
Heikki Linnakangas 4cd3fb6e12 Truncate predicate lock manager's SLRU lazily at checkpoint. That's safer
than doing it aggressively whenever the tail-XID pointer is advanced, because
this way we don't need to do it while holding SerializableXactHashLock.

This also fixes bug #5915 spotted by YAMAMOTO Takashi, and removes an
obsolete comment spotted by Kevin Grittner.
2011-03-08 12:12:54 +02:00
Heikki Linnakangas 1a4ab9ec23 If recovery_target_timeline is set to 'latest' and standby mode is enabled,
periodically rescan the archive for new timelines, while waiting for new WAL
segments to arrive. This allows you to set up a standby server that follows
the TLI change if another standby server is promoted to master. Before this,
you had to restart the standby server to make it notice the new timeline.

This patch only scans the archive for TLI changes, it won't follow a TLI
change in streaming replication. That is much needed too, but it would be a
much bigger patch than I dare to sneak in this late in the release cycle.

There was discussion on improving the sanity checking of the WAL segments so
that the system would notice more reliably if the new timeline isn't an
ancestor of the current one, but that is not included in this patch.

Reviewed by Fujii Masao.
2011-03-07 21:14:47 +02:00
Simon Riggs a8a8a3e096 Efficient transaction-controlled synchronous replication.
If a standby is broadcasting reply messages and we have named
one or more standbys in synchronous_standby_names then allow
users who set synchronous_replication to wait for commit, which
then provides strict data integrity guarantees. Design avoids
sending and receiving transaction state information so minimises
bookkeeping overheads. We synchronize with the highest priority
standby that is connected and ready to synchronize. Other standbys
can be defined to takeover in case of standby failure.

This version has very strict behaviour; more relaxed options
may be added at a later date.

Simon Riggs and Fujii Masao, with reviews by Yeb Havinga, Jaime
Casanova, Heikki Linnakangas and Robert Haas, plus the assistance
of many other design reviewers.
2011-03-06 22:49:16 +00:00
Tom Lane a874fe7b4c Refactor the executor's API to support data-modifying CTEs better.
The originally committed patch for modifying CTEs didn't interact well
with EXPLAIN, as noted by myself, and also had corner-case problems with
triggers, as noted by Dean Rasheed.  Those problems show it is really not
practical for ExecutorEnd to call any user-defined code; so split the
cleanup duties out into a new function ExecutorFinish, which must be called
between the last ExecutorRun call and ExecutorEnd.  Some Asserts have been
added to these functions to help verify correct usage.

It is no longer necessary for callers of the executor to call
AfterTriggerBeginQuery/AfterTriggerEndQuery for themselves, as this is now
done by ExecutorStart/ExecutorFinish respectively.  If you really need to
suppress that and do it for yourself, pass EXEC_FLAG_SKIP_TRIGGERS to
ExecutorStart.

Also, refactor portal commit processing to allow for the possibility that
PortalDrop will invoke user-defined code.  I think this is not actually
necessary just yet, since the portal-execution-strategy logic forces any
non-pure-SELECT query to be run to completion before we will consider
committing.  But it seems like good future-proofing.
2011-02-27 13:44:12 -05:00
Robert Haas 79ad8fc5f8 Named restore point improvements.
Emit a log message when creating a named restore point, and improve
documentation for pg_create_restore_point().

Euler Taveira de Oliveira, 	per suggestions from Thom Brown, with some
additional wordsmithing by me.
2011-02-24 19:02:00 -05:00
Simon Riggs bca8b7f16a Hot Standby feedback for avoidance of cleanup conflicts on standby.
Standby optionally sends back information about oldestXmin of queries
which is then checked and applied to the WALSender's proc->xmin.
GetOldestXmin() is modified slightly to agree with GetSnapshotData(),
so that all backends on primary include WALSender within their snapshots.
Note this does nothing to change the snapshot xmin on either master or
standby. Feedback piggybacks on the standby reply message.
vacuum_defer_cleanup_age is no longer used on standby, though parameter
still exists on primary, since some use cases still exist.

Simon Riggs, review comments from Fujii Masao, Heikki Linnakangas, Robert Haas
2011-02-16 19:29:37 +00:00
Robert Haas 4695da5ae9 pg_ctl promote
Fujii Masao, reviewed by Robert Haas, Stephen Frost, and Magnus Hagander.
2011-02-15 21:30:23 -05:00
Simon Riggs 5c588be729 PITR can stop at a named restore point when recovery target = time
though must not update the last transaction timestamp.
Plus comment and message cleanup for recent named restore point.

Fujii Masao, minor changes by me
2011-02-15 00:51:39 +00:00
Heikki Linnakangas b186523fd9 Send status updates back from standby server to master, indicating how far
the standby has written, flushed, and applied the WAL. At the moment, this
is for informational purposes only, the values are only shown in
pg_stat_replication system view, but in the future they will also be needed
for synchronous replication.

Extracted from Simon riggs' synchronous replication patch by Robert Haas, with
some tweaking by me.
2011-02-10 21:04:02 +02:00
Magnus Hagander 3144c33a2f Implement NOWAIT option for BASE_BACKUP command
Specifying this option makes the server not wait for the
xlog to be archived, or emit a warning that it can't,
instead leaving the responsibility with the client.

This is useful when the log is being streamed using
the streaming protocol in parallel with the backup,
without having log archiving enabled.
2011-02-09 10:59:53 +01:00
Simon Riggs c016ce7281 Named restore points in recovery. Users can record named points, then
new recovery.conf parameter recovery_target_name allows PITR to
specify named points as recovery targets.

Jaime Casanova, reviewed by Euler Taveira de Oliveira, plus minor edits
2011-02-08 19:39:08 +00:00
Simon Riggs 8c6e3adbf7 Basic Recovery Control functions for use in Hot Standby. Pause, Resume,
Status check functions only. Also, new recovery.conf parameter to
pause_at_recovery_target, default on.

Simon Riggs, reviewed by Fujii Masao
2011-02-08 18:30:22 +00:00
Simon Riggs faa0550572 Remove rare corner case for data loss when triggering standby server.
If the standby was streaming when trigger file arrives, check also in the
archive for additional WAL files. This is a corner case since it is
unlikely that we would trigger a failover while the master is still
available and sending data to standby, while at the same time running in
archive mode and also while the streaming standby has fallen behind archive.
Someone would eventually be unlucky; we must plug all gaps however small.

Fujii Masao
2011-02-08 14:38:02 +00:00
Heikki Linnakangas dafaa3efb7 Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.

To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.

A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.

Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.

We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.

Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.

Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-08 00:09:08 +02:00
Robert Haas 0af695fd43 Log restartpoints in the same fashion as checkpoints.
Prior to 9.0, restartpoints never created, deleted, or recycled WAL
files, but now they can.  This code makes log_checkpoints treat
checkpoints and restartpoints symmetrically.  It also adjusts up
the documentation of the parameter to mention restartpoints.

Fujii Masao.  Docs by me, as suggested by Itagaki Takahiro.
2011-02-02 21:08:53 -05:00
Heikki Linnakangas 997b48ed96 Support multiple concurrent pg_basebackup backups.
With this patch, pg_basebackup doesn't write a backup_label file in the
data directory, so it doesn't interfere with a pg_start/stop_backup() based
backup anymore. backup_label is still included in the backup, but it is
injected directly into the tar stream.

Heikki Linnakangas, reviewed by Fujii Masao and Magnus Hagander.
2011-01-31 18:25:39 +02:00
Tom Lane 0f73aae13d Allow the wal_buffers setting to be auto-tuned to a reasonable value.
If wal_buffers is initially set to -1 (which is now the default), it's
replaced by 1/32nd of shared_buffers, with a minimum of 8 (the old default)
and a maximum of the XLOG segment size.  The allowed range for manual
settings is still from 4 up to whatever will fit in shared memory.

Greg Smith, with implementation correction by me.
2011-01-22 20:31:24 -05:00
Magnus Hagander 4448917d51 Split pg_start_backup() and pg_stop_backup() into two pieces
Move the actual functionality into a separate function that's
easier to call internally, and change the SQL-callable function
to be a wrapper calling this.

Also create a pg_abort_backup() function, only callable internally,
that does only the most vital parts of pg_stop_backup(), making it
safe(r) to call from error handlers.
2011-01-09 21:00:28 +01:00
Robert Haas a9f72b4083 Improve recovery.conf.sample comments.
Jehan-Guillaume de Rorthais, with some additional wordsmithing by me.
2011-01-07 11:01:25 -05:00
Robert Haas dc8a14311a Update comments in RecordTransactionCommit() to mention unlogged tables. 2011-01-03 10:29:22 -05:00
Bruce Momjian 5d950e3b0c Stamp copyrights for year 2011. 2011-01-01 13:18:15 -05:00
Alvaro Herrera 55573990ca Avoid unnecessary public struct declaration in slru.h
Instead, declare a public wrapper of the sole function using it for
external callers, so that they don't have to always pass a NULL
argument.

Author: Kevin Grittner
2010-12-30 12:09:17 -03:00
Robert Haas 53dbc27c62 Support unlogged tables.
The contents of an unlogged table are WAL-logged; thus, they are not
available on standby servers and are truncated whenever the database
system enters recovery.  Indexes on unlogged tables are also unlogged.
Unlogged GiST indexes are not currently supported.
2010-12-29 06:48:53 -05:00
Magnus Hagander 9b8aff8c19 Add REPLICATION privilege for ROLEs
This privilege is required to do Streaming Replication, instead of
superuser, making it possible to set up a SR slave that doesn't
have write permissions on the master.

Superuser privileges do NOT override this check, so in order to
use the default superuser account for replication it must be
explicitly granted the REPLICATION permissions. This is backwards
incompatible change, in the interest of higher default security.
2010-12-29 11:05:03 +01:00
Bruce Momjian 5000472112 Remove quotes from boolean recovery.conf.sample parameters, now that the
quotes are not required.  This now matches postgresql.conf's
specification of booleans.
2010-12-24 11:51:51 -05:00
Heikki Linnakangas 9de3aa65f0 Rewrite the GiST insertion logic so that we don't need the post-recovery
cleanup stage to finish incomplete inserts or splits anymore. There was two
reasons for the cleanup step:

1. When a new tuple was inserted to a leaf page, the downlink in the parent
needed to be updated to contain (ie. to be consistent with) the new key.
Updating the parent in turn might require recursively updating the parent of
the parent. We now handle that by updating the parent while traversing down
the tree, so that when we insert the leaf tuple, all the parents are already
consistent with the new key, and the tree is consistent at every step.

2. When a page is split, we need to insert the downlink for the new right
page(s), and update the downlink for the original page to not include keys
that moved to the right page(s). We now handle that by setting a new flag,
F_FOLLOW_RIGHT, on the non-rightmost pages in the split. When that flag is
set, scans always follow the rightlink, regardless of the NSN mechanism used
to detect concurrent page splits. That way the tree is consistent right after
split, even though the downlink is still missing. This is very similar to the
way B-tree splits are handled. When the downlink is inserted in the parent,
the flag is cleared. To keep the insertion algorithm simple, when an
insertion sees an incomplete split, indicated by the F_FOLLOW_RIGHT flag, it
finishes the split before doing anything else.

These changes allow removing the whole "invalid tuple" mechanism, but I
retained the scan code to still follow invalid tuples correctly. While we
don't create any such tuples anymore, we want to handle them gracefully in
case you pg_upgrade a GiST index that has them. If we encounter any on an
insert, though, we just throw an error saying that you need to REINDEX.

The issue that got me into doing this is that if you did a checkpoint while
an insert or split was in progress, and the checkpoint finishes quickly so
that there is no WAL record related to the insert between RedoRecPtr and the
checkpoint record, recovery from that checkpoint would not know to finish
the incomplete insert. IOW, we have the same issue we solved with the
rm_safe_restartpoint mechanism during normal operation too. It's highly
unlikely to happen in practice, and this fix is far too large to backpatch,
so we're just going to live with in previous versions, but this refactoring
fixes it going forward.

With this patch, you don't get the annoying
'index "FOO" needs VACUUM or REINDEX to finish crash recovery' notices
anymore if you crash at an unfortunate moment.
2010-12-23 16:21:47 +02:00
Robert Haas f6a0863e3c Allow transactions that don't write WAL to commit asynchronously.
This case can arise if a transaction has written data, but only to
temporary tables.  Loss of the commit record in case of a crash won't
matter, because the temporary tables will be lost anyway.

Reviewed by Heikki Linnakangas and Simon Riggs.
2010-12-20 12:59:33 -05:00
Robert Haas 34c70c7ac4 Instrument checkpoint sync calls.
Greg Smith, reviewed by Jeff Janes
2010-12-14 09:26:19 -05:00
Tom Lane 04f4e10cfc Use symbolic names not octal constants for file permission flags.
Purely cosmetic patch to make our coding standards more consistent ---
we were doing symbolic some places and octal other places.  This patch
fixes all C-coded uses of mkdir, chmod, and umask.  There might be some
other calls I missed.  Inconsistency noted while researching tablespace
directory permissions issue.
2010-12-10 17:35:33 -05:00
Simon Riggs e620ee35b2 Optimize commit_siblings in two ways to improve group commit.
First, avoid scanning the whole ProcArray once we know there
are at least commit_siblings active; second, skip the check
altogether if commit_siblings = 0.

Greg Smith
2010-12-08 18:48:03 +00:00
Heikki Linnakangas 5a031a5556 Fix bugs in the hot standby known-assigned-xids tracking logic. If there's
an old transaction running in the master, and a lot of transactions have
started and finished since, and a WAL-record is written in the gap between
the creating the running-xacts snapshot and WAL-logging it, recovery will fail
with "too many KnownAssignedXids" error. This bug was reported by
Joachim Wieland on Nov 19th.

In the same scenario, when fewer transactions have started so that all the
xids fit in KnownAssignedXids despite the first bug, a more serious bug
arises. We incorrectly initialize the clog code with the oldest still running
transaction, and when we see the WAL record belonging to a transaction with
an XID larger than one that committed already before the checkpoint we're
recovering from, we zero the clog page containing the already committed
transaction, leading to data loss.

In hindsight, trying to track xids in the known-assigned-xids array before
seeing the running-xacts record was too complicated. To fix that, hold
XidGenLock while the running-xacts snapshot is taken and WAL-logged. That
ensures that no transaction can begin or end in that gap, so that in recvoery
we know that the snapshot contains all transactions running at that point in
WAL.
2010-12-07 09:23:30 +01:00
Heikki Linnakangas 95e42a2c29 Fix two typos, by Fujii Masao. 2010-12-06 12:38:05 +01:00
Robert Haas 5ef6c91383 Remove now-outdated mention of quotes being required in recovery.conf.
Noted by Itagaki Takahiro.
2010-12-03 09:00:18 -05:00
Robert Haas 970a18687f Use GUC lexer for recovery.conf parsing.
This eliminates some crufty, special-purpose code and, as a non-trivial
side benefit, allows recovery.conf parameters to be unquoted.

Dimitri Fontaine, with review and cleanup by Alvaro Herrera, Itagaki
Takahiro, and me.
2010-12-03 08:56:44 -05:00
Peter Eisentraut fc946c39ae Remove useless whitespace at end of lines 2010-11-23 22:34:55 +02:00
Heikki Linnakangas 542bdb2146 Fix bug introduced by the recent patch to check that the checkpoint redo
location read from backup label file can be found: wasShutdown was set
incorrectly when a backup label file was found.

Jeff Davis, with a little tweaking by me.
2010-11-11 19:32:11 +02:00
Robert Haas 7ba6e4f0e0 Add monitoring function pg_last_xact_replay_timestamp.
Fujii Masao, with a little wordsmithing by me.
2010-11-09 22:52:19 -05:00
Heikki Linnakangas 8c843fff2d Bootstrap WAL to begin at segment logid=0 logseg=1 (000000010000000000000001)
rather than 0/0, so that we can safely use 0/0 as an invalid value. This is a
more future-proof fix for the corner-case bug in streaming replication that
was fixed yesterday. We had a similar corner-case bug with log/seg 0/0 back in
February as well. Avoiding 0/0 as a valid value should prevent bugs like that
in the future. Per Tom Lane's idea.

Back-patch to 9.0. Since this only affects bootstrapping, it makes no
difference to existing installations. We don't need to worry about the
bug in existing installations, because if you've managed to get past the
initial base backup already, you won't hit the bug in the future either.
2010-11-02 11:39:48 +02:00
Heikki Linnakangas 931b6db39b Fix corner-case bug in tracking of latest removed WAL segment during
streaming replication. We used log/seg 0/0 to indicate that no WAL segments
have been removed since startup, but 0/0 is a valid value for the very first
WAL segment after initdb. To make that disambiguous, store
(latest removed WAL segment + 1) in the global variable.

Per report from Matt Chesler, also reproduced by Greg Smith.
2010-11-01 10:05:15 +02:00
Heikki Linnakangas 0c6293dd03 Before removing backup_label and irrevocably changing pg_control file, check
that WAL file containing the checkpoint redo-location can be found. This
avoids making the cluster irrecoverable if the redo location is in an earlie
WAL file than the checkpoint record.

Report, analysis and patch by Jeff Davis, with small changes by me.
2010-10-26 21:43:52 +03:00
Tom Lane def30e84c4 Don't try to fetch database name when SetTransactionIdLimit() is executed
outside a transaction.

This repairs brain fade in my patch of 2009-08-30: the reason we had been
storing oldest-database name, not OID, in ShmemVariableCache was of course
to avoid having to do a catalog lookup at times when it might be unsafe.

This error explains why Aleksandr Dushein is having trouble getting out of
an XID wraparound state in bug #5718, though not how he got into that state
in the first place.  I suspect pg_upgrade is at fault there.
2010-10-20 12:48:51 -04:00
Alvaro Herrera 17a16663d0 Remove AtStart_Cache() call in CommandCounterIncrement().
This call was present in the aboriginal code from Berkeley, and has
never been touched; it may very well be that it was there to mask
effects of bugs in other places and it may no longer be necessary.
The removal has been foreseen in a code comment since 2007; this seems
to be a good time to test this hypothesis.
2010-10-20 11:33:57 -03:00
Simon Riggs 3bbcc5c999 Make startup process respond to signals to cancel waiting on latch.
A tidy up for recently committed changes to startup latch.

Fujii Masao
2010-10-14 19:15:26 +01:00
Simon Riggs 45cd9199c2 Fix bug in comment of timeline history file.
Fujii Masao
2010-10-14 19:06:06 +01:00
Magnus Hagander 9f2e211386 Remove cvs keywords from all files. 2010-09-20 22:08:53 +02:00
Tom Lane 54d0e2886a Add some documentation about how we WAL-log filesystem actions.
Per a question from Robert Haas.
2010-09-17 00:42:39 +00:00
Heikki Linnakangas 79b54816db Fix two typos in comments, spotted by Fujii Masao and Thom Brown 2010-09-15 13:58:22 +00:00
Heikki Linnakangas 723d0184e2 Use a latch to make startup process wake up and replay immediately when
new WAL arrives via streaming replication. This reduces the latency, and
also allows us to use a longer polling interval, which is good for energy
efficiency.

We still need to poll to check for the appearance of a trigger file, but
the interval is now 5 seconds (instead of 100ms), like when waiting for
a new WAL segment to appear in WAL archive.
2010-09-15 10:35:05 +00:00
Heikki Linnakangas 2746e5f21d Introduce latches. A latch is a boolean variable, with the capability to
wait until it is set. Latches can be used to reliably wait until a signal
arrives, which is hard otherwise because signals don't interrupt select()
on some platforms, and even when they do, there's race conditions.

On Unix, latches use the so called self-pipe trick under the covers to
implement the sleep until the latch is set, without race conditions. On
Windows, Windows events are used.

Use the new latch abstraction to sleep in walsender, so that as soon as
a transaction finishes, walsender is woken up to immediately send the WAL
to the standby. This reduces the latency between master and standby, which
is good.

Preliminary work by Fujii Masao. The latch implementation is by me, with
helpful comments from many people.
2010-09-11 15:48:04 +00:00
Tom Lane eb36d1ad51 Fix oversight in RelFileNodeBackend patch: CreateFakeRelcacheEntry needs to
initialize the rd_backend field of a fake Relation entry correctly.
Fortunately, that is easy, since only non-temp relations should ever be
mentioned in the WAL stream.
2010-08-30 16:46:23 +00:00
Simon Riggs ac791d3ca1 Fix misleading DEBUG2 issued during RemoveOldXlogFiles() 2010-08-30 15:37:41 +00:00
Simon Riggs e72f15ed60 Truncate subtrans after each restartpoint.
Issue reported by Harald Kolb, patch by Fujii Masao, review by me.
2010-08-30 14:22:05 +00:00
Alvaro Herrera 3a1b51de19 Remove duplicate translatable phrase 2010-08-26 19:23:41 +00:00
Robert Haas debcec7dc3 Include the backend ID in the relpath of temporary relations.
This allows us to reliably remove all leftover temporary relation
files on cluster startup without reference to system catalogs or WAL;
therefore, we no longer include temporary relations in XLOG_XACT_COMMIT
and XLOG_XACT_ABORT WAL records.

Since these changes require including a backend ID in each
SharedInvalSmgrMsg, the size of the SharedInvalidationMessage.id
field has been reduced from two bytes to one, and the maximum number
of connections has been reduced from INT_MAX / 4 to 2^23-1.  It would
be possible to remove these restrictions by increasing the size of
SharedInvalidationMessage by 4 bytes, but right now that doesn't seem
like a good trade-off.

Review by Jaime Casanova and Tom Lane.
2010-08-13 20:10:54 +00:00
Robert Haas 95ef7cd40d Make RecordTransactionCommit() respect wal_level.
Since the only purpose of WAL-loggin SharedInvalidationMessages is to support
Hot Standby operation, they needn't be included when wal_level < hot_standby.

Back-patch to 9.0.

Review by Heikki Linnakanagas and Fujii Masao.
2010-08-13 15:42:21 +00:00
Robert Haas 30c22eb8fc Correct sundry errors in Hot Standby-related comments.
Fujii Masao
2010-08-12 23:24:54 +00:00
Simon Riggs 5b8bd0529e Rename asyncCommitLSN to asyncXactLSN to reflect changed role in 9.0.
Transaction aborts now record their LSN to avoid corner case
behaviour in SR/HS, hence change of name of variables and functions.
As pointed out by Fujii Masao. Cosmetic changes only.
2010-07-29 22:27:27 +00:00
Robert Haas 7be8946c78 Avoid deep recursion when assigning XIDs to multiple levels of subxacts.
Backpatch to 8.0.

Andres Freund, with cleanup and adjustment for older branches by me.
2010-07-23 00:43:00 +00:00
Tom Lane 672efc0865 Update obsolete comment. Noted by Josh Tolley. 2010-07-08 16:08:30 +00:00
Bruce Momjian 239d769e7e pgindent run for 9.0, second run 2010-07-06 19:19:02 +00:00
Tom Lane 8771634666 Don't set recoveryLastXTime when replaying a checkpoint --- that was a bogus
idea from the start since the variable is only meant to track commit/abort
events.  This patch reverts the logic around the variable to what it was in
8.4, except that the value is now kept in shared memory rather than a static
variable, so that it can be reported correctly by CreateRestartPoint (which is
executed in the bgwriter).
2010-07-03 22:15:45 +00:00
Tom Lane e76c1a0f4d Replace max_standby_delay with two parameters, max_standby_archive_delay and
max_standby_streaming_delay, and revise the implementation to avoid assuming
that timestamps found in WAL records can meaningfully be compared to clock
time on the standby server.  Instead, the delay limits are compared to the
elapsed time since we last obtained a new WAL segment from archive or since
we were last "caught up" to WAL data arriving via streaming replication.
This avoids problems with clock skew between primary and standby, as well
as other corner cases that the original coding would misbehave in, such
as the primary server having significant idle time between transactions.
Per my complaint some time ago and considerable ensuing discussion.

Do some desultory editing on the hot standby documentation, too.
2010-07-03 20:43:58 +00:00
Bruce Momjian b57ddccf05 Add C comment about why synchronous_commit=off behavior can lose
committed transactions in a postmaster crash.
2010-06-29 18:44:58 +00:00
Robert Haas 400916b6d7 emode_for_corrupt_record shouldn't reduce LOG messages to WARNING.
In non-interactive sessions, WARNING sorts below LOG.
2010-06-28 19:46:19 +00:00
Tom Lane 09698bb5fb Make RemoveOldXlogFiles's debug printout match style used elsewhere:
log and seg aren't an XLogRecPtr and shouldn't be printed like one.
Fujii Masao
2010-06-17 17:37:23 +00:00
Tom Lane 07e8b6aabc Don't allow walsender to send WAL data until it's been safely fsync'd on the
master.  Otherwise a subsequent crash could cause the master to lose WAL that
has already been applied on the slave, resulting in the slave being out of
sync and soon corrupt.  Per recent discussion and an example from Robert Haas.

Fujii Masao
2010-06-17 16:41:25 +00:00
Heikki Linnakangas 6da07cd80d If a corrupt WAL record is received by streaming replication, disconnect
and retry. If the record is genuinely corrupt in the master database,
there's little hope of recovering, but it's better than simply retrying
to apply the corrupt WAL record in a tight loop without even trying to
retransmit it, which is what we used to do.
2010-06-14 06:04:21 +00:00
Peter Eisentraut c86efdde5f Fix typo/bug, found by Clang compiler 2010-06-12 09:14:52 +00:00
Itagaki Takahiro 56834fc759 Rename restartpoint_command to archive_cleanup_command. 2010-06-10 08:13:50 +00:00
Heikki Linnakangas 0a7cb85531 Make TriggerFile variable static. It's not used outside xlog.c.
Fujii Masao
2010-06-10 07:49:23 +00:00
Heikki Linnakangas 346d7cd7fa Return NULL instead of 0/0 in pg_last_xlog_receive_location() and
pg_last_xlog_replay_location(). Per Robert Haas's suggestion, after
Itagaki Takahiro pointed out an issue in the docs. Also, some wording
changes in the docs by me.
2010-06-10 07:00:27 +00:00
Heikki Linnakangas 71815306e9 In standby mode, respect checkpoint_segments in addition to
checkpoint_timeout to trigger restartpoints. We used to deliberately only
do time-based restartpoints, because if checkpoint_segments is small we
would spend time doing restartpoints more often than really necessary.
But now that restartpoints are done in bgwriter, they're not as
disruptive as they used to be. Secondly, because streaming replication
stores the streamed WAL files in pg_xlog, we want to clean it up more
often to avoid running out of disk space when checkpoint_timeout is large
and checkpoint_segments small.

Patch by Fujii Masao, with some minor changes by me.
2010-06-09 15:04:07 +00:00
Magnus Hagander 8c873bbfa7 Make the walwriter close it's handle to an old xlog segment if it's no longer
the current one. Not doing this would leave the walwriter with a handle to a
deleted file if there was nothing for it to do for a long period of time,
preventing the file from  being completely removed.

Reported by Tollef Fog Heen, and thanks to Heikki for some hand-holding with
the patch.
2010-06-09 10:54:45 +00:00
Peter Eisentraut cb6038c168 Fix some inconsistent quoting of wal_level values in messages
When referring to postgresql.conf syntax, then it's without quotes
(wal_level=archive); in narrative it's with double quotes.  But never
single quotes.
2010-06-03 21:02:12 +00:00
Robert Haas d561430b66 On clean shutdown during recovery, don't warn about possible corruption.
Fujii Masao.  Review by Heikki Linnakangas and myself.
2010-06-03 03:20:00 +00:00
Heikki Linnakangas 6b24036365 Fix obsolete comments that I neglected to update in a previous patch.
Fujii Masao
2010-06-02 09:28:44 +00:00
Heikki Linnakangas c5bd8feac6 Adjust comment to reflect that we now have Hot Standby. Pointed out by
Robert Haas.
2010-05-27 00:38:39 +00:00
Robert Haas ea9968c331 Rename PM_RECOVERY_CONSISTENT and PMSIGNAL_RECOVERY_CONSISTENT.
The new names PM_HOT_STANDBY and PMSIGNAL_BEGIN_HOT_STANDBY more accurately
reflect their actual function.
2010-05-15 20:01:32 +00:00
Simon Riggs 4a24c9a063 Fix bug in processing of checkpoint time for max_standby_delay. Latest
log time was incorrectly set, typically leading to dates in the past,
which would cause more cancellations in Hot Standby on a quiet server.
2010-05-15 07:14:43 +00:00
Simon Riggs fd34374b17 Add many new Asserts in code and fix simple bug that slipped through
without them, related to previous commit. Report by Bruce Momjian.
2010-05-14 07:11:49 +00:00
Simon Riggs 463f151a23 Ensure that top level aborts call XLogSetAsyncCommit(). Not doing
so simply leads to data waiting in wal_buffers which then causes
later commits to potentially do emergency writes and for all forms
of replication to be potentially delayed without need or benefit.
Issue pointed out exactly by Fujii Masao, following bug report
by Robert Haas on a separate though related topic.
2010-05-13 11:39:30 +00:00
Simon Riggs 8431e296ea Cleanup initialization of Hot Standby. Clarify working with reanalysis
of requirements and documentation on LogStandbySnapshot(). Fixes
two minor bugs reported by Tom Lane that would lead to an incorrect
snapshot after transaction wraparound. Also fix two other problems
discovered that would give incorrect snapshots in certain cases.
ProcArrayApplyRecoveryInfo() substantially rewritten. Some minor
refactoring of xact_redo_apply() and ExpireTreeKnownAssignedTransactionIds().
2010-05-13 11:15:38 +00:00
Heikki Linnakangas ffe8c7c677 Need to hold ControlFileLock while updating control file. Update
minRecoveryPoint in control file when replaying a parameter change record,
to ensure that we don't allow hot standby on WAL generated without
wal_level='hot_standby' after a standby restart.
2010-05-03 11:17:52 +00:00
Tom Lane f9ed327f76 Clean up some awkward, inaccurate, and inefficient processing around
MaxStandbyDelay.  Use the GUC units mechanism for the value, and choose more
appropriate timestamp functions for performing tests with it.  Make the
ps_activity manipulation in ResolveRecoveryConflictWithVirtualXIDs have
behavior similar to ps_activity code elsewhere, notably not updating the
display when update_process_title is off and not truncating the display
contents at an arbitrarily-chosen length.  Improve the docs to be explicit
about what MaxStandbyDelay actually measures, viz the difference between
primary and standby servers' clocks, and the possible hazards if their clocks
aren't in sync.
2010-05-02 02:10:33 +00:00
Tom Lane 69f7a4d8e3 Adjust error checks in pg_start_backup and pg_stop_backup to make it possible
to perform a backup without archive_mode being enabled.  This gives up some
user-error protection in order to improve usefulness for streaming-replication
scenarios.  Per discussion.
2010-04-29 21:49:03 +00:00
Tom Lane f0488bd57c Rename the parameter recovery_connections to hot_standby, to reduce possible
confusion with streaming-replication settings.  Also, change its default
value to "off", because of concern about executing new and poorly-tested
code during ordinary non-replicating operation.  Per discussion.

In passing do some minor editing of related documentation.
2010-04-29 21:36:19 +00:00
Tom Lane 77acab75df Modify ShmemInitStruct and ShmemInitHash to throw errors internally,
rather than returning NULL for some-but-not-all failures as they used to.
Remove now-redundant tests for NULL from call sites.

We had to do something about this because many call sites were failing to
check for NULL; and changing it like this seems a lot more useful and
mistake-proof than adding checks to the call sites without them.
2010-04-28 16:54:16 +00:00
Heikki Linnakangas 9b8a73326e Introduce wal_level GUC to explicitly control if information needed for
archival or hot standby should be WAL-logged, instead of deducing that from
other options like archive_mode. This replaces recovery_connections GUC in
the primary, where it now has no effect, but it's still used in the standby
to enable/disable hot standby.

Remove the WAL-logging of "unlogged operations", like creating an index
without WAL-logging and fsyncing it at the end. Instead, we keep a copy of
the wal_mode setting and the settings that affect how much shared memory a
hot standby server needs to track master transactions (max_connections,
max_prepared_xacts, max_locks_per_xact) in pg_control. Whenever the settings
change, at server restart, write a WAL record noting the new settings and
update pg_control. This allows us to notice the change in those settings in
the standby at the right moment, they used to be included in checkpoint
records, but that meant that a changed value was not reflected in the
standby until the first checkpoint after the change.

Bump PG_CONTROL_VERSION and XLOG_PAGE_MAGIC. Whack XLOG_PAGE_MAGIC back to
the sequence it used to follow, before hot standby and subsequent patches
changed it to 0x9003.
2010-04-28 16:10:43 +00:00
Tom Lane 2871b4618a Replace the KnownAssignedXids hash table with a sorted-array data structure,
and be more tense about the locking requirements for it, to improve performance
in Hot Standby mode.  In passing fix a few bugs and improve a number of
comments in the existing HS code.

Simon Riggs, with some editorialization by Tom
2010-04-28 00:09:05 +00:00
Heikki Linnakangas 3efba16d56 If a base backup is cancelled by server shutdown or crash, throw an error
in WAL recovery when it sees the shutdown checkpoint record. It's more
user-friendly to find out about it at that point than at the end of
recovery, and you're not left wondering why your hot standby server never
opens up for read-only connections.
2010-04-27 09:25:18 +00:00
Simon Riggs 491d1ea5b3 Previous patch revoked following objections. 2010-04-23 20:21:31 +00:00
Simon Riggs 6ca23b1a29 Make CheckRequiredParameterValues() depend upon correct combination
of parameters. Fix bug report by Robert Haas that error message and
hint was incorrect if wrong mode parameters specified on master.
Internal changes only. Proposals for parameter simplification on
master/primary still under way.
2010-04-23 19:57:19 +00:00
Robert Haas 481cb5d9b5 Rename standby_keep_segments to wal_keep_segments.
Also, make the name of the GUC and the name of the backing variable match.
Alnong the way, clean up a couple of slight typographical errors in the
related docs.
2010-04-20 11:15:06 +00:00
Simon Riggs d38603bd97 Improve sequence and sense of messages from pg_stop_backup().
Now doesn't report it is waiting until it actually is waiting,
plus message doesn't appear until at least 5 seconds wait, so
we avoid reporting the wait before we've given the archiver
a reasonable time to wake up and archive the file we just
created earlier in the function.
Also add new unconditional message to confirm safe completion.
Now a normal, healthy execution does not report waiting at
all, just safe completion.
2010-04-18 18:44:53 +00:00
Simon Riggs 2847de9df2 Remove some additional changes in previous commit that belong elsewhere. 2010-04-18 18:17:12 +00:00
Simon Riggs 21d6a6a128 Tune GetSnapshotData() during Hot Standby by avoiding loop
through normal backends. Makes code clearer also, since we
avoid various Assert()s. Performance of snapshots taken
during recovery no longer depends upon number of read-only
backends.
2010-04-18 18:06:07 +00:00
Heikki Linnakangas 78974cfb9b In standby mode, suppress repeated LOG messages about a corrupt record,
which just indicates that we've reached the end of valid WAL found in
the standby.
2010-04-16 08:58:16 +00:00
Bruce Momjian ec4b9bcc3d Doc change: effect -> affect, per Robert Haas 2010-04-15 03:05:59 +00:00
Simon Riggs 55d7556a4d Fix minor typo in comment in xlog.c 2010-04-14 10:29:07 +00:00
Heikki Linnakangas 361bd1662e Allow Hot Standby to begin from a shutdown checkpoint.
Patch by Simon Riggs & me
2010-04-13 14:17:46 +00:00
Heikki Linnakangas 30556568f5 Update the location of last removed WAL segment in shared memory only
after actually removing one, so that if we can't remove segments because
WAL archiving is lagging behind, we don't unnecessarily forbid streaming
the old not-yet-archived segments that are still perfectly valid. Per
suggestion from Fujii Masao.
2010-04-12 10:40:43 +00:00
Heikki Linnakangas e57cd7f0a1 Change the logic to decide when to delete old WAL segments, so that it
doesn't take into account how far the WAL senders are. This way a hung
WAL sender doesn't prevent old WAL segments from being recycled/removed
in the primary, ultimately causing the disk to fill up. Instead add
standby_keep_segments setting to control how many old WAL segments are
kept in the primary. This also makes it more reliable to use streaming
replication without WAL archiving, assuming that you set
standby_keep_segments high enough.
2010-04-12 09:52:29 +00:00
Heikki Linnakangas 0f11ed5886 Allow quotes to be escaped in recovery.conf, by doubling them. This patch
also makes the parsing a little bit stricter, rejecting garbage after the
parameter value and values with missing ending quotes, for example.
2010-04-07 10:58:49 +00:00
Heikki Linnakangas 370f770c15 Forbid using pg_xlogfile_name() and pg_xlogfile_name_offset() during
recovery. We might want to relax this in the future, but ThisTimeLineID
isn't currently correct in backends during recovery, so the filename
returned was wrong.
2010-04-07 06:12:52 +00:00
Simon Riggs 89c5008158 Further message changes when recovery.conf parameters missing. 2010-04-06 17:51:58 +00:00
Heikki Linnakangas 492d9f2309 Rename "Log-streaming replication parameters" header to "Standby server
parameters" in recovery.conf, to match the grouping in the documentation.

Fujii Masao
2010-04-06 14:53:20 +00:00
Simon Riggs cf2575b8c4 Check compulsory parameters in recovery.conf in standby_mode, per docs. 2010-04-02 21:50:40 +00:00
Simon Riggs 31f00d163b Move system startup message prior to any calls out of data directory.
This allows us to see what mode the server is in before it starts to
perform actions that can block or hang. Otherwise server messages
may not appear until after messages that say FATAL the database
server is starting up.
2010-04-02 13:10:56 +00:00
Robert Haas 54943734f8 Refer to max_wal_senders in a more consistent fashion.
The error message now makes explicit reference to the GUC that must be changed
to fix the problem, using wording suggested by Tom Lane.  Along the way,
rename the GUC from MaxWalSenders to max_wal_senders for consistency and
grep-ability.
2010-04-01 00:43:29 +00:00
Bruce Momjian 55a01b4c0a Change recovery.conf.sample to match postgresql.conf by showing only
default values, with example comments.
2010-03-31 14:18:45 +00:00
Heikki Linnakangas 2a77355ea1 Change the retry-loop in standby mode to also try restoring files from
pg_xlog directory. This is essential for replaying WAL records that
were streamed from the master, after a standby server restart.

If a corrupt record is seen in a file restored from the archive or
streamed from the master, log it as a WARNING and keep retrying. If the
corruption is permanent, and not just a glitch in the whatever copies the
files to the archive or a network error not caught by CRC checks in TCP
for example, we will keep retrying and logging the WARNING indefinitely.
But that's better than shutting down completely, the standby is still
useful for running read-only queries. In PITR the recovery ends at such a
corrupt record, which is a bit questionable, but that's the behavior we
had in previous releases and we don't feel like chaning it now. It does
make sense for tools like pg_standby.
2010-03-30 16:23:57 +00:00
Simon Riggs de66effede Edit recovery.conf.sample so it matches docs. Change standby_mode
example to 'on or 'off' rather than 'true' or 'false', as shown
in docs. Add restartpoint_command. Add section header for recovery
target parameters, matching docs.
2010-03-29 18:50:36 +00:00
Peter Eisentraut c248d17120 Message tuning 2010-03-21 00:17:59 +00:00
Simon Riggs 3cdafe40e7 Adjust comment in .history file to match recovery target specified. Comment
present since 8.0 was never fully meaningful, since two recovery targets
cannot be specified. Refactor recovery target type to make this change
and associated code easier to understand. No change in function.

Bug report arising from internal support question.
2010-03-19 11:05:15 +00:00
Heikki Linnakangas c21ac0b58e Add restartpoint_command option to recovery.conf. Fix bug in %r handling
in recovery_end_command, it always came out as 0 because InRedo was
cleared before recovery_end_command was executed. Also, always take
ControlFileLock when reading checkpoint location for %r.

The recovery_end_command bug and the missing locking was present in 8.4
as well, that part of this patch will be backported separately.
2010-03-18 09:17:18 +00:00
Simon Riggs 1a163a0c68 Remove incorrect comment from GetWriteRecPtr(): the return value is always
correct, as described in comments at start of xlog.c
2010-03-15 18:49:17 +00:00
Itagaki Takahiro 17d8de0e61 pg_start_backup() can use a share lock to lock ControlFileLock
instead of an exclusive lock.

The change is almost for code cleanup. Since there seems to be no
performance benefits from it, backports should not be needed.

Fujii Masao
2010-03-10 02:04:48 +00:00
Bruce Momjian 65e806cba1 pgindent run for 9.0 2010-02-26 02:01:40 +00:00
Tom Lane a2239b96e0 Make pg_stop_backup's reporting a bit more verbose in hopes of making
error cases less intimidating for novices.  Per discussion.

Greg Smith
2010-02-25 02:17:50 +00:00
Tom Lane 05d8a561ff Clean up handling of XactReadOnly and RecoveryInProgress checks.
Add some checks that seem logically necessary, in particular let's make
real sure that HS slave sessions cannot create temp tables.  (If they did
they would think that temp tables belonging to the master's session with
the same BackendId were theirs.  We *must* not allow myTempNamespace to
become set in a slave session.)

Change setval() and nextval() so that they are only allowed on temp sequences
in a read-only transaction.  This seems consistent with what we allow for
table modifications in read-only transactions.  Since an HS slave can't have a
temp sequence, this also provides a nicer cure for the setval PANIC reported
by Erik Rijkers.

Make the error messages more uniform, and have them mention the specific
command being complained of.  This seems worth the trifling amount of extra
code, since people are likely to see such messages a lot more than before.
2010-02-20 21:24:02 +00:00
Heikki Linnakangas ad458cfe81 Don't use O_DIRECT when writing WAL files if archiving or streaming is
enabled. Bypassing the kernel cache is counter-productive in that case,
because the archiver/walsender process will read from the WAL file
soon after it's written, and if it's not cached the read will cause
a physical read, eating I/O bandwidth available on the WAL drive.

Also, walreceiver process does unaligned writes, so disable O_DIRECT
in walreceiver process for that reason too.
2010-02-19 10:51:04 +00:00
Itagaki Takahiro 3230fd056a Fix STOP WAL LOCATION in backup history files no to return the next
segment of XLOG_BACKUP_END record even if the the record is placed
at a segment boundary. Furthermore the previous implementation could
return nonexistent segment file name when the boundary is in segments
that has "FE" suffix; We never use segments with "FF" suffix.

Backpatch to 8.0, where hot backup was introduced.

Reported by Fujii Masao.
2010-02-19 01:04:03 +00:00
Tom Lane 50a90fac40 Stamp HEAD as 9.0devel, and update various places that were referring to 8.5
(hope I got 'em all).  Per discussion, this release will be 9.0 not 8.5.
2010-02-17 04:19:41 +00:00
Tom Lane c64339face When updating ShmemVariableCache from a checkpoint record, be sure to set
all the values derived from oldestXid, not just that field.  Brain fade in
one of my patches associated with flat file removal, exposed by a report
from Fujii Masao.

With this change, xidVacLimit should always be valid, so remove a couple of
bits of complexity associated with the previous assumption that sometimes
it wouldn't get set right away.
2010-02-17 03:10:33 +00:00
Tom Lane d1e027221d Replace the pg_listener-based LISTEN/NOTIFY mechanism with an in-memory queue.
In addition, add support for a "payload" string to be passed along with
each notify event.

This implementation should be significantly more efficient than the old one,
and is also more compatible with Hot Standby usage.  There is not yet any
facility for HS slaves to receive notifications generated on the master,
although such a thing is possible in future.

Joachim Wieland, reviewed by Jeff Davis; also hacked on by me.
2010-02-16 22:34:57 +00:00
Robert Haas e26c539e9f Wrap calls to SearchSysCache and related functions using macros.
The purpose of this change is to eliminate the need for every caller
of SearchSysCache, SearchSysCacheCopy, SearchSysCacheExists,
GetSysCacheOid, and SearchSysCacheList to know the maximum number
of allowable keys for a syscache entry (currently 4).  This will
make it far easier to increase the maximum number of keys in a
future release should we choose to do so, and it makes the code
shorter, too.

Design and review by Tom Lane.
2010-02-14 18:42:19 +00:00
Simon Riggs dd428c79a4 Fix relcache init file invalidation during Hot Standby for the case
where a database has a non-default tablespaceid. Pass thru MyDatabaseId
and MyDatabaseTableSpace to allow file path to be re-created in
standby and correct invalidation to take place in all cases.
Update and rework xact_commit_desc() debug messages.
Bug report from Tom by code inspection. Fix by me.
2010-02-13 16:15:48 +00:00
Heikki Linnakangas e465390d03 Reduce the chatter to the log when starting a standby server. Don't
echo all the recovery.conf options. Don't emit the "initializing
recovery connections" message, which doesn't mean anything to a user.
Remove the "starting archive recovery" message and replace the
"automatic recovery in progress" message with a more informative message
saying whether the server is doing PITR, normal archive recovery, or
standby mode.
2010-02-12 09:49:08 +00:00
Heikki Linnakangas 54cbd1757e If primary_conninfo is not set, don't try to establish streaming
connection.
2010-02-12 07:56:36 +00:00
Heikki Linnakangas 9fa01f6c8a Check for partial WAL files in standby mode. If restore_command restores
a partial WAL file, assume it's because the file is just being copied to
the archive and treat it the same as "file not found" in standby mode.
pg_standby has a similar check, so it seems reasonable to have the same
level of protection in the built-in standby mode.
2010-02-12 07:36:44 +00:00
Heikki Linnakangas 161d9d51b3 Now that streaming replication switches between streaming mode and
restoring from archive, the last WAL segment is not necessarily open at
the end of recovery. Fix assertion that assumed that.

Fujii Masao, fixing the assertion failure reported by Martin Pihlak.
2010-02-10 08:25:25 +00:00
Tom Lane cbe9d6beb4 Fix up rickety handling of relation-truncation interlocks.
Move rd_targblock, rd_fsm_nblocks, and rd_vm_nblocks from relcache to the smgr
relation entries, so that they will get reset to InvalidBlockNumber whenever
an smgr-level flush happens.  Because we now send smgr invalidation messages
immediately (not at end of transaction) when a relation truncation occurs,
this ensures that other backends will reset their values before they next
access the relation.  We no longer need the unreliable assumption that a
VACUUM that's doing a truncation will hold its AccessExclusive lock until
commit --- in fact, we can intentionally release that lock as soon as we've
completed the truncation.  This patch therefore reverts (most of) Alvaro's
patch of 2009-11-10, as well as my marginal hacking on it yesterday.  We can
also get rid of assorted no-longer-needed relcache flushes, which are far more
expensive than an smgr flush because they kill a lot more state.

In passing this patch fixes smgr_redo's failure to perform visibility-map
truncation, and cleans up some rather dubious assumptions in freespace.c and
visibilitymap.c about when rd_fsm_nblocks and rd_vm_nblocks can be out of
date.
2010-02-09 21:43:30 +00:00
Heikki Linnakangas 4cea603128 Remove piece of code to zero out minRecoveryPoint when starting crash
recovery. It's zeroed out whenever a checkpoint is written, so the only
scenario where the removed code did anything is when you kill archive
recovery, remove recovery.conf, and start up the server, so that it goes
into crash recovery instead. That's a "don't do that" scenario, but it
seems better to not clear minRecoveryPoint but instead update it like we
do in archive recovery, which is what will now happen.
2010-02-08 09:08:51 +00:00
Tom Lane 0a469c8769 Remove old-style VACUUM FULL (which was known for a little while as
VACUUM FULL INPLACE), along with a boatload of subsidiary code and complexity.
Per discussion, the use case for this method of vacuuming is no longer large
enough to justify maintaining it; not to mention that we don't wish to invest
the work that would be needed to make it play nicely with Hot Standby.

Aside from the code directly related to old-style VACUUM FULL, this commit
removes support for certain WAL record types that could only be generated
within VACUUM FULL, redirect-pointer removal in heap_page_prune, and
nontransactional generation of cache invalidation sinval messages (the last
being the sticking point for Hot Standby).

We still have to retain all code that copes with finding HEAP_MOVED_OFF and
HEAP_MOVED_IN flag bits on existing tuples.  This can't be removed as long
as we want to support in-place update from pre-9.0 databases.
2010-02-08 04:33:55 +00:00
Tom Lane b9b8831ad6 Create a "relation mapping" infrastructure to support changing the relfilenodes
of shared or nailed system catalogs.  This has two key benefits:

* The new CLUSTER-based VACUUM FULL can be applied safely to all catalogs.

* We no longer have to use an unsafe reindex-in-place approach for reindexing
  shared catalogs.

CLUSTER on nailed catalogs now works too, although I left it disabled on
shared catalogs because the resulting pg_index.indisclustered update would
only be visible in one database.

Since reindexing shared system catalogs is now fully transactional and
crash-safe, the former special cases in REINDEX behavior have been removed;
shared catalogs are treated the same as non-shared.

This commit does not do anything about the recently-discussed problem of
deadlocks between VACUUM FULL/CLUSTER on a system catalog and other
concurrent queries; will address that in a separate patch.  As a stopgap,
parallel_schedule has been tweaked to run vacuum.sql by itself, to avoid
such failures during the regression tests.
2010-02-07 20:48:13 +00:00
Simon Riggs 296578feb4 Revoke augmentation of WAL records for btree delete, per discussion. 2010-02-01 13:40:28 +00:00
Simon Riggs 6d2bc0a6cf Augment WAL records for btree delete with GetOldestXmin() to reduce
false positives during Hot Standby conflict processing. Simple
patch to enhance conflict processing, following previous discussions.
Controlled by parameter minimize_standby_conflicts = on | off, with
default off allows measurement of performance impact to see whether
it should be set on all the time.
2010-01-29 18:39:05 +00:00
Heikki Linnakangas b0509ef601 Fix crashing bug at the end of recovery in Streaming Replication, when
restore_command is not given. Fujii Masao.
2010-01-28 19:17:22 +00:00
Heikki Linnakangas 83cb7da7dc Fix bug in wasender's xlogid boundary handling, reported by Erik Rijkers.
LogwrtRqst.Write can be set to non-existent FF log segment, we mustn't
try to send that in XLogSend().

Also fix similar bug in ReadRecord(), which I just introduced in the
ReadRecord() refactoring patch.
2010-01-27 16:41:09 +00:00
Heikki Linnakangas 1bb2558046 Make standby server continuously retry restoring the next WAL segment with
restore_command, if the connection to the primary server is lost. This
ensures that the standby can recover automatically, if the connection is
lost for a long time and standby falls behind so much that the required
WAL segments have been archived and deleted in the master.

This also makes standby_mode useful without streaming replication; the
server will keep retrying restore_command every few seconds until the
trigger file is found. That's the same basic functionality pg_standby
offers, but without the bells and whistles.

To implement that, refactor the ReadRecord/FetchRecord functions. The
FetchRecord() function introduced in the original streaming replication
patch is removed, and all the retry logic is now in a new function called
XLogReadPage(). XLogReadPage() is now responsible for executing
restore_command, launching walreceiver, and waiting for new WAL to arrive
from primary, as required.

This also changes the life cycle of walreceiver. When launched, it now only
tries to connect to the master once, and exits if the connection fails, or
is lost during streaming for any reason. The startup process detects the
death, and re-launches walreceiver if necessary.
2010-01-27 15:27:51 +00:00
Simon Riggs aed1a0121a Fix longstanding gripe that we check for 0000000001.history at start of
archive recovery, even when we know it is never present.
2010-01-26 00:07:13 +00:00
Tom Lane 875353b99f Fix assorted core dumps and Assert failures that could occur during
AbortTransaction or AbortSubTransaction, when trying to clean up after an
error that prevented (sub)transaction start from completing:
* access to TopTransactionResourceOwner that might not exist
* assert failure in AtEOXact_GUC, if AtStart_GUC not called yet
* assert failure or core dump in AfterTriggerEndSubXact, if
  AfterTriggerBeginSubXact not called yet

Per testing by injecting elog(ERROR) at successive steps in StartTransaction
and StartSubTransaction.  It's not clear whether all of these cases could
really occur in the field, but at least one of them is easily exposed by
simple stress testing, as per my accidental discovery yesterday.
2010-01-24 21:49:17 +00:00
Simon Riggs 959ac58c04 In HS, Startup process sets SIGALRM when waiting for buffer pin. If
woken by alarm we send SIGUSR1 to all backends requesting that they
check to see if they are blocking Startup process. If so, they throw
ERROR/FATAL as for other conflict resolutions. Deadlock stop gap
removed. max_standby_delay = -1 option removed to prevent deadlock.
2010-01-23 16:37:12 +00:00
Heikki Linnakangas 09b115f706 Write a WAL record whenever we perform an operation without WAL-logging
that would've been WAL-logged if archiving was enabled. If we encounter
such records in archive recovery anyway, we know that some data is
missing from the log. A WARNING is emitted in that case.

Original patch by Fujii Masao, with changes by me.
2010-01-20 19:43:40 +00:00
Simon Riggs a8ce974cdd Teach standby conflict resolution to use SIGUSR1
Conflict reason is passed through directly to the backend, so we can
take decisions about the effect of the conflict based upon the local
state. No specific changes, as yet, though this prepares for later work.
CancelVirtualTransaction() sends signals while holding ProcArrayLock.
Introduce errdetail_abort() to give message detail explaining that the
abort was caused by conflict processing. Remove CONFLICT_MODE states
in favour of using PROCSIG_RECOVERY_CONFLICT states directly, for clarity.
2010-01-16 10:05:59 +00:00
Heikki Linnakangas 40f908bdcd Introduce Streaming Replication.
This includes two new kinds of postmaster processes, walsenders and
walreceiver. Walreceiver is responsible for connecting to the primary server
and streaming WAL to disk, while walsender runs in the primary server and
streams WAL from disk to the client.

Documentation still needs work, but the basics are there. We will probably
pull the replication section to a new chapter later on, as well as the
sections describing file-based replication. But let's do that as a separate
patch, so that it's easier to see what has been added/changed. This patch
also adds a new section to the chapter about FE/BE protocol, documenting the
protocol used by walsender/walreceivxer.

Bump catalog version because of two new functions,
pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for
monitoring the progress of replication.

Fujii Masao, with additional hacking by me
2010-01-15 09:19:10 +00:00
Simon Riggs 42edbd16fb During Hot Standby, set DatabasePath correctly during relcache init file
deletion, so that we attempt to unlink the correct filepath. unlink()
errors are ignorable there, so lack of a DatabasePath initialization step
did not cause visible problems until a related bug showed up on Solaris.

Code refactored from xact_redo_commit() to
ProcessCommittedInvalidationMessages() in inval.c. Recovery may replay
shared invalidation messages for many databases, so we cannot
SetDatabasePath() once as we do in normal backends. Read the databaseid
from the shared invalidation messages, then set DatabasePath
temporarily before calling RelationCacheInitFileInvalidate().

Problem report by Robert Treat, analysis and fix by me.
2010-01-09 16:49:27 +00:00
Heikki Linnakangas 06f82b2961 Write an end-of-backup WAL record at pg_stop_backup(), and wait for it at
recovery instead of reading the backup history file. This is more robust,
as it stops you from prematurely starting up an inconsisten cluster if the
backup history file is lost for some reason, or if the base backup was
never finished with pg_stop_backup().

This also paves the way for a simpler streaming replication patch, which
doesn't need to care about backup history files anymore.

The backup history file is still created and archived as before, but it's
not used by the system anymore. It's just for informational purposes now.

Bump PG_CONTROL_VERSION as the location of the backup startpoint is now
written to a new field in pg_control, and catversion because initdb is
required

Original patch by Fujii Masao per Simon's idea, with further fixes by me.
2010-01-04 12:50:50 +00:00
Bruce Momjian 0239800893 Update copyright for the year 2010. 2010-01-02 16:58:17 +00:00
Heikki Linnakangas ff1e1e45b9 Reset minRecoveryPoint at checkpoints, so that we don't uselessly update
it in the control file at crash recovery following an archive recovery.

Per Fujii Masao and subsequent discussion.
2009-12-30 08:37:21 +00:00
Simon Riggs efc16ea520 Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.

New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.

This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.

Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.

Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 01:32:45 +00:00
Tom Lane 62aba76568 Prevent indirect security attacks via changing session-local state within
an allegedly immutable index function.  It was previously recognized that
we had to prevent such a function from executing SET/RESET ROLE/SESSION
AUTHORIZATION, or it could trivially obtain the privileges of the session
user.  However, since there is in general no privilege checking for changes
of session-local state, it is also possible for such a function to change
settings in a way that might subvert later operations in the same session.
Examples include changing search_path to cause an unexpected function to
be called, or replacing an existing prepared statement with another one
that will execute a function of the attacker's choosing.

The present patch secures VACUUM, ANALYZE, and CREATE INDEX/REINDEX against
these threats, which are the same places previously deemed to need protection
against the SET ROLE issue.  GUC changes are still allowed, since there are
many useful cases for that, but we prevent security problems by forcing a
rollback of any GUC change after completing the operation.  Other cases are
handled by throwing an error if any change is attempted; these include temp
table creation, closing a cursor, and creating or deleting a prepared
statement.  (In 7.4, the infrastructure to roll back GUC changes doesn't
exist, so we settle for rejecting changes of "search_path" in these contexts.)

Original report and patch by Gurjeet Singh, additional analysis by
Tom Lane.

Security: CVE-2009-4136
2009-12-09 21:57:51 +00:00
Heikki Linnakangas cd87b6f8a5 Fix an old bug in multixact and two-phase commit. Prepared transactions can
be part of multixacts, so allocate a slot for each prepared transaction in
the "oldest member" array in multixact.c. On PREPARE TRANSACTION, transfer
the oldest member value from the current backends slot to the prepared xact
slot. Also save and recover the value from the 2pc state file.

The symptom of the bug was that after a transaction prepared, a shared lock
still held by the prepared transaction was sometimes ignored by other
transactions.

Fix back to 8.1, where both 2PC and multixact were introduced.
2009-11-23 09:58:36 +00:00
Heikki Linnakangas 7f2a10fecd Don't error out if recycling or removing an old WAL segment fails at the end
of checkpoint. Although the checkpoint has been written to WAL at that point
already, so that all data is safe, and we'll retry removing the WAL segment at
the next checkpoint, if such a failure persists we won't be able to remove any
other old WAL segments either and will eventually run out of disk space. It's
better to treat the failure as non-fatal, and move on to clean any other WAL
segment and continue with any other end-of-checkpoint cleanup.

We don't normally expect any such failures, but on Windows it can happen with
some anti-virus or backup software that lock files without FILE_SHARE_DELETE
flag.

Also, the loop in pgrename() to retry when the file is locked was broken. If a
file is locked on Windows, you get ERROR_SHARE_VIOLATION, not
ERROR_ACCESS_DENIED, at least on modern versions. Fix that, although I left
the check for ERROR_ACCESS_DENIED in there as well (presumably it was correct
in some environment), and added ERROR_LOCK_VIOLATION to be consistent with
similar checks in pgwin32_open(). Reduce the timeout on the loop from 30s to
10s, on the grounds that since it's been broken, we've effectively had a
timeout of 0s and no-one has complained, so a smaller timeout is actually
closer to the old behavior. A longer timeout would mean that if recycling a
WAL file fails because it's locked for some reason, InstallXLogFileSegment()
will hold ControlFileLock for longer, potentially blocking other backends, so
a long timeout isn't totally harmless.

While we're at it, set errno correctly in pgrename().

Backpatch to 8.2, which is the oldest version supported on Windows. The xlog.c
changes would make sense on other platforms and thus on older versions as
well, but since there's no such locking issues on other platforms, it's not
worth it.
2009-09-13 18:32:08 +00:00
Heikki Linnakangas 4e2d5efc6a On Windows, when a file is deleted and another process still has an open
file handle on it, the file goes into "pending deletion" state where it
still shows up in directory listing, but isn't accessible otherwise. That
confuses RemoveOldXLogFiles(), making it think that the file hasn't been
archived yet, while it actually was, and it was deleted along with the .done
file.

Fix that by renaming the file with ".deleted" extension before deleting it.
Also check the return value of rename() and unlink(), so that if the removal
fails for any reason (e.g another process is holding the file locked), we
don't delete the .done file until the WAL file is really gone.

Backpatch to 8.2, which is the oldest version supported on Windows.
2009-09-10 09:42:10 +00:00
Tom Lane 794e3e81a0 Force VACUUM to recalculate oldestXmin even when we haven't changed our
own database's datfrozenxid, if the current value is old enough to be
forcing autovacuums or warning messages.  This ensures that a bogus
value is replaced as soon as possible.  Per a comment from Heikki.
2009-09-01 04:46:49 +00:00
Tom Lane 14f445fccf Actually, we need to bump the format identifier on twophase files
because of readjustment of 2PC rmgr IDs for flatfile removal.
2009-09-01 04:15:45 +00:00
Alvaro Herrera a8bb8eb583 Remove flatfiles.c, which is now obsolete.
Recent commits have removed the various uses it was supporting.  It was a
performance bottleneck, according to bug report #4919 by Lauris Ulmanis; seems
it slowed down user creation after a billion users.
2009-09-01 02:54:52 +00:00
Tom Lane 25ec228ef7 Track the current XID wrap limit (or more accurately, the oldest unfrozen
XID) in checkpoint records.  This eliminates the need to recompute the value
from scratch during database startup, which is one of the two remaining
reasons for the flatfile code to exist.  It should also simplify life for
hot-standby operation.

To avoid bloating the checkpoint records unreasonably, I switched from
tracking the oldest database by name to tracking it by OID.  This turns
out to save cycles in general (everywhere but the warning-generating
paths, which we hardly care about) and also helps us deal with the case
that the oldest database got dropped instead of being vacuumed.  The prior
coding might go for a long time without updating the wrap limit in that case,
which is bad because it might result in a lot of useless autovacuum activity.
2009-08-31 02:23:23 +00:00
Heikki Linnakangas 9cd6685f91 In the checkpoint written at the end of archive recovery, the WAL page header
was incorrectly initialized with timeline ID 0. That rendered the WAL page
unrecoverable, making a subsequent archive recovery stop at that point.
ThisTimeLineID needs to be initialized before calling AdvanceXLInsertBuffer().

This fixes bug #5011 reported by James Bardin. Backpatch to 8.4, as the bug
was introduced by the changes to use of bgwriter for writing the
end-of-archive-recovery checkpoint. Patch by Tom Lane.
2009-08-27 07:15:41 +00:00
Tom Lane 04011cc970 Allow backends to start up without use of the flat-file copy of pg_database.
To make this work in the base case, pg_database now has a nailed-in-cache
relation descriptor that is initialized using hardwired knowledge in
relcache.c.  This means pg_database is added to the set of relations that
need to have a Schema_pg_xxx macro maintained in pg_attribute.h.  When this
path is taken, we'll have to do a seqscan of pg_database to find the row
we need.

In the normal case, we are able to do an indexscan to find the database's row
by name.  This is made possible by storing a global relcache init file that
describes only the shared catalogs and their indexes (and therefore is usable
by all backends in any database).  A new backend loads this cache file,
finds its database OID after an indexscan on pg_database, and then loads
the local relcache init file for that database.

This change should effectively eliminate number of databases as a factor
in backend startup time, even with large numbers of databases.  However,
the real reason for doing it is as a first step towards getting rid of
the flat files altogether.  There are still several other sub-projects
to be tackled before that can happen.
2009-08-12 20:53:31 +00:00
Tom Lane 97e14f6e93 Document that LocalSetXLogInsertAllowed can be re-executed.
Per comment from Simon.
2009-08-08 16:39:17 +00:00
Tom Lane 87740caa01 rm_cleanup functions need to be allowed to write WAL entries. This oversight
appears to explain the recent reports of "PANIC: cannot make new WAL entries
during recovery".
2009-08-07 19:29:49 +00:00
Tom Lane 2de48a83e6 Cleanup and code review for the patch that made bgwriter active during
archive recovery.  Invent a separate state variable and inquiry function
for XLogInsertAllowed() to clarify some tests and make the management of
writing the end-of-recovery checkpoint less klugy.  Fix several places
that were incorrectly testing InRecovery when they should be looking at
RecoveryInProgress or XLogInsertAllowed (because they will now be executed
in the bgwriter not startup process).  Clarify handling of bad LSNs passed
to XLogFlush during recovery.  Use a spinlock for setting/testing
SharedRecoveryInProgress.  Improve quite a lot of comments.

Heikki and Tom
2009-06-26 20:29:04 +00:00
Heikki Linnakangas 7e48b77b1c Fix some serious bugs in archive recovery, now that bgwriter is active
during it:

When bgwriter is active, the startup process can't perform mdsync() correctly
because it won't see the fsync requests accumulated in bgwriter's private
pendingOpsTable. Therefore make bgwriter responsible for the end-of-recovery
checkpoint as well, when it's active.

When bgwriter is active (= archive recovery), the startup process must not
accumulate fsync requests to its own pendingOpsTable, since bgwriter won't
see them there when it performs restartpoints. Make startup process drop its
pendingOpsTable when bgwriter is launched to avoid that.

Update minimum recovery point one last time when leaving archive recovery.
It won't be updated by the end-of-recovery checkpoint because XLogFlush()
sees us as out of recovery already.

This fixes bug #4879 reported by Fujii Masao.
2009-06-25 21:36:00 +00:00
Heikki Linnakangas ebaa1952f1 The code to unlink dropped relations in FinishPreparedTransaction() was
acting like runs inside WAL recovery, but it doesn't. I must've copy-pasted
this from a redo-function in the relation forks patch. Noticed by Tom Lane
while he was looking through callers of smgrdounlink().
2009-06-25 19:05:52 +00:00
Bruce Momjian d747140279 8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef list
provided by Andrew.
2009-06-11 14:49:15 +00:00
Heikki Linnakangas 7c8d7a2eec Only recycle normal files in pg_xlog as WAL segments. pg_standby creates
symbolic links with the -l option, and as Fujii Masao pointed out we ended up
overwriting files in the archive directory before this patch. Patch by
Aidan Van Dyk, Fujii Masao and me.

Backpatch to 8.3, where pg_standby was introduced.
2009-06-02 06:18:06 +00:00
Heikki Linnakangas 2e6107cb62 When archiving is enabled, rotate the last WAL segment at shutdown so that
all transactions are archived.

Original patch by Guillaume Smet.
2009-05-28 11:02:16 +00:00
Tom Lane 4616d57dad Fix all the server-side SIGQUIT handlers (grumble ... why so many identical
copies?) to ensure they really don't run proc_exit/shmem_exit callbacks,
as was intended.  I broke this behavior recently by installing atexit
callbacks without thinking about the one case where we truly don't want
to run those callback functions.  Noted in an example from Dave Page.
2009-05-15 15:56:39 +00:00
Tom Lane bfab3f19e3 Include recovery_end_command in recovery.conf.sample.
Per suggestion of Jaime Casanova.
2009-05-14 22:22:01 +00:00
Tom Lane 284e12c398 Improve a couple of comments. 2009-05-14 21:28:35 +00:00
Heikki Linnakangas 9e403c2587 Add recovery_end_command option to recovery.conf. recovery_end_command
is run at the end of archive recovery, providing a chance to do external
cleanup. Modify pg_standby so that it no longer removes the trigger file,
that is to be done using the recovery_end_command now.

Provide a "smart" failover mode in pg_standby, where we don't fail over
immediately, but only after recovering all unapplied WAL from the archive.
That gives you zero data loss assuming all WAL was archived before
failover, which is what most users of pg_standby actually want.

recovery_end_command by Simon Riggs, pg_standby changes by Fujii Masao and
myself.
2009-05-14 20:31:09 +00:00