Commit Graph

308 Commits

Author SHA1 Message Date
Simon Riggs a8a8a3e096 Efficient transaction-controlled synchronous replication.
If a standby is broadcasting reply messages and we have named
one or more standbys in synchronous_standby_names then allow
users who set synchronous_replication to wait for commit, which
then provides strict data integrity guarantees. Design avoids
sending and receiving transaction state information so minimises
bookkeeping overheads. We synchronize with the highest priority
standby that is connected and ready to synchronize. Other standbys
can be defined to takeover in case of standby failure.

This version has very strict behaviour; more relaxed options
may be added at a later date.

Simon Riggs and Fujii Masao, with reviews by Yeb Havinga, Jaime
Casanova, Heikki Linnakangas and Robert Haas, plus the assistance
of many other design reviewers.
2011-03-06 22:49:16 +00:00
Heikki Linnakangas be6668d6ef Increase the default for wal_sender_delay from 200ms to 1s. Now that WAL
sender is immediately woken up by transaction commit, there's no need to
wake up so aggressively.
2011-02-26 23:38:25 +02:00
Simon Riggs bc76695c4c Make a hard state change from catchup to streaming mode.
More useful state change for monitoring purposes, plus a
required change for synchronous replication patch.
2011-02-18 15:07:26 +00:00
Simon Riggs 06828c5feb Separate messages for standby replies and hot standby feedback.
Allow messages to be sent at different times, and greatly reduce
the frequency of hot standby feedback. Refactor to allow additional
message types.
2011-02-18 11:31:49 +00:00
Tom Lane 93016983d1 Fix blatantly uninitialized variable in recent commit.
Doesn't anybody around here pay attention to compiler warnings?
2011-02-16 19:53:20 -05:00
Simon Riggs bca8b7f16a Hot Standby feedback for avoidance of cleanup conflicts on standby.
Standby optionally sends back information about oldestXmin of queries
which is then checked and applied to the WALSender's proc->xmin.
GetOldestXmin() is modified slightly to agree with GetSnapshotData(),
so that all backends on primary include WALSender within their snapshots.
Note this does nothing to change the snapshot xmin on either master or
standby. Feedback piggybacks on the standby reply message.
vacuum_defer_cleanup_age is no longer used on standby, though parameter
still exists on primary, since some use cases still exist.

Simon Riggs, review comments from Fujii Masao, Heikki Linnakangas, Robert Haas
2011-02-16 19:29:37 +00:00
Robert Haas 883a9659fa Assorted corrections to the patch to add WAL receiver replies.
Per reports from Fujii Masao.
2011-02-15 12:05:00 -05:00
Heikki Linnakangas b186523fd9 Send status updates back from standby server to master, indicating how far
the standby has written, flushed, and applied the WAL. At the moment, this
is for informational purposes only, the values are only shown in
pg_stat_replication system view, but in the future they will also be needed
for synchronous replication.

Extracted from Simon riggs' synchronous replication patch by Robert Haas, with
some tweaking by me.
2011-02-10 21:04:02 +02:00
Magnus Hagander 76129e7f14 Include more status information in walsender results
Add the current xlog insert location to the response of
IDENTIFY_SYSTEM, and adds result sets containing start
and stop location of backups to BASE_BACKUP responses.
2011-02-03 13:46:23 +01:00
Magnus Hagander 507069de6d Add option to include WAL in base backup
When included, this makes the base backup a complete working
"clone" of the initial database, ready to have a postmaster
started against it without the need to set up any log archiving
or similar.

Magnus Hagander, reviewed by Fujii Masao and Heikki Linnakangas
2011-01-30 21:30:09 +01:00
Magnus Hagander e5487f65fd Make walsender options order-independent
While doing this, also move base backup options into
a struct instead of increasing the number of parameters
to multiple functions for each new option.
2011-01-23 23:39:18 +01:00
Magnus Hagander f88a638199 Only show pg_stat_replication details to superusers 2011-01-23 17:28:19 +01:00
Magnus Hagander 048d148fe6 Add pg_basebackup tool for streaming base backups
This tool makes it possible to do the pg_start_backup/
copy files/pg_stop_backup step in a single command.

There are still some steps to be done before this is a
complete backup solution, such as the ability to stream
the required WAL logs, but it's still usable, and
could do with some buildfarm coverage.

In passing, make the checkpoint request optionally
fast instead of hardcoding it.

Magnus Hagander, reviewed by Fujii Masao and Dimitri Fontaine
2011-01-23 12:21:23 +01:00
Heikki Linnakangas 8f5d65e916 Treat a WAL sender process that hasn't started streaming yet as a regular
backend, as far as the postmaster shutdown logic is concerned. That means,
fast shutdown will wait for WAL sender processes to exit before signaling
bgwriter to finish. This avoids race conditions between a base backup stopping
or starting, and bgwriter writing the shutdown checkpoint WAL record. We don't
want e.g the end-of-backup WAL record to be written after the shutdown
checkpoint.
2011-01-15 16:38:21 +02:00
Magnus Hagander fcd810c69a Use a lexer and grammar for parsing walsender commands
Makes it easier to parse mainly the BASE_BACKUP command
with it's options, and avoids having to manually deal
with quoted identifiers in the label (previously broken),
and makes it easier to add new commands and options in
the future.

In passing, refactor the case statement in the walsender
to put each command in it's own function.
2011-01-14 16:30:33 +01:00
Magnus Hagander 688423d004 Exit from base backups when shutdown is requested
When the exit waits until the whole backup completes, it may take
a very long time.

In passing, add back an error check in the main loop so we detect
clients that disconnect much earlier if the backup is large.
2011-01-14 12:36:45 +01:00
Magnus Hagander 9eacd427e8 Make sure walsender state is only read while holding the spinlock
Noted by Robert Haas.
2011-01-13 18:51:13 +01:00
Magnus Hagander 4c8e20f815 Track walsender state in shared memory and expose in pg_stat_replication 2011-01-11 21:25:28 +01:00
Magnus Hagander b7ebda9d8c Reset walsender ps title in the main loop
When in streaming mode we can never get out, so it will never
be required, but after a base backup (or other operations)
we can get back to the loop, so the title needs to be cleared.
2011-01-11 10:04:54 +01:00
Magnus Hagander 0eb59c4591 Backend support for streaming base backups
Add BASE_BACKUP command to walsender, allowing it to stream a
base backup to the client (in tar format). The syntax is still
far from ideal, that will be fixed in the switch to use a proper
grammar for walsender.

No client included yet, will come as a separate commit.

Magnus Hagander and Heikki Linnakangas
2011-01-10 14:04:19 +01:00
Itagaki Takahiro a755ea33ae New system view pg_stat_replication displays activity of wal sender processes.
Itagaki Takahiro and Simon Riggs.
2011-01-07 20:35:38 +09:00
Bruce Momjian 5d950e3b0c Stamp copyrights for year 2011. 2011-01-01 13:18:15 -05:00
Robert Haas d3d414696f Allow bidirectional copy messages in streaming replication mode.
Fujii Masao.  Review by Alvaro Herrera, Tom Lane, and myself.
2010-12-11 09:27:37 -05:00
Heikki Linnakangas 8c843fff2d Bootstrap WAL to begin at segment logid=0 logseg=1 (000000010000000000000001)
rather than 0/0, so that we can safely use 0/0 as an invalid value. This is a
more future-proof fix for the corner-case bug in streaming replication that
was fixed yesterday. We had a similar corner-case bug with log/seg 0/0 back in
February as well. Avoiding 0/0 as a valid value should prevent bugs like that
in the future. Per Tom Lane's idea.

Back-patch to 9.0. Since this only affects bootstrapping, it makes no
difference to existing installations. We don't need to worry about the
bug in existing installations, because if you've managed to get past the
initial base backup already, you won't hit the bug in the future either.
2010-11-02 11:39:48 +02:00
Heikki Linnakangas 931b6db39b Fix corner-case bug in tracking of latest removed WAL segment during
streaming replication. We used log/seg 0/0 to indicate that no WAL segments
have been removed since startup, but 0/0 is a valid value for the very first
WAL segment after initdb. To make that disambiguous, store
(latest removed WAL segment + 1) in the global variable.

Per report from Matt Chesler, also reproduced by Greg Smith.
2010-11-01 10:05:15 +02:00
Magnus Hagander 9f2e211386 Remove cvs keywords from all files. 2010-09-20 22:08:53 +02:00
Heikki Linnakangas 1eab7a560d Don't call OwnLatch while holding a spinlock. OwnLatch can elog() under
some "can't happen" scenarios, and spinlocks should only be held for
a few instructions anyway. As pointed out by Fujii Masao.
2010-09-15 06:51:19 +00:00
Heikki Linnakangas 3522217b63 Oops, the timeout argument to WaitLatchOrSocket is in microseconds, not
milliseconds.
2010-09-14 13:35:14 +00:00
Heikki Linnakangas 2746e5f21d Introduce latches. A latch is a boolean variable, with the capability to
wait until it is set. Latches can be used to reliably wait until a signal
arrives, which is hard otherwise because signals don't interrupt select()
on some platforms, and even when they do, there's race conditions.

On Unix, latches use the so called self-pipe trick under the covers to
implement the sleep until the latch is set, without race conditions. On
Windows, Windows events are used.

Use the new latch abstraction to sleep in walsender, so that as soon as
a transaction finishes, walsender is woken up to immediately send the WAL
to the standby. This reduces the latency between master and standby, which
is good.

Preliminary work by Fujii Masao. The latch implementation is by me, with
helpful comments from many people.
2010-09-11 15:48:04 +00:00
Robert Haas bca03b12c1 Add missing function prototype.
Fujii Masao
2010-07-22 13:03:11 +00:00
Bruce Momjian 239d769e7e pgindent run for 9.0, second run 2010-07-06 19:19:02 +00:00
Tom Lane 07e8b6aabc Don't allow walsender to send WAL data until it's been safely fsync'd on the
master.  Otherwise a subsequent crash could cause the master to lose WAL that
has already been applied on the slave, resulting in the slave being out of
sync and soon corrupt.  Per recent discussion and an example from Robert Haas.

Fujii Masao
2010-06-17 16:41:25 +00:00
Tom Lane 3ceb44fa0a Adjust misleading comment in walsender.c. We try to send all WAL data that's
been written out from shared memory, but the previous phrasing might be read
to say that we send only what's been fsync'd.
2010-06-03 23:00:14 +00:00
Tom Lane 0cc59cc1f3 Add current WAL end (as seen by walsender, ie, GetWriteRecPtr() result)
and current server clock time to SR data messages.  These are not currently
used on the slave side but seem likely to be useful in future, and it'd be
better not to change the SR protocol after release.  Per discussion.
Also do some minor code review and cleanup on walsender.c, and improve the
protocol documentation.
2010-06-03 22:17:32 +00:00
Peter Eisentraut cb6038c168 Fix some inconsistent quoting of wal_level values in messages
When referring to postgresql.conf syntax, then it's without quotes
(wal_level=archive); in narrative it's with double quotes.  But never
single quotes.
2010-06-03 21:02:12 +00:00
Heikki Linnakangas e0b581acd2 Send all outstanding WAL before exiting when smart shutdown is requested.
This was broken by my previous patch to send WAL in smaller batches.

Patch by Fujii Masao.
2010-05-31 10:44:37 +00:00
Heikki Linnakangas fbcdff39bd Thinko in previous commit: ensure that MAX_SEND_SIZE is always greater
than XLOG_BLCKSZ, by defining it as 16 * XLOG_BLCKSZ rather than directly
as 128k bytes.
2010-05-26 22:34:49 +00:00
Heikki Linnakangas ea5516081d In walsender, don't sleep if there's outstanding WAL waiting to be sent,
otherwise we effectively rate-limit the streaming as pointed out by
Simon Riggs. Also, send the WAL in smaller chunks, to respond to signals
more promptly.
2010-05-26 22:21:33 +00:00
Tom Lane 2ea56cbda6 Fix missing static declaration for XLogRead(). 2010-05-09 18:11:55 +00:00
Tom Lane 77acab75df Modify ShmemInitStruct and ShmemInitHash to throw errors internally,
rather than returning NULL for some-but-not-all failures as they used to.
Remove now-redundant tests for NULL from call sites.

We had to do something about this because many call sites were failing to
check for NULL; and changing it like this seems a lot more useful and
mistake-proof than adding checks to the call sites without them.
2010-04-28 16:54:16 +00:00
Heikki Linnakangas 9b8a73326e Introduce wal_level GUC to explicitly control if information needed for
archival or hot standby should be WAL-logged, instead of deducing that from
other options like archive_mode. This replaces recovery_connections GUC in
the primary, where it now has no effect, but it's still used in the standby
to enable/disable hot standby.

Remove the WAL-logging of "unlogged operations", like creating an index
without WAL-logging and fsyncing it at the end. Instead, we keep a copy of
the wal_mode setting and the settings that affect how much shared memory a
hot standby server needs to track master transactions (max_connections,
max_prepared_xacts, max_locks_per_xact) in pg_control. Whenever the settings
change, at server restart, write a WAL record noting the new settings and
update pg_control. This allows us to notice the change in those settings in
the standby at the right moment, they used to be included in checkpoint
records, but that meant that a changed value was not reflected in the
standby until the first checkpoint after the change.

Bump PG_CONTROL_VERSION and XLOG_PAGE_MAGIC. Whack XLOG_PAGE_MAGIC back to
the sequence it used to follow, before hot standby and subsequent patches
changed it to 0x9003.
2010-04-28 16:10:43 +00:00
Tom Lane a3c6d10575 Move the check for whether walreceiver has authenticated as a superuser
from walsender.c, where it didn't really belong, to postinit.c where it does
belong (and is essentially free, too).
2010-04-21 00:51:57 +00:00
Heikki Linnakangas 258174b462 Need to use the start pointer of a block we read from WAL segment in
the calculation, not the end pointer, as pointed out by Fujii Masao.
2010-04-12 10:18:50 +00:00
Heikki Linnakangas e57cd7f0a1 Change the logic to decide when to delete old WAL segments, so that it
doesn't take into account how far the WAL senders are. This way a hung
WAL sender doesn't prevent old WAL segments from being recycled/removed
in the primary, ultimately causing the disk to fill up. Instead add
standby_keep_segments setting to control how many old WAL segments are
kept in the primary. This also makes it more reliable to use streaming
replication without WAL archiving, assuming that you set
standby_keep_segments high enough.
2010-04-12 09:52:29 +00:00
Robert Haas 54943734f8 Refer to max_wal_senders in a more consistent fashion.
The error message now makes explicit reference to the GUC that must be changed
to fix the problem, using wording suggested by Tom Lane.  Along the way,
rename the GUC from MaxWalSenders to max_wal_senders for consistency and
grep-ability.
2010-04-01 00:43:29 +00:00
Heikki Linnakangas 59292f28ca Flush CopyOutResponse when starting streaming in walsender, so that it's
not delayed until the first WAL record is sent.

Fujii Masao
2010-03-26 12:23:34 +00:00
Simon Riggs 92fc0db99f Additional thoughts on WALSender cpu reduction. Use long type
and alter a comment to reduce confusion.
2010-03-24 21:41:57 +00:00
Simon Riggs 08882ce74c Reduce CPU utilisation of WALSender process. Process was using 10% CPU
doing nothing, caused by naptime specified in milliseconds yet units of
pg_usleep() parameter is microseconds. Correctly specifying units
reduces call frequency by 1000. Reduction in CPU consumption verified.
2010-03-24 20:11:12 +00:00
Heikki Linnakangas a383c55a1d Throw a nicer error message if a standby server attempts to connect while
the master is still in recovery. We don't support cascading slaves yet.

Patch by Fujii Masao, with slightly changed wording.
2010-03-16 09:09:55 +00:00
Bruce Momjian 65e806cba1 pgindent run for 9.0 2010-02-26 02:01:40 +00:00
Heikki Linnakangas cd2b7d3c4d Fix streaming replication starting at the very first WAL segment.
Per complaint from Greg Stark.
2010-02-25 07:31:40 +00:00
Heikki Linnakangas 3e87ba6ef7 Fix pq_getbyte_if_available() function. It was confused on what it
returns if no data is immediately available. Patch by me with numerous
fixes from Fujii Masao and Magnus Hagander.
2010-02-18 11:13:46 +00:00
Tom Lane 50a90fac40 Stamp HEAD as 9.0devel, and update various places that were referring to 8.5
(hope I got 'em all).  Per discussion, this release will be 9.0 not 8.5.
2010-02-17 04:19:41 +00:00
Heikki Linnakangas 808969d0e7 Add a message type header to the CopyData messages sent from primary
to standby in streaming replication. While we only have one message type
at the moment, adding a message type header makes this easier to extend.
2010-02-03 09:47:19 +00:00
Heikki Linnakangas 83cb7da7dc Fix bug in wasender's xlogid boundary handling, reported by Erik Rijkers.
LogwrtRqst.Write can be set to non-existent FF log segment, we mustn't
try to send that in XLogSend().

Also fix similar bug in ReadRecord(), which I just introduced in the
ReadRecord() refactoring patch.
2010-01-27 16:41:09 +00:00
Heikki Linnakangas 2d29f5f59a Fix bogus comments. 2010-01-21 08:19:57 +00:00
Heikki Linnakangas d3be71a208 Remove unused (in non-assertion-enabled build) variable. 2010-01-15 11:47:15 +00:00
Heikki Linnakangas 40f908bdcd Introduce Streaming Replication.
This includes two new kinds of postmaster processes, walsenders and
walreceiver. Walreceiver is responsible for connecting to the primary server
and streaming WAL to disk, while walsender runs in the primary server and
streams WAL from disk to the client.

Documentation still needs work, but the basics are there. We will probably
pull the replication section to a new chapter later on, as well as the
sections describing file-based replication. But let's do that as a separate
patch, so that it's easier to see what has been added/changed. This patch
also adds a new section to the chapter about FE/BE protocol, documenting the
protocol used by walsender/walreceivxer.

Bump catalog version because of two new functions,
pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for
monitoring the progress of replication.

Fujii Masao, with additional hacking by me
2010-01-15 09:19:10 +00:00