If a standby server has a cascading standby server connected to it, it's
possible that WAL has already been sent up to the next WAL page boundary,
splitting a WAL record in the middle, when the first standby server is
promoted. Don't throw an assertion failure or error in walsender if that
happens.
Also, fix a variant of the same bug in pg_receivexlog: if it had already
received WAL on previous timeline up to a segment boundary, when the
upstream standby server is promoted so that the timeline switch record falls
on the previous segment, pg_receivexlog would miss the segment containing
the timeline switch. To fix that, have walsender send the position of the
timeline switch at end-of-streaming, in addition to the next timeline's ID.
It was previously assumed that the switch happened exactly where the
streaming stopped.
Note: this is an incompatible change in the streaming protocol. You might
get an error if you try to stream over timeline switches, if the client is
running 9.3beta1 and the server is more recent. It should be fine after a
reconnect, however.
Reported by Fujii Masao.
Previously, libpq and the backend had opposite ideas about whether
it was necessary for the client to send a CopyDone message after
receiving an ErrorResponse, making it impossible to cleanly exit
COPY BOTH mode. Fix libpq so that works correctly, adopting the
backend's notion that an ErrorResponse kills the copy in both
directions.
Adjust receivelog.c to avoid a degradation in the quality of the
resulting error messages. libpqwalreceiver.c is already doing
the right thing, so no adjustment needed there.
Add an explicit statement to the documentation explaining how
this part of the protocol is supposed to work, in the hopes of
avoiding future confusion in this area.
Since the consequences of all this confusion are very limited,
especially in the back-branches where no client ever attempts
to exit COPY BOTH mode without closing the connection entirely,
no back-patch.
This patch addresses the problem that applications currently have to
extract object names from possibly-localized textual error messages,
if they want to know for example which index caused a UNIQUE_VIOLATION
failure. It adds new error message fields to the wire protocol, which
can carry the name of a table, table column, data type, or constraint
associated with the error. (Since the protocol spec has always instructed
clients to ignore unrecognized field types, this should not create any
compatibility problem.)
Support for providing these new fields has been added to just a limited set
of error reports (mainly, those in the "integrity constraint violation"
SQLSTATE class), but we will doubtless add them to more calls in future.
Pavel Stehule, reviewed and extensively revised by Peter Geoghegan, with
additional hacking by Tom Lane.
This mirrors the changes done earlier to the server in standby mode. When
receivelog reaches the end of a timeline, as reported by the server, it
fetches the timeline history file of the next timeline, and restarts
streaming from the new timeline by issuing a new START_STREAMING command.
When pg_receivexlog crosses a timeline, it leaves the .partial suffix on the
last segment on the old timeline. This helps you to tell apart a partial
segment left in the directory because of a timeline switch, and a completed
segment. If you just follow a single server, it won't make a difference, but
it can be significant in more complicated scenarios where new WAL is still
generated on the old timeline.
This includes two small changes to the streaming replication protocol:
First, when you reach the end of timeline while streaming, the server now
sends the TLI of the next timeline in the server's history to the client.
pg_receivexlog uses that as the next timeline, so that it doesn't need to
parse the timeline history file like a standby server does. Second, when
BASE_BACKUP command sends the begin and end WAL positions, it now also sends
the timeline IDs corresponding the positions.
Before this patch, streaming replication would refuse to start replicating
if the timeline in the primary doesn't exactly match the standby. The
situation where it doesn't match is when you have a master, and two
standbys, and you promote one of the standbys to become new master.
Promoting bumps up the timeline ID, and after that bump, the other standby
would refuse to continue.
There's significantly more timeline related logic in streaming replication
now. First of all, when a standby connects to primary, it will ask the
primary for any timeline history files that are missing from the standby.
The missing files are sent using a new replication command TIMELINE_HISTORY,
and stored in standby's pg_xlog directory. Using the timeline history files,
the standby can follow the latest timeline present in the primary
(recovery_target_timeline='latest'), just as it can follow new timelines
appearing in an archive directory.
START_REPLICATION now takes a TIMELINE parameter, to specify exactly which
timeline to stream WAL from. This allows the standby to request the primary
to send over WAL that precedes the promotion. The replication protocol is
changed slightly (in a backwards-compatible way although there's little hope
of streaming replication working across major versions anyway), to allow
replication to stop when the end of timeline reached, putting the walsender
back into accepting a replication command.
Many thanks to Amit Kapila for testing and reviewing various versions of
this patch.
We used to send structs wrapped in CopyData messages, which works as long as
the client and server agree on things like endianess, timestamp format and
alignment. That's good enough for running a standby server, which has to run
on the same platform anyway, but it's useful for tools like pg_receivexlog
to work across platforms.
This breaks protocol compatibility of streaming replication, but we never
promised that to be compatible across versions, anyway.
Both programs got the "magic" string wrong, causing standard-conforming tar
implementations to believe the output was just legacy tar format without
any POSIX extensions. This doesn't actually matter that much, especially
since pg_dump failed to fill the POSIX fields anyway, but still there is
little point in emitting tar format if we can't be compliant with the
standard. In addition, pg_dump failed to write the EOF marker correctly
(there should be 2 blocks of zeroes not just one), pg_basebackup put the
numeric group ID in the wrong place, and both programs had a pretty
brain-dead idea of how to compute the checksum. Fix all that and improve
the comments a bit.
pg_restore is modified to accept either the correct POSIX-compliant "magic"
string or the previous value. This part of the change will need to be
back-patched to avoid an unnecessary compatibility break when a previous
version tries to read tar-format output from 9.3 pg_dump.
Brian Weaver and Tom Lane
Rewrite plancache.c so that a "cached plan" (which is rather a misnomer
at this point) can support generation of custom, parameter-value-dependent
plans, and can make an intelligent choice between using custom plans and
the traditional generic-plan approach. The specific choice algorithm
implemented here can probably be improved in future, but this commit is
all about getting the mechanism in place, not the policy.
In addition, restructure the API to greatly reduce the amount of extraneous
data copying needed. The main compromise needed to make that possible was
to split the initial creation of a CachedPlanSource into two steps. It's
worth noting in particular that SPI_saveplan is now deprecated in favor of
SPI_keepplan, which accomplishes the same end result with zero data
copying, and no need to then spend even more cycles throwing away the
original SPIPlan. The risk of long-term memory leaks while manipulating
SPIPlans has also been greatly reduced. Most of this improvement is based
on use of the recently-added MemoryContextSetParent primitive.
It turns out the reason we hadn't found out about the portability issues
with our credential-control-message code is that almost no modern platforms
use that code at all; the ones that used to need it now offer getpeereid(),
which we choose first. The last holdout was NetBSD, and they added
getpeereid() as of 5.0. So far as I can tell, the only live platform on
which that code was being exercised was Debian/kFreeBSD, ie, FreeBSD kernel
with Linux userland --- since glibc doesn't provide getpeereid(), we fell
back to the control message code. However, the FreeBSD kernel provides a
LOCAL_PEERCRED socket parameter that's functionally equivalent to Linux's
SO_PEERCRED. That is both much simpler to use than control messages, and
superior because it doesn't require receiving a message from the other end
at just the right time.
Therefore, add code to use LOCAL_PEERCRED when necessary, and rip out all
the credential-control-message code in the backend. (libpq still has such
code so that it can still talk to pre-9.1 servers ... but eventually we can
get rid of it there too.) Clean up related autoconf probes, too.
This means that libpq's requirepeer parameter now works on exactly the same
platforms where the backend supports peer authentication, so adjust the
documentation accordingly.
the standby has written, flushed, and applied the WAL. At the moment, this
is for informational purposes only, the values are only shown in
pg_stat_replication system view, but in the future they will also be needed
for synchronous replication.
Extracted from Simon riggs' synchronous replication patch by Robert Haas, with
some tweaking by me.
Specifying this option makes the server not wait for the
xlog to be archived, or emit a warning that it can't,
instead leaving the responsibility with the client.
This is useful when the log is being streamed using
the streaming protocol in parallel with the backup,
without having log archiving enabled.
Add the current xlog insert location to the response of
IDENTIFY_SYSTEM, and adds result sets containing start
and stop location of backups to BASE_BACKUP responses.
When included, this makes the base backup a complete working
"clone" of the initial database, ready to have a postmaster
started against it without the need to set up any log archiving
or similar.
Magnus Hagander, reviewed by Fujii Masao and Heikki Linnakangas
This tool makes it possible to do the pg_start_backup/
copy files/pg_stop_backup step in a single command.
There are still some steps to be done before this is a
complete backup solution, such as the ability to stream
the required WAL logs, but it's still usable, and
could do with some buildfarm coverage.
In passing, make the checkpoint request optionally
fast instead of hardcoding it.
Magnus Hagander, reviewed by Fujii Masao and Dimitri Fontaine
Makes it easier to parse mainly the BASE_BACKUP command
with it's options, and avoids having to manually deal
with quoted identifiers in the label (previously broken),
and makes it easier to add new commands and options in
the future.
In passing, refactor the case statement in the walsender
to put each command in it's own function.
Add BASE_BACKUP command to walsender, allowing it to stream a
base backup to the client (in tar format). The syntax is still
far from ideal, that will be fixed in the switch to use a proper
grammar for walsender.
No client included yet, will come as a separate commit.
Magnus Hagander and Heikki Linnakangas
and current server clock time to SR data messages. These are not currently
used on the slave side but seem likely to be useful in future, and it'd be
better not to change the SR protocol after release. Per discussion.
Also do some minor code review and cleanup on walsender.c, and improve the
protocol documentation.
The endterm attribute is mainly useful when the toolchain does not support
automatic link target text generation for a particular situation. In the
past, this was required by the man page tools for all reference page links,
but that is no longer the case, and it now actually gets in the way of
proper automatic link text generation. The only remaining use cases are
currently xrefs to refsects.
In addition, add support for a "payload" string to be passed along with
each notify event.
This implementation should be significantly more efficient than the old one,
and is also more compatible with Hot Standby usage. There is not yet any
facility for HS slaves to receive notifications generated on the master,
although such a thing is possible in future.
Joachim Wieland, reviewed by Jeff Davis; also hacked on by me.
This includes two new kinds of postmaster processes, walsenders and
walreceiver. Walreceiver is responsible for connecting to the primary server
and streaming WAL to disk, while walsender runs in the primary server and
streams WAL from disk to the client.
Documentation still needs work, but the basics are there. We will probably
pull the replication section to a new chapter later on, as well as the
sections describing file-based replication. But let's do that as a separate
patch, so that it's easier to see what has been added/changed. This patch
also adds a new section to the chapter about FE/BE protocol, documenting the
protocol used by walsender/walreceivxer.
Bump catalog version because of two new functions,
pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for
monitoring the progress of replication.
Fujii Masao, with additional hacking by me
to the client by the server. This might seem pretty pointless but apparently
it will help pgbouncer, and perhaps other connection poolers. Anyway it's
practically free to do so for the normal use-case where appname is only set
in the startup packet --- we're just adding a few more bytes to the initial
ParameterStatus response packet. Per comments from Marko Kreen.
from DateStyle, and create a new interval style that produces output matching
the SQL standard (at least for interval values that fall within the standard's
restrictions). IntervalStyle is also used to resolve the conflict between the
standard and traditional Postgres rules for interpreting negative interval
input.
Ron Mayer
ParameterStatus message can be sent during COPY OUT: it's definitely
possible, since COPY from a SELECT subquery can trigger any user-defined
function.
we need to be able to swallow NOTICE messages, and potentially also
ParameterStatus messages (although the latter would be a bit weird),
without exiting COPY OUT state. Fix it, and adjust the protocol documentation
to emphasize the need for this. Per off-list report from Alexander Galler.