Commit Graph

2133 Commits

Author SHA1 Message Date
Heikki Linnakangas b669f416ce Fix checkpoint after fast promotion.
The intention was to request a regular online checkpoint immediately after
end of recovery, when performing "fast promotion". However, because the
checkpoint was requested before other backends were allowed to write WAL,
the checkpointer process performed a restartpoint rather than a checkpoint.

Delay the RequestCheckPoint call until after recovery has truly ended, so
that you get a real checkpoint.
2013-02-11 22:22:08 +02:00
Heikki Linnakangas 7803e9327d Include previous TLI in end-of-recovery and shutdown checkpoint records.
This isn't used for anything but a sanity check at the moment, but it could
be highly valuable for debugging purposes. It could also be used to recreate
timeline history by traversing WAL, which seems useful.
2013-02-11 18:16:25 +02:00
Tom Lane c352ea2d74 Further cleanup of gistsplit.c.
After further reflection I was unconvinced that the existing coding is
guaranteed to return valid union datums in every code path for multi-column
indexes.  Fix that by forcing a gistunionsubkey() call at the end of the
recursion.  Having done that, we can remove some clearly-redundant calls
elsewhere.  This should be a little faster for multi-column indexes (since
the previous coding would uselessly do such a call for each column while
unwinding the recursion), as well as much harder to break.

Also, simplify the handling of cases where one side or the other of a
primary split contains only don't-care tuples.  The previous coding used a
very ugly hack in removeDontCares() that essentially forced one random
tuple to be treated as non-don't-care, providing a random initial choice of
seed datum for the secondary split.  It seems unlikely that that method
will give better-than-random splits.  Instead, treat such a split as
degenerate and just let the next column determine the split, the same way
that we handle fully degenerate cases where the two sides produce identical
union datums.
2013-02-10 16:21:26 -05:00
Tom Lane db3d7e9f0d Remove useless picksplit-doesn't-support-secondary-split log spam.
This LOG message was put in over five years ago with the evident
expectation that we'd make all GiST opclasses support secondary split
directly.  However, no such thing ever happened, and indeed the number of
opclasses supporting it decreased to zero in 9.2.  The reason is that
improving on the default implementation isn't that easy --- the
opclass-specific code that did exist, before 9.2, doesn't appear to have
been any improvement over the default.

Hence, remove the message altogether.  There's certainly no point in
nagging users about this in released branches, but I doubt that we'll
ever implement complete opclass-specific support anyway.
2013-02-10 13:07:40 -05:00
Tom Lane dacc185f52 Remove vestigial secondary-split support in gist_box_picksplit().
Not only is this implementation of secondary-split not better than the
default implementation in gistsplit.c, it's actually worse.  The gistsplit.c
code at least looks to see if switching the left and right sides would make
a better merge with the previously-split tuples, while this doesn't.

In any case it's rather useless to support secondary split only in an edge
case.  There used to be more complete support for it here (in chooseLR()),
but that was removed in commit 7f3bd86843.
It appears to me though that the chooseLR() code was really isomorphic to
the default implementation, since it was still based on choosing the cheaper
way of adding two sub-split vectors that had been chosen without regard to
the primary split initially.  I think an implementation of secondary split
that could beat the default implementation would have to be pretty fully
integrated into the split algorithm, not plastered on at the end.

Back-patch to 9.2, but not further; previous branches have the chooseLR()
code which I don't feel a great need to mess with.  This is mainly so we
just have two behaviors and not three among the various branches (IOW, this
patch is cleanup for commit 7f3bd86843e5aad84585a57d3f6b80db3c609916's
incomplete removal of secondary-split support).
2013-02-10 12:40:09 -05:00
Tom Lane 0fd0f3688b Document and clean up gistsplit.c.
Improve comments, rename some variables and functions, slightly simplify
a couple of APIs, in an attempt to make this code readable by people other
than its original author.

Even though this is essentially just cosmetic, back-patch to all active
branches, because otherwise it's going to make back-patching future fixes
in this file very painful.
2013-02-10 11:58:15 -05:00
Tom Lane a187c96d26 Reduce log level of picksplit-doesn't-support-secondary-split whining.
This was agreed to back in 2007, but never actually done.

Josh Hansen
2013-02-09 12:17:55 -05:00
Tom Lane 3c29b196b0 Fix gist_box_same and gist_point_consistent to handle fuzziness correctly.
While there's considerable doubt that we want fuzzy behavior in the
geometric operators at all (let alone as currently implemented), nobody is
stepping forward to redesign that stuff.  In the meantime it behooves us
to make sure that index searches agree with the behavior of the underlying
operators.  This patch fixes two problems in this area.

First, gist_box_same was using fuzzy equality, but it really needs to use
exact equality to prevent not-quite-identical upper index keys from being
treated as identical, which for example would prevent an existing upper
key from being extended by an amount less than epsilon.  This would result
in inconsistent indexes.  (The next release notes will need to recommend
that users reindex GiST indexes on boxes, polygons, circles, and points,
since all four opclasses use gist_box_same.)

Second, gist_point_consistent used exact comparisons for upper-page
comparisons in ~= searches, when it needs to use fuzzy comparisons to
ensure it finds all matches; and it used fuzzy comparisons for point <@ box
searches, when it needs to use exact comparisons because that's what the
<@ operator (rather inconsistently) does.

The added regression test cases illustrate all three misbehaviors.

Back-patch to all active branches.  (8.4 did not have GiST point_ops,
but it still seems prudent to apply the gist_box_same patch to it.)

Alexander Korotkov, reviewed by Noah Misch
2013-02-08 18:03:17 -05:00
Alvaro Herrera 5766228bc6 Fix Xmax freeze conditions
I broke this in 0ac5ad5134; previously, freezing a tuple marked with an
IS_MULTI xmax was not necessary.

Per brokenness report from Jeff Janes.
2013-02-08 12:50:58 -03:00
Tom Lane 166d534fcd Repair bugs in GiST page splitting code for multi-column indexes.
When considering a non-last column in a multi-column GiST index,
gistsplit.c tries to improve on the split chosen by the opclass-specific
pickSplit function by considering penalties for the next column.  However,
there were two bugs in this code: it failed to recompute the union keys for
the leftmost index columns, even though these might well change after
reassigning tuples; and it included the old union keys in the recomputation
for the columns it did recompute, so that those keys couldn't get smaller
even if they should.  The first problem could result in an invalid index
in which searches wouldn't find index entries that are in fact present;
the second would make the index less efficient to search.

Both of these errors were caused by misuse of gistMakeUnionItVec, whose
API was designed in a way that just begged such errors to be made.  There
is no situation in which it's safe or useful to compute the union keys for
a subset of the index columns, and there is no caller that wants any
previous union keys to be included in the computation; so the undocumented
choice to treat the union keys as in/out rather than pure output parameters
is a waste of code as well as being dangerous.

Hence, rather than just making a minimal patch, I've changed the API of
gistMakeUnionItVec to remove the "startkey" parameter (it now always
processes all index columns) and treat the attr/isnull arrays as purely
output parameters.

In passing, also get rid of a couple of unnecessary and dangerous uses
of static variables in gistutil.c.  It's remarkable that the one in
gistMakeUnionKey hasn't given us portability troubles before now, because
in addition to posing a re-entrancy hazard, it was unsafely assuming that
a static char[] array would have at least Datum alignment.

Per investigation of a trouble report from Tomas Vondra.  (There are also
some bugs in contrib/btree_gist to be fixed, but that seems like material
for a separate patch.)  Back-patch to all supported branches.
2013-02-07 17:44:02 -05:00
Simon Riggs 072521b8c8 Rely only on checkpoint 1 at end of recovery.
Searching for checkpoint 2 (previous) is not
correct in all cases.

Bug report from Heikki Linnakangas
2013-02-07 16:33:05 +00:00
Alvaro Herrera 5a1cd89f8f Split out list of XLog resource managers
The new rmgrlist.h header, containing all necessary data
about built-in resource managers, allows other pieces of code to
access them.

In particular, this allows a future pg_xlogdump program to extract
rm_desc function pointers, without having to keep a duplicate list of
them.
2013-02-06 08:47:28 -03:00
Alvaro Herrera 9ee00ef4c7 Fill tuple before HeapSatisfiesHOTAndKeyUpdate
Failing to do this results in almost all updates to system catalogs
being non-HOT updates, because the OID column would differ (not having
been set for the new tuple), which is an indexed column.

While at it, make sure to set the tableoid early in both old and new
tuples as well.  This isn't of much consequence, since that column is
seldom (never?) indexed.

Report and patch from Andres Freund.
2013-02-01 10:43:09 -03:00
Alvaro Herrera b78647a0e6 Restrict infomask bits to set on multixacts
We must only set the bit(s) for the strongest lock held in the tuple;
otherwise, a multixact containing members with exclusive lock and
key-share lock will behave as though only a share lock is held.

This bug was introduced in commit 0ac5ad5134, somewhere along
development, when we allowed a singleton FOR SHARE lock to be
implemented without a MultiXact by using a multi-bit pattern.
I overlooked that GetMultiXactIdHintBits() needed to be tweaked as well.
Previously, we could have the bits for FOR KEY SHARE and FOR UPDATE
simultaneously set and it wouldn't cause a problem.

Per report from digoal@126.com
2013-01-31 19:35:31 -03:00
Simon Riggs 3f0ab05233 Switch timelines if we crash soon after promotion.
Previous patch to skip checkpoints at end of recovery didn't
correctly perform crash recovery, fumbling the timeline switch.
Now we record the minRecoveryPointTLI of the newly selected
timeline, so that we crash recover to the correct timeline.

Bug report from Fujii Masao, investigated by me.
2013-01-31 19:29:32 +00:00
Tom Lane 991f3e5ab3 Provide database object names as separate fields in error messages.
This patch addresses the problem that applications currently have to
extract object names from possibly-localized textual error messages,
if they want to know for example which index caused a UNIQUE_VIOLATION
failure.  It adds new error message fields to the wire protocol, which
can carry the name of a table, table column, data type, or constraint
associated with the error.  (Since the protocol spec has always instructed
clients to ignore unrecognized field types, this should not create any
compatibility problem.)

Support for providing these new fields has been added to just a limited set
of error reports (mainly, those in the "integrity constraint violation"
SQLSTATE class), but we will doubtless add them to more calls in future.

Pavel Stehule, reviewed and extensively revised by Peter Geoghegan, with
additional hacking by Tom Lane.
2013-01-29 17:08:26 -05:00
Simon Riggs fd4ced5230 Fast promote mode skips checkpoint at end of recovery.
pg_ctl promote -m fast will skip the checkpoint at end of recovery so that we
can achieve very fast failover when the apply delay is low. Write new WAL record
XLOG_END_OF_RECOVERY to allow us to switch timeline correctly for downstream log
readers. If we skip synchronous end of recovery checkpoint we request a normal
spread checkpoint so that the window of re-recovery is low.

Simon Riggs and Kyotaro Horiguchi, with input from Fujii Masao.
Review by Heikki Linnakangas
2013-01-29 00:06:15 +00:00
Heikki Linnakangas ba1cc6501e Add some randomness to the choice of which GiST page to insert to.
When descending the tree for an insert, and there are multiple equally good
pages we could insert to, make the choice in random. Previously, we would
always choose the tuple with lowest offset number. That meant that when two
non-leaf pages overlap - in the extreme case they might have exactly the same
key - all but the first such page went unused. That wasn't optimal for space
usage; if you deleted some tuples from the non-first pages, the space would
never be reused.

With this patch, the other pages are sometimes chosen too, although there's
still a heavy bias towards low-offset tuples, so that we don't lose cache
locality when doing a lot of inserts with similar keys.

Original idea by Alexander Korotkov, although this patch version was written
by me and copy-edited by Tom Lane.
2013-01-25 16:58:38 +02:00
Simon Riggs 5c54f63fd6 Fix rare missing cancellations in Hot Standby.
The machinery around XLOG_HEAP2_CLEANUP_INFO failed
to correctly pass through the necessary information
on latestRemovedXid, avoiding cancellations in some
infrequent concurrent update/cleanup scenarios.

Backpatchable fix to 9.0

Detailed bug report and fix by Noah Misch,
backpatchable version by me.
2013-01-24 14:19:29 +00:00
Alvaro Herrera 0ac5ad5134 Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE".  These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE".  UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.

Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.

The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid.  Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates.  This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed.  pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.

Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header.  This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.

Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)

With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.

As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.

Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane.  There's probably room for several more tests.

There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it.  Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.

This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
	AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
	1290721684-sup-3951@alvh.no-ip.org
	1294953201-sup-2099@alvh.no-ip.org
	1320343602-sup-2290@alvh.no-ip.org
	1339690386-sup-8927@alvh.no-ip.org
	4FE5FF020200002500048A3D@gw.wicourts.gov
	4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 12:04:59 -03:00
Heikki Linnakangas 990fe3c4ed Fix more issues with cascading replication and timeline switches.
When a standby server follows the master using WAL archive, and it chooses
a new timeline (recovery_target_timeline='latest'), it only fetches the
timeline history file for the chosen target timeline, not any other history
files that might be missing from pg_xlog. For example, if the current
timeline is 2, and we choose 4 as the new recovery target timeline, the
history file for timeline 3 is not fetched, even if it's part of this
server's history. That's enough for the standby itself - the history file
for timeline 4 includes timeline 3 as well - but if a cascading standby
server wants to recover to timeline 3, it needs the history file. To fix,
when a new recovery target timeline is chosen, try to copy any missing
history files from the archive to pg_xlog between the old and new target
timeline.

A second similar issue was with the WAL files. When a standby recovers from
archive, and it reaches a segment that contains a switch to a new timeline,
recovery fetches only the WAL file labelled with the new timeline's ID. The
file from the new timeline contains a copy of the WAL from the old timeline
up to the point where the switch happened, and recovery recovers it from the
new file. But in streaming replication, walsender only tries to read it
from the old timeline's file. To fix, change walsender to read it from the
new file, so that it behaves the same as recovery in that sense, and doesn't
try to open the possibly nonexistent file with the old timeline's ID.
2013-01-23 10:19:20 +02:00
Alvaro Herrera 8c17144c75 Fix off-by-one bug in xlog reading logic
Bug reported by Michael Paquier

Author: Andres Freund
2013-01-18 11:19:53 -03:00
Heikki Linnakangas 2ff6555313 Use the right timeline when beginning to stream from master.
The xlogreader refactoring broke the logic to decide which timeline to start
streaming from. XLogPageRead() uses the timeline history to check which
timeline the requested WAL position falls into. However, after the
refactoring, XLogPageRead() is always first called with the first page in
the segment, to verify the segment header, and only then with the actual WAL
position we're interested in. That first read of the segment's header made
XLogPageRead() to always start streaming from the old timeline containing
the segment header, not the timeline containing the actual record, if there
was a timeline switch within the segment.

I thought I fixed this yesterday, but that fix was too narrow and only fixed
this for the corner-case that the timeline switch happened in the first page
of the segment. To fix this more robustly, pass explicitly the position of
the record we're actually interested in to XLogPageRead, and use that to
decide which timeline to read from, rather than deduce it from the page and
offset.

Per report from Fujii Masao.
2013-01-18 11:46:49 +02:00
Heikki Linnakangas 88228e6f1d When xlogreader asks the callback function to read a page, make sure we
get a large enough part of the page to include the beginning of the next
record we're interested in. The XLogPageRead callback uses the requested
length to decide which timeline to stream WAL from, and if the first call
is short, and the page contains a timeline switch, we'll repeatedly try
to stream that page from the old timeline, and never get across the
timeline switch.
2013-01-17 23:46:33 +02:00
Heikki Linnakangas 0b6329130e Make pg_receivexlog and pg_basebackup -X stream work across timeline switches.
This mirrors the changes done earlier to the server in standby mode. When
receivelog reaches the end of a timeline, as reported by the server, it
fetches the timeline history file of the next timeline, and restarts
streaming from the new timeline by issuing a new START_STREAMING command.

When pg_receivexlog crosses a timeline, it leaves the .partial suffix on the
last segment on the old timeline. This helps you to tell apart a partial
segment left in the directory because of a timeline switch, and a completed
segment. If you just follow a single server, it won't make a difference, but
it can be significant in more complicated scenarios where new WAL is still
generated on the old timeline.

This includes two small changes to the streaming replication protocol:
First, when you reach the end of timeline while streaming, the server now
sends the TLI of the next timeline in the server's history to the client.
pg_receivexlog uses that as the next timeline, so that it doesn't need to
parse the timeline history file like a standby server does. Second, when
BASE_BACKUP command sends the begin and end WAL positions, it now also sends
the timeline IDs corresponding the positions.
2013-01-17 20:23:00 +02:00
Heikki Linnakangas 1296d5c53c Fix a couple of error-handling bugs in the xlogreader patch.
XLogReadRecord should reset its state on every error, to make sure it
re-reads the page on next call. It was inconsistent in that some errors did
that, but some did not.

In ReadRecord(), don't give up on an error if we're in standby mode. The
loop was set up to retry, but the checks within the loop broke out of the
loop on any error.

Andres Freund, with some tweaking by me.
2013-01-17 19:27:04 +02:00
Heikki Linnakangas 9ee4d06f3f Make GiST indexes on-disk compatible with 9.2 again.
The patch that turned XLogRecPtr into a uint64 inadvertently changed the
on-disk format of GiST indexes, because the NSN field in the GiST page
opaque is an XLogRecPtr. That breaks pg_upgrade. Revert the format of that
field back to the two-field struct that XLogRecPtr was before. This is the
same we did to LSNs in the page header to avoid changing on-disk format.

Bump catversion, as this invalidates any existing GiST indexes built on
9.3devel.
2013-01-17 16:46:16 +02:00
Alvaro Herrera 7fcbf6a405 Split out XLog reading as an independent facility
This new facility can not only be used by xlog.c to carry out crash
recovery, but also by external programs.  By supplying a function to
read XLog pages from somewhere, all the WAL reading can be used for
completely different purposes.

For the standard backend use, the behavior should be pretty much the
same as previously.  As for non-backend programs, an hypothetical
pg_xlogdump program is now closer to reality, but some more backend
support is still necessary.

This patch was originally submitted by Andres Freund in a different
form, but Heikki Linnakangas opted for and authored another design of
the concept.  Andres has advanced the patch since Heikki's initial
version.  Review and some (mostly cosmetics) changes by me.
2013-01-16 16:12:53 -03:00
Alvaro Herrera 692079e5dc Remove spurious space
Andres Freund
2013-01-14 12:08:59 -03:00
Tom Lane 31f38f28b0 Redesign the planner's handling of index-descent cost estimation.
Historically we've used a couple of very ad-hoc fudge factors to try to
get the right results when indexes of different sizes would satisfy a
query with the same number of index leaf tuples being visited.  In
commit 21a39de580 I tweaked one of these
fudge factors, with results that proved disastrous for larger indexes.
Commit bf01e34b55 fudged it some more,
but still with not a lot of principle behind it.

What seems like a better way to address these issues is to explicitly model
index-descent costs, since that's what's really at stake when considering
diferent indexes with similar leaf-page-level costs.  We tried that once
long ago, and found that charging random_page_cost per page descended
through was way too much, because upper btree levels tend to stay in cache
in real-world workloads.  However, there's still CPU costs to think about,
and the previous fudge factors can be seen as a crude attempt to account
for those costs.  So this patch replaces those fudge factors with explicit
charges for the number of tuple comparisons needed to descend the index
tree, plus a small charge per page touched in the descent.  The cost
multipliers are chosen so that the resulting charges are in the vicinity of
the historical (pre-9.2) fudge factors for indexes of up to about a million
tuples, while not ballooning unreasonably beyond that, as the old fudge
factor did (even more so in 9.2).

To make this work accurately for btree indexes, add some code that allows
extraction of the known root-page height from a btree.  There's no
equivalent number readily available for other index types, but we can use
the log of the number of index pages as an approximate substitute.

This seems like too much of a behavioral change to risk back-patching,
but it should improve matters going forward.  In 9.2 I'll just revert
the fudge-factor change.
2013-01-11 12:56:58 -05:00
Heikki Linnakangas b0daba57bb Tolerate timeline switches while "pg_basebackup -X fetch" is running.
If you take a base backup from a standby server with "pg_basebackup -X
fetch", and the timeline switches while the backup is being taken, the
backup used to fail with an error "requested WAL segment %s has already
been removed". This is because the server-side code that sends over the
required WAL files would not construct the WAL filename with the correct
timeline after a switch.

Fix that by using readdir() to scan pg_xlog for all the WAL segments in the
range, regardless of timeline.

Also, include all timeline history files in the backup, if taken with
"-X fetch". That fixes another related bug: If a timeline switch happened
just before the backup was initiated in a standby, the WAL segment
containing the initial checkpoint record contains WAL from the older
timeline too. Recovery will not accept that without a timeline history file
that lists the older timeline.

Backpatch to 9.2. Versions prior to that were not affected as you could not
take a base backup from a standby before 9.2.
2013-01-03 19:51:00 +02:00
Heikki Linnakangas ee994272ca Delay reading timeline history file until it's fetched from master.
Streaming replication can fetch any missing timeline history files from the
master, but recovery would read the timeline history file for the target
timeline before reading the checkpoint record, and before walreceiver has
had a chance to fetch it from the master. Delay reading it, and the sanity
checks involving timeline history, until after reading the checkpoint
record.

There is at least one scenario where this makes a difference: if you take
a base backup from a standby server right after a timeline switch, the
WAL segment containing the initial checkpoint record will begin with an
older timeline ID. Without the timeline history file, recovering that file
will fail as the older timeline ID is not recognized to be an ancestor of
the target timeline. If you try to recover from such a backup, using only
streaming replication to fetch the WAL, this patch is required for that to
work.
2013-01-03 10:41:58 +02:00
Heikki Linnakangas d194d7a526 Fix bug in streaming replication over multiple tli switches.
After receiving some WAL over streaming replication, try to open the file
from the timeline we're currently recieving, not recoveryTargetTLI. They
are usually the same, which is why wasn't noticed before, but you'd get
an error if there have been more than one timeline switch between the
current point in WAL and the recovery target.
2013-01-02 14:35:15 +02:00
Heikki Linnakangas 4ffd589f44 Fix silly typo in code, which broke the check for reaching consistency. 2013-01-02 13:44:59 +02:00
Bruce Momjian bd61a623ac Update copyrights for 2013
Fully update git head, and update back branches in ./COPYRIGHT and
legal.sgml files.
2013-01-01 17:15:01 -05:00
Heikki Linnakangas 60df192aea Keep timeline history files restored from archive in pg_xlog.
The cascading standby patch in 9.2 changed the way WAL files are treated
when restored from the archive. Before, they were restored under a temporary
filename, and not kept in pg_xlog, but after the patch, they were copied
under pg_xlog. This is necessary for a cascading standby to find them, but
it also means that if the archive goes offline and a standby is restarted,
it can recover back to where it was using the files in pg_xlog. It also
means that if you take an offline backup from a standby server, it includes
all the required WAL files in pg_xlog.

However, the same change was not made to timeline history files, so if the
WAL segment containing the checkpoint record contains a timeline switch, you
will still get an error if you try to restart recovery without the archive,
or recover from an offline backup taken from the standby.

With this patch, timeline history files restored from archive are copied
into pg_xlog like WAL files are, so that pg_xlog contains all the files
required to recover. This is a corner-case pre-existing issue in 9.2, but
even more important in master where it's possible for a standby to follow a
timeline switch through streaming replication. To make that possible, the
timeline history files must be present in pg_xlog.
2012-12-30 14:29:45 +02:00
Alvaro Herrera 5ab3af46dd Remove obsolete XLogRecPtr macros
This gets rid of XLByteLT, XLByteLE, XLByteEQ and XLByteAdvance.
These were useful for brevity when XLogRecPtrs were split in
xlogid/xrecoff; but now that they are simple uint64's, they are just
clutter.  The only downside to making this change would be ease of
backporting patches, but that has been negated by other substantive
changes to the involved code anyway.  The clarity of simpler expressions
makes the change worthwhile.

Most of the changes are mechanical, but in a couple of places, the patch
author chose to invert the operator sense, making the code flow more
logical (and more in line with preceding comments).

Author: Andres Freund
Eyeballed by Dimitri Fontaine and Alvaro Herrera
2012-12-28 13:06:15 -03:00
Alvaro Herrera 24eca7977e Assign InvalidXLogRecPtr instead of MemSet(0)
For consistency.

Author: Andres Freund
2012-12-27 18:33:03 -03:00
Peter Eisentraut a0bfb7b36e Fix grammatical mistake in error message 2012-12-20 23:36:13 -05:00
Heikki Linnakangas 343ee00b73 Fix recycling of WAL segments after switching timeline during recovery.
This was broken before, we would recycle old WAL segments on wrong timeline
after the recovery target timeline had changed, but my recent commit to
not initialize ThisTimeLineID at all in a standby's checkpointer process
broke this completely.

The problem is that when installing a recycled WAL segment as a future one,
ThisTimeLineID is used to construct the filename. To fix, always update
ThisTimeLineID to the current timeline being recovered, before recycling
WAL segments at a restartpoint.

This still leaves a small window where we might install WAL segments under
wrong timeline ID, if the timeline is changed just as we're about to start
recycling. Also, even if we're replaying timeline X at the momnent, there's
no guarantee that we'll need as many WAL segments on that timeline as we
recycle. We might be just about to reach the point where we switch to next
timeline, so might only need one more WAL segment on the current timeline.
We'll live with the waste in that situation.

Bug pointed out by Fujii Masao. 9.1 and 9.2 had the same issue, when
recovery target timeline was changed, but I committed a slightly different
version of this patch on those branches.
2012-12-20 22:00:58 +02:00
Heikki Linnakangas af275a12df Follow TLI of last replayed record, not recovery target TLI, in walsenders.
Most of the time, the last replayed record comes from the recovery target
timeline, but there is a corner case where it makes a difference. When
the startup process scans for a new timeline, and decides to change recovery
target timeline, there is a window where the recovery target TLI has already
been bumped, but there are no WAL segments from the new timeline in pg_xlog
yet. For example, if we have just replayed up to point 0/30002D8, on
timeline 1, there is a WAL file called 000000010000000000000003 in pg_xlog
that contains the WAL up to that point. When recovery switches recovery
target timeline to 2, a walsender can immediately try to read WAL from
0/30002D8, from timeline 2, so it will try to open WAL file
000000020000000000000003. However, that doesn't exist yet - the startup
process hasn't copied that file from the archive yet nor has the walreceiver
streamed it yet, so walsender fails with error "requested WAL segment
000000020000000000000003 has already been removed". That's harmless, in that
the standby will try to reconnect later and by that time the segment is
already created, but error messages that should be ignored are not good.

To fix that, have walsender track the TLI of the last replayed record,
instead of the recovery target timeline. That way walsender will not try to
read anything from timeline 2, until the WAL segment has been created and at
least one record has been replayed from it. The recovery target timeline is
now xlog.c's internal affair, it doesn't need to be exposed in shared memory
anymore.

This fixes the error reported by Thom Brown. depesz the same error message,
but I'm not sure if this fixes his scenario.
2012-12-20 14:39:04 +02:00
Heikki Linnakangas 1a11d4609e Don't set ThisTimeLineID in checkpointer & bgwriter during recovery.
We used to set it to the current recovery target timeline, but the recovery
target timeline can change during recovery, leaving ThisTimeLineID at an
old value. That seems worse than always leaving it at zero to begin with.

AFAICS there was no good reason to set it in the first place. ThisTimeLineID
is not needed in checkpointer or bgwriter process, until it's time to write
the end-of-recovery checkpoint, and at that point ThisTimeLineID is updated
anyway.
2012-12-20 14:39:04 +02:00
Heikki Linnakangas e43f947bf3 Check if we've reached end-of-backup point also if no redo is required.
If you restored from a backup taken from a standby, and the last record in
the backup is the checkpoint record, ie. there is no redo required except
for the checkpoint record, we would fail to notice that we've reached the
end-of-backup point, and the database is consistent. The result was an
error "WAL ends before end of online backup". To fix, move the
have-we-reached-end-of-backup check into CheckRecoveryConsistency(), which
is already responsible for similar checks with minRecoveryPoint, and is
called in the right places.

Backpatch to 9.2, this check and bug did not exist before that.
2012-12-19 14:22:00 +02:00
Robert Haas 75758a6ff0 Update comment in heapgetpage() regarding PD_ALL_VISIBLE vs. Hot Standby.
Pavan Deolasee, slightly modified by me
2012-12-14 15:44:38 -05:00
Heikki Linnakangas abfd192b1b Allow a streaming replication standby to follow a timeline switch.
Before this patch, streaming replication would refuse to start replicating
if the timeline in the primary doesn't exactly match the standby. The
situation where it doesn't match is when you have a master, and two
standbys, and you promote one of the standbys to become new master.
Promoting bumps up the timeline ID, and after that bump, the other standby
would refuse to continue.

There's significantly more timeline related logic in streaming replication
now. First of all, when a standby connects to primary, it will ask the
primary for any timeline history files that are missing from the standby.
The missing files are sent using a new replication command TIMELINE_HISTORY,
and stored in standby's pg_xlog directory. Using the timeline history files,
the standby can follow the latest timeline present in the primary
(recovery_target_timeline='latest'), just as it can follow new timelines
appearing in an archive directory.

START_REPLICATION now takes a TIMELINE parameter, to specify exactly which
timeline to stream WAL from. This allows the standby to request the primary
to send over WAL that precedes the promotion. The replication protocol is
changed slightly (in a backwards-compatible way although there's little hope
of streaming replication working across major versions anyway), to allow
replication to stop when the end of timeline reached, putting the walsender
back into accepting a replication command.

Many thanks to Amit Kapila for testing and reviewing various versions of
this patch.
2012-12-13 19:17:32 +02:00
Heikki Linnakangas 527668717a Make xlog_internal.h includable in frontend context.
This makes unnecessary the ugly hack used to #include postgres.h in
pg_basebackup.

Based on Alvaro Herrera's patch
2012-12-13 14:59:13 +02:00
Heikki Linnakangas 6264cd3d69 In multi-insert, don't go into infinite loop on a huge tuple and fillfactor.
If a tuple is larger than page size minus space reserved for fillfactor,
heap_multi_insert would never find a page that it fits in and repeatedly ask
for a new page from RelationGetBufferForTuple. If a tuple is too large to
fit on any page, taking fillfactor into account, RelationGetBufferForTuple
will always expand the relation. In a normal insert, heap_insert will accept
that and put the tuple on the new page. heap_multi_insert, however, does a
fillfactor check of its own, and doesn't accept the newly-extended page
RelationGetBufferForTuple returns, even though there is no other choice to
make the tuple fit.

Fix that by making the logic in heap_multi_insert more like the heap_insert
logic. The first tuple is always put on the page RelationGetBufferForTuple
gives us, and the fillfactor check is only applied to the subsequent tuples.

Report from David Gould, although I didn't use his patch.
2012-12-12 13:54:42 +02:00
Heikki Linnakangas 970fb12de1 Consistency check should compare last record replayed, not last record read.
EndRecPtr is the last record that we've read, but not necessarily yet
replayed. CheckRecoveryConsistency should compare minRecoveryPoint with the
last replayed record instead. This caused recovery to think it's reached
consistency too early.

Now that we do the check in CheckRecoveryConsistency correctly, we have to
move the call of that function to after redoing a record. The current place,
after reading a record but before replaying it, is wrong. In particular, if
there are no more records after the one ending at minRecoveryPoint, we don't
enter hot standby until one extra record is generated and read by the
standby, and CheckRecoveryConsistency is called. These two bugs conspired
to make the code appear to work correctly, except for the small window
between reading the last record that reaches minRecoveryPoint, and
replaying it.

In the passing, rename recoveryLastRecPtr, which is the last record
replayed, to lastReplayedEndRecPtr. This makes it slightly less confusing
with replayEndRecPtr, which is the last record read that we're about to
replay.

Original report from Kyotaro HORIGUCHI, further diagnosis by Fujii Masao.
Backpatch to 9.0, where Hot Standby subtly changed the test from
"minRecoveryPoint < EndRecPtr" to "minRecoveryPoint <= EndRecPtr". The
former works because where the test is performed, we have always read one
more record than we've replayed.
2012-12-11 18:54:02 +02:00
Heikki Linnakangas 7bffc9b7bf Update minimum recovery point on truncation.
If a file is truncated, we must update minRecoveryPoint. Once a file is
truncated, there's no going back; it would not be safe to stop recovery
at a point earlier than that anymore.

Per report from Kyotaro HORIGUCHI. Backpatch to 8.4. Before that,
minRecoveryPoint was not updated during recovery at all.
2012-12-10 16:57:16 +02:00
Heikki Linnakangas 6be799664a Fix the tracking of min recovery point timeline.
Forgot to update it at the right place. Also, consider checkpoint record
that switches to new timelne to be on the new timeline.

This fixes erroneous "requested timeline 2 does not contain minimum recovery
point" errors, pointed out by Amit Kapila while testing another patch.
2012-12-10 16:04:26 +02:00