Commit Graph

776 Commits

Author SHA1 Message Date
Fujii Masao 85d0e661aa Make recovery_target_action = pause work.
Previously even if recovery_target_action was set to pause and
the recovery target was reached, the recovery could never be paused.
Because the setting of pause was *always* overridden with that of
shutdown unexpectedly. This override is valid and intentional
if hot_standby is not enabled because there is no way to resume
the paused recovery in this case and the setting of pause is
completely useless. But not if hot_standby is enabled.

This patch changes the code so that the setting of pause is overridden
with that of shutdown only when hot_standby is not enabled.

Bug reported by Andres Freund
2015-05-21 13:56:17 +09:00
Heikki Linnakangas 4fc72cc7bb Collection of typo fixes.
Use "a" and "an" correctly, mostly in comments. Two error messages were
also fixed (they were just elogs, so no translation work required). Two
function comments in pg_proc.h were also fixed. Etsuro Fujita reported one
of these, but I found a lot more with grep.

Also fix a few other typos spotted while grepping for the a/an typos.
For example, "consists out of ..." -> "consists of ...". Plus a "though"/
"through" mixup reported by Euler Taveira.

Many of these typos were in old code, which would be nice to backpatch to
make future backpatching easier. But much of the code was new, and I didn't
feel like crafting separate patches for each branch. So no backpatching.
2015-05-20 16:56:22 +03:00
Simon Riggs f6a54fefc2 Fix spelling in comment 2015-05-19 18:37:46 -04:00
Heikki Linnakangas ffd37740ee Add archive_mode='always' option.
In 'always' mode, the standby independently archives all files it receives
from the primary.

Original patch by Fujii Masao, docs and review by me.
2015-05-15 18:55:24 +03:00
Andrew Dunstan 72d422a522 Map basebackup tablespaces using a tablespace_map file
Windows can't reliably restore symbolic links from a tar format, so
instead during backup start we create a tablespace_map file, which is
used by the restoring postgres to create the correct links in pg_tblspc.
The backup protocol also now has an option to request this file to be
included in the backup stream, and this is used by pg_basebackup when
operating in tar mode.

This is done on all platforms, not just Windows.

This means that pg_basebackup will not not work in tar mode against 9.4
and older servers, as this protocol option isn't implemented there.

Amit Kapila, reviewed by Dilip Kumar, with a little editing from me.
2015-05-12 09:29:10 -04:00
Heikki Linnakangas de7688442f At promotion, archive last segment from old timeline with .partial suffix.
Previously, we would archive the possible-incomplete WAL segment with its
normal filename, but that causes trouble if the server owning that timeline
is still running, and tries to archive the same segment later. It's not nice
for the standby to trip up the master's archival like that. And it's pretty
confusing, anyway, to have an incomplete segment in the archive that's
indistinguishable from a normal, complete segment.

To avoid such confusion, add a .partial suffix to the file. Or to be more
precise, make a copy of the old segment under the .partial suffix, and
archive that instead of the original file. pg_receivexlog also uses the
.partial suffix for the same purpose, to tell apart incompletely streamed
files from complete ones.

There is no automatic mechanism to use the .partial files at recovery, so
they will go unused, unless the administrator manually copies to them to
the pg_xlog directory (and removes the .partial suffix). Recovery won't
normally need the WAL - when recovering to the new timeline, it will find
the same WAL on the first segment on the new timeline instead - but it
nevertheless feels better to archive the file with the .partial suffix, for
debugging purposes if nothing else.
2015-05-08 21:59:01 +03:00
Heikki Linnakangas 179cdd0981 Add macros to check if a filename is a WAL segment or other such file.
We had many instances of the strlen + strspn combination to check for that.
This makes the code a bit easier to read.
2015-05-08 21:58:57 +03:00
Robert Haas 2ce439f337 Recursively fsync() the data directory after a crash.
Otherwise, if there's another crash, some writes from after the first
crash might make it to disk while writes from before the crash fail
to make it to disk.  This could lead to data corruption.

Back-patch to all supported versions.

Abhijit Menon-Sen, reviewed by Andres Freund and slightly revised
by me.
2015-05-04 14:13:53 -04:00
Robert Haas 924bcf4f16 Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things.  First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers.  Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes.  Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.

Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 15:02:14 -04:00
Andres Freund 5aa2350426 Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
  e.g. to avoid loops in bi-directional replication setups

The solution to these problems, as implemented here, consist out of
three parts:

1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
   replication origin, how far replay has progressed in a efficient and
   crash safe manner.
3) The ability to filter out changes performed on the behest of a
   replication origin during logical decoding; this allows complex
   replication topologies. E.g. by filtering all replayed changes out.

Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated.  We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.

This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL.  Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.

For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.

Bumps both catversion and wal page magic.

Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
    20140923182422.GA15776@alap3.anarazel.de,
    20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00
Heikki Linnakangas 3d80a1e0e3 Fix logic to skip checkpoint if no records have been inserted.
After the WAL format changes, the calculation of the size of a checkpoint
record became incorrect. Instead of trying to fix the math, check that the
previous record, i.e. the xl_prev value that we'd write for the next
record, matches the last checkpoint's redo pointer. That way it's not
dependent on the size of the checkpoint record at all.

The old logic was actually slightly wrong all along: if the previous
checkpoint record crossed a page boundary, the page headers threw off the
record size calculation, and the checkpoint was not skipped. The new
checkpoint would not cross a page boundary, so this only resulted in at
most one extra checkpoint after the system became idle. The new logic fixes
that. (It's not worth fixing in backbranches).

However, it makes some sense to try to keep the latest checkpoint contained
fully in a page, or at least in a single WAL segment, just on general
robustness grounds. If something goes awfully wrong, it's more likely that
you can recover the latest WAL segment, than the last two WAL segments. So
I added an extra check that the checkpoint is not skipped if the previous
checkpoint crossed a WAL segment.

Reported by Jeff Janes.
2015-04-15 17:21:04 +03:00
Heikki Linnakangas 4f700bcd20 Reorganize our CRC source files again.
Now that we use CRC-32C in WAL and the control file, the "traditional" and
"legacy" CRC-32 variants are not used in any frontend programs anymore.
Move the code for those back from src/common to src/backend/utils/hash.

Also move the slicing-by-8 implementation (back) to src/port. This is in
preparation for next patch that will add another implementation that uses
Intel SSE 4.2 instructions to calculate CRC-32C, where available.
2015-04-14 17:03:42 +03:00
Heikki Linnakangas b2a5545bd6 Don't archive bogus recycled or preallocated files after timeline switch.
After a timeline switch, we would leave behind recycled WAL segments that
are in the future, but on the old timeline. After promotion, and after they
become old enough to be recycled again, we would notice that they don't have
a .ready or .done file, create a .ready file for them, and archive them.
That's bogus, because the files contain garbage, recycled from an older
timeline (or prealloced as zeros). We shouldn't archive such files.

This could happen when we're following a timeline switch during replay, or
when we switch to new timeline at end-of-recovery.

To fix, whenever we switch to a new timeline, scan the data directory for
WAL segments on the old timeline, but with a higher segment number, and
remove them. Those don't belong to our timeline history, and are most
likely bogus recycled or preallocated files. They could also be valid files
that we streamed from the primary ahead of time, but in any case, they're
not needed to recover to the new timeline.
2015-04-13 16:53:49 +03:00
Fujii Masao 6e4bf4ecd3 Fix error handling of XLogReaderAllocate in case of OOM
Similarly to previous fix 9b8d478, commit 2c03216 has switched
XLogReaderAllocate() to use a set of palloc calls instead of malloc,
causing any callers of this function to fail with an error instead of
receiving a NULL pointer in case of out-of-memory error. Fix this by
using palloc_extended with MCXT_ALLOC_NO_OOM that will safely return
NULL in case of an OOM.

Michael Paquier, slightly modified by me.
2015-04-03 21:55:37 +09:00
Andres Freund 62e2a8dc2c Define integer limits independently from the system definitions.
In 83ff1618 we defined integer limits iff they're not provided by the
system. That turns out not to be the greatest idea because there's
different ways some datatypes can be represented. E.g. on OSX PG's 64bit
datatype will be a 'long int', but OSX unconditionally uses 'long
long'. That disparity then can lead to warnings, e.g. around printf
formats.

One way to fix that would be to back int64 using stdint.h's
int64_t. While a good idea it's not that easy to implement. We would
e.g. need to include stdint.h in our external headers, which we don't
today. Also computing the correct int64 printf formats in that case is
nontrivial.

Instead simply prefix the integer limits with PG_ and define them
unconditionally. I've adjusted all the references to them in code, but
not the ones in comments; the latter seems unnecessary to me.

Discussion: 20150331141423.GK4878@alap3.anarazel.de
2015-04-02 17:43:35 +02:00
Andres Freund 83ff1618bc Centralize definition of integer limits.
Several submitted and even committed patches have run into the problem
that C89, our baseline, does not provide minimum/maximum values for
various integer datatypes. C99's stdint.h does, but we can't rely on
it.

Several parts of the code defined limits locally, so instead centralize
the definitions to c.h.

This patch also changes the more obvious usages of literal limit values;
there's more places that could be changed, but it's less clear whether
it's beneficial to change those.

Author: Andrew Gierth
Discussion: 87619tc5wc.fsf@news-spur.riddles.org.uk
2015-03-25 22:39:42 +01:00
Andres Freund 87cec51d3a Don't delay replication for less than recovery_min_apply_delay's resolution.
Recovery delays are implemented by waiting on a latch, and latches take
milliseconds as a parameter. The required amount of waiting was computed
using microsecond resolution though and the wait loop's abort condition
was checking the delay in microseconds as well.  This could lead to
short spurts of busy looping when the overall wait time was below a
millisecond, but above 0 microseconds.

Instead just formulate the wait loop's abort condition in millisecond
granularity as well. Given that that's recovery_min_apply_delay
resolution, it seems harmless to not wait for less than a millisecond.

Backpatch to 9.4 where recovery_min_apply_delay was introduced.

Discussion: 20150323141819.GH26995@alap3.anarazel.de
2015-03-23 16:51:11 +01:00
Andres Freund a1105c3dd4 Fix copy & paste error in 4f1b890b13.
Due to the bug delayed standbys would not delay when applying prepared
transactions.

Discussion: CAB7nPqT6BO1cCn+sAyDByBxA4EKZNAiPi2mFJ=ANeZmnmewRyg@mail.gmail.com

Michael Paquier via Coverity.
2015-03-23 15:53:40 +01:00
Andres Freund 4f1b890b13 Merge the various forms of transaction commit & abort records.
Since 465883b0a two versions of commit records have existed. A compact
version that was used when no cache invalidations, smgr unlinks and
similar were needed, and a full version that could deal with all
that. Additionally the full version was embedded into twophase commit
records.

That resulted in a measurable reduction in the size of the logged WAL in
some workloads. But more recently additions like logical decoding, which
e.g. needs information about the database something was executed on,
made it applicable in fewer situations. The static split generally made
it hard to expand the commit record, because concerns over the size made
it hard to add anything to the compact version.

Additionally it's not particularly pretty to have twophase.c insert
RM_XACT records.

Rejigger things so that the commit and abort records only have one form
each, including the twophase equivalents. The presence of the various
optional (in the sense of not being in every record) pieces is indicated
by a bits in the 'xinfo' flag.  That flag previously was not included in
compact commit records. To prevent an increase in size due to its
presence, it's only included if necessary; signalled by a bit in the
xl_info bits available for xact.c, similar to heapam.c's
XLOG_HEAP_OPMASK/XLOG_HEAP_INIT_PAGE.

Twophase commit/aborts are now the same as their normal
counterparts. The original transaction's xid is included in an optional
data field.

This means that commit records generally are smaller, except in the case
of a transaction with subtransactions, but no other special cases; the
increase there is four bytes, which seems acceptable given that the more
common case of not having subtransactions shrank.  The savings are
especially measurable for twophase commits, which previously always used
the full version; but will in practice only infrequently have required
that.

The motivation for this work are not the space savings and and
deduplication though; it's that it makes it easier to extend commit
records with additional information. That's just a few lines of code
now; without impacting the common case where that information is not
needed.

Discussion: 20150220152150.GD4149@awork2.anarazel.de,
    235610.92468.qm%40web29004.mail.ird.yahoo.com

Reviewed-By: Heikki Linnakangas, Simon Riggs
2015-03-15 17:37:07 +01:00
Andres Freund a0f5954af1 Increase max_wal_size's default from 128MB to 1GB.
The introduction of min_wal_size & max_wal_size in 88e9823026 makes it
feasible to increase the default upper bound in checkpoint
size. Previously raising the default would lead to a increased disk
footprint, even if more segments weren't beneficial.  The low default of
checkpoint size is one of common performance problem users have thus
increasing the default makes sense.  Setups where the increase in
maximum disk usage is a problem will very likely have to run with a
modified configuration anyway.

Discussion: 54F4EFB8.40202@agliodbs.com,
    CA+TgmoZEAgX5oMGJOHVj8L7XOkAe05Gnf45rP40m-K3FhZRVKg@mail.gmail.com

Author: Josh Berkus, after a discussion involving lots of people.
2015-03-15 17:37:07 +01:00
Andres Freund 51c11a7025 Remove pause_at_recovery_target recovery.conf setting.
The new recovery_target_action (introduced in aedccb1f6/b8e33a85d4)
replaces it's functionality. Having both seems likely to cause more
confusion than it saves worry due to the incompatibility.

Discussion: 5484FC53.2060903@2ndquadrant.com
Author: Petr Jelinek
2015-03-15 17:37:07 +01:00
Fujii Masao 57aa5b2bb1 Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.

This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.

This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.

Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 15:52:24 +09:00
Alvaro Herrera 4f3924d9cd Keep CommitTs module in sync in standby and master
We allow this module to be turned off on restarts, so a restart time
check is enough to activate or deactivate the module; however, if there
is a standby replaying WAL emitted from a master which is restarted, but
the standby isn't, the state in the standby becomes inconsistent and can
easily be crashed.

Fix by activating and deactivating the module during WAL replay on
parameter change as well as on system start.

Problem reported by Fujii Masao in
http://www.postgresql.org/message-id/CAHGQGwFhJ3CnHo1CELEfay18yg_RA-XZT-7D8NuWUoYSZ90r4Q@mail.gmail.com

Author: Petr Jelínek
2015-03-09 17:44:00 -03:00
Heikki Linnakangas 88e9823026 Replace checkpoint_segments with min_wal_size and max_wal_size.
Instead of having a single knob (checkpoint_segments) that both triggers
checkpoints, and determines how many checkpoints to recycle, they are now
separate concerns. There is still an internal variable called
CheckpointSegments, which triggers checkpoints. But it no longer determines
how many segments to recycle at a checkpoint. That is now auto-tuned by
keeping a moving average of the distance between checkpoints (in bytes),
and trying to keep that many segments in reserve. The advantage of this is
that you can set max_wal_size very high, but the system won't actually
consume that much space if there isn't any need for it. The min_wal_size
sets a floor for that; you can effectively disable the auto-tuning behavior
by setting min_wal_size equal to max_wal_size.

The max_wal_size setting is now the actual target size of WAL at which a
new checkpoint is triggered, instead of the distance between checkpoints.
Previously, you could calculate the actual WAL usage with the formula
"(2 + checkpoint_completion_target) * checkpoint_segments + 1". With this
patch, you set the desired WAL usage with max_wal_size, and the system
calculates the appropriate CheckpointSegments with the reverse of that
formula. That's a lot more intuitive for administrators to set.

Reviewed by Amit Kapila and Venkata Balaji N.
2015-02-23 18:53:02 +02:00
Fujii Masao 5d2b45e3f7 Add GUC to control the time to wait before retrieving WAL after failed attempt.
Previously when the standby server failed to retrieve WAL files from any sources
(i.e., streaming replication, local pg_xlog directory or WAL archive), it always
waited for five seconds (hard-coded) before the next attempt. For example,
this is problematic in warm-standby because restore_command can fail
every five seconds even while new WAL file is expected to be unavailable for
a long time and flood the log files with its error messages.

This commit adds new parameter, wal_retrieve_retry_interval, to control that
wait time.

Alexey Vasiliev and Michael Paquier, reviewed by Andres Freund and me.
2015-02-23 20:55:17 +09:00
Heikki Linnakangas 49b04188f8 Fix thinko in re-setting wal_log_hints flag from a parameter-change record.
The flag is supposed to be copied from the record. Same issue with
track_commit_timestamps, but that's master-only.

Report and fix by Petr Jalinek. Backpatch to 9.4, where wal_log_hints was
added.
2015-01-15 20:52:41 +02:00
Heikki Linnakangas 1e78d81e88 Don't open a WAL segment for writing at end of recovery.
Since commit ba94518a, we used XLogFileOpen to open the next segment for
writing, but if the end-of-recovery happens exactly at a segment boundary,
the new segment might not exist yet. (Before ba94518a, XLogFileOpen was
correct, because we would open the previous segment if the switch happened
at the boundary.)

Instead of trying to create it if necessary, it's simpler to not bother
opening the segment at all. XLogWrite() will open or create it soon anyway,
after writing the checkpoint or end-of-recovery record.

Reported by Andres Freund.
2015-01-07 16:20:20 +02:00
Bruce Momjian 4baaf863ec Update copyright for 2015
Backpatch certain files through 9.0
2015-01-06 11:43:47 -05:00
Tom Lane d6657d2a10 Treat negative values of recovery_min_apply_delay as having no effect.
At one point in the development of this feature, it was claimed that
allowing negative values would be useful to compensate for timezone
differences between master and slave servers.  That was based on a mistaken
assumption that commit timestamps are recorded in local time; but of course
they're in UTC.  Nor is a negative apply delay likely to be a sane way of
coping with server clock skew.  However, the committed patch still treated
negative delays as doing something, and the timezone misapprehension
survived in the user documentation as well.

If recovery_min_apply_delay were a proper GUC we'd just set the minimum
allowed value to be zero; but for the moment it seems better to treat
negative settings as if they were zero.

In passing do some extra wordsmithing on the parameter's documentation,
including correcting a second misstatement that the parameter affects
processing of Restore Point records.

Issue noted by Michael Paquier, who also provided the code patch; doc
changes by me.  Back-patch to 9.4 where the feature was introduced.
2015-01-03 13:14:03 -05:00
Heikki Linnakangas 2ef6c66a2b Fix file descriptor leak at end of recovery.
XLogFileInit() returns a file descriptor, which needs to be closed. The leak
was short-lived, since the startup process exits shortly afterwards, but it
was clearly a bug, nevertheless.

Per Coverity report.
2014-12-21 21:51:59 +02:00
Heikki Linnakangas 5c805d0a81 Fix timestamp in end-of-recovery WAL records.
We used time(null) to set a TimestampTz field, which gave bogus results.
Noticed while looking at pg_xlogdump output.

Backpatch to 9.3 and above, where the fast promotion was introduced.
2014-12-19 17:04:20 +02:00
Heikki Linnakangas ba94518aad Change how first WAL segment on new timeline after promotion is created.
Two changes:

1. When copying a WAL segment from old timeline to create the first segment
on the new timeline, only copy up to the point where the timeline switch
happens, and zero-fill the rest. This avoids corner cases where we might
think that the copied WAL from the previous timeline belong to the new
timeline.

2. If the timeline switch happens at a segment boundary, don't copy the
whole old segment to the new timeline. It's pointless, because it's 100%
identical to the old segment.
2014-12-18 20:23:03 +02:00
Andres Freund c303e9e7e5 Fix (re-)starting from a basebackup taken off a standby after a failure.
When starting up from a basebackup taken off a standby extra logic has
to be applied to compute the point where the data directory is
consistent. Normal base backups use a WAL record for that purpose, but
that isn't possible on a standby.

That logic had a error check ensuring that the cluster's control file
indicates being in recovery. Unfortunately that check was too strict,
disregarding the fact that the control file could also indicate that
the cluster was shut down while in recovery.

That's possible when the a cluster starting from a basebackup is shut
down before the backup label has been removed. When everything goes
well that's a short window, but when either restore_command or
primary_conninfo isn't configured correctly the window can get much
wider. That's because inbetween reading and unlinking the label we
restore the last checkpoint from WAL which can need additional WAL.

To fix simply also allow starting when the control file indicates
"shutdown in recovery". There's nicer fixes imaginable, but they'd be
more invasive.

Backpatch to 9.2 where support for taking basebackups from standbys
was added.
2014-12-18 08:47:27 +01:00
Simon Riggs b8e33a85d4 Tweaks for recovery_target_action
Rename parameter action_at_recovery_target to
recovery_target_action suggested by Christoph Berg.

Place into recovery.conf suggested by Fujii Masao,
replacing (deprecating) earlier parameters, per
Michael Paquier.
2014-12-07 21:55:29 +09:00
Alvaro Herrera 73c986adde Keep track of transaction commit timestamps
Transactions can now set their commit timestamp directly as they commit,
or an external transaction commit timestamp can be fed from an outside
system using the new function TransactionTreeSetCommitTsData().  This
data is crash-safe, and truncated at Xid freeze point, same as pg_clog.

This module is disabled by default because it causes a performance hit,
but can be enabled in postgresql.conf requiring only a server restart.

A new test in src/test/modules is included.

Catalog version bumped due to the new subdirectory within PGDATA and a
couple of new SQL functions.

Authors: Álvaro Herrera and Petr Jelínek

Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert
Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven
Singer, Peter Eisentraut
2014-12-03 11:53:02 -03:00
Heikki Linnakangas afeacd2748 Fix assertion failure at end of PITR.
InitXLogInsert() cannot be called in a critical section, because it
allocates memory. But CreateCheckPoint() did that, when called for the
end-of-recovery checkpoint by the startup process.

In the passing, fix the scratch space allocation in InitXLogInsert to go to
the right memory context. Also update the comment at InitXLOGAccess, which
hasn't been totally accurate since hot standby was introduced (in a hot
standby backend, InitXLOGAccess isn't called at backend startup).

Reported by Michael Paquier
2014-11-28 09:31:53 +02:00
Simon Riggs aedccb1f6f action_at_recovery_target recovery config option
action_at_recovery_target = pause | promote | shutdown

Petr Jelinek

Reviewed by Muhammad Asif Naeem, Fujji Masao and
Simon Riggs
2014-11-25 20:13:30 +00:00
Heikki Linnakangas 0bd624d63b Distinguish XLOG_FPI records generated for hint-bit updates.
Add a new XLOG_FPI_FOR_HINT record type, and use that for full-page images
generated for hint bit updates, when checksums are enabled. The new record
type is replayed exactly the same as XLOG_FPI, but allows them to be tallied
separately e.g. in pg_xlogdump.
2014-11-24 11:09:08 +02:00
Heikki Linnakangas 2c03216d83 Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.

There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.

This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.

For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.

The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.

Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 18:46:41 +02:00
Andres Freund d3586fc8aa Ensure unlogged tables are reset even if crash recovery errors out.
Unlogged relations are reset at the end of crash recovery as they're
only synced to disk during a proper shutdown. Unfortunately that and
later steps can fail, e.g. due to running out of space. This reset
was, up to now performed after marking the database as having finished
crash recovery successfully. As out of space errors trigger a crash
restart that could lead to the situation that not all unlogged
relations are reset.

Once that happend usage of unlogged relations could yield errors like
"could not open file "...": No such file or directory". Luckily
clusters that show the problem can be fixed by performing a immediate
shutdown, and starting the database again.

To fix, just call ResetUnloggedRelations(UNLOGGED_RELATION_INIT)
earlier, before marking the database as having successfully recovered.

Discussion: 20140912112246.GA4984@alap3.anarazel.de

Backpatch to 9.1 where unlogged tables were introduced.

Abhijit Menon-Sen and Andres Freund
2014-11-15 01:19:26 +01:00
Heikki Linnakangas 7250d8535b Fix building with WAL_DEBUG.
Now that the backup blocks are appended to the WAL record in xloginsert.c,
XLogInsert doesn't see them anymore and cannot remove them from the version
reconstructed for xlog_outdesc. This makes running with wal_debug=on more
expensive, as we now make (unnecessary) temporary copies of the backup
blocks, but it doesn't seem worth convoluting the code to keep that
optimization.

Reported by Alvaro Herrera.
2014-11-07 23:09:31 +02:00
Heikki Linnakangas 2076db2aea Move the backup-block logic from XLogInsert to a new file, xloginsert.c.
xlog.c is huge, this makes it a little bit smaller, which is nice. Functions
related to putting together the WAL record are in xloginsert.c, and the
lower level stuff for managing WAL buffers and such are in xlog.c.

Also move the definition of XLogRecord to a separate header file. This
causes churn in the #includes of all the files that write WAL records, and
redo routines, but it avoids pulling in xlog.h into most places.

Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.
2014-11-06 13:55:36 +02:00
Heikki Linnakangas 5028f22f6e Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.

Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.

The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 11:39:48 +02:00
Fujii Masao c7371c4a60 Prevent the already-archived WAL file from being archived again.
Previously the archive recovery always created .ready file for
the last WAL file of the old timeline at the end of recovery even when
it's restored from the archive and has .done file. That is, there was
the case where the WAL file had both .ready and .done files.
This caused the already-archived WAL file to be archived again.

This commit prevents the archive recovery from creating .ready file
for the last WAL file if it has .done file, in order to prevent it from
being archived again.

This bug was added when cascading replication feature was introduced,
i.e., the commit 5286105800.
So, back-patch to 9.2, where cascading replication was added.

Reviewed by Michael Paquier
2014-10-23 16:21:27 +09:00
Andres Freund 5e5b65f359 Don't duplicate log_checkpoint messages for both of restart and checkpoints.
The duplication originated in cdd46c765, where restartpoints were
introduced.

In LogCheckpointStart's case the duplication actually lead to the
compiler's format string checking not to be effective because the
format string wasn't constant.

Arguably these messages shouldn't be elog(), but ereport() style
messages. That'd even allow to translate the messages... But as
there's more mistakes of that kind in surrounding code, it seems
better to change that separately.
2014-10-21 01:01:56 +02:00
Andres Freund 11abd6c90f Renumber CHECKPOINT_* flags.
Commit 7dbb606938 added a new CHECKPOINT_FLUSH_ALL flag. As that
commit needed to be backpatched I didn't change the numeric values of
the existing flags as that could lead to nastly problems if any
external code issued checkpoints. That's not a concern on master, so
renumber them there.

Also add a comment about CHECKPOINT_FLUSH_ALL above
CreateCheckPoint().
2014-10-21 00:20:08 +02:00
Andres Freund 7dbb606938 Flush unlogged table's buffers when copying or moving databases.
CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE copy the source
database directory on the filesystem level. To ensure the on disk
state is consistent they block out users of the affected database and
force a checkpoint to flush out all data to disk. Unfortunately, up to
now, that checkpoint didn't flush out dirty buffers from unlogged
relations.

That bug means there could be leftover dirty buffers in either the
template database, or the database in its old location. Leading to
problems when accessing relations in an inconsistent state; and to
possible problems during shutdown in the SET TABLESPACE case because
buffers belonging files that don't exist anymore are flushed.

This was reported in bug #10675 by Maxim Boguk.

Fix by Pavan Deolasee, modified somewhat by me. Reviewed by MauMau and
Fujii Masao.

Backpatch to 9.1 where unlogged tables were introduced.
2014-10-20 23:43:46 +02:00
Peter Eisentraut b7a08c8028 Message improvements 2014-10-12 01:06:35 -04:00
Heikki Linnakangas 5fa6c81a43 Remove num_xloginsert_locks GUC, replace with a #define
I left the GUC in place for the beta period, so that people could experiment
with different values. No-one's come up with any data that a different value
would be better under some circumstances, so rather than try to document to
users what the GUC, let's just hard-code the current value, 8.
2014-10-01 16:42:26 +03:00
Andres Freund ef8863844b Rename CACHE_LINE_SIZE to PG_CACHE_LINE_SIZE.
As noted in http://bugs.debian.org/763098 there is a conflict between
postgres' definition of CACHE_LINE_SIZE and the definition by various
*bsd platforms. It's debatable who has the right to define such a
name, but postgres' use was only introduced in 375d8526f2 (9.4), so
it seems like a good idea to rename it.

Discussion: 20140930195756.GC27407@msg.df7cb.de

Per complaint of Christoph Berg in the above email, although he's not
the original bug reporter.

Backpatch to 9.4 where the define was introduced.
2014-10-01 12:17:03 +02:00