Commit Graph

1662 Commits

Author SHA1 Message Date
Andres Freund 450c7defa6 Remove volatiles from {procarray,volatile}.c and fix memory ordering issue.
The use of volatiles in procarray.c largely originated from the time
when postgres did not have reliable compiler and memory
barriers. That's not the case anymore, so we can do better.

Several of the functions in procarray.c can be bottlenecks, and
removal of volatile yields mildly better code.

The new state, with explicit memory barriers, is also more
correct. The previous use of volatile did not actually deliver
sufficient guarantees on weakly ordered machines, in particular the
logic in GetNewTransactionId() does not look safe.  It seems unlikely
to be a problem in practice, but worth fixing.

Thomas and I independently wrote a patch for this.

Reported-By: Andres Freund and Thomas Munro
Author: Andres Freund, with cherrypicked changes from a patch by Thomas Munro
Discussion:
    https://postgr.es/m/20181005172955.wyjb4fzcdzqtaxjq@alap3.anarazel.de
    https://postgr.es/m/CAEepm=1nff0x=7i3YQO16jLA2qw-F9O39YmUew4oq-xcBQBs0g@mail.gmail.com
2018-11-10 16:11:57 -08:00
Thomas Munro c24dcd0cfd Use pg_pread() and pg_pwrite() for data files and WAL.
Cut down on system calls by doing random I/O using offset-based OS
routines where available.  Remove the code for tracking the 'virtual'
seek position.  The only reason left to call FileSeek() was to get
the file's size, so provide a new function FileSize() instead.

Author: Oskari Saarenmaa, Thomas Munro
Reviewed-by: Thomas Munro, Jesper Pedersen, Tom Lane, Alvaro Herrera
Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com
Discussion: https://postgr.es/m/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com
Discussion: https://postgr.es/m/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi
2018-11-07 09:51:50 +13:00
Magnus Hagander fbec7459aa Fix spelling errors and typos in comments
Author: Daniel Gustafsson <daniel@yesql.se>
2018-11-02 13:56:52 +01:00
Andres Freund c4ab62f9ac Fix typo in xlog.c.
Author: Daniel Gustafsson
Discussion: https://postgr.es/m/A6817958-949E-4A5B-895D-FA421B6640C2@yesql.se
2018-10-31 07:50:32 -07:00
Michael Paquier 10074651e3 Add pg_promote function
This function is able to promote a standby with this new SQL-callable
function.  Execution access can be granted to non-superusers so that
failover tools can observe the principle of least privilege.

Catalog version is bumped.

Author: Laurenz Albe
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/6e7c79b3ec916cf49742fb8849ed17cd87aed620.camel@cybertec.at
2018-10-25 09:46:00 +09:00
Tom Lane 13cd7209f7 Simplify use of AllocSetContextCreate() wrapper macro.
We can allow this macro to accept either abbreviated or non-abbreviated
allocation parameters by making use of __VA_ARGS__.  As noted by Andres
Freund, it's unlikely that any compiler would have __builtin_constant_p
but not __VA_ARGS__, so this gives up little or no error checking, and
it avoids a minor but annoying API break for extensions.

With this change, there is no reason for anybody to call
AllocSetContextCreateExtended directly, so in HEAD I renamed it to
AllocSetContextCreateInternal.  It's probably too late for an ABI
break like that in 11, though.

Discussion: https://postgr.es/m/20181012170355.bhxi273skjt6sag4@alap3.anarazel.de
2018-10-12 14:26:56 -04:00
Thomas Munro 212fab9926 Relax transactional restrictions on ALTER TYPE ... ADD VALUE (redux).
Originally committed as 15bc038f (plus some follow-ups), this was
reverted in 28e07270 due to a problem discovered in parallel
workers.  This new version corrects that problem by sending the
list of uncommitted enum values to parallel workers.

Here follows the original commit message describing the change:

To prevent possibly breaking indexes on enum columns, we must keep
uncommitted enum values from getting stored in tables, unless we
can be sure that any such column is new in the current transaction.

Formerly, we enforced this by disallowing ALTER TYPE ... ADD VALUE
from being executed at all in a transaction block, unless the target
enum type had been created in the current transaction.  This patch
removes that restriction, and instead insists that an uncommitted enum
value can't be referenced unless it belongs to an enum type created
in the same transaction as the value.  Per discussion, this should be
a bit less onerous.  It does require each function that could possibly
return a new enum value to SQL operations to check this restriction,
but there aren't so many of those that this seems unmaintainable.

Author: Andrew Dunstan and Tom Lane, with parallel query fix by Thomas Munro
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAEepm%3D0Ei7g6PaNTbcmAh9tCRahQrk%3Dr5ZWLD-jr7hXweYX3yg%40mail.gmail.com
Discussion: https://postgr.es/m/4075.1459088427%40sss.pgh.pa.us
2018-10-09 12:51:01 +13:00
Tom Lane 82ff0cc91d Advance transaction timestamp for intra-procedure transactions.
Per discussion, this behavior seems less astonishing than not doing so.

Peter Eisentraut and Tom Lane

Discussion: https://postgr.es/m/20180920234040.GC29981@momjian.us
2018-10-08 16:16:36 -04:00
Tom Lane 07ee62ce9e Propagate xactStartTimestamp and stmtStartTimestamp to parallel workers.
Previously, a worker process would establish values for these based on
its own start time.  In v10 and up, this can trivially be shown to cause
misbehavior of transaction_timestamp(), timestamp_in(), and related
functions which are (perhaps unwisely?) marked parallel-safe.  It seems
likely that other behaviors might diverge from what happens in the parent
as well.

It's not as trivial to demonstrate problems in 9.6 or 9.5, but I'm sure
it's still possible, so back-patch to all branches containing parallel
worker infrastructure.

In HEAD only, mark now() and statement_timestamp() as parallel-safe
(other affected functions already were).  While in theory we could
still squeeze that change into v11, it doesn't seem important enough
to force a last-minute catversion bump.

Konstantin Knizhnik, whacked around a bit by me

Discussion: https://postgr.es/m/6406dbd2-5d37-4cb6-6eb2-9c44172c7e7c@postgrespro.ru
2018-10-06 12:00:09 -04:00
Amit Kapila a86bf6057e Fix assertion failure when updating full_page_writes for checkpointer.
When the checkpointer receives a SIGHUP signal to update its configuration,
it may need to update the shared memory for full_page_writes and need to
write a WAL record for it.  Now, it is quite possible that the XLOG
machinery has not been initialized by that time and it will lead to
assertion failure while doing that.  Fix is to allow the initialization of
the XLOG machinery outside critical section.

This bug has been introduced by the commit 2c03216d83 which added the XLOG
machinery initialization in RecoveryInProgress code path.

Reported-by: Dilip Kumar
Author: Dilip Kumar
Reviewed-by: Michael Paquier and Amit Kapila
Backpatch-through: 9.5
Discussion: https://postgr.es/m/CAFiTN-u4BA8KXcQUWDPNgaKAjDXC=C2whnzBM8TAcv=stckYUw@mail.gmail.com
2018-09-28 16:40:04 +05:30
Michael Paquier 78ea8b5daa Fix WAL recycling on standbys depending on archive_mode
A restart point or a checkpoint recycling WAL segments treats segments
marked with neither ".done" (archiving is done) or ".ready" (segment is
ready to be archived) in archive_status the same way for archive_mode
being "on" or "always".  While for a primary this is fine, a standby
running a restart point with archive_mode = on would try to mark such a
segment as ready for archiving, which is something that will never
happen except after the standby is promoted.

Note that this problem applies only to WAL segments coming from the
local pg_wal the first time archive recovery is run.  Segments part of a
self-contained base backup are the most common case where this could
happen, however even in this case normally the .done markers would be
most likely part of the backup.  Segments recovered from an archive are
marked as .ready or .done by the startup process, and segments finished
streaming are marked as such by the WAL receiver, so they are handled
already.

Reported-by: Haruka Takatsuka
Author: Michael Paquier
Discussion: https://postgr.es/m/15402-a453c90ed4cf88b2@postgresql.org
Backpatch-through: 9.5, where archive_mode = always has been added.
2018-09-28 11:54:38 +09:00
Michael Paquier 8d28bf500f Rework activation of commit timestamps during recovery
The activation and deactivation of commit timestamp tracking has not
been handled consistently for a primary or standbys at recovery.  The
facility can be activated at three different moments of recovery:
- The beginning, where a primary would use the GUC value for the
decision-making, and where a standby relies on the contents of the
control file.
- When replaying a XLOG_PARAMETER_CHANGE record at redo.
- The end, where both primary and standby rely on the GUC value.

Using the GUC value for a primary at the beginning of recovery causes
problems with commit timestamp access when doing crash recovery.
Particularly, when replaying transaction commits, it could be possible
that an attempt to read commit timestamps is done for a transaction
which committed at a moment when track_commit_timestamp was disabled.

A test case is added to reproduce the failure.  The test works down to
v11 as it takes advantage of transaction commits within procedures.

Reported-by: Hailong Li
Author: Masahiko Sawasa, Michael Paquier
Reviewed-by: Kyotaro Horiguchi
Discussion: https://postgr.es/m/11224478-a782-203b-1f17-e4797b39bdf0@qunar.com
Backpatch-through: 9.5, where commit timestamps have been introduced.
2018-09-26 10:25:54 +09:00
Thomas Munro 6c3c9d4189 Defer restoration of libraries in parallel workers.
Several users of extensions complained of crashes in parallel workers
that turned out to be due to syscache access from their _PG_init()
functions.  Reorder the initialization of parallel workers so that
libraries are restored after the caches are initialized, and inside a
transaction.

This was reported in bug #15350 and elsewhere.  We don't consider it
to be a bug: extensions shouldn't do that, because then they can't be
used in shared_preload_libraries.  However, it's a fairly obscure
hazard and these extensions worked in practice before parallel query
came along.  So let's make it work.  Later commits might add a warning
message and eventually an error.

Back-patch to 9.6, where parallel query landed.

Author: Thomas Munro
Reviewed-by: Amit Kapila
Reported-by: Kieran McCusker, Jimmy
Discussion: https://postgr.es/m/153512195228.1489.8545997741965926448%40wrigleys.postgresql.org
2018-09-20 14:21:18 +12:00
Tom Lane 0d38e4ebb7 Fix minor error message style guide violation.
No periods at the ends of primary error messages, please.

Daniel Gustafsson

Discussion: https://postgr.es/m/43E004C0-18C6-42B4-A313-003B43EB0571@yesql.se
2018-09-19 17:06:40 -04:00
Amit Kapila bc153c941d Attach FPI to the first record after full_page_writes is turned on.
XLogInsert fails to attach a required FPI to the first record after
full_page_writes is turned on by the last checkpoint.  This bug got
introduced in 9.5 due to code rearrangement in commits 2c03216d83 and
2076db2aea.  Fix it by ensuring that XLogInsertRecord performs a
recomputation when the given record is generated with FPW as off but
found that the flag has been turned on while actually inserting the
record.

Reported-by: Kyotaro Horiguchi
Author: Kyotaro Horiguchi
Reviewed-by: Amit Kapila
Backpatch-through: 9.5 where this problem was introduced
Discussion: https://postgr.es/m/20180420.151043.74298611.horiguchi.kyotaro@lab.ntt.co.jp
2018-09-13 15:32:50 +05:30
Michael Paquier 9226a3b89b Remove duplicated words split across lines in comments
This has been detected using some interesting tricks with sed, and the
method used is mentioned in details in the discussion below.

Author: Justin Pryzby
Discussion: https://postgr.es/m/20180908013109.GB15350@telsasoft.com
2018-09-08 12:24:19 -07:00
Michael Paquier 8582b4d044 Improve handling of corrupted two-phase state files at recovery
When a corrupted two-phase state file is found by WAL replay, be it for
crash recovery or archive recovery, then the file is simply skipped and
a WARNING is logged to the user, causing the transaction to be silently
lost.  Facing an on-disk WAL file which is corrupted is as likely to
happen as what is stored in WAL records, but WAL records are already
able to fail hard if there is a CRC mismatch.  On-disk two-phase state
files, on the contrary, are simply ignored if corrupted.  Note that when
restoring the initial two-phase data state at recovery, files newer than
the horizon XID are discarded hence no files present in pg_twophase/
should be torned and have been made durable by a previous checkpoint, so
recovery should never see any corrupted two-phase state file by design.

The situation got better since 978b2f6 which has added two-phase state
information directly in WAL instead of using on-disk files, so the risk
is limited to two-phase transactions which live across at least one
checkpoint for long periods.  Backups having legit two-phase state files
on-disk could also lose silently transactions when restored if things
get corrupted.

This behavior exists since two-phase commit has been introduced, no
back-patch is done for now per the lack of complaints about this
problem.

Author: Michael Paquier
Discussion: https://postgr.es/m/20180709050309.GM1467@paquier.xyz
2018-09-07 11:00:16 -07:00
Peter Eisentraut 98afa68d93 Use C99 designated initializers for some structs
These are just a few particularly egregious cases that were hard to read
and write, and error prone because of many similar adjacent types.

Discussion: https://www.postgresql.org/message-id/flat/4c9f01be-9245-2148-b569-61a8562ef190%402ndquadrant.com
2018-09-07 11:40:03 +02:00
Tom Lane 44cac93464 Avoid using potentially-under-aligned page buffers.
There's a project policy against using plain "char buf[BLCKSZ]" local
or static variables as page buffers; preferred style is to palloc or
malloc each buffer to ensure it is MAXALIGN'd.  However, that policy's
been ignored in an increasing number of places.  We've apparently got
away with it so far, probably because (a) relatively few people use
platforms on which misalignment causes core dumps and/or (b) the
variables chance to be sufficiently aligned anyway.  But this is not
something to rely on.  Moreover, even if we don't get a core dump,
we might be paying a lot of cycles for misaligned accesses.

To fix, invent new union types PGAlignedBlock and PGAlignedXLogBlock
that the compiler must allocate with sufficient alignment, and use
those in place of plain char arrays.

I used these types even for variables where there's no risk of a
misaligned access, since ensuring proper alignment should make
kernel data transfers faster.  I also changed some places where
we had been palloc'ing short-lived buffers, for coding style
uniformity and to save palloc/pfree overhead.

Since this seems to be a live portability hazard (despite the lack
of field reports), back-patch to all supported versions.

Patch by me; thanks to Michael Paquier for review.

Discussion: https://postgr.es/m/1535618100.1286.3.camel@credativ.de
2018-09-01 15:27:17 -04:00
Michael Paquier c186ba135e Ensure correct minimum consistent point on standbys
Startup process has improved its calculation of incorrect minimum
consistent point in 8d68ee6, which ensures that all WAL available gets
replayed when doing crash recovery, and has introduced an incorrect
calculation of the minimum recovery point for non-startup processes,
which can cause incorrect page references on a standby when for example
the background writer flushed a couple of pages on-disk but was not
updating the control file to let a subsequent crash recovery replay to
where it should have.

The only case where this has been reported to be a problem is when a
standby needs to calculate the latest removed xid when replaying a btree
deletion record, so one would need connections on a standby that happen
just after recovery has thought it reached a consistent point.  Using a
background worker which is started after the consistent point is reached
would be the easiest way to get into problems if it connects to a
database.  Having clients which attempt to connect periodically could
also be a problem, but the odds of seeing this problem are much lower.

The fix used is pretty simple, as the idea is to give access to the
minimum recovery point written in the control file to non-startup
processes so as they use a reference, while the startup process still
initializes its own references of the minimum consistent point so as the
original problem with incorrect page references happening post-promotion
with a crash do not show up.

Reported-by: Alexander Kukushkin
Diagnosed-by: Alexander Kukushkin
Author: Michael Paquier
Reviewed-by: Kyotaro Horiguchi, Alexander Kukushkin
Discussion: https://postgr.es/m/153492341830.1368.3936905691758473953@wrigleys.postgresql.org
Backpatch-through: 9.3
2018-08-31 11:03:40 -07:00
Thomas Munro ca1e64feba Improve comment in GetNewObjectId().
The previous comment gave the impression that skipping OIDs before
FirstNormalObjectId was merely an optimization to avoid likely collisions.
In fact other parts of the system have been relying on this threshold to
detect system-created objects since commit 8e18d04d4d, so adjust the
wording.

Author: Thomas Munro
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAEepm%3D33JASACeOayr_W3%3DCSjy2jiPxM-k89axu0akFbHdjnjA%40mail.gmail.com
2018-08-16 17:17:30 +12:00
Michael Paquier 246a6c8f7b Make autovacuum more aggressive to remove orphaned temp tables
Commit dafa084, added in 10, made the removal of temporary orphaned
tables more aggressive.  This commit makes an extra step into the
aggressiveness by adding a flag in each backend's MyProc which tracks
down any temporary namespace currently in use.  The flag is set when the
namespace gets created and can be reset if the temporary namespace has
been created in a transaction or sub-transaction which is aborted.  The
flag value assignment is assumed to be atomic, so this can be done in a
lock-less fashion like other flags already present in PGPROC like
databaseId or backendId, still the fact that the temporary namespace and
table created are still locked until the transaction creating those
commits acts as a barrier for other backends.

This new flag gets used by autovacuum to discard more aggressively
orphaned tables by additionally checking for the database a backend is
connected to as well as its temporary namespace in-use, removing
orphaned temporary relations even if a backend reuses the same slot as
one which created temporary relations in a past session.

The base idea of this patch comes from Robert Haas, has been written in
its first version by Tsunakawa Takayuki, then heavily reviewed by me.

Author: Tsunakawa Takayuki
Reviewed-by: Michael Paquier, Kyotaro Horiguchi, Andres Freund
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8A4DC6@G01JPEXMBYT05
Backpatch: 11-, as PGPROC gains a new flag and we don't want silent ABI
breakages on already released versions.
2018-08-13 11:49:04 +02:00
Peter Geoghegan 4974d7f87e Handle parallel index builds on mapped relations.
Commit 9da0cc3528, which introduced parallel CREATE INDEX, failed to
propagate relmapper.c backend local cache state to parallel worker
processes.  This could result in parallel index builds against mapped
catalog relations where the leader process (participating as a worker)
scans the new, pristine relfilenode, while worker processes scan the
obsolescent relfilenode.  When this happened, the final index structure
was typically not consistent with the owning table's structure.  The
final index structure could contain entries formed from both heap
relfilenodes.  Only rebuilds on mapped catalog relations that occur as
part of a VACUUM FULL or CLUSTER could become corrupt in practice, since
their mapped relation relfilenode swap is what allows the inconsistency
to arise.

On master, fix the problem by propagating the required relmapper.c
backend state as part of standard parallel initialization (Cf. commit
29d58fd3).  On v11, simply disallow builds against mapped catalog
relations by deeming them parallel unsafe.

Author: Peter Geoghegan
Reported-By: "death lock"
Reviewed-By: Tom Lane, Amit Kapila
Bug: #15309
Discussion: https://postgr.es/m/153329671686.1405.18298309097348420351@wrigleys.postgresql.org
Backpatch: 11-, where parallel CREATE INDEX was introduced.
2018-08-10 13:01:34 -07:00
Michael Paquier 5a23c74b63 Reset properly errno before calling write()
6cb3372 enforces errno to ENOSPC when less bytes than what is expected
have been written when it is unset, though it forgot to properly reset
errno before doing a system call to write(), causing errno to
potentially come from a previous system call.

Reported-by: Tom Lane
Author: Michael Paquier
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/31797.1533326676@sss.pgh.pa.us
2018-08-05 05:31:18 +09:00
Michael Paquier d9fadbf131 Fix calculation for WAL segment recycling and removal
Commit 4b0d28de06 has removed the prior checkpoint and related
facilities but has left WAL recycling based on the LSN of the prior
checkpoint, which causes incorrect calculations for WAL removal and
recycling for max_wal_size and min_wal_size.  This commit changes things
so as the base calculation point is the last checkpoint generated.

Reported-by: Kyotaro Horiguchi
Author: Kyotaro Horiguchi
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/20180723.135748.42558387.horiguchi.kyotaro@lab.ntt.co.jp
Backpatch: 11-, where the prior checkpoint has been removed.
2018-07-24 10:32:56 +09:00
Michael Paquier e41d0a1090 Add proper errcodes to new error messages for read() failures
Those would use the default ERRCODE_INTERNAL_ERROR, but for foreseeable
failures an errcode ought to be set, ERRCODE_DATA_CORRUPTED making the
most sense here.

While on the way, fix one errcode_for_file_access missing in origin.c
since the code has been created, and remove one assignment of errno to 0
before calling read(), as this was around to fit with what was present
before 811b6e36 where errno would not be set when not enough bytes are
read.  I have noticed the first one, and Tom has pinged me about the
second one.

Author: Michael Paquier
Reported-by: Tom Lane
Discussion: https://postgr.es/m/27265.1531925836@sss.pgh.pa.us
2018-07-23 09:37:36 +09:00
Michael Paquier 56df07bb9e Make more consistent some error messages for file-related operations
Some error messages which report something about a file operation use
as well context which is already provided within the path being worked
on, making things rather duplicated.  This creates more work for
translators, and does not actually bring clarity.

More could be done, however in a lot of cases the context used is
actually useful, still that patch gets down things with a good cut.

Author: Michael Paquier
Reviewed-by: Kyotaro Horiguchi, Tom Lane
Discussion: https://postgr.es/m/20180718044711.GA8565@paquier.xyz
2018-07-23 09:19:12 +09:00
Tom Lane 3cb646264e Use a ResourceOwner to track buffer pins in all cases.
Historically, we've allowed auxiliary processes to take buffer pins without
tracking them in a ResourceOwner.  However, that creates problems for error
recovery.  In particular, we've seen multiple reports of assertion crashes
in the startup process when it gets an error while holding a buffer pin,
as for example if it gets ENOSPC during a write.  In a non-assert build,
the process would simply exit without releasing the pin at all.  We've
gotten away with that so far just because a failure exit of the startup
process translates to a database crash anyhow; but any similar behavior
in other aux processes could result in stuck pins and subsequent problems
in vacuum.

To improve this, institute a policy that we must *always* have a resowner
backing any attempt to pin a buffer, which we can enforce just by removing
the previous special-case code in resowner.c.  Add infrastructure to make
it easy to create a process-lifespan AuxProcessResourceOwner and clear
out its contents at appropriate times.  Replace existing ad-hoc resowner
management in bgwriter.c and other aux processes with that.  (Thus, while
the startup process gains a resowner where it had none at all before, some
other aux process types are replacing an ad-hoc resowner with this code.)
Also use the AuxProcessResourceOwner to manage buffer pins taken during
StartupXLOG and ShutdownXLOG, even when those are being run in a bootstrap
process or a standalone backend rather than a true auxiliary process.

In passing, remove some other ad-hoc resource owner creations that had
gotten cargo-culted into various other places.  As far as I can tell
that was all unnecessary, and if it had been necessary it was incomplete,
due to lacking any provision for clearing those resowners later.
(Also worth noting in this connection is that a process that hasn't called
InitBufferPoolBackend has no business accessing buffers; so there's more
to do than just add the resowner if we want to touch buffers in processes
not covered by this patch.)

Although this fixes a very old bug, no back-patch, because there's no
evidence of any significant problem in non-assert builds.

Patch by me, pursuant to a report from Justin Pryzby.  Thanks to
Robert Haas and Kyotaro Horiguchi for reviews.

Discussion: https://postgr.es/m/20180627233939.GA10276@telsasoft.com
2018-07-18 12:15:16 -04:00
Heikki Linnakangas 6b387179ba Fix misc typos, mostly in comments.
A collection of typos I happened to spot while reading code, as well as
grepping for common mistakes.

Backpatch to all supported versions, as applicable, to avoid conflicts
when backporting other commits in the future.
2018-07-18 16:17:32 +03:00
Michael Paquier 8bd064f0c7 Fix casting in error message for two-phase file
This error from from 811b6e3, which causes compilation warnings with OSX
10.3 and clang.

Reported by Tom Lane, via buildfarm member longfin.
2018-07-18 09:11:34 +09:00
Michael Paquier 811b6e36a9 Rework error messages around file handling
Some error messages related to file handling are using the code path
context to define their state.  For example, 2PC-related errors are
referring to "two-phase status files", or "relation mapping file" is
used for catalog-to-filenode mapping, however those prove to be
difficult to translate, and are not more helpful than just referring to
the path of the file being worked on.  So simplify all those error
messages by just referring to files with their path used.  In some
cases, like the manipulation of WAL segments, the context is actually
helpful so those are kept.

Calls to the system function read() have also been rather inconsistent
with their error handling sometimes not reporting the number of bytes
read, and some other code paths trying to use an errno which has not
been set.  The in-core functions are using a more consistent pattern
with this patch, which checks for both errno if set or if an
inconsistent read is happening.

So as to care about pluralization when reading an unexpected number of
byte(s), "could not read: read %d of %zu" is used as error message, with
%d field being the output result of read() and %zu the expected size.
This simplifies the work of translators with less variations of the same
message.

Author: Michael Paquier
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/20180520000522.GB1603@paquier.xyz
2018-07-18 08:01:23 +09:00
Robert Haas 32df1c9afa Add subtransaction handling for table synchronization workers.
Since the old logic was completely unaware of subtransactions, a
change made in a subsequently-aborted subtransaction would still cause
workers to be stopped at toplevel transaction commit.  Fix that by
managing a stack of worker lists rather than just one.

Amit Khandekar and Robert Haas

Discussion: http://postgr.es/m/CAJ3gD9eaG_mWqiOTA2LfAug-VRNn1hrhf50Xi1YroxL37QkZNg@mail.gmail.com
2018-07-16 17:33:22 -04:00
Michael Paquier 5fc1008e8a Clean up temporary WAL segments after an instance crash
Temporary WAL segments are created in pg_wal and named as xlogtemp.pid
before being renamed to the real deal when creating a new segment.  If
an instance crashes after the temporary segment is created and before
the rename is done, then the server would finish with unremovable data.

After an instance crash, scan pg_wal and remove any such segments.  With
repetitive unlucky crashes this would contribute to disk bloat and
presents risks of ENOSPC especially with max_wal_size close to the
maximum allowed.

Author: Michael Paquier
Reviewed-by: Yugo Nagata, Heikki Linnakangas
Discussion: https://postgr.es/m/20180514054955.GF1528@paquier.xyz
2018-07-13 06:43:20 +09:00
Peter Eisentraut bcbd940806 Remove dynamic_shared_memory_type=none
PostgreSQL nowadays offers some kind of dynamic shared memory feature on
all supported platforms.  Having the choice of "none" prevents us from
relying on DSM in core features.  So this patch removes the choice of
"none".

Author: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
2018-07-10 18:35:24 +02:00
Alvaro Herrera a22445ff0b Flip argument order in XLogSegNoOffsetToRecPtr
Commit fc49e24fa6 added an input argument after the existing output
argument.  Flip those.

Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20180708182345.imdgovmkffgtihhk@alvherre.pgsql
2018-07-09 14:33:38 -04:00
Michael Paquier cbc55da556 Rework order of end-of-recovery actions to delay timeline history write
A critical failure in some of the end-of-recovery actions before the
end-of-recovery record is written can cause PostgreSQL to react
inconsistently with the rest of the cluster in the event of a crash
before the final record is written.  Two such failures are for example
an error while processing a two-phase state files or when operating on
recovery.conf.  With this commit, the failures are still considered
FATAL, but the write of the timeline history file is delayed as much as
possible so as the window between the moment the file is written and the
end-of-recovery record is generated gets minimized. This way, in the
event of a crash or a failure, the new timeline decided at promotion
will not seem taken by other nodes in the cluster.  It is not really
possible to reduce to zero this window, hence one could still see
failures if a crash happens between the history file write and the
end-of-recovery record, so any future code should be careful when
adding new end-of-recovery actions.  The original report from Magnus
Hagander mentioned a renamed recovery.conf as original end-of-recovery
failure which caused a timeline to be seen as taken but the subsequent
processing on the now-missing recovery.conf cause the startup process to
issue stop on FATAL, which at follow-up startup made the system
inconsistent because of on-disk changes which already happened.

Processing of two-phase state files still needs some work as corrupted
entries are simply ignored now.  This is left as a future item and this
commit fixes the original complain.

Reported-by: Magnus Hagander
Author: Heikki Linnakangas
Reviewed-by: Alexander Korotkov, Michael Paquier, David Steele
Discussion: https://postgr.es/m/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=XxmXLmF6ipNFecbShQ@mail.gmail.com
2018-07-09 10:22:34 +09:00
Michael Paquier 3c64dcb1e3 Prevent references to invalid relation pages after fresh promotion
If a standby crashes after promotion before having completed its first
post-recovery checkpoint, then the minimal recovery point which marks
the LSN position where the cluster is able to reach consistency may be
set to a position older than the first end-of-recovery checkpoint while
all the WAL available should be replayed.  This leads to the instance
thinking that it contains inconsistent pages, causing a PANIC and a hard
instance crash even if all the WAL available has not been replayed for
certain sets of records replayed.  When in crash recovery,
minRecoveryPoint is expected to always be set to InvalidXLogRecPtr,
which forces the recovery to replay all the WAL available, so this
commit makes sure that the local copy of minRecoveryPoint from the
control file is initialized properly and stays as it is while crash
recovery is performed.  Once switching to archive recovery or if crash
recovery finishes, then the local copy minRecoveryPoint can be safely
updated.

Pavan Deolasee has reported and diagnosed the failure in the first
place, and the base fix idea to rely on the local copy of
minRecoveryPoint comes from Kyotaro Horiguchi, which has been expanded
into a full-fledged patch by me.  The test included in this commit has
been written by Álvaro Herrera and Pavan Deolasee, which I have modified
to make it faster and more reliable with sleep phases.

Backpatch down to all supported versions where the bug appears, aka 9.3
which is where the end-of-recovery checkpoint is not run by the startup
process anymore.  The test gets easily supported down to 10, still it
has been tested on all branches.

Reported-by: Pavan Deolasee
Diagnosed-by: Pavan Deolasee
Reviewed-by: Pavan Deolasee, Kyotaro Horiguchi
Author: Michael Paquier, Kyotaro Horiguchi, Pavan Deolasee, Álvaro
Herrera
Discussion: https://postgr.es/m/CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com
2018-07-05 10:46:18 +09:00
Fujii Masao b41669118c Improve the performance of relation deletes during recovery.
When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
previously, during recovery, smgrdounlink() (not smgrdounlinkall()) was
called for each file to delete, which led to scan shared_buffers
multiple times. Obviously this could cause to increase the WAL replay
time very much especially when shared_buffers was huge.

To alleviate this situation, this commit changes the recovery so that
it also calls smgrdounlinkall() only one time to delete multiple
relation files.

This is just fix for oversight of commit 279628a0a7, not new feature.
So, per discussion on pgsql-hackers, we concluded to backpatch this
to all supported versions.

Author: Fujii Masao
Reviewed-by: Michael Paquier, Andres Freund, Thomas Munro, Kyotaro Horiguchi, Takayuki Tsunakawa
Discussion: https://postgr.es/m/CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com
2018-07-05 02:23:46 +09:00
Michael Paquier c55de5e512 Add wait event for fsync of WAL segments
This has been visibly a forgotten spot in the first implementation of
wait events for I/O added by 249cf07, and what has been missing is a
fsync call for WAL segments which is a wrapper reacting on the value of
GUC wal_sync_method.

Reported-by: Konstantin Knizhnik
Author: Konstantin Knizhnik
Reviewed-by: Craig Ringer, Michael Paquier
Discussion: https://postgr.es/m/4a243897-0ad8-f471-aa40-242591f2476e@postgrespro.ru
2018-07-02 22:19:46 +09:00
Andrew Dunstan 1e9c858090 pgindent run prior to branching 2018-06-30 12:25:49 -04:00
Michael Paquier 6cb3372411 Address set of issues with errno handling
System calls mixed up in error code paths are causing two issues which
several code paths have not correctly handled:
1) For write() calls, sometimes the system may return less bytes than
what has been written without errno being set.  Some paths were careful
enough to consider that case, and assumed that errno should be set to
ENOSPC, other calls missed that.
2) errno generated by a system call is overwritten by other system calls
which may succeed once an error code path is taken, causing what is
reported to the user to be incorrect.

This patch uses the brute-force approach of correcting all those code
paths.  Some refactoring could happen in the future, but this is let as
future work, which is not targeted for back-branches anyway.

Author: Michael Paquier
Reviewed-by: Ashutosh Sharma
Discussion: https://postgr.es/m/20180622061535.GD5215@paquier.xyz
2018-06-25 11:19:05 +09:00
Michael Paquier 0aa5e65ab4 Fix typo in comment of commit_ts.c for incorrect reference to CLOG
Author: Shao Bret
2018-06-22 13:30:26 +09:00
Michael Paquier 70b4f82a4b Prevent hard failures of standbys caused by recycled WAL segments
When a standby's WAL receiver stops reading WAL from a WAL stream, it
writes data to the current WAL segment without having priorily zero'ed
the page currently written to, which can cause the WAL reader to read
junk data from a past recycled segment and then it would try to get a
record from it.  While sanity checks in place provide most of the
protection needed, in some rare circumstances, with chances increasing
when a record header crosses a page boundary, then the startup process
could fail violently on an allocation failure, as follows:
FATAL:  invalid memory alloc request size XXX

This is confusing for the user and also unhelpful as this requires in
the worst case a manual restart of the instance, impacting potentially
the availability of the cluster, and this also makes WAL data look like
it is in a corrupted state.

The chances of seeing failures are higher if the connection between the
standby and its root node is unstable, causing WAL pages to be written
in the middle.  A couple of approaches have been discussed, like
zero-ing  new WAL pages within the WAL receiver itself but this has the
disadvantage of impacting performance of any existing instances as this
breaks the sequential writes done by the WAL receiver.  This commit
deals with the problem with a more simple approach, which has no
performance impact without reducing the detection of the problem: if a
record is found with a length higher than 1GB for backends, then do not
try any allocation and report a soft failure which will force the
standby to retry reading WAL.  It could be possible that the allocation
call passes and that an unnecessary amount of memory is allocated,
however follow-up checks on records would just fail, making this
allocation short-lived anyway.

This patch owes a great deal to Tsunakawa Takayuki for reporting the
failure first, and then discussing a couple of potential approaches to
the problem.

Backpatch down to 9.5, which is where palloc_extended has been
introduced.

Reported-by: Tsunakawa Takayuki
Reviewed-by: Tsunakawa Takayuki
Author: Michael Paquier
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
2018-06-18 10:43:27 +09:00
Simon Riggs 15378c1a15 Remove AELs from subxids correctly on standby
Issues relate only to subtransactions that hold AccessExclusiveLocks
when replayed on standby.

Prior to PG10, aborting subtransactions that held an
AccessExclusiveLock failed to release the lock until top level commit or
abort. 49bff5300d fixed that.

However, 49bff5300d also introduced a similar bug where subtransaction
commit would fail to release an AccessExclusiveLock, leaving the lock to
be removed sometimes early and sometimes late. This commit fixes
that bug also. Backpatch to PG10 needed.

Tested by observation. Note need for multi-node isolationtester to improve
test coverage for this and other HS cases.

Reported-by: Simon Riggs
Author: Simon Riggs
2018-06-16 14:03:29 +01:00
Alvaro Herrera ff03112bdc Fix off-by-one bug in XactLogCommitRecord
Commit 1eb6d6527a introduced zeroed alignment bytes in the GID field
of commit/abort WAL records.  Fixup commit cf5a189059 later changed
that representation into a regular cstring with a single terminating
zero byte, but it also introduced an off-by-one mistake.  Fix that.

Author: Nikhil Sontakke
Reported-by: Nikhil Sontakke
Discussion: https://postgr.es/m/CAMGcDxey6dG1DP34_tJMoWPcp5sPJUAL4K5CayUUXLQSx2GQpA@mail.gmail.com
2018-06-15 15:00:41 -04:00
Alvaro Herrera eee381ef5e Fix function code in error report
This bug causes a lseek() failure to be reported as a "could not open"
failure in the error message, muddling bug reports.  I introduced this
copy-and-pasteo in commit 78e1220104.

Noticed while reviewing code for bug report #15221, from lily liang.  In
version 10 the affected function is only used by multixact.c and
commit_ts, and only in corner-case circumstances, neither of which are
involved in the reported bug (a pg_subtrans failure.)

Author: Álvaro Herrera
2018-06-06 14:48:08 -04:00
Magnus Hagander cfb758b6d9 Fix error message on short read of pg_control
Instead of saying "error: success", indicate that we got a working read
but it was too short.
2018-05-18 17:54:18 +02:00
Peter Eisentraut 9effb63e0d Message wording and pluralization improvements 2018-05-17 23:05:27 -04:00
Heikki Linnakangas 0668719801 Fix scenario where streaming standby gets stuck at a continuation record.
If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.

Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.

Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.

Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
2018-05-05 01:34:53 +03:00
Teodor Sigaev 8f9be261f4 Add HOLD_INTERRUPTS section into FinishPreparedTransaction.
If an interrupt arrives in the middle of FinishPreparedTransaction
and any callback decide to call CHECK_FOR_INTERRUPTS (e.g.
RemoveTwoPhaseFile can write a warning with ereport, which checks for
interrupts) then it's possible to leave current GXact undeleted.

Backpatch to all supported branches

Stas Kelvich

Discussion: ihttps://www.postgresql.org/message-id/3AD85097-A3F3-4EBA-99BD-C38EDF8D2949@postgrespro.ru
2018-05-03 20:08:29 +03:00