Commit Graph

3523 Commits

Author SHA1 Message Date
Magnus Hagander
312bac54cc Fix typo in comment
Author: Masahiko Sawada
2017-05-22 09:10:02 +02:00
Bruce Momjian
a6fd7b7a5f Post-PG 10 beta1 pgindent run
perltidy run not included.
2017-05-17 16:31:56 -04:00
Peter Eisentraut
d496a65790 Standardize "WAL location" terminology
Other previously used terms were "WAL position" or "log position".
2017-05-12 13:51:27 -04:00
Peter Eisentraut
c1a7f64b4a Replace "transaction log" with "write-ahead log"
This makes documentation and error messages match the renaming of "xlog"
to "wal" in APIs and file naming.
2017-05-12 11:52:43 -04:00
Tom Lane
d10c626de4 Rename WAL-related functions and views to use "lsn" not "location".
Per discussion, "location" is a rather vague term that could refer to
multiple concepts.  "LSN" is an unambiguous term for WAL locations and
should be preferred.  Some function names, view column names, and function
output argument names used "lsn" already, but others used "location",
as well as yet other terms such as "wal_position".  Since we've already
renamed a lot of things in this area from "xlog" to "wal" for v10,
we may as well incur a bit more compatibility pain and make these names
all consistent.

David Rowley, minor additional docs hacking by me

Discussion: https://postgr.es/m/CAKJS1f8O0njDKe8ePFQ-LK5-EjwThsDws6ohJ-+c6nWK+oUxtg@mail.gmail.com
2017-05-11 11:49:59 -04:00
Alvaro Herrera
b66adb7b0c Revert "Permit dump/reload of not-too-large >1GB tuples"
This reverts commits fa2fa99552 and 42f50cb8fa.

While the functionality that was intended to be provided by these
commits is desired, the patch didn't actually solve as many of the
problematic situations as we hoped, and it created a bunch of its own
problems.  Since we're going to require more extensive changes soon for
other reasons and users have been working around these problems for a
long time already, there is no point in spending effort in fixing this
halfway measure.

Per complaint from Tom Lane.
Discussion: https://postgr.es/m/21407.1484606922@sss.pgh.pa.us

(Commit fa2fa99552 had already been reverted in branches 9.5 as
f858524ee4 and 9.6 as e9e44a0953, so this touches master only.
Commit 42f50cb8fa was not present in the older branches.)
2017-05-10 18:41:27 -03:00
Robert Haas
a5775991bb Remove no-longer-needed compatibility code for hash indexes.
Because commit ea69a0dead bumped the
HASH_VERSION, we don't need to worry about PostgreSQL 10 seeing
bucket pages from earlier versions.

Amit Kapila

Discussion: http://postgr.es/m/CAA4eK1LAo4DGwh+mi-G3U8Pj1WkBBeFL38xdCnUHJv1z4bZFkQ@mail.gmail.com
2017-05-09 23:44:21 -04:00
Peter Eisentraut
086221cf6b Prevent panic during shutdown checkpoint
When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and throws
a PANIC if so.  At that point, only walsenders are still active, so one
might think this could not happen, but walsenders can also generate WAL,
for instance in BASE_BACKUP and certain variants of
CREATE_REPLICATION_SLOT.  So they can trigger this panic if such a
command is run while the shutdown checkpoint is being written.

To fix this, divide the walsender shutdown into two phases.  First, the
postmaster sends a SIGUSR2 signal to all walsenders.  The walsenders
then put themselves into the "stopping" state.  In this state, they
reject any new commands.  (For simplicity, we reject all new commands,
so that in the future we do not have to track meticulously which
commands might generate WAL.)  The checkpointer waits for all walsenders
to reach this state before proceeding with the shutdown checkpoint.
After the shutdown checkpoint is done, the postmaster sends
SIGINT (previously unused) to the walsenders.  This triggers the
existing shutdown behavior of sending out the shutdown checkpoint record
and then terminating.

Author: Michael Paquier <michael.paquier@gmail.com>
Reported-by: Fujii Masao <masao.fujii@gmail.com>
2017-05-05 10:31:42 -04:00
Tom Lane
3f074845a8 Fix pfree-of-already-freed-tuple when rescanning a GiST index-only scan.
GiST's getNextNearest() function attempts to pfree the previously-returned
tuple if any (that is, scan->xs_hitup in HEAD, or scan->xs_itup in older
branches).  However, if we are rescanning a plan node after ending a
previous scan early, those tuple pointers could be pointing to garbage,
because they would be pointing into the scan's pageDataCxt or queueCxt
which has been reset.  In a debug build this reliably results in a crash,
although I think it might sometimes accidentally fail to fail in
production builds.

To fix, clear the pointer field anyplace we reset a context it might
be pointing into.  This may be overkill --- I think probably only the
queueCxt case is involved in this bug, so that resetting in gistrescan()
would be sufficient --- but dangling pointers are generally bad news,
so let's avoid them.

Another plausible answer might be to just not bother with the pfree in
getNextNearest().  The reconstructed tuples would go away anyway in the
context resets, and I'm far from convinced that freeing them a bit earlier
really saves anything meaningful.  I'll stick with the original logic in
this patch, but if we find more problems in the same area we should
consider that approach.

Per bug #14641 from Denis Smirnov.  Back-patch to 9.5 where this
logic was introduced.

Discussion: https://postgr.es/m/20170504072034.24366.57688@wrigleys.postgresql.org
2017-05-04 13:59:39 -04:00
Peter Eisentraut
9414e41ea7 Fix logical replication launcher wake up and reset
After the logical replication launcher was told to wake up at
commit (for example, by a CREATE SUBSCRIPTION command), the flag to wake
up was not reset, so it would be woken up at every following commit as
well.  So fix that by resetting the flag.

Also, we don't need to wake up anything if the transaction was rolled
back.  Just reset the flag in that case.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reported-by: Fujii Masao <masao.fujii@gmail.com>
2017-05-01 10:18:09 -04:00
Simon Riggs
49e9281549 Rework handling of subtransactions in 2PC recovery
The bug fixed by 0874d4f3e1
caused us to question and rework the handling of
subtransactions in 2PC during and at end of recovery.
Patch adds checks and tests to ensure no further bugs.

This effectively removes the temporary measure put in place
by 546c13e11b.

Author: Simon Riggs
Reviewed-by: Tom Lane, Michael Paquier
Discussion: http://postgr.es/m/CANP8+j+vvXmruL_i2buvdhMeVv5TQu0Hm2+C5N+kdVwHJuor8w@mail.gmail.com
2017-04-27 14:41:22 +02:00
Simon Riggs
546c13e11b Workaround for RecoverPreparedTransactions()
Force overwriteOK = true while we investigate deeper fix

Proposed by Tom Lane as temporary measure, accepted by me
2017-04-23 22:12:01 +01:00
Tom Lane
0874d4f3e1 Fix order of arguments to SubTransSetParent().
ProcessTwoPhaseBuffer (formerly StandbyRecoverPreparedTransactions)
mixed up the parent and child XIDs when calling SubTransSetParent to
record the transactions' relationship in pg_subtrans.

Remarkably, analysis by Simon Riggs suggests that this doesn't lead to
visible problems (at least, not in non-Assert builds).  That might
explain why we'd not noticed it before.  Nonetheless, it's surely wrong.

This code was born broken, so back-patch to all supported branches.

Discussion: https://postgr.es/m/20110.1492905318@sss.pgh.pa.us
2017-04-23 13:11:06 -04:00
Simon Riggs
ee01f7092f Exit correctly from PrepareRedoRemove() when not found
Complex crash bug all started with this failure.
Diagnosed and fixed by Nikhil Sontakke, reviewed by me.

Reported-by: Jeff Janes
Author: Nikhil Sontakke
Discussion: https://postgr.es/m/CAMkU=1xBP8cqdS5eK8APHL=X6RHMMM2vG5g+QamduuTsyCwv9g@mail.gmail.com
2017-04-18 11:35:38 +01:00
Simon Riggs
aa203e7600 Don’t push nextid too far forwards in recovery
Doing so allows various crash possibilities. Fix by avoiding
having PrescanPreparedTransactions() increment
ShmemVariableCache->nextXid when it has no 2PC files

Bug found by Jeff Janes, diagnosis and patch by Pavan Deolasee,
then patch re-designed for clarity and full accuracy by
Michael Paquier.

Reported-by: Jeff Janes
Author: Pavan Deolasee, Michael Paquier
Discussion: https://postgr.es/m/CAMkU=1zMLnH_i1-PVQ-biZzvNx7VcuatriquEnh7HNk6K8Ss3Q@mail.gmail.com
2017-04-18 11:14:05 +01:00
Peter Eisentraut
6275f5d28a Fix new warnings from GCC 7
This addresses the new warning types -Wformat-truncation
-Wformat-overflow that are part of -Wall, via -Wformat, in GCC 7.
2017-04-17 13:59:46 -04:00
Tom Lane
b6dd127128 Ensure BackgroundWorker struct contents are well-defined.
Coverity complained because bgw.bgw_extra wasn't being filled in by
ApplyLauncherRegister().  The most future-proof fix is to memset the
whole BackgroundWorker struct to zeroes.  While at it, let's apply the
same coding rule to other places that set up BackgroundWorker structs;
four out of five had the same or related issues.
2017-04-16 23:23:44 -04:00
Peter Eisentraut
c7d225e227 Fix typo in comment
Author: Masahiko Sawada <sawada.mshk@gmail.com>
2017-04-16 19:47:37 -04:00
Tom Lane
083dc95a14 More cleanup of manipulations of hash indexes' hasho_flag field.
Not much point in defining test macros for the flag bits if we
don't use 'em.

Amit Kapila
2017-04-15 14:11:15 -04:00
Tom Lane
32470825d3 Avoid passing function pointers across process boundaries.
We'd already recognized that we can't pass function pointers across process
boundaries for functions in loadable modules, since a shared library could
get loaded at different addresses in different processes.  But actually the
practice doesn't work for functions in the core backend either, if we're
using EXEC_BACKEND.  This is the cause of recent failures on buildfarm
member culicidae.  Switch to passing a string function name in all cases.

Something like this needs to be back-patched into 9.6, but let's see
if the buildfarm likes it first.

Petr Jelinek, with a bunch of basically-cosmetic adjustments by me

Discussion: https://postgr.es/m/548f9c1d-eafa-e3fa-9da8-f0cc2f654e60@2ndquadrant.com
2017-04-14 23:50:16 -04:00
Tom Lane
2040bb4a0b Clean up manipulations of hash indexes' hasho_flag field.
Standardize on testing a hash index page's type by doing
	(opaque->hasho_flag & LH_PAGE_TYPE) == LH_xxx_PAGE
Various places were taking shortcuts like
	opaque->hasho_flag & LH_BUCKET_PAGE
which while not actually wrong, is still bad practice because
it encourages use of
	opaque->hasho_flag & LH_UNUSED_PAGE
which *is* wrong (LH_UNUSED_PAGE == 0, so the above is constant false).
hash_xlog.c's hash_mask() contained such an incorrect test.

This also ensures that we mask out the additional flag bits that
hasho_flag has accreted since 9.6.  pgstattuple's pgstat_hash_page(),
for one, was failing to do that and was thus actively broken.

Also fix assorted comments that hadn't been updated to reflect the
extended usage of hasho_flag, and fix some macros that were testing
just "(hasho_flag & bit)" to use the less dangerous, project-approved
form "((hasho_flag & bit) != 0)".

Coverity found the bug in hash_mask(); I noted the one in
pgstat_hash_page() through code reading.
2017-04-14 17:04:25 -04:00
Alvaro Herrera
8bf74967da Reduce the number of pallocs() in BRIN
Instead of allocating memory in brin_deform_tuple and brin_copy_tuple
over and over during a scan, allow reuse of previously allocated memory.
This is said to make for a measurable performance improvement.

Author: Jinyu Zhang, Álvaro Herrera
Reviewed by: Tomas Vondra
Discussion: https://postgr.es/m/495deb78.4186.1500dacaa63.Coremail.beijing_pg@163.com
2017-04-07 19:08:43 -03:00
Alvaro Herrera
817cb10013 Fix new BRIN desummarize WAL record
The WAL-writing piece was forgetting to set the pages-per-range value.
Also, fix the declared type of struct member heapBlk, which I mistakenly
set as OffsetNumber rather than BlockNumber.

Problem was introduced by commit c655899ba9 (April 1st).  Any system
that tries to replay the new WAL record written before this fix is
likely to die on replay and require pg_resetwal.

Reported by Tom Lane.
Discussion: https://postgr.es/m/20191.1491524824@sss.pgh.pa.us
2017-04-07 17:11:56 -03:00
Tom Lane
3f902354b0 Clean up after insufficiently-researched optimization of tuple conversions.
tupconvert.c's functions formerly considered that an explicit tuple
conversion was necessary if the input and output tupdescs contained
different type OIDs.  The point of that was to make sure that a composite
datum resulting from the conversion would contain the destination rowtype
OID in its composite-datum header.  However, commit 3838074f8 entirely
misunderstood what that check was for, thinking that it had something to do
with presence or absence of an OID column within the tuple.  Removal of the
check broke the no-op conversion path in ExecEvalConvertRowtype, as
reported by Ashutosh Bapat.

It turns out that of the dozen or so call sites for tupconvert.c functions,
ExecEvalConvertRowtype is the only one that cares about the composite-datum
header fields in the output tuple.  In all the rest, we'd much rather avoid
an unnecessary conversion whenever the tuples are physically compatible.
Moreover, the comments in tupconvert.c only promise physical compatibility
not a metadata match.  So, let's accept the removal of the guarantee about
the output tuple's rowtype marking, recognizing that this is a API change
that could conceivably break third-party callers of tupconvert.c.  (So,
let's remember to mention it in the v10 release notes.)

However, commit 3838074f8 did have a bit of a point here, in that two
tuples mustn't be considered physically compatible if one has HEAP_HASOID
set and the other doesn't.  (Some of the callers of tupconvert.c might not
really care about that, but we can't assume it in general.)  The previous
check accidentally covered that issue, because no RECORD types ever have
OIDs, while if two tupdescs have the same named composite type OID then,
a fortiori, they have the same tdhasoid setting.  If we're removing the
type OID match check then we'd better include tdhasoid match as part of
the physical compatibility check.

Without that hack in tupconvert.c, we need ExecEvalConvertRowtype to take
responsibility for inserting the correct rowtype OID label whenever
tupconvert.c decides it need not do anything.  This is easily done with
heap_copy_tuple_as_datum, which will be considerably faster than a tuple
disassembly and reassembly anyway; so from a performance standpoint this
change is a win all around compared to what happened in earlier branches.
It just means a couple more lines of code in ExecEvalConvertRowtype.

Ashutosh Bapat and Tom Lane

Discussion: https://postgr.es/m/CAFjFpRfvHABV6+oVvGcshF8rHn+1LfRUhj7Jz1CDZ4gPUwehBg@mail.gmail.com
2017-04-06 21:10:20 -04:00
Alvaro Herrera
7e534adcdc Fix BRIN cost estimation
The original code was overly optimistic about the cost of scanning a
BRIN index, leading to BRIN indexes being selected when they'd be a
worse choice than some other index.  This complete rewrite should be
more accurate.

Author: David Rowley, based on an earlier patch by Emre Hasegeli
Reviewed-by: Emre Hasegeli
Discussion: https://postgr.es/m/CAKJS1f9n-Wapop5Xz1dtGdpdqmzeGqQK4sV2MK-zZugfC14Xtw@mail.gmail.com
2017-04-06 17:51:53 -03:00
Peter Eisentraut
e6c9a5a9bc Fix mixup of bool and ternary value
Not currently a problem, but could be with stricter bool behavior under
stdbool or C++.

Reviewed-by: Andres Freund <andres@anarazel.de>
2017-04-06 13:09:42 -04:00
Simon Riggs
cd0cebaf7d Always SnapshotResetXmin() during ClearTransaction()
Avoid corner cases during 2PC with 6bad580d9e
2017-04-06 10:30:22 -04:00
Peter Eisentraut
3217327053 Identity columns
This is the SQL standard-conforming variant of PostgreSQL's serial
columns.  It fixes a few usability issues that serial columns have:

- CREATE TABLE / LIKE copies default but refers to same sequence
- cannot add/drop serialness with ALTER TABLE
- dropping default does not drop sequence
- need to grant separate privileges to sequence
- other slight weirdnesses because serial is some kind of special macro

Reviewed-by: Vitaly Burovoy <vitaly.burovoy@gmail.com>
2017-04-06 08:41:37 -04:00
Simon Riggs
6bad580d9e Avoid SnapshotResetXmin() during AtEOXact_Snapshot()
For normal commits and aborts we already reset PgXact->xmin,
so we can simply avoid running SnapshotResetXmin() twice.

During performance tests by Alexander Korotkov, diagnosis
by Andres Freund showed PgXact array as a bottleneck. After
manual analysis by me of the code paths that touch those
memory locations, I was able to identify extraneous code
in the main transaction commit path.

Avoiding touching highly contented shmem improves concurrent
performance slightly on all workloads, confirmed by tests
run by Ashutosh Sharma and Alexander Korotkov.

Simon Riggs

Discussion: CANP8+jJdXE9b+b9F8CQT-LuxxO0PBCB-SZFfMVAdp+akqo4zfg@mail.gmail.com
2017-04-06 08:31:52 -04:00
Robert Haas
633e15ea0f Fix pageinspect failures on hash indexes.
Make every page in a hash index which isn't all-zeroes have a valid
special space, so that tools like pageinspect don't error out.

Also, make pageinspect cope with all-zeroes pages, because
_hash_alloc_buckets can leave behind large numbers of those until
they're consumed by splits.

Ashutosh Sharma and Robert Haas, reviewed by Amit Kapila.
Original trouble report from Jeff Janes.

Discussion: http://postgr.es/m/CAMkU=1y6NjKmqbJ8wLMhr=F74WzcMALYWcVFhEpm7i=mV=XsOg@mail.gmail.com
2017-04-05 14:18:15 -04:00
Robert Haas
75a1cbdc3c hash: Fix write-ahead logging bug.
The size of the data is not the same thing as the size of the size of
the data.

Reported off-list by Tushar Ahuja.  Fix by Ashutosh Sharma, reviewed
by Amit Kapila.

Discussion: http://postgr.es/m/CAE9k0PnmPDXfvf8HDObme7q_Ewc4E26ukHXUBPySoOs0ObqqaQ@mail.gmail.com
2017-04-05 11:45:35 -04:00
Peter Eisentraut
afd79873a0 Capitalize names of PLs consistently
Author: Daniel Gustafsson <daniel@yesql.se>
2017-04-05 00:38:25 -04:00
Simon Riggs
9a3215026b Make min_wal_size/max_wal_size use MB internally
Previously they were defined using multiples of XLogSegSize.
Remove GUC_UNIT_XSEGS. Introduce GUC_UNIT_MB

Extracted from patch series on XLogSegSize infrastructure.

Beena Emerson
2017-04-04 18:00:01 -04:00
Simon Riggs
cd740c0dbf Fix uninitialized variables in twophase.c 2017-04-04 17:50:02 -04:00
Simon Riggs
728bd991c3 Speedup 2PC recovery by skipping two phase state files in normal path
2PC state info held in shmem at PREPARE, then cleaned at COMMIT PREPARED/ABORT PREPARED,
avoiding writing/fsyncing any state information to disk in the normal path, greatly enhancing replay speed.
Prepared transactions that live past one checkpoint redo horizon will be written to disk as now.
Similar conceptually to 978b2f65aa and building upon
the infrastructure created by that commit.

Authors, in equal measure: Stas Kelvich, Nikhil Sontakke and Michael Paquier
Discussion: https://postgr.es/m/CAMGcDxf8Bn9ZPBBJZba9wiyQq-Qk5uqq=VjoMnRnW5s+fKST3w@mail.gmail.com
2017-04-04 15:56:56 -04:00
Robert Haas
b38006ef6d Fix formula in _hash_spareindex.
This was correct in earlier versions of the patch that lead to
commit ea69a0dead, but somehow got
broken in the last version which I actually committed.

Mithun Cy, per an off-list report from Ashutosh Sharma

Discussion: http://postgr.es/m/CAD__OujbAwNU71v1y-RoQxZ8LZ6-V2UFTkex3v34MK6uZ3Xb5w@mail.gmail.com
2017-04-04 07:45:04 -04:00
Robert Haas
ea69a0dead Expand hash indexes more gradually.
Since hash indexes typically have very few overflow pages, adding a
new splitpoint essentially doubles the on-disk size of the index,
which can lead to large and abrupt increases in disk usage (and
perhaps long delays on occasion).  To mitigate this problem to some
degree, divide larger splitpoints into four equal phases.  This means
that, for example, instead of growing from 4GB to 8GB all at once, a
hash index will now grow from 4GB to 5GB to 6GB to 7GB to 8GB, which
is perhaps still not as smooth as we'd like but certainly an
improvement.

This changes the on-disk format of the metapage, so bump HASH_VERSION
from 2 to 3.  This will force a REINDEX of all existing hash indexes,
but that's probably a good idea anyway.  First, hash indexes from
pre-10 versions of PostgreSQL could easily be corrupted, and we don't
want to confuse corruption carried over from an older release with any
corruption caused despite the new write-ahead logging in v10.  Second,
it will let us remove some backward-compatibility code added by commit
293e24e507.

Mithun Cy, reviewed by Amit Kapila, Jesper Pedersen and me.  Regression
test outputs updated by me.

Discussion: http://postgr.es/m/CAD__OuhG6F1gQLCgMQNnMNgoCvOLQZz9zKYJQNYvYmmJoM42gA@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoYty0jCf-pa+m+vYUJ716+AxM7nv_syvyanyf5O-L_i2A@mail.gmail.com
2017-04-03 23:46:33 -04:00
Robert Haas
93cd7684ee Properly acquire buffer lock for page-at-a-time hash vacuum.
In a couple of places, _hash_kill_items was mistakenly called with
the buffer lock not held.  Repair.

Ashutosh Sharma, per a report from Andreas Seltenreich

Discussion: http://postgr.es/m/87o9wo8o0j.fsf@credativ.de
2017-04-03 22:26:06 -04:00
Alvaro Herrera
c655899ba9 BRIN de-summarization
When the BRIN summary tuple for a page range becomes too "wide" for the
values actually stored in the table (because the tuples that were
present originally are no longer present due to updates or deletes), it
can be useful to remove the outdated summary tuple, so that a future
summarization can install a tighter summary.

This commit introduces a SQL-callable interface to do so.

Author: Álvaro Herrera
Reviewed-by: Eiji Seki
Discussion: https://postgr.es/m/20170228045643.n2ri74ara4fhhfxf@alvherre.pgsql
2017-04-01 16:10:04 -03:00
Alvaro Herrera
7526e10224 BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur.  To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range.  Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.

This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations.  The only actual effect is to
change the timing, i.e. that it occurs earlier.  For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.

Most of the new code here is in autovacuum, which can now be told about
"work items" to process.  This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.

The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values().  It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.

Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 14:00:53 -03:00
Robert Haas
2113ac4cbb Don't use bgw_main even to specify in-core bgworker entrypoints.
On EXEC_BACKEND builds, this can fail if ASLR is in use.

Backpatch to 9.5.  On master, completely remove the bgw_main field
completely, since there is no situation in which it is safe for an
EXEC_BACKEND build.  On 9.6 and 9.5, leave the field intact to avoid
breaking things for third-party code that doesn't care about working
under EXEC_BACKEND.  Prior to 9.5, there are no in-core bgworker
entrypoints.

Petr Jelinek, reviewed by me.

Discussion: http://postgr.es/m/09d8ad33-4287-a09b-a77f-77f8761adb5e@2ndquadrant.com
2017-03-31 20:43:32 -04:00
Robert Haas
c94e6942ce Don't allocate storage for partitioned tables.
Also, don't allow setting reloptions on them, since that would have no
effect given the lack of storage.  The patch does this by introducing
a new reloption kind for which there are currently no reloptions -- we
might have some in the future -- so it adjusts parseRelOptions to
handle that case correctly.

Bumped catversion.  System catalogs that contained reloptions for
partitioned tables are no longer valid; plus, there are now fewer
physical files on disk, which is not technically a catalog change but
still a good reason to re-initdb.

Amit Langote, reviewed by Maksim Milyutin and Kyotaro Horiguchi and
revised a bit by me.

Discussion: http://postgr.es/m/20170331.173326.212311140.horiguchi.kyotaro@lab.ntt.co.jp
2017-03-31 16:28:51 -04:00
Alvaro Herrera
2fd8685e7f Simplify check of modified attributes in heap_update
The old coding was getting more complicated as new things were added,
and it would be barely tolerable with upcoming WARM updates and other
future features such as indirect indexes.  The new coding incurs a small
performance cost in synthetic benchmark cases, and is barely measurable
in normal cases.  A much larger benefit is expected from WARM, which
could actually bolt its needs on top of the existing coding, but it is
much uglier and bug-prone than doing it on this new code.  Additional
optimization can be applied on top of this, if need be.

Reviewed-by: Pavan Deolasee, Amit Kapila, Mithun CY
Discussion: https://postgr.es/m/20161228232018.4hc66ndrzpz4g4wn@alvherre.pgsql
	https://postgr.es/m/CABOikdMJfz69dBNRTOZcB6s5A0tf8OMCyQVYQyR-WFFdoEwKMQ@mail.gmail.com
2017-03-29 14:01:14 -03:00
Alvaro Herrera
ce96ce60ca Remove direct uses of ItemPointer.{ip_blkid,ip_posid}
There are no functional changes here; this simply encapsulates knowledge
of the ItemPointerData struct so that a future patch can change things
without more breakage.

All direct users of ip_blkid and ip_posid are changed to use existing
macros ItemPointerGetBlockNumber and ItemPointerGetOffsetNumber
respectively.  For callers where that's inappropriate (because they
Assert that the itempointer is is valid-looking), add
ItemPointerGetBlockNumberNoCheck and ItemPointerGetOffsetNumberNoCheck,
which lack the assertion but are otherwise identical.

Author: Pavan Deolasee
Discussion: https://postgr.es/m/CABOikdNnFon4cJiL=h1mZH3bgUeU+sWHuU4Yr8AB=j3A2p1GiA@mail.gmail.com
2017-03-28 19:02:23 -03:00
Simon Riggs
a99f77021f Correct grammar in error message
"could not generate" rather than "could not generation"
from commit 818fd4a67d
2017-03-28 13:24:39 -04:00
Tom Lane
8cfeaecfc7 Suppress implicit-conversion warnings seen with newer clang versions.
We were assigning values near 255 through "char *" pointers.  On machines
where char is signed, that's not entirely kosher, and it's reasonable
for compilers to warn about it.

A better solution would be to change the pointer type to "unsigned char *",
but that would be vastly more invasive.  For the moment, let's just apply
this simple backpatchable solution.

Aleksander Alekseev

Discussion: https://postgr.es/m/20170220141239.GD12278@e733.localdomain
Discussion: https://postgr.es/m/2839.1490714708@sss.pgh.pa.us
2017-03-28 13:16:19 -04:00
Robert Haas
c4c51541e2 Still more code review for single-page hash vacuuming.
Most seriously, fix use of incorrect block ID, per a report from
Jeff Janes that it causes a crash and a diagnosis from Amit Kapila.

Improve consistency between the hash and btree versions of this
code by adding back a PANIC that btree has, and by registering
data in the xlog record in the same way, per complaints from
Jeff Janes and Amit Kapila.

Tidy up some minor cosmetic points, per complaints from Amit
Kapila.

Patch by Ashutosh Sharma, reviewed by Amit Kapila, and tested by
Jeff Janes.

Discussion: http://postgr.es/m/CAMkU=1w-9Qe=Ff1o6bSaXpNO9wqpo7_9GL8_CVhw4BoVVHasqg@mail.gmail.com
2017-03-27 12:51:10 -04:00
Teodor Sigaev
1b02be21f2 Fsync directory after creating or unlinking file.
If file was created/deleted just before powerloss it's possible that
file system will miss that. To prevent it, call fsync() where creating/
unlinkg file is critical.

Author: Michael Paquier
Reviewed-by: Ashutosh Bapat, Takayuki Tsunakawa, me
2017-03-27 19:33:01 +03:00
Robert Haas
f0a6046bcb Fix comment.
Cut-and-paste led to something silly.

Ashutosh Sharma, reviewed by Amit Kapila and by me

Discussion: http://postgr.es/m/CAE9k0PmUbvQSBY7kwN_OkuqBYyHRXBX-c1ZkuAgR5vgF0GeWzQ@mail.gmail.com
2017-03-26 22:15:50 -04:00
Tom Lane
2f0903ea19 Improve performance of ExecEvalWholeRowVar.
In commit b8d7f053c, we needed to fix ExecEvalWholeRowVar to not change
the state of the slot it's copying.  The initial quick hack at that
required two rounds of tuple construction, which is not very nice.
To fix, add another primitive to tuptoaster.c that does precisely what
we need.  (I initially tried to do this by refactoring one of the
existing functions into two pieces; but it looked like that might hurt
performance for the existing case, and the amount of code that could
be shared is not very large, so I gave up on that.)

Discussion: https://postgr.es/m/26088.1490315792@sss.pgh.pa.us
2017-03-26 19:14:57 -04:00
Simon Riggs
3428ef7911 Reverting 42b4b0b241
Buildfarm issues and other reported issues
2017-03-24 17:56:17 +00:00
Simon Riggs
42b4b0b241 Avoid SnapshotResetXmin() during AtEOXact_Snapshot()
For normal commits and aborts we already reset PgXact->xmin
Avoiding touching highly contented shmem improves concurrent
performance.

Simon Riggs

Discussion: CANP8+jJdXE9b+b9F8CQT-LuxxO0PBCB-SZFfMVAdp+akqo4zfg@mail.gmail.com
2017-03-24 14:20:59 +00:00
Teodor Sigaev
78874531ba Fix backup canceling
Assert-enabled build crashes but without asserts it works by wrong way:
it may not reset forcing full page write and preventing from starting
exclusive backup with the same name as cancelled.
Patch replaces pair of booleans
nonexclusive_backup_running/exclusive_backup_running to single enum to
correctly describe backup state.

Backpatch to 9.6 where bug was introduced

Reported-by: David Steele
Authors: Michael Paquier, David Steele
Reviewed-by: Anastasia Lubennikova

https://commitfest.postgresql.org/13/1068/
2017-03-24 13:53:40 +03:00
Robert Haas
ea42cc18c3 Track the oldest XID that can be safely looked up in CLOG.
This provides infrastructure for looking up arbitrary, user-supplied
XIDs without a risk of scary-looking failures from within the clog
module.  Normally, the oldest XID that can be safely looked up in CLOG
is the same as the oldest XID that can reused without causing
wraparound, and the latter is already tracked.  However, while
truncation is in progress, the values are different, so we must
keep track of them separately.

Craig Ringer, reviewed by Simon Riggs and by me.

Discussion: http://postgr.es/m/CAMsr+YHQiWNEi0daCTboS40T+V5s_+dst3PYv_8v2wNVH+Xx4g@mail.gmail.com
2017-03-23 14:26:31 -04:00
Teodor Sigaev
218f51584d Reduce page locking in GIN vacuum
GIN vacuum during cleaning posting tree can lock this whole tree for a long
time with by holding LockBufferForCleanup() on root. Patch changes it with
two ways: first, cleanup lock will be taken only if there is an empty page
(which should be deleted) and, second, it tries to lock only subtree, not the
whole posting tree.

Author: Andrey Borodin with minor editorization by me
Reviewed-by: Jeff Davis, me

https://commitfest.postgresql.org/13/896/
2017-03-23 19:38:47 +03:00
Simon Riggs
6912acc04f Replication lag tracking for walsenders
Adds write_lag, flush_lag and replay_lag cols to pg_stat_replication.

Implements a lag tracker module that reports the lag times based upon
measurements of the time taken for recent WAL to be written, flushed and
replayed and for the sender to hear about it. These times
represent the commit lag that was (or would have been) introduced by each
synchronous commit level, if the remote server was configured as a
synchronous standby.  For an asynchronous standby, the replay_lag column
approximates the delay before recent transactions became visible to queries.
If the standby server has entirely caught up with the sending server and
there is no more WAL activity, the most recently measured lag times will
continue to be displayed for a short time and then show NULL.

Physical replication lag tracking is automatic. Logical replication tracking
is possible but is the responsibility of the logical decoding plugin.
Tracking is a private module operating within each walsender individually,
with values reported to shared memory. Module not used outside of walsender.

Design and code is good enough now to commit - kudos to the author.
In many ways a difficult topic, with important and subtle behaviour so this
shoudl be expected to generate discussion and multiple open items: Test now!

Author: Thomas Munro, following designs by Fujii Masao and Simon Riggs
Review: Simon Riggs, Ian Barwick and Craig Ringer
2017-03-23 14:05:28 +00:00
Stephen Frost
017e4f2588 Expose waitforarchive option through pg_stop_backup()
Internally, we have supported the option to either wait for all of the
WAL associated with a backup to be archived, or to return immediately.
This option is useful to users of pg_stop_backup() as well, when they
are reading the stop backup record position and checking that the WAL
they need has been archived independently.

This patch adds an additional, optional, argument to pg_stop_backup()
which allows the user to indicate if they wish to wait for the WAL to be
archived or not.  The default matches current behavior, which is to
wait.

Author: David Steele, with some minor changes, doc updates by me.
Reviewed by: Takayuki Tsunakawa, Fujii Masao
Discussion: https://postgr.es/m/758e3fd1-45b4-5e28-75cd-e9e7f93a4c02@pgmasters.net
2017-03-22 23:44:58 -04:00
Simon Riggs
af4b1a0869 Refactor GetOldestXmin() to use flags
Replace ignoreVacuum parameter with more flexible flags.

Author: Eiji Seki
Review: Haribabu Kommi
2017-03-22 16:51:01 +00:00
Simon Riggs
9b013dc238 Improve performance of replay of AccessExclusiveLocks
A hot standby replica keeps a list of Access Exclusive locks for a top
level transaction. These locks are released when the top level transaction
ends. Searching of this list is O(N^2), and each transaction had to pay the
price of searching this list for locks, even if it didn't take any AE
locks itself.

This patch optimizes this case by having the master server track which
transactions took AE locks, and passes that along to the standby server in
the commit/abort record. This allows the standby to only try to release
locks for transactions which actually took any, avoiding the majority of
the performance issue.

Refactor MyXactAccessedTempRel into MyXactFlags to allow minimal additional
cruft with this.

Analysis and initial patch by David Rowley
Author: David Rowley and Simon Riggs
2017-03-22 13:09:36 +00:00
Simon Riggs
1148e22a82 Teach xlogreader to follow timeline switches
Uses page-based mechanism to ensure we’re using the correct timeline.

Tests are included to exercise the functionality using a cold disk-level copy
of the master that's started up as a replica with slots intact, but the
intended use of the functionality is with later features.

Craig Ringer, reviewed by Simon Riggs and Andres Freund
2017-03-22 07:05:12 +00:00
Robert Haas
9abbf4727d Another fix for single-page hash index vacuum.
The WAL consistency checking code needed to be updated for the new
page status bit, but that didn't get done previously.

Ashutosh Sharma, reviewed by Amit Kapila

Discussion: http://postgr.es/m/CAA4eK1LP_oz4EfMen14OjJuzN5CqPdfRkFFuA-MfkcfeE8zGyg@mail.gmail.com
2017-03-20 15:55:27 -04:00
Robert Haas
953477ca35 Fixes for single-page hash index vacuum.
Clear LH_PAGE_HAS_DEAD_TUPLES during replay, similar to what gets done
for btree.  Update hashdesc.c for xl_hash_vacuum_one_page.

Oversights in commit 6977b8b7f4 spotted
by Amit Kapila.  Patch by Ashutosh Sharma.

Bump WAL version.  The original patch to make hash indexes write-ahead
logged probably should have done this, and the single page vacuuming
patch probably should have done it again, but better late than never.

Discussion: http://postgr.es/m/CAA4eK1Kd=mJ9xreovcsh0qMiAj-QqCphHVQ_Lfau1DR9oVjASQ@mail.gmail.com
2017-03-20 15:49:09 -04:00
Robert Haas
249cf070e3 Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add and
6f3bd98ebf, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O.  Add wait events for those operations, too.

Rushabh Lathia, with further hacking by me.  Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.

Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 07:43:01 -04:00
Robert Haas
88e66d193f Rename "pg_clog" directory to "pg_xact".
Names containing the letters "log" sometimes confuse users into
believing that only non-critical data is present.  It is hoped
this renaming will discourage ill-considered removals of transaction
status data.

Michael Paquier

Discussion: http://postgr.es/m/CA+Tgmoa9xFQyjRZupbdEFuwUerFTvC6HjZq1ud6GYragGDFFgA@mail.gmail.com
2017-03-17 09:48:38 -04:00
Robert Haas
6977b8b7f4 Port single-page btree vacuum logic to hash indexes.
This is advantageous for hash indexes for the same reasons it's good
for btrees: it accelerates space recycling, reducing bloat.

Ashutosh Sharma, reviewed by Amit Kapila and by me.  A bit of
additional hacking by me.

Discussion: http://postgr.es/m/CAE9k0PkRSyzx8dOnokEpUi2A-RFZK72WN0h9DEMv_ut9q6bPRw@mail.gmail.com
2017-03-15 22:18:56 -04:00
Robert Haas
f7b711c8bc Cosmetic fixes for hash index write-ahead logging.
Amit Kapila.  One of these was reported by Tom Lane.

Discussion: http://postgr.es/m/5515.1489514099@sss.pgh.pa.us
2017-03-15 07:22:49 -04:00
Robert Haas
bb4a39637a hash: Support WAL consistency checking.
Kuntal Ghosh, reviewed by Amit Kapila and Ashutosh Sharma, with
a few tweaks by me.

Discussion: http://postgr.es/m/CAGz5QCJLERUn_zoO0eDv6_Y_d0o4tNTMPeR7ivTLBg4rUrJdwg@mail.gmail.com
2017-03-14 14:58:56 -04:00
Robert Haas
c11453ce0a hash: Add write-ahead logging support.
The warning about hash indexes not being write-ahead logged and their
use being discouraged has been removed.  "snapshot too old" is now
supported for tables with hash indexes.  Most importantly, barring
bugs, hash indexes will now be crash-safe and usable on standbys.

This commit doesn't yet add WAL consistency checking for hash
indexes, as we now have for other index types; a separate patch has
been submitted to cure that lack.

Amit Kapila, reviewed and slightly modified by me.  The larger patch
series of which this is a part has been reviewed and tested by Álvaro
Herrera, Ashutosh Sharma, Mark Kirkwood, Jeff Janes, and Jesper
Pedersen.

Discussion: http://postgr.es/m/CAA4eK1JOBX=YU33631Qh-XivYXtPSALh514+jR8XeD7v+K3r_Q@mail.gmail.com
2017-03-14 13:27:02 -04:00
Peter Eisentraut
a47b38c9ee Spelling fixes
From: Josh Soref <jsoref@gmail.com>
2017-03-14 12:58:39 -04:00
Peter Eisentraut
f97a028d8e Spelling fixes in code comments
From: Josh Soref <jsoref@gmail.com>
2017-03-14 12:58:39 -04:00
Tom Lane
5ed6fff6b7 Make logging about multixact wraparound protection less chatty.
The original messaging design, introduced in commit 068cfadf9, seems too
chatty now that some time has elapsed since the bug fix; most installations
will be in good shape and don't really need a reminder about this on every
postmaster start.

Hence, arrange to suppress the "wraparound protections are now enabled"
message during startup (specifically, during the TrimMultiXact() call).
The message will still appear if protection becomes effective at some
later point.

Discussion: https://postgr.es/m/17211.1489189214@sss.pgh.pa.us
2017-03-14 12:47:53 -04:00
Peter Eisentraut
1e6de941e3 Change xlog to WAL in some error messages 2017-03-13 15:42:10 -04:00
Noah Misch
3a0d473192 Use wrappers of PG_DETOAST_DATUM_PACKED() more.
This makes almost all core code follow the policy introduced in the
previous commit.  Specific decisions:

- Text search support functions with char* and length arguments, such as
  prsstart and lexize, may receive unaligned strings.  I doubt
  maintainers of non-core text search code will notice.

- Use plain VARDATA() on values detoasted or synthesized earlier in the
  same function.  Use VARDATA_ANY() on varlenas sourced outside the
  function, even if they happen to always have four-byte headers.  As an
  exception, retain the universal practice of using VARDATA() on return
  values of SendFunctionCall().

- Retain PG_GETARG_BYTEA_P() in pageinspect.  (Page images are too large
  for a one-byte header, so this misses no optimization.)  Sites that do
  not call get_page_from_raw() typically need the four-byte alignment.

- For now, do not change btree_gist.  Its use of four-byte headers in
  memory is partly entangled with storage of 4-byte headers inside
  GBT_VARKEY, on disk.

- For now, do not change gtrgm_consistent() or gtrgm_distance().  They
  incorporate the varlena header into a cache, and there are multiple
  credible implementation strategies to consider.
2017-03-12 19:35:34 -04:00
Noah Misch
2fd26b23b6 Assume deconstruct_array() outputs are untoasted.
In functions that issue a deconstruct_array() call, consistently use
plain VARSIZE()/VARDATA() on the array elements.  Prior practice was
divided between those and VARSIZE_ANY_EXHDR()/VARDATA_ANY().
2017-03-12 19:35:31 -04:00
Robert Haas
390811750d Revert "Use group updates when setting transaction status in clog."
This reverts commit ccce90b398.  This
optimization is unsafe, at least, of rollbacks and rollbacks to
savepoints, but I'm concerned there may be other problematic cases as
well.  Therefore, I've decided to revert this pending further
investigation.
2017-03-10 14:49:56 -05:00
Robert Haas
ccce90b398 Use group updates when setting transaction status in clog.
Commit 0e141c0fbb introduced a mechanism
to reduce contention on ProcArrayLock by having a single process clear
XIDs in the procArray on behalf of multiple processes, reducing the
need to hand the lock around.  Use a similar mechanism to reduce
contention on CLogControlLock.  Testing shows that this very
significantly reduces the amount of time waiting for CLogControlLock
on high-concurrency pgbench tests run on a large multi-socket
machines; whether that translates into a TPS improvement depends on
how much of that contention is simply shifted to some other lock,
particularly WALWriteLock.

Amit Kapila, with some cosmetic changes by me.  Extensively reviewed,
tested, and benchmarked over a period of about 15 months by Simon
Riggs, Robert Haas, Andres Freund, Jesper Pedersen, and especially by
Tomas Vondra and Dilip Kumar.

Discussion: http://postgr.es/m/CAA4eK1L_snxM_JcrzEstNq9P66++F4kKFce=1r5+D1vzPofdtg@mail.gmail.com
Discussion: http://postgr.es/m/CAA4eK1LyR2A+m=RBSZ6rcPEwJ=rVi1ADPSndXHZdjn56yqO6Vg@mail.gmail.com
Discussion: http://postgr.es/m/91d57161-d3ea-0cc2-6066-80713e4f90d7@2ndquadrant.com
2017-03-09 17:49:01 -05:00
Tom Lane
86dbbf20d8 Put back <float.h> in a few files that need it for _isnan().
Further fallout from commit c29aff959: there are some files that need
<float.h>, and were getting it from datatype/timestamp.h, but it was not
apparent in my (tgl's) testing because the requirement for <float.h>
exists only on certain Windows toolchains.

Report and patch by David Rowley.

Discussion: https://postgr.es/m/CAKJS1f-BHceaFzZScFapDV48gUVM2CAOBfhkgffdqXzFb+kwew@mail.gmail.com
2017-03-08 15:38:34 -05:00
Robert Haas
f35742ccb7 Support parallel bitmap heap scans.
The index is scanned by a single process, but then all cooperating
processes can iterate jointly over the resulting set of heap blocks.
In the future, we might also want to support using a parallel bitmap
index scan to set up for a parallel bitmap heap scan, but that's a
job for another day.

Dilip Kumar, with some corrections and cosmetic changes by me.  The
larger patch set of which this is a part has been reviewed and tested
by (at least) Andres Freund, Amit Khandekar, Tushar Ahuja, Rafia
Sabih, Haribabu Kommi, Thomas Munro, and me.

Discussion: http://postgr.es/m/CAFiTN-uc4=0WxRGfCzs-xfkMYcSEWUC-Fon6thkJGjkh9i=13A@mail.gmail.com
2017-03-08 12:05:43 -05:00
Robert Haas
98e6e89040 tidbitmap: Support shared iteration.
When a shared iterator is used, each call to tbm_shared_iterate()
returns a result that has not yet been returned to any process
attached to the shared iterator.  In other words, each cooperating
processes gets a disjoint subset of the full result set, but all
results are returned exactly once.

This is infrastructure for parallel bitmap heap scan.

Dilip Kumar.  The larger patch set of which this is a part has been
reviewed and tested by (at least) Andres Freund, Amit Khandekar,
Tushar Ahuja, Rafia Sabih, Haribabu Kommi, and Thomas Munro.

Discussion: http://postgr.es/m/CAFiTN-uc4=0WxRGfCzs-xfkMYcSEWUC-Fon6thkJGjkh9i=13A@mail.gmail.com
2017-03-08 08:09:38 -05:00
Robert Haas
38305398cd hash: Refactor hash index creation.
The primary goal here is to move all of the related page modifications
to a single section of code, in preparation for adding write-ahead
logging.  In passing, rename _hash_metapinit to _hash_init, since it
initializes more than just the metapage.

Amit Kapila.  The larger patch series of which this is a part has been
reviewed and tested by Álvaro Herrera, Ashutosh Sharma, Mark Kirkwood,
Jeff Janes, and Jesper Pedersen.
2017-03-07 17:03:51 -05:00
Heikki Linnakangas
818fd4a67d Support SCRAM-SHA-256 authentication (RFC 5802 and 7677).
This introduces a new generic SASL authentication method, similar to the
GSS and SSPI methods. The server first tells the client which SASL
authentication mechanism to use, and then the mechanism-specific SASL
messages are exchanged in AuthenticationSASLcontinue and PasswordMessage
messages. Only SCRAM-SHA-256 is supported at the moment, but this allows
adding more SASL mechanisms in the future, without changing the overall
protocol.

Support for channel binding, aka SCRAM-SHA-256-PLUS is left for later.

The SASLPrep algorithm, for pre-processing the password, is not yet
implemented. That could cause trouble, if you use a password with
non-ASCII characters, and a client library that does implement SASLprep.
That will hopefully be added later.

Authorization identities, as specified in the SCRAM-SHA-256 specification,
are ignored. SET SESSION AUTHORIZATION provides more or less the same
functionality, anyway.

If a user doesn't exist, perform a "mock" authentication, by constructing
an authentic-looking challenge on the fly. The challenge is derived from
a new system-wide random value, "mock authentication nonce", which is
created at initdb, and stored in the control file. We go through these
motions, in order to not give away the information on whether the user
exists, to unauthenticated users.

Bumps PG_CONTROL_VERSION, because of the new field in control file.

Patch by Michael Paquier and Heikki Linnakangas, reviewed at different
stages by Robert Haas, Stephen Frost, David Steele, Aleksander Alekseev,
and many others.

Discussion: https://www.postgresql.org/message-id/CAB7nPqRbR3GmFYdedCAhzukfKrgBLTLtMvENOmPrVWREsZkF8g%40mail.gmail.com
Discussion: https://www.postgresql.org/message-id/CAB7nPqSMXU35g%3DW9X74HVeQp0uvgJxvYOuA4A-A3M%2B0wfEBv-w%40mail.gmail.com
Discussion: https://www.postgresql.org/message-id/55192AFE.6080106@iki.fi
2017-03-07 14:25:40 +02:00
Simon Riggs
21d4e2e206 Reduce lock levels for table storage params related to planning
The following parameters are now updateable with ShareUpdateExclusiveLock
effective_io_concurrency
parallel_workers
seq_page_cost
random_page_cost
n_distinct
n_distinct_inherited

Simon Riggs and Fabrízio Mello
2017-03-06 16:04:31 +05:30
Robert Haas
5a73e17317 Improve error reporting for tuple-routing failures.
Currently, the whole row is shown without column names.  Instead,
adopt a style similar to _bt_check_unique() in ExecFindPartition()
and show the failing key: (key1, ...) = (val1, ...).

Amit Langote, per a complaint from Simon Riggs.  Reviewed by me;
I also adjusted the grammar in one of the comments.

Discussion: http://postgr.es/m/9f9dc7ae-14f0-4a25-5485-964d9bfc19bd@lab.ntt.co.jp
2017-03-03 09:09:52 +05:30
Robert Haas
21a3cf4128 hash: Refactor and clean up bucket split code.
As with commit 30df93f698 and commit
b0f18cb77f, the goal here is to move all
of the related page modifications to a single section of code, in
preparation for adding write-ahead logging.

Amit Kapila, with slight changes by me.  The larger patch series of
which this is a part has been reviewed and tested by Álvaro Herrera,
Ashutosh Sharma, Mark Kirkwood, Jeff Janes, and Jesper Pedersen.
2017-03-01 14:43:38 +05:30
Magnus Hagander
016c990834 Fix incorrect variable datatype
Both datatypes map to the same underlying one which is why it still
worked, but we should use the correct type.

Author: Kyotaro HORIGUCHI
2017-02-28 12:20:35 +01:00
Tom Lane
9b88f27cb4 Allow index AMs to return either HeapTuple or IndexTuple format during IOS.
Previously, only IndexTuple format was supported for the output data of
an index-only scan.  This is fine for btree, which is just returning a
verbatim index tuple anyway.  It's not so fine for SP-GiST, which can
return reconstructed data that's much larger than a page.

To fix, extend the index AM API so that index-only scan data can be
returned in either HeapTuple or IndexTuple format.  There's other ways
we could have done it, but this way avoids an API break for index AMs
that aren't concerned with the issue, and it costs little except a couple
more fields in IndexScanDescs.

I changed both GiST and SP-GiST to use the HeapTuple method.  I'm not
very clear on whether GiST can reconstruct data that's too large for an
IndexTuple, but that seems possible, and it's not much of a code change to
fix.

Per a complaint from Vik Fearing.  Reviewed by Jason Li.

Discussion: https://postgr.es/m/49527f79-530d-0bfe-3dad-d183596afa92@2ndquadrant.fr
2017-02-27 17:20:34 -05:00
Robert Haas
30df93f698 hash: Refactor overflow page allocation.
As with commit b0f18cb77f, the goal
here is to move all of the related page modifications to a single
section of code, in preparation for adding write-ahead logging.

Amit Kapila, with slight changes by me.  The larger patch series
of which this is a part has been reviewed and tested by Álvaro
Herrera, Ashutosh Sharma, Mark Kirkwood, Jeff Janes, and Jesper
Pedersen, all of whom should also have been credited in the
previous commit message.
2017-02-27 22:59:55 +05:30
Robert Haas
b0f18cb77f hash: Refactor bucket squeeze code.
In preparation for adding write-ahead logging to hash indexes,
refactor _hash_freeovflpage and _hash_squeezebucket so that all
related page modifications happen in a single section of code.  The
previous coding assumed that it would be fine to move tuples one at a
time, and also that the various operations involved in freeing an
overflow page didn't necessarily all need to be done together, all
of which is true if you don't care about write-ahead logging.

Amit Kapila, with slight changes by me.
2017-02-27 22:34:21 +05:30
Tom Lane
9e3755ecb2 Remove useless duplicate inclusions of system header files.
c.h #includes a number of core libc header files, such as <stdio.h>.
There's no point in re-including these after having read postgres.h,
postgres_fe.h, or c.h; so remove code that did so.

While at it, also fix some places that were ignoring our standard pattern
of "include postgres[_fe].h, then system header files, then other Postgres
header files".  While there's not any great magic in doing it that way
rather than system headers last, it's silly to have just a few files
deviating from the general pattern.  (But I didn't attempt to enforce this
globally, only in files I was touching anyway.)

I'd be the first to say that this is mostly compulsive neatnik-ism,
but over time it might save enough compile cycles to be useful.
2017-02-25 16:12:55 -05:00
Tom Lane
c29aff959d Consistently declare timestamp variables as TimestampTz.
Twiddle the replication-related code so that its timestamp variables
are declared TimestampTz, rather than the uninformative "int64" that
was previously used for meant-to-be-always-integer timestamps.
This resolves the int64-vs-TimestampTz declaration inconsistencies
introduced by commit 7c030783a, though in the opposite direction to
what was originally suggested.

This required including datatype/timestamp.h in a couple more places
than before.  I decided it would be a good idea to slim down that
header by not having it pull in <float.h> etc, as those headers are
no longer at all relevant to its purpose.  Unsurprisingly, a small number
of .c files turn out to have been depending on those inclusions, so add
them back in the .c files as needed.

Discussion: https://postgr.es/m/26788.1487455319@sss.pgh.pa.us
Discussion: https://postgr.es/m/27694.1487456324@sss.pgh.pa.us
2017-02-23 15:57:08 -05:00
Tom Lane
d28aafb6dd Remove pg_control's enableIntTimes field.
We don't need it any more.

pg_controldata continues to report that date/time type storage is
"64-bit integers", but that's now a hard-wired behavior not something
it sees in the data.  This avoids breaking pg_upgrade, and perhaps other
utilities that inspect pg_control this way.  Ditto for pg_resetwal.

I chose to remove the "bigint_timestamps" output column of
pg_control_init(), though, as that function hasn't been around long
and probably doesn't have ossified users.

Discussion: https://postgr.es/m/26788.1487455319@sss.pgh.pa.us
2017-02-23 12:23:12 -05:00
Robert Haas
5262f7a4fc Add optimizer and executor support for parallel index scans.
In combination with 569174f1be, which
taught the btree AM how to perform parallel index scans, this allows
parallel index scan plans on btree indexes.  This infrastructure
should be general enough to support parallel index scans for other
index AMs as well, if someone updates them to support parallel
scans.

Amit Kapila, reviewed and tested by Anastasia Lubennikova, Tushar
Ahuja, and Haribabu Kommi, and me.
2017-02-15 13:53:24 -05:00
Robert Haas
569174f1be btree: Support parallel index scans.
This isn't exposed to the optimizer or the executor yet; we'll add
support for those things in a separate patch.  But this puts the
basic mechanism in place: several processes can attach to a parallel
btree index scan, and each one will get a subset of the tuples that
would have been produced by a non-parallel scan.  Each index page
becomes the responsibility of a single worker, which then returns
all of the TIDs on that page.

Rahila Syed, Amit Kapila, Robert Haas, reviewed and tested by
Anastasia Lubennikova, Tushar Ahuja, and Haribabu Kommi.
2017-02-15 07:41:14 -05:00
Robert Haas
8da9a22636 Split index xlog headers from other private index headers.
The xlog-specific headers need to be included in both frontend code -
specifically, pg_waldump - and the backend, but the remainder of the
private headers for each index are only needed by the backend.  By
splitting the xlog stuff out into separate headers, pg_waldump pulls
in fewer backend headers, which is a good thing.

Patch by me, reviewed by Michael Paquier and Andres Freund, per a
complaint from Dilip Kumar.

Discussion: http://postgr.es/m/CA+TgmoZ=F=GkxV0YEv-A8tb+AEGy_Qa7GSiJ8deBKFATnzfEug@mail.gmail.com
2017-02-14 15:37:59 -05:00
Robert Haas
fb47544d0c Minor fixes for WAL consistency checking.
Michael Paquier, reviewed and slightly revised by me.

Discussion: http://postgr.es/m/CAB7nPqRzCQb=vdfHvMtP0HMLBHU6z1aGdo4GJsUP-HP8jx+Pkw@mail.gmail.com
2017-02-14 12:41:01 -05:00
Robert Haas
3f01fd4ca0 Rename dtrace probes for ongoing xlog -> wal conversion.
xlog-switch becomes wal-switch, and xlog-insert becomes wal-insert.
2017-02-09 16:40:19 -05:00
Robert Haas
85c11324ca Rename user-facing tools with "xlog" in the name to say "wal".
This means pg_receivexlog because pg_receivewal, pg_resetxlog
becomes pg_resetwal, and pg_xlogdump becomes pg_waldump.
2017-02-09 16:23:46 -05:00
Robert Haas
806091c96f Remove all references to "xlog" from SQL-callable functions in pg_proc.
Commit f82ec32ac3 renamed the pg_xlog
directory to pg_wal.  To make things consistent, and because "xlog" is
terrible terminology for either "transaction log" or "write-ahead log"
rename all SQL-callable functions that contain "xlog" in the name to
instead contain "wal".  (Note that this may pose an upgrade hazard for
some users.)

Similarly, rename the xlog_position argument of the functions that
create slots to be called wal_position.

Discussion: https://www.postgresql.org/message-id/CA+Tgmob=YmA=H3DbW1YuOXnFVgBheRmyDkWcD9M8f=5bGWYEoQ@mail.gmail.com
2017-02-09 15:10:09 -05:00
Robert Haas
fc8219dc54 pageinspect: Fix hash_bitmap_info not to read the underlying page.
It did that to verify that the page was an overflow page rather than
anything else, but that means that checking the status of all the
overflow bits requires reading the entire index.  So don't do that.
The new code validates that the page is not a primary bucket page
or bitmap page by looking at the metapage, so that using this on
large numbers of pages can be reasonably efficient.

Ashutosh Sharma, per a complaint from me, and with further
modifications by me.
2017-02-09 14:34:34 -05:00
Tom Lane
86d911ec0f Allow index AMs to cache data across aminsert calls within a SQL command.
It's always been possible for index AMs to cache data across successive
amgettuple calls within a single SQL command: the IndexScanDesc.opaque
field is meant for precisely that.  However, no comparable facility
exists for amortizing setup work across successive aminsert calls.
This patch adds such a feature and teaches GIN, GIST, and BRIN to use it
to amortize catalog lookups they'd previously been doing on every call.
(The other standard index AMs keep everything they need in the relcache,
so there's little to improve there.)

For GIN, the overall improvement in a statement that inserts many rows
can be as much as 10%, though it seems a bit less for the other two.
In addition, this makes a really significant difference in runtime
for CLOBBER_CACHE_ALWAYS tests, since in those builds the repeated
catalog lookups are vastly more expensive.

The reason this has been hard up to now is that the aminsert function is
not passed any useful place to cache per-statement data.  What I chose to
do is to add suitable fields to struct IndexInfo and pass that to aminsert.
That's not widening the index AM API very much because IndexInfo is already
within the ken of ambuild; in fact, by passing the same info to aminsert
as to ambuild, this is really removing an inconsistency in the AM API.

Discussion: https://postgr.es/m/27568.1486508680@sss.pgh.pa.us
2017-02-09 11:52:12 -05:00
Robert Haas
a507b86900 Add WAL consistency checking facility.
When the new GUC wal_consistency_checking is set to a non-empty value,
it triggers recording of additional full-page images, which are
compared on the standby against the results of applying the WAL record
(without regard to those full-page images).  Allowable differences
such as hints are masked out, and the resulting pages are compared;
any difference results in a FATAL error on the standby.

Kuntal Ghosh, based on earlier patches by Michael Paquier and Heikki
Linnakangas.  Extensively reviewed and revised by Michael Paquier and
by me, with additional reviews and comments from Amit Kapila, Álvaro
Herrera, Simon Riggs, and Peter Eisentraut.
2017-02-08 15:45:30 -05:00
Robert Haas
94708c0e8c Fix compiler warning.
Mithun Cy, per a report by Erik Rijkers
2017-02-07 15:09:14 -05:00
Robert Haas
293e24e507 Cache hash index's metapage in rel->rd_amcache.
This avoids a very significant amount of buffer manager traffic and
contention when scanning hash indexes, because it's no longer
necessary to lock and pin the metapage for every scan.  We do need
some way of figuring out when the cache is too stale to use any more,
so that when we lock the primary bucket page to which the cached
metapage points us, we can tell whether a split has occurred since we
cached the metapage data.  To do that, we use the hash_prevblkno field
in the primary bucket page, which would otherwise always be set to
InvalidBuffer.

This patch contains code so that it will continue working (although
less efficiently) with hash indexes built before this change, but
perhaps we should consider bumping the hash version and ripping out
the compatibility code.  That decision can be made later, though.

Mithun Cy, reviewed by Jesper Pedersen, Amit Kapila, and by me.
Before committing, I made a number of cosmetic changes to the last
posted version of the patch, adjusted _hash_getcachedmetap to be more
careful about order of operation, and made some necessary updates to
the pageinspect documentation and regression tests.
2017-02-07 12:35:45 -05:00
Heikki Linnakangas
181bdb90ba Fix typos in comments.
Backpatch to all supported versions, where applicable, to make backpatching
of future fixes go more smoothly.

Josh Soref

Discussion: https://www.postgresql.org/message-id/CACZqfqCf+5qRztLPgmmosr-B0Ye4srWzzw_mo4c_8_B_mtjmJQ@mail.gmail.com
2017-02-06 11:33:58 +02:00
Robert Haas
38c363adf4 Improve grammar of message about two-phase state files.
When there's only one two-phase state file, there's also only one
long-running prepared transaction.  Adjust the message text
accordingly.

Nikhil Sontakke

Discussion: http://postgr.es/m/CAMGcDxcmR_DWZXXndGoPzVQx=B17A5=RviEA1qNaF=FWLy5Whw@mail.gmail.com
2017-02-03 17:16:54 -05:00
Robert Haas
08bf6e5295 pageinspect: Support hash indexes.
Patch by Jesper Pedersen and Ashutosh Sharma, with some error handling
improvements by me.  Tests from Peter Eisentraut.  Reviewed by Álvaro
Herrera, Michael Paquier, Jesper Pedersen, Jeff Janes, Peter
Eisentraut, Amit Kapila, Mithun Cy, and me.

Discussion: http://postgr.es/m/e2ac6c58-b93f-9dd9-f4e6-d6d30add7fdf@redhat.com
2017-02-02 14:19:32 -05:00
Andrew Dunstan
f1169ab501 Don't count background workers against a user's connection limit.
Doing so doesn't seem to be within the purpose of the per user
connection limits, and has particularly unfortunate effects in
conjunction with parallel queries.

Backpatch to 9.6 where parallel queries were introduced.

David Rowley, reviewed by Robert Haas and Albe Laurenz.
2017-02-01 18:02:43 -05:00
Robert Haas
bbd8550bce Refactor other replication commands to use DestRemoteSimple.
Commit a84069d935 added a new type of
DestReceiver to avoid duplicating the existing code for the SHOW
command, but it turns out we can leverage that new DestReceiver
type in a few more places, saving some code.

Michael Paquier, reviewed by Andres Freund and by me.

Discussion: http://postgr.es/m/CAB7nPqSdFOQC0evc0r1nJeQyGBqjBrR41MC4rcMqUUpoJaZbtQ%40mail.gmail.com
Discussion: http://postgr.es/m/CAB7nPqT2K4XFT1JgqufFBjsOc-NUKXg5qBDucHPMbk6Xi1kYaA@mail.gmail.com
2017-02-01 13:42:41 -05:00
Robert Haas
8a815e3fc3 Move comment about test slightly closer to test.
The addition of a TestForOldSnapshot() call here has made the
referent of this comment slightly less clear, so move the comment
to compensate.

Amit Kapila (as part of the parallel index scan patch)
2017-01-31 17:21:02 -05:00
Robert Haas
3838074f86 Be more aggressive in avoiding tuple conversion.
According to the comments in tupconvert.c, it's necessary to perform
tuple conversion when either table has OIDs, and this was previously
checked by ensuring that the tdtypeid value matched between the tables
in question.  However, that's overly stringent: we have access to
tdhasoid and can test directly whether OIDs are present, which lets us
avoid conversion in cases where the type OIDs are different but the
tuple descriptors are entirely the same (and neither has OIDs).  This
is useful to the partitioning code, which can thereby avoid converting
tuples when inserting into a partition whose columns appear in the
same order as the parent columns, the normal case.  It's possible
for the tuple routing code to avoid some additional overhead in this
case as well, so do that, too.

It's not clear whether it would be OK to skip this when both tables
have OIDs: do callers count on this to build a new tuple (losing the
previous OID) in such instances?  Until we figure it out, leave the
behavior in that case alone.

Amit Langote, reviewed by me.
2017-01-24 21:53:38 -05:00
Robert Haas
d1ecd53947 Add a SHOW command to the replication command language.
This is useful infrastructure for an upcoming proposed patch to
allow the WAL segment size to be changed at initdb time; tools like
pg_basebackup need the ability to interrogate the server setting.
But it also doesn't seem like a bad thing to have independently of
that; it may find other uses in the future.

Robert Haas and Beena Emerson.  (The original patch here was by
Beena, but I rewrote it to such a degree that most of the code
being committed here is mine.)

Discussion: http://postgr.es/m/CA+TgmobNo4qz06wHEmy9DszAre3dYx-WNhHSCbU9SAwf+9Ft6g@mail.gmail.com
2017-01-24 17:04:12 -05:00
Robert Haas
a84069d935 Add a new DestReceiver for printing tuples without catalog access.
If you create a DestReciver of type DestRemote and try to use it from
a replication connection that is not bound to a specific daabase, or
any other hypothetical type of backend that is not bound to a specific
database, it will fail because it doesn't have a pg_proc catalog to
look up properties of the types being printed.  In general, that's
an unavoidable problem, but we can hardwire the properties of a few
builtin types in order to support utility commands.  This new
DestReceiver of type DestRemoteSimple does just that.

Patch by me, reviewed by Michael Paquier.

Discussion: http://postgr.es/m/CA+TgmobNo4qz06wHEmy9DszAre3dYx-WNhHSCbU9SAwf+9Ft6g@mail.gmail.com
2017-01-24 16:53:56 -05:00
Robert Haas
7b4ac19982 Extend index AM API for parallel index scans.
This patch doesn't actually make any index AM parallel-aware, but it
provides the necessary functions at the AM layer to do so.

Rahila Syed, Amit Kapila, Robert Haas
2017-01-24 16:42:58 -05:00
Robert Haas
b1ecb9b3fc Fix interaction of partitioned tables with BulkInsertState.
When copying into a partitioned table, the target heap may change from
one tuple to next.  We must ask ReadBufferBI() to get a new buffer
every time such change occurs.  To do that, use new function
ReleaseBulkInsertStatePin().  This fixes the bug that tuples ended up
being inserted into the wrong partition, which occurred exactly
because the wrong buffer was used.

Amit Langote, per a suggestion from Robert Haas.  Some cosmetic
adjustments by me.

Reports by 高增琦 (Gao Zengqi), Venkata B Nagothi, and
Ragnar Ouchterlony.

Discussion: http://postgr.es/m/CAFmBtr32FDOqofo8yG-4mjzL1HnYHxXK5S9OGFJ%3D%3DcJpgEW4vA%40mail.gmail.com
Discussion: http://postgr.es/m/CAEyp7J9WiX0L3DoiNcRrY-9iyw%3DqP%2Bj%3DDLsAnNFF1xT2J1ggfQ%40mail.gmail.com
Discussion: http://postgr.es/m/16d73804-c9cd-14c5-463e-5caad563ff77%40agama.tv
Discussion: http://postgr.es/m/CA+TgmoaiZpDVUUN8LZ4jv1qFE_QyR+H9ec+79f5vNczYarg5Zg@mail.gmail.com
2017-01-24 08:50:16 -05:00
Peter Eisentraut
f21a563d25 Move some things from builtins.h to new header files
This avoids that builtins.h has to include additional header files.
2017-01-20 20:29:53 -05:00
Peter Eisentraut
665d1fad99 Logical replication
- Add PUBLICATION catalogs and DDL
- Add SUBSCRIPTION catalog and DDL
- Define logical replication protocol and output plugin
- Add logical replication workers

From: Petr Jelinek <petr@2ndquadrant.com>
Reviewed-by: Steve Singer <steve@ssinger.info>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Erik Rijkers <er@xs4all.nl>
Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
2017-01-20 09:04:49 -05:00
Alvaro Herrera
8eace46d34 Fix race condition in reading commit timestamps
If a user requests the commit timestamp for a transaction old enough
that its data is concurrently being truncated away by vacuum at just the
right time, they would receive an ugly internal file-not-found error
message from slru.c rather than the expected NULL return value.

In a primary server, the window for the race is very small: the lookup
has to occur exactly between the two calls by vacuum, and there's not a
lot that happens between them (mostly just a multixact truncate).  In a
standby server, however, the window is larger because the truncation is
executed as soon as the WAL record for it is replayed, but the advance
of the oldest-Xid is not executed until the next checkpoint record.

To fix in the primary, simply reverse the order of operations in
vac_truncate_clog.  To fix in the standby, augment the WAL truncation
record so that the standby is aware of the new oldest-XID value and can
apply the update immediately.  WAL version bumped because of this.

No backpatch, because of the low importance of the bug and its rarity.

Author: Craig Ringer
Reviewed-By: Petr Jelínek, Peter Eisentraut
Discussion: https://postgr.es/m/CAMsr+YFhVtRQT1VAwC+WGbbxZZRzNou=N9Ed-FrCqkwQ8H8oJQ@mail.gmail.com
2017-01-19 18:24:17 -03:00
Robert Haas
e37360d5df Improve comment in hashsearch.c.
Typo fix from Mithun Cy; other improvements by me.
2017-01-18 16:36:48 -05:00
Peter Eisentraut
352a24a1f9 Generate fmgr prototypes automatically
Gen_fmgrtab.pl creates a new file fmgrprotos.h, which contains
prototypes for all functions registered in pg_proc.h.  This avoids
having to manually maintain these prototypes across a random variety of
header files.  It also automatically enforces a correct function
signature, and since there are warnings about missing prototypes, it
will detect functions that are defined but not registered in
pg_proc.h (or otherwise used).

Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
2017-01-17 14:06:07 -05:00
Fujii Masao
974ece58bb Fix an assertion failure related to an exclusive backup.
Previously multiple sessions could execute pg_start_backup() and
pg_stop_backup() to start and stop an exclusive backup at the same time.
This could trigger the assertion failure of
"FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)".
This happend because, even while pg_start_backup() was starting
an exclusive backup, other session could run pg_stop_backup()
concurrently and mark the backup as not-in-progress unconditionally.

This patch introduces ExclusiveBackupState indicating the state of
an exclusive backup. This state is used to ensure that there is only
one session running pg_start_backup() or pg_stop_backup() at
the same time, to avoid the assertion failure.

Back-patch to all supported versions.

Author: Michael Paquier
Reviewed-By: Kyotaro Horiguchi and me
Reported-By: Andreas Seltenreich
Discussion: <87mvktojme.fsf@credativ.de>
2017-01-17 17:27:32 +09:00
Robert Haas
e898437460 Improve coding in _hash_addovflpage.
Instead of relying on the page contents to know whether we have
advanced from the primary bucket page to an overflow page, track
that explicitly.

Amit Kapila, per a complaint by me.
2017-01-10 08:31:03 -05:00
Alvaro Herrera
3957b58b88 Fix ALTER TABLE / SET TYPE for irregular inheritance
If inherited tables don't have exactly the same schema, the USING clause
in an ALTER TABLE / SET DATA TYPE misbehaves when applied to the
children tables since commit 9550e8348b.  Starting with that commit,
the attribute numbers in the USING expression are fixed during parse
analysis.  This can lead to bogus errors being reported during
execution, such as:
   ERROR:  attribute 2 has wrong type
   DETAIL:  Table has type smallint, but query expects integer.

Since it wouldn't do to revert to the original coding, we now apply a
transformation to map the attribute numbers to the correct ones for each
child.

Reported by Justin Pryzby
Analysis by Tom Lane; patch by me.
Discussion: https://postgr.es/m/20170102225618.GA10071@telsasoft.com
2017-01-09 19:26:58 -03:00
Alvaro Herrera
7403561c0f BRIN revmap pages are not standard pages ...
... and therefore we ought not to tell XLogRegisterBuffer the opposite,
when writing XLog for a brin update that moves the index tuple to a
different page.  Otherwise, xlog insertion would try to "compress the
hole" when producing a full-page image for it; but since we don't update
pd_lower/upper, the hole covers the whole page.  On WAL replay, the
revmap page becomes empty and so the entire portion of the index is
useless and needs to be recomputed.

This is low-probability: a BRIN update only moves an index tuple to a
different page when the summary tuple is larger than the existing one,
which doesn't happen with fixed-width datatypes.  Also, the revmap
page must be first after a checkpoint.

Report and patch: Kuntal Ghosh
Bug is alleged to have detected by a WAL-consistency-checking tool.
Discussion: https://postgr.es/m/CAGz5QCJ=00UQjScSEFbV=0qO5ShTZB9WWz_Fm7+Wd83zPs9Geg@mail.gmail.com

I posted a test case demonstrating the problem, but I'm refraining from
adding it to the test suite; if the WAL consistency tool makes it in,
that will be a better way to catch this from regressing.  (We should
definitely have someting that causes not-same-page updates, though.)
2017-01-09 18:19:29 -03:00
Bruce Momjian
1d25779284 Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
Tom Lane
54386f3578 Remove triggerable Assert in hashname().
hashname() asserted that the key string it is given is shorter than
NAMEDATALEN.  That should surely always be true if the input is in fact a
regular value of type "name".  However, for reasons of coding convenience,
we allow plain old C strings to be treated as "name" values in many places.
Some SQL functions accept arbitrary "text" inputs, convert them to C
strings, and pass them otherwise-untransformed to syscache lookups for name
columns, allowing an overlength input value to trigger hashname's Assert.

This would be a DOS problem, except that it only happens in assert-enabled
builds which aren't recommended for production.  In a production build,
you'll just get a name lookup error, since regardless of the hash value
computed by hashname, the later equality comparison checks can't match.
Likewise, if the catalog lookup is done by seqscan or indexscan searches,
there will just be a lookup error, since the name comparison functions
don't contain any similar length checks, and will see an overlength input
as unequal to any stored entry.

After discussion we concluded that we should simply remove this Assert.
It's inessential to hashname's own functionality, and having such an
assertion in only some paths for name lookup is more of a foot-gun than
a useful check.  There may or may not be a case for the affected callers
to do something other than let the name lookup fail, but we'll consider
that separately; in any case we probably don't want to change such
behavior in the back branches.

Per report from Tushar Ahuja.  Back-patch to all supported branches.

Report: https://postgr.es/m/7d0809ee-6f25-c9d6-8e74-5b2967830d49@enterprisedb.com
Discussion: https://postgr.es/m/17691.1482523168@sss.pgh.pa.us
2016-12-26 14:58:02 -05:00
Robert Haas
7819ba1ef6 Remove _hash_chgbufaccess().
This is basically for the same reasons I got rid of _hash_wrtbuf()
in commit 25216c9893: it's not
convenient to have a function which encapsulates MarkBufferDirty(),
especially as we move towards having hash indexes be WAL-logged.

Patch by me, reviewed (but not entirely endorsed) by Amit Kapila.
2016-12-23 07:14:37 -05:00
Andres Freund
6ef2eba3f5 Skip checkpoints, archiving on idle systems.
Some background activity (like checkpoints, archive timeout, standby
snapshots) is not supposed to happen on an idle system. Unfortunately
so far it was not easy to determine when a system is idle, which
defeated some of the attempts to avoid redundant activity on an idle
system.

To make that easier, allow to make individual WAL insertions as not
being "important". By checking whether any important activity happened
since the last time an activity was performed, it now is easy to check
whether some action needs to be repeated.

Use the new facility for checkpoints, archive timeout and standby
snapshots.

The lack of a facility causes some issues in older releases, but in my
opinion the consequences (superflous checkpoints / archived segments)
aren't grave enough to warrant backpatching.

Author: Michael Paquier, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Amit Kapila, Kyotaro HORIGUCHI
Bug: #13685
Discussion:
    https://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org
    https://www.postgresql.org/message-id/CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com
Backpatch: -
2016-12-22 11:31:50 -08:00
Robert Haas
097e41439d Fix broken error check in _hash_doinsert.
You can't just cast a HashMetaPage to a Page, because the meta page
data is stored after the page header, not at offset 0.  Fortunately,
this didn't break anything because it happens to find hashm_bsize
at the offset at which it expects to find pd_pagesize_version, and
the values are close enough to the same that this works out.

Still, it's a bug, so back-patch to all supported versions.

Mithun Cy, revised a bit by me.
2016-12-22 13:59:01 -05:00
Robert Haas
dd728826c5 Fix locking problem in _hash_squeezebucket() / _hash_freeovflpage().
A bucket squeeze operation needs to lock each page of the bucket
before releasing the prior page, but the previous coding fumbled the
locking when freeing an overflow page during a bucket squeeze
operation.  Commit 6d46f4783e
introduced this bug.

Amit Kapila, with help from Kuntal Ghosh and Dilip Kumar, after
an initial trouble report by Jeff Janes.  Reviewed by me.  I also
fixed a problem with a comment.
2016-12-19 12:31:50 -05:00
Robert Haas
3761fe3c20 Simplify LWLock tranche machinery by removing array_base/array_stride.
array_base and array_stride were added so that we could identify the
offset of an LWLock within a tranche, but this facility is only very
marginally used apart from the main tranche.  So, give every lock in
the main tranche its own tranche ID and get rid of array_base,
array_stride, and all that's attached.  For debugging facilities
(Trace_lwlocks and LWLOCK_STATS) print the pointer address of the
LWLock using %p instead of the offset.  This is arguably more useful,
and certainly a lot cheaper.  Drop the offset-within-tranche from
the information reported to dtrace and from one can't-happen message
inside lwlock.c.

The main user-visible impact of this change is that pg_stat_activity
will now report all waits for LWLocks as "LWLock" rather than
reporting some as "LWLockTranche" and others as "LWLockNamed".

The main motivation for this change is that the need to specify an
array_base and an array_stride is awkward for parallel query.  There
is only a very limited supply of tranche IDs so we can't just keep
allocating new ones, and if we try to use the same tranche IDs every
time then we run into trouble when multiple parallel contexts are
use simultaneously.  So if we didn't get rid of this mechanism we'd
have to make it even more complicated.  By simplifying it in this
way, we instead reduce the size of the generated code for lwlock.c
by about 5%.

Discussion: http://postgr.es/m/CA+TgmoYsFn6NUW1x0AZtupJGUAs1UDY4dJtCN47_Q6D0sP80PA@mail.gmail.com
2016-12-16 11:29:23 -05:00
Robert Haas
6a4fe1127c Fix more hash index bugs around marking buffers dirty.
In _hash_freeovflpage(), if we're freeing the overflow page that
immediate follows the page to which tuples are being moved (the
confusingly-named "write buffer"), don't forget to mark that
page dirty after updating its hasho_nextblkno.

In _hash_squeezebucket(), it's not necessary to mark the primary
bucket page dirty if there are no overflow pages, because there's
nothing to squeeze in that case.

Amit Kapila, with help from Kuntal Ghosh and Dilip Kumar, after
an initial trouble report by Jeff Janes.
2016-12-16 09:55:20 -05:00
Robert Haas
25216c9893 Remove _hash_wrtbuf() in favor of calling MarkBufferDirty().
The whole concept of _hash_wrtbuf() is that we need to know at the
time we're releasing the buffer lock (and pin) whether we dirtied the
buffer, but this is easy to get wrong.  This patch actually fixes one
non-obvious bug of that form: hashbucketcleanup forgot to signal
_hash_squeezebucket, which gets the primary bucket page already
locked, as to whether it had already dirtied the page.  Calling
MarkBufferDirty() at the places where we dirty the buffer is more
intuitive and lets us simplify the code in various places as well.

On top of all that, the ultimate goal here is to make hash indexes
WAL-logged, and as the comments to _hash_wrtbuf() note, it should
go away when that happens.  Making it go away a little earlier than
that seems like a good preparatory step.

Report by Jeff Janes.  Diagnosis by Amit Kapila, Kuntal Ghosh,
and Dilip Kumar.  Patch by me, after studying an alternative patch
submitted by Amit Kapila.

Discussion: http://postgr.es/m/CAA4eK1Kf6tOY0oVz_SEdngiNFkeXrA3xUSDPPORQvsWVPdKqnA@mail.gmail.com
2016-12-16 09:37:28 -05:00
Robert Haas
501c7b94bc Fix bug in hashbulkdelete.
Commit 6d46f4783e failed to account for
the possibility that hashbulkdelete() might encounter a bucket that
has been split since it began scanning the bucket array.  Repair.

Extracted from a larger pathc by Amit Kapila; I rewrote the comment.
2016-12-13 12:16:02 -05:00
Robert Haas
3856cf9607 Remove should_free arguments to tuplesort routines.
Since commit e94568ecc1, the answer is
always "false", and we do not need to complicate the API by arranging
to return a constant value.

Peter Geoghegan

Discussion: http://postgr.es/m/CAM3SWZQWZZ_N=DmmL7tKy_OUjGH_5mN=N=A6h7kHyyDvEhg2DA@mail.gmail.com
2016-12-12 15:57:35 -05:00
Robert Haas
fa0f466d53 Log the creation of an init fork unconditionally.
Previously, it was thought that this only needed to be done for the
benefit of possible standbys, so wal_level = minimal skipped it.
But that's not safe, because during crash recovery we might replay
XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record which recursively
removes the directory that contains the new init fork.  So log it
always.

The user-visible effect of this bug is that if you create a database
or tablespace, then create an unlogged table, then crash without
checkpointing, then restart, accessing the table will fail, because
the it won't have been properly reset.  This commit fixes that.

Michael Paquier, per a report from Konstantin Knizhnik.  Wording of
the comments per a suggestion from me.
2016-12-08 14:12:08 -05:00
Robert Haas
f0e44751d7 Implement table partitioning.
Table partitioning is like table inheritance and reuses much of the
existing infrastructure, but there are some important differences.
The parent is called a partitioned table and is always empty; it may
not have indexes or non-inherited constraints, since those make no
sense for a relation with no data of its own.  The children are called
partitions and contain all of the actual data.  Each partition has an
implicit partitioning constraint.  Multiple inheritance is not
allowed, and partitioning and inheritance can't be mixed.  Partitions
can't have extra columns and may not allow nulls unless the parent
does.  Tuples inserted into the parent are automatically routed to the
correct partition, so tuple-routing ON INSERT triggers are not needed.
Tuple routing isn't yet supported for partitions which are foreign
tables, and it doesn't handle updates that cross partition boundaries.

Currently, tables can be range-partitioned or list-partitioned.  List
partitioning is limited to a single column, but range partitioning can
involve multiple columns.  A partitioning "column" can be an
expression.

Because table partitioning is less general than table inheritance, it
is hoped that it will be easier to reason about properties of
partitions, and therefore that this will serve as a better foundation
for a variety of possible optimizations, including query planner
optimizations.  The tuple routing based which this patch does based on
the implicit partitioning constraints is an example of this, but it
seems likely that many other useful optimizations are also possible.

Amit Langote, reviewed and tested by Robert Haas, Ashutosh Bapat,
Amit Kapila, Rajkumar Raghuwanshi, Corey Huinker, Jaime Casanova,
Rushabh Lathia, Erik Rijkers, among others.  Minor revisions by me.
2016-12-07 13:17:55 -05:00
Robert Haas
2f4193c350 Fix race introduced by 6d46f4783e.
It's possible for the metapage contents to change after we release
the lock, so we must read them before releasing the lock.

Amit Kapila.  Submitted in response to a trouble report from
Andreas Seltenreich, though it is not certain this fixes the
problem.
2016-12-05 11:43:37 -05:00
Fujii Masao
5dc851afde Fix incorrect output from gin_desc().
Previously gin_desc() displayed incorrect output "unknown action 0"
for XLOG_GIN_INSERT and XLOG_GIN_VACUUM_DATA_LEAF_PAGE records with
valid actions. The cause of this problem was that gin_desc() wrongly
used XLogRecGetData() to extract data from those records.
Since they were registered by XLogRegisterBufData(), gin_desc() should
have used XLogRecGetBlockData(), instead, like gin_redo().
Also there were other differences about how to treat XLOG_GIN_INSERT
record between gin_desc() and gin_redo().

This commit fixes gin_desc() routine so that it treats those records
in the same way as gin_redo().

Batch-patch to 9.5 where WAL record format was revamped and
XLogRegisterBufData() was added.

Reported-By: Andres Freund
Reviewed-By: Tom Lane
Discussion: <20160509194645.7lewnpw647zegx2m@alap3.anarazel.de>
2016-12-05 20:29:41 +09:00
Robert Haas
b460f5d669 Add max_parallel_workers GUC.
Increase the default value of the existing max_worker_processes GUC
from 8 to 16, and add a new max_parallel_workers GUC with a maximum
of 8.  This way, even if the maximum amount of parallel query is
happening, there is still room for background workers that do other
things, as originally envisioned when max_worker_processes was added.

Julien Rouhaud, reviewed by Amit Kapila and by revised by me.
2016-12-02 07:42:58 -05:00
Alvaro Herrera
fa2fa99552 Permit dump/reload of not-too-large >1GB tuples
Our documentation states that our maximum field size is 1 GB, and that
our maximum row size of 1.6 TB.  However, while this might be attainable
in theory with enough contortions, it is not workable in practice; for
starters, pg_dump fails to dump tables containing rows larger than 1 GB,
even if individual columns are well below the limit; and even if one
does manage to manufacture a dump file containing a row that large, the
server refuses to load it anyway.

This commit enables dumping and reloading of such tuples, provided two
conditions are met:

1. no single column is larger than 1 GB (in output size -- for bytea
   this includes the formatting overhead)
2. the whole row is not larger than 2 GB

There are three related changes to enable this:

a. StringInfo's API now has two additional functions that allow creating
a string that grows beyond the typical 1GB limit (and "long" string).
ABI compatibility is maintained.  We still limit these strings to 2 GB,
though, for reasons explained below.

b. COPY now uses long StringInfos, so that pg_dump doesn't choke
trying to emit rows longer than 1GB.

c. heap_form_tuple now uses the MCXT_ALLOW_HUGE flag in its allocation
for the input tuple, which means that large tuples are accepted on
input.  Note that at this point we do not apply any further limit to the
input tuple size.

The main reason to limit to 2 GB is that the FE/BE protocol uses 32 bit
length words to describe each row; and because the documentation is
ambiguous on its signedness and libpq does consider it signed, we cannot
use the highest-order bit.  Additionally, the StringInfo API uses "int"
(which is 4 bytes wide in most platforms) in many places, so we'd need
to change that API too in order to improve, which has lots of fallout.

Backpatch to 9.5, which is the oldest that has
MemoryContextAllocExtended, a necessary piece of infrastructure.  We
could apply to 9.4 with very minimal additional effort, but any further
than that would require backpatching "huge" allocations too.

This is the largest set of changes we could find that can be
back-patched without breaking compatibility with existing systems.
Fixing a bigger set of problems (for example, dumping tuples bigger than
2GB, or dumping fields bigger than 1GB) would require changing the FE/BE
protocol and/or changing the StringInfo API in an ABI-incompatible way,
neither of which would be back-patchable.

Authors: Daniel Vérité, Álvaro Herrera
Reviewed by: Tomas Vondra
Discussion: https://postgr.es/m/20160229183023.GA286012@alvherre.pgsql
2016-12-02 00:34:01 -03:00
Robert Haas
6d46f4783e Improve hash index bucket split behavior.
Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency.  Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.

In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel.  There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.

This patch also removes the unworldly assumption that a split will
never be interrupted.  With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion.  While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.

Amit Kapila.  I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's.  Also
reviewed by Jesper Pedersen, Jeff Janes, and others.

Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
2016-11-30 15:39:21 -05:00
Tom Lane
dbdfd114f3 Bring some clarity to the defaults for the xxx_flush_after parameters.
Instead of confusingly stating platform-dependent defaults for these
parameters in the comments in postgresql.conf.sample (with the main
entry being a lie on Linux), teach initdb to install the correct
platform-dependent value in postgresql.conf, similarly to the way
we handle other platform-dependent defaults.  This won't do anything
for existing 9.6 installations, but since it's effectively only a
documentation improvement, that seems OK.

Since this requires initdb to have access to the default values,
move the #define's for those to pg_config_manual.h; the original
placement in bufmgr.h is unworkable because that file can't be
included by frontend programs.

Adjust the default value for wal_writer_flush_after so that it is 1MB
regardless of XLOG_BLCKSZ, conforming to what is stated in both the
SGML docs and postgresql.conf.  (We could alternatively make it scale
with XLOG_BLCKSZ, but I'm not sure I see the point.)

Copy-edit related SGML documentation.

Fabien Coelho and Tom Lane, per a gripe from Tomas Vondra.

Discussion: <30ebc6e3-8358-09cf-44a8-578252938424@2ndquadrant.com>
2016-11-25 18:36:10 -05:00
Alvaro Herrera
4aaddf2f00 Fix commit_ts for FrozenXid and BootstrapXid
Previously, requesting commit timestamp for transactions
FrozenTransactionId and BootstrapTransactionId resulted in an error.
But since those values can validly appear in committed tuples' Xmin,
this behavior is unhelpful and error prone: each caller would have to
special-case those values before requesting timestamp data for an Xid.
We already have a perfectly good interface for returning "the Xid you
requested is too old for us to have commit TS data for it", so let's use
that instead.

Backpatch to 9.5, where commit timestamps appeared.

Author: Craig Ringer
Discussion: https://www.postgresql.org/message-id/CAMsr+YFM5Q=+ry3mKvWEqRTxrB0iU3qUSRnS28nz6FJYtBwhJg@mail.gmail.com
2016-11-24 15:39:55 -03:00
Robert Haas
e343dfa42b Remove barrier.h
A new thing also called a "barrier" is proposed, but whether we decide
to take that patch or not, this file seems to have outlived its
usefulness.

Thomas Munro
2016-11-22 20:28:24 -05:00
Robert Haas
e8ac886c24 Support condition variables.
Condition variables provide a flexible way to sleep until a
cooperating process causes an arbitrary condition to become true.  In
simple cases, this can be accomplished with a WaitLatch/ResetLatch
loop; the cooperating process can call SetLatch after performing work
that might cause the condition to be satisfied, and the waiting
process can recheck the condition each time.  However, if the process
performing the work doesn't have an easy way to identify which
processes might be waiting, this doesn't work, because it can't
identify which latches to set.  Condition variables solve that problem
by internally maintaining a list of waiters; a process that may have
caused some waiter's condition to be satisfied must "signal" or
"broadcast" on the condition variable.

Robert Haas and Thomas Munro
2016-11-22 14:27:11 -05:00
Robert Haas
a43f1939d5 Remove or reduce verbosity of some debug messages.
The debug messages that merely print StartTransactionCommand,
CommitTransactionCommand, ProcessUtilty, or ProcessQuery with no
additional details seem to be useless.  Get rid of them.

The transaction status messages produced by ShowTransactionState are
occasionally useful, but they are extremely verbose, producing
multiple lines of log output every time they fire, which can happens
multiple times per transaction.  So, reduce the level to DEBUG5; avoid
emitting an extra line just to explain which debug point is at issue;
and tighten up the rest of the message so it doesn't use quite so much
horizontal space.

With these changes, it's possible to run a somewhat busy system with a
log level even as high as DEBUG4, whereas previously anything above
DEBUG2 would flood the log with output that probably wasn't really all
that useful.
2016-11-17 17:05:16 -05:00
Alvaro Herrera
f65b94f639 Avoid pin scan for replay of XLOG_BTREE_VACUUM in all cases
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to
require complex interlocking that matched the requirements on the
master. This required an O(N) operation that became a significant
problem with large indexes, causing replication delays of seconds or in
some cases minutes while the XLOG_BTREE_VACUUM was replayed.

This commit skips the “pin scan” that was previously required, by
observing in detail when and how it is safe to do so, with full
documentation. The pin scan is skipped only in replay; the VACUUM code
path on master is not touched here.

No tests included. Manual tests using an additional patch to view WAL records
and their timing have shown the change in WAL records and their handling has
successfully reduced replication delay.

This is a back-patch of commits 687f2cd7a0, 3e4b7d8798, b602842613
by Simon Riggs, to branches 9.4 and 9.5.  No further backpatch is
possible because this depends on catalog scans being MVCC.  I (Álvaro)
additionally updated a slight problem in the README, which explains why
this touches the 9.6 and master branches.
2016-11-17 13:31:30 -03:00
Tom Lane
9257f07872 Replace uses of SPI_modifytuple that intend to allocate in current context.
Invent a new function heap_modify_tuple_by_cols() that is functionally
equivalent to SPI_modifytuple except that it always allocates its result
by simple palloc.  I chose however to make the API details a bit more
like heap_modify_tuple: pass a tupdesc rather than a Relation, and use
bool convention for the isnull array.

Use this function in place of SPI_modifytuple at all call sites where the
intended behavior is to allocate in current context.  (There actually are
only two call sites left that depend on the old behavior, which makes me
wonder if we should just drop this function rather than keep it.)

This new function is easier to use than heap_modify_tuple() for purposes
of replacing a single column (or, really, any fixed number of columns).
There are a number of places where it would simplify the code to change
over, but I resisted that temptation for the moment ... everywhere except
in plpgsql's exec_assign_value(); changing that might offer some small
performance benefit, so I did it.

This is on the way to removing SPI_push/SPI_pop, but it seems like
good code cleanup in its own right.

Discussion: <9633.1478552022@sss.pgh.pa.us>
2016-11-08 15:36:44 -05:00
Robert Haas
f0e72a25b0 Improve handling of dead tuples in hash indexes.
When squeezing a bucket during vacuum, it's not necessary to retain
any tuples already marked as dead, so ignore them when deciding which
tuples must be moved in order to empty a bucket page.  Similarly, when
splitting a bucket, relocating dead tuples to the new bucket is a
waste of effort; instead, just ignore them.

Amit Kapila, reviewed by me.  Testing help provided by Ashutosh
Sharma.
2016-11-08 10:52:51 -05:00
Tom Lane
5485c99e7f Fix silly nil-pointer-dereference bug introduced in commit d5f6f13f8.
Don't fetch record->xl_info before we've verified that record isn't
NULL.  Per Coverity.

Michael Paquier
2016-11-06 11:29:40 -05:00
Tom Lane
d5f6f13f8e Be more consistent about masking xl_info with ~XLR_INFO_MASK.
Generally, WAL resource managers are only supposed to examine the
top 4 bits of a WAL record's xl_info; the rest are reserved for
the WAL mechanism itself.  A few places were not consistent about
doing this with respect to XLOG_CHECKPOINT and XLOG_SWITCH records.
There's no bug currently, since no additional bits ever get set in
these specific record types, but that might not be true forever.
Let's follow the generic coding rule here too.

Michael Paquier
2016-11-04 13:26:49 -04:00
Robert Haas
33839b5ffb Fix leftover reference to background writer performing checkpoints.
This was changed in PostgreSQL 9.2, but somehow this comment never
got updated.
2016-10-28 09:09:00 -04:00
Robert Haas
f267c1c244 Fix possible pg_basebackup failure on standby with "include WAL".
If a restartpoint flushed no dirty buffers, it could fail to update
the minimum recovery point, leading to a minimum recovery point prior
to the starting REDO location.  perform_base_backup() would interpret
that as meaning that no WAL files at all needed to be included in the
backup, failing an internal sanity check.  To fix, have restartpoints
always update the minimum recovery point to just after the checkpoint
record itself, so that the file (or files) containing the checkpoint
record will always be included in the backup.

Code by Amit Kapila, per a design suggestion by me, with some
additional work on the code comment by me.  Test case by Michael
Paquier.  Report by Kyotaro Horiguchi.
2016-10-27 11:19:51 -04:00
Alvaro Herrera
00f15338b2 Preserve commit timestamps across clean restart
An oversight in setting the boundaries of known commit timestamps during
startup caused old commit timestamps to become inaccessible after a
server restart.

Author and reporter: Julien Rouhaud
Review, test code: Craig Ringer
2016-10-24 09:45:48 -03:00
Robert Haas
919c811ca1 Fix comment formatting. 2016-10-21 12:04:21 -04:00
Robert Haas
f82ec32ac3 Rename "pg_xlog" directory to "pg_wal".
"xlog" is not a particularly clear abbreviation for "write-ahead log",
and it sometimes confuses users into believe that the contents of the
"pg_xlog" directory are not critical data, leading to unpleasant
consequences.  So, rename the directory to "pg_wal".

This patch modifies pg_upgrade and pg_basebackup to understand both
the old and new directory layouts; the former is necessary given the
purpose of the tool, while the latter merely avoids an unnecessary
backward-compatibility break.

We may wish to consider renaming other programs, switches, and
functions which still use the old "xlog" naming to also refer to
"wal".  However, that's still under discussion, so let's do just this
much for now.

Discussion: CAB7nPqTeC-8+zux8_-4ZD46V7YPwooeFxgndfsq5Rg8ibLVm1A@mail.gmail.com

Michael Paquier
2016-10-20 11:32:18 -04:00
Heikki Linnakangas
917dc7d239 Fix WAL-logging of FSM and VM truncation.
When a relation is truncated, it is important that the FSM is truncated as
well. Otherwise, after recovery, the FSM can return a page that has been
truncated away, leading to errors like:

ERROR:  could not read block 28991 in file "base/16390/572026": read only 0
of 8192 bytes

We were using MarkBufferDirtyHint() to dirty the buffer holding the last
remaining page of the FSM, but during recovery, that might in fact not
dirty the page, and the FSM update might be lost.

To fix, use the stronger MarkBufferDirty() function. MarkBufferDirty()
requires us to do WAL-logging ourselves, to protect from a torn page, if
checksumming is enabled.

Also fix an oversight in visibilitymap_truncate: it also needs to WAL-log
when checksumming is enabled.

Analysis by Pavan Deolasee.

Discussion: <CABOikdNr5vKucqyZH9s1Mh0XebLs_jRhKv6eJfNnD2wxTn=_9A@mail.gmail.com>
2016-10-19 14:26:05 +03:00
Tom Lane
5c80642aa8 Remove unnecessary int2vector-specific hash function and equality operator.
These functions were originally added in commit d8cedf67a to support
use of int2vector columns as catcache lookup keys.  However, there are
no catcaches that use such columns.  (Indeed I now think it must always
have been dead code: a catcache with such a key column would need an
underlying unique index on the column, but we've never had an int2vector
btree opclass.)

Getting rid of the int2vector-specific operator and function does not
lose any functionality, because operations on int2vectors will now fall
back to the generic anyarray support.  This avoids a wart that a btree
index on an int2vector column (made using anyarray_ops) would fail to
match equality searches, because int2vectoreq wasn't a member of the
opclass.  We don't really care much about that, since int2vector is not
meant as a type for users to use, but it's silly to have extra code and
less functionality.

If we ever do want a catcache to be indexed by an int2vector column,
we'd need to put back full btree and hash opclasses for int2vector,
comparable to the support for oidvector.  (The anyarray code can't be
used at such a low level, because it needs to do catcache lookups.)
But we'll deal with that if/when the need arises.

Also worth noting is that removal of the hash int2vector_ops opclass will
break any user-created hash indexes on int2vector columns.  While hash
anyarray_ops would serve the same purpose, it would probably not compute
the same hash values and thus wouldn't be on-disk-compatible.  Given that
int2vector isn't a user-facing type and we're planning other incompatible
changes in hash indexes for v10 anyway, this doesn't seem like something
to worry about, but it's probably worth mentioning here.

Amit Langote

Discussion: <d9bb74f8-b194-7307-9ebd-90645d377e45@lab.ntt.co.jp>
2016-10-12 14:54:08 -04:00
Tom Lane
2f1eaf87e8 Drop server support for FE/BE protocol version 1.0.
While this isn't a lot of code, it's been essentially untestable for
a very long time, because libpq doesn't support anything older than
protocol 2.0, and has not since release 6.3.  There's no reason to
believe any other client-side code still uses that protocol, either.

Discussion: <2661.1475849167@sss.pgh.pa.us>
2016-10-11 12:19:18 -04:00
Robert Haas
6f3bd98ebf Extend framework from commit 53be0b1ad to report latch waits.
WaitLatch, WaitLatchOrSocket, and WaitEventSetWait now taken an
additional wait_event_info parameter; legal values are defined in
pgstat.h.  This makes it possible to uniquely identify every point in
the core code where we are waiting for a latch; extensions can pass
WAIT_EXTENSION.

Because latches were the major wait primitive not previously covered
by this patch, it is now possible to see information in
pg_stat_activity on a large number of important wait events not
previously addressed, such as ClientRead, ClientWrite, and SyncRep.

Unfortunately, many of the wait events added by this patch will fail
to appear in pg_stat_activity because they're only used in background
processes which don't currently appear in pg_stat_activity.  We should
fix this either by creating a separate view for such information, or
else by deciding to include them in pg_stat_activity after all.

Michael Paquier and Robert Haas, reviewed by Alexander Korotkov and
Thomas Munro.
2016-10-04 11:01:42 -04:00
Tom Lane
fdc9186f7e Replace the built-in GIN array opclasses with a single polymorphic opclass.
We had thirty different GIN array opclasses sharing the same operators and
support functions.  That still didn't cover all the built-in types, nor
did it cover arrays of extension-added types.  What we want is a single
polymorphic opclass for "anyarray".  There were two missing features needed
to make this possible:

1. We have to be able to declare the index storage type as ANYELEMENT
when the opclass is declared to index ANYARRAY.  This just takes a few
more lines in index_create().  Although this currently seems of use only
for GIN, there's no reason to make index_create() restrict it to that.

2. We have to be able to identify the proper GIN compare function for
the index storage type.  This patch proceeds by making the compare function
optional in GIN opclass definitions, and specifying that the default btree
comparison function for the index storage type will be looked up when the
opclass omits it.  Again, that seems pretty generically useful.

Since the comparison function lookup is done in initGinState(), making
use of the second feature adds an additional cache lookup to GIN index
access setup.  It seems unlikely that that would be very noticeable given
the other costs involved, but maybe at some point we should consider
making GinState data persist longer than it now does --- we could keep it
in the index relcache entry, perhaps.

Rather fortuitously, we don't seem to need to do anything to get this
change to play nice with dump/reload or pg_upgrade scenarios: the new
opclass definition is automatically selected to replace existing index
definitions, and the on-disk data remains compatible.  Also, if a user has
created a custom opclass definition for a non-builtin type, this doesn't
break that, since CREATE INDEX will prefer an exact match to opcintype
over a match to ANYARRAY.  However, if there's anyone out there with
handwritten DDL that explicitly specifies _bool_ops or one of the other
replaced opclass names, they'll need to adjust that.

Tom Lane, reviewed by Enrique Meneses

Discussion: <14436.1470940379@sss.pgh.pa.us>
2016-09-26 14:52:44 -04:00
Tom Lane
959ea7fa76 Remove useless code.
Apparent copy-and-pasteo in standby_desc_invalidations() had two
entries for msg->id == SHAREDINVALRELMAP_ID.

Aleksander Alekseev

Discussion: <20160923090814.GB1238@e733>
2016-09-23 10:44:50 -04:00
Peter Eisentraut
ebdf5bf7d1 Delay updating control file to "in production"
Move the updating of the control file to "in production" status until
the point where WAL writes are allowed.  Before, there could be a
significant gap between the control file update and write transactions
actually being allowed.  This makes it more reliable to use the control
status to verify the end of a promotion.

From: Michael Paquier <michael.paquier@gmail.com>
2016-09-21 12:00:00 -04:00
Heikki Linnakangas
45310221a9 Fix outdated comments, GIST search queue is not an RBTree anymore.
The GiST search queue is implemented as a pairing heap rather than as
Red-Black Tree, since 9.5 (commit e7032610). I neglected these comments
in that commit.
2016-09-20 11:38:25 +03:00
Robert Haas
445a38aba2 Have heapam.h include lockdefs.h rather than lock.h.
lockdefs.h was only split from lock.h relatively recently, and
represents a minimal subset of the old lock.h.  heapam.h only needs
that smaller subset, so adjust it to include only that.  This requires
some corresponding adjustments elsewhere.

Peter Geoghegan
2016-09-13 09:21:35 -04:00
Tom Lane
24992c6db9 Rewrite PageIndexDeleteNoCompact into a form that only deletes 1 tuple.
The full generality of deleting an arbitrary number of tuples is no longer
needed, so let's save some code and cycles by replacing the original coding
with an implementation based on PageIndexTupleDelete.

We can always get back the old code from git if we need it again for new
callers (though I don't care for its willingness to mess with line pointers
it wasn't told to mess with).

Discussion: <552.1473445163@sss.pgh.pa.us>
2016-09-09 19:00:59 -04:00
Tom Lane
b1328d78f8 Invent PageIndexTupleOverwrite, and teach BRIN and GiST to use it.
PageIndexTupleOverwrite performs approximately the same function as
PageIndexTupleDelete (or PageIndexDeleteNoCompact) followed by PageAddItem
targeting the same item pointer offset.  But in the case where the new
tuple is the same size as the old, it avoids shuffling other data around on
the page, because the new tuple is placed where the old one was rather than
being appended to the end of the page.  This has been shown to provide a
substantial speedup for some GiST use-cases.

Also, this change allows some API simplifications: we can get rid of
the rather klugy and error-prone PAI_ALLOW_FAR_OFFSET flag for
PageAddItemExtended, since that was used only to cover a corner case
for BRIN that's better expressed by using PageIndexTupleOverwrite.

Note that this patch causes a rather subtle WAL incompatibility: the
physical page content change represented by certain WAL records is now
different than it was before, because while the tuples have the same
itempointer line numbers, the tuples themselves are in different places.
I have not bumped the WAL version number because I think it doesn't matter
unless you are trying to do bitwise comparisons of original and replayed
pages, and in any case we're early in a devel cycle and there will probably
be more WAL changes before v10 gets out the door.

There is probably room to make use of PageIndexTupleOverwrite in SP-GiST
and GIN too, but that is left for a future patch.

Andrey Borodin, reviewed by Anastasia Lubennikova, whacked around a bit
by me

Discussion: <CAJEAwVGQjGGOj6mMSgMwGvtFd5Kwe6VFAxY=uEPZWMDjzbn4VQ@mail.gmail.com>
2016-09-09 18:02:36 -04:00
Alvaro Herrera
5c609a742f Fix locking a tuple updated by an aborted (sub)transaction
When heap_lock_tuple decides to follow the update chain, it tried to
also lock any version of the tuple that was created by an update that
was subsequently rolled back.  This is pointless, since for all intents
and purposes that tuple exists no more; and moreover it causes
misbehavior, as reported independently by Marko Tiikkaja and Marti
Raudsepp: some SELECT FOR UPDATE/SHARE queries may fail to return
the tuples, and assertion-enabled builds crash.

Fix by having heap_lock_updated_tuple test the xmin and return success
immediately if the tuple was created by an aborted transaction.

The condition where tuples become invisible occurs when an updated tuple
chain is followed by heap_lock_updated_tuple, which reports the problem
as HeapTupleSelfUpdated to its caller heap_lock_tuple, which in turn
propagates that code outwards possibly leading the calling code
(ExecLockRows) to believe that the tuple exists no longer.

Backpatch to 9.3.  Only on 9.5 and newer this leads to a visible
failure, because of commit 27846f02c176; before that, heap_lock_tuple
skips the whole dance when the tuple is already locked by the same
transaction, because of the ancient HeapTupleSatisfiesUpdate behavior.
Still, the buggy condition may also exist in more convoluted scenarios
involving concurrent transactions, so it seems safer to fix the bug in
the old branches too.

Discussion:
	https://www.postgresql.org/message-id/CABRT9RC81YUf1=jsmWopcKJEro=VoeG2ou6sPwyOUTx_qteRsg@mail.gmail.com
	https://www.postgresql.org/message-id/48d3eade-98d3-8b9a-477e-1a8dc32a724d@joh.to
2016-09-09 15:54:29 -03:00
Simon Riggs
ec253de1fd Fix corruption of 2PC recovery with subxacts
Reading 2PC state files during recovery was borked, causing corruptions during
recovery. Effect limited to servers with 2PC, subtransactions and
recovery/replication.

Stas Kelvich, reviewed by Michael Paquier and Pavan Deolasee
2016-09-09 11:55:12 +01:00
Simon Riggs
67c6bd1ca3 Fix minor memory leak in Standby startup
StandbyRecoverPreparedTransactions() leaked the buffer
used for two phase state file. This was leaked once
at startup and at every shutdown checkpoint seen.

Backpatch to 9.6

Stas Kelvich
2016-09-08 10:32:58 +01:00
Peter Eisentraut
49eb0fd097 Add location field to DefElem
Add a location field to the DefElem struct, used to parse many utility
commands.  Update various error messages to supply error position
information.

To propogate the error position information in a more systematic way,
create a ParseState in standard_ProcessUtility() and pass that to
interested functions implementing the utility commands.  This seems
better than passing the query string and then reassembling a parse state
ad hoc, which violates the encapsulation of the ParseState type.

Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
2016-09-06 12:00:00 -04:00
Tom Lane
60893786d5 Fix corrupt GIN_SEGMENT_ADDITEMS WAL records on big-endian hardware.
computeLeafRecompressWALData() tried to produce a uint16 WAL log field by
memcpy'ing the first two bytes of an int-sized variable.  That accidentally
works on little-endian hardware, but not at all on big-endian.  Replay then
thinks it's looking at an ADDITEMS action with zero entries, and reads the
first two bytes of the first TID therein as the next segno/action,
typically leading to "unexpected GIN leaf action" errors during replay.
Even if replay failed to crash, the resulting GIN index page would surely
be incorrect.  To fix, just declare the variable as uint16 instead.

Per bug #14295 from Spencer Thomason (much thanks to Spencer for turning
his problem into a self-contained test case).  This likely also explains
a previous report of the same symptom from Bernd Helmle.

Back-patch to 9.4 where the problem was introduced (by commit 14d02f0bb).

Discussion: <20160826072658.15676.7628@wrigleys.postgresql.org>
Possible-Report: <2DA7350F7296B2A142272901@eje.land.credativ.lan>
2016-09-03 13:28:53 -04:00
Simon Riggs
35250b6ad7 New recovery target recovery_target_lsn
Michael Paquier
2016-09-03 17:48:01 +01:00
Heikki Linnakangas
9f85784cae Support multiple iterators in the Red-Black Tree implementation.
While we don't need multiple iterators at the moment, the interface is
nicer and less dangerous this way.

Aleksander Alekseev, with some changes by me.
2016-09-02 08:39:39 +03:00
Tom Lane
0e0f43d6fd Prevent starting a standalone backend with standby_mode on.
This can't really work because standby_mode expects there to be more
WAL arriving, which there will not ever be because there's no WAL
receiver process to fetch it.  Moreover, if standby_mode is on then
hot standby might also be turned on, causing even more strangeness
because that expects read-only sessions to be executing in parallel.
Bernd Helmle reported a case where btree_xlog_delete_get_latestRemovedXid
got confused, but rather than band-aiding individual problems it seems
best to prevent getting anywhere near this state in the first place.
Back-patch to all supported branches.

In passing, also fix some omissions of errcodes in other ereport's in
readRecoveryCommandFile().

Michael Paquier (errcode hacking by me)

Discussion: <00F0B2CEF6D0CEF8A90119D4@eje.credativ.lan>
2016-08-31 08:52:13 -04:00
Alvaro Herrera
8e1e3f958f Split hash.h → hash_xlog.h
Since the hash AM is going to be revamped to have WAL, this is a good
opportunity to clean up the include file a little bit to avoid including
a lot of extra stuff in the future.

Author: Amit Kapila
2016-08-29 18:55:49 -03:00
Fujii Masao
bd082231ed Fix typos in comments. 2016-08-29 16:06:40 +09:00
Fujii Masao
bab7823a49 Fix pg_xlogdump so that it handles cross-page XLP_FIRST_IS_CONTRECORD record.
Previously pg_xlogdump failed to dump the contents of the WAL file
if the file starts with the continuation WAL record which spans
more than one pages. Since pg_xlogdump assumed that the continuation
record always fits on a page, it could not find the valid WAL record to
start reading from in that case.

This patch changes pg_xlogdump so that it can handle a continuation
WAL record which crosses a page boundary and find the valid record
to start reading from.

Back-patch to 9.3 where pg_xlogdump was introduced.

Author: Pavan Deolasee
Reviewed-By: Michael Paquier and Craig Ringer
Discussion: CABOikdPsPByMiG6J01DKq6om2+BNkxHTPkOyqHM2a4oYwGKsqQ@mail.gmail.com
2016-08-29 14:34:58 +09:00
Tom Lane
ea268cdc9a Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters.  While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea.  Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.

While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.

In passing, change TopMemoryContext to use the default allocation
parameters.  Formerly it could only be extended 8K at a time.  That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there.  There seems no good reason
not to let it use growing blocks like most other contexts.

Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain.  The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.

Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 17:50:38 -04:00
Tom Lane
78dcd027e8 Fix potential memory leakage from HandleParallelMessages().
HandleParallelMessages leaked memory into the caller's context.  Since it's
called from ProcessInterrupts, there is basically zero certainty as to what
CurrentMemoryContext is, which means we could be leaking into long-lived
contexts.  Over the processing of many worker messages that would grow to
be a problem.  Things could be even worse than just a leak, if we happened
to service the interrupt while ErrorContext is current: elog.c thinks it
can reset that on its own whim, possibly yanking storage out from under
HandleParallelMessages.

Give HandleParallelMessages its own dedicated context instead, which we can
reset during each call to ensure there's no accumulation of wasted memory.

Discussion: <16610.1472222135@sss.pgh.pa.us>
2016-08-26 15:04:05 -04:00
Tom Lane
fbf28b6b52 Fix logic for adding "parallel worker" context line to worker errors.
The previous coding here was capable of adding a "parallel worker" context
line to errors that were not, in fact, returned from a parallel worker.
Instead of using an errcontext callback to add that annotation, just paste
it onto the message by hand; this looks uglier but is more reliable.

Discussion: <19757.1472151987@sss.pgh.pa.us>
2016-08-26 10:07:28 -04:00
Tom Lane
ae4760d667 Fix small query-lifespan memory leak in bulk updates.
When there is an identifiable REPLICA IDENTITY index on the target table,
heap_update leaks the id_attrs bitmapset.  That's not many bytes, but it
adds up over enough rows, since the code typically runs in a query-lifespan
context.  Bug introduced in commit e55704d8b, which did a rather poor job
of cloning the existing use-pattern for RelationGetIndexAttrBitmap().

Per bug #14293 from Zhou Digoal.  Back-patch to 9.4 where the bug was
introduced.

Report: <20160824114320.15676.45171@wrigleys.postgresql.org>
2016-08-24 22:20:25 -04:00
Tom Lane
71e006f031 Suppress compiler warnings in non-cassert builds.
With Asserts off, these variables are set but never used, resulting
in warnings from pickier compilers.  Fix that with our standard solution.
Per report from Jeff Janes.
2016-08-23 23:21:10 -04:00
Tom Lane
d2ddee63b4 Improve SP-GiST opclass API to better support unlabeled nodes.
Previously, the spgSplitTuple action could only create a new upper tuple
containing a single labeled node.  This made it useless for opclasses
that prefer to work with fixed sets of nodes (labeled or otherwise),
which meant that restrictive prefixes could not be used with such
node definitions.  Change the output field set for the choose() method
to allow it to specify any valid node set for the new upper tuple,
and to specify which of these nodes to place the modified lower tuple in.

In addition to its primary use for fixed node sets, this feature could
allow existing opclasses that use variable node sets to skip a separate
spgAddNode action when splitting a tuple, by setting up the node needed
for the incoming value as part of the spgSplitTuple action.  However, care
would have to be taken to add the extra node only when it would not make
the tuple bigger than before.  (spgAddNode can enlarge the tuple,
spgSplitTuple can't.)

This is a prerequisite for an upcoming SP-GiST inet opclass, but is
being committed separately to increase the visibility of the API change.

In passing, improve the documentation about the traverse-values feature
that was added by commit ccd6eb49a.

Emre Hasegeli, with cosmetic adjustments and documentation rework by me

Discussion: <CAE2gYzxtth9qatW_OAqdOjykS0bxq7AYHLuyAQLPgT7H9ZU0Cw@mail.gmail.com>
2016-08-23 12:10:34 -04:00
Andres Freund
07ef035129 Fix deletion of speculatively inserted TOAST on conflict
INSERT ..  ON CONFLICT runs a pre-check of the possible conflicting
constraints before performing the actual speculative insertion.  In case
the inserted tuple included TOASTed columns the ON CONFLICT condition
would be handled correctly in case the conflict was caught by the
pre-check, but if two transactions entered the speculative insertion
phase at the same time, one would have to re-try, and the code for
aborting a speculative insertion did not handle deleting the
speculatively inserted TOAST datums correctly.

TOAST deletion would fail with "ERROR: attempted to delete invisible
tuple" as we attempted to remove the TOAST tuples using
simple_heap_delete which reasoned that the given tuples should not be
visible to the command that wrote them.

This commit updates the heap_abort_speculative() function which aborts
the conflicting tuple to use itself, via toast_delete, for deleting
associated TOAST datums.  Like before, the inserted toast rows are not
marked as being speculative.

This commit also adds a isolationtester spec test, exercising the
relevant code path. Unfortunately 9.5 cannot handle two waiting
sessions, and thus cannot execute this test.

Reported-By: Viren Negi, Oskari Saarenmaa
Author: Oskari Saarenmaa, edited a bit by me
Bug: #14150
Discussion: <20160519123338.12513.20271@wrigleys.postgresql.org>
Backpatch: 9.5, where ON CONFLICT was introduced
2016-08-17 17:03:36 -07:00
Peter Eisentraut
f0fe1c8f70 Fix typos
From: Alexander Law <exclusion@gmail.com>
2016-08-16 14:52:29 -04:00
Tom Lane
b5bce6c1ec Final pgindent + perltidy run for 9.6. 2016-08-15 13:42:51 -04:00
Tom Lane
ed0097e4f9 Add SQL-accessible functions for inspecting index AM properties.
Per discussion, we should provide such functions to replace the lost
ability to discover AM properties by inspecting pg_am (cf commit
65c5fcd35).  The added functionality is also meant to displace any code
that was looking directly at pg_index.indoption, since we'd rather not
believe that the bit meanings in that field are part of any client API
contract.

As future-proofing, define the SQL API to not assume that properties that
are currently AM-wide or index-wide will remain so unless they logically
must be; instead, expose them only when inquiring about a specific index
or even specific index column.  Also provide the ability for an index
AM to override the behavior.

In passing, document pg_am.amtype, overlooked in commit 473b93287.

Andrew Gierth, with kibitzing by me and others

Discussion: <87mvl5on7n.fsf@news-spur.riddles.org.uk>
2016-08-13 18:31:14 -04:00
Tom Lane
e89526d4f3 In B-tree page deletion, clean up properly after page deletion failure.
In _bt_unlink_halfdead_page(), we might fail to find an immediate left
sibling of the target page, perhaps because of corruption of the page
sibling links.  The code intends to cope with this by just abandoning
the deletion attempt; but what actually happens is that it fails outright
due to releasing the same buffer lock twice.  (And error recovery masks
a second problem, which is possible leakage of a pin on another page.)
Seems to have been introduced by careless refactoring in commit efada2b8e.
Since there are multiple cases to consider, let's make releasing the buffer
lock in the failure case the responsibility of _bt_unlink_halfdead_page()
not its caller.

Also, avoid fetching the leaf page's left-link again after we've dropped
lock on the page.  This is probably harmless, but it's not exactly good
coding practice.

Per report from Kyotaro Horiguchi.  Back-patch to 9.4 where the faulty code
was introduced.

Discussion: <20160803.173116.111915228.horiguchi.kyotaro@lab.ntt.co.jp>
2016-08-06 14:28:37 -04:00
Robert Haas
81c766b3fd Change InitToastSnapshot to a macro.
tqual.h is included in some front-end compiles, and a static inline
breaks on buildfarm member castoroides.  Since the macro is never
referenced, it should dodge that problem, although this doesn't
seem like the cleanest way of hiding things from front-end compiles.

Report and review by Tom Lane; patch by me.
2016-08-05 11:58:03 -04:00
Andres Freund
e7caacf733 Fix hard to hit race condition in heapam's tuple locking code.
As mentioned in its commit message, eca0f1db left open a race condition,
where a page could be marked all-visible, after the code checked
PageIsAllVisible() to pin the VM, but before the page is locked.  Plug
that hole.

Reviewed-By: Robert Haas, Andres Freund
Author: Amit Kapila
Discussion: CAEepm=3fWAbWryVW9swHyLTY4sXVf0xbLvXqOwUoDiNCx9mBjQ@mail.gmail.com
Backpatch: -
2016-08-04 20:07:16 -07:00
Robert Haas
3e2f3c2e42 Prevent "snapshot too old" from trying to return pruned TOAST tuples.
Previously, we tested for MVCC snapshots to see whether they were too
old, but not TOAST snapshots, which can lead to complaints about missing
TOAST chunks if those chunks are subject to early pruning.  Ideally,
the threshold lsn and timestamp for a TOAST snapshot would be that of
the corresponding MVCC snapshot, but since we have no way of deciding
which MVCC snapshot was used to fetch the TOAST pointer, use the oldest
active or registered snapshot instead.

Reported by Andres Freund, who also sketched out what the fix should
look like.  Patch by me, reviewed by Amit Kapila.
2016-08-03 16:50:01 -04:00
Tom Lane
b6a97b91ff Block interrupts during HandleParallelMessages().
As noted by Alvaro, there are CHECK_FOR_INTERRUPTS() calls in the shm_mq.c
functions called by HandleParallelMessages().  I believe they're all
unreachable since we always pass nowait = true, but it doesn't seem like
a great idea to assume that no such call will ever be reachable from
HandleParallelMessages().  If that did happen, there would be a risk of a
recursive call to HandleParallelMessages(), which it does not appear to be
designed for --- for example, there's nothing that would prevent
out-of-order processing of received messages.  And certainly such cases
cannot easily be tested.  So let's prevent it by holding off interrupts for
the duration of the function.  Back-patch to 9.5 which contains identical
code.

Discussion: <14869.1470083848@sss.pgh.pa.us>
2016-08-02 16:39:16 -04:00
Tom Lane
a5fe473ad7 Minor cleanup for access/transam/parallel.c.
ParallelMessagePending *must* be marked volatile, because it's set
by a signal handler.  On the other hand, it's pointless for
HandleParallelMessageInterrupt to save/restore errno; that must be,
and is, done at the outer level of the SIGUSR1 signal handler.

Calling CHECK_FOR_INTERRUPTS() inside HandleParallelMessages, which itself
is called from CHECK_FOR_INTERRUPTS(), seems both useless and hazardous.
The comment claiming that this is needed to handle the error queue going
away is certainly misguided, in any case.

Improve a couple of error message texts, and use
ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE to report loss of parallel worker
connection, since that's what's used in e.g. tqueue.c.  (Maybe it would be
worth inventing a dedicated ERRCODE for this type of failure?  But I do not
think ERRCODE_INTERNAL_ERROR is appropriate.)

Minor stylistic cleanups.
2016-08-01 16:12:01 -04:00
Peter Eisentraut
ef5d4a3cfa Message style improvements 2016-07-28 16:34:44 -04:00
Robert Haas
1091402b5a Remove unused structure member.
Michael Paquier
2016-07-21 11:53:44 -04:00
Andres Freund
eca0f1db14 Clear all-frozen visibilitymap status when locking tuples.
Since a892234 & fd31cd265 the visibilitymap's freeze bit is used to
avoid vacuuming the whole relation in anti-wraparound vacuums. Doing so
correctly relies on not adding xids to the heap without also unsetting
the visibilitymap flag.  Tuple locking related code has not done so.

To allow selectively resetting all-frozen - to avoid pessimizing
heap_lock_tuple - allow to selectively reset the all-frozen with
visibilitymap_clear(). To avoid having to use
visibilitymap_get_status (e.g. via VM_ALL_FROZEN) inside a critical
section, have visibilitymap_clear() return whether any bits have been
reset.

There's a remaining issue (denoted by XXX): After the PageIsAllVisible()
check in heap_lock_tuple() and heap_lock_updated_tuple_rec() the page
status could theoretically change. Practically that currently seems
impossible, because updaters will hold a page level pin already.  Due to
the next beta coming up, it seems better to get the required WAL magic
bump done before resolving this issue.

The added flags field fields to xl_heap_lock and xl_heap_lock_updated
require bumping the WAL magic. Since there's already been a catversion
bump since the last beta, that's not an issue.

Reviewed-By: Robert Haas, Amit Kapila and Andres Freund
Author: Masahiko Sawada, heavily revised by Andres Freund
Discussion: CAEepm=3fWAbWryVW9swHyLTY4sXVf0xbLvXqOwUoDiNCx9mBjQ@mail.gmail.com
Backpatch: -
2016-07-18 02:01:13 -07:00
Tom Lane
9563d5b5e4 Add regression test case exercising the sorting path for hash index build.
We've broken this code path at least twice in the past, so it's prudent
to have a test case that covers it.  To allow exercising the code path
without creating a very large (and slow to run) test case, redefine the
sort threshold to be bounded by maintenance_work_mem as well as the number
of available buffers.  While at it, fix an ancient oversight that when
building a temp index, the number of available buffers is not NBuffers but
NLocBuffer.  Also, if assertions are enabled, apply a direct test that the
sort actually does return the tuples in the expected order.

Peter Geoghegan

Patch: <CAM3SWZTBAo4hjbBd780+MrOKiKp_TMo1N3A0Rw9_im8gbD7fQA@mail.gmail.com>
2016-07-16 15:30:15 -04:00
Andres Freund
bfa2ab56bb Fix torn-page, unlogged xid and further risks from heap_update().
When heap_update needs to look for a page for the new tuple version,
because the current one doesn't have sufficient free space, or when
columns have to be processed by the tuple toaster, it has to release the
lock on the old page during that. Otherwise there'd be lock ordering and
lock nesting issues.

To avoid concurrent sessions from trying to update / delete / lock the
tuple while the page's content lock is released, the tuple's xmax is set
to the current session's xid.

That unfortunately was done without any WAL logging, thereby violating
the rule that no XIDs may appear on disk, without an according WAL
record.  If the database were to crash / fail over when the page level
lock is released, and some activity lead to the page being written out
to disk, the xid could end up being reused; potentially leading to the
row becoming invisible.

There might be additional risks by not having t_ctid point at the tuple
itself, without having set the appropriate lock infomask fields.

To fix, compute the appropriate xmax/infomask combination for locking
the tuple, and perform WAL logging using the existing XLOG_HEAP_LOCK
record. That allows the fix to be backpatched.

This issue has existed for a long time. There appears to have been
partial attempts at preventing dangers, but these never have fully been
implemented, and were removed a long time ago, in
11919160 (cf. HEAP_XMAX_UNLOGGED).

In master / 9.6, there's an additional issue, namely that the
visibilitymap's freeze bit isn't reset at that point yet. Since that's a
new issue, introduced only in a892234f83, that'll be fixed in a
separate commit.

Author: Masahiko Sawada and Andres Freund
Reported-By: Different aspects by Thomas Munro, Noah Misch, and others
Discussion: CAEepm=3fWAbWryVW9swHyLTY4sXVf0xbLvXqOwUoDiNCx9mBjQ@mail.gmail.com
Backpatch: 9.1/all supported versions
2016-07-15 17:49:48 -07:00
Andres Freund
a4d357bfbd Make HEAP_LOCK/HEAP2_LOCK_UPDATED replay reset HEAP_XMAX_INVALID.
0ac5ad5 started to compress infomask bits in WAL records. Unfortunately
the replay routines for XLOG_HEAP_LOCK/XLOG_HEAP2_LOCK_UPDATED forgot to
reset the HEAP_XMAX_INVALID (and some other) hint bits.

Luckily that's not problematic in the majority of cases, because after a
crash/on a standby row locks aren't meaningful. Unfortunately that does
not hold true in the presence of prepared transactions. This means that
after a crash, or after promotion, row level locks held by a prepared,
but not yet committed, prepared transaction might not be enforced.

Discussion: 20160715192319.ubfuzim4zv3rqnxv@alap3.anarazel.de
Backpatch: 9.3, the oldest branch on which 0ac5ad5 is present.
2016-07-15 14:45:37 -07:00
Alvaro Herrera
533e9c6b06 Avoid serializability errors when locking a tuple with a committed update
When key-share locking a tuple that has been not-key-updated, and the
update is a committed transaction, in some cases we raised
serializability errors:
    ERROR:  could not serialize access due to concurrent update

Because the key-share doesn't conflict with the update, the error is
unnecessary and inconsistent with the case that the update hasn't
committed yet.  This causes problems for some usage patterns, even if it
can be claimed that it's sufficient to retry the aborted transaction:
given a steady stream of updating transactions and a long locking
transaction, the long transaction can be starved indefinitely despite
multiple retries.

To fix, we recognize that HeapTupleSatisfiesUpdate can return
HeapTupleUpdated when an updating transaction has committed, and that we
need to deal with that case exactly as if it were a non-committed
update: verify whether the two operations conflict, and if not, carry on
normally.  If they do conflict, however, there is a difference: in the
HeapTupleBeingUpdated case we can just sleep until the concurrent
transaction is gone, while in the HeapTupleUpdated case this is not
possible and we must raise an error instead.

Per trouble report from Olivier Dony.

In addition to a couple of test cases that verify the changed behavior,
I added a test case to verify the behavior that remains unchanged,
namely that errors are raised when a update that modifies the key is
used.  That must still generate serializability errors.  One
pre-existing test case changes behavior; per discussion, the new
behavior is actually the desired one.

Discussion: https://www.postgresql.org/message-id/560AA479.4080807@odoo.com
  https://www.postgresql.org/message-id/20151014164844.3019.25750@wrigleys.postgresql.org

Backpatch to 9.3, where the problem appeared.
2016-07-15 14:17:20 -04:00
Tom Lane
1acf757255 Fix GiST index build for NaN values in geometric types.
GiST index build could go into an infinite loop when presented with boxes
(or points, circles or polygons) containing NaN component values.  This
happened essentially because the code assumed that x == x is true for any
"double" value x; but it's not true for NaNs.  The looping behavior was not
the only problem though: we also attempted to sort the items using simple
double comparisons.  Since NaNs violate the trichotomy law, qsort could
(in principle at least) get arbitrarily confused and mess up the sorting of
ordinary values as well as NaNs.  And we based splitting choices on box size
calculations that could produce NaNs, again resulting in undesirable
behavior.

To fix, replace all comparisons of doubles in this logic with
float8_cmp_internal, which is NaN-aware and is careful to sort NaNs
consistently, higher than any non-NaN.  Also rearrange the box size
calculation to not produce NaNs; instead it should produce an infinity
for a box with NaN on one side and not-NaN on the other.

I don't by any means claim that this solves all problems with NaNs in
geometric values, but it should at least make GiST index insertion work
reliably with such data.  It's likely that the index search side of things
still needs some work, and probably regular geometric operations too.
But with this patch we're laying down a convention for how such cases
ought to behave.

Per bug #14238 from Guang-Dih Lei.  Back-patch to 9.2; the code used before
commit 7f3bd86843 is quite different and doesn't lock up on my simple
test case, nor on the submitter's dataset.

Report: <20160708151747.1426.60150@wrigleys.postgresql.org>
Discussion: <28685.1468246504@sss.pgh.pa.us>
2016-07-14 18:45:59 -04:00
Magnus Hagander
87d84d67bb Fix start WAL filename for concurrent backups from standby
On a standby, ThisTimelineID is always 0, so we would generate a
filename in timeline 0 even for other timelines. Instead, use starttli
which we have retreived from the controlfile.

Report by: Francesco Canovai in bug #14230
Author: Marco Nenciarini
Reviewed by: Michael Paquier and Amit Kapila
2016-07-11 12:02:31 +02:00
Robert Haas
10c0558ffe Fix several mistakes around parallel workers and client_encoding.
Previously, workers sent data to the leader using the client encoding.
That mostly worked, but the leader the converted the data back to the
server encoding.  Since not all encoding conversions are reversible,
that could provoke failures.  Fix by using the database encoding for
all communication between worker and leader.

Also, while temporary changes to GUC settings, as from the SET clause
of a function, are in general OK for parallel query, changing
client_encoding this way inside of a parallel worker is not OK.
Previously, that would have confused the leader; with these changes,
it would not confuse the leader, but it wouldn't do anything either.
So refuse such changes in parallel workers.

Also, the previous code naively assumed that when it received a
NotifyResonse from the worker, it could pass that directly back to the
user.  But now that worker-to-leader communication always uses the
database encoding, that's clearly no longer correct - though,
actually, the old way was always broken for V2 clients.  So
disassemble and reconstitute the message instead.

Issues reported by Peter Eisentraut.  Patch by me, reviewed by
Peter Eisentraut.
2016-06-30 18:35:32 -04:00
Alvaro Herrera
b78364df18 Remove unused arguments in two GiST subroutines
These arguments became unused in commit 2c03216d83.  Noticed while
skimming code for unrelated development.

This is cosmetic, so no backpatch.
2016-06-28 16:01:13 -04:00
Alvaro Herrera
e3ad3ffa68 Fix handling of multixacts predating pg_upgrade
After pg_upgrade, it is possible that some tuples' Xmax have multixacts
corresponding to the old installation; such multixacts cannot have
running members anymore.  In many code sites we already know not to read
them and clobber them silently, but at least when VACUUM tries to freeze
a multixact or determine whether one needs freezing, there's an attempt
to resolve it to its member transactions by calling GetMultiXactIdMembers,
and if the multixact value is "in the future" with regards to the
current valid multixact range, an error like this is raised:
    ERROR:  MultiXactId 123 has not been created yet -- apparent wraparound
and vacuuming fails.  Per discussion with Andrew Gierth, it is completely
bogus to try to resolve multixacts coming from before a pg_upgrade,
regardless of where they stand with regards to the current valid
multixact range.

It's possible to get from under this problem by doing SELECT FOR UPDATE
of the problem tuples, but if tables are large, this is slow and
tedious, so a more thorough solution is desirable.

To fix, we realize that multixacts in xmax created in 9.2 and previous
have a specific bit pattern that is never used in 9.3 and later (we
already knew this, per comments and infomask tests sprinkled in various
places, but we weren't leveraging this knowledge appropriately).
Whenever the infomask of the tuple matches that bit pattern, we just
ignore the multixact completely as if Xmax wasn't set; or, in the case
of tuple freezing, we act as if an unwanted value is set and clobber it
without decoding.  This guarantees that no errors will be raised, and
that the values will be progressively removed until all tables are
clean.  Most callers of GetMultiXactIdMembers are patched to recognize
directly that the value is a removable "empty" multixact and avoid
calling GetMultiXactIdMembers altogether.

To avoid changing the signature of GetMultiXactIdMembers() in back
branches, we keep the "allow_old" boolean flag but rename it to
"from_pgupgrade"; if the flag is true, we always return an empty set
instead of looking up the multixact.  (I suppose we could remove the
argument in the master branch, but I chose not to do so in this commit).

This was broken all along, but the error-facing message appeared first
because of commit 8e9a16ab8f and was partially fixed in a25c2b7c4d.
This fix, backpatched all the way back to 9.3, goes approximately in the
same direction as a25c2b7c4d but should cover all cases.

Bug analysis by Andrew Gierth and Álvaro Herrera.

A number of public reports match this bug:
  https://www.postgresql.org/message-id/20140330040029.GY4582@tamriel.snowman.net
  https://www.postgresql.org/message-id/538F3D70.6080902@publicrelay.com
  https://www.postgresql.org/message-id/556439CF.7070109@pscs.co.uk
  https://www.postgresql.org/message-id/SG2PR06MB0760098A111C88E31BD4D96FB3540@SG2PR06MB0760.apcprd06.prod.outlook.com
  https://www.postgresql.org/message-id/20160615203829.5798.4594@wrigleys.postgresql.org
2016-06-24 18:29:28 -04:00
Tom Lane
8cf739de85 Fix building of large (bigger than shared_buffers) hash indexes.
When the index is predicted to need more than NBuffers buckets,
CREATE INDEX attempts to sort the index entries by hash key before
insertion, so as to reduce thrashing.  This code path got broken by
commit 9f03ca9151, which overlooked that _hash_form_tuple() is not
just an alias for index_form_tuple().  The index got built anyway, but
with garbage data, so that searches for pre-existing tuples always
failed.  Fix by refactoring to separate construction of the indexable
data from calling index_form_tuple().

Per bug #14210 from Daniel Newman.  Back-patch to 9.5 where the
bug was introduced.

Report: <20160623162507.17237.39471@wrigleys.postgresql.org>
2016-06-24 16:57:36 -04:00
Alvaro Herrera
1ca4a1b5d2 Finish up XLOG_HINT renaming
Commit b8fd1a09f3 renamed XLOG_HINT to XLOG_FPI, but neglected two
places.

Backpatch to 9.3, like that commit.
2016-06-17 18:05:55 -04:00
Robert Haas
71d05a2c7b pg_visibility: Add pg_truncate_visibility_map function.
This requires some core changes as well so that we can properly
WAL-log the truncation.  Specifically, it changes the format of the
XLOG_SMGR_TRUNCATE WAL record, so bump XLOG_PAGE_MAGIC.

Patch by me, reviewed but not fully endorsed by Andres Freund.
2016-06-17 17:37:30 -04:00
Robert Haas
9c188a8454 Fix typo.
Thomas Munro
2016-06-17 12:55:30 -04:00
Robert Haas
292794f82b Remove PID from 'parallel worker' context message.
Discussion: <bfd204ab-ab1a-792a-b345-0274a09a4b5f@2ndquadrant.com>
2016-06-17 09:26:17 -04:00
Tom Lane
bfb937427b Fix fuzzy thinking in ReinitializeParallelDSM().
The fact that no workers were successfully launched in the previous
iteration does not excuse us from setting up properly to try again.
This appears to explain crashes I saw in parallel regression testing
due to error_mqh being NULL when it shouldn't be.

Minor other cosmetic fixes too.
2016-06-16 15:20:29 -04:00
Robert Haas
38e9f90a22 Fix lazy_scan_heap so that it won't mark pages all-frozen too soon.
Commit a892234f83 added a new bit per
page to the visibility map fork indicating whether the page is
all-frozen, but incorrectly assumed that if lazy_scan_heap chose to
freeze a tuple then that tuple would not need to later be frozen
again. This turns out to be false, because xmin and xmax (and
conceivably xvac, if dealing with tuples from very old releases) could
be frozen at separate times.

Thanks to Andres Freund for help in uncovering and tracking down this
issue.
2016-06-15 14:30:06 -04:00
Tom Lane
cae1c788b9 Improve the situation for parallel query versus temp relations.
Transmit the leader's temp-namespace state to workers.  This is important
because without it, the workers do not really have the same search path
as the leader.  For example, there is no good reason (and no extant code
either) to prevent a worker from executing a temp function that the
leader created previously; but as things stood it would fail to find the
temp function, and then either fail or execute the wrong function entirely.

We still prohibit a worker from creating a temp namespace on its own.
In effect, a worker can only see the session's temp namespace if the leader
had created it before starting the worker, which seems like the right
semantics.

Also, transmit the leader's BackendId to workers, and arrange for workers
to use that when determining the physical file path of a temp relation
belonging to their session.  While the original intent was to prevent such
accesses entirely, there were a number of holes in that, notably in places
like dbsize.c which assume they can safely access temp rels of other
sessions anyway.  We might as well get this right, as a small down payment
on someday allowing workers to access the leader's temp tables.  (With
this change, directly using "MyBackendId" as a relation or buffer backend
ID is deprecated; you should use BackendIdForTempRelations() instead.
I left a couple of such uses alone though, as they're not going to be
reachable in parallel workers until we do something about localbuf.c.)

Move the thou-shalt-not-access-thy-leader's-temp-tables prohibition down
into localbuf.c, which is where it actually matters, instead of having it
in relation_open().  This amounts to recognizing that access to temp
tables' catalog entries is perfectly safe in a worker, it's only the data
in local buffers that is problematic.

Having done all that, we can get rid of the test in has_parallel_hazard()
that says that use of a temp table's rowtype is unsafe in parallel workers.
That test was unduly expensive, and if we really did need such a
prohibition, that was not even close to being a bulletproof guard for it.
(For example, any user-defined function executed in a parallel worker
might have attempted such access.)
2016-06-09 20:16:11 -04:00
Robert Haas
4bc424b968 pgindent run for 9.6 2016-06-09 18:02:36 -04:00
Robert Haas
c9ce4a1c61 Eliminate "parallel degree" terminology.
This terminology provoked widespread complaints.  So, instead, rename
the GUC max_parallel_degree to max_parallel_workers_per_gather
(leaving room for a possible future GUC max_parallel_workers that acts
as a system-wide limit), and rename the parallel_degree reloption to
parallel_workers.  Rename structure members to match.

These changes create a dump/restore hazard for users of PostgreSQL
9.6beta1 who have set the reloption (or applied the GUC using ALTER
USER or ALTER DATABASE).
2016-06-09 10:00:26 -04:00
Peter Eisentraut
5c6d2a5e7c Message style and wording fixes 2016-06-07 14:18:55 -04:00
Robert Haas
c6dbf1fe79 Stop the executor if no more tuples can be sent from worker to leader.
If a Gather node has read as many tuples as it needs (for example, due
to Limit) it may detach the queue connecting it to the worker before
reading all of the worker's tuples.  Rather than let the worker
continue to generate and send all of the results, have it stop after
sending the next tuple.

More could be done here to stop the worker even quicker, but this is
about as well as we can hope to do for 9.6.

This is in response to a problem report from Andreas Seltenreich.
Commit 44339b892a should be actually be
sufficient to fix that example even without this change, but it seems
better to do this, too, since we might otherwise waste quite a large
amount of effort in one or more workers.

Discussion: CAA4eK1KOKGqmz9bGu+Z42qhRwMbm4R5rfnqsLCNqFs9j14jzEA@mail.gmail.com

Amit Kapila
2016-06-06 14:52:58 -04:00
Robert Haas
932b97a011 Fix typo.
Jim Nasby
2016-06-06 07:58:50 -04:00
Greg Stark
e1623c3959 Fix various common mispellings.
Mostly these are just comments but there are a few in documentation
and a handful in code and tests. Hopefully this doesn't cause too much
unnecessary pain for backpatching. I relented from some of the most
common like "thru" for that reason. The rest don't seem numerous
enough to cause problems.

Thanks to Kevin Lyda's tool https://pypi.python.org/pypi/misspellings
2016-06-03 16:08:45 +01:00
Robert Haas
fdfaccfa79 Cosmetic improvements to freeze map code.
Per post-commit review comments from Andres Freund, improve variable
names, comments, and in one place, slightly improve the code structure.

Masahiko Sawada
2016-06-03 08:43:41 -04:00
Kevin Grittner
7392eed7c2 Fix btree mark/restore bug.
Commit 2ed5b87f96 introduced a bug in
mark/restore, in an attempt to optimize repeated restores to the
same page.  This caused an assertion failure during a merge join
which fed directly from an index scan, although the impact would
not be limited to that case.  Revert the bad chunk of code from
that commit.

While investigating this bug it was discovered that a particular
"paranoia" set of the mark position field would not prevent bad
behavior; it would just make it harder to diagnose.  Change that
into an assertion, which will draw attention to any future problem
in that area more directly.

Backpatch to 9.5, where the bug was introduced.

Bug #14169 reported by Shinta Koyanagi.
Preliminary analysis by Tom Lane identified which commit caused
the bug.
2016-06-02 12:23:01 -05:00
Alvaro Herrera
975ad4e602 Fix PageAddItem BRIN bug
BRIN was relying on the ability to remove a tuple from an index page,
then putting another tuple in the same line pointer.  But PageAddItem
refuses to add a tuple beyond the first free item past the last used
item, and in particular, it rejects an attempt to add an item to an
empty page anywhere other than the first line pointer.  PageAddItem
issues a WARNING and indicates to the caller that it failed, which in
turn causes the BRIN calling code to issue a PANIC, so the whole
sequence looks like this:
	WARNING:  specified item offset is too large
	PANIC:  failed to add BRIN tuple

To fix, create a new function PageAddItemExtended which is like
PageAddItem except that the two boolean arguments become a flags bitmap;
the "overwrite" and "is_heap" boolean flags in PageAddItem become
PAI_OVERWITE and PAI_IS_HEAP flags in the new function, and a new flag
PAI_ALLOW_FAR_OFFSET enables the behavior required by BRIN.
PageAddItem() retains its original signature, for compatibility with
third-party modules (other callers in core code are not modified,
either).

Also, in the belt-and-suspenders spirit, I added a new sanity check in
brinGetTupleForHeapBlock to raise an error if an TID found in the revmap
is not marked as live by the page header.  This causes it to react with
"ERROR: corrupted BRIN index" to the bug at hand, rather than a hard
crash.

Backpatch to 9.5.

Bug reported by Andreas Seltenreich as detected by his handy sqlsmith
fuzzer.
Discussion: https://www.postgresql.org/message-id/87mvni77jh.fsf@elite.ansel.ydns.eu
2016-05-30 14:47:22 -04:00
Tom Lane
1e0d6512e5 Fix BTREE_BUILD_STATS build.
Commit 65c5fcd353 broke this by removing a
header include directive that is conditionally required.  Add that back
to nbtree.c, with annotation to keep pgrminclude from re-breaking it.

Peter Geoghegan

Report: <CAM3SWZTNjHFYW_UG8bu0BnogqQ2HfsTgkzXLueuUhfTcYbu5HA@mail.gmail.com>
2016-05-23 19:41:11 -04:00
Teodor Sigaev
7c979c95a3 Allocate all page images at once in generic wal interface
That reduces number of allocation.

Per gripe from Michael Paquier and Tom Lane suggestion.
2016-05-17 22:09:22 +03:00
Teodor Sigaev
7c8345f67f Correctly align page's images in generic wal API
Page image should be MAXALIGN'ed because existing code could directly align
pointers in page instead of align offset from beginning of page.

Found during play with indexes as extenstion, Alexander Korotkov and me
2016-05-17 00:01:35 +03:00
Alvaro Herrera
cca2a27860 Fix bogus comments
Some comments mentioned XLogReplayBuffer, but there's no such function:
that was an interim name for a function that got renamed to
XLogReadBufferForRedo, before commit 2c03216d83 was pushed.
2016-05-12 16:07:07 -03:00
Alvaro Herrera
bdb9e3dc1d Fix obsolete comment 2016-05-12 15:36:51 -03:00
Tom Lane
9eb7a0ac6b Fix poorly-worded log message.
Euler Taveira
2016-05-08 01:37:07 -04:00
Robert Haas
c7ea68ff8d Limit maximum parallel degree to 1024.
This new limit affects both the max_parallel_degree GUC and the
parallel_degree reloption.  There may some day be a use case for using
more than 1024 CPUs for a single query, but that's surely not the case
right now.  Not only do not very many people have that many CPUs, but
the code hasn't been tested at that kind of scale and is very unlikely
to perform well, or even work at all, without a lot more work.  The
issue addressed by commit 06bd458cb8 is
probably just one problem of many.

The idea of a more reasonable limit here was suggested by Tom Lane;
the value of 1024 was suggested by Amit Kapila.
2016-05-06 14:50:54 -04:00
Robert Haas
06bd458cb8 Use mul_size when multiplying by the number of parallel workers.
That way, if the result overflows size_t, you'll get an error instead
of undefined behavior, which seems like a plus.  This also has the
effect of casting the number of workers from int to Size, which is
better because it's harder to overflow int than size_t.

Dilip Kumar reported this issue and provided a patch upon which this
patch is based, but his version did use mul_size.
2016-05-06 14:32:58 -04:00
Kevin Grittner
2cc41acd8f Fix hash index vs "snapshot too old" problemms
Hash indexes are not WAL-logged, and so do not maintain the LSN of
index pages.  Since the "snapshot too old" feature counts on
detecting error conditions using the LSN of a table and all indexes
on it, this makes it impossible to safely do early vacuuming on any
table with a hash index, so add this to the tests for whether the
xid used to vacuum a table can be adjusted based on
old_snapshot_threshold.

While at it, add a paragraph to the docs for old_snapshot_threshold
which specifically mentions this and other aspects of the feature
which may otherwise surprise users.

Problem reported and patch reviewed by Amit Kapila
2016-05-06 07:47:12 -05:00
Alvaro Herrera
c1543a81a7 Revert timeline following in replication slots
This reverts commits f07d18b6e9, 82c83b3372, 3a3b309041, and
24c5f1a103.

This feature has shown enough immaturity that it was deemed better to
rip it out before rushing some more fixes at the last minute.  There are
discussions on larger changes in this area for the next release.
2016-05-04 17:32:22 -03:00
Teodor Sigaev
f8467f7da8 Prevent to use magic constants
Use macroses for definition amstrategies/amsupport fields instead of
hardcoded values.

Author: Nikolay Shaplov with addition for contrib/bloom
2016-04-28 16:39:25 +03:00
Teodor Sigaev
e2c79e14d9 Prevent multiple cleanup process for pending list in GIN.
Previously, ginInsertCleanup could exit early if it detects that someone else
is cleaning up the pending list, without waiting for that someone else to
finish the job. But in this case vacuum could miss tuples to be deleted.

Cleanup process now locks metapage with a help of heavyweight
LockPage(ExclusiveLock), and it guarantees that there is no another cleanup
process at the same time. Lock is taken differently depending on caller of
cleanup process: any vacuums and gin_clean_pending_list() will be blocked
until lock becomes available, ordinary insert uses conditional lock to
prevent indefinite waiting on lock.

Insert into pending list doesn't use this lock, so insertion isn't blocked.

Also, patch adds stopping of cleanup process when at-start-cleanup-tail is
reached in order to prevent infinite cleanup in case of massive insertion. But
it will stop only for automatic maintenance tasks like autovacuum.

Patch introduces choice of limit of memory to use: autovacuum_work_mem,
maintenance_work_mem or work_mem depending on call path.

Patch for previous releases should be reworked due to changes between 9.6 and
previous ones in this area.

Discover and diagnostics by Jeff Janes and Tomas Vondra

Patch by me with some ideas of Jeff Janes
2016-04-28 16:21:42 +03:00
Andres Freund
c6ff84b06a Emit invalidations to standby for transactions without xid.
So far, when a transaction with pending invalidations, but without an
assigned xid, committed, we simply ignored those invalidation
messages. That's problematic, because those are actually sent for a
reason.

Known symptoms of this include that existing sessions on a hot-standby
replica sometimes fail to notice new concurrently built indexes and
visibility map updates.

The solution is to WAL log such invalidations in transactions without an
xid. We considered to alternatively force-assign an xid, but that'd be
problematic for vacuum, which might be run in systems with few xids.

Important: This adds a new WAL record, but as the patch has to be
back-patched, we can't bump the WAL page magic. This means that standbys
have to be updated before primaries; otherwise
"PANIC: standby_redo: unknown op code 32" errors can be encountered.

XXX:

Reported-By: Васильев Дмитрий, Masahiko Sawada
Discussion:
    CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com
    CAD21AoDpZ6Xjg=gFrGPnSn4oTRRcwK1EBrWCq9OqOHuAcMMC=w@mail.gmail.com
2016-04-26 20:21:54 -07:00
Tom Lane
bde361fef5 Fix memory leak and other bugs in ginPlaceToPage() & subroutines.
Commit 36a35c550a turned the interface between ginPlaceToPage and
its subroutines in gindatapage.c and ginentrypage.c into a royal mess:
page-update critical sections were started in one place and finished in
another place not even in the same file, and the very same subroutine
might return having started a critical section or not.  Subsequent patches
band-aided over some of the problems with this design by making things
even messier.

One user-visible resulting problem is memory leaks caused by the need for
the subroutines to allocate storage that would survive until ginPlaceToPage
calls XLogInsert (as reported by Julien Rouhaud).  This would not typically
be noticeable during retail index updates.  It could be visible in a GIN
index build, in the form of memory consumption swelling to several times
the commanded maintenance_work_mem.

Another rather nasty problem is that in the internal-page-splitting code
path, we would clear the child page's GIN_INCOMPLETE_SPLIT flag well before
entering the critical section that it's supposed to be cleared in; a
failure in between would leave the index in a corrupt state.  There were
also assorted coding-rule violations with little immediate consequence but
possible long-term hazards, such as beginning an XLogInsert sequence before
entering a critical section, or calling elog(DEBUG) inside a critical
section.

To fix, redefine the API between ginPlaceToPage() and its subroutines
by splitting the subroutines into two parts.  The "beginPlaceToPage"
subroutine does what can be done outside a critical section, including
full computation of the result pages into temporary storage when we're
going to split the target page.  The "execPlaceToPage" subroutine is called
within a critical section established by ginPlaceToPage(), and it handles
the actual page update in the non-split code path.  The critical section,
as well as the XLOG insertion call sequence, are both now always started
and finished in ginPlaceToPage().  Also, make ginPlaceToPage() create and
work in a short-lived memory context to eliminate the leakage problem.
(Since a short-lived memory context had been getting created in the most
common code path in the subroutines, this shouldn't cause any noticeable
performance penalty; we're just moving the overhead up one call level.)

In passing, fix a bunch of comments that had gone unmaintained throughout
all this klugery.

Report: <571276DD.5050303@dalibo.com>
2016-04-20 14:25:15 -04:00
Kevin Grittner
a343e223a5 Revert no-op changes to BufferGetPage()
The reverted changes were intended to force a choice of whether any
newly-added BufferGetPage() calls needed to be accompanied by a
test of the snapshot age, to support the "snapshot too old"
feature.  Such an accompanying test is needed in about 7% of the
cases, where the page is being used as part of a scan rather than
positioning for other purposes (such as DML or vacuuming).  The
additional effort required for back-patching, and the doubt whether
the intended benefit would really be there, have indicated it is
best just to rely on developers to do the right thing based on
comments and existing usage, as we do with many other conventions.

This change should have little or no effect on generated executable
code.

Motivated by the back-patching pain of Tom Lane and Robert Haas
2016-04-20 08:31:19 -05:00
Tom Lane
f0e766bd7f Fix memory leak in GIN index scans.
The code had a query-lifespan memory leak when encountering GIN entries
that have posting lists (rather than posting trees, ie, there are a
relatively small number of heap tuples containing this index key value).
With a suitable data distribution this could add up to a lot of leakage.
Problem seems to have been introduced by commit 36a35c550, so back-patch
to 9.4.

Julien Rouhaud
2016-04-15 00:02:26 -04:00
Tom Lane
5713f03973 Improve API of GenericXLogRegister().
Rename this function to GenericXLogRegisterBuffer() to make it clearer
what it does, and leave room for other sorts of "register" actions in
future.  Also, replace its "bool isNew" argument with an integer flags
argument, so as to allow adding more flags in future without an API
break.

Alexander Korotkov, adjusted slightly by me
2016-04-12 11:42:06 -04:00
Tom Lane
bdf7db8192 In generic WAL application and replay, ensure page "hole" is always zero.
The previous coding could allow the contents of the "hole" between pd_lower
and pd_upper to diverge during replay from what it had been when the update
was originally applied.  This would pose a problem if checksums were in
use, and in any case would complicate forensic comparisons between master
and slave servers.  So force the "hole" to contain zeroes, both at initial
application of a generically-logged action, and at replay.

Alexander Korotkov, adjusted slightly by me
2016-04-12 11:14:00 -04:00
Stephen Frost
cd13471f2e Correct copyright for newly added genericdesc.c
It's 2016 these days (no, not entirely sure how we got here either).

Pointed out by Amit Langote
2016-04-12 08:45:09 -04:00
Tom Lane
660d5fb856 Further minor improvement in generic_xlog.c: always say REGBUF_STANDARD.
Since we're requiring pages handled by generic_xlog.c to be standard
format, specify REGBUF_STANDARD when doing a full-page image, so that
xloginsert.c can compress out the "hole" between pd_lower and pd_upper.
Given the current API in which this path will be taken only for a newly
initialized page, the hole is likely to be particularly large in such
cases, so that this oversight could easily be performance-significant.
I don't notice any particular change in the runtime of contrib/bloom's
regression test, though.
2016-04-10 00:24:28 -04:00
Tom Lane
68689c66ef Micro-optimize GenericXLogFinish().
Make the inner comparison loops of computeDelta() as tight as possible by
pulling considerations of valid and invalid ranges out of the inner loops,
and extending a match or non-match detection as far as possible before
deciding what to do next.  To keep this tractable, give up the possibility
of merging fragments across the pd_lower to pd_upper gap.  The fraction of
pages where that could happen (ie, there are 4 or fewer bytes in the gap,
*and* data changes immediately adjacent to it on both sides) is too small
to be worth spending cycles on.

Also, avoid two BLCKSZ-length memcpy()s by computing the delta before
moving data into the target buffer, instead of after.  This doesn't save
nearly as many cycles as being tenser about computeDelta(), but it still
seems worth doing.

On my machine, this patch cuts a full 40% off the runtime of
contrib/bloom's regression test.
2016-04-09 19:30:56 -04:00
Tom Lane
08e785436f Get rid of GenericXLogUnregister().
This routine is unsafe as implemented, because it invalidates the page
image pointers returned by previous GenericXLogRegister() calls.

Rather than complicate the API or the implementation to avoid that,
let's just get rid of it; the use-case for having it seems much
too thin to justify a lot of work here.

While at it, do some wordsmithing on the SGML docs for generic WAL.
2016-04-09 16:39:30 -04:00
Tom Lane
db03cf375d Code review/prettification for generic_xlog.c.
Improve commentary, use more specific names for the delta fields,
const-ify pointer arguments where possible, avoid assuming that
initializing only the first element of a local array will guarantee
that the remaining elements end up as we need them.  (I think that
code in generic_redo actually worked, but only because InvalidBuffer
is zero; this is a particularly ugly way of depending on that ...)
2016-04-09 15:02:19 -04:00
Tom Lane
2dd318d277 Run pgindent on generic_xlog.c.
This code desperately needs some micro-optimization, and I'd like it
to be formatted a bit more nicely while I work on it.
2016-04-09 13:33:33 -04:00
Kevin Grittner
848ef42bb8 Add the "snapshot too old" feature
This feature is controlled by a new old_snapshot_threshold GUC.  A
value of -1 disables the feature, and that is the default.  The
value of 0 is just intended for testing.  Above that it is the
number of minutes a snapshot can reach before pruning and vacuum
are allowed to remove dead tuples which the snapshot would
otherwise protect.  The xmin associated with a transaction ID does
still protect dead tuples.  A connection which is using an "old"
snapshot does not get an error unless it accesses a page modified
recently enough that it might not be able to produce accurate
results.

This is similar to the Oracle feature, and we use the same SQLSTATE
and error message for compatibility.
2016-04-08 14:36:30 -05:00
Kevin Grittner
8b65cf4c5e Modify BufferGetPage() to prepare for "snapshot too old" feature
This patch is a no-op patch which is intended to reduce the chances
of failures of omission once the functional part of the "snapshot
too old" patch goes in.  It adds parameters for snapshot, relation,
and an enum to specify whether the snapshot age check needs to be
done for the page at this point.  This initial patch passes NULL
for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the
third.  The follow-on patch will change the places where the test
needs to be made.
2016-04-08 14:30:10 -05:00
Teodor Sigaev
8b99edefca Revert CREATE INDEX ... INCLUDING ...
It's not ready yet, revert two commits
690c543550 - unstable test output
386e3d7609 - patch itself
2016-04-08 21:52:13 +03:00
Teodor Sigaev
386e3d7609 CREATE INDEX ... INCLUDING (column[, ...])
Now indexes (but only B-tree for now) can contain "extra" column(s) which
doesn't participate in index structure, they are just stored in leaf
tuples. It allows to use index only scan by using single index instead
of two or more indexes.

Author: Anastasia Lubennikova with minor editorializing by me
Reviewers: David Rowley, Peter Geoghegan, Jeff Janes
2016-04-08 19:45:59 +03:00
Andres Freund
5364b357fb Increase maximum number of clog buffers.
Benchmarking has shown that the current number of clog buffers limits
scalability. We've previously increased the number in 33aaa139, but
that's not sufficient with a large number of clients.

We've benchmarked the cost of increasing the limit by benchmarking worst
case scenarios; testing showed that 128 buffers don't cause a
regression, even in contrived scenarios, whereas 256 does

There are a number of more complex patches flying around to address
various clog scalability problems, but this is simple enough that we can
get it into 9.6; and is beneficial even after those patches have been
applied.

It is a bit unsatisfactory to increase this in small steps every few
releases, but a better solution seems to require a rewrite of slru.c;
not something done quickly.

Author: Amit Kapila and Andres Freund
Discussion: CAA4eK1+-=18HOrdqtLXqOMwZDbC_15WTyHiFruz7BvVArZPaAw@mail.gmail.com
2016-04-08 08:25:59 -07:00
Robert Haas
25fe8b5f1a Add a 'parallel_degree' reloption.
The code that estimates what parallel degree should be uesd for the
scan of a relation is currently rather stupid, so add a parallel_degree
reloption that can be used to override the planner's rather limited
judgement.

Julien Rouhaud, reviewed by David Rowley, James Sewell, Amit Kapila,
and me.  Some further hacking by me.
2016-04-08 11:14:56 -04:00
Robert Haas
719c84c1be Extend relations multiple blocks at a time to improve scalability.
Contention on the relation extension lock can become quite fierce when
multiple processes are inserting data into the same relation at the same
time at a high rate.  Experimentation shows the extending the relation
multiple blocks at a time improves scalability.

Dilip Kumar, reviewed by Petr Jelinek, Amit Kapila, and me.
2016-04-08 02:04:46 -04:00
Kevin Grittner
fcff8a5751 Detect SSI conflicts before reporting constraint violations
While prior to this patch the user-visible effect on the database
of any set of successfully committed serializable transactions was
always consistent with some one-at-a-time order of execution of
those transactions, the presence of declarative constraints could
allow errors to occur which were not possible in any such ordering,
and developers had no good workarounds to prevent user-facing
errors where they were not necessary or desired.  This patch adds
a check for serialization failure ahead of duplicate key checking
so that if a developer explicitly (redundantly) checks for the
pre-existing value they will get the desired serialization failure
where the problem is caused by a concurrent serializable
transaction; otherwise they will get a duplicate key error.

While it would be better if the reads performed by the constraints
could count as part of the work of the transaction for
serialization failure checking, and we will hopefully get there
some day, this patch allows a clean and reliable way for developers
to work around the issue.  In many cases existing code will already
be doing the right thing for this to "just work".

Author: Thomas Munro, with minor editing of docs by me
Reviewed-by: Marko Tiikkaja, Kevin Grittner
2016-04-07 11:12:35 -05:00
Stephen Frost
1574783b4c Use GRANT system to manage access to sensitive functions
Now that pg_dump will properly dump out any ACL changes made to
functions which exist in pg_catalog, switch to using the GRANT system
to manage access to those functions.

This means removing 'if (!superuser()) ereport()' checks from the
functions themselves and then REVOKEing EXECUTE right from 'public' for
these functions in system_views.sql.

Reviews by Alexander Korotkov, Jose Luis Tallon
2016-04-06 21:45:32 -04:00
Simon Riggs
3fe3511d05 Generic Messages for Logical Decoding
API and mechanism to allow generic messages to be inserted into WAL that are
intended to be read by logical decoding plugins. This commit adds an optional
new callback to the logical decoding API.

Messages are either text or bytea. Messages can be transactional, or not, and
are identified by a prefix to allow multiple concurrent decoding plugins.

(Not to be confused with Generic WAL records, which are intended to allow crash
recovery of extensible objects.)

Author: Petr Jelinek and Andres Freund
Reviewers: Artur Zakirov, Tomas Vondra, Simon Riggs
Discussion: 5685F999.6010202@2ndquadrant.com
2016-04-06 10:05:41 +01:00
Magnus Hagander
7117685461 Implement backup API functions for non-exclusive backups
Previously non-exclusive backups had to be done using the replication protocol
and pg_basebackup. With this commit it's now possible to make them using
pg_start_backup/pg_stop_backup as well, as long as the backup program can
maintain a persistent connection to the database.

Doing this, backup_label and tablespace_map are returned as results from
pg_stop_backup() instead of being written to the data directory. This makes
the server safe from a crash during an ongoing backup, which can be a problem
with exclusive backups.

The old syntax of the functions remain and work exactly as before, but since the
new syntax is safer this should eventually be deprecated and removed.

Only reference documentation is included. The main section on backup still needs
to be rewritten to cover this, but since that is already scheduled for a separate
large rewrite, it's not included in this patch.

Reviewed by David Steele and Amit Kapila
2016-04-05 20:03:49 +02:00
Alvaro Herrera
890614d2b3 Display WAL pointer in rm_redo error callback
This makes it easier to identify the source of a recovery problem
in case of a bug or data corruption.
2016-04-04 18:12:12 -03:00
Alvaro Herrera
c9ff752a85 Silence compiler warning
Reported by Peter Eisentraut to occur on 32bit systems
2016-04-04 17:07:23 -03:00
Simon Riggs
3e4b7d8798 Avoid pin scan for replay of XLOG_BTREE_VACUUM in all cases
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require
complex interlocking that matched the requirements on the master. This required
an O(N) operation that became a significant problem with large indexes, causing
replication delays of seconds or in some cases minutes while the
XLOG_BTREE_VACUUM was replayed.

This commit skips the pin scan that was previously required, by observing in
detail when and how it is safe to do so, with full documentation. The pin
scan is skipped only in replay; the VACUUM code path on master is not
touched here and WAL is identical.

The current commit applies in all cases, effectively replacing commit
687f2cd7a0.
2016-04-03 17:46:09 +01:00
Teodor Sigaev
65578341af Add Generic WAL interface
This interface is designed to give an access to WAL for extensions which
could implement new access method, for example. Previously it was
impossible because restoring from custom WAL would need to access system
catalog to find a redo custom function. This patch suggests generic way
to describe changes on page with standart layout.

Bump XLOG_PAGE_MAGIC because of new record type.

Author: Alexander Korotkov with a help of Petr Jelinek, Markus Nullmeier and
	minor editorization by my
Reviewers: Petr Jelinek, Alvaro Herrera, Teodor Sigaev, Jim Nasby,
	Michael Paquier
2016-04-01 12:21:48 +03:00
Alvaro Herrera
24c5f1a103 Enable logical slots to follow timeline switches
When decoding from a logical slot, it's necessary for xlog reading to be
able to read xlog from historical (i.e. not current) timelines;
otherwise, decoding fails after failover, because the archives are in
the historical timeline.  This is required to make "failover logical
slots" possible; it currently has no other use, although theoretically
it could be used by an extension that creates a slot on a standby and
continues to replay from the slot when the standby is promoted.

This commit includes a module in src/test/modules with functions to
manipulate the slots (which is not otherwise possible in SQL code) in
order to enable testing, and a new test in src/test/recovery to ensure
that the behavior is as expected.

Author: Craig Ringer
Reviewed-By: Oleksii Kliukin, Andres Freund, Petr Jelínek
2016-03-30 20:07:05 -03:00
Alvaro Herrera
3b02ea4f07 XLogReader general code cleanup
Some minor tweaks and comment additions, for cleanliness sake and to
avoid having the upcoming timeline-following patch be polluted with
unrelated cleanup.

Extracted from a larger patch by Craig Ringer, reviewed by Andres
Freund, with some additions by myself.
2016-03-30 18:56:13 -03:00
Teodor Sigaev
ccd6eb49a4 Introduce traversalValue for SP-GiST scan
During scan sometimes it would be very helpful to know some information about
parent node or all 	ancestor nodes. Right now reconstructedValue could be used
but it's not a right usage of it (range opclass uses that).

traversalValue is arbitrary piece of memory in separate MemoryContext while
reconstructedVale should have the same type as indexed column.

Subsequent patches for range opclass and quad4d tree will use it.

Author: Alexander Lebedev, Teodor Sigaev
2016-03-30 18:29:28 +03:00
Robert Haas
314cbfc5da Add new replication mode synchronous_commit = 'remote_apply'.
In this mode, the master waits for the transaction to be applied on
the remote side, not just written to disk.  That means that you can
count on a transaction started on the standby to see all commits
previously acknowledged by the master.

To make this work, the standby sends a reply after replaying each
commit record generated with synchronous_commit >= 'remote_apply'.
This introduces a small inefficiency: the extra replies will be sent
even by standbys that aren't the current synchronous standby.  But
previously-existing synchronous_commit levels make no attempt at all
to optimize which replies are sent based on what the primary cares
about, so this is no worse, and at least avoids any extra replies for
people not using the feature at all.

Thomas Munro, reviewed by Michael Paquier and by me.  Some additional
tweaks by me.
2016-03-29 21:29:49 -04:00
Andres Freund
1a7a43672b Don't use !! but != 0/NULL to force boolean evaluation.
I introduced several uses of !! to force bit arithmetic to be boolean,
but per discussion the project prefers != 0/NULL.

Discussion: CA+TgmoZP5KakLGP6B4vUjgMBUW0woq_dJYi0paOz-My0Hwt_vQ@mail.gmail.com
2016-03-27 18:10:19 +02:00
Alvaro Herrera
473b932870 Support CREATE ACCESS METHOD
This enables external code to create access methods.  This is useful so
that extensions can add their own access methods which can be formally
tracked for dependencies, so that DROP operates correctly.  Also, having
explicit support makes pg_dump work correctly.

Currently only index AMs are supported, but we expect different types to
be added in the future.

Authors: Alexander Korotkov, Petr Jelínek
Reviewed-By: Teodor Sigaev, Petr Jelínek, Jim Nasby
Commitfest-URL: https://commitfest.postgresql.org/9/353/
Discussion: https://www.postgresql.org/message-id/CAPpHfdsXwZmojm6Dx+TJnpYk27kT4o7Ri6X_4OSWcByu1Rm+VA@mail.gmail.com
2016-03-23 23:01:35 -03:00
Peter Eisentraut
b555ed8102 Merge wal_level "archive" and "hot_standby" into new name "replica"
The distinction between "archive" and "hot_standby" existed only because
at the time "hot_standby" was added, there was some uncertainty about
stability.  This is now a long time ago.  We would like to move forward
with simplifying the replication configuration, but this distinction is
in the way, because a primary server cannot tell (without asking a
standby or predicting the future) which one of these would be the
appropriate level.

Pick a new name for the combined setting to make it clearer that it
covers all (non-logical) backup and replication uses.  The old values
are still accepted but are converted internally.

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reviewed-by: David Steele <david@pgmasters.net>
2016-03-18 23:56:03 +01:00
Robert Haas
3aff33aa68 Fix typos.
Oskari Saarenmaa
2016-03-15 18:06:11 -04:00
Alvaro Herrera
5bcc413f80 Fix typos in comments 2016-03-15 17:57:17 -03:00
Tom Lane
ab4ff2889d Fix memory leak in repeated GIN index searches.
Commit d88976cfa1 removed this code from ginFreeScanKeys():
-		if (entry->list)
-			pfree(entry->list);
evidently in the belief that that ItemPointer array is allocated in the
keyCtx and so would be reclaimed by the following MemoryContextReset.
Unfortunately, it isn't and it won't.  It'd likely be a good idea for
that to become so, but as a simple and back-patchable fix in the
meantime, restore this code to ginFreeScanKeys().

Also, add a similar pfree to where startScanEntry() is about to zero out
entry->list.  I am not sure if there are any code paths where this
change prevents a leak today, but it seems like cheap future-proofing.

In passing, make the initial allocation of so->entries[] use palloc
not palloc0.  The code doesn't depend on unused entries being zero;
if it did, the array-enlargement code in ginFillScanEntry() would be
wrong.  So using palloc0 initially can only serve to confuse readers
about what the invariant is.

Per report from Felipe de Jesús Molina Bravo, via Jaime Casanova in
<CAJGNTeMR1ndMU2Thpr8GPDUfiHTV7idELJRFusA5UXUGY1y-eA@mail.gmail.com>
2016-03-13 16:44:31 -04:00
Robert Haas
481c76abf4 Fix a typo, and remove unnecessary pgstat_report_wait_end().
Per Amit Kapila.
2016-03-11 07:34:00 -05:00
Robert Haas
53be0b1add Provide much better wait information in pg_stat_activity.
When a process is waiting for a heavyweight lock, we will now indicate
the type of heavyweight lock for which it is waiting.  Also, you can
now see when a process is waiting for a lightweight lock - in which
case we will indicate the individual lock name or the tranche, as
appropriate - or for a buffer pin.

Amit Kapila, Ildus Kurbangaliev, reviewed by me.  Lots of helpful
discussion and suggestions by many others, including Alexander
Korotkov, Vladimir Borodin, and many others.
2016-03-10 12:44:09 -05:00
Simon Riggs
e0694cf9c7 Reduce size of two phase file header
Previously 2PC header was fixed at 200 bytes, which in most cases wasted
WAL space for a workload using 2PC heavily.

Pavan Deolasee, reviewed by Petr Jelinek
2016-03-10 12:51:46 +00:00
Simon Riggs
fcb4bfddb6 Reduce lock level for altering fillfactor
Fabrízio de Royes Mello and Simon Riggs
2016-03-10 12:07:33 +00:00
Andres Freund
1d4a0ab19a Avoid unlikely data-loss scenarios due to rename() without fsync.
Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes. Use the previously added durable_rename()/durable_link_or_rename()
in various places where we previously just renamed files.

Most of the changed call sites are arguably not critical, but it seems
better to err on the side of too much durability.  The most prominent
known case where the previously missing fsyncs could cause data loss is
crashes at the end of a checkpoint. After the actual checkpoint has been
performed, old WAL files are recycled. When they're filled, their
contents are fdatasynced, but we did not fsync the containing
directory. An OS/hardware crash in an unfortunate moment could then end
up leaving that file with its old name, but new content; WAL replay
would thus not replay it.

Reported-By: Tomas Vondra
Author: Michael Paquier, Tomas Vondra, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
2016-03-09 18:53:53 -08:00
Tom Lane
a298a1e06f Fix incorrect handling of NULL index entries in indexed ROW() comparisons.
An index search using a row comparison such as ROW(a, b) > ROW('x', 'y')
would stop upon reaching a NULL entry in the "b" column, ignoring the
fact that there might be non-NULL "b" values associated with later values
of "a".  This happens because _bt_mark_scankey_required() marks the
subsidiary scankey for "b" as required, which is just wrong: it's for
a column after the one with the first inequality key (namely "a"), and
thus can't be considered a required match.

This bit of brain fade dates back to the very beginnings of our support
for indexed ROW() comparisons, in 2006.  Kind of astonishing that no one
came across it before Glen Takahashi, in bug #14010.

Back-patch to all supported versions.

Note: the given test case doesn't actually fail in unpatched 9.1, evidently
because the fix for bug #6278 (i.e., stopping at nulls in either scan
direction) is required to make it fail.  I'm sure I could devise a case
that fails in 9.1 as well, perhaps with something involving making a cursor
back up; but it doesn't seem worth the trouble.
2016-03-09 14:51:22 -05:00
Robert Haas
b6fb6471f6 Add a generic command progress reporting facility.
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.

As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else.  A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting.  But this gets the basic framework in place.

Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
2016-03-09 12:08:58 -05:00
Robert Haas
734f86d50d Add new flags argument for xl_heap_visible to heap2_desc.
Masahiko Sawada
2016-03-08 13:28:22 -05:00
Robert Haas
77a1d1e798 Department of second thoughts: remove PD_ALL_FROZEN.
Commit a892234f83 added a second bit per
page to the visibility map, which still seems like a good idea, but it
also added a second page-level bit alongside PD_ALL_VISIBLE to track
whether the visibility map bit was set.  That no longer seems like a
clever plan, because we don't really need that bit for anything.  We
always clear both bits when the page is modified anyway.

Patch by me, reviewed by Kyotaro Horiguchi and Masahiko Sawada.
2016-03-08 08:46:48 -05:00
Fujii Masao
d34794f7d5 Ignore recovery_min_apply_delay until recovery has reached consistent state
Previously recovery_min_apply_delay was applied even before recovery
had reached consistency. This could cause us to wait a long time
unexpectedly for read-only connections to be allowed. It's problematic
because the standby was useless during that wait time.

This patch changes recovery_min_apply_delay so that it's applied once
the database has reached the consistent state. That is, even if the delay
is set, the standby tries to replay WAL records as fast as possible until
it has reached consistency.

Author: Michael Paquier
Reviewed-By: Julien Rouhaud
Reported-By: Greg Clough
Backpatch: 9.4, where recovery_min_apply_delay was added
Bug: #13770
Discussion: http://www.postgresql.org/message-id/20151111155006.2644.84564@wrigleys.postgresql.org
2016-03-06 02:29:04 +09:00
Robert Haas
6fcde8a5c8 Minor improvements to transaction manager README.
A simple SELECT is handled by PortalRunSelect, not ProcessQuery.  Also,
the previous indentation was unclear: change it so that a deeper level
of indentation indicates that the outer function calls the inner one.

Stas Kelvich
2016-03-04 14:12:28 -05:00
Robert Haas
df4685fb0c Minor optimizations based on ParallelContext having nworkers_launched.
Originally, we didn't have nworkers_launched, so code that used parallel
contexts had to be preprared for the possibility that not all of the
workers requested actually got launched.  But now we can count on knowing
the number of workers that were successfully launched, which can shave
off a few cycles and simplify some code slightly.

Amit Kapila, reviewed by Haribabu Kommi, per a suggestion from Peter
Geoghegan.
2016-03-04 12:59:10 -05:00
Simon Riggs
c7111d11b1 Revert buggy optimization of index scans
606c0123d6 attempted to reduce cost of index scans using > and <
strategies, though got that completely wrong in a few complex cases.

Revert whole patch until we find a safe optimization.
2016-03-03 09:53:43 +00:00
Robert Haas
a892234f83 Change the format of the VM fork to add a second bit per page.
The new bit indicates whether every tuple on the page is already frozen.
It is cleared only when the all-visible bit is cleared, and it can be
set only when we vacuum a page and find that every tuple on that page is
both visible to every transaction and in no need of any future
vacuuming.

A future commit will use this new bit to optimize away full-table scans
that would otherwise be triggered by XID wraparound considerations.  A
page which is merely all-visible must still be scanned in that case, but
a page which is all-frozen need not be.  This commit does not attempt
that optimization, although that optimization is the goal here.  It
seems better to get the basic infrastructure in place first.

Per discussion, it's very desirable for pg_upgrade to automatically
migrate existing VM forks from the old format to the new format.  That,
too, will be handled in a follow-on patch.

Masahiko Sawada, reviewed by Kyotaro Horiguchi, Fujii Masao, Amit
Kapila, Simon Riggs, Andres Freund, and others, and substantially
revised by me.
2016-03-01 21:49:41 -05:00
Simon Riggs
481725c0ba Correct StartupSUBTRANS for page wraparound
StartupSUBTRANS() incorrectly handled cases near the max pageid in the subtrans
data structure, which in some cases could lead to errors in startup for Hot
Standby.
This patch wraps the pageids correctly, avoiding any such errors.
Identified by exhaustive crash testing by Jeff Janes.

Jeff Janes
2016-02-19 08:31:12 +00:00
Andres Freund
7975c5e0a9 Allow the WAL writer to flush WAL at a reduced rate.
Commit 4de82f7d7 increased the WAL flush rate, mainly to increase the
likelihood that hint bits can be set quickly. More quickly set hint bits
can reduce contention around the clog et al.  But unfortunately the
increased flush rate can have a significant negative performance impact,
I have measured up to a factor of ~4.  The reason for this slowdown is
that if there are independent writes to the underlying devices, for
example because shared buffers is a lot smaller than the hot data set,
or because a checkpoint is ongoing, the fdatasync() calls force cache
flushes to be emitted to the storage.

This is achieved by flushing WAL only if the last flush was longer than
wal_writer_delay ago, or if more than wal_writer_flush_after (new GUC)
unflushed blocks are pending. Based on some tests the default for
wal_writer_delay is 1MB, which seems to work well both on SSD and
rotational media.

To avoid negative performance impact due to 4de82f7d7 an earlier
commit (db76b1e) made SetHintBits() more likely to succeed; preventing
performance regressions in the pgbench tests I performed.

Discussion: 20160118163908.GW10941@awork2.anarazel.de
2016-02-16 00:56:34 +01:00
Joe Conway
59a884e985 Change delimiter used for display of NextXID
NextXID has been rendered in the form of a pg_lsn even though it
really is not. This can cause confusion, so change the format from
%u/%u to %u:%u, per discussion on hackers.

Complaint by me, patch by me and Bruce, reviewed by Michael Paquier
and Alvaro. Applied to HEAD only.

Author: Joe Conway, Bruce Momjian
Reviewed-by: Michael Paquier, Alvaro Herrera
Backpatch-through: master
2016-02-12 14:23:59 -08:00
Robert Haas
63461a63f9 Make builtin lwlock tranche names consistent.
Previously, we had a mix of styles.

Amit Kapila
2016-02-12 08:07:11 -05:00
Tom Lane
d18643c4a6 Shift the responsibility for emitting "database system is shut down".
Historically this message has been emitted at the end of ShutdownXLOG().
That's not an insane place for it in a standalone backend, but in the
postmaster environment we've grown a fair amount of stuff that happens
later, including archiver/walsender shutdown, stats collector shutdown,
etc.  Recent buildfarm experimentation showed that on slower machines
there could be many seconds' delay between finishing ShutdownXLOG() and
actual postmaster exit.  That's fairly confusing, both for testing
purposes and for DBAs.  Hence, move the code that prints this message
into UnlinkLockFiles(), so that it comes out just after we remove the
postmaster's pidfile.  That is a more appropriate definition of "is shut
down" from the point of view of "pg_ctl stop", for example.  In general,
removing the pidfile should be the last externally-visible action of
either a postmaster or a standalone backend; compare commit
d73d14c271 for instance.  So this seems
like a reasonably future-proof approach.
2016-02-11 14:14:22 -05:00
Tom Lane
c5e9b77127 Revert "Temporarily make pg_ctl and server shutdown a whole lot chattier."
This reverts commit 3971f64843 and a
couple of followon debugging commits; I think we've learned what we can
from them.
2016-02-10 16:01:04 -05:00
Tom Lane
7351e18286 Add more chattiness in server shutdown.
Early returns from the buildfarm show that there's a bit of a gap in the
logging I added in 3971f64843: the portion of CreateCheckPoint()
after CheckPointGuts() can take a fair amount of time.  Add a few more
log messages in that section of code.  This too shall be reverted later.
2016-02-09 11:21:46 -05:00
Tom Lane
3971f64843 Temporarily make pg_ctl and server shutdown a whole lot chattier.
This is a quick hack, due to be reverted when its purpose has been served,
to try to gather information about why some of the buildfarm critters
regularly fail with "postmaster does not shut down" complaints.  Maybe they
are just really overloaded, but maybe something else is going on.  Hence,
instrument pg_ctl to print the current time when it starts waiting for
postmaster shutdown and when it gives up, and add a lot of logging of the
current time in the server's checkpoint and shutdown code paths.

No attempt has been made to make this pretty.  I'm not even totally sure
if it will build on Windows, but we'll soon find out.
2016-02-08 18:43:11 -05:00
Robert Haas
7c944bd903 Introduce a new GUC force_parallel_mode for testing purposes.
When force_parallel_mode = true, we enable the parallel mode restrictions
for all queries for which this is believed to be safe.  For the subset of
those queries believed to be safe to run entirely within a worker, we spin
up a worker and run the query there instead of running it in the
original process.  When force_parallel_mode = regress, make additional
changes to allow the regression tests to run cleanly even though parallel
workers have been injected under the hood.

Taken together, this facilitates both better user testing and better
regression testing of the parallelism code.

Robert Haas, with help from Amit Kapila and Rushabh Lathia.
2016-02-07 11:41:33 -05:00
Robert Haas
a1c1af2a1f Introduce group locking to prevent parallel processes from deadlocking.
For locking purposes, we now regard heavyweight locks as mutually
non-conflicting between cooperating parallel processes.  There are some
possible pitfalls to this approach that are not to be taken lightly,
but it works OK for now and can be changed later if we find a better
approach.  Without this, it's very easy for parallel queries to
silently self-deadlock if the user backend holds strong relation locks.

Robert Haas, with help from Amit Kapila.  Thanks to Noah Misch and
Andres Freund for extensive discussion of possible issues with this
approach.
2016-02-07 10:16:13 -05:00
Teodor Sigaev
f25d07d99f Fix lossy KNN GiST when ordering operator returns non-float8 value.
KNN GiST with recheck flag should return to executor the same type as ordering
operator, GiST detects this type by looking to return type of function which
implements ordering operator. But occasionally detecting code works after
replacing ordering operator function to distance support function.
Distance support function always returns float8, so, detecting code get float8
instead of actual return type of ordering operator.

Built-in opclasses don't have ordering operator which doesn't return
non-float8 value, so, tests are impossible here, at least now.

Backpatch to 9.5 where lozzy KNN was introduced.

Author: Alexander Korotkov
Report by: Artur Zakirov
2016-02-02 15:20:33 +03:00
Robert Haas
7191ce8bea Make all built-in lwlock tranche IDs fixed.
This makes the values more stable, which seems like a good thing for
anybody who needs to look at at them.

Alexander Korotkov and Amit Kapila
2016-02-02 06:45:55 -05:00
Heikki Linnakangas
61ce1e8f15 Fix misspelled function name in comment. 2016-02-01 10:10:24 +02:00
Peter Eisentraut
9217bf3961 Fix whitespace 2016-01-30 15:58:20 -05:00
Fujii Masao
7f46eaf035 Add gin_clean_pending_list function to clean up GIN pending list
This function cleans up the pending list of the GIN index by
moving entries in it to the main GIN data structure in bulk.
It returns the number of pages cleaned up from the pending list.

This function is useful, for example, when the pending list
needs to be cleaned up *quickly* to improve the performance of
the search using GIN index. VACUUM can do the same thing, too,
but it may take days to run on a large table.

Jeff Janes,
reviewed by Julien Rouhaud, Jaime Casanova, Alvaro Herrera and me.

Discussion: CAMkU=1x8zFkpfnozXyt40zmR3Ub_kHu58LtRmwHUKRgQss7=iQ@mail.gmail.com
2016-01-28 12:57:52 +09:00
Tom Lane
cc988fbb0b Improve ResourceOwners' behavior for large numbers of owned objects.
The original coding was quite fast so long as objects were always
released in reverse order of addition; otherwise, it degenerated into
O(N^2) behavior due to searching for the array element to delete.
Improve matters by switching to hashed storage when the number of
objects of a given type exceeds 64.  (The cutover point is open to
discussion, of course, but some simple performance testing suggests
that hashing has enough overhead to be a loser below there.)

Also, refactor resowner.c so that we don't need N copies of the array
management code.  Since all the resource IDs the code currently needs
to deal with are either pointers or integers, it seems sufficient to
create a one-size-fits-all infrastructure in which everything is
converted to a Datum for storage.

Aleksander Alekseev, reviewed by Stas Kelvich, further fixes by me
2016-01-26 15:20:30 -05:00
Tom Lane
d9b9289c83 Suppress compiler warning.
Given the limited range of i, these shifts should not cause any
problem, but that apparently doesn't stop some compilers from
whining about them.

David Rowley
2016-01-21 21:14:07 -05:00
Tom Lane
be44ed27b8 Improve index AMs' opclass validation procedures.
The amvalidate functions added in commit 65c5fcd353 were on the
crude side.  Improve them in a few ways:

* Perform signature checking for operators and support functions.

* Apply more thorough checks for missing operators and functions,
where possible.

* Instead of reporting problems as ERRORs, report most problems as INFO
messages and make the amvalidate function return FALSE.  This allows
more than one problem to be discovered per run.

* Report object names rather than OIDs, and work a bit harder on making
the messages understandable.

Also, remove a few more opr_sanity regression test queries that are
now superseded by the amvalidate checks.
2016-01-21 19:47:15 -05:00
Fujii Masao
38710a374e Remove unused argument from ginInsertCleanup()
It's an oversight in commit dc943ad.
2016-01-22 01:22:56 +09:00
Simon Riggs
c80b31d557 Refactor headers to split out standby defs
Jeff Janes
2016-01-20 18:51:34 -08:00
Simon Riggs
978b2f65aa Speedup 2PC by skipping two phase state files in normal path
2PC state info is written only to WAL at PREPARE, then read back from WAL at
COMMIT PREPARED/ABORT PREPARED. Prepared transactions that live past one bufmgr
checkpoint cycle will be written to disk in the same form as previously. Crash
recovery path is not altered. Measured performance gains of 50-100% for short
2PC transactions by completely avoiding writing files and fsyncing. Other
optimizations still available, further patches in related areas expected.

Stas Kelvich and heavily edited by Simon Riggs

Based upon earlier ideas and patches by Michael Paquier and Heikki Linnakangas,
a concrete example of how Postgres-XC has fed back ideas into PostgreSQL.

Reviewed by Michael Paquier, Jeff Janes and Andres Freund
Performance testing by Jesper Pedersen
2016-01-20 18:40:44 -08:00
Simon Riggs
422a55a687 Refactor to create generic WAL page read callback
Previously we didn’t have a generic WAL page read callback function,
surprisingly. Logical decoding has logical_read_local_xlog_page(), which was
actually generic, so move that to xlogfunc.c and rename to
read_local_xlog_page().
Maintain logical_read_local_xlog_page() so existing callers still work.

As requested by Michael Paquier, Alvaro Herrera and Andres Freund
2016-01-20 17:18:58 -08:00
Tom Lane
9ff60273e3 Fix assorted inconsistencies in GiST opclass support function declarations.
The conventions specified by the GiST SGML documentation were widely
ignored.  For example, the strategy-number argument for "consistent" and
"distance" functions is specified to be a smallint, but most of the
built-in support functions declared it as an integer, and for that matter
the core code passed it using Int32GetDatum not Int16GetDatum.  None of
that makes any real difference at runtime, but it's quite confusing for
newcomers to the code, and it makes it very hard to write an amvalidate()
function that checks support function signatures.  So let's try to instill
some consistency here.

Another similar issue is that the "query" argument is not of a single
well-defined type, but could have different types depending on the strategy
(corresponding to search operators with different righthand-side argument
types).  Some of the functions threw up their hands and declared the query
argument as being of "internal" type, which surely isn't right ("any" would
have been more appropriate); but the majority position seemed to be to
declare it as being of the indexed data type, corresponding to a search
operator with both input types the same.  So I've specified a convention
that that's what to do always.

Also, the result of the "union" support function actually must be of the
index's storage type, but the documentation suggested declaring it to
return "internal", and some of the functions followed that.  Standardize
on telling the truth, instead.

Similarly, standardize on declaring the "same" function's inputs as
being of the storage type, not "internal".

Also, somebody had forgotten to add the "recheck" argument to both
the documentation of the "distance" support function and all of their
SQL declarations, even though the C code was happily using that argument.
Clean that up too.

Fix up some other omissions in the docs too, such as documenting that
union's second input argument is vestigial.

So far as the errors in core function declarations go, we can just fix
pg_proc.h and bump catversion.  Adjusting the erroneous declarations in
contrib modules is more debatable: in principle any change in those
scripts should involve an extension version bump, which is a pain.
However, since these changes are purely cosmetic and make no functional
difference, I think we can get away without doing that.
2016-01-19 12:04:36 -05:00
Tom Lane
65c5fcd353 Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function.  All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function.  This is similar to
the designs we've adopted for FDWs and tablesample methods.  There
are multiple advantages.  For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.

A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL.  We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.

Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-17 19:36:59 -05:00
Simon Riggs
e63bb4549a Add new user fn pg_current_xlog_flush_location()
Tomas Vondra, reviewed by Michael Paquier and Amit Kapila
Minor edits by me
2016-01-12 07:54:52 +00:00
Simon Riggs
1e29e6324c Maintain local LogwrtResult consistently
Teach GetFlushRecPtr() to update LogwrtResult cache as performed by all other
functions in xlog.c
2016-01-12 07:33:20 +00:00
Simon Riggs
b602842613 Revoke change to rmgr desc of btree vacuum
Per discussion with Andres Freund
2016-01-09 18:31:08 +00:00
Simon Riggs
687f2cd7a0 Avoid pin scan for replay of XLOG_BTREE_VACUUM
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require
complex interlocking that matched the requirements on the master. This required
an O(N) operation that became a significant problem with large indexes, causing
replication delays of seconds or in some cases minutes while the
XLOG_BTREE_VACUUM was replayed.

This commit skips the “pin scan” that was previously required, by observing in
detail when and how it is safe to do so, with full documentation. The pin scan
is skipped only in replay; the VACUUM code path on master is not touched here.

The current commit still performs the pin scan for toast indexes, though this
can also be avoided if we recheck scans on toast indexes. Later patch will
address this.

No tests included. Manual tests using an additional patch to view WAL records
and their timing have shown the change in WAL records and their handling has
successfully reduced replication delay.
2016-01-09 10:10:08 +00:00
Tom Lane
7157fe80f4 Fix overly-strict assertions in spgtextproc.c.
spg_text_inner_consistent is capable of reconstructing an empty string
to pass down to the next index level; this happens if we have an empty
string coming in, no prefix, and a dummy node label.  (In practice, what
is needed to trigger that is insertion of a whole bunch of empty-string
values.)  Then, we will arrive at the next level with in->level == 0
and a non-NULL (but zero length) in->reconstructedValue, which is valid
but the Assert tests weren't expecting it.

Per report from Andreas Seltenreich.  This has no impact in non-Assert
builds, so should not be a problem in production, but back-patch to
all affected branches anyway.

In passing, remove a couple of useless variable initializations and
shorten the code by not duplicating DatumGetPointer() calls.
2016-01-02 16:24:50 -05:00
Bruce Momjian
ee94300446 Update copyright for 2016
Backpatch certain files through 9.1
2016-01-02 13:33:40 -05:00
Noah Misch
3cd1ba147e Fix comments about WAL rule "write xlog before data" versus pg_multixact.
Recovery does not achieve its goal of zeroing all pg_multixact entries
whose accompanying WAL records never reached disk.  Remove that claim
and justify its expendability.  Detail the need for TrimMultiXact(),
which has little in common with the TrimCLOG() rationale.  Merge two
tightly-related comments.  Stop presenting pg_multixact as specific to
heap_lock_tuple(); PostgreSQL 9.3 extended its use to heap_update().

Noticed while investigating a report from Andres Freund.
2016-01-01 01:46:46 -05:00
Joe Conway
241448b23a Rename (new|old)estCommitTs to (new|old)estCommitTsXid
The variables newestCommitTs and oldestCommitTs sound as if they are
timestamps, but in fact they are the transaction Ids that correspond
to the newest and oldest timestamps rather than the actual timestamps.
Rename these variables to reflect that they are actually xids: to wit
newestCommitTsXid and oldestCommitTsXid respectively. Also modify
related code in a similar fashion, particularly the user facing output
emitted by pg_controldata and pg_resetxlog.

Complaint and patch by me, review by Tom Lane and Alvaro Herrera.
Backpatch to 9.5 where these variables were first introduced.
2015-12-28 12:34:11 -08:00
Tom Lane
3d2b31e30e Fix brin_summarize_new_values() to check index type and ownership.
brin_summarize_new_values() did not check that the passed OID was for
an index at all, much less that it was a BRIN index, and would fail in
obscure ways if it wasn't (possibly damaging data first?).  It also
lacked any permissions test; by analogy to VACUUM, we should only allow
the table's owner to summarize.

Noted by Jeff Janes, fix by Michael Paquier and me
2015-12-26 12:56:09 -05:00
Robert Haas
9a51698bae Fix typo in comment.
Amit Langote
2015-12-18 12:03:15 -05:00
Robert Haas
3fed417452 Provide a way to predefine LWLock tranche IDs.
It's a bit cumbersome to use LWLockNewTrancheId(), because the returned
value needs to be shared between backends so that each backend can call
LWLockRegisterTranche() with the correct ID.  So, for built-in tranches,
use a hard-coded value instead.

This is motivated by an upcoming patch adding further built-in tranches.

Andres Freund and Robert Haas
2015-12-15 11:48:19 -05:00
Andres Freund
cca705a5d9 Fix bug in SetOffsetVacuumLimit() triggered by find_multixact_start() failure.
Previously, if find_multixact_start() failed, SetOffsetVacuumLimit() would
install 0 into MultiXactState->offsetStopLimit if it previously succeeded.
Luckily, there are no known cases where find_multixact_start() will return
an error in 9.5 and above. But if it were to happen, for example due to
filesystem permission issues, it'd be somewhat bad: GetNewMultiXactId()
could continue allocating mxids even if close to a wraparound, or it could
erroneously stop allocating mxids, even if no wraparound is looming.  The
wrong value would be corrected the next time SetOffsetVacuumLimit() is
called, or by a restart.

Reported-By: Noah Misch, although this is not his preferred fix
Discussion: 20151210140450.GA22278@alap3.anarazel.de
Backpatch: 9.5, where the bug was introduced as part of 4f627f
2015-12-14 11:34:16 +01:00
Alvaro Herrera
69e7235c93 Fix commit timestamp initialization
This module needs explicit initialization in order to replay WAL records
in recovery, but we had broken this recently following changes to make
other (stranger) scenarios work correctly.  To fix, rework the
initialization sequence so that it always takes place before WAL replay
commences for both master and standby.

I could have gone for a more localized fix that just added a "startup"
call for the master server, but it seemed better to restructure the
existing callers as well so that the whole thing made more sense.  As a
drawback, there is more control logic in xlog.c now than previously, but
doing otherwise meant passing down the ControlFile flag, which seemed
uglier as a whole.

This also meant adding a check to not re-execute ActivateCommitTs if it
had already been called.

Reported by Fujii Masao.

Backpatch to 9.5.
2015-12-11 14:30:43 -03:00
Peter Eisentraut
a351705d8a Improve some messages 2015-12-10 22:05:27 -05:00
Andres Freund
e3f4cfc7aa Fix bug leading to restoring unlogged relations from empty files.
At the end of crash recovery, unlogged relations are reset to the empty
state, using their init fork as the template. The init fork is copied to
the main fork without going through shared buffers. Unfortunately WAL
replay so far has not necessarily flushed writes from shared buffers to
disk at that point. In normal crash recovery, and before the
introduction of 'fast promotions' in fd4ced523 / 9.3, the
END_OF_RECOVERY checkpoint flushes the buffers out in time. But with
fast promotions that's not the case anymore.

To fix, force WAL writes targeting the init fork to be flushed
immediately (using the new FlushOneBuffer() function). In 9.5+ that
flush can centrally be triggered from the code dealing with restoring
full page writes (XLogReadBufferForRedoExtended), in earlier releases
that responsibility is in the hands of XLOG_HEAP_NEWPAGE's replay
function.

Backpatch to 9.1, even if this currently is only known to trigger in
9.3+. Flushing earlier is more robust, and it is advantageous to keep
the branches similar.

Typical symptoms of this bug are errors like
'ERROR:  index "..." contains unexpected zero page at block 0'
shortly after promoting a node.

Reported-By: Thom Brown
Author: Andres Freund and Michael Paquier
Discussion: 20150326175024.GJ451@alap3.anarazel.de
Backpatch: 9.1-
2015-12-10 16:29:26 +01:00
Alvaro Herrera
820ddb2c2f Further tweak commit_timestamp behavior
As pointed out by Fujii Masao, we weren't quite there on a standby
behaving sanely: first because we were failing to acquire the correct
state in the case where no XLOG_PARAMETER_CHANGE message was sent
(because a checkpoint had already happened after the setting was changed
in the master, and then the standby was restarted); and second because
promoting the standby with the feature enabled failed to activate it if
the master had the feature disabled.

This patch fixes both those misbehaviors hopefully without
re-introducing any old problems.

Also change the hint emitted in a standby together with the error
message about the feature being disabled, to make it point out that the
place to chance the setting is the master.  Otherwise, if the setting is
already enabled in the standby, it is very confusing to have it say that
the setting must be enabled ...

Authors: Álvaro Herrera, Petr Jelínek.
Backpatch to 9.5.
2015-12-03 19:22:31 -03:00
Robert Haas
74d0d5f3eb Fix typo in comment.
Amit Langote
2015-11-19 16:45:39 -05:00
Andres Freund
d3c8ac114f Remove function names from some elog() calls in heapam.c.
At least one of the names was, due to a function renaming late in the
development of ON CONFLICT, wrong. Since including function names in
error messages is against the message style guide anyway, remove them
from the messages.

Discussion: CAM3SWZT8paz=usgMVHm0XOETkQvzjRtAUthATnmaHQQY0obnGw@mail.gmail.com
Backpatch: 9.5, where ON CONFLICT was introduced
2015-11-19 01:37:58 +01:00
Peter Eisentraut
5db837d3f2 Message improvements 2015-11-16 21:39:23 -05:00
Robert Haas
fe702a7b3f Move each SLRU's lwlocks to a separate tranche.
This makes it significantly easier to identify these lwlocks in
LWLOCK_STATS or Trace_lwlocks output.  It's also arguably better
from a modularity standpoint, since lwlock.c no longer needs to
know anything about the LWLock needs of the higher-level SLRU
facility.

Ildus Kurbangaliev, reviewd by Álvaro Herrera and by me.
2015-11-12 14:59:09 -05:00
Robert Haas
64b2e7ad91 Pass extra data to bgworkers, and use this to fix parallel contexts.
Up until now, the total amount of data that could be passed to a
background worker at startup was one datum, which can be a small as
4 bytes on some systems.  That's enough to pass a dsm_handle or an
array index, but not much else.  Add a bgw_extra flag to the
BackgroundWorker struct, allowing up to 128 bytes to be passed to
a new worker on any platform.

Use this to fix a problem I recently discovered with the parallel
context machinery added in 9.5: the master assigns each worker an
array index, and each worker subsequently assigns itself an array
index, and there's nothing to guarantee that the two sets of indexes
match, leading to chaos.

Normally, I would not back-patch the change to add bgw_extra, since it
is basically a feature addition.  However, since 9.5 is still in beta
and there seems to be no other sensible way to repair the broken
parallel context machinery, back-patch to 9.5.  Existing background
worker code can ignore the bgw_extra field without a problem, but
might need to be recompiled since the structure size has changed.

Report and patch by me.  Review by Amit Kapila.
2015-11-05 12:13:56 -05:00
Kevin Grittner
585e2a3b1a Fix serialization anomalies due to race conditions on INSERT.
On insert the CheckForSerializableConflictIn() test was performed
before the page(s) which were going to be modified had been locked
(with an exclusive buffer content lock).  If another process
acquired a relation SIReadLock on the heap and scanned to a page on
which an insert was going to occur before the page was so locked,
a rw-conflict would be missed, which could allow a serialization
anomaly to be missed.  The window between the check and the page
lock was small, so the bug was generally not noticed unless there
was high concurrency with multiple processes inserting into the
same table.

This was reported by Peter Bailis as bug #11732, by Sean Chittenden
as bug #13667, and by others.

The race condition was eliminated in heap_insert() by moving the
check down below the acquisition of the buffer lock, which had been
the very next statement.  Because of the loop locking and unlocking
multiple buffers in heap_multi_insert() a check was added after all
inserts were completed.  The check before the start of the inserts
was left because it might avoid a large amount of work to detect a
serialization anomaly before performing the all of the inserts and
the related WAL logging.

While investigating this bug, other SSI bugs which were even harder
to hit in practice were noticed and fixed, an unnecessary check
(covered by another check, so redundant) was removed from
heap_update(), and comments were improved.

Back-patch to all supported branches.

Kevin Grittner and Thomas Munro
2015-10-31 14:43:34 -05:00
Robert Haas
3a1f8611f2 Update parallel executor support to reuse the same DSM.
Commit b0b0d84b3d purported to make it
possible to relaunch workers using the same parallel context, but it had
an unpleasant race condition: we might reinitialize after the workers
have sent their last control message but before they have dettached the
DSM, leaving to crashes.  Repair by introducing a new ParallelContext
operation, ReinitializeParallelDSM.

Adjust execParallel.c to use this new support, so that we can rescan a
Gather node by relaunching workers but without needing to recreate the
DSM.

Amit Kapila, with some adjustments by me.  Extracted from latest parallel
sequential scan patch.
2015-10-30 10:44:54 +01:00
Peter Eisentraut
a8d585c091 Message style improvements
Message style, plurals, quoting, spelling, consistency with similar
messages
2015-10-28 20:38:36 -04:00
Alvaro Herrera
21a4e4a4c9 Fix BRIN free space computations
A bug in the original free space computation made it possible to
return a page which wasn't actually able to fit the item.  Since the
insertion code isn't prepared to deal with PageAddItem failing, a PANIC
resulted ("failed to add BRIN tuple [to new page]").  Add a macro to
encapsulate the correct computation, and use it in
brin_getinsertbuffer's callers before calling that routine, to raise an
early error.

I became aware of the possiblity of a problem in this area while working
on ccc4c07499.  There's no archived discussion about it, but it's
easy to reproduce a problem in the unpatched code with something like

CREATE TABLE t (a text);
CREATE INDEX ti ON t USING brin (a) WITH (pages_per_range=1);

for length in `seq 8000 8196`
do
	psql -f - <<EOF
TRUNCATE TABLE t;
INSERT INTO t VALUES ('z'), (repeat('a', $length));
EOF
done

Backpatch to 9.5, where BRIN was introduced.
2015-10-27 18:17:55 -03:00
Alvaro Herrera
531d21b75f Cleanup commit timestamp module activaction, again
Further tweak commit_ts.c so that on a standby the state is completely
consistent with what that in the master, rather than behaving
differently in the cases that the settings differ.  Now in standby and
master the module should always be active or inactive in lockstep.

Author: Petr Jelínek, with some further tweaks by Álvaro Herrera.

Backpatch to 9.5, where commit timestamps were introduced.

Discussion: http://www.postgresql.org/message-id/5622BF9D.2010409@2ndquadrant.com
2015-10-27 15:06:50 -03:00
Robert Haas
31ba62ce32 Fix typos in comments.
CharSyam
2015-10-22 14:52:23 -04:00
Robert Haas
ee7ca559fc Add a C API for parallel heap scans.
Using this API, one backend can set up a ParallelHeapScanDesc to
which multiple backends can then attach.  Each tuple in the relation
will be returned to exactly one of the scanning backends.  Only
forward scans are supported, and rescans must be carefully
coordinated.

This is not exposed to the planner or executor yet.

The original version of this code was written by me.  Amit Kapila
reviewed it, tested it, and improved it, including adding support for
synchronized scans, per review comments from Jeff Davis.  Extensive
testing of this and related patches was performed by Haribabu Kommi.
Final cleanup of this patch by me.
2015-10-16 17:33:18 -04:00
Robert Haas
b0b0d84b3d Allow a parallel context to relaunch workers.
This may allow some callers to avoid the overhead involved in tearing
down a parallel context and then setting up a new one, which means
releasing the DSM and then allocating and populating a new one.  I
suspect we'll want to revise the Gather node to make use of this new
capability, but even if not it may be useful elsewhere and requires
very little additional code.
2015-10-16 17:18:05 -04:00
Robert Haas
a53c06a13e Prohibit parallel query when the isolation level is serializable.
In order for this to be safe, the code which hands true serializability
will need to taught that the SIRead locks taken by a parallel worker
pertain to the same transaction as those taken by the parallel leader.
Some further changes may be needed as well.  Until the necessary
adaptations are made, don't generate parallel plans in serializable
mode, and if a previously-generated parallel plan is used after
serializable mode has been activated, run it serially.

This fixes a bug in commit 7aea8e4f2d.
2015-10-16 11:58:27 -04:00
Robert Haas
82b37765c7 Fix a problem with parallel workers being unable to restore role.
check_role() tries to verify that the user has permission to become the
requested role, but this is inappropriate in a parallel worker, which
needs to exactly recreate the master's authorization settings.  So skip
the check in that case.

This fixes a bug in commit 924bcf4f16.
2015-10-16 11:37:19 -04:00
Robert Haas
6de6d96d97 Invalidate caches after cranking up a parallel worker transaction.
Starting a parallel worker transaction changes our notion of which XIDs
are in-progress or committed, and our notion of the current command
counter ID.  Therefore, our view of these caches prior to starting
this transaction may no longer valid.  Defend against that by clearing
them.

This fixes a bug in commit 924bcf4f16.
2015-10-16 11:31:23 -04:00
Robert Haas
94b4f7e2a6 Tighten up application of parallel mode checks.
Commit 924bcf4f16 failed to enforce
parallel mode checks during the commit of a parallel worker, because
we exited parallel mode prior to ending the transaction so that we
could pop the active snapshot.  Re-establish parallel mode during
parallel worker commit.  Without this, it's far too easy for unsafe
actions during the pre-commit sequence to crash the server instead of
hitting the error checks as intended.

Just to be extra paranoid, adjust a couple of the sanity checks in
xact.c to check not only IsInParallelMode() but also
IsParallelWorker().
2015-10-16 09:59:57 -04:00
Robert Haas
423ec0877f Transfer current command counter ID to parallel workers.
Commit 924bcf4f16 correctly forbade
parallel workers to modify the command counter while in parallel mode,
but it inexplicably neglected to actually transfer the current command
counter from leader to workers.  This can result in the workers seeing
a different set of tuples from the leader, which is bad.  Repair.
2015-10-16 09:58:42 -04:00
Robert Haas
2ad5c27bb5 Don't send protocol messages to a shm_mq that no longer exists.
Commit 2bd9e412f9 introduced a mechanism
for relaying protocol messages from a background worker to another
backend via a shm_mq.  However, there was no provision for shutting
down the communication channel.  Therefore, a protocol message sent
late in the shutdown sequence, such as a DEBUG message resulting from
cranking up log_min_messages, could crash the server.  To fix, install
an on_dsm_detach callback that disables sending messages to the shm_mq
when the associated DSM is detached.
2015-10-16 09:42:33 -04:00
Andres Freund
2596d705bd Re-Align *_freeze_max_age reloption limits with corresponding GUC limits.
In 020235a575 I lowered the autovacuum_*freeze_max_age minimums to
allow for easier testing of wraparounds. I did not touch the
corresponding per-table limits. While those don't matter for the purpose
of wraparound, it seems more consistent to lower them as well.

It's noteworthy that the previous reloption lower limit for
autovacuum_multixact_freeze_max_age was too high by one magnitude, even
before 020235a575.

Discussion: 26377.1443105453@sss.pgh.pa.us
Backpatch: back to 9.0 (in parts), like the prior patch
2015-10-05 11:53:43 +02:00
Alvaro Herrera
e06b2e1d2e Don't disable commit_ts in standby if enabled locally
Bug noticed by Fujii Masao
2015-10-02 12:49:01 -03:00
Peter Eisentraut
87c2b517ac Fix message punctuation according to style guide 2015-10-01 21:39:56 -04:00
Alvaro Herrera
f12e814b88 Fix commit_ts for standby
Module initialization was still not completely correct after commit
6b61955135, per crash report from Takashi Ohnishi.  To fix, instead of
trying to monkey around with the value of the GUC setting directly, add
a separate boolean flag that enables the feature on a standby, but only
for the startup (recovery) process, when it sees that its master server
has the feature enabled.
Discussion: http://www.postgresql.org/message-id/ca44c6c7f9314868bdc521aea4f77cbf@MP-MSGSS-MBX004.msg.nttdata.co.jp

Also change the deactivation routine to delete all segment files rather
than leaving the last one around.  (This doesn't need separate
WAL-logging, because on recovery we execute the same deactivation
routine anyway.)

In passing, clean up the code structure somewhat, particularly so that
xlog.c doesn't know so much about when to activate/deactivate the
feature.

Thanks to Fujii Masao for testing and Petr Jelínek for off-list discussion.

Back-patch to 9.5, where commit_ts was introduced.
2015-10-01 15:06:55 -03:00
Robert Haas
227d57f358 Don't dump core when destroying an unused ParallelContext.
If a transaction or subtransaction creates a ParallelContext but ends
without calling InitializeParallelDSM, the previous code would
seg fault.  Fix that.
2015-09-30 18:36:31 -04:00
Alvaro Herrera
6b61955135 Code review for transaction commit timestamps
There are three main changes here:

1. No longer cause a start failure in a standby if the feature is
disabled in postgresql.conf but enabled in the master.  This reverts one
part of commit 4f3924d9cd43; what we keep is the ability of the standby
to activate/deactivate the module (which includes creating and removing
segments as appropriate) during replay of such actions in the master.

2. Replay WAL records affecting commitTS even if the feature is
disabled.  This means the standby will always have the same state as the
master after replay.

3. Have COMMIT PREPARE record the transaction commit time as well.  We
were previously only applying it in the normal transaction commit path.

Author: Petr Jelínek
Discussion: http://www.postgresql.org/message-id/CAHGQGwHereDzzzmfxEBYcVQu3oZv6vZcgu1TPeERWbDc+gQ06g@mail.gmail.com
Discussion: http://www.postgresql.org/message-id/CAHGQGwFuzfO4JscM9LCAmCDCxp_MfLvN4QdB+xWsS-FijbjTYQ@mail.gmail.com

Additionally, I cleaned up nearby code related to replication origins,
which I found a bit hard to follow, and fixed a couple of typos.

Backpatch to 9.5, where this code was introduced.

Per bug reports from Fujii Masao and subsequent discussion.
2015-09-29 14:40:56 -03:00
Alvaro Herrera
17f5831c81 Fix "sesssion" typo
It was introduced alongside replication origins, by commit
5aa2350426, so backpatch to 9.5.

Pointed out by Fujii Masao
2015-09-28 19:13:42 -03:00
Andres Freund
aa29c1ccd9 Remove legacy multixact truncation support.
In 9.5 and master there is no need to support legacy truncation. This is
just committed separately to make it easier to backpatch the WAL logged
multixact truncation to 9.3 and 9.4 if we later decide to do so.

I bumped master's magic from 0xD086 to 0xD088 and 9.5's from 0xD085 to
0xD087 to avoid 9.5 reusing a value that has been in use on master while
keeping the numbers increasing between major versions.

Discussion: 20150621192409.GA4797@alap3.anarazel.de
Backpatch: 9.5
2015-09-26 19:04:25 +02:00
Andres Freund
4f627f8973 Rework the way multixact truncations work.
The fact that multixact truncations are not WAL logged has caused a fair
share of problems. Amongst others it requires to do computations during
recovery while the database is not in a consistent state, delaying
truncations till checkpoints, and handling members being truncated, but
offset not.

We tried to put bandaids on lots of these issues over the last years,
but it seems time to change course. Thus this patch introduces WAL
logging for multixact truncations.

This allows:
1) to perform the truncation directly during VACUUM, instead of delaying it
   to the checkpoint.
2) to avoid looking at the offsets SLRU for truncation during recovery,
   we can just use the master's values.
3) simplify a fair amount of logic to keep in memory limits straight,
   this has gotten much easier

During the course of fixing this a bunch of additional bugs had to be
fixed:
1) Data was not purged from memory the member's SLRU before deleting
   segments. This happened to be hard or impossible to hit due to the
   interlock between checkpoints and truncation.
2) find_multixact_start() relied on SimpleLruDoesPhysicalPageExist - but
   that doesn't work for offsets that haven't yet been flushed to
   disk. Add code to flush the SLRUs to fix. Not pretty, but it feels
   slightly safer to only make decisions based on actual on-disk state.
3) find_multixact_start() could be called concurrently with a truncation
   and thus fail. Via SetOffsetVacuumLimit() that could lead to a round
   of emergency vacuuming. The problem remains in
   pg_get_multixact_members(), but that's quite harmless.

For now this is going to only get applied to 9.5+, leaving the issues in
the older branches in place. It is quite possible that we need to
backpatch at a later point though.

For the case this gets backpatched we need to handle that an updated
standby may be replaying WAL from a not-yet upgraded primary. We have to
recognize that situation and use "old style" truncation (i.e. looking at
the SLRUs) during WAL replay. In contrast to before, this now happens in
the startup process, when replaying a checkpoint record, instead of the
checkpointer. Doing truncation in the restartpoint is incorrect, they
can happen much later than the original checkpoint, thereby leading to
wraparound.  To avoid "multixact_redo: unknown op code 48" errors
standbys would have to be upgraded before primaries.

A later patch will bump the WAL page magic, and remove the legacy
truncation codepaths. Legacy truncation support is just included to make
a possible future backpatch easier.

Discussion: 20150621192409.GA4797@alap3.anarazel.de
Reviewed-By: Robert Haas, Alvaro Herrera, Thomas Munro
Backpatch: 9.5 for now
2015-09-26 19:04:25 +02:00
Teodor Sigaev
dc943ad952 Allow autoanalyze to add pages deleted from pending list to FSM
Commit e956808328 introduces adding pages
to FSM for ordinary insert, but autoanalyze was able just cleanup
pending list without adding to FSM.

Also fix double call of IndexFreeSpaceMapVacuum() during ginvacuumcleanup()

Report from Fujii Masao
Patch by me
Review by Jeff Janes
2015-09-23 15:33:51 +03:00
Peter Eisentraut
4a1e15e4a9 Add missing serial comma 2015-09-18 22:41:42 -04:00
Teodor Sigaev
22f519c92a Fix bug introduced by microvacuum for GiST
Commit 013ebc0a7b introduces microvacuum for
GiST, deletetion of tuple marked LP_DEAD uses IndexPageMultiDelete while
recovery code uses IndexPageTupleDelete in loop. This causes a difference
in offset numbers of tuples to delete. Patch introduces usage of
IndexPageMultiDelete in GiST except gistplacetopage() where only one tuple is
deleted at once. That also slightly improve performance, because
IndexPageMultiDelete is more effective.

Patch changes WAL format, so bump wal page magic.

Bug report from Jeff Janes
Diagnostic and patch by Anastasia Lubennikova and me
2015-09-17 14:22:37 +03:00
Fujii Masao
10fbb79f1a Improve log messages related to tablespace_map file
This patch changes the log message which is logged when the server
successfully renames backup_label file to *.old but fails to rename
tablespace_map file during the shutdown. Previously the WARNING
message "online backup mode was not canceled" was logged in that case.
However this message is confusing because the backup mode is treated
as canceled whenever backup_label is successfully renamed. So this
commit makes the server log the message "online backup mode canceled"
in that case.

Also this commit changes errdetail messages so that they follow the
error message style guide.

Back-patch to 9.5 where tablespace_map file is introduced.

Original patch by Amit Kapila, heavily modified by me.
2015-09-15 23:21:51 +09:00
Alvaro Herrera
5cd6538345 Add missing ReleaseBuffer call in BRIN revmap code
I think this particular branch is actually dead, but the analysis to
prove that is not trivial, so instead take the weasel way.

Reported by Jinyu Zhang
Backpatch to 9.5, where BRIN was introduced.
2015-09-11 15:29:46 -03:00
Teodor Sigaev
223936e226 Fix oversight in 013ebc0a7b commit
Declaration of varibale inside ÓÝ×Õ
2015-09-09 19:21:16 +03:00
Teodor Sigaev
013ebc0a7b Microvacuum for GIST
Mark index tuple as dead if it's pointed by kill_prior_tuple during
ordinary (search) scan and remove it during insert process if there is no
enough space for new tuple to insert. This improves select performance
because index will not return tuple marked as dead and improves insert
performance because it reduces number of page split.

Anastasia Lubennikova <a.lubennikova@postgrespro.ru> with
 minor editorialization by me
2015-09-09 18:43:37 +03:00
Fujii Masao
96f6a0cb41 Remove files signaling a standby promotion request at postmaster startup
This commit makes postmaster forcibly remove the files signaling
a standby promotion request. Otherwise, the existence of those files
can trigger a promotion too early, whether a user wants that or not.

This removal of files is usually unnecessary because they can exist
only during a few moments during a standby promotion. However
there is a race condition: if pg_ctl promote is executed and creates
the files during a promotion, the files can stay around even after
the server is brought up to new master. Then, if new standby starts
by using the backup taken from that master, the files can exist
at the server startup and should be removed in order to avoid
an unexpected promotion.

Back-patch to 9.1 where promote signal file was introduced.

Problem reported by Feike Steenbergen.
Original patch by Michael Paquier, modified by me.

Discussion: 20150528100705.4686.91426@wrigleys.postgresql.org
2015-09-09 22:51:44 +09:00
Alvaro Herrera
1aba62ec63 Allow per-tablespace effective_io_concurrency
Per discussion, nowadays it is possible to have tablespaces that have
wildly different I/O characteristics from others.  Setting different
effective_io_concurrency parameters for those has been measured to
improve performance.

Author: Julien Rouhaud
Reviewed by: Andres Freund
2015-09-08 12:51:42 -03:00
Teodor Sigaev
e26692248a Make GIN's cleanup pending list process interruptable
Cleanup process could be called by ordinary insert/update and could take a lot
of time. Add vacuum_delay_point() to make this process interruptable. Under
vacuum this call will also throttle a vacuum process to decrease system load,
called from insert/update it will not throttle, and that reduces a latency.

Backpatch for all supported branches.

Jeff Janes <jeff.janes@gmail.com>
2015-09-07 17:16:29 +03:00
Teodor Sigaev
e956808328 Add pages deleted from pending list to FSM
Add pages deleted from GIN's pending list during cleanup to free space map
immediately. Clean up process could be initiated by ordinary insert but adding
page to FSM might occur only at vacuum. On some workload like never-vacuumed
insert-only tables it could cause a huge bloat.

Jeff Janes <jeff.janes@gmail.com>
2015-09-07 16:24:01 +03:00
Heikki Linnakangas
c80b5f66c6 Fix misc typos.
Oskari Saarenmaa. Backpatch to stable branches where applicable.
2015-09-05 11:35:49 +03:00
Tatsuo Ishii
c39f5674df Fix brin index summarizing while vacuuming.
If the number of heap blocks is not multiples of pages per range, the
summarizing produces wrong summary information for the last brin index
tuple while vacuuming.

Problem reported by Tatsuo Ishii and fixed by Amit Langote.

Discussion at "[HACKERS] BRIN INDEX value (message id :20150903.174935.1946402199422994347.t-ishii@sraoss.co.jp)
Backpatched to 9.5 in which brin index was added.
2015-09-05 09:19:25 +09:00
Tom Lane
c5454f99c4 Fix subtransaction cleanup after an outer-subtransaction portal fails.
Formerly, we treated only portals created in the current subtransaction as
having failed during subtransaction abort.  However, if the error occurred
while running a portal created in an outer subtransaction (ie, a cursor
declared before the last savepoint), that has to be considered broken too.

To allow reliable detection of which ones those are, add a bookkeeping
field to struct Portal that tracks the innermost subtransaction in which
each portal has actually been executed.  (Without this, we'd end up
failing portals containing functions that had called the subtransaction,
thereby breaking plpgsql exception blocks completely.)

In addition, when we fail an outer-subtransaction Portal, transfer its
resources into the subtransaction's resource owner, so that they're
released early in cleanup of the subxact.  This fixes a problem reported by
Jim Nasby in which a function executed in an outer-subtransaction cursor
could cause an Assert failure or crash by referencing a relation created
within the inner subtransaction.

The proximate cause of the Assert failure is that AtEOSubXact_RelationCache
assumed it could blow away a relcache entry without first checking that the
entry had zero refcount.  That was a bad idea on its own terms, so add such
a check there, and to the similar coding in AtEOXact_RelationCache.  This
provides an independent safety measure in case there are still ways to
provoke the situation despite the Portal-level changes.

This has been broken since subtransactions were invented, so back-patch
to all supported branches.

Tom Lane and Michael Paquier
2015-09-04 13:37:14 -04:00
Fujii Masao
1ea5ce5c5f Document that max_worker_processes must be high enough in standby.
The setting values of some parameters including max_worker_processes
must be equal to or higher than the values on the master. However,
previously max_worker_processes was not listed as such parameter
in the document. So this commit adds it to that list.

Back-patch to 9.4 where max_worker_processes was added.
2015-09-03 22:30:16 +09:00
Teodor Sigaev
30bb26b5e0 Allow usage of huge maintenance_work_mem for GIN build.
Currently, in-memory posting list during GIN build process is limited 1GB
because of using repalloc. The patch replaces call of repalloc to repalloc_huge.
It increases limit of posting list from 180 millions
(1GB / sizeof(ItemPointerData)) to 4 billions limited by maxcount/count fields
in GinEntryAccumulator and subsequent calls. Check added.

Also, fix accounting of allocatedMemory during build to prevent integer
overflow with maintenance_work_mem > 4GB.

Robert Abraham <robert.abraham86@googlemail.com> with additions by me
2015-09-02 20:08:58 +03:00
Alvaro Herrera
e68be16b0d Do not allow *timestamp to be passed as NULL
The code had bugs that would cause crashes if NULL was passed as that
argument (originally intended to mean not to bother returning its
value), and after inspection it turns out that nothing seems interested
in the case that *ts is NULL anyway.  Therefore, remove the partial
checks intended to support that case.

Author: Michael Paquier
though I didn't include a proposed Assert.

Backpatch to 9.5.
2015-08-21 14:36:54 -03:00
Andres Freund
e95126cf04 Don't use function definitions looking like old-style ones.
This fixes a bunch of somewhat pedantic warnings with new
compilers. Since by far the majority of other functions definitions use
the (void) style it just seems to be consistent to do so as well in the
remaining few places.
2015-08-15 17:25:00 +02:00
Simon Riggs
47167b7907 Reduce lock levels for ALTER TABLE SET autovacuum storage options
Reduce lock levels down to ShareUpdateExclusiveLock for all autovacuum-related
relation options when setting them using ALTER TABLE.

Add infrastructure to allow varying lock levels for relation options in later
patches. Setting multiple options together uses the highest lock level required
for any option. Works for both main and toast tables.

Fabrízio Mello, reviewed by Michael Paquier, mild edit and additional regression
tests from myself
2015-08-14 14:19:28 +01:00
Alvaro Herrera
fcbf455842 Fix unitialized variables
As complained by clang, reported by Andres Freund.  Brown paper bag bug
in ccc4c07499.

Add some comments, too.

Backpatch to 9.5, like that one.
2015-08-13 00:12:07 -03:00
Alvaro Herrera
ccc4c07499 Close some holes in BRIN page assignment
In some corner cases, it is possible for the BRIN index relation to be
extended by brin_getinsertbuffer but the new page not be used
immediately for anything by its callers; when this happens, the page is
initialized and the FSM is updated (by brin_getinsertbuffer) with the
info about that page, but these actions are not WAL-logged.  A later
index insert/update can use the page, but since the page is already
initialized, the initialization itself is not WAL-logged then either.
Replay of this sequence of events causes recovery to fail altogether.

There is a related corner case within brin_getinsertbuffer itself, in
which we extend the relation to put a new index tuple there, but later
find out that we cannot do so, and do not return the buffer; the page
obtained from extension is not even initialized.  The resulting page is
lost forever.

To fix, shuffle the code so that initialization is not the
responsibility of brin_getinsertbuffer anymore, in normal cases;
instead, the initialization is done by its callers (brin_doinsert and
brin_doupdate) once they're certain that the page is going to be used.
When either those functions determine that the new page cannot be used,
before bailing out they initialize the page as an empty regular page,
enter it in FSM and WAL-log all this.  This way, the page is usable for
future index insertions, and WAL replay doesn't find trying to insert
tuples in pages whose initialization didn't make it to the WAL.  The
same strategy is used in brin_getinsertbuffer when it cannot return the
new page.

Additionally, add a new step to vacuuming so that all pages of the index
are scanned; whenever an uninitialized page is found, it is initialized
as empty and WAL-logged.  This closes the hole that the relation is
extended but the system crashes before anything is WAL-logged about it.
We also take this opportunity to update the FSM, in case it has gotten
out of date.

Thanks to Heikki Linnakangas for finding the problem that kicked some
additional analysis of BRIN page assignment code.

Backpatch to 9.5, where BRIN was introduced.

Discussion: https://www.postgresql.org/message-id/20150723204810.GY5596@postgresql.org
2015-08-12 14:20:38 -03:00
Andres Freund
18e8613564 Address points made in post-commit review of replication origins.
Amit reviewed the replication origins patch and made some good
points. Address them. This fixes typos in error messages, docs and
comments and adds a missing error check (although in a
should-never-happen scenario).

Discussion: CAA4eK1JqUBVeWWKwUmBPryFaje4190ug0y-OAUHWQ6tD83V4xg@mail.gmail.com
Backpatch: 9.5, where replication origins were introduced.
2015-08-07 15:09:05 +02:00
Robert Haas
0e141c0fbb Reduce ProcArrayLock contention by removing backends in batches.
When a write transaction commits, it must clear its XID advertised via
the ProcArray, which requires that we hold ProcArrayLock in exclusive
mode in order to prevent concurrent processes running GetSnapshotData
from seeing inconsistent results.  When many processes try to commit
at once, ProcArrayLock must change hands repeatedly, with each
concurrent process trying to commit waking up to acquire the lock in
turn.  To make things more efficient, when more than one backend is
trying to commit a write transaction at the same time, have just one
of them acquire ProcArrayLock in exclusive mode and clear the XIDs of
all processes in the group.  Benchmarking reveals that this is much
more efficient at very high client counts.

Amit Kapila, heavily revised by me, with some review also from Pavan
Deolasee.
2015-08-06 12:02:12 -04:00
Alvaro Herrera
2834855cb9 Fix BRIN to use SnapshotAny during summarization
For correctness of summarization results, it is critical that the
snapshot used during the summarization scan is able to see all tuples
that are live to all transactions -- including tuples inserted or
deleted by in-progress transactions.  Otherwise, it would be possible
for a transaction to insert a tuple, then idle for a long time while a
concurrent transaction executes summarization of the range: this would
result in the inserted value not being considered in the summary.
Previously we were trying to use a MVCC snapshot in conjunction with
adding a "placeholder" tuple in the index: the snapshot would see all
committed tuples, and the placeholder tuple would catch insertions by
any new inserters.  The hole is that prior insertions by transactions
that are still in progress by the time the MVCC snapshot was taken were
ignored.

Kevin Grittner reported this as a bogus error message during vacuum with
default transaction isolation mode set to repeatable read (because the
error report mentioned a function name not being invoked during), but
the problem is larger than that.

To fix, tweak IndexBuildHeapRangeScan to have a new mode that behaves
the way we need using SnapshotAny visibility rules.  This change
simplifies the BRIN code a bit, mainly by removing large comments that
were mistaken.  Instead, rely on the SnapshotAny semantics to provide
what it needs.  (The business about a placeholder tuple needs to remain:
that covers the case that a transaction inserts a a tuple in a page that
summarization already scanned.)

Discussion: https://www.postgresql.org/message-id/20150731175700.GX2441@postgresql.org

In passing, remove a couple of unused declarations from brin.h and
reword a comment to be proper English.  This part submitted by Kevin
Grittner.

Backpatch to 9.5, where BRIN was introduced.
2015-08-05 16:20:50 -03:00
Fujii Masao
dd85acf0c4 Make recovery rename tablespace_map to *.old if backup_label is not present.
If tablespace_map file is present without backup_label file, there is
no use of such file.  There is no harm in retaining it, but it is better
to get rid of the map file so that we don't have any redundant file
in data directory and it will avoid any sort of confusion. It seems
prudent though to just rename the file out of the way rather than
delete it completely, also we ignore any error that occurs in rename
operation as even if map file is present without backup_label file,
it is harmless.

Back-patch to 9.5 where tablespace_map file was introduced.

Amit Kapila, reviewed by Robert Haas, Alvaro Herrera and me.
2015-08-03 23:04:41 +09:00
Tom Lane
09cecdf285 Fix a number of places that produced XX000 errors in the regression tests.
It's against project policy to use elog() for user-facing errors, or to
omit an errcode() selection for errors that aren't supposed to be "can't
happen" cases.  Fix all the violations of this policy that result in
ERRCODE_INTERNAL_ERROR log entries during the standard regression tests,
as errors that can reliably be triggered from SQL surely should be
considered user-facing.

I also looked through all the files touched by this commit and fixed
other nearby problems of the same ilk.  I do not claim to have fixed
all violations of the policy, just the ones in these files.

In a few places I also changed existing ERRCODE choices that didn't
seem particularly appropriate; mainly replacing ERRCODE_SYNTAX_ERROR
by something more specific.

Back-patch to 9.5, but no further; changing ERRCODE assignments in
stable branches doesn't seem like a good idea.
2015-08-02 23:49:19 -04:00
Heikki Linnakangas
358cde320b Fix race condition that lead to WALInsertLock deadlock with commit_delay.
If a call to WaitForXLogInsertionsToFinish() returned a value in the middle
of a page, and another backend then started to insert a record to the same
page, and then you called WaitXLogInsertionsToFinish() again, the second
call might return a smaller value than the first call. The problem was in
GetXLogBuffer(), which always updated the insertingAt value to the
beginning of the requested page, not the actual requested location. Because
of that, the second call might return a xlog pointer to the beginning of
the page, while the first one returned a later position on the same page.
XLogFlush() performs two calls to WaitXLogInsertionsToFinish() in
succession, and holds WALWriteLock on the second call, which can deadlock
if the second call to WaitXLogInsertionsToFinish() blocks.

Reported by Spiros Ioannou. Backpatch to 9.4, where the more scalable
WALInsertLock mechanism, and this bug, was introduced.
2015-08-02 20:08:10 +03:00
Andres Freund
7039760114 Fix issues around the "variable" support in the lwlock infrastructure.
The lwlock scalability work introduced two race conditions into the
lwlock variable support provided for xlog.c. First, and harmlessly on
most platforms, it set/read the variable without the spinlock in some
places. Secondly, due to the removal of the spinlock, it was possible
that a backend missed changes to the variable's state if it changed in
the wrong moment because checking the lock's state, the variable's state
and the queuing are not protected by a single spinlock acquisition
anymore.

To fix first move resetting the variable's from LWLockAcquireWithVar to
WALInsertLockRelease, via a new function LWLockReleaseClearVar. That
prevents issues around waiting for a variable's value to change when a
new locker has acquired the lock, but not yet set the value. Secondly
re-check that the variable hasn't changed after enqueing, that prevents
the issue that the lock has been released and already re-acquired by the
time the woken up backend checks for the lock's state.

Reported-By: Jeff Janes
Analyzed-By: Heikki Linnakangas
Reviewed-By: Heikki Linnakangas
Discussion: 5592DB35.2060401@iki.fi
Backpatch: 9.5, where the lwlock scalability went in
2015-08-02 18:41:23 +02:00
Alvaro Herrera
c81276241b Fix broken assertion in BRIN code
The code was assuming that any NULL value in scan keys was due to IS
NULL or IS NOT NULL, but it turns out to be possible to get them with
other operators too, if they are used in contrived-enough ways.  Easiest
way out of the problem seems to check explicitely for the IS NOT NULL
flag, instead of assuming it must be set if the IS NULL flag is not set,
when a null scan key is found; if neither flag is set, follow the lead
of other index AMs and assume that all indexable operators must be
strict, and thus the query is never satisfiable.

Also, add a comment to try and lure some future hacker into improving
analysis of scan keys in brin.

Per report from Andreas Seltenreich; diagnosis by Tom Lane.
Backpatch to 9.5.

Discussion: http://www.postgresql.org/message-id/20646.1437919632@sss.pgh.pa.us
2015-07-30 15:07:19 -03:00
Joe Conway
7b4bfc87d5 Plug RLS related information leak in pg_stats view.
The pg_stats view is supposed to be restricted to only show rows
about tables the user can read. However, it sometimes can leak
information which could not otherwise be seen when row level security
is enabled. Fix that by not showing pg_stats rows to users that would
be subject to RLS on the table the row is related to. This is done
by creating/using the newly introduced SQL visible function,
row_security_active().

Along the way, clean up three call sites of check_enable_rls(). The second
argument of that function should only be specified as other than
InvalidOid when we are checking as a different user than the current one,
as in when querying through a view. These sites were passing GetUserId()
instead of InvalidOid, which can cause the function to return incorrect
results if the current user has the BYPASSRLS privilege and row_security
has been set to OFF.

Additionally fix a bug causing RI Trigger error messages to unintentionally
leak information when RLS is enabled, and other minor cleanup and
improvements. Also add WITH (security_barrier) to the definition of pg_stats.

Bumped CATVERSION due to new SQL functions and pg_stats view definition.

Back-patch to 9.5 where RLS was introduced. Reported by Yaroslav.
Patch by Joe Conway and Dean Rasheed with review and input by
Michael Paquier and Stephen Frost.
2015-07-28 13:21:22 -07:00
Heikki Linnakangas
5e65f45c6e Another attempt at fixing memory leak in xlogreader.
max_block_id is also reset between reading records.

Michael Paquier
2015-07-28 09:09:36 +03:00
Heikki Linnakangas
820d1ced1b Don't assume that PageIsEmpty() returns true on an all-zeros page.
It does currently, and I don't see us changing that any time soon, but we
don't make that assumption anywhere else.

Per Tom Lane's suggestion. Backpatch to 9.2, like the previous patch that
added this assumption.
2015-07-27 18:54:09 +03:00
Heikki Linnakangas
61a65c53bd Fix memory leak in xlogreader facility.
XLogReaderFree failed to free the per-block data buffers, when they
happened to not be used by the latest read WAL record.

Michael Paquier. Backpatch to 9.5, where the per-block buffers were added.
2015-07-27 18:29:31 +03:00
Heikki Linnakangas
334445179c Reuse all-zero pages in GIN.
In GIN, an all-zeros page would be leaked forever, and never reused. Just
add them to the FSM in vacuum, and they will be reinitialized when grabbed
from the FSM. On master and 9.5, attempting to access the page's opaque
struct also caused an assertion failure, although that was otherwise
harmless.

Reported by Jeff Janes. Backpatch to all supported versions.
2015-07-27 12:30:26 +03:00
Heikki Linnakangas
023430abf7 Fix handling of all-zero pages in SP-GiST vacuum.
SP-GiST initialized an all-zeros page at vacuum, but that was not
WAL-logged, which is not safe. You might get a torn page write, when it gets
flushed to disk, and end-up with a half-initialized index page. To fix,
leave it in the all-zeros state, and add it to the FSM. It will be
initialized when reused. Also don't set the page-deleted flag when recycling
an empty page. That was also not WAL-logged, and a torn write of that would
cause the page to have an invalid checksum.

Backpatch to 9.2, where SP-GiST indexes were added.
2015-07-27 12:28:21 +03:00
Heikki Linnakangas
65c384c5ab Avoid calling PageGetSpecialPointer() on an all-zeros page.
That was otherwise harmless, but tripped the new assertion in
PageGetSpecialPointer().

Reported by Amit Langote. Backpatch to 9.5, where the assertion was added.
2015-07-27 12:24:27 +03:00
Tom Lane
d9476b8380 Dodge portability issue (apparent compiler bug) in new tablesample code.
Some of the older OS X critters in the buildfarm are failing regression,
with symptoms showing that a request for 100% sampling in BERNOULLI or
SYSTEM methods actually gets only around 50% of the table.  gdb revealed
that the computation of the "cutoff" number was producing 0x7FFFFFFF
rather than the expected 0x100000000.  Inspecting the assembly code,
it looks like gcc is trying to use lrint() instead of rint() and then
fumbling the conversion from long double to uint64.  This seems like a
clear compiler bug, but assigning the intermediate result into a plain
double variable works around it, so let's just do that.  (Another idea
would be to give up one bit of hash width so that we don't need to use
a uint64 cutoff, but let's see if this is enough.)
2015-07-25 19:42:32 -04:00
Tom Lane
dd7a8f66ed Redesign tablesample method API, and do extensive code review.
The original implementation of TABLESAMPLE modeled the tablesample method
API on index access methods, which wasn't a good choice because, without
specialized DDL commands, there's no way to build an extension that can
implement a TSM.  (Raw inserts into system catalogs are not an acceptable
thing to do, because we can't undo them during DROP EXTENSION, nor will
pg_upgrade behave sanely.)  Instead adopt an API more like procedural
language handlers or foreign data wrappers, wherein the only SQL-level
support object needed is a single handler function identified by having
a special return type.  This lets us get rid of the supporting catalog
altogether, so that no custom DDL support is needed for the feature.

Adjust the API so that it can support non-constant tablesample arguments
(the original coding assumed we could evaluate the argument expressions at
ExecInitSampleScan time, which is undesirable even if it weren't outright
unsafe), and discourage sampling methods from looking at invisible tuples.
Make sure that the BERNOULLI and SYSTEM methods are genuinely repeatable
within and across queries, as required by the SQL standard, and deal more
honestly with methods that can't support that requirement.

Make a full code-review pass over the tablesample additions, and fix
assorted bugs, omissions, infelicities, and cosmetic issues (such as
failure to put the added code stanzas in a consistent ordering).
Improve EXPLAIN's output of tablesample plans, too.

Back-patch to 9.5 so that we don't have to support the original API
in production.
2015-07-25 14:39:00 -04:00
Heikki Linnakangas
766dcfb16c Fix off-by-one error in calculating subtrans/multixact truncation point.
If there were no subtransactions (or multixacts) active, we would calculate
the oldestxid == next xid. That's correct, but if next XID happens to be
on the next pg_subtrans (pg_multixact) page, the page does not exist yet,
and SimpleLruTruncate will produce an "apparent wraparound" warning. The
warning is harmless in this case, but looks very alarming to users.

Backpatch to all supported versions. Patch and analysis by Thomas Munro.
2015-07-23 01:29:59 +03:00
Tom Lane
434873806a Fix some oversights in BRIN patch.
Remove HeapScanDescData.rs_initblock, which wasn't being used for anything
in the final version of the patch.

Fix IndexBuildHeapScan so that it supports syncscan again; the patch
broke synchronous scanning for index builds by forcing rs_startblk
to zero even when the caller did not care about that and had asked
for syncscan.

Add some commentary and usage defenses to heap_setscanlimits().

Fix heapam so that asking for rs_numblocks == 0 does what you would
reasonably expect.  As coded it amounted to requesting a whole-table
scan, because those "--x <= 0" tests on an unsigned variable would
behave surprisingly.
2015-07-21 13:38:24 -04:00
Heikki Linnakangas
eb11de8ff5 Sanity-check that a page zeroed by redo routine is marked with WILL_INIT.
There was already a sanity-check in the other direction: if a page was
marked with WILL_INIT, it had to be initialized by the redo routine. It's
not strictly necessary for correctness that a page is marked with WILL_INIT
if it's going to be initialized at redo, but it's a missed optimization if
nothing else.

Fix a few instances of this issue in SP-GiST, where a block in WAL record
was not marked with WILL_INIT, but was in fact always initialized at redo.
We were creating a full-page image of the page unnecessarily in those
cases.

Backpatch to 9.5, where the new WILL_INIT flag was added.
2015-07-20 22:34:01 +03:00
Alvaro Herrera
8d90736924 Improve BRIN documentation somewhat
This removes some info about support procedures being used, which was
obsoleted by commit db5f98ab4f, as well as add some more documentation
on how to create new opclasses using the Minmax infrastructure.
(Hopefully we can get something similar for Inclusion as well.)

In passing, fix some obsolete mentions of "mmtuples" in source code
comments.

Backpatch to 9.5, where BRIN was introduced.
2015-07-20 12:16:40 +02:00
Heikki Linnakangas
7261172430 Remove obsolete heap_formtuple/modifytuple/deformtuple functions.
These variants used the old-style 'n'/' ' NULL indicators. The new-style
functions have been available since version 8.1. That should be long enough
that if there is still any old external code using these functions, they
can just switch to the new functions without worrying about backwards
compatibility

Peter Geoghegan
2015-07-02 21:21:23 +03:00
Heikki Linnakangas
f92d6a540a Use appendStringInfoString/Char et al where appropriate.
Patch by David Rowley. Backpatch to 9.5, as some of the calls were new in
9.5, and keeping the code in sync with master makes future backpatching
easier.
2015-07-02 12:36:03 +03:00
Fujii Masao
8217370864 Make XLogFileCopy() look the same as in 9.4.
XLogFileCopy() was changed heavily in commit de76884. However it was
partially reverted in commit 7abc685 and most of those changes to
XLogFileCopy() were no longer needed. Then commit 7cbee7c removed
those unnecessary code, but XLogFileCopy() looked different in master
and 9.4 though the contents are almost the same.

This patch makes XLogFileCopy() look the same in master and back-branches,
which makes back-patching easier, per discussion on pgsql-hackers.
Back-patch to 9.5.

Discussion: 55760844.7090703@iki.fi

Michael Paquier
2015-07-01 10:54:47 +09:00
Tom Lane
131926a52d Remove useless check for NULL subexpression.
Coverity rightly gripes that it's silly to have a test here when
the adjacent ExecEvalExpr() would choke on a NULL expression pointer.

Petr Jelinek
2015-06-30 12:53:54 -04:00
Heikki Linnakangas
fdf28853ae Don't call PageGetSpecialPointer() on page until it's been initialized.
After calling XLogInitBufferForRedo(), the page might be all-zeros if it was
not in page cache already. btree_xlog_unlink_page initialized the page
correctly, but it called PageGetSpecialPointer before initializing it, which
would lead to a corrupt page at WAL replay, if the unlinked page is not in
page cache.

Backpatch to 9.4, the bug came with the rewrite of B-tree page deletion.
2015-06-30 13:41:30 +03:00
Heikki Linnakangas
47fe4d25d5 Initialize GIN metapage correctly when replaying metapage-update WAL record.
I broke this with my WAL format refactoring patch. Before that, the metapage
was read from disk, and modified in-place regardless of the LSN. That was
always a bit silly, as there's no need to read the old page version from
disk disk when we're overwriting it anyway. So that was changed in 9.5, but
I failed to add a GinInitPage call to initialize the page-headers correctly.
Usually you wouldn't notice, because the metapage is already in the page
cache and is not zeroed.

One way to reproduce this is to perform a VACUUM on an already vacuumed
table (so that the vacuum has no real work to do), immediately after a
checkpoint, and then perform an immediate shutdown. After recovery, the
page headers of the metapage will be incorrectly all-zeroes.

Reported by Jeff Janes
2015-06-30 00:06:00 +03:00
Heikki Linnakangas
d661532e27 Also trigger restartpoints based on max_wal_size on standby.
When archive recovery and restartpoints were initially introduced,
checkpoint_segments was ignored on the grounds that the files restored from
archive don't consume any space in the recovery server. That was changed in
later releases, but even then it was arguably a feature rather than a bug,
as performing restartpoints as often as checkpoints during normal operation
might be excessive, but you might nevertheless not want to waste a lot of
space for pre-allocated WAL by setting checkpoint_segments to a high value.
But now that we have separate min_wal_size and max_wal_size settings, you
can bound WAL usage with max_wal_size, and still avoid consuming excessive
space usage by setting min_wal_size to a lower value, so that argument is
moot.

There are still some issues with actually limiting the space usage to
max_wal_size: restartpoints in recovery can only start after seeing the
checkpoint record, while a checkpoint starts flushing buffers as soon as
the redo-pointer is set. Restartpoint is paced to happen at the same
leisurily speed, determined by checkpoint_completion_target, as checkpoints,
but because they are started later, max_wal_size can be exceeded by upto
one checkpoint cycle's worth of WAL, depending on
checkpoint_completion_target. But that seems better than not trying at all,
and max_wal_size is a soft limit anyway.

The documentation already claimed that max_wal_size is obeyed in recovery,
so this just fixes the behaviour to match the docs. However, add some
weasel-words there to mention that max_wal_size may well be exceeded by
some amount in recovery.
2015-06-29 00:09:10 +03:00
Heikki Linnakangas
a32c3ec893 Promote the assertion that XLogBeginInsert() is not called twice into ERROR.
Seems like cheap insurance for WAL bugs. A spurious call to
XLogBeginInsert() in itself would be fairly harmless, but if there is any
data registered and the insertion is not completed/cancelled properly, there
is a risk that the data ends up in a wrong WAL record.

Per Jeff Janes's suggestion.
2015-06-28 22:30:39 +03:00
Heikki Linnakangas
a45c70acf3 Fix double-XLogBeginInsert call in GIN page splits.
If data checksums or wal_log_hints is on, and a GIN page is split, the code
to find a new, empty, block was called after having already called
XLogBeginInsert(). That causes an assertion failure or PANIC, if finding the
new block involves updating a FSM page that had not been modified since last
checkpoint, because that update is WAL-logged, which calls XLogBeginInsert
again. Nested XLogBeginInsert calls are not supported.

To fix, rearrange GIN code so that XLogBeginInsert is called later, after
finding the victim buffers.

Reported by Jeff Janes.
2015-06-28 22:16:21 +03:00
Simon Riggs
66fbcb0d2e Avoid hot standby cancels from VAC FREEZE
VACUUM FREEZE generated false cancelations of standby queries on an
otherwise idle master. Caused by an off-by-one error on cutoff_xid
which goes back to original commit.

Backpatch to all versions 9.0+

Analysis and report by Marco Nenciarini

Bug fix by Simon Riggs
2015-06-27 00:41:47 +01:00
Alvaro Herrera
4028222468 Fix BRIN xlog replay
There was a confusion about which block number to use when storing an
item's pointer in the revmap -- the revmap page's blkno was being used,
not the data page's blkno.

Spotted-by: Jeff Janes
2015-06-26 18:13:05 -03:00
Robert Haas
8f15f74a44 Be more conservative about removing tablespace "symlinks".
Don't apply rmtree(), which will gleefully remove an entire subtree,
and don't even apply unlink() unless it's symlink or a directory,
the only things that we expect to find.

Amit Kapila, with minor tweaks by me, per extensive discussions
involving Andrew Dunstan, Fujii Masao, and Heikki Linnakangas,
at least some of whom also reviewed the code.
2015-06-26 15:53:13 -04:00
Heikki Linnakangas
4b8e24b9ad Fix a couple of bugs with wal_log_hints.
1. Replay of the WAL record for setting a bit in the visibility map
contained an assertion that a full-page image of that record type can only
occur with checksums enabled. But it can also happen with wal_log_hints, so
remove the assertion. Unlike checksums, wal_log_hints can be changed on the
fly, so it would be complicated to figure out if it was enabled at the time
that the WAL record was generated.

2. wal_log_hints has the same effect on the locking needed to read the LSN
of a page as data checksums. BufferGetLSNAtomic() didn't get the memo.

Backpatch to 9.4, where wal_log_hints was added.
2015-06-26 12:38:24 +03:00
Andres Freund
667912aee6 Improve multixact emergency autovacuum logic.
Previously autovacuum was not necessarily triggered if space in the
members slru got tight. The first problem was that the signalling was
tied to values in the offsets slru, but members can advance much
faster. Thats especially a problem if old sessions had been around that
previously prevented the multixact horizon to increase. Secondly the
skipping logic doesn't work if the database was restarted after
autovacuum was triggered - that knowledge is not preserved across
restart. This is especially a problem because it's a common
panic-reaction to restart the database if it gets slow to
anti-wraparound vacuums.

Fix the first problem by separating the logic for members from
offsets. Trigger autovacuum whenever a multixact crosses a segment
boundary, as the current member offset increases in irregular values, so
we can't use a simple modulo logic as for offsets.  Add a stopgap for
the second problem, by signalling autovacuum whenver ERRORing out
because of boundaries.

Discussion: 20150608163707.GD20772@alap3.anarazel.de

Backpatch into 9.3, where it became more likely that multixacts wrap
around.
2015-06-21 18:57:28 +02:00
Andres Freund
90231cd518 Add missing check for wal_debug GUC.
9a20a9b2 added a new elog(), enabled when WAL_DEBUG is defined. The
other WAL_DEBUG dependant messages check for the wal_debug GUC, but this
one did not. While at it replace 'upto' with 'up to'.

Discussion: 20150610110253.GF3832@alap3.anarazel.de

Backpatch to 9.4, the first release containing 9a20a9b2.
2015-06-21 18:37:09 +02:00
Robert Haas
ed16f73c57 Fix corner case in autovacuum-forcing logic for multixact wraparound.
Since find_multixact_start() relies on SimpleLruDoesPhysicalPageExist(),
and that function looks only at the on-disk state, it's possible for it
to fail to find a page that exists in the in-memory SLRU that has not
been written yet.  If that happens, SetOffsetVacuumLimit() will
erroneously decide to force emergency autovacuuming immediately.

We should probably fix find_multixact_start() to consider the data
cached in memory as well as on the on-disk state, but that's no excuse
for SetOffsetVacuumLimit() to be stupid about the case where it can
no longer read the value after having previously succeeded in doing so.

Report by Andres Freund.
2015-06-19 11:28:30 -04:00
Alvaro Herrera
94232c909d Fix typos
tablesapce -> tablespace
there -> their

These were introduced in 72d422a52, so no need to backpatch.
2015-06-08 15:37:42 -03:00
Fujii Masao
7abc685974 Refactor WAL segment copying code.
* Remove unused argument "dstfname" and related code from XLogFileCopy().

* Previously XLogFileCopy() returned a pstrdup'd string so that
InstallXLogFileSegment() used it later. Since the pstrdup'd string was never
free'd, there could be a risk of memory leak. It was almost harmless because
the startup process exited just after calling XLogFileCopy(), it existed.
This commit changes XLogFileCopy() so that it directly calls
InstallXLogFileSegment() and doesn't call pstrdup() at all. Which fixes that
memory leak problem.

* Extend InstallXLogFileSegment() so that the caller can specify the log level.
Which allows us to emit an error when InstallXLogFileSegment() fails a disk
file access like link() and rename(). Previously it was always logged with
LOG level and additionally needed to be logged with ERROR when we wanted
to treat it as an error.

Michael Paquier
2015-06-09 03:03:24 +09:00
Andres Freund
d1b958218a Allow HotStandbyActiveInReplay() to be called in single user mode.
HotStandbyActiveInReplay, introduced in 061b079f, only allowed WAL
replay to happen in the startup process, missing the single user case.

This buglet is fairly harmless as it only causes problems when single
user mode in an assertion enabled build is used to replay a btree vacuum
record.

Backpatch to 9.2. 061b079f was backpatched further, but the assertion
was not.
2015-06-08 14:09:27 +02:00
Robert Haas
068cfadf9e Cope with possible failure of the oldest MultiXact to exist.
Recent commits, mainly b69bf30b9b and
53bb309d2d, introduced mechanisms to
protect against wraparound of the MultiXact member space: the number
of multixacts that can exist at one time is limited to 2^32, but the
total number of members in those multixacts is also limited to 2^32,
and older code did not take care to enforce the second limit,
potentially allowing old data to be overwritten while it was still
needed.

Unfortunately, these new mechanisms failed to account for the fact
that the code paths in which they run might be executed during
recovery or while the cluster was in an inconsistent state.  Also,
they failed to account for the fact that users who used pg_upgrade
to upgrade a PostgreSQL version between 9.3.0 and 9.3.4 might have
might oldestMultiXid = 1 in the control file despite the true value
being larger.

To fix these problems, first, avoid unnecessarily examining the
mmembers of MultiXacts when the cluster is not known to be consistent.
TruncateMultiXact has done this for a long time, and this patch does
not fix that.  But the new calls used to prevent member wraparound
are not needed until we reach normal running, so avoid calling them
earlier.  (SetMultiXactIdLimit is actually called before InRecovery
is set, so we can't rely on that; we invent our own multixact-specific
flag instead.)

Second, make failure to look up the members of a MultiXact a non-fatal
error.  Instead, if we're unable to determine the member offset at
which wraparound would occur, postpone arming the member wraparound
defenses until we are able to do so.  If we're unable to determine the
member offset that should force autovacuum, force it continuously
until we are able to do so.  If we're unable to deterine the member
offset at which we should truncate the members SLRU, log a message and
skip truncation.

An important consequence of these changes is that anyone who does have
a bogus oldestMultiXid = 1 value in pg_control will experience
immediate emergency autovacuuming when upgrading to a release that
contains this fix.  The release notes should highlight this fact.  If
a user has no pg_multixact/offsets/0000 file, but has oldestMultiXid = 1
in the control file, they may wish to vacuum any tables with
relminmxid = 1 prior to upgrading in order to avoid an immediate
emergency autovacuum after the upgrade.  This must be done with a
PostgreSQL version 9.3.5 or newer and with vacuum_multixact_freeze_min_age
and vacuum_multixact_freeze_table_age set to 0.

This patch also adds an additional log message at each database server
startup, indicating either that protections against member wraparound
have been engaged, or that they have not.  In the latter case, once
autovacuum has advanced oldestMultiXid to a sane value, the message
indicating that the guards have been engaged will appear at the next
checkpoint.  A few additional messages have also been added at the DEBUG1
level so that the correct operation of this code can be properly audited.

Along the way, this patch fixes another, related bug in TruncateMultiXact
that has existed since PostgreSQL 9.3.0: when no MultiXacts exist at
all, the truncation code looks up NextMultiXactId, which doesn't exist
yet.  This can lead to TruncateMultiXact removing every file in
pg_multixact/offsets instead of keeping one around, as it should.
This in turn will cause the database server to refuse to start
afterwards.

Patch by me.  Review by Álvaro Herrera, Andres Freund, Noah Misch, and
Thomas Munro.
2015-06-05 09:31:57 -04:00
Tom Lane
d8179b001a Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f337 introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.

After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/.  The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.

This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.

I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().

The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.

Back-patch to all active branches, like the prior commit.

Abhijit Menon-Sen and Tom Lane
2015-05-28 17:33:03 -04:00
Tom Lane
79f2b5d583 Fix valgrind's "unaddressable bytes" whining about BRIN code.
brin_form_tuple calculated an exact tuple size, then palloc'd and
filled just that much.  Later, brin_doinsert or brin_doupdate would
MAXALIGN the tuple size and tell PageAddItem that that was the size
of the tuple to insert.  If the original tuple size wasn't a multiple
of MAXALIGN, the net result would be that PageAddItem would memcpy
a few more bytes than the palloc request had been for.

AFAICS, this is totally harmless in the real world: the error is a
read overrun not a write overrun, and palloc would certainly have
rounded the request up to a MAXALIGN multiple internally, so there's
no chance of the memcpy fetching off the end of memory.  Valgrind,
however, is picky to the byte level not the MAXALIGN level.

Fix it by pushing the MAXALIGN step back to brin_form_tuple.  (The other
possible source of tuples in this code, brin_form_placeholder_tuple,
was already producing a MAXALIGN'd result.)

In passing, be a bit more paranoid about internal allocations in
brin_form_tuple.
2015-05-25 21:56:19 -04:00
Alvaro Herrera
cdbdc43827 Update README.tuplock
Multixact truncation is now handled differently, and this file hadn't
gotten the memo.

Per note from Amit Langote.  I didn't use his patch, though.

Also update the description of infomask bits, which weren't completely up
to date either.  This commit also propagates b01a4f6838 back to 9.3 and
9.4, which apparently I failed to do back then.
2015-05-25 15:09:05 -03:00
Tom Lane
2aa0476dc3 Manual cleanup of pgindent results.
Fix some places where pgindent did silly stuff, often because project
style wasn't followed to begin with.  (I've not touched the atomics
headers, though.)
2015-05-24 15:04:10 -04:00
Bruce Momjian
807b9e0dff pgindent run for 9.5 2015-05-23 21:35:49 -04:00
Tom Lane
72809480d6 Fix incorrect snprintf() limit.
Typo in commit 7cbee7c0a.  No practical effect since the buffer should
never actually be overrun, but various compilers and static analyzers will
whine about it.

Petr Jelinek
2015-05-23 16:05:52 -04:00
Tom Lane
821b821a24 Still more fixes for lossy-GiST-distance-functions patch.
Fix confusion in documentation, substantial memory leakage if float8 or
float4 are pass-by-reference, and assorted comments that were obsoleted
by commit 98edd617f3.
2015-05-23 15:22:25 -04:00
Heikki Linnakangas
7cbee7c0a1 At promotion, don't leave behind a partial segment on the old timeline.
With commit de768844, a copy of the partial segment was archived with the
.partial suffix, but the original file was still left in pg_xlog, so it
didn't actually solve the problems with archiving the partial segment that
it was supposed to solve. With this patch, the partial segment is renamed
rather than copied, so we only archive it with the .partial suffix.

Also be more robust in detecting if the last segment is already being
archived. Previously I used XLogArchiveIsBusy() for that, but that's not
quite right. With archive_mode='always', there might be a .ready file for
it, and we don't want to rename it to .partial in that case.

The old segment is needed until we're fully committed to the new timeline,
i.e. until we've written the end-of-recovery WAL record and updated the
min recovery point and timeline in the control file. So move the renaming
later in the startup sequence, after all that's been done.
2015-05-22 11:04:33 +03:00
Fujii Masao
85d0e661aa Make recovery_target_action = pause work.
Previously even if recovery_target_action was set to pause and
the recovery target was reached, the recovery could never be paused.
Because the setting of pause was *always* overridden with that of
shutdown unexpectedly. This override is valid and intentional
if hot_standby is not enabled because there is no way to resume
the paused recovery in this case and the setting of pause is
completely useless. But not if hot_standby is enabled.

This patch changes the code so that the setting of pause is overridden
with that of shutdown only when hot_standby is not enabled.

Bug reported by Andres Freund
2015-05-21 13:56:17 +09:00
Heikki Linnakangas
fa60fb63e5 Fix more typos in comments.
Patch by CharSyam, plus a few more I spotted with grep.
2015-05-20 19:45:43 +03:00
Heikki Linnakangas
4fc72cc7bb Collection of typo fixes.
Use "a" and "an" correctly, mostly in comments. Two error messages were
also fixed (they were just elogs, so no translation work required). Two
function comments in pg_proc.h were also fixed. Etsuro Fujita reported one
of these, but I found a lot more with grep.

Also fix a few other typos spotted while grepping for the a/an typos.
For example, "consists out of ..." -> "consists of ...". Plus a "though"/
"through" mixup reported by Euler Taveira.

Many of these typos were in old code, which would be nice to backpatch to
make future backpatching easier. But much of the code was new, and I didn't
feel like crafting separate patches for each branch. So no backpatching.
2015-05-20 16:56:22 +03:00
Simon Riggs
f6a54fefc2 Fix spelling in comment 2015-05-19 18:37:46 -04:00
Magnus Hagander
3b075e9d7b Fix typos in comments
Dmitriy Olshevskiy
2015-05-17 14:58:04 +02:00
Peter Eisentraut
e6dc503445 Fix whitespace 2015-05-16 20:43:32 -04:00
Alvaro Herrera
b0b7be6133 Add BRIN infrastructure for "inclusion" opclasses
This lets BRIN be used with R-Tree-like indexing strategies.

Also provided are operator classes for range types, box and inet/cidr.
The infrastructure provided here should be sufficient to create operator
classes for similar datatypes; for instance, opclasses for PostGIS
geometries should be doable, though we didn't try to implement one.

(A box/point opclass was also submitted, but we ripped it out before
commit because the handling of floating point comparisons in existing
code is inconsistent and would generate corrupt indexes.)

Author: Emre Hasegeli.  Cosmetic changes by me
Review: Andreas Karlsson
2015-05-15 18:05:22 -03:00
Alvaro Herrera
26df7066cc Move strategy numbers to include/access/stratnum.h
For upcoming BRIN opclasses, it's convenient to have strategy numbers
defined in a single place.  Since there's nothing appropriate, create
it.  The StrategyNumber typedef now lives there, as well as existing
strategy numbers for B-trees (from skey.h) and R-tree-and-friends (from
gist.h).  skey.h is forced to include stratnum.h because of the
StrategyNumber typedef, but gist.h is not; extensions that currently
rely on gist.h for rtree strategy numbers might need to add a new

A few .c files can stop including skey.h and/or gist.h, which is a nice
side benefit.

Per discussion:
https://www.postgresql.org/message-id/20150514232132.GZ2523@alvh.no-ip.org

Authored by Emre Hasegeli and Álvaro.

(It's not clear to me why bootscanner.l has any #include lines at all.)
2015-05-15 17:03:16 -03:00
Simon Riggs
f6d208d6e5 TABLESAMPLE, SQL Standard and extensible
Add a TABLESAMPLE clause to SELECT statements that allows
user to specify random BERNOULLI sampling or block level
SYSTEM sampling. Implementation allows for extensible
sampling functions to be written, using a standard API.
Basic version follows SQLStandard exactly. Usable
concrete use cases for the sampling API follow in later
commits.

Petr Jelinek

Reviewed by Michael Paquier and Simon Riggs
2015-05-15 14:37:10 -04:00
Heikki Linnakangas
ffd37740ee Add archive_mode='always' option.
In 'always' mode, the standby independently archives all files it receives
from the primary.

Original patch by Fujii Masao, docs and review by me.
2015-05-15 18:55:24 +03:00
Heikki Linnakangas
98edd617f3 Fix datatype confusion with the new lossy GiST distance functions.
We can only support a lossy distance function when the distance function's
datatype is comparable with the original ordering operator's datatype.
The distance function always returns a float8, so we are limited to float8,
and float4 (by a hard-coded cast of the float8 to float4).

In light of this limitation, it seems like a good idea to have a separate
'recheck' flag for the ORDER BY expressions, so that if you have a non-lossy
distance function, it still works with lossy quals. There are cases like
that with the build-in or contrib opclasses, but it's plausible.

There was a hidden assumption that the ORDER BY values returned by GiST
match the original ordering operator's return type, but there are plenty
of examples where that's not true, e.g. in btree_gist and pg_trgm. As long
as the distance function is not lossy, we can tolerate that and just not
return the distance to the executor (or rather, always return NULL). The
executor doesn't need the distances if there are no lossy results.

There was another little bug: the recheck variable was not initialized
before calling the distance function. That revealed the bigger issue,
as the executor tried to reorder tuples that didn't need reordering, and
that failed because of the datatype mismatch.
2015-05-15 18:09:31 +03:00
Heikki Linnakangas
35fcb1b3d0 Allow GiST distance function to return merely a lower-bound.
The distance function can now set *recheck = false, like index quals. The
executor will then re-check the ORDER BY expressions, and use a queue to
reorder the results on the fly.

This makes it possible to do kNN-searches on polygons and circles, which
don't store the exact value in the index, but just a bounding box.

Alexander Korotkov and me
2015-05-15 14:26:51 +03:00
Tom Lane
1dc5ebc907 Support "expanded" objects, particularly arrays, for better performance.
This patch introduces the ability for complex datatypes to have an
in-memory representation that is different from their on-disk format.
On-disk formats are typically optimized for minimal size, and in any case
they can't contain pointers, so they are often not well-suited for
computation.  Now a datatype can invent an "expanded" in-memory format
that is better suited for its operations, and then pass that around among
the C functions that operate on the datatype.  There are also provisions
(rudimentary as yet) to allow an expanded object to be modified in-place
under suitable conditions, so that operations like assignment to an element
of an array need not involve copying the entire array.

The initial application for this feature is arrays, but it is not hard
to foresee using it for other container types like JSON, XML and hstore.
I have hopes that it will be useful to PostGIS as well.

In this initial implementation, a few heuristics have been hard-wired
into plpgsql to improve performance for arrays that are stored in
plpgsql variables.  We would like to generalize those hacks so that
other datatypes can obtain similar improvements, but figuring out some
appropriate APIs is left as a task for future work.  (The heuristics
themselves are probably not optimal yet, either, as they sometimes
force expansion of arrays that would be better left alone.)

Preliminary performance testing shows impressive speed gains for plpgsql
functions that do element-by-element access or update of large arrays.
There are other cases that get a little slower, as a result of added array
format conversions; but we can hope to improve anything that's annoyingly
bad.  In any case most applications should see a net win.

Tom Lane, reviewed by Andres Freund
2015-05-14 12:08:49 -04:00
Tom Lane
0bb8528b5c Fix postgres_fdw to return the right ctid value in EvalPlanQual cases.
If a postgres_fdw foreign table is a non-locked source relation in an
UPDATE, DELETE, or SELECT FOR UPDATE/SHARE, and the query selects its
ctid column, the wrong value would be returned if an EvalPlanQual
recheck occurred.  This happened because the foreign table's result row
was copied via the ROW_MARK_COPY code path, and EvalPlanQualFetchRowMarks
just unconditionally set the reconstructed tuple's t_self to "invalid".

To fix that, we can have EvalPlanQualFetchRowMarks copy the composite
datum's t_ctid field, and be sure to initialize that along with t_self
when postgres_fdw constructs a tuple to return.

If we just did that much then EvalPlanQualFetchRowMarks would start
returning "(0,0)" as ctid for all other ROW_MARK_COPY cases, which perhaps
does not matter much, but then again maybe it might.  The cause of that is
that heap_form_tuple, which is the ultimate source of all composite datums,
simply leaves t_ctid as zeroes in newly constructed tuples.  That seems
like a bad idea on general principles: a field that's really not been
initialized shouldn't appear to have a valid value.  So let's eat the
trivial additional overhead of doing "ItemPointerSetInvalid(&(td->t_ctid))"
in heap_form_tuple.

This closes out our handling of Etsuro Fujita's report that tableoid and
ctid weren't correctly set in postgres_fdw EvalPlanQual cases.  Along the
way we did a great deal of work to improve FDWs' ability to control row
locking behavior; which was not wasted effort by any means, but it didn't
end up being a fix for this problem because that feature would be too
expensive for postgres_fdw to use all the time.

Although the fix for the tableoid misbehavior was back-patched, I'm
hesitant to do so here; it seems far less likely that people would care
about remote ctid than tableoid, and even such a minor behavioral change
as this in heap_form_tuple is perhaps best not back-patched.  So commit
to HEAD only, at least for the moment.

Etsuro Fujita, with some adjustments by me
2015-05-13 14:05:29 -04:00
Andrew Dunstan
72d422a522 Map basebackup tablespaces using a tablespace_map file
Windows can't reliably restore symbolic links from a tar format, so
instead during backup start we create a tablespace_map file, which is
used by the restoring postgres to create the correct links in pg_tblspc.
The backup protocol also now has an option to request this file to be
included in the backup stream, and this is used by pg_basebackup when
operating in tar mode.

This is done on all platforms, not just Windows.

This means that pg_basebackup will not not work in tar mode against 9.4
and older servers, as this protocol option isn't implemented there.

Amit Kapila, reviewed by Dilip Kumar, with a little editing from me.
2015-05-12 09:29:10 -04:00
Peter Eisentraut
d02f16470f Replace some appendStringInfo* calls with more appropriate variants
Author: David Rowley <dgrowleyml@gmail.com>
2015-05-11 20:38:55 -04:00
Robert Haas
b4d4ce1d50 Increase threshold for multixact member emergency autovac to 50%.
Analysis by Noah Misch shows that the 25% threshold set by commit
53bb309d2d is lower than any other,
similar autovac threshold.  While we don't know exactly what value
will be optimal for all users, it is better to err a little on the
high side than on the low side.  A higher value increases the risk
that users might exhaust the available space and start seeing errors
before autovacuum can clean things up sufficiently, but a user who
hits that problem can compensate for it by reducing
autovacuum_multixact_freeze_max_age to a value dependent on their
average multixact size.  On the flip side, if the emergency cap
imposed by that patch kicks in too early, the user will experience
excessive wraparound scanning and will be unable to mitigate that
problem by configuration.  The new value will hopefully reduce the
risk of such bad experiences while still providing enough headroom
to avoid multixact member exhaustion for most users.

Along the way, adjust the documentation to reflect the effects of
commit 04e6d3b877, which taught
autovacuum to run for multixact wraparound even when autovacuum
is configured off.
2015-05-11 12:15:50 -04:00
Robert Haas
04e6d3b877 Even when autovacuum=off, force it for members as we do in other cases.
Thomas Munro, with some adjustments by me.
2015-05-11 10:51:14 -04:00
Robert Haas
f6a6c46d7f Advance the stop point for multixact offset creation only at checkpoint.
Commit b69bf30b9b advanced the stop point
at vacuum time, but this has subsequently been shown to be unsafe as a
result of analysis by myself and Thomas Munro and testing by Thomas
Munro.  The crux of the problem is that the SLRU deletion logic may
get confused about what to remove if, at exactly the right time during
the checkpoint process, the head of the SLRU crosses what used to be
the tail.

This patch, by me, fixes the problem by advancing the stop point only
following a checkpoint.  This has the additional advantage of making
the removal logic work during recovery more like the way it works during
normal running, which is probably good.

At least one of the calls to DetermineSafeOldestOffset which this patch
removes was already dead, because MultiXactAdvanceOldest is called only
during recovery and DetermineSafeOldestOffset was set up to do nothing
during recovery.  That, however, is inconsistent with the principle that
recovery and normal running should work similarly, and was confusing to
boot.

Along the way, fix some comments that previous patches in this area
neglected to update.  It's not clear to me whether there's any
concrete basis for the decision to use only half of the multixact ID
space, but it's neither necessary nor sufficient to prevent multixact
member wraparound, so the comments should not say otherwise.
2015-05-10 22:21:20 -04:00
Robert Haas
312747c224 Fix DetermineSafeOldestOffset for the case where there are no mxacts.
Commit b69bf30b9b failed to take into
account the possibility that there might be no multixacts in existence
at all.

Report by Thomas Munro; patch by me.
2015-05-10 21:34:26 -04:00
Tom Lane
c594c75078 Add missing "static" marker.
Per buildfarm member pademelon.
2015-05-09 23:39:36 -04:00
Heikki Linnakangas
de7688442f At promotion, archive last segment from old timeline with .partial suffix.
Previously, we would archive the possible-incomplete WAL segment with its
normal filename, but that causes trouble if the server owning that timeline
is still running, and tries to archive the same segment later. It's not nice
for the standby to trip up the master's archival like that. And it's pretty
confusing, anyway, to have an incomplete segment in the archive that's
indistinguishable from a normal, complete segment.

To avoid such confusion, add a .partial suffix to the file. Or to be more
precise, make a copy of the old segment under the .partial suffix, and
archive that instead of the original file. pg_receivexlog also uses the
.partial suffix for the same purpose, to tell apart incompletely streamed
files from complete ones.

There is no automatic mechanism to use the .partial files at recovery, so
they will go unused, unless the administrator manually copies to them to
the pg_xlog directory (and removes the .partial suffix). Recovery won't
normally need the WAL - when recovering to the new timeline, it will find
the same WAL on the first segment on the new timeline instead - but it
nevertheless feels better to archive the file with the .partial suffix, for
debugging purposes if nothing else.
2015-05-08 21:59:01 +03:00
Heikki Linnakangas
179cdd0981 Add macros to check if a filename is a WAL segment or other such file.
We had many instances of the strlen + strspn combination to check for that.
This makes the code a bit easier to read.
2015-05-08 21:58:57 +03:00
Peter Eisentraut
16c73e773b Fix whitespace 2015-05-08 14:45:53 -04:00
Andres Freund
e8898e9169 Minor ON CONFLICT related comments and doc fixes.
Geoff Winkless, Stephen Frost, Peter Geoghegan and me.
2015-05-08 19:24:14 +02:00
Robert Haas
53bb309d2d Teach autovacuum about multixact member wraparound.
The logic introduced in commit b69bf30b9b
and repaired in commits 669c7d20e6 and
7be47c56af helps to ensure that we don't
overwrite old multixact member information while it is still needed,
but a user who creates many large multixacts can still exhaust the
member space (and thus start getting errors) while autovacuum stands
idly by.

To fix this, progressively ramp down the effective value (but not the
actual contents) of autovacuum_multixact_freeze_max_age as member space
utilization increases.  This makes autovacuum more aggressive and also
reduces the threshold for a manual VACUUM to perform a full-table scan.

This patch leaves unsolved the problem of ensuring that emergency
autovacuums are triggered even when autovacuum=off.  We'll need to fix
that via a separate patch.

Thomas Munro and Robert Haas
2015-05-08 12:53:00 -04:00
Andres Freund
168d5805e4 Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint.  DO NOTHING avoids the
constraint violation, without touching the pre-existing row.  DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed.  The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.

This feature is often referred to as upsert.

This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert.  If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made.  If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.

To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.

Bumps catversion as stored rules change.

Author: Peter Geoghegan, with significant contributions from Heikki
    Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
    Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:43:10 +02:00
Alvaro Herrera
db5f98ab4f Improve BRIN infra, minmax opclass and regression test
The minmax opclass was using the wrong support functions when
cross-datatypes queries were run.  Instead of trying to fix the
pg_amproc definitions (which apparently is not possible), use the
already correct pg_amop entries instead.  This requires jumping through
more hoops (read: extra syscache lookups) to obtain the underlying
functions to execute, but it is necessary for correctness.

Author: Emre Hasegeli, tweaked by Álvaro
Review: Andreas Karlsson

Also change BrinOpcInfo to record each stored type's typecache entry
instead of just the OID.  Turns out that the full type cache is
necessary in brin_deform_tuple: the original code used the indexed
type's byval and typlen properties to extract the stored tuple, which is
correct in Minmax; but in other implementations that want to store
something different, that's wrong.  The realization that this is a bug
comes from Emre also, but I did not use his patch.

I also adopted Emre's regression test code (with smallish changes),
which is more complete.
2015-05-07 13:02:22 -03:00
Robert Haas
7be47c56af Fix incorrect math in DetermineSafeOldestOffset.
The old formula didn't have enough parentheses, so it would do the wrong
thing, and it used / rather than % to find a remainder.  The effect of
these oversights is that the stop point chosen by the logic introduced in
commit b69bf30b9b might be rather
meaningless.

Thomas Munro, reviewed by Kevin Grittner, with a whitespace tweak by me.
2015-05-07 11:19:31 -04:00
Robert Haas
1998261034 Avoid using a C++ keyword as a structure member name.
Per request from Peter Eisentraut.
2015-05-05 22:41:03 -04:00
Robert Haas
2ce439f337 Recursively fsync() the data directory after a crash.
Otherwise, if there's another crash, some writes from after the first
crash might make it to disk while writes from before the crash fail
to make it to disk.  This could lead to data corruption.

Back-patch to all supported versions.

Abhijit Menon-Sen, reviewed by Andres Freund and slightly revised
by me.
2015-05-04 14:13:53 -04:00
Heikki Linnakangas
ec3d976bce Fix the same-rel optimization when creating WAL records.
prev_regbuf was never set, and therefore the same-rel flag was never set on
WAL records.

Report and fix by Zhanq Zq
2015-05-04 21:03:36 +03:00
Andres Freund
1db12da85b Fix unaligned memory access in xlog parsing due to replication origin patch.
ParseCommitRecord() accessed xl_xact_origin directly. But the chunks in
the commit record's data only have 4 byte alignment, whereas
xl_xact_origin's members require 8 byte alignment on some
platforms. Update comments to make not of that and copy the record to
stack local storage before reading.

With help from Stefan Kaltenbrunner in pinning down the buildfarm and
verifying the fix.
2015-05-01 11:36:14 +02:00
Robert Haas
924bcf4f16 Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things.  First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers.  Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes.  Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.

Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 15:02:14 -04:00
Alvaro Herrera
669c7d20e6 Fix pg_upgrade's multixact handling (again)
We need to create the pg_multixact/offsets file deleted by pg_upgrade
much earlier than we originally were: it was in TrimMultiXact(), which
runs after we exit recovery, but it actually needs to run earlier than
the first call to SetMultiXactIdLimit (before recovery), because that
routine already wants to read the first offset segment.

Per pg_upgrade trouble report from Jeff Janes.

While at it, silence a compiler warning about a pointless assert that an
unsigned variable was being tested non-negative.  This was a signed
constant in Thomas Munro's patch which I changed to unsigned before
commit.  Pointed out by Andres Freund.
2015-04-30 13:55:06 -03:00
Andres Freund
5aa2350426 Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
  e.g. to avoid loops in bi-directional replication setups

The solution to these problems, as implemented here, consist out of
three parts:

1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
   replication origin, how far replay has progressed in a efficient and
   crash safe manner.
3) The ability to filter out changes performed on the behest of a
   replication origin during logical decoding; this allows complex
   replication topologies. E.g. by filtering all replayed changes out.

Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated.  We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.

This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL.  Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.

For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.

Bumps both catversion and wal page magic.

Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
    20140923182422.GA15776@alap3.anarazel.de,
    20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00
Alvaro Herrera
d3821e70c9 Code review for multixact bugfix
Reword messages, rename a confusingly named function.

Per Robert Haas.
2015-04-28 14:52:29 -03:00
Alvaro Herrera
b69bf30b9b Protect against multixact members wraparound
Multixact member files are subject to early wraparound overflow and
removal: if the average multixact size is above a certain threshold (see
note below) the protections against offset overflow are not enough:
during multixact truncation at checkpoint time, some
pg_multixact/members files would be removed because the server considers
them to be old and not needed anymore.  This leads to loss of files that
are critical to interpret existing tuples's Xmax values.

To protect against this, since we don't have enough info in pg_control
and we can't modify it in old branches, we maintain shared memory state
about the oldest value that we need to keep; we use this during new
multixact creation to abort if an old still-needed file would get
overwritten.  This value is kept up to date by checkpoints, which makes
it not completely accurate but should be good enough.  We start emitting
warnings sometime earlier, so that the eventual multixact-shutdown
doesn't take DBAs completely by surprise (more precisely: once 20
members SLRU segments are remaining before shutdown.)

On troublesome average multixact size: The threshold size depends on the
multixact freeze parameters. The oldest age is related to the greater of
multixact_freeze_table_age and multixact_freeze_min_age: anything
older than that should be removed promptly by autovacuum.  If autovacuum
is keeping up with multixact freezing, the troublesome multixact average
size is
	(2^32-1) / Max(freeze table age, freeze min age)
or around 28 members per multixact.  Having an average multixact size
larger than that will eventually cause new multixact data to overwrite
the data area for older multixacts.  (If autovacuum is not able to keep
up, or there are errors in vacuuming, the actual maximum is
multixact_freeeze_max_age instead, at which point multixact generation
is stopped completely.  The default value for this limit is 400 million,
which means that the multixact size that would cause trouble is about 10
members).

Initial bug report by Timothy Garnett, bug #12990
Backpatch to 9.3, where the problem was introduced.

Authors: Álvaro Herrera, Thomas Munro
Reviews: Thomas Munro, Amit Kapila, Robert Haas, Kevin Grittner
2015-04-28 11:32:53 -03:00
Andres Freund
6aab1f45ac Fix various typos and grammar errors in comments.
Author: Dmitriy Olshevskiy
Discussion: 553D00A6.4090205@bk.ru
2015-04-26 18:42:31 +02:00
Heikki Linnakangas
2c47fe16a7 Fix deadlock at startup, if max_prepared_transactions is too small.
When the startup process recovers transactions by scanning pg_twophase
directory, it should clear MyLockedGxact after it's done processing each
transaction. Like we do during normal operation, at PREPARE TRANSACTION.
Otherwise, if the startup process exits due to an error, it will try to
clear the locking_backend field of the last recovered transaction. That's
usually harmless, but if the error happens in MarkAsPreparing, while
holding TwoPhaseStateLock, the shmem-exit hook will try to acquire
TwoPhaseStateLock again, and deadlock with itself.

This fixes bug #13128 reported by Grant McAlister. The bug was introduced
by commit bb38fb0d, so backpatch to all supported versions like that
commit.
2015-04-23 21:39:35 +03:00
Heikki Linnakangas
3d80a1e0e3 Fix logic to skip checkpoint if no records have been inserted.
After the WAL format changes, the calculation of the size of a checkpoint
record became incorrect. Instead of trying to fix the math, check that the
previous record, i.e. the xl_prev value that we'd write for the next
record, matches the last checkpoint's redo pointer. That way it's not
dependent on the size of the checkpoint record at all.

The old logic was actually slightly wrong all along: if the previous
checkpoint record crossed a page boundary, the page headers threw off the
record size calculation, and the checkpoint was not skipped. The new
checkpoint would not cross a page boundary, so this only resulted in at
most one extra checkpoint after the system became idle. The new logic fixes
that. (It's not worth fixing in backbranches).

However, it makes some sense to try to keep the latest checkpoint contained
fully in a page, or at least in a single WAL segment, just on general
robustness grounds. If something goes awfully wrong, it's more likely that
you can recover the latest WAL segment, than the last two WAL segments. So
I added an extra check that the checkpoint is not skipped if the previous
checkpoint crossed a WAL segment.

Reported by Jeff Janes.
2015-04-15 17:21:04 +03:00
Alvaro Herrera
0a52fafce4 Fix typo in comment
SLRU_SEGMENTS_PER_PAGE -> SLRU_PAGES_PER_SEGMENT

I introduced this ancient typo in subtrans.c and later propagated it to
multixact.c.  I fixed the latter in f741300c, but only back to 9.3;
backpatch to all supported branches for consistency.
2015-04-14 12:12:18 -03:00
Heikki Linnakangas
4f700bcd20 Reorganize our CRC source files again.
Now that we use CRC-32C in WAL and the control file, the "traditional" and
"legacy" CRC-32 variants are not used in any frontend programs anymore.
Move the code for those back from src/common to src/backend/utils/hash.

Also move the slicing-by-8 implementation (back) to src/port. This is in
preparation for next patch that will add another implementation that uses
Intel SSE 4.2 instructions to calculate CRC-32C, where available.
2015-04-14 17:03:42 +03:00
Alvaro Herrera
b5213e14a4 Remove duplicated word in README 2015-04-13 14:28:21 -03:00
Heikki Linnakangas
b2a5545bd6 Don't archive bogus recycled or preallocated files after timeline switch.
After a timeline switch, we would leave behind recycled WAL segments that
are in the future, but on the old timeline. After promotion, and after they
become old enough to be recycled again, we would notice that they don't have
a .ready or .done file, create a .ready file for them, and archive them.
That's bogus, because the files contain garbage, recycled from an older
timeline (or prealloced as zeros). We shouldn't archive such files.

This could happen when we're following a timeline switch during replay, or
when we switch to new timeline at end-of-recovery.

To fix, whenever we switch to a new timeline, scan the data directory for
WAL segments on the old timeline, but with a higher segment number, and
remove them. Those don't belong to our timeline history, and are most
likely bogus recycled or preallocated files. They could also be valid files
that we streamed from the primary ahead of time, but in any case, they're
not needed to recover to the new timeline.
2015-04-13 16:53:49 +03:00
Alvaro Herrera
27846f02c1 Optimize locking a tuple already locked by another subxact
Locking and updating the same tuple repeatedly led to some strange
multixacts being created which had several subtransactions of the same
parent transaction holding locks of the same strength.  However,
once a subxact of the current transaction holds a lock of a given
strength, it's not necessary to acquire the same lock again.  This made
some coding patterns much slower than required.

The fix is twofold.  First we change HeapTupleSatisfiesUpdate to return
HeapTupleBeingUpdated for the case where the current transaction is
already a single-xid locker for the given tuple; it used to return
HeapTupleMayBeUpdated for that case.  The new logic is simpler, and the
change to pgrowlocks is a testament to that: previously we needed to
check for the single-xid locker separately in a very ugly way.  That
test is simpler now.

As fallout from the HTSU change, some of its callers need to be amended
so that tuple-locked-by-own-transaction is taken into account in the
BeingUpdated case rather than the MayBeUpdated case.  For many of them
there is no difference; but heap_delete() and heap_update now check
explicitely and do not grab tuple lock in that case.

The HTSU change also means that routine MultiXactHasRunningRemoteMembers
introduced in commit 11ac4c73cb is no longer necessary and can be
removed; the case that used to require it is now handled naturally as
result of the changes to heap_delete and heap_update.

The second part of the fix to the performance issue is to adjust
heap_lock_tuple to avoid the slowness:

1. Previously we checked for the case that our own transaction already
held a strong enough lock and returned MayBeUpdated, but only in the
multixact case.  Now we do it for the plain Xid case as well, which
saves having to LockTuple.

2. If the current transaction is the only locker of the tuple (but with
a lock not as strong as what we need; otherwise it would have been
caught in the check mentioned above), we can skip sleeping on the
multixact, and instead go straight to create an updated multixact with
the additional lock strength.

3. Most importantly, make sure that both the single-xid-locker case and
the multixact-locker case optimization are applied always.  We do this
by checking both in a single place, rather than them appearing in two
separate portions of the routine -- something that is made possible by
the HeapTupleSatisfiesUpdate API change.  Previously we would only check
for the single-xid case when HTSU returned MayBeUpdated, and only
checked for the multixact case when HTSU returned BeingUpdated.  This
was at odds with what HTSU actually returned in one case: if our own
transaction was locker in a multixact, it returned MayBeUpdated, so the
optimization never applied.  This is what led to the large multixacts in
the first place.

Per bug report #8470 by Oskari Saarenmaa.
2015-04-10 13:47:15 -03:00
Tom Lane
b7e1652d5d Remove unnecessary variables in _hash_splitbucket().
Commit ed9cc2b5df made it unnecessary to pass
start_nblkno to _hash_splitbucket(), and for that matter unnecessary to
have the internal nblkno variable either.  My compiler didn't complain
about that, but some did.  I also rearranged the use of oblkno a bit to
make that case more parallel.

Report and initial patch by Petr Jelinek, rearranged a bit by me.
Back-patch to all branches, like the previous patch.
2015-04-03 16:49:44 -04:00
Alvaro Herrera
4ff695b17d Add log_min_autovacuum_duration per-table option
This is useful to control autovacuum log volume, for situations where
monitoring only a set of tables is necessary.

Author: Michael Paquier
Reviewed by: A team led by Naoya Anzai (also including Akira Kurosawa,
Taiki Kondo, Huong Dangminh), Fujii Masao.
2015-04-03 11:55:50 -03:00
Fujii Masao
6e4bf4ecd3 Fix error handling of XLogReaderAllocate in case of OOM
Similarly to previous fix 9b8d478, commit 2c03216 has switched
XLogReaderAllocate() to use a set of palloc calls instead of malloc,
causing any callers of this function to fail with an error instead of
receiving a NULL pointer in case of out-of-memory error. Fix this by
using palloc_extended with MCXT_ALLOC_NO_OOM that will safely return
NULL in case of an OOM.

Michael Paquier, slightly modified by me.
2015-04-03 21:55:37 +09:00
Fujii Masao
9b8d4782ba Rework handling of OOM when allocating record buffer in XLOG reader.
Commit 2c03216 changed allocate_recordbuf() so that it uses a palloc to
allocate the read buffer and fails immediately when an out-of-memory error
shows up, even though its callers still expect that NULL is returned in that
case. This bug is fixed making allocate_recordbuf() use a palloc_extended
with MCXT_ALLOC_NO_OOM flag and return NULL in OOM case.

Michael Paquier
2015-04-03 18:29:38 +09:00
Andres Freund
62e2a8dc2c Define integer limits independently from the system definitions.
In 83ff1618 we defined integer limits iff they're not provided by the
system. That turns out not to be the greatest idea because there's
different ways some datatypes can be represented. E.g. on OSX PG's 64bit
datatype will be a 'long int', but OSX unconditionally uses 'long
long'. That disparity then can lead to warnings, e.g. around printf
formats.

One way to fix that would be to back int64 using stdint.h's
int64_t. While a good idea it's not that easy to implement. We would
e.g. need to include stdint.h in our external headers, which we don't
today. Also computing the correct int64 printf formats in that case is
nontrivial.

Instead simply prefix the integer limits with PG_ and define them
unconditionally. I've adjusted all the references to them in code, but
not the ones in comments; the latter seems unnecessary to me.

Discussion: 20150331141423.GK4878@alap3.anarazel.de
2015-04-02 17:43:35 +02:00
Tom Lane
ed9cc2b5df Fix bogus concurrent use of _hash_getnewbuf() in bucket split code.
_hash_splitbucket() obtained the base page of the new bucket by calling
_hash_getnewbuf(), but it held no exclusive lock that would prevent some
other process from calling _hash_getnewbuf() at the same time.  This is
contrary to _hash_getnewbuf()'s API spec and could in fact cause failures.
In practice, we must only call that function while holding write lock on
the hash index's metapage.

An additional problem was that we'd already modified the metapage's bucket
mapping data, meaning that failure to extend the index would leave us with
a corrupt index.

Fix both issues by moving the _hash_getnewbuf() call to just before we
modify the metapage in _hash_expandtable().

Unfortunately there's still a large problem here, which is that we could
also incur ENOSPC while trying to get an overflow page for the new bucket.
That would leave the index corrupt in a more subtle way, namely that some
index tuples that should be in the new bucket might still be in the old
one.  Fixing that seems substantially more difficult; even preallocating as
many pages as we could possibly need wouldn't entirely guarantee that the
bucket split would complete successfully.  So for today let's just deal
with the base case.

Per report from Antonin Houska.  Back-patch to all active branches.
2015-03-30 16:40:05 -04:00
Tom Lane
1601830ec2 Make ginbuild's funcCtx be independent of its tmpCtx.
Previously the funcCtx was a child of the tmpCtx, but that was broken
by commit eaa5808e8e, which made
MemoryContextReset() delete, not reset, child contexts.  The behavior of
having a tmpCtx reset also clear the other context seems rather dubious
anyway, so let's just disentangle them.  Per report from Erik Rijkers.

In passing, fix badly-inaccurate comments about these contexts.
2015-03-29 14:02:58 -04:00
Heikki Linnakangas
55b59eda13 Fix GiST index-only scans for opclasses with different storage type.
We cannot use the index's tuple descriptor directly to describe the index
tuples returned in an index-only scan. That's because the index might use
a different datatype for the values stored on disk than the type originally
indexed. As long as they were both pass-by-ref, it worked, but will not work
for pass-by-value types of different sizes. I noticed this as a crash when I
started hacking a patch to add fetch methods to btree_gist.
2015-03-26 23:07:52 +02:00
Tom Lane
785941cdc3 Tweak __attribute__-wrapping macros for better pgindent results.
This improves on commit bbfd7edae5 by
making two simple changes:

* pg_attribute_noreturn now takes parentheses, ie pg_attribute_noreturn().
Likewise pg_attribute_unused(), pg_attribute_packed().  This reduces
pgindent's tendency to misformat declarations involving them.

* attributes are now always attached to function declarations, not
definitions.  Previously some places were taking creative shortcuts,
which were not merely candidates for bad misformatting by pgindent
but often were outright wrong anyway.  (It does little good to put a
noreturn annotation where callers can't see it.)  In any case, if
we would like to believe that these macros can be used with non-gcc
compilers, we should avoid gratuitous variance in usage patterns.

I also went through and manually improved the formatting of a lot of
declarations, and got rid of excessively repetitive (and now obsolete
anyway) comments informing the reader what pg_attribute_printf is for.
2015-03-26 14:03:25 -04:00
Heikki Linnakangas
d04c8ed904 Add support for index-only scans in GiST.
This adds a new GiST opclass method, 'fetch', which is used to reconstruct
the original Datum from the value stored in the index. Also, the 'canreturn'
index AM interface function gains a new 'attno' argument. That makes it
possible to use index-only scans on a multi-column index where some of the
opclasses support index-only scans but some do not.

This patch adds support in the box and point opclasses. Other opclasses
can added later as follow-on patches (btree_gist would be particularly
interesting).

Anastasia Lubennikova, with additional fixes and modifications by me.
2015-03-26 19:12:00 +02:00
Heikki Linnakangas
8fa393a6d7 Minor cleanup of GiST code, for readability.
Remove the gistcentryinit function, inlining the relevant part of it into
the only caller.
2015-03-26 19:11:54 +02:00
Andres Freund
83ff1618bc Centralize definition of integer limits.
Several submitted and even committed patches have run into the problem
that C89, our baseline, does not provide minimum/maximum values for
various integer datatypes. C99's stdint.h does, but we can't rely on
it.

Several parts of the code defined limits locally, so instead centralize
the definitions to c.h.

This patch also changes the more obvious usages of literal limit values;
there's more places that could be changed, but it's less clear whether
it's beneficial to change those.

Author: Andrew Gierth
Discussion: 87619tc5wc.fsf@news-spur.riddles.org.uk
2015-03-25 22:39:42 +01:00
Kevin Grittner
2ed5b87f96 Reduce pinning and buffer content locking for btree scans.
Even though the main benefit of the Lehman and Yao algorithm for
btrees is that no locks need be held between page reads in an
index search, we were holding a buffer pin on each leaf page after
it was read until we were ready to read the next one.  The reason
was so that we could treat this as a weak lock to create an
"interlock" with vacuum's deletion of heap line pointers, even
though our README file pointed out that this was not necessary for
a scan using an MVCC snapshot.

The main goal of this patch is to reduce the blocking of vacuum
processes by in-progress btree index scans (including a cursor
which is idle), but the code rearrangement also allows for one
less buffer content lock to be taken when a forward scan steps from
one page to the next, which results in a small but consistent
performance improvement in many workloads.

This patch leaves behavior unchanged for some cases, which can be
addressed separately so that each case can be evaluated on its own
merits.  These unchanged cases are when a scan uses a non-MVCC
snapshot, an index-only scan, and a scan of a btree index for which
modifications are not WAL-logged.  If later patches allow  all of
these cases to drop the buffer pin after reading a leaf page, then
the btree vacuum process can be simplified; it will no longer need
the "super-exclusive" lock to delete tuples from a page.

Reviewed by Heikki Linnakangas and Kyotaro Horiguchi
2015-03-25 14:24:43 -05:00
Andres Freund
87cec51d3a Don't delay replication for less than recovery_min_apply_delay's resolution.
Recovery delays are implemented by waiting on a latch, and latches take
milliseconds as a parameter. The required amount of waiting was computed
using microsecond resolution though and the wait loop's abort condition
was checking the delay in microseconds as well.  This could lead to
short spurts of busy looping when the overall wait time was below a
millisecond, but above 0 microseconds.

Instead just formulate the wait loop's abort condition in millisecond
granularity as well. Given that that's recovery_min_apply_delay
resolution, it seems harmless to not wait for less than a millisecond.

Backpatch to 9.4 where recovery_min_apply_delay was introduced.

Discussion: 20150323141819.GH26995@alap3.anarazel.de
2015-03-23 16:51:11 +01:00
Andres Freund
a1105c3dd4 Fix copy & paste error in 4f1b890b13.
Due to the bug delayed standbys would not delay when applying prepared
transactions.

Discussion: CAB7nPqT6BO1cCn+sAyDByBxA4EKZNAiPi2mFJ=ANeZmnmewRyg@mail.gmail.com

Michael Paquier via Coverity.
2015-03-23 15:53:40 +01:00
Andres Freund
4f1b890b13 Merge the various forms of transaction commit & abort records.
Since 465883b0a two versions of commit records have existed. A compact
version that was used when no cache invalidations, smgr unlinks and
similar were needed, and a full version that could deal with all
that. Additionally the full version was embedded into twophase commit
records.

That resulted in a measurable reduction in the size of the logged WAL in
some workloads. But more recently additions like logical decoding, which
e.g. needs information about the database something was executed on,
made it applicable in fewer situations. The static split generally made
it hard to expand the commit record, because concerns over the size made
it hard to add anything to the compact version.

Additionally it's not particularly pretty to have twophase.c insert
RM_XACT records.

Rejigger things so that the commit and abort records only have one form
each, including the twophase equivalents. The presence of the various
optional (in the sense of not being in every record) pieces is indicated
by a bits in the 'xinfo' flag.  That flag previously was not included in
compact commit records. To prevent an increase in size due to its
presence, it's only included if necessary; signalled by a bit in the
xl_info bits available for xact.c, similar to heapam.c's
XLOG_HEAP_OPMASK/XLOG_HEAP_INIT_PAGE.

Twophase commit/aborts are now the same as their normal
counterparts. The original transaction's xid is included in an optional
data field.

This means that commit records generally are smaller, except in the case
of a transaction with subtransactions, but no other special cases; the
increase there is four bytes, which seems acceptable given that the more
common case of not having subtransactions shrank.  The savings are
especially measurable for twophase commits, which previously always used
the full version; but will in practice only infrequently have required
that.

The motivation for this work are not the space savings and and
deduplication though; it's that it makes it easier to extend commit
records with additional information. That's just a few lines of code
now; without impacting the common case where that information is not
needed.

Discussion: 20150220152150.GD4149@awork2.anarazel.de,
    235610.92468.qm%40web29004.mail.ird.yahoo.com

Reviewed-By: Heikki Linnakangas, Simon Riggs
2015-03-15 17:37:07 +01:00
Andres Freund
a0f5954af1 Increase max_wal_size's default from 128MB to 1GB.
The introduction of min_wal_size & max_wal_size in 88e9823026 makes it
feasible to increase the default upper bound in checkpoint
size. Previously raising the default would lead to a increased disk
footprint, even if more segments weren't beneficial.  The low default of
checkpoint size is one of common performance problem users have thus
increasing the default makes sense.  Setups where the increase in
maximum disk usage is a problem will very likely have to run with a
modified configuration anyway.

Discussion: 54F4EFB8.40202@agliodbs.com,
    CA+TgmoZEAgX5oMGJOHVj8L7XOkAe05Gnf45rP40m-K3FhZRVKg@mail.gmail.com

Author: Josh Berkus, after a discussion involving lots of people.
2015-03-15 17:37:07 +01:00
Andres Freund
51c11a7025 Remove pause_at_recovery_target recovery.conf setting.
The new recovery_target_action (introduced in aedccb1f6/b8e33a85d4)
replaces it's functionality. Having both seems likely to cause more
confusion than it saves worry due to the incompatibility.

Discussion: 5484FC53.2060903@2ndquadrant.com
Author: Petr Jelinek
2015-03-15 17:37:07 +01:00
Fujii Masao
cd6c45cbee Suppress maybe-uninitialized compiler warnings.
Previously some compilers were thinking that the variables that
57aa5b2 added maybe-uninitialized.

Spotted by Andres Freund
2015-03-15 10:40:43 +09:00
Heikki Linnakangas
26d2c5dc8d Fix memory leaks in GIN index vacuum.
Per bug #12850 by Walter Nordmann. Backpatch to 9.4 where the leak was
introduced.
2015-03-12 15:34:32 +01:00
Andres Freund
bbfd7edae5 Add macros wrapping all usage of gcc's __attribute__.
Until now __attribute__() was defined to be empty for all compilers but
gcc. That's problematic because it prevents using it in other compilers;
which is necessary e.g. for atomics portability.  It's also just
generally dubious to do so in a header as widely included as c.h.

Instead add pg_attribute_format_arg, pg_attribute_printf,
pg_attribute_noreturn macros which are implemented in the compilers that
understand them. Also add pg_attribute_noreturn and pg_attribute_packed,
but don't provide fallbacks, since they can affect functionality.

This means that external code that, possibly unwittingly, relied on
__attribute__ defined to be empty on !gcc compilers may now run into
warnings or errors on those compilers. But there shouldn't be many
occurances of that and it's hard to work around...

Discussion: 54B58BA3.8040302@ohmu.fi
Author: Oskari Saarenmaa, with some minor changes by me.
2015-03-11 14:30:01 +01:00
Fujii Masao
57aa5b2bb1 Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.

This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.

This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.

Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 15:52:24 +09:00
Alvaro Herrera
e491bd2ee3 Move BRIN page type to page's last two bytes
... which is the usual convention among AMs, so that pg_filedump and
similar utilities can tell apart pages of different AMs.  It was also
the intent of the original code, but I failed to realize that alignment
considerations would move the whole thing to the previous-to-last word
in the page.

The new definition of the associated macro makes surrounding code a bit
leaner, too.

Per note from Heikki at
http://www.postgresql.org/message-id/546A16EF.9070005@vmware.com
2015-03-10 12:27:15 -03:00
Alvaro Herrera
4f3924d9cd Keep CommitTs module in sync in standby and master
We allow this module to be turned off on restarts, so a restart time
check is enough to activate or deactivate the module; however, if there
is a standby replaying WAL emitted from a master which is restarted, but
the standby isn't, the state in the standby becomes inconsistent and can
easily be crashed.

Fix by activating and deactivating the module during WAL replay on
parameter change as well as on system start.

Problem reported by Fujii Masao in
http://www.postgresql.org/message-id/CAHGQGwFhJ3CnHo1CELEfay18yg_RA-XZT-7D8NuWUoYSZ90r4Q@mail.gmail.com

Author: Petr Jelínek
2015-03-09 17:44:00 -03:00
Heikki Linnakangas
f1fd515b39 Move WAL-related definitions from dbcommands.h to separate header file.
This makes it easier to write frontend programs that needs to understand
the WAL record format of CREATE/DROP DATABASE. dbcommands.h cannot easily
be #included in a frontend program, because it pulls in other header files
that need backend stuff, but the new dbcommands_xlog.h header file has
fewer dependencies.
2015-03-09 15:50:49 +02:00
Fujii Masao
c74c04b8aa Add missing "goto err" statements in xlogreader.c.
Spotted by Andres Freund.
2015-03-09 14:31:10 +09:00
Fujii Masao
934d122685 Fix typo in comment. 2015-03-05 20:15:16 +09:00
Andres Freund
fd6a3f3ad4 Reconsider when to wait for WAL flushes/syncrep during commit.
Up to now RecordTransactionCommit() waited for WAL to be flushed (if
synchronous_commit != off) and to be synchronously replicated (if
enabled), even if a transaction did not have a xid assigned. The primary
reason for that is that sequence's nextval() did not assign a xid, but
are worthwhile to wait for on commit.

This can be problematic because sometimes read only transactions do
write WAL, e.g. HOT page prune records. That then could lead to read only
transactions having to wait during commit. Not something people expect
in a read only transaction.

This lead to such strange symptoms as backends being seemingly stuck
during connection establishment when all synchronous replicas are
down. Especially annoying when said stuck connection is the standby
trying to reconnect to allow syncrep again...

This behavior also is involved in a rather complicated <= 9.4 bug where
the transaction started by catchup interrupt processing waited for
syncrep using latches, but didn't get the wakeup because it was already
running inside the same overloaded signal handler. Fix the issue here
doesn't properly solve that issue, merely papers over the problems. In
9.5 catchup interrupts aren't processed out of signal handlers anymore.

To fix all this, make nextval() acquire a top level xid, and only wait for
transaction commit if a transaction both acquired a xid and emitted WAL
records.  If only a xid has been assigned we don't uselessly want to
wait just because of writes to temporary/unlogged tables; if only WAL
has been written we don't want to wait just because of HOT prunes.

The xid assignment in nextval() is unlikely to cause overhead in
real-world workloads. For one it only happens SEQ_LOG_VALS/32 values
anyway, for another only usage of nextval() without using the result in
an insert or similar is affected.

Discussion: 20150223165359.GF30784@awork2.anarazel.de,
    369698E947874884A77849D8FE3680C2@maumau,
    5CF4ABBA67674088B3941894E22A0D25@maumau

Per complaint from maumau and Thom Brown

Backpatch all the way back; 9.0 doesn't have syncrep, but it seems
better to be consistent behavior across all maintained branches.
2015-02-26 12:50:07 +01:00
Heikki Linnakangas
88e9823026 Replace checkpoint_segments with min_wal_size and max_wal_size.
Instead of having a single knob (checkpoint_segments) that both triggers
checkpoints, and determines how many checkpoints to recycle, they are now
separate concerns. There is still an internal variable called
CheckpointSegments, which triggers checkpoints. But it no longer determines
how many segments to recycle at a checkpoint. That is now auto-tuned by
keeping a moving average of the distance between checkpoints (in bytes),
and trying to keep that many segments in reserve. The advantage of this is
that you can set max_wal_size very high, but the system won't actually
consume that much space if there isn't any need for it. The min_wal_size
sets a floor for that; you can effectively disable the auto-tuning behavior
by setting min_wal_size equal to max_wal_size.

The max_wal_size setting is now the actual target size of WAL at which a
new checkpoint is triggered, instead of the distance between checkpoints.
Previously, you could calculate the actual WAL usage with the formula
"(2 + checkpoint_completion_target) * checkpoint_segments + 1". With this
patch, you set the desired WAL usage with max_wal_size, and the system
calculates the appropriate CheckpointSegments with the reverse of that
formula. That's a lot more intuitive for administrators to set.

Reviewed by Amit Kapila and Venkata Balaji N.
2015-02-23 18:53:02 +02:00
Fujii Masao
5d2b45e3f7 Add GUC to control the time to wait before retrieving WAL after failed attempt.
Previously when the standby server failed to retrieve WAL files from any sources
(i.e., streaming replication, local pg_xlog directory or WAL archive), it always
waited for five seconds (hard-coded) before the next attempt. For example,
this is problematic in warm-standby because restore_command can fail
every five seconds even while new WAL file is expected to be unavailable for
a long time and flood the log files with its error messages.

This commit adds new parameter, wal_retrieve_retry_interval, to control that
wait time.

Alexey Vasiliev and Michael Paquier, reviewed by Andres Freund and me.
2015-02-23 20:55:17 +09:00
Tom Lane
2e211211a7 Use FLEXIBLE_ARRAY_MEMBER in a number of other places.
I think we're about done with this...
2015-02-21 16:12:14 -05:00
Tom Lane
e1a11d9311 Use FLEXIBLE_ARRAY_MEMBER for HeapTupleHeaderData.t_bits[].
This requires changing quite a few places that were depending on
sizeof(HeapTupleHeaderData), but it seems for the best.

Michael Paquier, some adjustments by me
2015-02-21 15:13:06 -05:00
Tom Lane
33a3b03d63 Use FLEXIBLE_ARRAY_MEMBER in some more places.
Fix a batch of structs that are only visible within individual .c files.

Michael Paquier
2015-02-20 17:32:01 -05:00
Tom Lane
e38b1eb098 Use FLEXIBLE_ARRAY_MEMBER in struct varlena.
This forces some minor coding adjustments in tuptoaster.c and inv_api.c,
but the new coding there is cleaner anyway.

Michael Paquier
2015-02-20 16:51:53 -05:00
Heikki Linnakangas
d17b6df239 Fix knn-GiST queue comparison function to return heap tuples first.
The part of the comparison function that was supposed to keep heap tuples
ahead of index items was backwards. It would not lead to incorrect results,
but it is more efficient to return heap tuples first, before scanning more
index pages, when both have the same distance.

Alexander Korotkov
2015-02-17 22:33:38 +02:00
Tom Lane
bc4de01db3 Minor cleanup/code review for "indirect toast" stuff.
Fix some issues I noticed while fooling with an extension to allow an
additional kind of toast pointer.  Much of this is just comment
improvement, but there are a couple of actual bugs, which might or might
not be reachable today depending on what can happen during logical
decoding.  An example is that toast_flatten_tuple() failed to cover the
possibility of an indirection pointer in its input.  Back-patch to 9.4
just in case that is reachable now.

In HEAD, also correct some really minor issues with recent compression
reorganization, such as dangerously underparenthesized macros.
2015-02-09 12:30:52 -05:00
Fujii Masao
40bede5477 Move pg_lzcompress.c to src/common.
The meta data of PGLZ symbolized by PGLZ_Header is removed, to make
the compression and decompression code independent on the backend-only
varlena facility. PGLZ_Header is being used to store some meta data
related to the data being compressed like the raw length of the uncompressed
record or some varlena-related data, making it unpluggable once PGLZ is
stored in src/common as it contains some backend-only code paths with
the management of varlena structures. The APIs of PGLZ are reworked
at the same time to do only compression and decompression of buffers
without the meta-data layer, simplifying its use for a more general usage.

On-disk format is preserved as well, so there is no incompatibility with
previous major versions of PostgreSQL for TOAST entries.

Exposing compression and decompression APIs of pglz makes possible its
use by extensions and contrib modules. Especially this commit is required
for upcoming WAL compression feature so that the WAL reader facility can
decompress the WAL data by using pglz_decompress.

Michael Paquier, reviewed by me.
2015-02-09 15:15:24 +09:00
Heikki Linnakangas
d88976cfa1 Use a separate memory context for GIN scan keys.
It was getting tedious to track and release all the different things that
form a scan key. We were leaking at least the queryCategories array, and
possibly more, on a rescan. That was visible if a GIN index was used in a
nested loop join. This also protects from leaks in extractQuery method.

No backpatching, given the lack of complaints from the field. Maybe later,
after this has received more field testing.
2015-02-04 17:40:25 +02:00
Heikki Linnakangas
57fe246890 Fix reference-after-free when waiting for another xact due to constraint.
If an insertion or update had to wait for another transaction to finish,
because there was another insertion with conflicting key in progress,
we would pass a just-free'd item pointer to XactLockTableWait().

All calls to XactLockTableWait() and MultiXactIdWait() had similar issues.
Some passed a pointer to a buffer in the buffer cache, after already
releasing the lock. The call in EvalPlanQualFetch had already released the
pin too. All but the call in execUtils.c would merely lead to reporting a
bogus ctid, however (or an assertion failure, if enabled).

All the callers that passed HeapTuple->t_data->t_ctid were slightly bogus
anyway: if the tuple was updated (again) in the same transaction, its ctid
field would point to the next tuple in the chain, not the tuple itself.

Backpatch to 9.4, where the 'ctid' argument to XactLockTableWait was added
(in commit f88d4cfc)
2015-02-04 16:00:34 +02:00
Heikki Linnakangas
68fa75f318 Fix query-duration memory leak with GIN rescans.
The requiredEntries / additionalEntries arrays were not freed in
freeScanKeys() like other per-key stuff.

It's not obvious, but startScanKey() was only ever called after the keys
have been initialized with ginNewScanKey(). That's why it doesn't need to
worry about freeing existing arrays. The ginIsNewKey() test in gingetbitmap
was never true, because ginrescan free's the existing keys, and it's not OK
to call gingetbitmap twice in a row without calling ginrescan in between.
To make that clear, remove the unnecessary ginIsNewKey(). And just to be
extra sure that nothing funny happens if there is an existing key after all,
call freeScanKeys() to free it if it exists. This makes the code more
straightforward.

(I'm seeing other similar leaks in testing a query that rescans an GIN index
scan, but that's a different issue. This just fixes the obvious leak with
those two arrays.)

Backpatch to 9.4, where GIN fast scan was added.
2015-01-30 17:58:23 +01:00
Stephen Frost
32bf6ee6ab Fix BuildIndexValueDescription for expressions
In 804b6b6db4 we modified
BuildIndexValueDescription to pay attention to which columns are visible
to the user, but unfortunatley that commit neglected to consider indexes
which are built on expressions.

Handle error-reporting of violations of constraint indexes based on
expressions by not returning any detail when the user does not have
table-level SELECT rights.

Backpatch to 9.0, as the prior commit was.

Pointed out by Tom.
2015-01-29 21:59:34 -05:00
Heikki Linnakangas
31ed42b9a3 Fix bug where GIN scan keys were not initialized with gin_fuzzy_search_limit.
When gin_fuzzy_search_limit was used, we could jump out of startScan()
without calling startScanKey(). That was harmless in 9.3 and below, because
startScanKey()() didn't do anything interesting, but in 9.4 it initializes
information needed for skipping entries (aka GIN fast scans), and you
readily get a segfault if it's not done. Nevertheless, it was clearly wrong
all along, so backpatch all the way to 9.1 where the early return was
introduced.

(AFAICS startScanKey() did nothing useful in 9.3 and below, because the
fields it initialized were already initialized in ginFillScanKey(), but I
don't dare to change that in a minor release. ginFillScanKey() is always
called in gingetbitmap() even though there's a check there to see if the
scan keys have already been initialized, because they never are; ginrescan()
free's them.)

In the passing, remove unnecessary if-check from the second inner loop in
startScan(). We already check in the first loop that the condition is true
for all entries.

Reported by Olaf Gawenda, bug #12694, Backpatch to 9.1 and above, although
AFAICS it causes a live bug only in 9.4.
2015-01-29 19:35:55 +02:00
Stephen Frost
804b6b6db4 Fix column-privilege leak in error-message paths
While building error messages to return to the user,
BuildIndexValueDescription, ExecBuildSlotValueDescription and
ri_ReportViolation would happily include the entire key or entire row in
the result returned to the user, even if the user didn't have access to
view all of the columns being included.

Instead, include only those columns which the user is providing or which
the user has select rights on.  If the user does not have any rights
to view the table or any of the columns involved then no detail is
provided and a NULL value is returned from BuildIndexValueDescription
and ExecBuildSlotValueDescription.  Note that, for key cases, the user
must have access to all of the columns for the key to be shown; a
partial key will not be returned.

Further, in master only, do not return any data for cases where row
security is enabled on the relation and row security should be applied
for the user.  This required a bit of refactoring and moving of things
around related to RLS- note the addition of utils/misc/rls.c.

Back-patch all the way, as column-level privileges are now in all
supported versions.

This has been assigned CVE-2014-8161, but since the issue and the patch
have already been publicized on pgsql-hackers, there's no point in trying
to hide this commit.
2015-01-28 12:31:30 -05:00
Heikki Linnakangas
670bf71f65 Remove dead NULL-pointer checks in GiST code.
gist_poly_compress() and gist_circle_compress() checked for a NULL-pointer
key argument, but that was dead code; the gist code never passes a
NULL-pointer to the "compress" method.

This commit also removes a documentation note added in commit a0a3883,
about doing NULL-pointer checks in the "compress" method. It was added
based on the fact that some implementations were doing NULL-pointer
checks, but those checks were unnecessary in the first place.

The NULL-pointer check in gbt_var_same() function was also unnecessary.
The arguments to the "same" method come from the "compress", "union", or
"picksplit" methods, but none of them return a NULL pointer.

None of this is to be confused with SQL NULL values. Those are dealt with
by the gist machinery, and are never passed to the GiST opclass methods.

Michael Paquier
2015-01-28 10:03:58 +02:00
Alvaro Herrera
972bf7d6f1 Tweak BRIN minmax operator class
In the union support proc, we were not checking the hasnulls flag of
value A early enough, so it could be skipped if the "allnulls" flag in
value B is set.  Also, a check on the allnulls flag of value "B" was
redundant, so remove it.

Also change inet_minmax_ops to not be the default opclass for type inet,
as a future inclusion operator class would be more useful and it's
pretty difficult to change default opclass for a datatype later on.
(There is no catversion bump for this catalog change; this shouldn't be
a problem.)

Extracted from a larger patch to add an "inclusion" operator class.

Author: Emre Hasegeli
2015-01-22 17:01:09 -03:00
Robert Haas
4ea51cdfe8 Use abbreviated keys for faster sorting of text datums.
This commit extends the SortSupport infrastructure to allow operator
classes the option to provide abbreviated representations of Datums;
in the case of text, we abbreviate by taking the first few characters
of the strxfrm() blob.  If the abbreviated comparison is insufficent
to resolve the comparison, we fall back on the normal comparator.
This can be much faster than the old way of doing sorting if the
first few bytes of the string are usually sufficient to resolve the
comparison.

There is the potential for a performance regression if all of the
strings to be sorted are identical for the first 8+ characters and
differ only in later positions; therefore, the SortSupport machinery
now provides an infrastructure to abort the use of abbreviation if
it appears that abbreviation is producing comparatively few distinct
keys.  HyperLogLog, a streaming cardinality estimator, is included in
this commit and used to make that determination for text.

Peter Geoghegan, reviewed by me.
2015-01-19 15:28:27 -05:00
Robert Haas
9d54b93239 BRIN typo fix.
Amit Langote
2015-01-19 08:34:29 -05:00
Heikki Linnakangas
49b04188f8 Fix thinko in re-setting wal_log_hints flag from a parameter-change record.
The flag is supposed to be copied from the record. Same issue with
track_commit_timestamps, but that's master-only.

Report and fix by Petr Jalinek. Backpatch to 9.4, where wal_log_hints was
added.
2015-01-15 20:52:41 +02:00
Alvaro Herrera
d126e1e95f Tweak heapam's rmgr desc output slightly
Some spaces were missing, and putting the affected tuple offset first in
the lock cases instead of the locking data makes more sense.

No backpatch since this is cosmetic and surrounding code has changed.
2015-01-12 16:09:16 -03:00
Heikki Linnakangas
1e78d81e88 Don't open a WAL segment for writing at end of recovery.
Since commit ba94518a, we used XLogFileOpen to open the next segment for
writing, but if the end-of-recovery happens exactly at a segment boundary,
the new segment might not exist yet. (Before ba94518a, XLogFileOpen was
correct, because we would open the previous segment if the switch happened
at the boundary.)

Instead of trying to create it if necessary, it's simpler to not bother
opening the segment at all. XLogWrite() will open or create it soon anyway,
after writing the checkpoint or end-of-recovery record.

Reported by Andres Freund.
2015-01-07 16:20:20 +02:00
Bruce Momjian
4baaf863ec Update copyright for 2015
Backpatch certain files through 9.0
2015-01-06 11:43:47 -05:00
Alvaro Herrera
d5e3d1e969 Fix thinko in lock mode enum
Commit 0e5680f473 contained a thinko
mixing LOCKMODE with LockTupleMode.  This caused misbehavior in the case
where a tuple is marked with a multixact with at most a FOR SHARE lock,
and another transaction tries to acquire a FOR NO KEY EXCLUSIVE lock;
this case should block but doesn't.

Include a new isolation tester spec file to explicitely try all the
tuple lock combinations; without the fix it shows the problem:

    starting permutation: s1_begin s1_lcksvpt s1_tuplock2 s2_tuplock3 s1_commit
    step s1_begin: BEGIN;
    step s1_lcksvpt: SELECT * FROM multixact_conflict FOR KEY SHARE; SAVEPOINT foo;
    a

    1
    step s1_tuplock2: SELECT * FROM multixact_conflict FOR SHARE;
    a

    1
    step s2_tuplock3: SELECT * FROM multixact_conflict FOR NO KEY UPDATE;
    a

    1
    step s1_commit: COMMIT;

With the fixed code, step s2_tuplock3 blocks until session 1 commits,
which is the correct behavior.

All other cases behave correctly.

Backpatch to 9.3, like the commit that introduced the problem.
2015-01-04 15:48:29 -03:00
Andres Freund
14570c2828 Remove superflous variable from xlogreader's XLogFindNextRecord().
Pointed out by Coverity.

Since this is mere, and debatable, cosmetics I'm not backpatching
this.
2015-01-04 15:35:46 +01:00
Tom Lane
d6657d2a10 Treat negative values of recovery_min_apply_delay as having no effect.
At one point in the development of this feature, it was claimed that
allowing negative values would be useful to compensate for timezone
differences between master and slave servers.  That was based on a mistaken
assumption that commit timestamps are recorded in local time; but of course
they're in UTC.  Nor is a negative apply delay likely to be a sane way of
coping with server clock skew.  However, the committed patch still treated
negative delays as doing something, and the timezone misapprehension
survived in the user documentation as well.

If recovery_min_apply_delay were a proper GUC we'd just set the minimum
allowed value to be zero; but for the moment it seems better to treat
negative settings as if they were zero.

In passing do some extra wordsmithing on the parameter's documentation,
including correcting a second misstatement that the parameter affects
processing of Restore Point records.

Issue noted by Michael Paquier, who also provided the code patch; doc
changes by me.  Back-patch to 9.4 where the feature was introduced.
2015-01-03 13:14:03 -05:00
Alvaro Herrera
0e5680f473 Grab heavyweight tuple lock only before sleeping
We were trying to acquire the lock even when we were subsequently
not sleeping in some other transaction, which opens us up unnecessarily
to deadlocks.  In particular, this is troublesome if an update tries to
lock an updated version of a tuple and finds itself doing EvalPlanQual
update chain walking; more than two sessions doing this concurrently
will find themselves sleeping on each other because the HW tuple lock
acquisition in heap_lock_tuple called from EvalPlanQualFetch races with
the same tuple lock being acquired in heap_update -- one of these
sessions sleeps on the other one to finish while holding the tuple lock,
and the other one sleeps on the tuple lock.

Per trouble report from Andrew Sackville-West in
http://www.postgresql.org/message-id/20140731233051.GN17765@andrew-ThinkPad-X230

His scenario can be simplified down to a relatively simple
isolationtester spec file which I don't include in this commit; the
reason is that the current isolationtester is not able to deal with more
than one blocked session concurrently and it blocks instead of raising
the expected deadlock.  In the future, if we improve isolationtester, it
would be good to include the spec file in the isolation schedule.  I
posted it in
http://www.postgresql.org/message-id/20141212205254.GC1768@alvh.no-ip.org

Hat tip to Mark Kirkwood, who helped diagnose the trouble.
2014-12-26 13:52:27 -03:00
Tom Lane
966115c305 Temporarily revert "Move pg_lzcompress.c to src/common."
This reverts commit 60838df922.
That change needs a bit more thought to be workable.  In view of
the potentially machine-dependent stuff that went in today,
we need all of the buildfarm to be testing those other changes.
2014-12-25 13:22:55 -05:00
Andres Freund
7882c3b0b9 Convert the PGPROC->lwWaitLink list into a dlist instead of open coding it.
Besides being shorter and much easier to read it changes the logic in
LWLockRelease() to release all shared lockers when waking up any. This
can yield some significant performance improvements - and the fairness
isn't really much worse than before, as we always allowed new shared
lockers to jump the queue.
2014-12-25 17:24:30 +01:00
Fujii Masao
60838df922 Move pg_lzcompress.c to src/common.
Exposing compression and decompression APIs of pglz makes possible its
use by extensions and contrib modules. pglz_decompress contained a call
to elog to emit an error message in case of corrupted data. This function
is changed to return a status code to let its callers return an error instead.

This commit is required for upcoming WAL compression feature so that
the WAL reader facility can decompress the WAL data by using pglz_decompress.

Michael Paquier
2014-12-25 20:46:14 +09:00
Alvaro Herrera
a609d96778 Revert "Use a bitmask to represent role attributes"
This reverts commit 1826987a46.

The overall design was deemed unacceptable, in discussion following the
previous commit message; we might find some parts of it still
salvageable, but I don't want to be on the hook for fixing it, so let's
wait until we have a new patch.
2014-12-23 15:35:49 -03:00
Alvaro Herrera
1826987a46 Use a bitmask to represent role attributes
The previous representation using a boolean column for each attribute
would not scale as well as we want to add further attributes.

Extra auxilliary functions are added to go along with this change, to
make up for the lost convenience of access of the old representation.

Catalog version bumped due to change in catalogs and the new functions.

Author: Adam Brightwell, minor tweaks by Álvaro
Reviewed by: Stephen Frost, Andres Freund, Álvaro Herrera
2014-12-23 10:22:09 -03:00
Heikki Linnakangas
e7032610f7 Use a pairing heap for the priority queue in kNN-GiST searches.
This performs slightly better, uses less memory, and needs slightly less
code in GiST, than the Red-Black tree previously used.

Reviewed by Peter Geoghegan
2014-12-22 12:05:57 +02:00
Heikki Linnakangas
2ef6c66a2b Fix file descriptor leak at end of recovery.
XLogFileInit() returns a file descriptor, which needs to be closed. The leak
was short-lived, since the startup process exits shortly afterwards, but it
was clearly a bug, nevertheless.

Per Coverity report.
2014-12-21 21:51:59 +02:00
Heikki Linnakangas
5c805d0a81 Fix timestamp in end-of-recovery WAL records.
We used time(null) to set a TimestampTz field, which gave bogus results.
Noticed while looking at pg_xlogdump output.

Backpatch to 9.3 and above, where the fast promotion was introduced.
2014-12-19 17:04:20 +02:00
Tom Lane
4a14f13a0a Improve hash_create's API for selecting simple-binary-key hash functions.
Previously, if you wanted anything besides C-string hash keys, you had to
specify a custom hashing function to hash_create().  Nearly all such
callers were specifying tag_hash or oid_hash; which is tedious, and rather
error-prone, since a caller could easily miss the opportunity to optimize
by using hash_uint32 when appropriate.  Replace this with a design whereby
callers using simple binary-data keys just specify HASH_BLOBS and don't
need to mess with specific support functions.  hash_create() itself will
take care of optimizing when the key size is four bytes.

This nets out saving a few hundred bytes of code space, and offers
a measurable performance improvement in tidbitmap.c (which was not
exploiting the opportunity to use hash_uint32 for its 4-byte keys).
There might be some wins elsewhere too, I didn't analyze closely.

In future we could look into offering a similar optimized hashing function
for 8-byte keys.  Under this design that could be done in a centralized
and machine-independent fashion, whereas getting it right for keys of
platform-dependent sizes would've been notationally painful before.

For the moment, the old way still works fine, so as not to break source
code compatibility for loadable modules.  Eventually we might want to
remove tag_hash and friends from the exported API altogether, since there's
no real need for them to be explicitly referenced from outside dynahash.c.

Teodor Sigaev and Tom Lane
2014-12-18 13:36:36 -05:00
Heikki Linnakangas
ba94518aad Change how first WAL segment on new timeline after promotion is created.
Two changes:

1. When copying a WAL segment from old timeline to create the first segment
on the new timeline, only copy up to the point where the timeline switch
happens, and zero-fill the rest. This avoids corner cases where we might
think that the copied WAL from the previous timeline belong to the new
timeline.

2. If the timeline switch happens at a segment boundary, don't copy the
whole old segment to the new timeline. It's pointless, because it's 100%
identical to the old segment.
2014-12-18 20:23:03 +02:00
Andres Freund
c303e9e7e5 Fix (re-)starting from a basebackup taken off a standby after a failure.
When starting up from a basebackup taken off a standby extra logic has
to be applied to compute the point where the data directory is
consistent. Normal base backups use a WAL record for that purpose, but
that isn't possible on a standby.

That logic had a error check ensuring that the cluster's control file
indicates being in recovery. Unfortunately that check was too strict,
disregarding the fact that the control file could also indicate that
the cluster was shut down while in recovery.

That's possible when the a cluster starting from a basebackup is shut
down before the backup label has been removed. When everything goes
well that's a short window, but when either restore_command or
primary_conninfo isn't configured correctly the window can get much
wider. That's because inbetween reading and unlinking the label we
restore the last checkpoint from WAL which can need additional WAL.

To fix simply also allow starting when the control file indicates
"shutdown in recovery". There's nicer fixes imaginable, but they'd be
more invasive.

Backpatch to 9.2 where support for taking basebackups from standbys
was added.
2014-12-18 08:47:27 +01:00
Tom Lane
06d5803ffa Fix assorted confusion between Oid and int32.
In passing, also make some debugging elog's in pgstat.c a bit more
consistently worded.

Back-patch as far as applicable (9.3 or 9.4; none of these mistakes are
really old).

Mark Dilger identified and patched the type violations; the message
rewordings are mine.
2014-12-11 15:41:15 -05:00
Simon Riggs
c270754719 Remove duplicate code in heap_prune_chain()
No need to set tuple tableOid twice

Jim Nasby
2014-12-08 08:44:37 +09:00
Simon Riggs
b8e33a85d4 Tweaks for recovery_target_action
Rename parameter action_at_recovery_target to
recovery_target_action suggested by Christoph Berg.

Place into recovery.conf suggested by Fujii Masao,
replacing (deprecating) earlier parameters, per
Michael Paquier.
2014-12-07 21:55:29 +09:00
Heikki Linnakangas
326b6f009f Print new track_commit_timestamp in rm_desc of a parameter-change record.
Michael Paquier
2014-12-05 12:11:43 +02:00
Heikki Linnakangas
c846e67c46 Print wal_log_hints in the rm_desc routing of a parameter-change record.
It was an oversight in the original commit.

Also note in the sample config file that changing wal_log_hints requires a
restart.

Michael Paquier. Backpatch to 9.4, where wal_log_hints was added.
2014-12-05 12:00:48 +02:00
Alvaro Herrera
73c986adde Keep track of transaction commit timestamps
Transactions can now set their commit timestamp directly as they commit,
or an external transaction commit timestamp can be fed from an outside
system using the new function TransactionTreeSetCommitTsData().  This
data is crash-safe, and truncated at Xid freeze point, same as pg_clog.

This module is disabled by default because it causes a performance hit,
but can be enabled in postgresql.conf requiring only a server restart.

A new test in src/test/modules is included.

Catalog version bumped due to the new subdirectory within PGDATA and a
couple of new SQL functions.

Authors: Álvaro Herrera and Petr Jelínek

Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert
Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven
Singer, Peter Eisentraut
2014-12-03 11:53:02 -03:00
Alvaro Herrera
6597ec9be6 Fix typos 2014-12-03 11:52:15 -03:00
Tom Lane
1511521a36 Minor cleanup of function declarations for BRIN.
Get rid of PG_FUNCTION_INFO_V1() macros, which are quite inappropriate
for built-in functions (possibly leftovers from testing as a loadable
module?).  Also, fix gratuitous inconsistency between SQL-level and
C-level names of the minmax support functions.
2014-12-02 14:07:54 -05:00
Alvaro Herrera
ae04bf5027 Update transaction README for persistent multixacts
Multixacts are now maintained during recovery, but the README didn't get
the memo.  Backpatch to 9.3, where the divergence was introduced.
2014-11-28 18:06:18 -03:00
Heikki Linnakangas
afeacd2748 Fix assertion failure at end of PITR.
InitXLogInsert() cannot be called in a critical section, because it
allocates memory. But CreateCheckPoint() did that, when called for the
end-of-recovery checkpoint by the startup process.

In the passing, fix the scratch space allocation in InitXLogInsert to go to
the right memory context. Also update the comment at InitXLOGAccess, which
hasn't been totally accurate since hot standby was introduced (in a hot
standby backend, InitXLOGAccess isn't called at backend startup).

Reported by Michael Paquier
2014-11-28 09:31:53 +02:00
Simon Riggs
aedccb1f6f action_at_recovery_target recovery config option
action_at_recovery_target = pause | promote | shutdown

Petr Jelinek

Reviewed by Muhammad Asif Naeem, Fujji Masao and
Simon Riggs
2014-11-25 20:13:30 +00:00
Heikki Linnakangas
49b86fb1c9 Add a few paragraphs to B-tree README explaining L&Y algorithm.
This gives an overview of what Lehman & Yao's paper is all about, so that
you can understand the rest of the README without having to read the paper.

Per discussion with Peter Geoghegan and others.
2014-11-24 13:43:33 +02:00
Heikki Linnakangas
0bd624d63b Distinguish XLOG_FPI records generated for hint-bit updates.
Add a new XLOG_FPI_FOR_HINT record type, and use that for full-page images
generated for hint bit updates, when checksums are enabled. The new record
type is replayed exactly the same as XLOG_FPI, but allows them to be tallied
separately e.g. in pg_xlogdump.
2014-11-24 11:09:08 +02:00
Heikki Linnakangas
622983ea69 No need to call XLogEnsureRecordSpace when the relation is unlogged.
Amit Kapila
2014-11-21 15:13:15 +02:00