Commit Graph

3231 Commits

Author SHA1 Message Date
Andres Freund 9206ced1dc Clean up latch related code.
The larger part of this patch replaces usages of MyProc->procLatch
with MyLatch.  The latter works even early during backend startup,
where MyProc->procLatch doesn't yet.  While the affected code
shouldn't run in cases where it's not initialized, it might get copied
into places where it might.  Using MyLatch is simpler and a bit faster
to boot, so there's little point to stick with the previous coding.

While doing so I noticed some weaknesses around newly introduced uses
of latches that could lead to missed events, and an omitted
CHECK_FOR_INTERRUPTS() call in worker_spi.

As all the actual bugs are in v10 code, there doesn't seem to be
sufficient reason to backpatch this.

Author: Andres Freund
Discussion:
    https://postgr.es/m/20170606195321.sjmenrfgl2nu6j63@alap3.anarazel.de
    https://postgr.es/m/20170606210405.sim3yl6vpudhmufo@alap3.anarazel.de
Backpatch: -
2017-06-06 16:13:00 -07:00
Andres Freund c6c3334364 Prevent possibility of panics during shutdown checkpoint.
When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and
throws a PANIC if so.  At that point, only walsenders are still
active, so one might think this could not happen, but walsenders can
also generate WAL, for instance in BASE_BACKUP and logical decoding
related commands (e.g. via hint bits).  So they can trigger this panic
if such a command is run while the shutdown checkpoint is being
written.

To fix this, divide the walsender shutdown into two phases.  First,
checkpointer, itself triggered by postmaster, sends a
PROCSIG_WALSND_INIT_STOPPING signal to all walsenders.  If the backend
is idle or runs an SQL query this causes the backend to shutdown, if
logical replication is in progress all existing WAL records are
processed followed by a shutdown.  Otherwise this causes the walsender
to switch to the "stopping" state. In this state, the walsender will
reject any further replication commands. The checkpointer begins the
shutdown checkpoint once all walsenders are confirmed as
stopping. When the shutdown checkpoint finishes, the postmaster sends
us SIGUSR2. This instructs walsender to send any outstanding WAL,
including the shutdown checkpoint record, wait for it to be replicated
to the standby, and then exit.

Author: Andres Freund, based on an earlier patch by Michael Paquier
Reported-By: Fujii Masao, Andres Freund
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20170602002912.tqlwn4gymzlxpvs2@alap3.anarazel.de
Backpatch: 9.4, where logical decoding was introduced
2017-06-05 19:18:15 -07:00
Andres Freund 703f148e98 Revert "Prevent panic during shutdown checkpoint"
This reverts commit 086221cf6b, which
was made to master only.

The approach implemented in the above commit has some issues.  While
those could easily be fixed incrementally, doing so would make
backpatching considerably harder, so instead first revert this patch.

Discussion: https://postgr.es/m/20170602002912.tqlwn4gymzlxpvs2@alap3.anarazel.de
2017-06-05 19:18:15 -07:00
Tom Lane d466335064 Don't be so trusting that shm_toc_lookup() will always succeed.
Given the possibility of race conditions and so on, it seems entirely
unsafe to just assume that shm_toc_lookup() always finds the key it's
looking for --- but that was exactly what all but one call site were
doing.  To fix, add a "bool noError" argument, similarly to what we
have in many other functions, and throw an error on an unexpected
lookup failure.  Remove now-redundant Asserts that a rather random
subset of call sites had.

I doubt this will throw any light on buildfarm member lorikeet's
recent failures, because if an unnoticed lookup failure were involved,
you'd kind of expect a null-pointer-dereference crash rather than the
observed symptom.  But you never know ... and this is better coding
practice even if it never catches anything.

Discussion: https://postgr.es/m/9697.1496675981@sss.pgh.pa.us
2017-06-05 12:05:42 -04:00
Alvaro Herrera 55a70a023c Assorted translatable string fixes
Mark our rusage reportage string translatable; remove quotes from type
names; unify formatting of very similar messages.
2017-06-04 11:41:16 -04:00
Robert Haas 814573e6c4 Restore accidentally-removed line.
Commit 88e66d193f is to blame.

Masahiko Sawada

Discussion: http://postgr.es/m/CAD21AoAXeb7O4hgg+efs8JT_SxpR4doAH5c5s-Z5WoRLstBZJA@mail.gmail.com
2017-05-31 14:24:22 -04:00
Alvaro Herrera b4da9d0e1e brin: Don't crash on auto-summarization
We were trying to free a pointer into a shared buffer, which never
works; and we were failing to release the buffer lock appropriately.
Fix those omissions.

While at it, improve documentation for brinGetTupleForHeapBlock, the
inadequacy of which evidently caused these bugs in the first place.

Reported independently by Zhou Digoal (bug #14668) and Alexander Sosna.

Discussion: https://postgr.es/m/8c31c11b-6adb-228d-22c2-4ace89fc9209@credativ.de
Discussion: https://postgr.es/m/20170524063323.29941.46339@wrigleys.postgresql.org
2017-05-30 18:17:09 -04:00
Alvaro Herrera e6785a5ca1 Fix wording in amvalidate error messages
Remove some gratuituous message differences by making the AM name
previously embedded in each message be a %s instead.  While at it, get
rid of terminology that's unclear and unnecessary in one message.

Discussion: https://postgr.es/m/20170523001557.bq2hbq7hxyvyw62q@alvherre.pgsql
2017-05-30 15:45:42 -04:00
Magnus Hagander 312bac54cc Fix typo in comment
Author: Masahiko Sawada
2017-05-22 09:10:02 +02:00
Bruce Momjian a6fd7b7a5f Post-PG 10 beta1 pgindent run
perltidy run not included.
2017-05-17 16:31:56 -04:00
Peter Eisentraut d496a65790 Standardize "WAL location" terminology
Other previously used terms were "WAL position" or "log position".
2017-05-12 13:51:27 -04:00
Peter Eisentraut c1a7f64b4a Replace "transaction log" with "write-ahead log"
This makes documentation and error messages match the renaming of "xlog"
to "wal" in APIs and file naming.
2017-05-12 11:52:43 -04:00
Tom Lane d10c626de4 Rename WAL-related functions and views to use "lsn" not "location".
Per discussion, "location" is a rather vague term that could refer to
multiple concepts.  "LSN" is an unambiguous term for WAL locations and
should be preferred.  Some function names, view column names, and function
output argument names used "lsn" already, but others used "location",
as well as yet other terms such as "wal_position".  Since we've already
renamed a lot of things in this area from "xlog" to "wal" for v10,
we may as well incur a bit more compatibility pain and make these names
all consistent.

David Rowley, minor additional docs hacking by me

Discussion: https://postgr.es/m/CAKJS1f8O0njDKe8ePFQ-LK5-EjwThsDws6ohJ-+c6nWK+oUxtg@mail.gmail.com
2017-05-11 11:49:59 -04:00
Alvaro Herrera b66adb7b0c Revert "Permit dump/reload of not-too-large >1GB tuples"
This reverts commits fa2fa99552 and 42f50cb8fa.

While the functionality that was intended to be provided by these
commits is desired, the patch didn't actually solve as many of the
problematic situations as we hoped, and it created a bunch of its own
problems.  Since we're going to require more extensive changes soon for
other reasons and users have been working around these problems for a
long time already, there is no point in spending effort in fixing this
halfway measure.

Per complaint from Tom Lane.
Discussion: https://postgr.es/m/21407.1484606922@sss.pgh.pa.us

(Commit fa2fa99552 had already been reverted in branches 9.5 as
f858524ee4 and 9.6 as e9e44a0953, so this touches master only.
Commit 42f50cb8fa was not present in the older branches.)
2017-05-10 18:41:27 -03:00
Robert Haas a5775991bb Remove no-longer-needed compatibility code for hash indexes.
Because commit ea69a0dead bumped the
HASH_VERSION, we don't need to worry about PostgreSQL 10 seeing
bucket pages from earlier versions.

Amit Kapila

Discussion: http://postgr.es/m/CAA4eK1LAo4DGwh+mi-G3U8Pj1WkBBeFL38xdCnUHJv1z4bZFkQ@mail.gmail.com
2017-05-09 23:44:21 -04:00
Peter Eisentraut 086221cf6b Prevent panic during shutdown checkpoint
When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and throws
a PANIC if so.  At that point, only walsenders are still active, so one
might think this could not happen, but walsenders can also generate WAL,
for instance in BASE_BACKUP and certain variants of
CREATE_REPLICATION_SLOT.  So they can trigger this panic if such a
command is run while the shutdown checkpoint is being written.

To fix this, divide the walsender shutdown into two phases.  First, the
postmaster sends a SIGUSR2 signal to all walsenders.  The walsenders
then put themselves into the "stopping" state.  In this state, they
reject any new commands.  (For simplicity, we reject all new commands,
so that in the future we do not have to track meticulously which
commands might generate WAL.)  The checkpointer waits for all walsenders
to reach this state before proceeding with the shutdown checkpoint.
After the shutdown checkpoint is done, the postmaster sends
SIGINT (previously unused) to the walsenders.  This triggers the
existing shutdown behavior of sending out the shutdown checkpoint record
and then terminating.

Author: Michael Paquier <michael.paquier@gmail.com>
Reported-by: Fujii Masao <masao.fujii@gmail.com>
2017-05-05 10:31:42 -04:00
Tom Lane 3f074845a8 Fix pfree-of-already-freed-tuple when rescanning a GiST index-only scan.
GiST's getNextNearest() function attempts to pfree the previously-returned
tuple if any (that is, scan->xs_hitup in HEAD, or scan->xs_itup in older
branches).  However, if we are rescanning a plan node after ending a
previous scan early, those tuple pointers could be pointing to garbage,
because they would be pointing into the scan's pageDataCxt or queueCxt
which has been reset.  In a debug build this reliably results in a crash,
although I think it might sometimes accidentally fail to fail in
production builds.

To fix, clear the pointer field anyplace we reset a context it might
be pointing into.  This may be overkill --- I think probably only the
queueCxt case is involved in this bug, so that resetting in gistrescan()
would be sufficient --- but dangling pointers are generally bad news,
so let's avoid them.

Another plausible answer might be to just not bother with the pfree in
getNextNearest().  The reconstructed tuples would go away anyway in the
context resets, and I'm far from convinced that freeing them a bit earlier
really saves anything meaningful.  I'll stick with the original logic in
this patch, but if we find more problems in the same area we should
consider that approach.

Per bug #14641 from Denis Smirnov.  Back-patch to 9.5 where this
logic was introduced.

Discussion: https://postgr.es/m/20170504072034.24366.57688@wrigleys.postgresql.org
2017-05-04 13:59:39 -04:00
Peter Eisentraut 9414e41ea7 Fix logical replication launcher wake up and reset
After the logical replication launcher was told to wake up at
commit (for example, by a CREATE SUBSCRIPTION command), the flag to wake
up was not reset, so it would be woken up at every following commit as
well.  So fix that by resetting the flag.

Also, we don't need to wake up anything if the transaction was rolled
back.  Just reset the flag in that case.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reported-by: Fujii Masao <masao.fujii@gmail.com>
2017-05-01 10:18:09 -04:00
Simon Riggs 49e9281549 Rework handling of subtransactions in 2PC recovery
The bug fixed by 0874d4f3e1
caused us to question and rework the handling of
subtransactions in 2PC during and at end of recovery.
Patch adds checks and tests to ensure no further bugs.

This effectively removes the temporary measure put in place
by 546c13e11b.

Author: Simon Riggs
Reviewed-by: Tom Lane, Michael Paquier
Discussion: http://postgr.es/m/CANP8+j+vvXmruL_i2buvdhMeVv5TQu0Hm2+C5N+kdVwHJuor8w@mail.gmail.com
2017-04-27 14:41:22 +02:00
Simon Riggs 546c13e11b Workaround for RecoverPreparedTransactions()
Force overwriteOK = true while we investigate deeper fix

Proposed by Tom Lane as temporary measure, accepted by me
2017-04-23 22:12:01 +01:00
Tom Lane 0874d4f3e1 Fix order of arguments to SubTransSetParent().
ProcessTwoPhaseBuffer (formerly StandbyRecoverPreparedTransactions)
mixed up the parent and child XIDs when calling SubTransSetParent to
record the transactions' relationship in pg_subtrans.

Remarkably, analysis by Simon Riggs suggests that this doesn't lead to
visible problems (at least, not in non-Assert builds).  That might
explain why we'd not noticed it before.  Nonetheless, it's surely wrong.

This code was born broken, so back-patch to all supported branches.

Discussion: https://postgr.es/m/20110.1492905318@sss.pgh.pa.us
2017-04-23 13:11:06 -04:00
Simon Riggs ee01f7092f Exit correctly from PrepareRedoRemove() when not found
Complex crash bug all started with this failure.
Diagnosed and fixed by Nikhil Sontakke, reviewed by me.

Reported-by: Jeff Janes
Author: Nikhil Sontakke
Discussion: https://postgr.es/m/CAMkU=1xBP8cqdS5eK8APHL=X6RHMMM2vG5g+QamduuTsyCwv9g@mail.gmail.com
2017-04-18 11:35:38 +01:00
Simon Riggs aa203e7600 Don’t push nextid too far forwards in recovery
Doing so allows various crash possibilities. Fix by avoiding
having PrescanPreparedTransactions() increment
ShmemVariableCache->nextXid when it has no 2PC files

Bug found by Jeff Janes, diagnosis and patch by Pavan Deolasee,
then patch re-designed for clarity and full accuracy by
Michael Paquier.

Reported-by: Jeff Janes
Author: Pavan Deolasee, Michael Paquier
Discussion: https://postgr.es/m/CAMkU=1zMLnH_i1-PVQ-biZzvNx7VcuatriquEnh7HNk6K8Ss3Q@mail.gmail.com
2017-04-18 11:14:05 +01:00
Peter Eisentraut 6275f5d28a Fix new warnings from GCC 7
This addresses the new warning types -Wformat-truncation
-Wformat-overflow that are part of -Wall, via -Wformat, in GCC 7.
2017-04-17 13:59:46 -04:00
Tom Lane b6dd127128 Ensure BackgroundWorker struct contents are well-defined.
Coverity complained because bgw.bgw_extra wasn't being filled in by
ApplyLauncherRegister().  The most future-proof fix is to memset the
whole BackgroundWorker struct to zeroes.  While at it, let's apply the
same coding rule to other places that set up BackgroundWorker structs;
four out of five had the same or related issues.
2017-04-16 23:23:44 -04:00
Peter Eisentraut c7d225e227 Fix typo in comment
Author: Masahiko Sawada <sawada.mshk@gmail.com>
2017-04-16 19:47:37 -04:00
Tom Lane 083dc95a14 More cleanup of manipulations of hash indexes' hasho_flag field.
Not much point in defining test macros for the flag bits if we
don't use 'em.

Amit Kapila
2017-04-15 14:11:15 -04:00
Tom Lane 32470825d3 Avoid passing function pointers across process boundaries.
We'd already recognized that we can't pass function pointers across process
boundaries for functions in loadable modules, since a shared library could
get loaded at different addresses in different processes.  But actually the
practice doesn't work for functions in the core backend either, if we're
using EXEC_BACKEND.  This is the cause of recent failures on buildfarm
member culicidae.  Switch to passing a string function name in all cases.

Something like this needs to be back-patched into 9.6, but let's see
if the buildfarm likes it first.

Petr Jelinek, with a bunch of basically-cosmetic adjustments by me

Discussion: https://postgr.es/m/548f9c1d-eafa-e3fa-9da8-f0cc2f654e60@2ndquadrant.com
2017-04-14 23:50:16 -04:00
Tom Lane 2040bb4a0b Clean up manipulations of hash indexes' hasho_flag field.
Standardize on testing a hash index page's type by doing
	(opaque->hasho_flag & LH_PAGE_TYPE) == LH_xxx_PAGE
Various places were taking shortcuts like
	opaque->hasho_flag & LH_BUCKET_PAGE
which while not actually wrong, is still bad practice because
it encourages use of
	opaque->hasho_flag & LH_UNUSED_PAGE
which *is* wrong (LH_UNUSED_PAGE == 0, so the above is constant false).
hash_xlog.c's hash_mask() contained such an incorrect test.

This also ensures that we mask out the additional flag bits that
hasho_flag has accreted since 9.6.  pgstattuple's pgstat_hash_page(),
for one, was failing to do that and was thus actively broken.

Also fix assorted comments that hadn't been updated to reflect the
extended usage of hasho_flag, and fix some macros that were testing
just "(hasho_flag & bit)" to use the less dangerous, project-approved
form "((hasho_flag & bit) != 0)".

Coverity found the bug in hash_mask(); I noted the one in
pgstat_hash_page() through code reading.
2017-04-14 17:04:25 -04:00
Alvaro Herrera 8bf74967da Reduce the number of pallocs() in BRIN
Instead of allocating memory in brin_deform_tuple and brin_copy_tuple
over and over during a scan, allow reuse of previously allocated memory.
This is said to make for a measurable performance improvement.

Author: Jinyu Zhang, Álvaro Herrera
Reviewed by: Tomas Vondra
Discussion: https://postgr.es/m/495deb78.4186.1500dacaa63.Coremail.beijing_pg@163.com
2017-04-07 19:08:43 -03:00
Alvaro Herrera 817cb10013 Fix new BRIN desummarize WAL record
The WAL-writing piece was forgetting to set the pages-per-range value.
Also, fix the declared type of struct member heapBlk, which I mistakenly
set as OffsetNumber rather than BlockNumber.

Problem was introduced by commit c655899ba9 (April 1st).  Any system
that tries to replay the new WAL record written before this fix is
likely to die on replay and require pg_resetwal.

Reported by Tom Lane.
Discussion: https://postgr.es/m/20191.1491524824@sss.pgh.pa.us
2017-04-07 17:11:56 -03:00
Tom Lane 3f902354b0 Clean up after insufficiently-researched optimization of tuple conversions.
tupconvert.c's functions formerly considered that an explicit tuple
conversion was necessary if the input and output tupdescs contained
different type OIDs.  The point of that was to make sure that a composite
datum resulting from the conversion would contain the destination rowtype
OID in its composite-datum header.  However, commit 3838074f8 entirely
misunderstood what that check was for, thinking that it had something to do
with presence or absence of an OID column within the tuple.  Removal of the
check broke the no-op conversion path in ExecEvalConvertRowtype, as
reported by Ashutosh Bapat.

It turns out that of the dozen or so call sites for tupconvert.c functions,
ExecEvalConvertRowtype is the only one that cares about the composite-datum
header fields in the output tuple.  In all the rest, we'd much rather avoid
an unnecessary conversion whenever the tuples are physically compatible.
Moreover, the comments in tupconvert.c only promise physical compatibility
not a metadata match.  So, let's accept the removal of the guarantee about
the output tuple's rowtype marking, recognizing that this is a API change
that could conceivably break third-party callers of tupconvert.c.  (So,
let's remember to mention it in the v10 release notes.)

However, commit 3838074f8 did have a bit of a point here, in that two
tuples mustn't be considered physically compatible if one has HEAP_HASOID
set and the other doesn't.  (Some of the callers of tupconvert.c might not
really care about that, but we can't assume it in general.)  The previous
check accidentally covered that issue, because no RECORD types ever have
OIDs, while if two tupdescs have the same named composite type OID then,
a fortiori, they have the same tdhasoid setting.  If we're removing the
type OID match check then we'd better include tdhasoid match as part of
the physical compatibility check.

Without that hack in tupconvert.c, we need ExecEvalConvertRowtype to take
responsibility for inserting the correct rowtype OID label whenever
tupconvert.c decides it need not do anything.  This is easily done with
heap_copy_tuple_as_datum, which will be considerably faster than a tuple
disassembly and reassembly anyway; so from a performance standpoint this
change is a win all around compared to what happened in earlier branches.
It just means a couple more lines of code in ExecEvalConvertRowtype.

Ashutosh Bapat and Tom Lane

Discussion: https://postgr.es/m/CAFjFpRfvHABV6+oVvGcshF8rHn+1LfRUhj7Jz1CDZ4gPUwehBg@mail.gmail.com
2017-04-06 21:10:20 -04:00
Alvaro Herrera 7e534adcdc Fix BRIN cost estimation
The original code was overly optimistic about the cost of scanning a
BRIN index, leading to BRIN indexes being selected when they'd be a
worse choice than some other index.  This complete rewrite should be
more accurate.

Author: David Rowley, based on an earlier patch by Emre Hasegeli
Reviewed-by: Emre Hasegeli
Discussion: https://postgr.es/m/CAKJS1f9n-Wapop5Xz1dtGdpdqmzeGqQK4sV2MK-zZugfC14Xtw@mail.gmail.com
2017-04-06 17:51:53 -03:00
Peter Eisentraut e6c9a5a9bc Fix mixup of bool and ternary value
Not currently a problem, but could be with stricter bool behavior under
stdbool or C++.

Reviewed-by: Andres Freund <andres@anarazel.de>
2017-04-06 13:09:42 -04:00
Simon Riggs cd0cebaf7d Always SnapshotResetXmin() during ClearTransaction()
Avoid corner cases during 2PC with 6bad580d9e
2017-04-06 10:30:22 -04:00
Peter Eisentraut 3217327053 Identity columns
This is the SQL standard-conforming variant of PostgreSQL's serial
columns.  It fixes a few usability issues that serial columns have:

- CREATE TABLE / LIKE copies default but refers to same sequence
- cannot add/drop serialness with ALTER TABLE
- dropping default does not drop sequence
- need to grant separate privileges to sequence
- other slight weirdnesses because serial is some kind of special macro

Reviewed-by: Vitaly Burovoy <vitaly.burovoy@gmail.com>
2017-04-06 08:41:37 -04:00
Simon Riggs 6bad580d9e Avoid SnapshotResetXmin() during AtEOXact_Snapshot()
For normal commits and aborts we already reset PgXact->xmin,
so we can simply avoid running SnapshotResetXmin() twice.

During performance tests by Alexander Korotkov, diagnosis
by Andres Freund showed PgXact array as a bottleneck. After
manual analysis by me of the code paths that touch those
memory locations, I was able to identify extraneous code
in the main transaction commit path.

Avoiding touching highly contented shmem improves concurrent
performance slightly on all workloads, confirmed by tests
run by Ashutosh Sharma and Alexander Korotkov.

Simon Riggs

Discussion: CANP8+jJdXE9b+b9F8CQT-LuxxO0PBCB-SZFfMVAdp+akqo4zfg@mail.gmail.com
2017-04-06 08:31:52 -04:00
Robert Haas 633e15ea0f Fix pageinspect failures on hash indexes.
Make every page in a hash index which isn't all-zeroes have a valid
special space, so that tools like pageinspect don't error out.

Also, make pageinspect cope with all-zeroes pages, because
_hash_alloc_buckets can leave behind large numbers of those until
they're consumed by splits.

Ashutosh Sharma and Robert Haas, reviewed by Amit Kapila.
Original trouble report from Jeff Janes.

Discussion: http://postgr.es/m/CAMkU=1y6NjKmqbJ8wLMhr=F74WzcMALYWcVFhEpm7i=mV=XsOg@mail.gmail.com
2017-04-05 14:18:15 -04:00
Robert Haas 75a1cbdc3c hash: Fix write-ahead logging bug.
The size of the data is not the same thing as the size of the size of
the data.

Reported off-list by Tushar Ahuja.  Fix by Ashutosh Sharma, reviewed
by Amit Kapila.

Discussion: http://postgr.es/m/CAE9k0PnmPDXfvf8HDObme7q_Ewc4E26ukHXUBPySoOs0ObqqaQ@mail.gmail.com
2017-04-05 11:45:35 -04:00
Peter Eisentraut afd79873a0 Capitalize names of PLs consistently
Author: Daniel Gustafsson <daniel@yesql.se>
2017-04-05 00:38:25 -04:00
Simon Riggs 9a3215026b Make min_wal_size/max_wal_size use MB internally
Previously they were defined using multiples of XLogSegSize.
Remove GUC_UNIT_XSEGS. Introduce GUC_UNIT_MB

Extracted from patch series on XLogSegSize infrastructure.

Beena Emerson
2017-04-04 18:00:01 -04:00
Simon Riggs cd740c0dbf Fix uninitialized variables in twophase.c 2017-04-04 17:50:02 -04:00
Simon Riggs 728bd991c3 Speedup 2PC recovery by skipping two phase state files in normal path
2PC state info held in shmem at PREPARE, then cleaned at COMMIT PREPARED/ABORT PREPARED,
avoiding writing/fsyncing any state information to disk in the normal path, greatly enhancing replay speed.
Prepared transactions that live past one checkpoint redo horizon will be written to disk as now.
Similar conceptually to 978b2f65aa and building upon
the infrastructure created by that commit.

Authors, in equal measure: Stas Kelvich, Nikhil Sontakke and Michael Paquier
Discussion: https://postgr.es/m/CAMGcDxf8Bn9ZPBBJZba9wiyQq-Qk5uqq=VjoMnRnW5s+fKST3w@mail.gmail.com
2017-04-04 15:56:56 -04:00
Robert Haas b38006ef6d Fix formula in _hash_spareindex.
This was correct in earlier versions of the patch that lead to
commit ea69a0dead, but somehow got
broken in the last version which I actually committed.

Mithun Cy, per an off-list report from Ashutosh Sharma

Discussion: http://postgr.es/m/CAD__OujbAwNU71v1y-RoQxZ8LZ6-V2UFTkex3v34MK6uZ3Xb5w@mail.gmail.com
2017-04-04 07:45:04 -04:00
Robert Haas ea69a0dead Expand hash indexes more gradually.
Since hash indexes typically have very few overflow pages, adding a
new splitpoint essentially doubles the on-disk size of the index,
which can lead to large and abrupt increases in disk usage (and
perhaps long delays on occasion).  To mitigate this problem to some
degree, divide larger splitpoints into four equal phases.  This means
that, for example, instead of growing from 4GB to 8GB all at once, a
hash index will now grow from 4GB to 5GB to 6GB to 7GB to 8GB, which
is perhaps still not as smooth as we'd like but certainly an
improvement.

This changes the on-disk format of the metapage, so bump HASH_VERSION
from 2 to 3.  This will force a REINDEX of all existing hash indexes,
but that's probably a good idea anyway.  First, hash indexes from
pre-10 versions of PostgreSQL could easily be corrupted, and we don't
want to confuse corruption carried over from an older release with any
corruption caused despite the new write-ahead logging in v10.  Second,
it will let us remove some backward-compatibility code added by commit
293e24e507.

Mithun Cy, reviewed by Amit Kapila, Jesper Pedersen and me.  Regression
test outputs updated by me.

Discussion: http://postgr.es/m/CAD__OuhG6F1gQLCgMQNnMNgoCvOLQZz9zKYJQNYvYmmJoM42gA@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoYty0jCf-pa+m+vYUJ716+AxM7nv_syvyanyf5O-L_i2A@mail.gmail.com
2017-04-03 23:46:33 -04:00
Robert Haas 93cd7684ee Properly acquire buffer lock for page-at-a-time hash vacuum.
In a couple of places, _hash_kill_items was mistakenly called with
the buffer lock not held.  Repair.

Ashutosh Sharma, per a report from Andreas Seltenreich

Discussion: http://postgr.es/m/87o9wo8o0j.fsf@credativ.de
2017-04-03 22:26:06 -04:00
Alvaro Herrera c655899ba9 BRIN de-summarization
When the BRIN summary tuple for a page range becomes too "wide" for the
values actually stored in the table (because the tuples that were
present originally are no longer present due to updates or deletes), it
can be useful to remove the outdated summary tuple, so that a future
summarization can install a tighter summary.

This commit introduces a SQL-callable interface to do so.

Author: Álvaro Herrera
Reviewed-by: Eiji Seki
Discussion: https://postgr.es/m/20170228045643.n2ri74ara4fhhfxf@alvherre.pgsql
2017-04-01 16:10:04 -03:00
Alvaro Herrera 7526e10224 BRIN auto-summarization
Previously, only VACUUM would cause a page range to get initially
summarized by BRIN indexes, which for some use cases takes too much time
since the inserts occur.  To avoid the delay, have brininsert request a
summarization run for the previous range as soon as the first tuple is
inserted into the first page of the next range.  Autovacuum is in charge
of processing these requests, after doing all the regular vacuuming/
analyzing work on tables.

This doesn't impose any new tasks on autovacuum, because autovacuum was
already in charge of doing summarizations.  The only actual effect is to
change the timing, i.e. that it occurs earlier.  For this reason, we
don't go any great lengths to record these requests very robustly; if
they are lost because of a server crash or restart, they will happen at
a later time anyway.

Most of the new code here is in autovacuum, which can now be told about
"work items" to process.  This can be used for other things such as GIN
pending list cleaning, perhaps visibility map bit setting, both of which
are currently invoked during vacuum, but do not really depend on vacuum
taking place.

The requests are at the page range level, a granularity for which we did
not have SQL-level access; we only had index-level summarization
requests via brin_summarize_new_values().  It seems reasonable to add
SQL-level access to range-level summarization too, so add a function
brin_summarize_range() to do that.

Authors: Álvaro Herrera, based on sketch from Simon Riggs.
Reviewed-by: Thomas Munro.
Discussion: https://postgr.es/m/20170301045823.vneqdqkmsd4as4ds@alvherre.pgsql
2017-04-01 14:00:53 -03:00
Robert Haas 2113ac4cbb Don't use bgw_main even to specify in-core bgworker entrypoints.
On EXEC_BACKEND builds, this can fail if ASLR is in use.

Backpatch to 9.5.  On master, completely remove the bgw_main field
completely, since there is no situation in which it is safe for an
EXEC_BACKEND build.  On 9.6 and 9.5, leave the field intact to avoid
breaking things for third-party code that doesn't care about working
under EXEC_BACKEND.  Prior to 9.5, there are no in-core bgworker
entrypoints.

Petr Jelinek, reviewed by me.

Discussion: http://postgr.es/m/09d8ad33-4287-a09b-a77f-77f8761adb5e@2ndquadrant.com
2017-03-31 20:43:32 -04:00
Robert Haas c94e6942ce Don't allocate storage for partitioned tables.
Also, don't allow setting reloptions on them, since that would have no
effect given the lack of storage.  The patch does this by introducing
a new reloption kind for which there are currently no reloptions -- we
might have some in the future -- so it adjusts parseRelOptions to
handle that case correctly.

Bumped catversion.  System catalogs that contained reloptions for
partitioned tables are no longer valid; plus, there are now fewer
physical files on disk, which is not technically a catalog change but
still a good reason to re-initdb.

Amit Langote, reviewed by Maksim Milyutin and Kyotaro Horiguchi and
revised a bit by me.

Discussion: http://postgr.es/m/20170331.173326.212311140.horiguchi.kyotaro@lab.ntt.co.jp
2017-03-31 16:28:51 -04:00
Alvaro Herrera 2fd8685e7f Simplify check of modified attributes in heap_update
The old coding was getting more complicated as new things were added,
and it would be barely tolerable with upcoming WARM updates and other
future features such as indirect indexes.  The new coding incurs a small
performance cost in synthetic benchmark cases, and is barely measurable
in normal cases.  A much larger benefit is expected from WARM, which
could actually bolt its needs on top of the existing coding, but it is
much uglier and bug-prone than doing it on this new code.  Additional
optimization can be applied on top of this, if need be.

Reviewed-by: Pavan Deolasee, Amit Kapila, Mithun CY
Discussion: https://postgr.es/m/20161228232018.4hc66ndrzpz4g4wn@alvherre.pgsql
	https://postgr.es/m/CABOikdMJfz69dBNRTOZcB6s5A0tf8OMCyQVYQyR-WFFdoEwKMQ@mail.gmail.com
2017-03-29 14:01:14 -03:00
Alvaro Herrera ce96ce60ca Remove direct uses of ItemPointer.{ip_blkid,ip_posid}
There are no functional changes here; this simply encapsulates knowledge
of the ItemPointerData struct so that a future patch can change things
without more breakage.

All direct users of ip_blkid and ip_posid are changed to use existing
macros ItemPointerGetBlockNumber and ItemPointerGetOffsetNumber
respectively.  For callers where that's inappropriate (because they
Assert that the itempointer is is valid-looking), add
ItemPointerGetBlockNumberNoCheck and ItemPointerGetOffsetNumberNoCheck,
which lack the assertion but are otherwise identical.

Author: Pavan Deolasee
Discussion: https://postgr.es/m/CABOikdNnFon4cJiL=h1mZH3bgUeU+sWHuU4Yr8AB=j3A2p1GiA@mail.gmail.com
2017-03-28 19:02:23 -03:00
Simon Riggs a99f77021f Correct grammar in error message
"could not generate" rather than "could not generation"
from commit 818fd4a67d
2017-03-28 13:24:39 -04:00
Tom Lane 8cfeaecfc7 Suppress implicit-conversion warnings seen with newer clang versions.
We were assigning values near 255 through "char *" pointers.  On machines
where char is signed, that's not entirely kosher, and it's reasonable
for compilers to warn about it.

A better solution would be to change the pointer type to "unsigned char *",
but that would be vastly more invasive.  For the moment, let's just apply
this simple backpatchable solution.

Aleksander Alekseev

Discussion: https://postgr.es/m/20170220141239.GD12278@e733.localdomain
Discussion: https://postgr.es/m/2839.1490714708@sss.pgh.pa.us
2017-03-28 13:16:19 -04:00
Robert Haas c4c51541e2 Still more code review for single-page hash vacuuming.
Most seriously, fix use of incorrect block ID, per a report from
Jeff Janes that it causes a crash and a diagnosis from Amit Kapila.

Improve consistency between the hash and btree versions of this
code by adding back a PANIC that btree has, and by registering
data in the xlog record in the same way, per complaints from
Jeff Janes and Amit Kapila.

Tidy up some minor cosmetic points, per complaints from Amit
Kapila.

Patch by Ashutosh Sharma, reviewed by Amit Kapila, and tested by
Jeff Janes.

Discussion: http://postgr.es/m/CAMkU=1w-9Qe=Ff1o6bSaXpNO9wqpo7_9GL8_CVhw4BoVVHasqg@mail.gmail.com
2017-03-27 12:51:10 -04:00
Teodor Sigaev 1b02be21f2 Fsync directory after creating or unlinking file.
If file was created/deleted just before powerloss it's possible that
file system will miss that. To prevent it, call fsync() where creating/
unlinkg file is critical.

Author: Michael Paquier
Reviewed-by: Ashutosh Bapat, Takayuki Tsunakawa, me
2017-03-27 19:33:01 +03:00
Robert Haas f0a6046bcb Fix comment.
Cut-and-paste led to something silly.

Ashutosh Sharma, reviewed by Amit Kapila and by me

Discussion: http://postgr.es/m/CAE9k0PmUbvQSBY7kwN_OkuqBYyHRXBX-c1ZkuAgR5vgF0GeWzQ@mail.gmail.com
2017-03-26 22:15:50 -04:00
Tom Lane 2f0903ea19 Improve performance of ExecEvalWholeRowVar.
In commit b8d7f053c, we needed to fix ExecEvalWholeRowVar to not change
the state of the slot it's copying.  The initial quick hack at that
required two rounds of tuple construction, which is not very nice.
To fix, add another primitive to tuptoaster.c that does precisely what
we need.  (I initially tried to do this by refactoring one of the
existing functions into two pieces; but it looked like that might hurt
performance for the existing case, and the amount of code that could
be shared is not very large, so I gave up on that.)

Discussion: https://postgr.es/m/26088.1490315792@sss.pgh.pa.us
2017-03-26 19:14:57 -04:00
Simon Riggs 3428ef7911 Reverting 42b4b0b241
Buildfarm issues and other reported issues
2017-03-24 17:56:17 +00:00
Simon Riggs 42b4b0b241 Avoid SnapshotResetXmin() during AtEOXact_Snapshot()
For normal commits and aborts we already reset PgXact->xmin
Avoiding touching highly contented shmem improves concurrent
performance.

Simon Riggs

Discussion: CANP8+jJdXE9b+b9F8CQT-LuxxO0PBCB-SZFfMVAdp+akqo4zfg@mail.gmail.com
2017-03-24 14:20:59 +00:00
Teodor Sigaev 78874531ba Fix backup canceling
Assert-enabled build crashes but without asserts it works by wrong way:
it may not reset forcing full page write and preventing from starting
exclusive backup with the same name as cancelled.
Patch replaces pair of booleans
nonexclusive_backup_running/exclusive_backup_running to single enum to
correctly describe backup state.

Backpatch to 9.6 where bug was introduced

Reported-by: David Steele
Authors: Michael Paquier, David Steele
Reviewed-by: Anastasia Lubennikova

https://commitfest.postgresql.org/13/1068/
2017-03-24 13:53:40 +03:00
Robert Haas ea42cc18c3 Track the oldest XID that can be safely looked up in CLOG.
This provides infrastructure for looking up arbitrary, user-supplied
XIDs without a risk of scary-looking failures from within the clog
module.  Normally, the oldest XID that can be safely looked up in CLOG
is the same as the oldest XID that can reused without causing
wraparound, and the latter is already tracked.  However, while
truncation is in progress, the values are different, so we must
keep track of them separately.

Craig Ringer, reviewed by Simon Riggs and by me.

Discussion: http://postgr.es/m/CAMsr+YHQiWNEi0daCTboS40T+V5s_+dst3PYv_8v2wNVH+Xx4g@mail.gmail.com
2017-03-23 14:26:31 -04:00
Teodor Sigaev 218f51584d Reduce page locking in GIN vacuum
GIN vacuum during cleaning posting tree can lock this whole tree for a long
time with by holding LockBufferForCleanup() on root. Patch changes it with
two ways: first, cleanup lock will be taken only if there is an empty page
(which should be deleted) and, second, it tries to lock only subtree, not the
whole posting tree.

Author: Andrey Borodin with minor editorization by me
Reviewed-by: Jeff Davis, me

https://commitfest.postgresql.org/13/896/
2017-03-23 19:38:47 +03:00
Simon Riggs 6912acc04f Replication lag tracking for walsenders
Adds write_lag, flush_lag and replay_lag cols to pg_stat_replication.

Implements a lag tracker module that reports the lag times based upon
measurements of the time taken for recent WAL to be written, flushed and
replayed and for the sender to hear about it. These times
represent the commit lag that was (or would have been) introduced by each
synchronous commit level, if the remote server was configured as a
synchronous standby.  For an asynchronous standby, the replay_lag column
approximates the delay before recent transactions became visible to queries.
If the standby server has entirely caught up with the sending server and
there is no more WAL activity, the most recently measured lag times will
continue to be displayed for a short time and then show NULL.

Physical replication lag tracking is automatic. Logical replication tracking
is possible but is the responsibility of the logical decoding plugin.
Tracking is a private module operating within each walsender individually,
with values reported to shared memory. Module not used outside of walsender.

Design and code is good enough now to commit - kudos to the author.
In many ways a difficult topic, with important and subtle behaviour so this
shoudl be expected to generate discussion and multiple open items: Test now!

Author: Thomas Munro, following designs by Fujii Masao and Simon Riggs
Review: Simon Riggs, Ian Barwick and Craig Ringer
2017-03-23 14:05:28 +00:00
Stephen Frost 017e4f2588 Expose waitforarchive option through pg_stop_backup()
Internally, we have supported the option to either wait for all of the
WAL associated with a backup to be archived, or to return immediately.
This option is useful to users of pg_stop_backup() as well, when they
are reading the stop backup record position and checking that the WAL
they need has been archived independently.

This patch adds an additional, optional, argument to pg_stop_backup()
which allows the user to indicate if they wish to wait for the WAL to be
archived or not.  The default matches current behavior, which is to
wait.

Author: David Steele, with some minor changes, doc updates by me.
Reviewed by: Takayuki Tsunakawa, Fujii Masao
Discussion: https://postgr.es/m/758e3fd1-45b4-5e28-75cd-e9e7f93a4c02@pgmasters.net
2017-03-22 23:44:58 -04:00
Simon Riggs af4b1a0869 Refactor GetOldestXmin() to use flags
Replace ignoreVacuum parameter with more flexible flags.

Author: Eiji Seki
Review: Haribabu Kommi
2017-03-22 16:51:01 +00:00
Simon Riggs 9b013dc238 Improve performance of replay of AccessExclusiveLocks
A hot standby replica keeps a list of Access Exclusive locks for a top
level transaction. These locks are released when the top level transaction
ends. Searching of this list is O(N^2), and each transaction had to pay the
price of searching this list for locks, even if it didn't take any AE
locks itself.

This patch optimizes this case by having the master server track which
transactions took AE locks, and passes that along to the standby server in
the commit/abort record. This allows the standby to only try to release
locks for transactions which actually took any, avoiding the majority of
the performance issue.

Refactor MyXactAccessedTempRel into MyXactFlags to allow minimal additional
cruft with this.

Analysis and initial patch by David Rowley
Author: David Rowley and Simon Riggs
2017-03-22 13:09:36 +00:00
Simon Riggs 1148e22a82 Teach xlogreader to follow timeline switches
Uses page-based mechanism to ensure we’re using the correct timeline.

Tests are included to exercise the functionality using a cold disk-level copy
of the master that's started up as a replica with slots intact, but the
intended use of the functionality is with later features.

Craig Ringer, reviewed by Simon Riggs and Andres Freund
2017-03-22 07:05:12 +00:00
Robert Haas 9abbf4727d Another fix for single-page hash index vacuum.
The WAL consistency checking code needed to be updated for the new
page status bit, but that didn't get done previously.

Ashutosh Sharma, reviewed by Amit Kapila

Discussion: http://postgr.es/m/CAA4eK1LP_oz4EfMen14OjJuzN5CqPdfRkFFuA-MfkcfeE8zGyg@mail.gmail.com
2017-03-20 15:55:27 -04:00
Robert Haas 953477ca35 Fixes for single-page hash index vacuum.
Clear LH_PAGE_HAS_DEAD_TUPLES during replay, similar to what gets done
for btree.  Update hashdesc.c for xl_hash_vacuum_one_page.

Oversights in commit 6977b8b7f4 spotted
by Amit Kapila.  Patch by Ashutosh Sharma.

Bump WAL version.  The original patch to make hash indexes write-ahead
logged probably should have done this, and the single page vacuuming
patch probably should have done it again, but better late than never.

Discussion: http://postgr.es/m/CAA4eK1Kd=mJ9xreovcsh0qMiAj-QqCphHVQ_Lfau1DR9oVjASQ@mail.gmail.com
2017-03-20 15:49:09 -04:00
Robert Haas 249cf070e3 Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add and
6f3bd98ebf, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O.  Add wait events for those operations, too.

Rushabh Lathia, with further hacking by me.  Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.

Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 07:43:01 -04:00
Robert Haas 88e66d193f Rename "pg_clog" directory to "pg_xact".
Names containing the letters "log" sometimes confuse users into
believing that only non-critical data is present.  It is hoped
this renaming will discourage ill-considered removals of transaction
status data.

Michael Paquier

Discussion: http://postgr.es/m/CA+Tgmoa9xFQyjRZupbdEFuwUerFTvC6HjZq1ud6GYragGDFFgA@mail.gmail.com
2017-03-17 09:48:38 -04:00
Robert Haas 6977b8b7f4 Port single-page btree vacuum logic to hash indexes.
This is advantageous for hash indexes for the same reasons it's good
for btrees: it accelerates space recycling, reducing bloat.

Ashutosh Sharma, reviewed by Amit Kapila and by me.  A bit of
additional hacking by me.

Discussion: http://postgr.es/m/CAE9k0PkRSyzx8dOnokEpUi2A-RFZK72WN0h9DEMv_ut9q6bPRw@mail.gmail.com
2017-03-15 22:18:56 -04:00
Robert Haas f7b711c8bc Cosmetic fixes for hash index write-ahead logging.
Amit Kapila.  One of these was reported by Tom Lane.

Discussion: http://postgr.es/m/5515.1489514099@sss.pgh.pa.us
2017-03-15 07:22:49 -04:00
Robert Haas bb4a39637a hash: Support WAL consistency checking.
Kuntal Ghosh, reviewed by Amit Kapila and Ashutosh Sharma, with
a few tweaks by me.

Discussion: http://postgr.es/m/CAGz5QCJLERUn_zoO0eDv6_Y_d0o4tNTMPeR7ivTLBg4rUrJdwg@mail.gmail.com
2017-03-14 14:58:56 -04:00
Robert Haas c11453ce0a hash: Add write-ahead logging support.
The warning about hash indexes not being write-ahead logged and their
use being discouraged has been removed.  "snapshot too old" is now
supported for tables with hash indexes.  Most importantly, barring
bugs, hash indexes will now be crash-safe and usable on standbys.

This commit doesn't yet add WAL consistency checking for hash
indexes, as we now have for other index types; a separate patch has
been submitted to cure that lack.

Amit Kapila, reviewed and slightly modified by me.  The larger patch
series of which this is a part has been reviewed and tested by Álvaro
Herrera, Ashutosh Sharma, Mark Kirkwood, Jeff Janes, and Jesper
Pedersen.

Discussion: http://postgr.es/m/CAA4eK1JOBX=YU33631Qh-XivYXtPSALh514+jR8XeD7v+K3r_Q@mail.gmail.com
2017-03-14 13:27:02 -04:00
Peter Eisentraut a47b38c9ee Spelling fixes
From: Josh Soref <jsoref@gmail.com>
2017-03-14 12:58:39 -04:00
Peter Eisentraut f97a028d8e Spelling fixes in code comments
From: Josh Soref <jsoref@gmail.com>
2017-03-14 12:58:39 -04:00
Tom Lane 5ed6fff6b7 Make logging about multixact wraparound protection less chatty.
The original messaging design, introduced in commit 068cfadf9, seems too
chatty now that some time has elapsed since the bug fix; most installations
will be in good shape and don't really need a reminder about this on every
postmaster start.

Hence, arrange to suppress the "wraparound protections are now enabled"
message during startup (specifically, during the TrimMultiXact() call).
The message will still appear if protection becomes effective at some
later point.

Discussion: https://postgr.es/m/17211.1489189214@sss.pgh.pa.us
2017-03-14 12:47:53 -04:00
Peter Eisentraut 1e6de941e3 Change xlog to WAL in some error messages 2017-03-13 15:42:10 -04:00
Noah Misch 3a0d473192 Use wrappers of PG_DETOAST_DATUM_PACKED() more.
This makes almost all core code follow the policy introduced in the
previous commit.  Specific decisions:

- Text search support functions with char* and length arguments, such as
  prsstart and lexize, may receive unaligned strings.  I doubt
  maintainers of non-core text search code will notice.

- Use plain VARDATA() on values detoasted or synthesized earlier in the
  same function.  Use VARDATA_ANY() on varlenas sourced outside the
  function, even if they happen to always have four-byte headers.  As an
  exception, retain the universal practice of using VARDATA() on return
  values of SendFunctionCall().

- Retain PG_GETARG_BYTEA_P() in pageinspect.  (Page images are too large
  for a one-byte header, so this misses no optimization.)  Sites that do
  not call get_page_from_raw() typically need the four-byte alignment.

- For now, do not change btree_gist.  Its use of four-byte headers in
  memory is partly entangled with storage of 4-byte headers inside
  GBT_VARKEY, on disk.

- For now, do not change gtrgm_consistent() or gtrgm_distance().  They
  incorporate the varlena header into a cache, and there are multiple
  credible implementation strategies to consider.
2017-03-12 19:35:34 -04:00
Noah Misch 2fd26b23b6 Assume deconstruct_array() outputs are untoasted.
In functions that issue a deconstruct_array() call, consistently use
plain VARSIZE()/VARDATA() on the array elements.  Prior practice was
divided between those and VARSIZE_ANY_EXHDR()/VARDATA_ANY().
2017-03-12 19:35:31 -04:00
Robert Haas 390811750d Revert "Use group updates when setting transaction status in clog."
This reverts commit ccce90b398.  This
optimization is unsafe, at least, of rollbacks and rollbacks to
savepoints, but I'm concerned there may be other problematic cases as
well.  Therefore, I've decided to revert this pending further
investigation.
2017-03-10 14:49:56 -05:00
Robert Haas ccce90b398 Use group updates when setting transaction status in clog.
Commit 0e141c0fbb introduced a mechanism
to reduce contention on ProcArrayLock by having a single process clear
XIDs in the procArray on behalf of multiple processes, reducing the
need to hand the lock around.  Use a similar mechanism to reduce
contention on CLogControlLock.  Testing shows that this very
significantly reduces the amount of time waiting for CLogControlLock
on high-concurrency pgbench tests run on a large multi-socket
machines; whether that translates into a TPS improvement depends on
how much of that contention is simply shifted to some other lock,
particularly WALWriteLock.

Amit Kapila, with some cosmetic changes by me.  Extensively reviewed,
tested, and benchmarked over a period of about 15 months by Simon
Riggs, Robert Haas, Andres Freund, Jesper Pedersen, and especially by
Tomas Vondra and Dilip Kumar.

Discussion: http://postgr.es/m/CAA4eK1L_snxM_JcrzEstNq9P66++F4kKFce=1r5+D1vzPofdtg@mail.gmail.com
Discussion: http://postgr.es/m/CAA4eK1LyR2A+m=RBSZ6rcPEwJ=rVi1ADPSndXHZdjn56yqO6Vg@mail.gmail.com
Discussion: http://postgr.es/m/91d57161-d3ea-0cc2-6066-80713e4f90d7@2ndquadrant.com
2017-03-09 17:49:01 -05:00
Tom Lane 86dbbf20d8 Put back <float.h> in a few files that need it for _isnan().
Further fallout from commit c29aff959: there are some files that need
<float.h>, and were getting it from datatype/timestamp.h, but it was not
apparent in my (tgl's) testing because the requirement for <float.h>
exists only on certain Windows toolchains.

Report and patch by David Rowley.

Discussion: https://postgr.es/m/CAKJS1f-BHceaFzZScFapDV48gUVM2CAOBfhkgffdqXzFb+kwew@mail.gmail.com
2017-03-08 15:38:34 -05:00
Robert Haas f35742ccb7 Support parallel bitmap heap scans.
The index is scanned by a single process, but then all cooperating
processes can iterate jointly over the resulting set of heap blocks.
In the future, we might also want to support using a parallel bitmap
index scan to set up for a parallel bitmap heap scan, but that's a
job for another day.

Dilip Kumar, with some corrections and cosmetic changes by me.  The
larger patch set of which this is a part has been reviewed and tested
by (at least) Andres Freund, Amit Khandekar, Tushar Ahuja, Rafia
Sabih, Haribabu Kommi, Thomas Munro, and me.

Discussion: http://postgr.es/m/CAFiTN-uc4=0WxRGfCzs-xfkMYcSEWUC-Fon6thkJGjkh9i=13A@mail.gmail.com
2017-03-08 12:05:43 -05:00
Robert Haas 98e6e89040 tidbitmap: Support shared iteration.
When a shared iterator is used, each call to tbm_shared_iterate()
returns a result that has not yet been returned to any process
attached to the shared iterator.  In other words, each cooperating
processes gets a disjoint subset of the full result set, but all
results are returned exactly once.

This is infrastructure for parallel bitmap heap scan.

Dilip Kumar.  The larger patch set of which this is a part has been
reviewed and tested by (at least) Andres Freund, Amit Khandekar,
Tushar Ahuja, Rafia Sabih, Haribabu Kommi, and Thomas Munro.

Discussion: http://postgr.es/m/CAFiTN-uc4=0WxRGfCzs-xfkMYcSEWUC-Fon6thkJGjkh9i=13A@mail.gmail.com
2017-03-08 08:09:38 -05:00
Robert Haas 38305398cd hash: Refactor hash index creation.
The primary goal here is to move all of the related page modifications
to a single section of code, in preparation for adding write-ahead
logging.  In passing, rename _hash_metapinit to _hash_init, since it
initializes more than just the metapage.

Amit Kapila.  The larger patch series of which this is a part has been
reviewed and tested by Álvaro Herrera, Ashutosh Sharma, Mark Kirkwood,
Jeff Janes, and Jesper Pedersen.
2017-03-07 17:03:51 -05:00
Heikki Linnakangas 818fd4a67d Support SCRAM-SHA-256 authentication (RFC 5802 and 7677).
This introduces a new generic SASL authentication method, similar to the
GSS and SSPI methods. The server first tells the client which SASL
authentication mechanism to use, and then the mechanism-specific SASL
messages are exchanged in AuthenticationSASLcontinue and PasswordMessage
messages. Only SCRAM-SHA-256 is supported at the moment, but this allows
adding more SASL mechanisms in the future, without changing the overall
protocol.

Support for channel binding, aka SCRAM-SHA-256-PLUS is left for later.

The SASLPrep algorithm, for pre-processing the password, is not yet
implemented. That could cause trouble, if you use a password with
non-ASCII characters, and a client library that does implement SASLprep.
That will hopefully be added later.

Authorization identities, as specified in the SCRAM-SHA-256 specification,
are ignored. SET SESSION AUTHORIZATION provides more or less the same
functionality, anyway.

If a user doesn't exist, perform a "mock" authentication, by constructing
an authentic-looking challenge on the fly. The challenge is derived from
a new system-wide random value, "mock authentication nonce", which is
created at initdb, and stored in the control file. We go through these
motions, in order to not give away the information on whether the user
exists, to unauthenticated users.

Bumps PG_CONTROL_VERSION, because of the new field in control file.

Patch by Michael Paquier and Heikki Linnakangas, reviewed at different
stages by Robert Haas, Stephen Frost, David Steele, Aleksander Alekseev,
and many others.

Discussion: https://www.postgresql.org/message-id/CAB7nPqRbR3GmFYdedCAhzukfKrgBLTLtMvENOmPrVWREsZkF8g%40mail.gmail.com
Discussion: https://www.postgresql.org/message-id/CAB7nPqSMXU35g%3DW9X74HVeQp0uvgJxvYOuA4A-A3M%2B0wfEBv-w%40mail.gmail.com
Discussion: https://www.postgresql.org/message-id/55192AFE.6080106@iki.fi
2017-03-07 14:25:40 +02:00
Simon Riggs 21d4e2e206 Reduce lock levels for table storage params related to planning
The following parameters are now updateable with ShareUpdateExclusiveLock
effective_io_concurrency
parallel_workers
seq_page_cost
random_page_cost
n_distinct
n_distinct_inherited

Simon Riggs and Fabrízio Mello
2017-03-06 16:04:31 +05:30
Robert Haas 5a73e17317 Improve error reporting for tuple-routing failures.
Currently, the whole row is shown without column names.  Instead,
adopt a style similar to _bt_check_unique() in ExecFindPartition()
and show the failing key: (key1, ...) = (val1, ...).

Amit Langote, per a complaint from Simon Riggs.  Reviewed by me;
I also adjusted the grammar in one of the comments.

Discussion: http://postgr.es/m/9f9dc7ae-14f0-4a25-5485-964d9bfc19bd@lab.ntt.co.jp
2017-03-03 09:09:52 +05:30
Robert Haas 21a3cf4128 hash: Refactor and clean up bucket split code.
As with commit 30df93f698 and commit
b0f18cb77f, the goal here is to move all
of the related page modifications to a single section of code, in
preparation for adding write-ahead logging.

Amit Kapila, with slight changes by me.  The larger patch series of
which this is a part has been reviewed and tested by Álvaro Herrera,
Ashutosh Sharma, Mark Kirkwood, Jeff Janes, and Jesper Pedersen.
2017-03-01 14:43:38 +05:30
Magnus Hagander 016c990834 Fix incorrect variable datatype
Both datatypes map to the same underlying one which is why it still
worked, but we should use the correct type.

Author: Kyotaro HORIGUCHI
2017-02-28 12:20:35 +01:00
Tom Lane 9b88f27cb4 Allow index AMs to return either HeapTuple or IndexTuple format during IOS.
Previously, only IndexTuple format was supported for the output data of
an index-only scan.  This is fine for btree, which is just returning a
verbatim index tuple anyway.  It's not so fine for SP-GiST, which can
return reconstructed data that's much larger than a page.

To fix, extend the index AM API so that index-only scan data can be
returned in either HeapTuple or IndexTuple format.  There's other ways
we could have done it, but this way avoids an API break for index AMs
that aren't concerned with the issue, and it costs little except a couple
more fields in IndexScanDescs.

I changed both GiST and SP-GiST to use the HeapTuple method.  I'm not
very clear on whether GiST can reconstruct data that's too large for an
IndexTuple, but that seems possible, and it's not much of a code change to
fix.

Per a complaint from Vik Fearing.  Reviewed by Jason Li.

Discussion: https://postgr.es/m/49527f79-530d-0bfe-3dad-d183596afa92@2ndquadrant.fr
2017-02-27 17:20:34 -05:00
Robert Haas 30df93f698 hash: Refactor overflow page allocation.
As with commit b0f18cb77f, the goal
here is to move all of the related page modifications to a single
section of code, in preparation for adding write-ahead logging.

Amit Kapila, with slight changes by me.  The larger patch series
of which this is a part has been reviewed and tested by Álvaro
Herrera, Ashutosh Sharma, Mark Kirkwood, Jeff Janes, and Jesper
Pedersen, all of whom should also have been credited in the
previous commit message.
2017-02-27 22:59:55 +05:30
Robert Haas b0f18cb77f hash: Refactor bucket squeeze code.
In preparation for adding write-ahead logging to hash indexes,
refactor _hash_freeovflpage and _hash_squeezebucket so that all
related page modifications happen in a single section of code.  The
previous coding assumed that it would be fine to move tuples one at a
time, and also that the various operations involved in freeing an
overflow page didn't necessarily all need to be done together, all
of which is true if you don't care about write-ahead logging.

Amit Kapila, with slight changes by me.
2017-02-27 22:34:21 +05:30
Tom Lane 9e3755ecb2 Remove useless duplicate inclusions of system header files.
c.h #includes a number of core libc header files, such as <stdio.h>.
There's no point in re-including these after having read postgres.h,
postgres_fe.h, or c.h; so remove code that did so.

While at it, also fix some places that were ignoring our standard pattern
of "include postgres[_fe].h, then system header files, then other Postgres
header files".  While there's not any great magic in doing it that way
rather than system headers last, it's silly to have just a few files
deviating from the general pattern.  (But I didn't attempt to enforce this
globally, only in files I was touching anyway.)

I'd be the first to say that this is mostly compulsive neatnik-ism,
but over time it might save enough compile cycles to be useful.
2017-02-25 16:12:55 -05:00
Tom Lane c29aff959d Consistently declare timestamp variables as TimestampTz.
Twiddle the replication-related code so that its timestamp variables
are declared TimestampTz, rather than the uninformative "int64" that
was previously used for meant-to-be-always-integer timestamps.
This resolves the int64-vs-TimestampTz declaration inconsistencies
introduced by commit 7c030783a, though in the opposite direction to
what was originally suggested.

This required including datatype/timestamp.h in a couple more places
than before.  I decided it would be a good idea to slim down that
header by not having it pull in <float.h> etc, as those headers are
no longer at all relevant to its purpose.  Unsurprisingly, a small number
of .c files turn out to have been depending on those inclusions, so add
them back in the .c files as needed.

Discussion: https://postgr.es/m/26788.1487455319@sss.pgh.pa.us
Discussion: https://postgr.es/m/27694.1487456324@sss.pgh.pa.us
2017-02-23 15:57:08 -05:00
Tom Lane d28aafb6dd Remove pg_control's enableIntTimes field.
We don't need it any more.

pg_controldata continues to report that date/time type storage is
"64-bit integers", but that's now a hard-wired behavior not something
it sees in the data.  This avoids breaking pg_upgrade, and perhaps other
utilities that inspect pg_control this way.  Ditto for pg_resetwal.

I chose to remove the "bigint_timestamps" output column of
pg_control_init(), though, as that function hasn't been around long
and probably doesn't have ossified users.

Discussion: https://postgr.es/m/26788.1487455319@sss.pgh.pa.us
2017-02-23 12:23:12 -05:00
Robert Haas 5262f7a4fc Add optimizer and executor support for parallel index scans.
In combination with 569174f1be, which
taught the btree AM how to perform parallel index scans, this allows
parallel index scan plans on btree indexes.  This infrastructure
should be general enough to support parallel index scans for other
index AMs as well, if someone updates them to support parallel
scans.

Amit Kapila, reviewed and tested by Anastasia Lubennikova, Tushar
Ahuja, and Haribabu Kommi, and me.
2017-02-15 13:53:24 -05:00
Robert Haas 569174f1be btree: Support parallel index scans.
This isn't exposed to the optimizer or the executor yet; we'll add
support for those things in a separate patch.  But this puts the
basic mechanism in place: several processes can attach to a parallel
btree index scan, and each one will get a subset of the tuples that
would have been produced by a non-parallel scan.  Each index page
becomes the responsibility of a single worker, which then returns
all of the TIDs on that page.

Rahila Syed, Amit Kapila, Robert Haas, reviewed and tested by
Anastasia Lubennikova, Tushar Ahuja, and Haribabu Kommi.
2017-02-15 07:41:14 -05:00
Robert Haas 8da9a22636 Split index xlog headers from other private index headers.
The xlog-specific headers need to be included in both frontend code -
specifically, pg_waldump - and the backend, but the remainder of the
private headers for each index are only needed by the backend.  By
splitting the xlog stuff out into separate headers, pg_waldump pulls
in fewer backend headers, which is a good thing.

Patch by me, reviewed by Michael Paquier and Andres Freund, per a
complaint from Dilip Kumar.

Discussion: http://postgr.es/m/CA+TgmoZ=F=GkxV0YEv-A8tb+AEGy_Qa7GSiJ8deBKFATnzfEug@mail.gmail.com
2017-02-14 15:37:59 -05:00
Robert Haas fb47544d0c Minor fixes for WAL consistency checking.
Michael Paquier, reviewed and slightly revised by me.

Discussion: http://postgr.es/m/CAB7nPqRzCQb=vdfHvMtP0HMLBHU6z1aGdo4GJsUP-HP8jx+Pkw@mail.gmail.com
2017-02-14 12:41:01 -05:00
Robert Haas 3f01fd4ca0 Rename dtrace probes for ongoing xlog -> wal conversion.
xlog-switch becomes wal-switch, and xlog-insert becomes wal-insert.
2017-02-09 16:40:19 -05:00
Robert Haas 85c11324ca Rename user-facing tools with "xlog" in the name to say "wal".
This means pg_receivexlog because pg_receivewal, pg_resetxlog
becomes pg_resetwal, and pg_xlogdump becomes pg_waldump.
2017-02-09 16:23:46 -05:00
Robert Haas 806091c96f Remove all references to "xlog" from SQL-callable functions in pg_proc.
Commit f82ec32ac3 renamed the pg_xlog
directory to pg_wal.  To make things consistent, and because "xlog" is
terrible terminology for either "transaction log" or "write-ahead log"
rename all SQL-callable functions that contain "xlog" in the name to
instead contain "wal".  (Note that this may pose an upgrade hazard for
some users.)

Similarly, rename the xlog_position argument of the functions that
create slots to be called wal_position.

Discussion: https://www.postgresql.org/message-id/CA+Tgmob=YmA=H3DbW1YuOXnFVgBheRmyDkWcD9M8f=5bGWYEoQ@mail.gmail.com
2017-02-09 15:10:09 -05:00
Robert Haas fc8219dc54 pageinspect: Fix hash_bitmap_info not to read the underlying page.
It did that to verify that the page was an overflow page rather than
anything else, but that means that checking the status of all the
overflow bits requires reading the entire index.  So don't do that.
The new code validates that the page is not a primary bucket page
or bitmap page by looking at the metapage, so that using this on
large numbers of pages can be reasonably efficient.

Ashutosh Sharma, per a complaint from me, and with further
modifications by me.
2017-02-09 14:34:34 -05:00
Tom Lane 86d911ec0f Allow index AMs to cache data across aminsert calls within a SQL command.
It's always been possible for index AMs to cache data across successive
amgettuple calls within a single SQL command: the IndexScanDesc.opaque
field is meant for precisely that.  However, no comparable facility
exists for amortizing setup work across successive aminsert calls.
This patch adds such a feature and teaches GIN, GIST, and BRIN to use it
to amortize catalog lookups they'd previously been doing on every call.
(The other standard index AMs keep everything they need in the relcache,
so there's little to improve there.)

For GIN, the overall improvement in a statement that inserts many rows
can be as much as 10%, though it seems a bit less for the other two.
In addition, this makes a really significant difference in runtime
for CLOBBER_CACHE_ALWAYS tests, since in those builds the repeated
catalog lookups are vastly more expensive.

The reason this has been hard up to now is that the aminsert function is
not passed any useful place to cache per-statement data.  What I chose to
do is to add suitable fields to struct IndexInfo and pass that to aminsert.
That's not widening the index AM API very much because IndexInfo is already
within the ken of ambuild; in fact, by passing the same info to aminsert
as to ambuild, this is really removing an inconsistency in the AM API.

Discussion: https://postgr.es/m/27568.1486508680@sss.pgh.pa.us
2017-02-09 11:52:12 -05:00
Robert Haas a507b86900 Add WAL consistency checking facility.
When the new GUC wal_consistency_checking is set to a non-empty value,
it triggers recording of additional full-page images, which are
compared on the standby against the results of applying the WAL record
(without regard to those full-page images).  Allowable differences
such as hints are masked out, and the resulting pages are compared;
any difference results in a FATAL error on the standby.

Kuntal Ghosh, based on earlier patches by Michael Paquier and Heikki
Linnakangas.  Extensively reviewed and revised by Michael Paquier and
by me, with additional reviews and comments from Amit Kapila, Álvaro
Herrera, Simon Riggs, and Peter Eisentraut.
2017-02-08 15:45:30 -05:00
Robert Haas 94708c0e8c Fix compiler warning.
Mithun Cy, per a report by Erik Rijkers
2017-02-07 15:09:14 -05:00
Robert Haas 293e24e507 Cache hash index's metapage in rel->rd_amcache.
This avoids a very significant amount of buffer manager traffic and
contention when scanning hash indexes, because it's no longer
necessary to lock and pin the metapage for every scan.  We do need
some way of figuring out when the cache is too stale to use any more,
so that when we lock the primary bucket page to which the cached
metapage points us, we can tell whether a split has occurred since we
cached the metapage data.  To do that, we use the hash_prevblkno field
in the primary bucket page, which would otherwise always be set to
InvalidBuffer.

This patch contains code so that it will continue working (although
less efficiently) with hash indexes built before this change, but
perhaps we should consider bumping the hash version and ripping out
the compatibility code.  That decision can be made later, though.

Mithun Cy, reviewed by Jesper Pedersen, Amit Kapila, and by me.
Before committing, I made a number of cosmetic changes to the last
posted version of the patch, adjusted _hash_getcachedmetap to be more
careful about order of operation, and made some necessary updates to
the pageinspect documentation and regression tests.
2017-02-07 12:35:45 -05:00
Heikki Linnakangas 181bdb90ba Fix typos in comments.
Backpatch to all supported versions, where applicable, to make backpatching
of future fixes go more smoothly.

Josh Soref

Discussion: https://www.postgresql.org/message-id/CACZqfqCf+5qRztLPgmmosr-B0Ye4srWzzw_mo4c_8_B_mtjmJQ@mail.gmail.com
2017-02-06 11:33:58 +02:00
Robert Haas 38c363adf4 Improve grammar of message about two-phase state files.
When there's only one two-phase state file, there's also only one
long-running prepared transaction.  Adjust the message text
accordingly.

Nikhil Sontakke

Discussion: http://postgr.es/m/CAMGcDxcmR_DWZXXndGoPzVQx=B17A5=RviEA1qNaF=FWLy5Whw@mail.gmail.com
2017-02-03 17:16:54 -05:00
Robert Haas 08bf6e5295 pageinspect: Support hash indexes.
Patch by Jesper Pedersen and Ashutosh Sharma, with some error handling
improvements by me.  Tests from Peter Eisentraut.  Reviewed by Álvaro
Herrera, Michael Paquier, Jesper Pedersen, Jeff Janes, Peter
Eisentraut, Amit Kapila, Mithun Cy, and me.

Discussion: http://postgr.es/m/e2ac6c58-b93f-9dd9-f4e6-d6d30add7fdf@redhat.com
2017-02-02 14:19:32 -05:00
Andrew Dunstan f1169ab501 Don't count background workers against a user's connection limit.
Doing so doesn't seem to be within the purpose of the per user
connection limits, and has particularly unfortunate effects in
conjunction with parallel queries.

Backpatch to 9.6 where parallel queries were introduced.

David Rowley, reviewed by Robert Haas and Albe Laurenz.
2017-02-01 18:02:43 -05:00
Robert Haas bbd8550bce Refactor other replication commands to use DestRemoteSimple.
Commit a84069d935 added a new type of
DestReceiver to avoid duplicating the existing code for the SHOW
command, but it turns out we can leverage that new DestReceiver
type in a few more places, saving some code.

Michael Paquier, reviewed by Andres Freund and by me.

Discussion: http://postgr.es/m/CAB7nPqSdFOQC0evc0r1nJeQyGBqjBrR41MC4rcMqUUpoJaZbtQ%40mail.gmail.com
Discussion: http://postgr.es/m/CAB7nPqT2K4XFT1JgqufFBjsOc-NUKXg5qBDucHPMbk6Xi1kYaA@mail.gmail.com
2017-02-01 13:42:41 -05:00
Robert Haas 8a815e3fc3 Move comment about test slightly closer to test.
The addition of a TestForOldSnapshot() call here has made the
referent of this comment slightly less clear, so move the comment
to compensate.

Amit Kapila (as part of the parallel index scan patch)
2017-01-31 17:21:02 -05:00
Robert Haas 3838074f86 Be more aggressive in avoiding tuple conversion.
According to the comments in tupconvert.c, it's necessary to perform
tuple conversion when either table has OIDs, and this was previously
checked by ensuring that the tdtypeid value matched between the tables
in question.  However, that's overly stringent: we have access to
tdhasoid and can test directly whether OIDs are present, which lets us
avoid conversion in cases where the type OIDs are different but the
tuple descriptors are entirely the same (and neither has OIDs).  This
is useful to the partitioning code, which can thereby avoid converting
tuples when inserting into a partition whose columns appear in the
same order as the parent columns, the normal case.  It's possible
for the tuple routing code to avoid some additional overhead in this
case as well, so do that, too.

It's not clear whether it would be OK to skip this when both tables
have OIDs: do callers count on this to build a new tuple (losing the
previous OID) in such instances?  Until we figure it out, leave the
behavior in that case alone.

Amit Langote, reviewed by me.
2017-01-24 21:53:38 -05:00
Robert Haas d1ecd53947 Add a SHOW command to the replication command language.
This is useful infrastructure for an upcoming proposed patch to
allow the WAL segment size to be changed at initdb time; tools like
pg_basebackup need the ability to interrogate the server setting.
But it also doesn't seem like a bad thing to have independently of
that; it may find other uses in the future.

Robert Haas and Beena Emerson.  (The original patch here was by
Beena, but I rewrote it to such a degree that most of the code
being committed here is mine.)

Discussion: http://postgr.es/m/CA+TgmobNo4qz06wHEmy9DszAre3dYx-WNhHSCbU9SAwf+9Ft6g@mail.gmail.com
2017-01-24 17:04:12 -05:00
Robert Haas a84069d935 Add a new DestReceiver for printing tuples without catalog access.
If you create a DestReciver of type DestRemote and try to use it from
a replication connection that is not bound to a specific daabase, or
any other hypothetical type of backend that is not bound to a specific
database, it will fail because it doesn't have a pg_proc catalog to
look up properties of the types being printed.  In general, that's
an unavoidable problem, but we can hardwire the properties of a few
builtin types in order to support utility commands.  This new
DestReceiver of type DestRemoteSimple does just that.

Patch by me, reviewed by Michael Paquier.

Discussion: http://postgr.es/m/CA+TgmobNo4qz06wHEmy9DszAre3dYx-WNhHSCbU9SAwf+9Ft6g@mail.gmail.com
2017-01-24 16:53:56 -05:00
Robert Haas 7b4ac19982 Extend index AM API for parallel index scans.
This patch doesn't actually make any index AM parallel-aware, but it
provides the necessary functions at the AM layer to do so.

Rahila Syed, Amit Kapila, Robert Haas
2017-01-24 16:42:58 -05:00
Robert Haas b1ecb9b3fc Fix interaction of partitioned tables with BulkInsertState.
When copying into a partitioned table, the target heap may change from
one tuple to next.  We must ask ReadBufferBI() to get a new buffer
every time such change occurs.  To do that, use new function
ReleaseBulkInsertStatePin().  This fixes the bug that tuples ended up
being inserted into the wrong partition, which occurred exactly
because the wrong buffer was used.

Amit Langote, per a suggestion from Robert Haas.  Some cosmetic
adjustments by me.

Reports by 高增琦 (Gao Zengqi), Venkata B Nagothi, and
Ragnar Ouchterlony.

Discussion: http://postgr.es/m/CAFmBtr32FDOqofo8yG-4mjzL1HnYHxXK5S9OGFJ%3D%3DcJpgEW4vA%40mail.gmail.com
Discussion: http://postgr.es/m/CAEyp7J9WiX0L3DoiNcRrY-9iyw%3DqP%2Bj%3DDLsAnNFF1xT2J1ggfQ%40mail.gmail.com
Discussion: http://postgr.es/m/16d73804-c9cd-14c5-463e-5caad563ff77%40agama.tv
Discussion: http://postgr.es/m/CA+TgmoaiZpDVUUN8LZ4jv1qFE_QyR+H9ec+79f5vNczYarg5Zg@mail.gmail.com
2017-01-24 08:50:16 -05:00
Peter Eisentraut f21a563d25 Move some things from builtins.h to new header files
This avoids that builtins.h has to include additional header files.
2017-01-20 20:29:53 -05:00
Peter Eisentraut 665d1fad99 Logical replication
- Add PUBLICATION catalogs and DDL
- Add SUBSCRIPTION catalog and DDL
- Define logical replication protocol and output plugin
- Add logical replication workers

From: Petr Jelinek <petr@2ndquadrant.com>
Reviewed-by: Steve Singer <steve@ssinger.info>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Erik Rijkers <er@xs4all.nl>
Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
2017-01-20 09:04:49 -05:00
Alvaro Herrera 8eace46d34 Fix race condition in reading commit timestamps
If a user requests the commit timestamp for a transaction old enough
that its data is concurrently being truncated away by vacuum at just the
right time, they would receive an ugly internal file-not-found error
message from slru.c rather than the expected NULL return value.

In a primary server, the window for the race is very small: the lookup
has to occur exactly between the two calls by vacuum, and there's not a
lot that happens between them (mostly just a multixact truncate).  In a
standby server, however, the window is larger because the truncation is
executed as soon as the WAL record for it is replayed, but the advance
of the oldest-Xid is not executed until the next checkpoint record.

To fix in the primary, simply reverse the order of operations in
vac_truncate_clog.  To fix in the standby, augment the WAL truncation
record so that the standby is aware of the new oldest-XID value and can
apply the update immediately.  WAL version bumped because of this.

No backpatch, because of the low importance of the bug and its rarity.

Author: Craig Ringer
Reviewed-By: Petr Jelínek, Peter Eisentraut
Discussion: https://postgr.es/m/CAMsr+YFhVtRQT1VAwC+WGbbxZZRzNou=N9Ed-FrCqkwQ8H8oJQ@mail.gmail.com
2017-01-19 18:24:17 -03:00
Robert Haas e37360d5df Improve comment in hashsearch.c.
Typo fix from Mithun Cy; other improvements by me.
2017-01-18 16:36:48 -05:00
Peter Eisentraut 352a24a1f9 Generate fmgr prototypes automatically
Gen_fmgrtab.pl creates a new file fmgrprotos.h, which contains
prototypes for all functions registered in pg_proc.h.  This avoids
having to manually maintain these prototypes across a random variety of
header files.  It also automatically enforces a correct function
signature, and since there are warnings about missing prototypes, it
will detect functions that are defined but not registered in
pg_proc.h (or otherwise used).

Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
2017-01-17 14:06:07 -05:00
Fujii Masao 974ece58bb Fix an assertion failure related to an exclusive backup.
Previously multiple sessions could execute pg_start_backup() and
pg_stop_backup() to start and stop an exclusive backup at the same time.
This could trigger the assertion failure of
"FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)".
This happend because, even while pg_start_backup() was starting
an exclusive backup, other session could run pg_stop_backup()
concurrently and mark the backup as not-in-progress unconditionally.

This patch introduces ExclusiveBackupState indicating the state of
an exclusive backup. This state is used to ensure that there is only
one session running pg_start_backup() or pg_stop_backup() at
the same time, to avoid the assertion failure.

Back-patch to all supported versions.

Author: Michael Paquier
Reviewed-By: Kyotaro Horiguchi and me
Reported-By: Andreas Seltenreich
Discussion: <87mvktojme.fsf@credativ.de>
2017-01-17 17:27:32 +09:00
Robert Haas e898437460 Improve coding in _hash_addovflpage.
Instead of relying on the page contents to know whether we have
advanced from the primary bucket page to an overflow page, track
that explicitly.

Amit Kapila, per a complaint by me.
2017-01-10 08:31:03 -05:00
Alvaro Herrera 3957b58b88 Fix ALTER TABLE / SET TYPE for irregular inheritance
If inherited tables don't have exactly the same schema, the USING clause
in an ALTER TABLE / SET DATA TYPE misbehaves when applied to the
children tables since commit 9550e8348b.  Starting with that commit,
the attribute numbers in the USING expression are fixed during parse
analysis.  This can lead to bogus errors being reported during
execution, such as:
   ERROR:  attribute 2 has wrong type
   DETAIL:  Table has type smallint, but query expects integer.

Since it wouldn't do to revert to the original coding, we now apply a
transformation to map the attribute numbers to the correct ones for each
child.

Reported by Justin Pryzby
Analysis by Tom Lane; patch by me.
Discussion: https://postgr.es/m/20170102225618.GA10071@telsasoft.com
2017-01-09 19:26:58 -03:00
Alvaro Herrera 7403561c0f BRIN revmap pages are not standard pages ...
... and therefore we ought not to tell XLogRegisterBuffer the opposite,
when writing XLog for a brin update that moves the index tuple to a
different page.  Otherwise, xlog insertion would try to "compress the
hole" when producing a full-page image for it; but since we don't update
pd_lower/upper, the hole covers the whole page.  On WAL replay, the
revmap page becomes empty and so the entire portion of the index is
useless and needs to be recomputed.

This is low-probability: a BRIN update only moves an index tuple to a
different page when the summary tuple is larger than the existing one,
which doesn't happen with fixed-width datatypes.  Also, the revmap
page must be first after a checkpoint.

Report and patch: Kuntal Ghosh
Bug is alleged to have detected by a WAL-consistency-checking tool.
Discussion: https://postgr.es/m/CAGz5QCJ=00UQjScSEFbV=0qO5ShTZB9WWz_Fm7+Wd83zPs9Geg@mail.gmail.com

I posted a test case demonstrating the problem, but I'm refraining from
adding it to the test suite; if the WAL consistency tool makes it in,
that will be a better way to catch this from regressing.  (We should
definitely have someting that causes not-same-page updates, though.)
2017-01-09 18:19:29 -03:00
Bruce Momjian 1d25779284 Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
Tom Lane 54386f3578 Remove triggerable Assert in hashname().
hashname() asserted that the key string it is given is shorter than
NAMEDATALEN.  That should surely always be true if the input is in fact a
regular value of type "name".  However, for reasons of coding convenience,
we allow plain old C strings to be treated as "name" values in many places.
Some SQL functions accept arbitrary "text" inputs, convert them to C
strings, and pass them otherwise-untransformed to syscache lookups for name
columns, allowing an overlength input value to trigger hashname's Assert.

This would be a DOS problem, except that it only happens in assert-enabled
builds which aren't recommended for production.  In a production build,
you'll just get a name lookup error, since regardless of the hash value
computed by hashname, the later equality comparison checks can't match.
Likewise, if the catalog lookup is done by seqscan or indexscan searches,
there will just be a lookup error, since the name comparison functions
don't contain any similar length checks, and will see an overlength input
as unequal to any stored entry.

After discussion we concluded that we should simply remove this Assert.
It's inessential to hashname's own functionality, and having such an
assertion in only some paths for name lookup is more of a foot-gun than
a useful check.  There may or may not be a case for the affected callers
to do something other than let the name lookup fail, but we'll consider
that separately; in any case we probably don't want to change such
behavior in the back branches.

Per report from Tushar Ahuja.  Back-patch to all supported branches.

Report: https://postgr.es/m/7d0809ee-6f25-c9d6-8e74-5b2967830d49@enterprisedb.com
Discussion: https://postgr.es/m/17691.1482523168@sss.pgh.pa.us
2016-12-26 14:58:02 -05:00
Robert Haas 7819ba1ef6 Remove _hash_chgbufaccess().
This is basically for the same reasons I got rid of _hash_wrtbuf()
in commit 25216c98938495fd741bf585dcbef45b3a9ffd40: it's not
convenient to have a function which encapsulates MarkBufferDirty(),
especially as we move towards having hash indexes be WAL-logged.

Patch by me, reviewed (but not entirely endorsed) by Amit Kapila.
2016-12-23 07:14:37 -05:00
Andres Freund 6ef2eba3f5 Skip checkpoints, archiving on idle systems.
Some background activity (like checkpoints, archive timeout, standby
snapshots) is not supposed to happen on an idle system. Unfortunately
so far it was not easy to determine when a system is idle, which
defeated some of the attempts to avoid redundant activity on an idle
system.

To make that easier, allow to make individual WAL insertions as not
being "important". By checking whether any important activity happened
since the last time an activity was performed, it now is easy to check
whether some action needs to be repeated.

Use the new facility for checkpoints, archive timeout and standby
snapshots.

The lack of a facility causes some issues in older releases, but in my
opinion the consequences (superflous checkpoints / archived segments)
aren't grave enough to warrant backpatching.

Author: Michael Paquier, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Amit Kapila, Kyotaro HORIGUCHI
Bug: #13685
Discussion:
    https://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org
    https://www.postgresql.org/message-id/CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com
Backpatch: -
2016-12-22 11:31:50 -08:00
Robert Haas 097e41439d Fix broken error check in _hash_doinsert.
You can't just cast a HashMetaPage to a Page, because the meta page
data is stored after the page header, not at offset 0.  Fortunately,
this didn't break anything because it happens to find hashm_bsize
at the offset at which it expects to find pd_pagesize_version, and
the values are close enough to the same that this works out.

Still, it's a bug, so back-patch to all supported versions.

Mithun Cy, revised a bit by me.
2016-12-22 13:59:01 -05:00
Robert Haas dd728826c5 Fix locking problem in _hash_squeezebucket() / _hash_freeovflpage().
A bucket squeeze operation needs to lock each page of the bucket
before releasing the prior page, but the previous coding fumbled the
locking when freeing an overflow page during a bucket squeeze
operation.  Commit 6d46f4783e
introduced this bug.

Amit Kapila, with help from Kuntal Ghosh and Dilip Kumar, after
an initial trouble report by Jeff Janes.  Reviewed by me.  I also
fixed a problem with a comment.
2016-12-19 12:31:50 -05:00
Robert Haas 3761fe3c20 Simplify LWLock tranche machinery by removing array_base/array_stride.
array_base and array_stride were added so that we could identify the
offset of an LWLock within a tranche, but this facility is only very
marginally used apart from the main tranche.  So, give every lock in
the main tranche its own tranche ID and get rid of array_base,
array_stride, and all that's attached.  For debugging facilities
(Trace_lwlocks and LWLOCK_STATS) print the pointer address of the
LWLock using %p instead of the offset.  This is arguably more useful,
and certainly a lot cheaper.  Drop the offset-within-tranche from
the information reported to dtrace and from one can't-happen message
inside lwlock.c.

The main user-visible impact of this change is that pg_stat_activity
will now report all waits for LWLocks as "LWLock" rather than
reporting some as "LWLockTranche" and others as "LWLockNamed".

The main motivation for this change is that the need to specify an
array_base and an array_stride is awkward for parallel query.  There
is only a very limited supply of tranche IDs so we can't just keep
allocating new ones, and if we try to use the same tranche IDs every
time then we run into trouble when multiple parallel contexts are
use simultaneously.  So if we didn't get rid of this mechanism we'd
have to make it even more complicated.  By simplifying it in this
way, we instead reduce the size of the generated code for lwlock.c
by about 5%.

Discussion: http://postgr.es/m/CA+TgmoYsFn6NUW1x0AZtupJGUAs1UDY4dJtCN47_Q6D0sP80PA@mail.gmail.com
2016-12-16 11:29:23 -05:00
Robert Haas 6a4fe1127c Fix more hash index bugs around marking buffers dirty.
In _hash_freeovflpage(), if we're freeing the overflow page that
immediate follows the page to which tuples are being moved (the
confusingly-named "write buffer"), don't forget to mark that
page dirty after updating its hasho_nextblkno.

In _hash_squeezebucket(), it's not necessary to mark the primary
bucket page dirty if there are no overflow pages, because there's
nothing to squeeze in that case.

Amit Kapila, with help from Kuntal Ghosh and Dilip Kumar, after
an initial trouble report by Jeff Janes.
2016-12-16 09:55:20 -05:00
Robert Haas 25216c9893 Remove _hash_wrtbuf() in favor of calling MarkBufferDirty().
The whole concept of _hash_wrtbuf() is that we need to know at the
time we're releasing the buffer lock (and pin) whether we dirtied the
buffer, but this is easy to get wrong.  This patch actually fixes one
non-obvious bug of that form: hashbucketcleanup forgot to signal
_hash_squeezebucket, which gets the primary bucket page already
locked, as to whether it had already dirtied the page.  Calling
MarkBufferDirty() at the places where we dirty the buffer is more
intuitive and lets us simplify the code in various places as well.

On top of all that, the ultimate goal here is to make hash indexes
WAL-logged, and as the comments to _hash_wrtbuf() note, it should
go away when that happens.  Making it go away a little earlier than
that seems like a good preparatory step.

Report by Jeff Janes.  Diagnosis by Amit Kapila, Kuntal Ghosh,
and Dilip Kumar.  Patch by me, after studying an alternative patch
submitted by Amit Kapila.

Discussion: http://postgr.es/m/CAA4eK1Kf6tOY0oVz_SEdngiNFkeXrA3xUSDPPORQvsWVPdKqnA@mail.gmail.com
2016-12-16 09:37:28 -05:00
Robert Haas 501c7b94bc Fix bug in hashbulkdelete.
Commit 6d46f4783e failed to account for
the possibility that hashbulkdelete() might encounter a bucket that
has been split since it began scanning the bucket array.  Repair.

Extracted from a larger pathc by Amit Kapila; I rewrote the comment.
2016-12-13 12:16:02 -05:00
Robert Haas 3856cf9607 Remove should_free arguments to tuplesort routines.
Since commit e94568ecc1, the answer is
always "false", and we do not need to complicate the API by arranging
to return a constant value.

Peter Geoghegan

Discussion: http://postgr.es/m/CAM3SWZQWZZ_N=DmmL7tKy_OUjGH_5mN=N=A6h7kHyyDvEhg2DA@mail.gmail.com
2016-12-12 15:57:35 -05:00
Robert Haas fa0f466d53 Log the creation of an init fork unconditionally.
Previously, it was thought that this only needed to be done for the
benefit of possible standbys, so wal_level = minimal skipped it.
But that's not safe, because during crash recovery we might replay
XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record which recursively
removes the directory that contains the new init fork.  So log it
always.

The user-visible effect of this bug is that if you create a database
or tablespace, then create an unlogged table, then crash without
checkpointing, then restart, accessing the table will fail, because
the it won't have been properly reset.  This commit fixes that.

Michael Paquier, per a report from Konstantin Knizhnik.  Wording of
the comments per a suggestion from me.
2016-12-08 14:12:08 -05:00
Robert Haas f0e44751d7 Implement table partitioning.
Table partitioning is like table inheritance and reuses much of the
existing infrastructure, but there are some important differences.
The parent is called a partitioned table and is always empty; it may
not have indexes or non-inherited constraints, since those make no
sense for a relation with no data of its own.  The children are called
partitions and contain all of the actual data.  Each partition has an
implicit partitioning constraint.  Multiple inheritance is not
allowed, and partitioning and inheritance can't be mixed.  Partitions
can't have extra columns and may not allow nulls unless the parent
does.  Tuples inserted into the parent are automatically routed to the
correct partition, so tuple-routing ON INSERT triggers are not needed.
Tuple routing isn't yet supported for partitions which are foreign
tables, and it doesn't handle updates that cross partition boundaries.

Currently, tables can be range-partitioned or list-partitioned.  List
partitioning is limited to a single column, but range partitioning can
involve multiple columns.  A partitioning "column" can be an
expression.

Because table partitioning is less general than table inheritance, it
is hoped that it will be easier to reason about properties of
partitions, and therefore that this will serve as a better foundation
for a variety of possible optimizations, including query planner
optimizations.  The tuple routing based which this patch does based on
the implicit partitioning constraints is an example of this, but it
seems likely that many other useful optimizations are also possible.

Amit Langote, reviewed and tested by Robert Haas, Ashutosh Bapat,
Amit Kapila, Rajkumar Raghuwanshi, Corey Huinker, Jaime Casanova,
Rushabh Lathia, Erik Rijkers, among others.  Minor revisions by me.
2016-12-07 13:17:55 -05:00
Robert Haas 2f4193c350 Fix race introduced by 6d46f4783e.
It's possible for the metapage contents to change after we release
the lock, so we must read them before releasing the lock.

Amit Kapila.  Submitted in response to a trouble report from
Andreas Seltenreich, though it is not certain this fixes the
problem.
2016-12-05 11:43:37 -05:00
Fujii Masao 5dc851afde Fix incorrect output from gin_desc().
Previously gin_desc() displayed incorrect output "unknown action 0"
for XLOG_GIN_INSERT and XLOG_GIN_VACUUM_DATA_LEAF_PAGE records with
valid actions. The cause of this problem was that gin_desc() wrongly
used XLogRecGetData() to extract data from those records.
Since they were registered by XLogRegisterBufData(), gin_desc() should
have used XLogRecGetBlockData(), instead, like gin_redo().
Also there were other differences about how to treat XLOG_GIN_INSERT
record between gin_desc() and gin_redo().

This commit fixes gin_desc() routine so that it treats those records
in the same way as gin_redo().

Batch-patch to 9.5 where WAL record format was revamped and
XLogRegisterBufData() was added.

Reported-By: Andres Freund
Reviewed-By: Tom Lane
Discussion: <20160509194645.7lewnpw647zegx2m@alap3.anarazel.de>
2016-12-05 20:29:41 +09:00
Robert Haas b460f5d669 Add max_parallel_workers GUC.
Increase the default value of the existing max_worker_processes GUC
from 8 to 16, and add a new max_parallel_workers GUC with a maximum
of 8.  This way, even if the maximum amount of parallel query is
happening, there is still room for background workers that do other
things, as originally envisioned when max_worker_processes was added.

Julien Rouhaud, reviewed by Amit Kapila and by revised by me.
2016-12-02 07:42:58 -05:00
Alvaro Herrera fa2fa99552 Permit dump/reload of not-too-large >1GB tuples
Our documentation states that our maximum field size is 1 GB, and that
our maximum row size of 1.6 TB.  However, while this might be attainable
in theory with enough contortions, it is not workable in practice; for
starters, pg_dump fails to dump tables containing rows larger than 1 GB,
even if individual columns are well below the limit; and even if one
does manage to manufacture a dump file containing a row that large, the
server refuses to load it anyway.

This commit enables dumping and reloading of such tuples, provided two
conditions are met:

1. no single column is larger than 1 GB (in output size -- for bytea
   this includes the formatting overhead)
2. the whole row is not larger than 2 GB

There are three related changes to enable this:

a. StringInfo's API now has two additional functions that allow creating
a string that grows beyond the typical 1GB limit (and "long" string).
ABI compatibility is maintained.  We still limit these strings to 2 GB,
though, for reasons explained below.

b. COPY now uses long StringInfos, so that pg_dump doesn't choke
trying to emit rows longer than 1GB.

c. heap_form_tuple now uses the MCXT_ALLOW_HUGE flag in its allocation
for the input tuple, which means that large tuples are accepted on
input.  Note that at this point we do not apply any further limit to the
input tuple size.

The main reason to limit to 2 GB is that the FE/BE protocol uses 32 bit
length words to describe each row; and because the documentation is
ambiguous on its signedness and libpq does consider it signed, we cannot
use the highest-order bit.  Additionally, the StringInfo API uses "int"
(which is 4 bytes wide in most platforms) in many places, so we'd need
to change that API too in order to improve, which has lots of fallout.

Backpatch to 9.5, which is the oldest that has
MemoryContextAllocExtended, a necessary piece of infrastructure.  We
could apply to 9.4 with very minimal additional effort, but any further
than that would require backpatching "huge" allocations too.

This is the largest set of changes we could find that can be
back-patched without breaking compatibility with existing systems.
Fixing a bigger set of problems (for example, dumping tuples bigger than
2GB, or dumping fields bigger than 1GB) would require changing the FE/BE
protocol and/or changing the StringInfo API in an ABI-incompatible way,
neither of which would be back-patchable.

Authors: Daniel Vérité, Álvaro Herrera
Reviewed by: Tomas Vondra
Discussion: https://postgr.es/m/20160229183023.GA286012@alvherre.pgsql
2016-12-02 00:34:01 -03:00
Robert Haas 6d46f4783e Improve hash index bucket split behavior.
Previously, the right to split a bucket was represented by a
heavyweight lock on the page number of the primary bucket page.
Unfortunately, this meant that every scan needed to take a heavyweight
lock on that bucket also, which was bad for concurrency.  Instead, use
a cleanup lock on the primary bucket page to indicate the right to
begin a split, so that scans only need to retain a pin on that page,
which is they would have to acquire anyway, and which is also much
cheaper.

In addition to reducing the locking cost, this also avoids locking out
scans and inserts for the entire lifetime of the split: while the new
bucket is being populated with copies of the appropriate tuples from
the old bucket, scans and inserts can happen in parallel.  There are
minor concurrency improvements for vacuum operations as well, though
the situation there is still far from ideal.

This patch also removes the unworldly assumption that a split will
never be interrupted.  With the new code, a split is done in a series
of small steps and the system can pick up where it left off if it is
interrupted prior to completion.  While this patch does not itself add
write-ahead logging for hash indexes, it is clearly a necessary first
step, since one of the things that could interrupt a split is the
removal of electrical power from the machine performing it.

Amit Kapila.  I wrote the original design on which this patch is
based, and did a good bit of work on the comments and README through
multiple rounds of review, but all of the code is Amit's.  Also
reviewed by Jesper Pedersen, Jeff Janes, and others.

Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
2016-11-30 15:39:21 -05:00
Tom Lane dbdfd114f3 Bring some clarity to the defaults for the xxx_flush_after parameters.
Instead of confusingly stating platform-dependent defaults for these
parameters in the comments in postgresql.conf.sample (with the main
entry being a lie on Linux), teach initdb to install the correct
platform-dependent value in postgresql.conf, similarly to the way
we handle other platform-dependent defaults.  This won't do anything
for existing 9.6 installations, but since it's effectively only a
documentation improvement, that seems OK.

Since this requires initdb to have access to the default values,
move the #define's for those to pg_config_manual.h; the original
placement in bufmgr.h is unworkable because that file can't be
included by frontend programs.

Adjust the default value for wal_writer_flush_after so that it is 1MB
regardless of XLOG_BLCKSZ, conforming to what is stated in both the
SGML docs and postgresql.conf.  (We could alternatively make it scale
with XLOG_BLCKSZ, but I'm not sure I see the point.)

Copy-edit related SGML documentation.

Fabien Coelho and Tom Lane, per a gripe from Tomas Vondra.

Discussion: <30ebc6e3-8358-09cf-44a8-578252938424@2ndquadrant.com>
2016-11-25 18:36:10 -05:00
Alvaro Herrera 4aaddf2f00 Fix commit_ts for FrozenXid and BootstrapXid
Previously, requesting commit timestamp for transactions
FrozenTransactionId and BootstrapTransactionId resulted in an error.
But since those values can validly appear in committed tuples' Xmin,
this behavior is unhelpful and error prone: each caller would have to
special-case those values before requesting timestamp data for an Xid.
We already have a perfectly good interface for returning "the Xid you
requested is too old for us to have commit TS data for it", so let's use
that instead.

Backpatch to 9.5, where commit timestamps appeared.

Author: Craig Ringer
Discussion: https://www.postgresql.org/message-id/CAMsr+YFM5Q=+ry3mKvWEqRTxrB0iU3qUSRnS28nz6FJYtBwhJg@mail.gmail.com
2016-11-24 15:39:55 -03:00
Robert Haas e343dfa42b Remove barrier.h
A new thing also called a "barrier" is proposed, but whether we decide
to take that patch or not, this file seems to have outlived its
usefulness.

Thomas Munro
2016-11-22 20:28:24 -05:00
Robert Haas e8ac886c24 Support condition variables.
Condition variables provide a flexible way to sleep until a
cooperating process causes an arbitrary condition to become true.  In
simple cases, this can be accomplished with a WaitLatch/ResetLatch
loop; the cooperating process can call SetLatch after performing work
that might cause the condition to be satisfied, and the waiting
process can recheck the condition each time.  However, if the process
performing the work doesn't have an easy way to identify which
processes might be waiting, this doesn't work, because it can't
identify which latches to set.  Condition variables solve that problem
by internally maintaining a list of waiters; a process that may have
caused some waiter's condition to be satisfied must "signal" or
"broadcast" on the condition variable.

Robert Haas and Thomas Munro
2016-11-22 14:27:11 -05:00
Robert Haas a43f1939d5 Remove or reduce verbosity of some debug messages.
The debug messages that merely print StartTransactionCommand,
CommitTransactionCommand, ProcessUtilty, or ProcessQuery with no
additional details seem to be useless.  Get rid of them.

The transaction status messages produced by ShowTransactionState are
occasionally useful, but they are extremely verbose, producing
multiple lines of log output every time they fire, which can happens
multiple times per transaction.  So, reduce the level to DEBUG5; avoid
emitting an extra line just to explain which debug point is at issue;
and tighten up the rest of the message so it doesn't use quite so much
horizontal space.

With these changes, it's possible to run a somewhat busy system with a
log level even as high as DEBUG4, whereas previously anything above
DEBUG2 would flood the log with output that probably wasn't really all
that useful.
2016-11-17 17:05:16 -05:00
Alvaro Herrera f65b94f639 Avoid pin scan for replay of XLOG_BTREE_VACUUM in all cases
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to
require complex interlocking that matched the requirements on the
master. This required an O(N) operation that became a significant
problem with large indexes, causing replication delays of seconds or in
some cases minutes while the XLOG_BTREE_VACUUM was replayed.

This commit skips the “pin scan” that was previously required, by
observing in detail when and how it is safe to do so, with full
documentation. The pin scan is skipped only in replay; the VACUUM code
path on master is not touched here.

No tests included. Manual tests using an additional patch to view WAL records
and their timing have shown the change in WAL records and their handling has
successfully reduced replication delay.

This is a back-patch of commits 687f2cd7a0, 3e4b7d8798, b602842613
by Simon Riggs, to branches 9.4 and 9.5.  No further backpatch is
possible because this depends on catalog scans being MVCC.  I (Álvaro)
additionally updated a slight problem in the README, which explains why
this touches the 9.6 and master branches.
2016-11-17 13:31:30 -03:00
Tom Lane 9257f07872 Replace uses of SPI_modifytuple that intend to allocate in current context.
Invent a new function heap_modify_tuple_by_cols() that is functionally
equivalent to SPI_modifytuple except that it always allocates its result
by simple palloc.  I chose however to make the API details a bit more
like heap_modify_tuple: pass a tupdesc rather than a Relation, and use
bool convention for the isnull array.

Use this function in place of SPI_modifytuple at all call sites where the
intended behavior is to allocate in current context.  (There actually are
only two call sites left that depend on the old behavior, which makes me
wonder if we should just drop this function rather than keep it.)

This new function is easier to use than heap_modify_tuple() for purposes
of replacing a single column (or, really, any fixed number of columns).
There are a number of places where it would simplify the code to change
over, but I resisted that temptation for the moment ... everywhere except
in plpgsql's exec_assign_value(); changing that might offer some small
performance benefit, so I did it.

This is on the way to removing SPI_push/SPI_pop, but it seems like
good code cleanup in its own right.

Discussion: <9633.1478552022@sss.pgh.pa.us>
2016-11-08 15:36:44 -05:00
Robert Haas f0e72a25b0 Improve handling of dead tuples in hash indexes.
When squeezing a bucket during vacuum, it's not necessary to retain
any tuples already marked as dead, so ignore them when deciding which
tuples must be moved in order to empty a bucket page.  Similarly, when
splitting a bucket, relocating dead tuples to the new bucket is a
waste of effort; instead, just ignore them.

Amit Kapila, reviewed by me.  Testing help provided by Ashutosh
Sharma.
2016-11-08 10:52:51 -05:00
Tom Lane 5485c99e7f Fix silly nil-pointer-dereference bug introduced in commit d5f6f13f8.
Don't fetch record->xl_info before we've verified that record isn't
NULL.  Per Coverity.

Michael Paquier
2016-11-06 11:29:40 -05:00
Tom Lane d5f6f13f8e Be more consistent about masking xl_info with ~XLR_INFO_MASK.
Generally, WAL resource managers are only supposed to examine the
top 4 bits of a WAL record's xl_info; the rest are reserved for
the WAL mechanism itself.  A few places were not consistent about
doing this with respect to XLOG_CHECKPOINT and XLOG_SWITCH records.
There's no bug currently, since no additional bits ever get set in
these specific record types, but that might not be true forever.
Let's follow the generic coding rule here too.

Michael Paquier
2016-11-04 13:26:49 -04:00
Robert Haas 33839b5ffb Fix leftover reference to background writer performing checkpoints.
This was changed in PostgreSQL 9.2, but somehow this comment never
got updated.
2016-10-28 09:09:00 -04:00
Robert Haas f267c1c244 Fix possible pg_basebackup failure on standby with "include WAL".
If a restartpoint flushed no dirty buffers, it could fail to update
the minimum recovery point, leading to a minimum recovery point prior
to the starting REDO location.  perform_base_backup() would interpret
that as meaning that no WAL files at all needed to be included in the
backup, failing an internal sanity check.  To fix, have restartpoints
always update the minimum recovery point to just after the checkpoint
record itself, so that the file (or files) containing the checkpoint
record will always be included in the backup.

Code by Amit Kapila, per a design suggestion by me, with some
additional work on the code comment by me.  Test case by Michael
Paquier.  Report by Kyotaro Horiguchi.
2016-10-27 11:19:51 -04:00
Alvaro Herrera 00f15338b2 Preserve commit timestamps across clean restart
An oversight in setting the boundaries of known commit timestamps during
startup caused old commit timestamps to become inaccessible after a
server restart.

Author and reporter: Julien Rouhaud
Review, test code: Craig Ringer
2016-10-24 09:45:48 -03:00
Robert Haas 919c811ca1 Fix comment formatting. 2016-10-21 12:04:21 -04:00
Robert Haas f82ec32ac3 Rename "pg_xlog" directory to "pg_wal".
"xlog" is not a particularly clear abbreviation for "write-ahead log",
and it sometimes confuses users into believe that the contents of the
"pg_xlog" directory are not critical data, leading to unpleasant
consequences.  So, rename the directory to "pg_wal".

This patch modifies pg_upgrade and pg_basebackup to understand both
the old and new directory layouts; the former is necessary given the
purpose of the tool, while the latter merely avoids an unnecessary
backward-compatibility break.

We may wish to consider renaming other programs, switches, and
functions which still use the old "xlog" naming to also refer to
"wal".  However, that's still under discussion, so let's do just this
much for now.

Discussion: CAB7nPqTeC-8+zux8_-4ZD46V7YPwooeFxgndfsq5Rg8ibLVm1A@mail.gmail.com

Michael Paquier
2016-10-20 11:32:18 -04:00
Heikki Linnakangas 917dc7d239 Fix WAL-logging of FSM and VM truncation.
When a relation is truncated, it is important that the FSM is truncated as
well. Otherwise, after recovery, the FSM can return a page that has been
truncated away, leading to errors like:

ERROR:  could not read block 28991 in file "base/16390/572026": read only 0
of 8192 bytes

We were using MarkBufferDirtyHint() to dirty the buffer holding the last
remaining page of the FSM, but during recovery, that might in fact not
dirty the page, and the FSM update might be lost.

To fix, use the stronger MarkBufferDirty() function. MarkBufferDirty()
requires us to do WAL-logging ourselves, to protect from a torn page, if
checksumming is enabled.

Also fix an oversight in visibilitymap_truncate: it also needs to WAL-log
when checksumming is enabled.

Analysis by Pavan Deolasee.

Discussion: <CABOikdNr5vKucqyZH9s1Mh0XebLs_jRhKv6eJfNnD2wxTn=_9A@mail.gmail.com>
2016-10-19 14:26:05 +03:00
Tom Lane 5c80642aa8 Remove unnecessary int2vector-specific hash function and equality operator.
These functions were originally added in commit d8cedf67a to support
use of int2vector columns as catcache lookup keys.  However, there are
no catcaches that use such columns.  (Indeed I now think it must always
have been dead code: a catcache with such a key column would need an
underlying unique index on the column, but we've never had an int2vector
btree opclass.)

Getting rid of the int2vector-specific operator and function does not
lose any functionality, because operations on int2vectors will now fall
back to the generic anyarray support.  This avoids a wart that a btree
index on an int2vector column (made using anyarray_ops) would fail to
match equality searches, because int2vectoreq wasn't a member of the
opclass.  We don't really care much about that, since int2vector is not
meant as a type for users to use, but it's silly to have extra code and
less functionality.

If we ever do want a catcache to be indexed by an int2vector column,
we'd need to put back full btree and hash opclasses for int2vector,
comparable to the support for oidvector.  (The anyarray code can't be
used at such a low level, because it needs to do catcache lookups.)
But we'll deal with that if/when the need arises.

Also worth noting is that removal of the hash int2vector_ops opclass will
break any user-created hash indexes on int2vector columns.  While hash
anyarray_ops would serve the same purpose, it would probably not compute
the same hash values and thus wouldn't be on-disk-compatible.  Given that
int2vector isn't a user-facing type and we're planning other incompatible
changes in hash indexes for v10 anyway, this doesn't seem like something
to worry about, but it's probably worth mentioning here.

Amit Langote

Discussion: <d9bb74f8-b194-7307-9ebd-90645d377e45@lab.ntt.co.jp>
2016-10-12 14:54:08 -04:00
Tom Lane 2f1eaf87e8 Drop server support for FE/BE protocol version 1.0.
While this isn't a lot of code, it's been essentially untestable for
a very long time, because libpq doesn't support anything older than
protocol 2.0, and has not since release 6.3.  There's no reason to
believe any other client-side code still uses that protocol, either.

Discussion: <2661.1475849167@sss.pgh.pa.us>
2016-10-11 12:19:18 -04:00
Robert Haas 6f3bd98ebf Extend framework from commit 53be0b1ad to report latch waits.
WaitLatch, WaitLatchOrSocket, and WaitEventSetWait now taken an
additional wait_event_info parameter; legal values are defined in
pgstat.h.  This makes it possible to uniquely identify every point in
the core code where we are waiting for a latch; extensions can pass
WAIT_EXTENSION.

Because latches were the major wait primitive not previously covered
by this patch, it is now possible to see information in
pg_stat_activity on a large number of important wait events not
previously addressed, such as ClientRead, ClientWrite, and SyncRep.

Unfortunately, many of the wait events added by this patch will fail
to appear in pg_stat_activity because they're only used in background
processes which don't currently appear in pg_stat_activity.  We should
fix this either by creating a separate view for such information, or
else by deciding to include them in pg_stat_activity after all.

Michael Paquier and Robert Haas, reviewed by Alexander Korotkov and
Thomas Munro.
2016-10-04 11:01:42 -04:00
Tom Lane fdc9186f7e Replace the built-in GIN array opclasses with a single polymorphic opclass.
We had thirty different GIN array opclasses sharing the same operators and
support functions.  That still didn't cover all the built-in types, nor
did it cover arrays of extension-added types.  What we want is a single
polymorphic opclass for "anyarray".  There were two missing features needed
to make this possible:

1. We have to be able to declare the index storage type as ANYELEMENT
when the opclass is declared to index ANYARRAY.  This just takes a few
more lines in index_create().  Although this currently seems of use only
for GIN, there's no reason to make index_create() restrict it to that.

2. We have to be able to identify the proper GIN compare function for
the index storage type.  This patch proceeds by making the compare function
optional in GIN opclass definitions, and specifying that the default btree
comparison function for the index storage type will be looked up when the
opclass omits it.  Again, that seems pretty generically useful.

Since the comparison function lookup is done in initGinState(), making
use of the second feature adds an additional cache lookup to GIN index
access setup.  It seems unlikely that that would be very noticeable given
the other costs involved, but maybe at some point we should consider
making GinState data persist longer than it now does --- we could keep it
in the index relcache entry, perhaps.

Rather fortuitously, we don't seem to need to do anything to get this
change to play nice with dump/reload or pg_upgrade scenarios: the new
opclass definition is automatically selected to replace existing index
definitions, and the on-disk data remains compatible.  Also, if a user has
created a custom opclass definition for a non-builtin type, this doesn't
break that, since CREATE INDEX will prefer an exact match to opcintype
over a match to ANYARRAY.  However, if there's anyone out there with
handwritten DDL that explicitly specifies _bool_ops or one of the other
replaced opclass names, they'll need to adjust that.

Tom Lane, reviewed by Enrique Meneses

Discussion: <14436.1470940379@sss.pgh.pa.us>
2016-09-26 14:52:44 -04:00
Tom Lane 959ea7fa76 Remove useless code.
Apparent copy-and-pasteo in standby_desc_invalidations() had two
entries for msg->id == SHAREDINVALRELMAP_ID.

Aleksander Alekseev

Discussion: <20160923090814.GB1238@e733>
2016-09-23 10:44:50 -04:00
Peter Eisentraut ebdf5bf7d1 Delay updating control file to "in production"
Move the updating of the control file to "in production" status until
the point where WAL writes are allowed.  Before, there could be a
significant gap between the control file update and write transactions
actually being allowed.  This makes it more reliable to use the control
status to verify the end of a promotion.

From: Michael Paquier <michael.paquier@gmail.com>
2016-09-21 12:00:00 -04:00
Heikki Linnakangas 45310221a9 Fix outdated comments, GIST search queue is not an RBTree anymore.
The GiST search queue is implemented as a pairing heap rather than as
Red-Black Tree, since 9.5 (commit e7032610). I neglected these comments
in that commit.
2016-09-20 11:38:25 +03:00
Robert Haas 445a38aba2 Have heapam.h include lockdefs.h rather than lock.h.
lockdefs.h was only split from lock.h relatively recently, and
represents a minimal subset of the old lock.h.  heapam.h only needs
that smaller subset, so adjust it to include only that.  This requires
some corresponding adjustments elsewhere.

Peter Geoghegan
2016-09-13 09:21:35 -04:00
Tom Lane 24992c6db9 Rewrite PageIndexDeleteNoCompact into a form that only deletes 1 tuple.
The full generality of deleting an arbitrary number of tuples is no longer
needed, so let's save some code and cycles by replacing the original coding
with an implementation based on PageIndexTupleDelete.

We can always get back the old code from git if we need it again for new
callers (though I don't care for its willingness to mess with line pointers
it wasn't told to mess with).

Discussion: <552.1473445163@sss.pgh.pa.us>
2016-09-09 19:00:59 -04:00
Tom Lane b1328d78f8 Invent PageIndexTupleOverwrite, and teach BRIN and GiST to use it.
PageIndexTupleOverwrite performs approximately the same function as
PageIndexTupleDelete (or PageIndexDeleteNoCompact) followed by PageAddItem
targeting the same item pointer offset.  But in the case where the new
tuple is the same size as the old, it avoids shuffling other data around on
the page, because the new tuple is placed where the old one was rather than
being appended to the end of the page.  This has been shown to provide a
substantial speedup for some GiST use-cases.

Also, this change allows some API simplifications: we can get rid of
the rather klugy and error-prone PAI_ALLOW_FAR_OFFSET flag for
PageAddItemExtended, since that was used only to cover a corner case
for BRIN that's better expressed by using PageIndexTupleOverwrite.

Note that this patch causes a rather subtle WAL incompatibility: the
physical page content change represented by certain WAL records is now
different than it was before, because while the tuples have the same
itempointer line numbers, the tuples themselves are in different places.
I have not bumped the WAL version number because I think it doesn't matter
unless you are trying to do bitwise comparisons of original and replayed
pages, and in any case we're early in a devel cycle and there will probably
be more WAL changes before v10 gets out the door.

There is probably room to make use of PageIndexTupleOverwrite in SP-GiST
and GIN too, but that is left for a future patch.

Andrey Borodin, reviewed by Anastasia Lubennikova, whacked around a bit
by me

Discussion: <CAJEAwVGQjGGOj6mMSgMwGvtFd5Kwe6VFAxY=uEPZWMDjzbn4VQ@mail.gmail.com>
2016-09-09 18:02:36 -04:00
Alvaro Herrera 5c609a742f Fix locking a tuple updated by an aborted (sub)transaction
When heap_lock_tuple decides to follow the update chain, it tried to
also lock any version of the tuple that was created by an update that
was subsequently rolled back.  This is pointless, since for all intents
and purposes that tuple exists no more; and moreover it causes
misbehavior, as reported independently by Marko Tiikkaja and Marti
Raudsepp: some SELECT FOR UPDATE/SHARE queries may fail to return
the tuples, and assertion-enabled builds crash.

Fix by having heap_lock_updated_tuple test the xmin and return success
immediately if the tuple was created by an aborted transaction.

The condition where tuples become invisible occurs when an updated tuple
chain is followed by heap_lock_updated_tuple, which reports the problem
as HeapTupleSelfUpdated to its caller heap_lock_tuple, which in turn
propagates that code outwards possibly leading the calling code
(ExecLockRows) to believe that the tuple exists no longer.

Backpatch to 9.3.  Only on 9.5 and newer this leads to a visible
failure, because of commit 27846f02c176; before that, heap_lock_tuple
skips the whole dance when the tuple is already locked by the same
transaction, because of the ancient HeapTupleSatisfiesUpdate behavior.
Still, the buggy condition may also exist in more convoluted scenarios
involving concurrent transactions, so it seems safer to fix the bug in
the old branches too.

Discussion:
	https://www.postgresql.org/message-id/CABRT9RC81YUf1=jsmWopcKJEro=VoeG2ou6sPwyOUTx_qteRsg@mail.gmail.com
	https://www.postgresql.org/message-id/48d3eade-98d3-8b9a-477e-1a8dc32a724d@joh.to
2016-09-09 15:54:29 -03:00
Simon Riggs ec253de1fd Fix corruption of 2PC recovery with subxacts
Reading 2PC state files during recovery was borked, causing corruptions during
recovery. Effect limited to servers with 2PC, subtransactions and
recovery/replication.

Stas Kelvich, reviewed by Michael Paquier and Pavan Deolasee
2016-09-09 11:55:12 +01:00
Simon Riggs 67c6bd1ca3 Fix minor memory leak in Standby startup
StandbyRecoverPreparedTransactions() leaked the buffer
used for two phase state file. This was leaked once
at startup and at every shutdown checkpoint seen.

Backpatch to 9.6

Stas Kelvich
2016-09-08 10:32:58 +01:00
Peter Eisentraut 49eb0fd097 Add location field to DefElem
Add a location field to the DefElem struct, used to parse many utility
commands.  Update various error messages to supply error position
information.

To propogate the error position information in a more systematic way,
create a ParseState in standard_ProcessUtility() and pass that to
interested functions implementing the utility commands.  This seems
better than passing the query string and then reassembling a parse state
ad hoc, which violates the encapsulation of the ParseState type.

Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
2016-09-06 12:00:00 -04:00
Tom Lane 60893786d5 Fix corrupt GIN_SEGMENT_ADDITEMS WAL records on big-endian hardware.
computeLeafRecompressWALData() tried to produce a uint16 WAL log field by
memcpy'ing the first two bytes of an int-sized variable.  That accidentally
works on little-endian hardware, but not at all on big-endian.  Replay then
thinks it's looking at an ADDITEMS action with zero entries, and reads the
first two bytes of the first TID therein as the next segno/action,
typically leading to "unexpected GIN leaf action" errors during replay.
Even if replay failed to crash, the resulting GIN index page would surely
be incorrect.  To fix, just declare the variable as uint16 instead.

Per bug #14295 from Spencer Thomason (much thanks to Spencer for turning
his problem into a self-contained test case).  This likely also explains
a previous report of the same symptom from Bernd Helmle.

Back-patch to 9.4 where the problem was introduced (by commit 14d02f0bb).

Discussion: <20160826072658.15676.7628@wrigleys.postgresql.org>
Possible-Report: <2DA7350F7296B2A142272901@eje.land.credativ.lan>
2016-09-03 13:28:53 -04:00
Simon Riggs 35250b6ad7 New recovery target recovery_target_lsn
Michael Paquier
2016-09-03 17:48:01 +01:00
Heikki Linnakangas 9f85784cae Support multiple iterators in the Red-Black Tree implementation.
While we don't need multiple iterators at the moment, the interface is
nicer and less dangerous this way.

Aleksander Alekseev, with some changes by me.
2016-09-02 08:39:39 +03:00
Tom Lane 0e0f43d6fd Prevent starting a standalone backend with standby_mode on.
This can't really work because standby_mode expects there to be more
WAL arriving, which there will not ever be because there's no WAL
receiver process to fetch it.  Moreover, if standby_mode is on then
hot standby might also be turned on, causing even more strangeness
because that expects read-only sessions to be executing in parallel.
Bernd Helmle reported a case where btree_xlog_delete_get_latestRemovedXid
got confused, but rather than band-aiding individual problems it seems
best to prevent getting anywhere near this state in the first place.
Back-patch to all supported branches.

In passing, also fix some omissions of errcodes in other ereport's in
readRecoveryCommandFile().

Michael Paquier (errcode hacking by me)

Discussion: <00F0B2CEF6D0CEF8A90119D4@eje.credativ.lan>
2016-08-31 08:52:13 -04:00
Alvaro Herrera 8e1e3f958f Split hash.h → hash_xlog.h
Since the hash AM is going to be revamped to have WAL, this is a good
opportunity to clean up the include file a little bit to avoid including
a lot of extra stuff in the future.

Author: Amit Kapila
2016-08-29 18:55:49 -03:00
Fujii Masao bd082231ed Fix typos in comments. 2016-08-29 16:06:40 +09:00
Fujii Masao bab7823a49 Fix pg_xlogdump so that it handles cross-page XLP_FIRST_IS_CONTRECORD record.
Previously pg_xlogdump failed to dump the contents of the WAL file
if the file starts with the continuation WAL record which spans
more than one pages. Since pg_xlogdump assumed that the continuation
record always fits on a page, it could not find the valid WAL record to
start reading from in that case.

This patch changes pg_xlogdump so that it can handle a continuation
WAL record which crosses a page boundary and find the valid record
to start reading from.

Back-patch to 9.3 where pg_xlogdump was introduced.

Author: Pavan Deolasee
Reviewed-By: Michael Paquier and Craig Ringer
Discussion: CABOikdPsPByMiG6J01DKq6om2+BNkxHTPkOyqHM2a4oYwGKsqQ@mail.gmail.com
2016-08-29 14:34:58 +09:00
Tom Lane ea268cdc9a Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters.  While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea.  Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.

While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.

In passing, change TopMemoryContext to use the default allocation
parameters.  Formerly it could only be extended 8K at a time.  That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there.  There seems no good reason
not to let it use growing blocks like most other contexts.

Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain.  The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.

Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 17:50:38 -04:00
Tom Lane 78dcd027e8 Fix potential memory leakage from HandleParallelMessages().
HandleParallelMessages leaked memory into the caller's context.  Since it's
called from ProcessInterrupts, there is basically zero certainty as to what
CurrentMemoryContext is, which means we could be leaking into long-lived
contexts.  Over the processing of many worker messages that would grow to
be a problem.  Things could be even worse than just a leak, if we happened
to service the interrupt while ErrorContext is current: elog.c thinks it
can reset that on its own whim, possibly yanking storage out from under
HandleParallelMessages.

Give HandleParallelMessages its own dedicated context instead, which we can
reset during each call to ensure there's no accumulation of wasted memory.

Discussion: <16610.1472222135@sss.pgh.pa.us>
2016-08-26 15:04:05 -04:00
Tom Lane fbf28b6b52 Fix logic for adding "parallel worker" context line to worker errors.
The previous coding here was capable of adding a "parallel worker" context
line to errors that were not, in fact, returned from a parallel worker.
Instead of using an errcontext callback to add that annotation, just paste
it onto the message by hand; this looks uglier but is more reliable.

Discussion: <19757.1472151987@sss.pgh.pa.us>
2016-08-26 10:07:28 -04:00
Tom Lane ae4760d667 Fix small query-lifespan memory leak in bulk updates.
When there is an identifiable REPLICA IDENTITY index on the target table,
heap_update leaks the id_attrs bitmapset.  That's not many bytes, but it
adds up over enough rows, since the code typically runs in a query-lifespan
context.  Bug introduced in commit e55704d8b, which did a rather poor job
of cloning the existing use-pattern for RelationGetIndexAttrBitmap().

Per bug #14293 from Zhou Digoal.  Back-patch to 9.4 where the bug was
introduced.

Report: <20160824114320.15676.45171@wrigleys.postgresql.org>
2016-08-24 22:20:25 -04:00
Tom Lane 71e006f031 Suppress compiler warnings in non-cassert builds.
With Asserts off, these variables are set but never used, resulting
in warnings from pickier compilers.  Fix that with our standard solution.
Per report from Jeff Janes.
2016-08-23 23:21:10 -04:00
Tom Lane d2ddee63b4 Improve SP-GiST opclass API to better support unlabeled nodes.
Previously, the spgSplitTuple action could only create a new upper tuple
containing a single labeled node.  This made it useless for opclasses
that prefer to work with fixed sets of nodes (labeled or otherwise),
which meant that restrictive prefixes could not be used with such
node definitions.  Change the output field set for the choose() method
to allow it to specify any valid node set for the new upper tuple,
and to specify which of these nodes to place the modified lower tuple in.

In addition to its primary use for fixed node sets, this feature could
allow existing opclasses that use variable node sets to skip a separate
spgAddNode action when splitting a tuple, by setting up the node needed
for the incoming value as part of the spgSplitTuple action.  However, care
would have to be taken to add the extra node only when it would not make
the tuple bigger than before.  (spgAddNode can enlarge the tuple,
spgSplitTuple can't.)

This is a prerequisite for an upcoming SP-GiST inet opclass, but is
being committed separately to increase the visibility of the API change.

In passing, improve the documentation about the traverse-values feature
that was added by commit ccd6eb49a.

Emre Hasegeli, with cosmetic adjustments and documentation rework by me

Discussion: <CAE2gYzxtth9qatW_OAqdOjykS0bxq7AYHLuyAQLPgT7H9ZU0Cw@mail.gmail.com>
2016-08-23 12:10:34 -04:00
Andres Freund 07ef035129 Fix deletion of speculatively inserted TOAST on conflict
INSERT ..  ON CONFLICT runs a pre-check of the possible conflicting
constraints before performing the actual speculative insertion.  In case
the inserted tuple included TOASTed columns the ON CONFLICT condition
would be handled correctly in case the conflict was caught by the
pre-check, but if two transactions entered the speculative insertion
phase at the same time, one would have to re-try, and the code for
aborting a speculative insertion did not handle deleting the
speculatively inserted TOAST datums correctly.

TOAST deletion would fail with "ERROR: attempted to delete invisible
tuple" as we attempted to remove the TOAST tuples using
simple_heap_delete which reasoned that the given tuples should not be
visible to the command that wrote them.

This commit updates the heap_abort_speculative() function which aborts
the conflicting tuple to use itself, via toast_delete, for deleting
associated TOAST datums.  Like before, the inserted toast rows are not
marked as being speculative.

This commit also adds a isolationtester spec test, exercising the
relevant code path. Unfortunately 9.5 cannot handle two waiting
sessions, and thus cannot execute this test.

Reported-By: Viren Negi, Oskari Saarenmaa
Author: Oskari Saarenmaa, edited a bit by me
Bug: #14150
Discussion: <20160519123338.12513.20271@wrigleys.postgresql.org>
Backpatch: 9.5, where ON CONFLICT was introduced
2016-08-17 17:03:36 -07:00
Peter Eisentraut f0fe1c8f70 Fix typos
From: Alexander Law <exclusion@gmail.com>
2016-08-16 14:52:29 -04:00
Tom Lane b5bce6c1ec Final pgindent + perltidy run for 9.6. 2016-08-15 13:42:51 -04:00
Tom Lane ed0097e4f9 Add SQL-accessible functions for inspecting index AM properties.
Per discussion, we should provide such functions to replace the lost
ability to discover AM properties by inspecting pg_am (cf commit
65c5fcd35).  The added functionality is also meant to displace any code
that was looking directly at pg_index.indoption, since we'd rather not
believe that the bit meanings in that field are part of any client API
contract.

As future-proofing, define the SQL API to not assume that properties that
are currently AM-wide or index-wide will remain so unless they logically
must be; instead, expose them only when inquiring about a specific index
or even specific index column.  Also provide the ability for an index
AM to override the behavior.

In passing, document pg_am.amtype, overlooked in commit 473b93287.

Andrew Gierth, with kibitzing by me and others

Discussion: <87mvl5on7n.fsf@news-spur.riddles.org.uk>
2016-08-13 18:31:14 -04:00
Tom Lane e89526d4f3 In B-tree page deletion, clean up properly after page deletion failure.
In _bt_unlink_halfdead_page(), we might fail to find an immediate left
sibling of the target page, perhaps because of corruption of the page
sibling links.  The code intends to cope with this by just abandoning
the deletion attempt; but what actually happens is that it fails outright
due to releasing the same buffer lock twice.  (And error recovery masks
a second problem, which is possible leakage of a pin on another page.)
Seems to have been introduced by careless refactoring in commit efada2b8e.
Since there are multiple cases to consider, let's make releasing the buffer
lock in the failure case the responsibility of _bt_unlink_halfdead_page()
not its caller.

Also, avoid fetching the leaf page's left-link again after we've dropped
lock on the page.  This is probably harmless, but it's not exactly good
coding practice.

Per report from Kyotaro Horiguchi.  Back-patch to 9.4 where the faulty code
was introduced.

Discussion: <20160803.173116.111915228.horiguchi.kyotaro@lab.ntt.co.jp>
2016-08-06 14:28:37 -04:00
Robert Haas 81c766b3fd Change InitToastSnapshot to a macro.
tqual.h is included in some front-end compiles, and a static inline
breaks on buildfarm member castoroides.  Since the macro is never
referenced, it should dodge that problem, although this doesn't
seem like the cleanest way of hiding things from front-end compiles.

Report and review by Tom Lane; patch by me.
2016-08-05 11:58:03 -04:00
Andres Freund e7caacf733 Fix hard to hit race condition in heapam's tuple locking code.
As mentioned in its commit message, eca0f1db left open a race condition,
where a page could be marked all-visible, after the code checked
PageIsAllVisible() to pin the VM, but before the page is locked.  Plug
that hole.

Reviewed-By: Robert Haas, Andres Freund
Author: Amit Kapila
Discussion: CAEepm=3fWAbWryVW9swHyLTY4sXVf0xbLvXqOwUoDiNCx9mBjQ@mail.gmail.com
Backpatch: -
2016-08-04 20:07:16 -07:00
Robert Haas 3e2f3c2e42 Prevent "snapshot too old" from trying to return pruned TOAST tuples.
Previously, we tested for MVCC snapshots to see whether they were too
old, but not TOAST snapshots, which can lead to complaints about missing
TOAST chunks if those chunks are subject to early pruning.  Ideally,
the threshold lsn and timestamp for a TOAST snapshot would be that of
the corresponding MVCC snapshot, but since we have no way of deciding
which MVCC snapshot was used to fetch the TOAST pointer, use the oldest
active or registered snapshot instead.

Reported by Andres Freund, who also sketched out what the fix should
look like.  Patch by me, reviewed by Amit Kapila.
2016-08-03 16:50:01 -04:00
Tom Lane b6a97b91ff Block interrupts during HandleParallelMessages().
As noted by Alvaro, there are CHECK_FOR_INTERRUPTS() calls in the shm_mq.c
functions called by HandleParallelMessages().  I believe they're all
unreachable since we always pass nowait = true, but it doesn't seem like
a great idea to assume that no such call will ever be reachable from
HandleParallelMessages().  If that did happen, there would be a risk of a
recursive call to HandleParallelMessages(), which it does not appear to be
designed for --- for example, there's nothing that would prevent
out-of-order processing of received messages.  And certainly such cases
cannot easily be tested.  So let's prevent it by holding off interrupts for
the duration of the function.  Back-patch to 9.5 which contains identical
code.

Discussion: <14869.1470083848@sss.pgh.pa.us>
2016-08-02 16:39:16 -04:00
Tom Lane a5fe473ad7 Minor cleanup for access/transam/parallel.c.
ParallelMessagePending *must* be marked volatile, because it's set
by a signal handler.  On the other hand, it's pointless for
HandleParallelMessageInterrupt to save/restore errno; that must be,
and is, done at the outer level of the SIGUSR1 signal handler.

Calling CHECK_FOR_INTERRUPTS() inside HandleParallelMessages, which itself
is called from CHECK_FOR_INTERRUPTS(), seems both useless and hazardous.
The comment claiming that this is needed to handle the error queue going
away is certainly misguided, in any case.

Improve a couple of error message texts, and use
ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE to report loss of parallel worker
connection, since that's what's used in e.g. tqueue.c.  (Maybe it would be
worth inventing a dedicated ERRCODE for this type of failure?  But I do not
think ERRCODE_INTERNAL_ERROR is appropriate.)

Minor stylistic cleanups.
2016-08-01 16:12:01 -04:00
Peter Eisentraut ef5d4a3cfa Message style improvements 2016-07-28 16:34:44 -04:00
Robert Haas 1091402b5a Remove unused structure member.
Michael Paquier
2016-07-21 11:53:44 -04:00
Andres Freund eca0f1db14 Clear all-frozen visibilitymap status when locking tuples.
Since a892234 & fd31cd265 the visibilitymap's freeze bit is used to
avoid vacuuming the whole relation in anti-wraparound vacuums. Doing so
correctly relies on not adding xids to the heap without also unsetting
the visibilitymap flag.  Tuple locking related code has not done so.

To allow selectively resetting all-frozen - to avoid pessimizing
heap_lock_tuple - allow to selectively reset the all-frozen with
visibilitymap_clear(). To avoid having to use
visibilitymap_get_status (e.g. via VM_ALL_FROZEN) inside a critical
section, have visibilitymap_clear() return whether any bits have been
reset.

There's a remaining issue (denoted by XXX): After the PageIsAllVisible()
check in heap_lock_tuple() and heap_lock_updated_tuple_rec() the page
status could theoretically change. Practically that currently seems
impossible, because updaters will hold a page level pin already.  Due to
the next beta coming up, it seems better to get the required WAL magic
bump done before resolving this issue.

The added flags field fields to xl_heap_lock and xl_heap_lock_updated
require bumping the WAL magic. Since there's already been a catversion
bump since the last beta, that's not an issue.

Reviewed-By: Robert Haas, Amit Kapila and Andres Freund
Author: Masahiko Sawada, heavily revised by Andres Freund
Discussion: CAEepm=3fWAbWryVW9swHyLTY4sXVf0xbLvXqOwUoDiNCx9mBjQ@mail.gmail.com
Backpatch: -
2016-07-18 02:01:13 -07:00
Tom Lane 9563d5b5e4 Add regression test case exercising the sorting path for hash index build.
We've broken this code path at least twice in the past, so it's prudent
to have a test case that covers it.  To allow exercising the code path
without creating a very large (and slow to run) test case, redefine the
sort threshold to be bounded by maintenance_work_mem as well as the number
of available buffers.  While at it, fix an ancient oversight that when
building a temp index, the number of available buffers is not NBuffers but
NLocBuffer.  Also, if assertions are enabled, apply a direct test that the
sort actually does return the tuples in the expected order.

Peter Geoghegan

Patch: <CAM3SWZTBAo4hjbBd780+MrOKiKp_TMo1N3A0Rw9_im8gbD7fQA@mail.gmail.com>
2016-07-16 15:30:15 -04:00
Andres Freund bfa2ab56bb Fix torn-page, unlogged xid and further risks from heap_update().
When heap_update needs to look for a page for the new tuple version,
because the current one doesn't have sufficient free space, or when
columns have to be processed by the tuple toaster, it has to release the
lock on the old page during that. Otherwise there'd be lock ordering and
lock nesting issues.

To avoid concurrent sessions from trying to update / delete / lock the
tuple while the page's content lock is released, the tuple's xmax is set
to the current session's xid.

That unfortunately was done without any WAL logging, thereby violating
the rule that no XIDs may appear on disk, without an according WAL
record.  If the database were to crash / fail over when the page level
lock is released, and some activity lead to the page being written out
to disk, the xid could end up being reused; potentially leading to the
row becoming invisible.

There might be additional risks by not having t_ctid point at the tuple
itself, without having set the appropriate lock infomask fields.

To fix, compute the appropriate xmax/infomask combination for locking
the tuple, and perform WAL logging using the existing XLOG_HEAP_LOCK
record. That allows the fix to be backpatched.

This issue has existed for a long time. There appears to have been
partial attempts at preventing dangers, but these never have fully been
implemented, and were removed a long time ago, in
11919160 (cf. HEAP_XMAX_UNLOGGED).

In master / 9.6, there's an additional issue, namely that the
visibilitymap's freeze bit isn't reset at that point yet. Since that's a
new issue, introduced only in a892234f83, that'll be fixed in a
separate commit.

Author: Masahiko Sawada and Andres Freund
Reported-By: Different aspects by Thomas Munro, Noah Misch, and others
Discussion: CAEepm=3fWAbWryVW9swHyLTY4sXVf0xbLvXqOwUoDiNCx9mBjQ@mail.gmail.com
Backpatch: 9.1/all supported versions
2016-07-15 17:49:48 -07:00
Andres Freund a4d357bfbd Make HEAP_LOCK/HEAP2_LOCK_UPDATED replay reset HEAP_XMAX_INVALID.
0ac5ad5 started to compress infomask bits in WAL records. Unfortunately
the replay routines for XLOG_HEAP_LOCK/XLOG_HEAP2_LOCK_UPDATED forgot to
reset the HEAP_XMAX_INVALID (and some other) hint bits.

Luckily that's not problematic in the majority of cases, because after a
crash/on a standby row locks aren't meaningful. Unfortunately that does
not hold true in the presence of prepared transactions. This means that
after a crash, or after promotion, row level locks held by a prepared,
but not yet committed, prepared transaction might not be enforced.

Discussion: 20160715192319.ubfuzim4zv3rqnxv@alap3.anarazel.de
Backpatch: 9.3, the oldest branch on which 0ac5ad5 is present.
2016-07-15 14:45:37 -07:00
Alvaro Herrera 533e9c6b06 Avoid serializability errors when locking a tuple with a committed update
When key-share locking a tuple that has been not-key-updated, and the
update is a committed transaction, in some cases we raised
serializability errors:
    ERROR:  could not serialize access due to concurrent update

Because the key-share doesn't conflict with the update, the error is
unnecessary and inconsistent with the case that the update hasn't
committed yet.  This causes problems for some usage patterns, even if it
can be claimed that it's sufficient to retry the aborted transaction:
given a steady stream of updating transactions and a long locking
transaction, the long transaction can be starved indefinitely despite
multiple retries.

To fix, we recognize that HeapTupleSatisfiesUpdate can return
HeapTupleUpdated when an updating transaction has committed, and that we
need to deal with that case exactly as if it were a non-committed
update: verify whether the two operations conflict, and if not, carry on
normally.  If they do conflict, however, there is a difference: in the
HeapTupleBeingUpdated case we can just sleep until the concurrent
transaction is gone, while in the HeapTupleUpdated case this is not
possible and we must raise an error instead.

Per trouble report from Olivier Dony.

In addition to a couple of test cases that verify the changed behavior,
I added a test case to verify the behavior that remains unchanged,
namely that errors are raised when a update that modifies the key is
used.  That must still generate serializability errors.  One
pre-existing test case changes behavior; per discussion, the new
behavior is actually the desired one.

Discussion: https://www.postgresql.org/message-id/560AA479.4080807@odoo.com
  https://www.postgresql.org/message-id/20151014164844.3019.25750@wrigleys.postgresql.org

Backpatch to 9.3, where the problem appeared.
2016-07-15 14:17:20 -04:00
Tom Lane 1acf757255 Fix GiST index build for NaN values in geometric types.
GiST index build could go into an infinite loop when presented with boxes
(or points, circles or polygons) containing NaN component values.  This
happened essentially because the code assumed that x == x is true for any
"double" value x; but it's not true for NaNs.  The looping behavior was not
the only problem though: we also attempted to sort the items using simple
double comparisons.  Since NaNs violate the trichotomy law, qsort could
(in principle at least) get arbitrarily confused and mess up the sorting of
ordinary values as well as NaNs.  And we based splitting choices on box size
calculations that could produce NaNs, again resulting in undesirable
behavior.

To fix, replace all comparisons of doubles in this logic with
float8_cmp_internal, which is NaN-aware and is careful to sort NaNs
consistently, higher than any non-NaN.  Also rearrange the box size
calculation to not produce NaNs; instead it should produce an infinity
for a box with NaN on one side and not-NaN on the other.

I don't by any means claim that this solves all problems with NaNs in
geometric values, but it should at least make GiST index insertion work
reliably with such data.  It's likely that the index search side of things
still needs some work, and probably regular geometric operations too.
But with this patch we're laying down a convention for how such cases
ought to behave.

Per bug #14238 from Guang-Dih Lei.  Back-patch to 9.2; the code used before
commit 7f3bd86843 is quite different and doesn't lock up on my simple
test case, nor on the submitter's dataset.

Report: <20160708151747.1426.60150@wrigleys.postgresql.org>
Discussion: <28685.1468246504@sss.pgh.pa.us>
2016-07-14 18:45:59 -04:00
Magnus Hagander 87d84d67bb Fix start WAL filename for concurrent backups from standby
On a standby, ThisTimelineID is always 0, so we would generate a
filename in timeline 0 even for other timelines. Instead, use starttli
which we have retreived from the controlfile.

Report by: Francesco Canovai in bug #14230
Author: Marco Nenciarini
Reviewed by: Michael Paquier and Amit Kapila
2016-07-11 12:02:31 +02:00
Robert Haas 10c0558ffe Fix several mistakes around parallel workers and client_encoding.
Previously, workers sent data to the leader using the client encoding.
That mostly worked, but the leader the converted the data back to the
server encoding.  Since not all encoding conversions are reversible,
that could provoke failures.  Fix by using the database encoding for
all communication between worker and leader.

Also, while temporary changes to GUC settings, as from the SET clause
of a function, are in general OK for parallel query, changing
client_encoding this way inside of a parallel worker is not OK.
Previously, that would have confused the leader; with these changes,
it would not confuse the leader, but it wouldn't do anything either.
So refuse such changes in parallel workers.

Also, the previous code naively assumed that when it received a
NotifyResonse from the worker, it could pass that directly back to the
user.  But now that worker-to-leader communication always uses the
database encoding, that's clearly no longer correct - though,
actually, the old way was always broken for V2 clients.  So
disassemble and reconstitute the message instead.

Issues reported by Peter Eisentraut.  Patch by me, reviewed by
Peter Eisentraut.
2016-06-30 18:35:32 -04:00
Alvaro Herrera b78364df18 Remove unused arguments in two GiST subroutines
These arguments became unused in commit 2c03216d83.  Noticed while
skimming code for unrelated development.

This is cosmetic, so no backpatch.
2016-06-28 16:01:13 -04:00
Alvaro Herrera e3ad3ffa68 Fix handling of multixacts predating pg_upgrade
After pg_upgrade, it is possible that some tuples' Xmax have multixacts
corresponding to the old installation; such multixacts cannot have
running members anymore.  In many code sites we already know not to read
them and clobber them silently, but at least when VACUUM tries to freeze
a multixact or determine whether one needs freezing, there's an attempt
to resolve it to its member transactions by calling GetMultiXactIdMembers,
and if the multixact value is "in the future" with regards to the
current valid multixact range, an error like this is raised:
    ERROR:  MultiXactId 123 has not been created yet -- apparent wraparound
and vacuuming fails.  Per discussion with Andrew Gierth, it is completely
bogus to try to resolve multixacts coming from before a pg_upgrade,
regardless of where they stand with regards to the current valid
multixact range.

It's possible to get from under this problem by doing SELECT FOR UPDATE
of the problem tuples, but if tables are large, this is slow and
tedious, so a more thorough solution is desirable.

To fix, we realize that multixacts in xmax created in 9.2 and previous
have a specific bit pattern that is never used in 9.3 and later (we
already knew this, per comments and infomask tests sprinkled in various
places, but we weren't leveraging this knowledge appropriately).
Whenever the infomask of the tuple matches that bit pattern, we just
ignore the multixact completely as if Xmax wasn't set; or, in the case
of tuple freezing, we act as if an unwanted value is set and clobber it
without decoding.  This guarantees that no errors will be raised, and
that the values will be progressively removed until all tables are
clean.  Most callers of GetMultiXactIdMembers are patched to recognize
directly that the value is a removable "empty" multixact and avoid
calling GetMultiXactIdMembers altogether.

To avoid changing the signature of GetMultiXactIdMembers() in back
branches, we keep the "allow_old" boolean flag but rename it to
"from_pgupgrade"; if the flag is true, we always return an empty set
instead of looking up the multixact.  (I suppose we could remove the
argument in the master branch, but I chose not to do so in this commit).

This was broken all along, but the error-facing message appeared first
because of commit 8e9a16ab8f and was partially fixed in a25c2b7c4d.
This fix, backpatched all the way back to 9.3, goes approximately in the
same direction as a25c2b7c4d but should cover all cases.

Bug analysis by Andrew Gierth and Álvaro Herrera.

A number of public reports match this bug:
  https://www.postgresql.org/message-id/20140330040029.GY4582@tamriel.snowman.net
  https://www.postgresql.org/message-id/538F3D70.6080902@publicrelay.com
  https://www.postgresql.org/message-id/556439CF.7070109@pscs.co.uk
  https://www.postgresql.org/message-id/SG2PR06MB0760098A111C88E31BD4D96FB3540@SG2PR06MB0760.apcprd06.prod.outlook.com
  https://www.postgresql.org/message-id/20160615203829.5798.4594@wrigleys.postgresql.org
2016-06-24 18:29:28 -04:00
Tom Lane 8cf739de85 Fix building of large (bigger than shared_buffers) hash indexes.
When the index is predicted to need more than NBuffers buckets,
CREATE INDEX attempts to sort the index entries by hash key before
insertion, so as to reduce thrashing.  This code path got broken by
commit 9f03ca9151, which overlooked that _hash_form_tuple() is not
just an alias for index_form_tuple().  The index got built anyway, but
with garbage data, so that searches for pre-existing tuples always
failed.  Fix by refactoring to separate construction of the indexable
data from calling index_form_tuple().

Per bug #14210 from Daniel Newman.  Back-patch to 9.5 where the
bug was introduced.

Report: <20160623162507.17237.39471@wrigleys.postgresql.org>
2016-06-24 16:57:36 -04:00
Alvaro Herrera 1ca4a1b5d2 Finish up XLOG_HINT renaming
Commit b8fd1a09f3 renamed XLOG_HINT to XLOG_FPI, but neglected two
places.

Backpatch to 9.3, like that commit.
2016-06-17 18:05:55 -04:00
Robert Haas 71d05a2c7b pg_visibility: Add pg_truncate_visibility_map function.
This requires some core changes as well so that we can properly
WAL-log the truncation.  Specifically, it changes the format of the
XLOG_SMGR_TRUNCATE WAL record, so bump XLOG_PAGE_MAGIC.

Patch by me, reviewed but not fully endorsed by Andres Freund.
2016-06-17 17:37:30 -04:00
Robert Haas 9c188a8454 Fix typo.
Thomas Munro
2016-06-17 12:55:30 -04:00
Robert Haas 292794f82b Remove PID from 'parallel worker' context message.
Discussion: <bfd204ab-ab1a-792a-b345-0274a09a4b5f@2ndquadrant.com>
2016-06-17 09:26:17 -04:00
Tom Lane bfb937427b Fix fuzzy thinking in ReinitializeParallelDSM().
The fact that no workers were successfully launched in the previous
iteration does not excuse us from setting up properly to try again.
This appears to explain crashes I saw in parallel regression testing
due to error_mqh being NULL when it shouldn't be.

Minor other cosmetic fixes too.
2016-06-16 15:20:29 -04:00
Robert Haas 38e9f90a22 Fix lazy_scan_heap so that it won't mark pages all-frozen too soon.
Commit a892234f83 added a new bit per
page to the visibility map fork indicating whether the page is
all-frozen, but incorrectly assumed that if lazy_scan_heap chose to
freeze a tuple then that tuple would not need to later be frozen
again. This turns out to be false, because xmin and xmax (and
conceivably xvac, if dealing with tuples from very old releases) could
be frozen at separate times.

Thanks to Andres Freund for help in uncovering and tracking down this
issue.
2016-06-15 14:30:06 -04:00
Tom Lane cae1c788b9 Improve the situation for parallel query versus temp relations.
Transmit the leader's temp-namespace state to workers.  This is important
because without it, the workers do not really have the same search path
as the leader.  For example, there is no good reason (and no extant code
either) to prevent a worker from executing a temp function that the
leader created previously; but as things stood it would fail to find the
temp function, and then either fail or execute the wrong function entirely.

We still prohibit a worker from creating a temp namespace on its own.
In effect, a worker can only see the session's temp namespace if the leader
had created it before starting the worker, which seems like the right
semantics.

Also, transmit the leader's BackendId to workers, and arrange for workers
to use that when determining the physical file path of a temp relation
belonging to their session.  While the original intent was to prevent such
accesses entirely, there were a number of holes in that, notably in places
like dbsize.c which assume they can safely access temp rels of other
sessions anyway.  We might as well get this right, as a small down payment
on someday allowing workers to access the leader's temp tables.  (With
this change, directly using "MyBackendId" as a relation or buffer backend
ID is deprecated; you should use BackendIdForTempRelations() instead.
I left a couple of such uses alone though, as they're not going to be
reachable in parallel workers until we do something about localbuf.c.)

Move the thou-shalt-not-access-thy-leader's-temp-tables prohibition down
into localbuf.c, which is where it actually matters, instead of having it
in relation_open().  This amounts to recognizing that access to temp
tables' catalog entries is perfectly safe in a worker, it's only the data
in local buffers that is problematic.

Having done all that, we can get rid of the test in has_parallel_hazard()
that says that use of a temp table's rowtype is unsafe in parallel workers.
That test was unduly expensive, and if we really did need such a
prohibition, that was not even close to being a bulletproof guard for it.
(For example, any user-defined function executed in a parallel worker
might have attempted such access.)
2016-06-09 20:16:11 -04:00
Robert Haas 4bc424b968 pgindent run for 9.6 2016-06-09 18:02:36 -04:00
Robert Haas c9ce4a1c61 Eliminate "parallel degree" terminology.
This terminology provoked widespread complaints.  So, instead, rename
the GUC max_parallel_degree to max_parallel_workers_per_gather
(leaving room for a possible future GUC max_parallel_workers that acts
as a system-wide limit), and rename the parallel_degree reloption to
parallel_workers.  Rename structure members to match.

These changes create a dump/restore hazard for users of PostgreSQL
9.6beta1 who have set the reloption (or applied the GUC using ALTER
USER or ALTER DATABASE).
2016-06-09 10:00:26 -04:00
Peter Eisentraut 5c6d2a5e7c Message style and wording fixes 2016-06-07 14:18:55 -04:00
Robert Haas c6dbf1fe79 Stop the executor if no more tuples can be sent from worker to leader.
If a Gather node has read as many tuples as it needs (for example, due
to Limit) it may detach the queue connecting it to the worker before
reading all of the worker's tuples.  Rather than let the worker
continue to generate and send all of the results, have it stop after
sending the next tuple.

More could be done here to stop the worker even quicker, but this is
about as well as we can hope to do for 9.6.

This is in response to a problem report from Andreas Seltenreich.
Commit 44339b892a should be actually be
sufficient to fix that example even without this change, but it seems
better to do this, too, since we might otherwise waste quite a large
amount of effort in one or more workers.

Discussion: CAA4eK1KOKGqmz9bGu+Z42qhRwMbm4R5rfnqsLCNqFs9j14jzEA@mail.gmail.com

Amit Kapila
2016-06-06 14:52:58 -04:00
Robert Haas 932b97a011 Fix typo.
Jim Nasby
2016-06-06 07:58:50 -04:00
Greg Stark e1623c3959 Fix various common mispellings.
Mostly these are just comments but there are a few in documentation
and a handful in code and tests. Hopefully this doesn't cause too much
unnecessary pain for backpatching. I relented from some of the most
common like "thru" for that reason. The rest don't seem numerous
enough to cause problems.

Thanks to Kevin Lyda's tool https://pypi.python.org/pypi/misspellings
2016-06-03 16:08:45 +01:00
Robert Haas fdfaccfa79 Cosmetic improvements to freeze map code.
Per post-commit review comments from Andres Freund, improve variable
names, comments, and in one place, slightly improve the code structure.

Masahiko Sawada
2016-06-03 08:43:41 -04:00
Kevin Grittner 7392eed7c2 Fix btree mark/restore bug.
Commit 2ed5b87f96 introduced a bug in
mark/restore, in an attempt to optimize repeated restores to the
same page.  This caused an assertion failure during a merge join
which fed directly from an index scan, although the impact would
not be limited to that case.  Revert the bad chunk of code from
that commit.

While investigating this bug it was discovered that a particular
"paranoia" set of the mark position field would not prevent bad
behavior; it would just make it harder to diagnose.  Change that
into an assertion, which will draw attention to any future problem
in that area more directly.

Backpatch to 9.5, where the bug was introduced.

Bug #14169 reported by Shinta Koyanagi.
Preliminary analysis by Tom Lane identified which commit caused
the bug.
2016-06-02 12:23:01 -05:00
Alvaro Herrera 975ad4e602 Fix PageAddItem BRIN bug
BRIN was relying on the ability to remove a tuple from an index page,
then putting another tuple in the same line pointer.  But PageAddItem
refuses to add a tuple beyond the first free item past the last used
item, and in particular, it rejects an attempt to add an item to an
empty page anywhere other than the first line pointer.  PageAddItem
issues a WARNING and indicates to the caller that it failed, which in
turn causes the BRIN calling code to issue a PANIC, so the whole
sequence looks like this:
	WARNING:  specified item offset is too large
	PANIC:  failed to add BRIN tuple

To fix, create a new function PageAddItemExtended which is like
PageAddItem except that the two boolean arguments become a flags bitmap;
the "overwrite" and "is_heap" boolean flags in PageAddItem become
PAI_OVERWITE and PAI_IS_HEAP flags in the new function, and a new flag
PAI_ALLOW_FAR_OFFSET enables the behavior required by BRIN.
PageAddItem() retains its original signature, for compatibility with
third-party modules (other callers in core code are not modified,
either).

Also, in the belt-and-suspenders spirit, I added a new sanity check in
brinGetTupleForHeapBlock to raise an error if an TID found in the revmap
is not marked as live by the page header.  This causes it to react with
"ERROR: corrupted BRIN index" to the bug at hand, rather than a hard
crash.

Backpatch to 9.5.

Bug reported by Andreas Seltenreich as detected by his handy sqlsmith
fuzzer.
Discussion: https://www.postgresql.org/message-id/87mvni77jh.fsf@elite.ansel.ydns.eu
2016-05-30 14:47:22 -04:00
Tom Lane 1e0d6512e5 Fix BTREE_BUILD_STATS build.
Commit 65c5fcd353 broke this by removing a
header include directive that is conditionally required.  Add that back
to nbtree.c, with annotation to keep pgrminclude from re-breaking it.

Peter Geoghegan

Report: <CAM3SWZTNjHFYW_UG8bu0BnogqQ2HfsTgkzXLueuUhfTcYbu5HA@mail.gmail.com>
2016-05-23 19:41:11 -04:00
Teodor Sigaev 7c979c95a3 Allocate all page images at once in generic wal interface
That reduces number of allocation.

Per gripe from Michael Paquier and Tom Lane suggestion.
2016-05-17 22:09:22 +03:00
Teodor Sigaev 7c8345f67f Correctly align page's images in generic wal API
Page image should be MAXALIGN'ed because existing code could directly align
pointers in page instead of align offset from beginning of page.

Found during play with indexes as extenstion, Alexander Korotkov and me
2016-05-17 00:01:35 +03:00
Alvaro Herrera cca2a27860 Fix bogus comments
Some comments mentioned XLogReplayBuffer, but there's no such function:
that was an interim name for a function that got renamed to
XLogReadBufferForRedo, before commit 2c03216d83 was pushed.
2016-05-12 16:07:07 -03:00
Alvaro Herrera bdb9e3dc1d Fix obsolete comment 2016-05-12 15:36:51 -03:00
Tom Lane 9eb7a0ac6b Fix poorly-worded log message.
Euler Taveira
2016-05-08 01:37:07 -04:00
Robert Haas c7ea68ff8d Limit maximum parallel degree to 1024.
This new limit affects both the max_parallel_degree GUC and the
parallel_degree reloption.  There may some day be a use case for using
more than 1024 CPUs for a single query, but that's surely not the case
right now.  Not only do not very many people have that many CPUs, but
the code hasn't been tested at that kind of scale and is very unlikely
to perform well, or even work at all, without a lot more work.  The
issue addressed by commit 06bd458cb8 is
probably just one problem of many.

The idea of a more reasonable limit here was suggested by Tom Lane;
the value of 1024 was suggested by Amit Kapila.
2016-05-06 14:50:54 -04:00
Robert Haas 06bd458cb8 Use mul_size when multiplying by the number of parallel workers.
That way, if the result overflows size_t, you'll get an error instead
of undefined behavior, which seems like a plus.  This also has the
effect of casting the number of workers from int to Size, which is
better because it's harder to overflow int than size_t.

Dilip Kumar reported this issue and provided a patch upon which this
patch is based, but his version did use mul_size.
2016-05-06 14:32:58 -04:00
Kevin Grittner 2cc41acd8f Fix hash index vs "snapshot too old" problemms
Hash indexes are not WAL-logged, and so do not maintain the LSN of
index pages.  Since the "snapshot too old" feature counts on
detecting error conditions using the LSN of a table and all indexes
on it, this makes it impossible to safely do early vacuuming on any
table with a hash index, so add this to the tests for whether the
xid used to vacuum a table can be adjusted based on
old_snapshot_threshold.

While at it, add a paragraph to the docs for old_snapshot_threshold
which specifically mentions this and other aspects of the feature
which may otherwise surprise users.

Problem reported and patch reviewed by Amit Kapila
2016-05-06 07:47:12 -05:00
Alvaro Herrera c1543a81a7 Revert timeline following in replication slots
This reverts commits f07d18b6e9, 82c83b3372, 3a3b309041, and
24c5f1a103.

This feature has shown enough immaturity that it was deemed better to
rip it out before rushing some more fixes at the last minute.  There are
discussions on larger changes in this area for the next release.
2016-05-04 17:32:22 -03:00
Teodor Sigaev f8467f7da8 Prevent to use magic constants
Use macroses for definition amstrategies/amsupport fields instead of
hardcoded values.

Author: Nikolay Shaplov with addition for contrib/bloom
2016-04-28 16:39:25 +03:00
Teodor Sigaev e2c79e14d9 Prevent multiple cleanup process for pending list in GIN.
Previously, ginInsertCleanup could exit early if it detects that someone else
is cleaning up the pending list, without waiting for that someone else to
finish the job. But in this case vacuum could miss tuples to be deleted.

Cleanup process now locks metapage with a help of heavyweight
LockPage(ExclusiveLock), and it guarantees that there is no another cleanup
process at the same time. Lock is taken differently depending on caller of
cleanup process: any vacuums and gin_clean_pending_list() will be blocked
until lock becomes available, ordinary insert uses conditional lock to
prevent indefinite waiting on lock.

Insert into pending list doesn't use this lock, so insertion isn't blocked.

Also, patch adds stopping of cleanup process when at-start-cleanup-tail is
reached in order to prevent infinite cleanup in case of massive insertion. But
it will stop only for automatic maintenance tasks like autovacuum.

Patch introduces choice of limit of memory to use: autovacuum_work_mem,
maintenance_work_mem or work_mem depending on call path.

Patch for previous releases should be reworked due to changes between 9.6 and
previous ones in this area.

Discover and diagnostics by Jeff Janes and Tomas Vondra

Patch by me with some ideas of Jeff Janes
2016-04-28 16:21:42 +03:00
Andres Freund c6ff84b06a Emit invalidations to standby for transactions without xid.
So far, when a transaction with pending invalidations, but without an
assigned xid, committed, we simply ignored those invalidation
messages. That's problematic, because those are actually sent for a
reason.

Known symptoms of this include that existing sessions on a hot-standby
replica sometimes fail to notice new concurrently built indexes and
visibility map updates.

The solution is to WAL log such invalidations in transactions without an
xid. We considered to alternatively force-assign an xid, but that'd be
problematic for vacuum, which might be run in systems with few xids.

Important: This adds a new WAL record, but as the patch has to be
back-patched, we can't bump the WAL page magic. This means that standbys
have to be updated before primaries; otherwise
"PANIC: standby_redo: unknown op code 32" errors can be encountered.

XXX:

Reported-By: Васильев Дмитрий, Masahiko Sawada
Discussion:
    CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com
    CAD21AoDpZ6Xjg=gFrGPnSn4oTRRcwK1EBrWCq9OqOHuAcMMC=w@mail.gmail.com
2016-04-26 20:21:54 -07:00
Tom Lane bde361fef5 Fix memory leak and other bugs in ginPlaceToPage() & subroutines.
Commit 36a35c550a turned the interface between ginPlaceToPage and
its subroutines in gindatapage.c and ginentrypage.c into a royal mess:
page-update critical sections were started in one place and finished in
another place not even in the same file, and the very same subroutine
might return having started a critical section or not.  Subsequent patches
band-aided over some of the problems with this design by making things
even messier.

One user-visible resulting problem is memory leaks caused by the need for
the subroutines to allocate storage that would survive until ginPlaceToPage
calls XLogInsert (as reported by Julien Rouhaud).  This would not typically
be noticeable during retail index updates.  It could be visible in a GIN
index build, in the form of memory consumption swelling to several times
the commanded maintenance_work_mem.

Another rather nasty problem is that in the internal-page-splitting code
path, we would clear the child page's GIN_INCOMPLETE_SPLIT flag well before
entering the critical section that it's supposed to be cleared in; a
failure in between would leave the index in a corrupt state.  There were
also assorted coding-rule violations with little immediate consequence but
possible long-term hazards, such as beginning an XLogInsert sequence before
entering a critical section, or calling elog(DEBUG) inside a critical
section.

To fix, redefine the API between ginPlaceToPage() and its subroutines
by splitting the subroutines into two parts.  The "beginPlaceToPage"
subroutine does what can be done outside a critical section, including
full computation of the result pages into temporary storage when we're
going to split the target page.  The "execPlaceToPage" subroutine is called
within a critical section established by ginPlaceToPage(), and it handles
the actual page update in the non-split code path.  The critical section,
as well as the XLOG insertion call sequence, are both now always started
and finished in ginPlaceToPage().  Also, make ginPlaceToPage() create and
work in a short-lived memory context to eliminate the leakage problem.
(Since a short-lived memory context had been getting created in the most
common code path in the subroutines, this shouldn't cause any noticeable
performance penalty; we're just moving the overhead up one call level.)

In passing, fix a bunch of comments that had gone unmaintained throughout
all this klugery.

Report: <571276DD.5050303@dalibo.com>
2016-04-20 14:25:15 -04:00
Kevin Grittner a343e223a5 Revert no-op changes to BufferGetPage()
The reverted changes were intended to force a choice of whether any
newly-added BufferGetPage() calls needed to be accompanied by a
test of the snapshot age, to support the "snapshot too old"
feature.  Such an accompanying test is needed in about 7% of the
cases, where the page is being used as part of a scan rather than
positioning for other purposes (such as DML or vacuuming).  The
additional effort required for back-patching, and the doubt whether
the intended benefit would really be there, have indicated it is
best just to rely on developers to do the right thing based on
comments and existing usage, as we do with many other conventions.

This change should have little or no effect on generated executable
code.

Motivated by the back-patching pain of Tom Lane and Robert Haas
2016-04-20 08:31:19 -05:00
Tom Lane f0e766bd7f Fix memory leak in GIN index scans.
The code had a query-lifespan memory leak when encountering GIN entries
that have posting lists (rather than posting trees, ie, there are a
relatively small number of heap tuples containing this index key value).
With a suitable data distribution this could add up to a lot of leakage.
Problem seems to have been introduced by commit 36a35c550, so back-patch
to 9.4.

Julien Rouhaud
2016-04-15 00:02:26 -04:00
Tom Lane 5713f03973 Improve API of GenericXLogRegister().
Rename this function to GenericXLogRegisterBuffer() to make it clearer
what it does, and leave room for other sorts of "register" actions in
future.  Also, replace its "bool isNew" argument with an integer flags
argument, so as to allow adding more flags in future without an API
break.

Alexander Korotkov, adjusted slightly by me
2016-04-12 11:42:06 -04:00
Tom Lane bdf7db8192 In generic WAL application and replay, ensure page "hole" is always zero.
The previous coding could allow the contents of the "hole" between pd_lower
and pd_upper to diverge during replay from what it had been when the update
was originally applied.  This would pose a problem if checksums were in
use, and in any case would complicate forensic comparisons between master
and slave servers.  So force the "hole" to contain zeroes, both at initial
application of a generically-logged action, and at replay.

Alexander Korotkov, adjusted slightly by me
2016-04-12 11:14:00 -04:00
Stephen Frost cd13471f2e Correct copyright for newly added genericdesc.c
It's 2016 these days (no, not entirely sure how we got here either).

Pointed out by Amit Langote
2016-04-12 08:45:09 -04:00