Commit Graph

3934 Commits

Author SHA1 Message Date
Thomas Munro 74618e77b4 Handle lack of DSM slots in parallel btree build.
If no DSM slots are available, a ParallelContext can still be
created, but its seg pointer is NULL.  Teach parallel btree build
to cope with that by falling back to a regular non-parallel build,
to avoid crashing with a segmentation fault.

Back-patch to 11, where parallel CREATE INDEX landed.

Reported-by: Nicola Contu
Reviewed-by: Peter Geoghegan
Discussion: https://postgr.es/m/CA%2BhUKGJgJEBnkuODBVomyK3MWFvDBbMVj%3Dgdt6DnRPU-5sQ6UQ%40mail.gmail.com
2020-01-31 10:25:34 +13:00
Alvaro Herrera c9d2977519 Clean up newlines following left parentheses
We used to strategically place newlines after some function call left
parentheses to make pgindent move the argument list a few chars to the
left, so that the whole line would fit under 80 chars.  However,
pgindent no longer does that, so the newlines just made the code
vertically longer for no reason.  Remove those newlines, and reflow some
of those lines for some extra naturality.

Reviewed-by: Michael Paquier, Tom Lane
Discussion: https://postgr.es/m/20200129200401.GA6303@alvherre.pgsql
2020-01-30 13:42:14 -03:00
Alvaro Herrera 4e89c79a52 Remove excess parens in ereport() calls
Cosmetic cleanup, not worth backpatching.

Discussion: https://postgr.es/m/20200129200401.GA6303@alvherre.pgsql
Reviewed-by: Tom Lane, Michael Paquier
2020-01-30 13:32:04 -03:00
Peter Eisentraut dc788668bb Fail if recovery target is not reached
Before, if a recovery target is configured, but the archive ended
before the target was reached, recovery would end and the server would
promote without further notice.  That was deemed to be pretty wrong.
With this change, if the recovery target is not reached, it is a fatal
error.

Based-on-patch-by: Leif Gunnar Erlandsen <leif@lako.no>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/993736dd3f1713ec1f63fc3b653839f5@lako.no
2020-01-29 15:58:14 +01:00
Heikki Linnakangas 30012a04a6 Fix randAccess setting in ReadRecord()
Commit 38a957316d got this backwards.

Author: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/20200128.194408.2260703306774646445.horikyota.ntt@gmail.com
2020-01-28 12:55:30 +02:00
Thomas Munro 11da6bccd1 Fix compile error on HP C.
Per build farm animal anole, after commit 6f38d4dac3.
2020-01-28 20:30:40 +13:00
Thomas Munro 6f38d4dac3 Remove dependency on HeapTuple from predicate locking functions.
The following changes make the predicate locking functions more
generic and suitable for use by future access methods:

- PredicateLockTuple() is renamed to PredicateLockTID().  It takes
  ItemPointer and inserting transaction ID instead of HeapTuple.

- CheckForSerializableConflictIn() takes blocknum instead of buffer.

- CheckForSerializableConflictOut() no longer takes HeapTuple or buffer.

Author: Ashwin Agrawal
Reviewed-by: Andres Freund, Kuntal Ghosh, Thomas Munro
Discussion: https://postgr.es/m/CALfoeiv0k3hkEb3Oqk%3DziWqtyk2Jys1UOK5hwRBNeANT_yX%2Bng%40mail.gmail.com
2020-01-28 13:13:04 +13:00
Heikki Linnakangas 38a957316d Refactor XLogReadRecord(), adding XLogBeginRead() function.
The signature of XLogReadRecord() required the caller to pass the starting
WAL position as argument, or InvalidXLogRecPtr to continue reading at the
end of previous record. That's slightly awkward to the callers, as most
of them don't want to randomly jump around in the WAL stream, but start
reading at one position and then read everything from that point onwards.
Remove the 'RecPtr' argument and add a new function XLogBeginRead() to
specify the starting position instead. That's more convenient for the
callers. Also, xlogreader holds state that is reset when you change the
starting position, so having a separate function for doing that feels like
a more natural fit.

This changes XLogFindNextRecord() function so that it doesn't reset the
xlogreader's state to what it was before the call anymore. Instead, it
positions the xlogreader to the found record, like XLogBeginRead().

Reviewed-by: Kyotaro Horiguchi, Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/5382a7a3-debe-be31-c860-cb810c08f366%40iki.fi
2020-01-26 11:39:00 +02:00
Michael Paquier f942dfb952 Clarify some comments in vacuumlazy.c
Author: Justin Pryzby
Discussion: https://postgr.es/m/20200113004542.GA26045@telsasoft.com
2020-01-23 15:56:56 +09:00
Fujii Masao 41c184bc64 Add GUC ignore_invalid_pages.
Detection of WAL records having references to invalid pages
during recovery causes PostgreSQL to raise a PANIC-level error,
aborting the recovery. Setting ignore_invalid_pages to on causes
the system to ignore those WAL records (but still report a warning),
and continue recovery. This behavior may cause crashes, data loss,
propagate or hide corruption, or other serious problems.
However, it may allow you to get past the PANIC-level error,
to finish the recovery, and to cause the server to start up.

Author: Fujii Masao
Reviewed-by: Michael Paquier
Discussion: https://www.postgresql.org/message-id/CAHGQGwHCK6f77yeZD4MHOnN+PaTf6XiJfEB+Ce7SksSHjeAWtg@mail.gmail.com
2020-01-22 11:56:34 +09:00
Amit Kapila 79a3efb84d Fix the computation of max dead tuples during the vacuum.
In commit 40d964ec99, we changed the way memory is allocated for dead
tuples but forgot to update the place where we compute the maximum
number of dead tuples.  This could lead to invalid memory requests.

Reported-by: Andres Freund
Diagnosed-by: Andres Freund
Author: Masahiko Sawada
Reviewed-by: Amit Kapila and Dilip Kumar
Discussion: https://postgr.es/m/20200121060020.e3cr7s7fj5rw4lok@alap3.anarazel.de
2020-01-22 07:43:51 +05:30
Heikki Linnakangas 4c87010981 Fix crash in BRIN inclusion op functions, due to missing datum copy.
The BRIN add_value() and union() functions need to make a longer-lived
copy of the argument, if they want to store it in the BrinValues struct
also passed as argument. The functions for the "inclusion operator
classes" used with box, range and inet types didn't take into account
that the union helper function might return its argument as is, without
making a copy. Check for that case, and make a copy if necessary. That
case arises at least with the range_union() function, when one of the
arguments is an 'empty' range:

CREATE TABLE brintest (n numrange);
CREATE INDEX brinidx ON brintest USING brin (n);
INSERT INTO brintest VALUES ('empty');
INSERT INTO brintest VALUES (numrange(0, 2^1000::numeric));
INSERT INTO brintest VALUES ('(-1, 0)');

SELECT brin_desummarize_range('brinidx', 0);
SELECT brin_summarize_range('brinidx', 0);

Backpatch down to 9.5, where BRIN was introduced.

Discussion: https://www.postgresql.org/message-id/e6e1d6eb-0a67-36aa-e779-bcca59167c14%40iki.fi
Reviewed-by: Emre Hasegeli, Tom Lane, Alvaro Herrera
2020-01-20 10:36:35 +02:00
Amit Kapila 40d964ec99 Allow vacuum command to process indexes in parallel.
This feature allows the vacuum to leverage multiple CPUs in order to
process indexes.  This enables us to perform index vacuuming and index
cleanup with background workers.  This adds a PARALLEL option to VACUUM
command where the user can specify the number of workers that can be used
to perform the command which is limited by the number of indexes on a
table.  Specifying zero as a number of workers will disable parallelism.
This option can't be used with the FULL option.

Each index is processed by at most one vacuum process.  Therefore parallel
vacuum can be used when the table has at least two indexes.

The parallel degree is either specified by the user or determined based on
the number of indexes that the table has, and further limited by
max_parallel_maintenance_workers.  The index can participate in parallel
vacuum iff it's size is greater than min_parallel_index_scan_size.

Author: Masahiko Sawada and Amit Kapila
Reviewed-by: Dilip Kumar, Amit Kapila, Robert Haas, Tomas Vondra,
Mahendra Singh and Sergei Kornilov
Tested-by: Mahendra Singh and Prabhat Sahu
Discussion:
https://postgr.es/m/CAD21AoDTPMgzSkV4E3SFo1CH_x50bf5PqZFQf4jmqjk-C03BWg@mail.gmail.com
https://postgr.es/m/CAA4eK1J-VoR9gzS5E75pcD-OH0mEyCdp8RihcwKrcuw7J-Q0+w@mail.gmail.com
2020-01-20 07:57:49 +05:30
Alexander Korotkov 4b754d6c16 Avoid full scan of GIN indexes when possible
The strategy of GIN index scan is driven by opclass-specific extract_query
method.  This method that needed search mode is GIN_SEARCH_MODE_ALL.  This
mode means that matching tuple may contain none of extracted entries.  Simple
example is '!term' tsquery, which doesn't need any term to exist in matching
tsvector.

In order to handle such scan key GIN calculates virtual entry, which contains
all TIDs of all entries of attribute.  In fact this is full scan of index
attribute.  And typically this is very slow, but allows to handle some queries
correctly in GIN.  However, current algorithm calculate such virtual entry for
each GIN_SEARCH_MODE_ALL scan key even if they are multiple for the same
attribute.  This is clearly not optimal.

This commit improves the situation by introduction of "exclude only" scan keys.
Such scan keys are not capable to return set of matching TIDs.  Instead, they
are capable only to filter TIDs produced by normal scan keys.  Therefore,
each attribute should contain at least one normal scan key, while rest of them
may be "exclude only" if search mode is GIN_SEARCH_MODE_ALL.

The same optimization might be applied to the whole scan, not per-attribute.
But that leads to NULL values elimination problem.  There is trade-off between
multiple possible ways to do this.  We probably want to do this later using
some cost-based decision algorithm.

Discussion: https://postgr.es/m/CAOBaU_YGP5-BEt5Cc0%3DzMve92vocPzD%2BXiZgiZs1kjY0cj%3DXBg%40mail.gmail.com
Author: Nikita Glukhov, Alexander Korotkov, Tom Lane, Julien Rouhaud
Reviewed-by: Julien Rouhaud, Tomas Vondra, Tom Lane
2020-01-18 01:11:39 +03:00
Amit Kapila 4d8a8d0c73 Introduce IndexAM fields for parallel vacuum.
Introduce new fields amusemaintenanceworkmem and amparallelvacuumoptions
in IndexAmRoutine for parallel vacuum.  The amusemaintenanceworkmem tells
whether a particular IndexAM uses maintenance_work_mem or not.  This will
help in controlling the memory used by individual workers as otherwise,
each worker can consume memory equal to maintenance_work_mem.  The
amparallelvacuumoptions tell whether a particular IndexAM participates in
a parallel vacuum and if so in which phase (bulkdelete, vacuumcleanup) of
vacuum.

Author: Masahiko Sawada and Amit Kapila
Reviewed-by: Dilip Kumar, Amit Kapila, Tomas Vondra and Robert Haas
Discussion:
https://postgr.es/m/CAD21AoDTPMgzSkV4E3SFo1CH_x50bf5PqZFQf4jmqjk-C03BWg@mail.gmail.com
https://postgr.es/m/CAA4eK1LmcD5aPogzwim5Nn58Ki+74a6Edghx4Wd8hAskvHaq5A@mail.gmail.com
2020-01-15 07:24:14 +05:30
Michael Paquier 7689d907bb Fix comment in heapam.c
Improvement per suggestion from Tom Lane.

Author: Daniel Gustafsson
Discussion: https://postgr.es/m/FED18699-4270-4778-8DA8-10F119A5ECF3@yesql.se
2020-01-13 17:57:38 +09:00
Amit Kapila 4e514c6180 Delete empty pages in each pass during GIST VACUUM.
Earlier, we use to postpone deleting empty pages till the second stage of
vacuum to amortize the cost of scanning internal pages.  However, that can
sometimes (say vacuum is canceled or errored between first and second
stage) delay the pages to be recycled.

Another thing is that to facilitate deleting empty pages in the second
stage, we need to share the information about internal and empty pages
between different stages of vacuum.  It will be quite tricky to share this
information via DSM which is required for the upcoming parallel vacuum
patch.

Also, it will bring the logic to reclaim deleted pages closer to nbtree
where we delete empty pages in each pass.

Overall, the advantages of deleting empty pages in each pass outweigh the
advantages of postponing the same.

Author: Dilip Kumar, with changes by Amit Kapila
Reviewed-by: Sawada Masahiko and Amit Kapila
Discussion: https://postgr.es/m/CAA4eK1LGr+MN0xHZpJ2dfS8QNQ1a_aROKowZB+MPNep8FVtwAA@mail.gmail.com
2020-01-13 07:59:44 +05:30
Alvaro Herrera f5d28710c7 Reimplement nullification of walsender timestamp
Make the value null only at pg_stat_activity-output time, as suggested
by Tom Lane, instead of messing with the internal state.  This should
appease buildfarm members with force_parallel_mode=regress, which are
running parallel queries on logical replication walsenders.

The fact that walsenders can run parallel queries should perhaps be
studied more carefully, but for the moment let's get rid of the red
blots in buildfarm.

Backpatch to pg10, like the previous commit.

Discussion: https://postgr.es/m/30804.1578438763@sss.pgh.pa.us
2020-01-08 14:33:49 -03:00
Alvaro Herrera b175bd59fa pg_stat_activity: show NULL stmt start time for walsenders
Returning a non-NULL time is pointless, sinc a walsender is not a
process that would be running normal transactions anyway, but the code
was unintentionally exposing the process start time intermittently,
which was not only bogus but it also confused monitoring systems looking
for idle transactions.  Fix by avoiding all updates in walsenders.

Backpatch to 11, where walsenders started appearing in pg_stat_activity.

Reported-by: Tomas Vondra
Discussion: https://postgr.es/m/20191209234409.exe7osmyalwkt5j4@development
2020-01-07 17:38:48 -03:00
Robert Haas ce242ae154 tableam: New callback relation_fetch_toast_slice.
Instead of always calling heap_fetch_toast_slice during detoasting,
invoke a table AM callback which, when the toast table is a heap
table, will be heap_fetch_toast_slice.

This makes it possible for a table AM other than heap to be used
as a TOAST table. It also completes the series of commits intended
to improve the interaction of tableam with TOAST that began with
commit 8b94dab06617ef80a0901ab103ebd8754427ef5a; detoast.c is
now, hopefully, fully AM-independent.

Patch by me, reviewed by Andres Freund and Peter Eisentraut.

Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com
2020-01-07 14:36:38 -05:00
Robert Haas 83322e38da tableam: Allow choice of toast AM.
Previously, the toast table had to be implemented by the same AM that
was used for the main table, which was bad, because the detoasting
code won't work with anything but heap. This commit doesn't fix the
latter problem, although there's another patch coming which does,
but it does let you pick something that works (i.e. heap, right now).

Patch by me, reviewed by Andres Freund.

Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com
2020-01-07 14:23:25 -05:00
Peter Geoghegan fc31001123 Remove redundant incomplete split assertion.
The fastpath insert optimization's incomplete split flag Assert() is
redundant.  We'll reach the more general Assert() within
_bt_findinsertloc() in all cases. (Besides, Assert()'ing that the
rightmost page doesn't have the flag set never made much sense.)
2020-01-05 17:42:13 -08:00
Peter Geoghegan d2e5e20e57 Add xl_btree_delete optimization.
Commit 558a9165e0 taught _bt_delitems_delete() to produce its own XID
horizon on the primary.  Standbys no longer needed to generate their own
latestRemovedXid, since they could just use the explicitly logged value
from the primary instead.  The deleted offset numbers array from the
xl_btree_delete WAL record was no longer used by the REDO routine for
anything other than deleting the items.

This enables a minor optimization:  We now treat the array as buffer
state, not generic WAL data, following _bt_delitems_vacuum()'s example.
This should be a minor win, since it allows us to avoid including the
deleted items array in cases where XLogInsert() stores the whole buffer
anyway.  The primary goal here is to make the code more maintainable,
though.  Removing inessential differences between the two functions
highlights the fundamental differences that remain.

Also change xl_btree_delete to use uint32 for the size of the array of
item offsets being deleted.  This brings xl_btree_delete closer to
xl_btree_vacuum.  Furthermore, it seems like a good idea to use an
explicit-width integer type (the field was previously an "int").

Bump XLOG_PAGE_MAGIC because xl_btree_delete changed.

Discussion: https://postgr.es/m/CAH2-Wzkz4TjmezzfAbaV1zYrh=fr0bCpzuJTvBe5iUQ3aUPsCQ@mail.gmail.com
2020-01-03 12:18:13 -08:00
Peter Geoghegan 0c41c83d8f Clear up btree_xlog_split() alignment comment.
Adjust a comment that describes how alignment of the new left page high
key works in btree_xlog_split(), the nbtree page split REDO routine.
The wording used before commit 2c03216d83 is much clearer, so go back
to that.
2020-01-02 18:30:25 -08:00
Peter Geoghegan 44e44bd258 Correct _bt_delitems_vacuum() lock comments.
The expectation within _bt_delitems_vacuum() is that caller has a
super-exclusive/cleanup buffer lock (not just a pin and a write lock).
2020-01-02 13:30:40 -08:00
Peter Geoghegan 4b25f5d0ba Revise BTP_HAS_GARBAGE nbtree VACUUM comments.
_bt_delitems_vacuum() comments claimed that it isn't worth another scan
of the page to avoid falsely unsetting the BTP_HAS_GARBAGE page flag
hint (this happens to be the same wording that was removed from
_bt_delitems_delete() by my recent commit fe97c61c).  The comments made
little sense, though.  The issue can't have much to do with performing a
second scan of the target leaf page, since an LP_DEAD test could easily
be performed in the first scan of the page anyway (the scan that takes
place in btvacuumpage() caller).

Revise the explanation.  It makes much more sense to frame this as an
issue about recovery conflicts.  _bt_delitems_vacuum() cannot easily
generate an XID cutoff in the same way that _bt_delitems_delete() is
designed to.

Falsely unsetting the page flag is not ideal, and is likely to happen
more often than was supposed by the original comments.  Explain why it
usually isn't a problem in practice.  There may be an argument for
_bt_delitems_vacuum() not clearing the BTP_HAS_GARBAGE bit, removing the
question of it being falsely unset by VACUUM (there may even be an
argument for not using a page level hint at all).  This can be revisited
later.
2020-01-01 17:29:41 -08:00
Peter Geoghegan c5f3b53b0e Update btree_xlog_delete() comments.
Commit fe97c61c updated LP_DEAD item deletion comments, but missed a
minor discrepancy on the REDO side.  Fix it now.

In passing, don't talk about the btree_xlog_vacuum() behavior within
btree_xlog_delete().  The reliance on XLOG_HEAP2_CLEANUP_INFO records
for recovery conflicts is already discussed within btvacuumpage() and
mentioned again in passing above btree_xlog_vacuum(), which seems
sufficient.
2020-01-01 11:32:07 -08:00
Bruce Momjian 7559d8ebfa Update copyrights for 2020
Backpatch-through: update all files in master, backpatch legal files through 9.4
2020-01-01 12:21:45 -05:00
Michael Paquier 7854e07f25 Revert "Rename files and headers related to index AM"
This follows multiple complains from Peter Geoghegan, Andres Freund and
Alvaro Herrera that this issue ought to be dug more before actually
happening, if it happens.

Discussion: https://postgr.es/m/20191226144606.GA5659@alvherre.pgsql
2019-12-27 08:09:00 +09:00
Michael Paquier 1ab41a3c8e Refactor code dedicated to index vacuuming in vacuumlazy.c
The part in charge of doing the vacuum on all the indexes of a relation
was duplicated, with the same handling for progress reporting done.
While on it, update the progress reporting for heap vacuuming in the
subroutine doing the actual work, keeping the status update local.  This
way, any future caller of lazy_vacuum_heap() does not have to worry
about doing any progress reporting update.

Author: Justin Pryzby, Michael Paquier
Discussion: https://postgr.es/m/20191120210600.GC30362@telsasoft.com
2019-12-26 17:01:23 +09:00
Michael Paquier 8ce3aa9b59 Rename files and headers related to index AM
The following renaming is done so as source files related to index
access methods are more consistent with table access methods (the
original names used for index AMs ware too generic, and could be
confused as including features related to table AMs):
- amapi.h -> indexam.h.
- amapi.c -> indexamapi.c.  Here we have an equivalent with
backend/access/table/tableamapi.c.
- amvalidate.c -> indexamvalidate.c.
- amvalidate.h -> indexamvalidate.h.
- genam.c -> indexgenam.c.
- genam.h -> indexgenam.h.

This has been discussed during the development of v12 when table AM was
worked on, but the renaming never happened.

Author: Michael Paquier
Reviewed-by: Fabien Coelho, Julien Rouhaud
Discussion: https://postgr.es/m/20191223053434.GF34339@paquier.xyz
2019-12-25 10:23:39 +09:00
Alvaro Herrera c4dcd9144b Avoid splitting C string literals with \-newline
Using \ is unnecessary and ugly, so remove that.  While at it, stitch
the literals back into a single line: we've long discouraged splitting
error message literals even when they go past the 80 chars line limit,
to improve greppability.

Leave contrib/tablefunc alone.

Discussion: https://postgr.es/m/20191223195156.GA12271@alvherre.pgsql
2019-12-24 12:44:12 -03:00
Peter Geoghegan fe97c61c87 Update nbtree LP_DEAD item deletion comments.
Comments about the consequences of clearing the BTP_HAS_GARBAGE page
flag bit that apply only to VACUUM were added to code that deals with
opportunistic deletion of LP_DEAD items by commit a760893d.  The same
comment block was added to both _bt_delitems_vacuum() and
_bt_delitems_delete().  Correct _bt_delitems_delete()'s copy of the
comment block.

_bt_delitems_delete() reliably deletes items that were found by caller
to have their LP_DEAD bit set.  There is no question about whether or
not unsetting the BTP_HAS_GARBAGE bit can miss some LP_DEAD items that
were set recently.

Also tweak a related section of the nbtree README.
2019-12-22 19:57:35 -08:00
Peter Geoghegan 9f83468b35 Remove unneeded "pin scan" nbtree VACUUM code.
The REDO routine for nbtree's xl_btree_vacuum record type hasn't
performed a "pin scan" since commit 3e4b7d87 went in, so clearly there
isn't any point in VACUUM WAL-logging information that won't actually be
used.  Finish off the work of commit 3e4b7d87 (and the closely related
preceding commit 687f2cd7) by removing the code that generates this
unused information.  Also remove the REDO routine code disabled by
commit 3e4b7d87.

Replace the unneeded lastBlockVacuumed field in xl_btree_vacuum with a
new "ndeleted" field.  The new field isn't actually needed right now,
since we could continue to infer the array length from the overall
record length.  However, an upcoming patch to add deduplication to
nbtree needs to add an "items updated" field to xl_btree_vacuum, so we
might as well start being explicit about the number of items now.
(Besides, it doesn't seem like a good idea to leave the xl_btree_vacuum
struct without any fields; the C standard says that that's undefined.)

nbtree VACUUM no longer forces writing a WAL record for the last block
in the index.  Writing out a WAL record with no items for the final
block was supposed to force processing of a lastBlockVacuumed field by a
pin scan.

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

Discussion: https://postgr.es/m/CAH2-WzmY_mT7UnTzFB5LBQDBkKpdV5UxP3B5bLb7uP%3D%3D6UQJRQ%40mail.gmail.com
2019-12-19 11:35:55 -08:00
Bruce Momjian b93e9a5c94 revert: Remove meaningless assignments in nbtree code
Reverts commit 05684c8255.

Reported-by: Tom Lane

Discussion: https://postgr.es/m/404.1576770942@sss.pgh.pa.us

Backpatch-through: master
2019-12-19 11:19:10 -05:00
Bruce Momjian 05684c8255 Remove meaningless assignments in nbtree code
Reported-by: Ranier Vilela

Discussion: https://postgr.es/m/MN2PR18MB2927BB876D12A70FDBE8F35AE3450@MN2PR18MB2927.namprd18.prod.outlook.com

Backpatch-through: master
2019-12-19 10:33:48 -05:00
Robert Haas 303640199d Fix minor problems with non-exclusive backup cleanup.
The previous coding imagined that it could call before_shmem_exit()
when a non-exclusive backup began and then remove the previously-added
handler by calling cancel_before_shmem_exit() when that backup
ended. However, this only works provided that nothing else in the
system has registered a before_shmem_exit() hook in the interim,
because cancel_before_shmem_exit() is documented to remove a callback
only if it is the latest callback registered. It also only works
if nothing can ERROR out between the time that sessionBackupState
is reset and the time that cancel_before_shmem_exit(), which doesn't
seem to be strictly true.

To fix, leave the handler installed for the lifetime of the session,
arrange to install it just once, and teach it to quietly do nothing if
there isn't a non-exclusive backup in process.

This is a bug, but for now I'm not going to back-patch, because the
consequences are minor. It's possible to cause a spurious warning
to be generated, but that doesn't really matter. It's also possible
to trigger an assertion failure, but production builds shouldn't
have assertions enabled.

Patch by me, reviewed by Kyotaro Horiguchi, Michael Paquier (who
preferred a different approach, but got outvoted), Fujii Masao,
and Tom Lane, and with comments by various others.

Discussion: http://postgr.es/m/CA+TgmobMjnyBfNhGTKQEDbqXYE3_rXWpc4CM63fhyerNCes3mA@mail.gmail.com
2019-12-19 09:06:54 -05:00
Robert Haas e9fd0415e6 Move heap-specific detoasting logic into a separate function.
The new function, heap_fetch_toast_slice, is shared between
toast_fetch_datum_slice and toast_fetch_datum, and does all the
work of scanning the TOAST table, fetching chunks, and storing
them into the space allocated for the result varlena.

As an incidental side effect, this allows toast_fetch_datum_slice
to perform the scan with only a single scankey if all chunks are
being fetched, which might have some tiny performance benefit.

Discussion: http://postgr.es/m/CA+TgmobBzxwFojJ0zV0Own3dr09y43hp+OzU2VW+nos4PMXWEg@mail.gmail.com
2019-12-18 11:08:59 -05:00
Michael Paquier 2032645b19 Fix compiler warning in non-assert builds
Oversight in commit e1551f9.

Reported-by: Erik Rijkers
Discussion: https://postgr.es/m/b7ad911d3eaa29af9fcdb9ccb26c363c@xs4all.nl
2019-12-18 16:55:25 +09:00
Michael Paquier e1551f96e6 Refactor attribute mappings used in logical tuple conversion
Tuple conversion support in tupconvert.c is able to convert rowtypes
between two relations, inner and outer, which are logically equivalent
but have a different ordering or even dropped columns (used mainly for
inheritance tree and partitions).  This makes use of attribute mappings,
which are simple arrays made of AttrNumber elements with a length
matching the number of attributes of the outer relation.  The length of
the attribute mapping has been treated as completely independent of the
mapping itself until now, making it easy to pass down an incorrect
mapping length.

This commit refactors the code related to attribute mappings and moves
it into an independent facility called attmap.c, extracted from
tupconvert.c.  This merges the attribute mapping with its length,
avoiding to try to guess what is the length of a mapping to use as this
is computed once, when the map is built.

This will avoid mistakes like what has been fixed in dc816e58, which has
used an incorrect mapping length by matching it with the number of
attributes of an inner relation (a child partition) instead of an outer
relation (a partitioned table).

Author: Michael Paquier
Reviewed-by: Amit Langote
Discussion: https://postgr.es/m/20191121042556.GD153437@paquier.xyz
2019-12-18 16:23:02 +09:00
Michael Paquier 70116493a8 Remove shadow variables linked to RedoRecPtr in xlog.c
This changes the routines in charge of recycling WAL segments past the
last redo LSN to not use anymore "RedoRecPtr" as a local variable, which
is also available in the context of the session as a static declaration,
replacing it with "lastredoptr".  This confusion has been introduced by
d9fadbf, so backpatch down to v11 like the other commit.

Thanks to Tom Lane, Robert Haas, Alvaro Herrera, Mark Dilger and Kyotaro
Horiguchi for the input provided.

Author: Ranier Vilela
Discussion: https://postgr.es/m/MN2PR18MB2927F7B5F690065E1194B258E35D0@MN2PR18MB2927.namprd18.prod.outlook.com
Backpatch-through: 11
2019-12-18 10:11:13 +09:00
Robert Haas 5184f110aa Fix bad formula in previous commit.
Commit d5406dea25 used a slightly
novel, and wrong, approach to compute the length of the last
toast chunk. It worked fine unless the last chunk happened to
have the largest possible size.
2019-12-17 15:53:17 -05:00
Robert Haas d5406dea25 Code cleanup for toast_fetch_datum and toast_fetch_datum_slice.
Rework some of the checks for bad TOAST chunks to be a bit simpler
and easier to understand. These checks verify that (1) we get all
and only the chunk numbers we expect to see and (2) each chunk has
the expected size. However, the existing code was a bit hard to
understand, at least for me; try to make it clearer.

As part of that, have toast_fetch_datum_slice check the relationship
between endchunk and totalchunks only with an Assert() rather than
checking every chunk number against both values. There's no need to
check that relationship in production builds because it's not a
function of whether on-disk corruption is present; it's just a
question of whether the code does the right math.

Also, have toast_fetch_datum_slice() use ereport(ERROR) rather than
elog(ERROR). Commit fd6ec93bf8 made
the two functions inconsistent with each other.

Rename assorted variables for better clarity and consistency, and
move assorted variables from function scope to the function's main
loop. Remove a few variables that are used only once entirely.

Patch by me, reviewed by Peter Eisentraut.

Discussion: http://postgr.es/m/CA+TgmobBzxwFojJ0zV0Own3dr09y43hp+OzU2VW+nos4PMXWEg@mail.gmail.com
2019-12-17 14:27:09 -05:00
Amit Kapila af3290f5e7 Change overly strict Assert in TransactionGroupUpdateXidStatus.
This Assert thought that an overflowed transaction can never get registered
for the group update.  But that is not true, because even when the number
of children for a transaction got reduced, the overflow flag is not
changed.  And, for group update, we only care about the current number of
children for a transaction that is being committed.

Based on comments by Andres Freund, remove a redundant Assert in
TransactionIdSetPageStatus as we already had a static Assert for the same
condition a few lines earlier.

Reported-by: Vignesh C
Author: Dilip Kumar
Reviewed-by: Amit Kapila
Backpatch-through: 11
Discussion: https://postgr.es/m/CAFiTN-s5=uJw-Z6JC9gcqtBSjXsrHnU63PXBrA=pnBjqnkm5UA@mail.gmail.com
2019-12-17 09:29:22 +05:30
Peter Geoghegan fcf3b6917b Rename nbtree tuple macros.
Rename two function-style macros, removing the word "inner".  This makes
things more consistent.
2019-12-16 17:49:45 -08:00
Peter Geoghegan 9067b83955 Update nbtree README's "Scans during Recovery".
get_actual_variable_range() hasn't used a dirty snapshot since commit
3ca930fc3, which invented a new snapshot type specifically to meet
selfuncs.c's requirements (HeapTupleSatisfiesNonVacuumable() type
snapshots were added).

Discussion: https://postgr.es/m/CAH2-Wzn2pSqEOcBDAA40CnO82oEy-EOpE2bNh_XL_cfFoA86jw@mail.gmail.com
2019-12-16 17:11:35 -08:00
Alvaro Herrera 91fca4bb60 Demote variable from global to local
recoveryDelayUntilTime was introduced by commit 36da3cfb45 as a global
because its method of operation was devilishly intrincate.  Commit
c945af80cf removed all that complexity and could have turned it into a
local variable, but didn't.  Do so now.

Discussion: https://postgr.es/m/20191213200751.GA10731@alvherre.pgsql
Reviewed-by: Michaël Paquier, Daniel Gustafsson
2019-12-16 14:23:56 -03:00
Heikki Linnakangas 741b884353 Fix yet another crash in page split during GiST index creation.
Commit a7ee7c8513 fixed a bug in GiST page split during index creation,
where we failed to re-find the position of a downlink after the page
containing it was split. However, that fix was incomplete; the other call
to gistinserttuples() in the same function needs to also clear
'downlinkoffnum'.

Fixes bug #16134 reported by Alexander Lakhin, for real this time. The
previous fix was enough to fix the crash with the reproducer script for
bug #16162, but the original script for #16134 was still crashing.

Backpatch to v12, like the previous incomplete fix.

Discussion: https://www.postgresql.org/message-id/d869f537-abe4-d2ea-0510-38cd053f5152%40gmail.com
2019-12-16 13:57:41 +02:00
Michael Paquier e5a02e0fc6 Remove duplicated progress reporting during heap scan of VACUUM
This has been introduced by c16dc1a since progress reporting for VACUUM
has been added.  As this issue just causes some extra work and is
harmless, no backpatch is done.

Author: Justin Pryzby
Discussion: https://postgr.es/m/20191213030831.GT2082@telsasoft.com
2019-12-15 22:05:33 +09:00
Heikki Linnakangas a7ee7c8513 Fix crash when a page was split during GiST index creation.
The bug was similar to the one that was fixed in commit 22251686f0. When
we split page X and insert the downlink for the new page, the parent page
might also need to be split. When that happens, the downlink offset number
we remembered for X is no longer valid. We correctly called
gistFindCorrectParent() to re-find it, but gistFindCorrectParent() doesn't
do anything if the LSN of the page hasn't changed, and we stopped updating
LSNs during index build in commit 9155580fd5. The buggy codepath was taken
if the page was split into three or more pages, and inserting the downlink
caused the parent page to split. To fix, explicitly mark the downlink
offset number as invalid, to force gistFindCorrectParent() to re-find it.

Fixes bug #16134 reported by Alexander Lakhin, reported again as #16162 by
Andreas Kunert. Thanks to Jeff Janes, Tom Lane and Tomas Vondra for
debugging. Backpatch to v12, where we stopped WAL-logging during index
build.

Discussion: https://www.postgresql.org/message-id/16134-0423f729671dec64%40postgresql.org
Discussion: https://www.postgresql.org/message-id/16162-45d21b7b6c1a3105%40postgresql.org
2019-12-13 23:58:10 +02:00
Michael Paquier 68ab982906 Fix thinkos from commit 9989d37
Error messages referring to incorrect WAL segment names could have been
generated for a fsync() failure or when creating a new segment at the
end of recovery.
2019-12-03 18:59:09 +09:00
Michael Paquier 9989d37d1c Remove XLogFileNameP() from the tree
XLogFileNameP() is a wrapper routine able to build a palloc'd string for
a WAL segment name, which is used for error string generation.  There
were several code paths where it gets called in a critical section,
where memory allocation is not allowed.  This results in triggering
an assertion failure instead of generating the wanted error message.

Another, more annoying, problem is that if the allocation to generate
the WAL segment name fails on OOM, then the failure would be escalated
to a PANIC.

This removes the routine and all its callers are replaced with a logic
using a fixed-size buffer.  This way, all the existing mistakes are
fixed and future ones are prevented.

Author: Masahiko Sawada
Reviewed-by: Michael Paquier, Álvaro Herrera
Discussion: https://postgr.es/m/CA+fd4k5gC9H4uoWMLg9K_QfNrnkkdEw+-AFveob9YX7z8JnKTA@mail.gmail.com
2019-12-03 15:06:04 +09:00
Alvaro Herrera 3974c4a724 Remove useless "return;" lines
Discussion: https://postgr.es/m/20191128144653.GA27883@alvherre.pgsql
2019-11-28 16:48:37 -03:00
Alvaro Herrera 0dc8ead463 Refactor WAL file-reading code into WALRead()
XLogReader, walsender and pg_waldump all had their own routines to read
data from WAL files to memory, with slightly different approaches
according to the particular conditions of each environment.  There's a
lot of commonality, so we can refactor that into a single routine
WALRead in XLogReader, and move the differences to a separate (simpler)
callback that just opens the next WAL-segment.  This results in a
clearer (ahem) code flow.

The error reporting needs are covered by filling in a new error-info
struct, WALReadError, and it's the caller's responsibility to act on it.
The backend has WALReadRaiseError() to do so.

We no longer ever need to seek in this interface; switch to using
pg_pread().

Author: Antonin Houska, with contributions from Álvaro Herrera
Reviewed-by: Michaël Paquier, Kyotaro Horiguchi
Discussion: https://postgr.es/m/14984.1554998742@spoje.net
2019-11-25 15:04:54 -03:00
Michael Paquier 4cb658af70 Refactor reloption handling for index AMs in-core
This reworks the reloption parsing and build of a couple of index AMs by
creating new structures for each index AM's options.  This split was
already done for BRIN, GIN and GiST (which actually has a fillfactor
parameter), but not for hash, B-tree and SPGiST which relied on
StdRdOptions due to an overlap with the default option set.

This saves a couple of bytes for rd_options in each relcache entry with
indexes making use of relation options, and brings more consistency
between all index AMs.  While on it, add a couple of AssertMacro() calls
to make sure that utility macros to grab values of reloptions are used
with the expected index AM.

Author: Nikolay Shaplov
Reviewed-by: Amit Langote, Michael Paquier, Álvaro Herrera, Dent John
Discussion: https://postgr.es/m/4127670.gFlpRb6XCm@x200m
2019-11-25 09:40:53 +09:00
Tom Lane 6b802cfc7f Avoid assertion failure with LISTEN in a serializable transaction.
If LISTEN is the only action in a serializable-mode transaction,
and the session was not previously listening, and the notify queue
is not empty, predicate.c reported an assertion failure.  That
happened because we'd acquire the transaction's initial snapshot
during PreCommit_Notify, which was called *after* predicate.c
expects any such snapshot to have been established.

To fix, just swap the order of the PreCommit_Notify and
PreCommit_CheckForSerializationFailure calls during CommitTransaction.
This will imply holding the notify-insertion lock slightly longer,
but the difference could only be meaningful in serializable mode,
which is an expensive option anyway.

It appears that this is just an assertion failure, with no
consequences in non-assert builds.  A snapshot used only to scan
the notify queue could not have been involved in any serialization
conflicts, so there would be nothing for
PreCommit_CheckForSerializationFailure to do except assign it a
prepareSeqNo and set the SXACT_FLAG_PREPARED flag.  And given no
conflicts, neither of those omissions affect the behavior of
ReleasePredicateLocks.  This admittedly once-over-lightly analysis
is backed up by the lack of field reports of trouble.

Per report from Mark Dilger.  The bug is old, so back-patch to all
supported branches; but the new test case only goes back to 9.6,
for lack of adequate isolationtester infrastructure before that.

Discussion: https://postgr.es/m/3ac7f397-4d5f-be8e-f354-440020675694@gmail.com
Discussion: https://postgr.es/m/13881.1574557302@sss.pgh.pa.us
2019-11-24 15:57:49 -05:00
Peter Eisentraut 2e4db241bf Remove configure --disable-float4-byval
This build option was only useful to maintain compatibility for
version-0 functions, but those are no longer supported, so this option
can be removed.

float4 is now always pass-by-value; the pass-by-reference code path is
completely removed.

Discussion: https://www.postgresql.org/message-id/flat/f3e1e576-2749-bbd7-2d57-3f9dcf75255a@2ndquadrant.com
2019-11-21 18:29:21 +01:00
Fujii Masao e6d8069522 Make DROP DATABASE command generate less WAL records.
Previously DROP DATABASE generated as many XLOG_DBASE_DROP WAL records
as the number of tablespaces that the database to drop uses. This caused
the scans of shared_buffers as many times as the number of the tablespaces
during recovery because WAL replay of one XLOG_DBASE_DROP record needs
that full scan. This could make the recovery time longer especially
when shared_buffers is large.

This commit changes DROP DATABASE so that it generates only one
XLOG_DBASE_DROP record, and registers the information of all the tablespaces
into it. Then, WAL replay of XLOG_DBASE_DROP record needs full scan of
shared_buffers only once, and which may improve the recovery performance.

Author: Fujii Masao
Reviewed-by: Kirk Jamison, Simon Riggs
Discussion: https://postgr.es/m/CAHGQGwF8YwNH0ZaL+2wjZPkj+ji9UhC+Z4ScnG97WKtVY5L9iw@mail.gmail.com
2019-11-21 21:10:37 +09:00
Peter Geoghegan 9f0f12ac57 Fix HeapTupleSatisfiesNonVacuumable() comment.
Oversight in commit 63746189b2.
2019-11-20 11:36:54 -08:00
Alexander Korotkov b107140804 Fix page modification outside of critical section in GIN
By oversight 52ac6cd2d0 makes ginDeletePage() sets pd_prune_xid of page to be
deleted before entering the critical section.  It appears that only versions 11
and later were affected by this oversight.

Backpatch-through: 11
2019-11-20 00:12:33 +03:00
Alexander Korotkov 32ca32d0be Revise GIN README
We find GIN concurrency bugs from time to time.  One of the problems here is
that concurrency of GIN isn't well-documented in README.  So, it might be even
hard to distinguish design bugs from implementation bugs.

This commit revised concurrency section in GIN README providing more details.
Some examples are illustrated in ASCII art.

Also, this commit add the explanation of how is tuple layout in internal GIN
B-tree page different in comparison with nbtree.

Discussion: https://postgr.es/m/CAPpHfduXR_ywyaVN4%2BOYEGaw%3DcPLzWX6RxYLBncKw8de9vOkqw%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Peter Geoghegan
Backpatch-through: 9.4
2019-11-20 00:04:22 +03:00
Alexander Korotkov d5ad7a09af Fix traversing to the deleted GIN page via downlink
Current GIN code appears to don't handle traversing to the deleted page via
downlink.  This commit fixes that by stepping right from the delete page like
we do in nbtree.

This commit also fixes setting 'deleted' flag to the GIN pages.  Now other page
flags are not erased once page is deleted.  That helps to keep our assertions
true if we arrive deleted page via downlink.

Discussion: https://postgr.es/m/CAPpHfdvMvsw-NcE5bRS7R1BbvA4BxoDnVVjkXC5W0Czvy9LVrg%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Peter Geoghegan
Backpatch-through: 9.4
2019-11-20 00:04:22 +03:00
Alexander Korotkov e14641197a Fix deadlock between ginDeletePage() and ginStepRight()
When ginDeletePage() is about to delete page it locks its left sibling to revise
the rightlink.  So, it locks pages in right to left manner.  Int he same time
ginStepRight() locks pages in left to right manner, and that could cause a
deadlock.

This commit makes ginScanToDelete() keep exclusive lock on left siblings of
currently investigated path.  That elimites need to relock left sibling in
ginDeletePage().  Thus, deadlock with ginStepRight() can't happen anymore.

Reported-by: Chen Huajun
Discussion: https://postgr.es/m/5c332bd1.87b6.16d7c17aa98.Coremail.chjischj%40163.com
Author: Alexander Korotkov
Reviewed-by: Peter Geoghegan
Backpatch-through: 10
2019-11-20 00:04:09 +03:00
Peter Geoghegan 2110f71696 nbtree: Tweak _bt_pgaddtup() comments.
Make it clear that _bt_pgaddtup() truncates the first data item on an
internal page because its key is supposed to be treated as minus
infinity within _bt_compare().
2019-11-18 13:04:53 -08:00
Tomas Vondra 2dc08bd617 Properly determine length for on-disk TOAST values
In detoast_attr_slice, VARSIZE_ANY was used to compute compressed length
of on-disk TOAST values. That's incorrect, because the varlena value may
be just a TOAST pointer, producing either bogus value or crashing.

This is likely why the code was crashing on big-endian machines before
540f316809 replaced the VARSIZE with VARSIZE_ANY, which however only
masked the issue.

Reported-by: Rushabh Lathia
Discussion: https://postgr.es/m/CAL-OGkthU9Gs7TZchf5OWaL-Gsi=hXqufTxKv9qpNG73d5na_g@mail.gmail.com
2019-11-16 03:07:11 +01:00
Michael Paquier 50d22de932 Cleanup code in reloptions.h regarding reloption handling
reloptions.h includes since ba748f7 a set of macros to handle reloption
types in a way similar to how parseRelOptions() works.  They have never
been used in the core code, and we have more simple methods now to parse
and fill in rd_options for a given relation depending on its relkind, so
remove this interface to simplify things.

Per discussion between Amit Langote, Álvaro Herrera and me.

Discussion: https://postgr.es/m/CA+HiwqE6zbNO92az6pp5GiTw4tr-9rfCE0t84whQSP+YwSKjMQ@mail.gmail.com
2019-11-14 13:59:59 +09:00
Michael Paquier 1bbd608fda Split handling of reloptions for partitioned tables
Partitioned tables do not have relation options yet, but, similarly to
what's done for views which have their own parsing table, it could make
sense to introduce new parameters for some of the existing default ones
like fillfactor, autovacuum, etc.  Splitting things has the advantage to
make the information stored in rd_options include only the necessary
information, reducing the amount of memory used for a relcache entry
with partitioned tables if new reloptions are introduced at this level.

Author:  Nikolay Shaplov
Reviewed-by: Amit Langote, Michael Paquier
Discussion: https://postgr.es/m/1627387.Qykg9O6zpu@x200m
2019-11-14 12:34:28 +09:00
Fujii Masao 7b8a899bde Make pg_waldump report more detail information about PREPARE TRANSACTION record.
This commit changes xact_desc() so that it reports the detail information about
PREPARE TRANSACTION record, like GID (global transaction identifier),
timestamp at prepare transaction, delete-on-abort/commit relations,
XID of subtransactions, and invalidation messages. These are helpful
when diagnosing 2PC-related troubles.

Author: Fujii Masao
Reviewed-by: Michael Paquier, Andrey Lepikhov, Kyotaro Horiguchi, Julien Rouhaud, Alvaro Herrera
Discussion: https://postgr.es/m/CAHGQGwEvhASad4JJnCv=0dW2TJypZgW_Vpb-oZik2a3utCqcrA@mail.gmail.com
2019-11-13 16:59:17 +09:00
Peter Geoghegan 1f55ebae27 Make _bt_keep_natts_fast() use datum_image_eq().
An upcoming patch that adds deduplication to the nbtree AM will rely on
_bt_keep_natts_fast() understanding that differences in TOAST input
state can never affect its answer.  In particular, two opclass-equal
datums (with opclasses deemed safe for deduplication) should never be
treated as unequal by _bt_keep_natts_fast() due to TOAST input
differences.

This also seems like a good idea on general principle.  nbtsplitloc.c
will now occasionally make better decisions about where to split a leaf
page.  The behavior of _bt_keep_natts_fast() is now somewhat closer to
the behavior of _bt_keep_natts().

Discussion: https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
2019-11-12 13:08:41 -08:00
Amit Kapila 14aec03502 Make the order of the header file includes consistent in backend modules.
Similar to commits 7e735035f2 and dddf4cdc33, this commit makes the order
of header file inclusion consistent for backend modules.

In the passing, removed a couple of duplicate inclusions.

Author: Vignesh C
Reviewed-by: Kuntal Ghosh and Amit Kapila
Discussion: https://postgr.es/m/CALDaNm2Sznv8RR6Ex-iJO6xAdsxgWhCoETkaYX=+9DW3q0QCfA@mail.gmail.com
2019-11-12 08:30:16 +05:30
Thomas Munro 695c5977c8 Optimize TransactionIdIsCurrentTransactionId().
If the passed in xid is the current top transaction, we can do a fast
check and exit early.  This should work well for the current heap but
also works very well for proposed AMs that don't use a separate xid
for subtransactions.

Author: Ashwin Agrawal, based on a suggestion from Andres Freund
Reviewed-by: Thomas Munro
Discussion: https://postgr.es/m/CALfoeiv0k3hkEb3Oqk%3DziWqtyk2Jys1UOK5hwRBNeANT_yX%2Bng%40mail.gmail.com
2019-11-11 16:33:04 +13:00
Andres Freund aae50236e4 Pass ItemPointer not HeapTuple to IndexBuildCallback.
Not all AMs use HeapTuples internally, making it inconvenient to pass
a HeapTuple. As the index callbacks really only need the TID, not the
full tuple, modify callback to only take ItemPointer.

Author: Ashwin Agrawal
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CALfoeis6=8ehuR=VNtHvj3z16cYfCwPdTcpaxU+sfSUJ5QgR3g@mail.gmail.com
2019-11-08 11:49:29 -08:00
Peter Eisentraut b85e43feb3 More precise errors from initial pg_control check
Use a separate error message for invalid checkpoint location and
invalid state instead of just "invalid data" for both.

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/20191107041630.GK1768@paquier.xyz
2019-11-08 08:03:16 +01:00
Peter Geoghegan e86c8ef243 Use "low key" terminology in nbtsort.c.
nbtree index builds once stashed the "minimum key" for a page, which was
used as the basis of the pivot tuple that gets placed in the next level
up (i.e. the tuple that stores the downlink to the page in question).
It doesn't quite work that way anymore, so the "minimum key" terminology
now seems misleading (these days the minimum key is actually a straight
copy of the high key from the left sibling, which is a distinct thing in
subtle but important ways).  Rename this concept to "low key".  This
name is a lot clearer given that there is now a sharp distinction
between pivot and non-pivot tuples.  Also remove comments that describe
obsolete details about how the minimum key concept used to work.

Rather than generating the minus infinity item for the leftmost page on
a level by copying the new item and truncating that copy, simply
allocate a small buffer.  The old approach confusingly created the
impression that the new item had some kind of significance.  This was
another artifact of how things used to work before commits 8224de4f and
dd299df8.
2019-11-07 17:12:09 -08:00
Fujii Masao a0c96856e8 Fix assertion failure when running pgbench -s.
If there is the WAL page that the continuation WAL record just fits within
(i.e., the continuation record ends just at the end of the page) and
the LSN in such page is specified with -s option, previously pg_waldump
caused an assertion failure. The cause of this assertion failure was that
XLogFindNextRecord() that pg_waldump -s calls mistakenly handled
such special WAL page.

This commit changes XLogFindNextRecord() so that it can handle
such WAL page correctly.

Back-patch to all supported versions.

Author: Andrey Lepikhov
Reviewed-by: Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/99303554-5dd5-06e6-f943-b3005ccd6edd@postgrespro.ru
2019-11-07 16:31:36 +09:00
Thomas Munro 7815e7efdb Add reusable routine for making arrays unique.
Introduce qunique() and qunique_arg(), which can be used after qsort()
and qsort_arg() respectively to remove duplicate values.  Use it where
appropriate.

Author: Thomas Munro
Reviewed-by: Tom Lane (in an earlier version)
Discussion: https://postgr.es/m/CAEepm%3D2vmFTNpAmwbGGD2WaryM6T3hSDVKQPfUwjdD_5XY6vAA%40mail.gmail.com
2019-11-07 17:00:48 +13:00
Andres Freund 01368e5d9d Split all OBJS style lines in makefiles into one-line-per-entry style.
When maintaining or merging patches, one of the most common sources
for conflicts are the list of objects in makefiles. Especially when
the split across lines has been changed on both sides, which is
somewhat common due to attempting to stay below 80 columns, those
conflicts are unnecessarily laborious to resolve.

By splitting, and alphabetically sorting, OBJS style lines into one
object per line, conflicts should be less frequent, and easier to
resolve when they still occur.

Author: Andres Freund
Discussion: https://postgr.es/m/20191029200901.vww4idgcxv74cwes@alap3.anarazel.de
2019-11-05 14:41:07 -08:00
Michael Paquier 3534fa2233 Refactor code building relation options
Historically, the code to build relation options has been shaped the
same way in multiple code paths by using a set of datums in input with
the options parsed with a static table which is then filled with the
option values.  This introduces a new common routine in reloptions.c to
do most of the legwork for the in-core code paths.

Author: Amit Langote
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CA+HiwqGsoSn_uTPPYT19WrtR7oYpYtv4CdS0xuedTKiHHWuk_g@mail.gmail.com
2019-11-05 09:17:05 +09:00
Tom Lane ec28808ba8 Fix ginEntryInsert's counting of GIN leaf tuples.
As the code stands, nEntries counts the number of ginEntryInsert()
calls, so that's what you end up with at the end of a GIN index build.
However, ginvacuumcleanup() recomputes nEntries as the number of
surviving leaf tuples, and that's generally consistent with the way that
gincostestimate() uses the value.  So let's clearly define nEntries
as the number of leaf tuples, and therefore adjust ginEntryInsert() to
increment it only when we make a new one, not when we add TIDs into an
existing tuple or posting tree.

In practice this inconsistency probably has little impact, so I don't
feel a need to back-patch.

Insung Moon and Keisuke Kuroda

Discussion: https://postgr.es/m/CAEMmqBuH_O-oXL+3_ArQ6F5cJ7kXVow2SGQB3HRacku_T+xkmA@mail.gmail.com
2019-11-04 14:16:42 -05:00
Michael Paquier 6ca86bb7e9 Fix typos in the code
Author: Vignesh C
Reviewed-by: Dilip Kumar, Michael Paquier
Discussion: https://postgr.es/m/CALDaNm0ni+GAOe4+fbXiOxNrVudajMYmhJFtXGX-zBPoN8ixhw@mail.gmail.com
2019-10-30 10:03:00 +09:00
Michael Paquier 51970fa8df Fix initialization of fake LSN for unlogged relations
9155580 has changed the value of the first fake LSN for unlogged
relations from 1 to FirstNormalUnloggedLSN (aka 1000), GiST requiring a
non-zero LSN on some pages to allow an interlocking logic to work, but
its value was still initialized to 1 at the beginning of recovery or
after running pg_resetwal.  This fixes the initialization for both code
paths.

Author: Takayuki Tsunakawa
Reviewed-by: Dilip Kumar, Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/OSBPR01MB2503CE851940C17DE44AE3D9FE6F0@OSBPR01MB2503.jpnprd01.prod.outlook.com
Backpatch-through: 12
2019-10-27 13:54:12 +09:00
Fujii Masao 3b0c59ac1c Fix typo in xlog.c.
Author: Fujii Masao
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/CAHGQGwH7dtYvOZZ8c0AG5AJwH5pfiRdKaCptY1_RdHy0HYeRfQ@mail.gmail.com
2019-10-24 14:13:36 +09:00
Peter Eisentraut f86f46d091 Fix comment
The last argument of smgrextend() was renamed from isTemp to skipFsync
in debcec7dc3, but the comments at two
call sites were not updated.
2019-10-22 09:58:20 +02:00
Amit Kapila 70a6c37d52 Fix memory leak introduced in commit 7df159a620.
We memorize all internal and empty leaf pages in the 1st vacuum stage for
gist indexes.  They are used in the 2nd stage, to delete all the empty
pages.  There was a memory context page_set_context for this purpose, but
we never used it.

Reported-by: Amit Kapila
Author: Dilip Kumar
Reviewed-by: Amit Kapila
Backpatch-through: 12, where it got introduced
Discussion: https://postgr.es/m/CAA4eK1LGr+MN0xHZpJ2dfS8QNQ1a_aROKowZB+MPNep8FVtwAA@mail.gmail.com
2019-10-21 08:57:32 +05:30
Fujii Masao ec1259e880 Fix failure of archive recovery with recovery_min_apply_delay enabled.
recovery_min_apply_delay parameter is intended for use with streaming
replication deployments. However, the document clearly explains that
the parameter will be honored in all cases if it's specified. So it should
take effect even if in archive recovery. But, previously, archive recovery
with recovery_min_apply_delay enabled always failed, and caused assertion
failure if --enable-caasert is enabled.

The cause of this problem is that; the ownership of recoveryWakeupLatch
that recovery_min_apply_delay uses was taken only when standby mode
is requested. So unowned latch could be used in archive recovery, and
which caused the failure.

This commit changes recovery code so that the ownership of
recoveryWakeupLatch is taken even in archive recovery. Which prevents
archive recovery with recovery_min_apply_delay from failing.

Back-patch to v9.4 where recovery_min_apply_delay was added.

Author: Fujii Masao
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CAHGQGwEyD6HdZLfdWc+95g=VQFPR4zQL4n+yHxQgGEGjaSVheQ@mail.gmail.com
2019-10-18 22:32:18 +09:00
Fujii Masao 9b95a36be8 Make crash recovery ignore recovery_min_apply_delay setting.
In v11 or before, this setting could not take effect in crash recovery
because it's specified in recovery.conf and crash recovery always
starts without recovery.conf. But commit 2dedf4d9a8 integrated
recovery.conf into postgresql.conf and which unexpectedly allowed
this setting to take effect even in crash recovery. This is definitely
not good behavior.

To fix the issue, this commit makes crash recovery always ignore
recovery_min_apply_delay setting.

Back-patch to v12 where the issue was added.

Author: Fujii Masao
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CAHGQGwEyD6HdZLfdWc+95g=VQFPR4zQL4n+yHxQgGEGjaSVheQ@mail.gmail.com
Discussion: https://postgr.es/m/e445616d-023e-a268-8aa1-67b8b335340c@pgmasters.net
2019-10-18 22:24:18 +09:00
Fujii Masao 20961ceaf0 Make crash recovery ignore restore_command and recovery_end_command settings.
In v11 or before, those settings could not take effect in crash recovery
because they are specified in recovery.conf and crash recovery always
starts without recovery.conf. But commit 2dedf4d9a8 integrated
recovery.conf into postgresql.conf and which unexpectedly allowed
those settings to take effect even in crash recovery. This is definitely
not good behavior.

To fix the issue, this commit makes crash recovery always ignore
restore_command and recovery_end_command settings.

Back-patch to v12 where the issue was added.

Author: Fujii Masao
Reviewed-by: Peter Eisentraut
Discussion: https://postgr.es/m/e445616d-023e-a268-8aa1-67b8b335340c@pgmasters.net
2019-10-11 15:47:59 +09:00
Michael Paquier b8e19b932a Flush logical mapping files with fd opened for read/write at checkpoint
The file descriptor was opened with read-only to fsync a regular file,
which would cause EBADFD errors on some platforms.

This is similar to the recent fix done by a586cc4b (which was broken by
me with 82a5649), except that I noticed this issue while monitoring the
backend code for similar mistakes.  Backpatch to 9.4, as this has been
introduced since logical decoding exists as of b89e151.

Author: Michael Paquier
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/20191006045548.GA14532@paquier.xyz
Backpatch-through: 9.4
2019-10-09 13:30:43 +09:00
Robert Haas 2e8b6bfa90 Rename some toasting functions based on whether they are heap-specific.
The old names for the attribute-detoasting functions names included
the word "heap," which seems outdated now that the heap is only one of
potentially many table access methods.

On the other hand, toast_insert_or_update and toast_delete are
heap-specific, so rename them by adding "heap_" as a prefix.

Not all of the work of making the TOAST system fully accessible to AMs
other than the heap is done yet, but there seems to be little harm in
getting this renaming out of the way now. Commit
8b94dab066 already divided up the
functions among various files partially according to whether it was
intended that they should be heap-specific or AM-agnostic, so this is
just clarifying the division contemplated by that commit.

Patch by me, reviewed and tested by Prabhat Sabu, Thomas Munro,
Andres Freund, and Álvaro Herrera.

Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com
2019-10-04 14:24:46 -04:00
Robert Haas 967e276e9f Remove AtSubStart_Notify.
Allocate notify-related state lazily instead. This makes trivial
subtransactions noticeably faster.

Patch by me, reviewed and tested by Dilip Kumar, Kyotaro Horiguchi,
and Jeevan Ladhe.

Discussion: https://postgr.es/m/CA+TgmobE1J22S1eC-6N-je9LgrcwZypkwp+zH6JXo9mc=4Nk3A@mail.gmail.com
2019-10-04 08:19:25 -04:00
Michael Paquier df86e52cac Remove temporary WAL and history files at the end of archive recovery
cbc55da has reworked the order of some actions at the end of archive
recovery.  Unfortunately this overlooked the fact that the startup
process needs to remove RECOVERYXLOG (for temporary WAL segment newly
recovered from archives) and RECOVERYHISTORY (for temporary history
file) at this step, leaving the files around even after recovery ended.

Backpatch to 9.5, like the previous commit.

Author: Sawada Masahiko
Reviewed-by: Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/CAD21AoBO_eDQub6zojFnWtnmutRBWvYf7=cW4Hsqj+U_R26w3Q@mail.gmail.com
Backpatch-through: 9.5
2019-10-02 15:53:07 +09:00
Tomas Vondra 11a078cf87 Optimize partial TOAST decompression
Commit 4d0e994eed added support for partial TOAST decompression, so the
decompression is interrupted after producing the requested prefix. For
prefix and slices near the beginning of the entry, this may saves a lot
of decompression work.

That however only deals with decompression - the whole compressed entry
was still fetched and re-assembled, even though the compression used
only a small fraction of it. This commit improves that by computing how
much compressed data may be needed to decompress the requested prefix,
and then fetches only the necessary part.

We always need to fetch a bit more compressed data than the requested
(uncompressed) prefix, because the prefix may not be compressible at all
and pglz itself adds a bit of overhead. That means this optimization is
most effective when the requested prefix is much smaller than the whole
compressed entry.

Author: Binguo Bao
Reviewed-by: Andrey Borodin, Tomas Vondra, Paul Ramsey
Discussion: https://www.postgresql.org/message-id/flat/CAL-OGkthU9Gs7TZchf5OWaL-Gsi=hXqufTxKv9qpNG73d5na_g@mail.gmail.com
2019-10-01 14:28:28 +02:00
Fujii Masao 7acf8a876b Make crash recovery ignore recovery target settings.
In v11 or before, recovery target settings could not take effect in
crash recovery because they are specified in recovery.conf and
crash recovery always starts without recovery.conf. But commit
2dedf4d9a8 integrated recovery.conf into postgresql.conf and
which unexpectedly allowed recovery target settings to take effect
even in crash recovery. This is definitely not good behavior.

To fix the issue, this commit makes crash recovery always ignore
recovery target settings.

Back-patch to v12.

Author: Peter Eisentraut
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/e445616d-023e-a268-8aa1-67b8b335340c@pgmasters.net
2019-09-30 10:18:15 +09:00
Michael Paquier fbfa566488 Fix lockmode initialization for custom relation options
The code was enforcing AccessExclusiveLock for all custom relation
options, which is incorrect as the APIs allow a custom lock level to be
set.

While on it, fix a couple of inconsistencies in the tests and the README
of dummy_index_am.

Oversights in commit 773df88.

Discussion: https://postgr.es/m/20190925234152.GA2115@paquier.xyz
2019-09-27 09:31:20 +09:00
Michael Paquier 6e22813b2d Fix comment in xlogreader.c
This has been introduced by 709d003, that has moved readSegNo, readOff
and readPageTLI into a new structure called WALOpenSegment initialized
separately.

Author: Kyotaro Horiguchi
Discussion: https://postgr.es/m/20190926.110809.248342687.horikyota.ntt@gmail.com
2019-09-26 11:53:37 +09:00
Alvaro Herrera 773df883e8 Support reloptions of enum type
All our current in core relation options of type string (not many,
admittedly) behave in reality like enums.  But after seeing an
implementation for enum reloptions, it's clear that strings are messier,
so introduce the new reloption type.  Switch all string options to be
enums instead.

Fortunately we have a recently introduced test module for reloptions, so
we don't lose coverage of string reloptions, which may still be used by
third-party modules.

Authors: Nikolay Shaplov, Álvaro Herrera
Reviewed-by: Nikita Glukhov, Aleksandr Parfenov
Discussion: https://postgr.es/m/43332102.S2V5pIjXRx@x200m
2019-09-25 15:56:52 -03:00
Michael Paquier 69f9410807 Allow definition of lock mode for custom reloptions
Relation options can define a lock mode other than AccessExclusiveMode
since 47167b7, but modules defining custom relation options did not
really have a way to enforce that.  Correct that by extending the
current API set so as modules can define a custom lock mode.

Author: Michael Paquier
Reviewed-by: Kuntal Ghosh
Discussion: https://postgr.es/m/20190920013831.GD1844@paquier.xyz
2019-09-25 10:13:52 +09:00
Michael Paquier 736b84eede Fix failure with lock mode used for custom relation options
In-core relation options can use a custom lock mode since 47167b7, that
has lowered the lock available for some autovacuum parameters.  However
it forgot to consider custom relation options.  This causes failures
with ALTER TABLE SET when changing a custom relation option, as its lock
is not defined.  The existing APIs to define a custom reloption does not
allow to define a custom lock mode, so enforce its initialization to
AccessExclusiveMode which should be safe enough in all cases.  An
upcoming patch will extend the existing APIs to allow a custom lock mode
to be defined.

The problem can be reproduced with bloom indexes, so add a test there.

Reported-by: Nikolay Sharplov
Analyzed-by: Thomas Munro, Michael Paquier
Author: Michael Paquier
Reviewed-by: Kuntal Ghosh
Discussion: https://postgr.es/m/20190920013831.GD1844@paquier.xyz
Backpatch-through: 9.6
2019-09-25 10:07:23 +09:00
Alexander Korotkov 90c0987258 Fix bug in pairingheap_SpGistSearchItem_cmp()
Our item contains only so->numberOfNonNullOrderBys of distances.  Reflect that
in the loop upper bound.

Discussion: https://postgr.es/m/53536807-784c-e029-6e92-6da802ab8d60%40postgrespro.ru
Author: Nikita Glukhov
Backpatch-through: 12
2019-09-25 01:47:36 +03:00
Alvaro Herrera 709d003fbd Rework WAL-reading supporting structs
The state-tracking of WAL reading in various places was pretty messy,
mostly because the ancient physical-replication WAL reading code wasn't
using the XLogReader abstraction.  This led to some untidy code.  Make
it prettier by creating two additional supporting structs,
WALSegmentContext and WALOpenSegment which keep track of WAL-reading
state.  This makes code cleaner, as well as supports more future
cleanup.

Author: Antonin Houska
Reviewed-by: Álvaro Herrera and (older versions) Robert Haas
Discussion: https://postgr.es/m/14984.1554998742@spoje.net
2019-09-24 16:39:53 -03:00
Fujii Masao 6d05086c0a Speedup truncations of relation forks.
When a relation is truncated, shared_buffers needs to be scanned
so that any buffers for the relation forks are invalidated in it.
Previously, shared_buffers was scanned for each relation forks, i.e.,
MAIN, FSM and VM, when VACUUM truncated off any empty pages
at the end of relation or TRUNCATE truncated the relation in place.
Since shared_buffers needed to be scanned multiple times,
it could take a long time to finish those commands especially
when shared_buffers was large.

This commit changes the logic so that shared_buffers is scanned only
one time for those three relation forks.

Author: Kirk Jamison
Reviewed-by: Masahiko Sawada, Thomas Munro, Alvaro Herrera, Takayuki Tsunakawa and Fujii Masao
Discussion: https://postgr.es/m/D09B13F772D2274BB348A310EE3027C64E2067@g01jpexmbkw24
2019-09-24 17:31:26 +09:00
Peter Eisentraut 887248e97e Message style fixes 2019-09-23 13:38:39 +02:00
Alexander Korotkov 8c8a267201 Fix freeing old values in index_store_float8_orderby_distances()
6cae9d2c10 has added an error in freeing old values in
index_store_float8_orderby_distances() function.  It looks for old value in
scan->xs_orderbynulls[i] after setting a new value there.
This commit fixes that.  Also it removes short-circuit in handling
distances == NULL situation.  Now distances == NULL will be treated the same
way as array with all null distances.  That is, previous values will be freed
if any.

Reported-by: Tom Lane, Nikita Glukhov
Discussion: https://postgr.es/m/CAPpHfdu2wcoAVAm3Ek66rP%3Duo_C-D84%2B%2Buf1VEcbyi_caBXWCA%40mail.gmail.com
Discussion: https://postgr.es/m/426580d3-a668-b9d1-7b8e-f74d1a6524e0%40postgrespro.ru
Backpatch-through: 12
2019-09-20 01:19:08 +03:00
Alexander Korotkov 6cae9d2c10 Improve handling of NULLs in KNN-GiST and KNN-SP-GiST
This commit improves subject in two ways:

 * It removes ugliness of 02f90879e7, which stores distance values and null
   flags in two separate arrays after GISTSearchItem struct.  Instead we pack
   both distance value and null flag in IndexOrderByDistance struct.  Alignment
   overhead should be negligible, because we typically deal with at most few
   "col op const" expressions in ORDER BY clause.
 * It fixes handling of "col op NULL" expression in KNN-SP-GiST.  Now, these
   expression are not passed to support functions, which can't deal with them.
   Instead, NULL result is implicitly assumed.  It future we may decide to
   teach support functions to deal with NULL arguments, but current solution is
   bugfix suitable for backpatch.

Reported-by: Nikita Glukhov
Discussion: https://postgr.es/m/826f57ee-afc7-8977-c44c-6111d18b02ec%40postgrespro.ru
Author: Nikita Glukhov
Reviewed-by: Alexander Korotkov
Backpatch-through: 9.4
2019-09-19 21:48:39 +03:00
Peter Eisentraut 48770492c3 Add some const decorations to array constants
Author: Mark G <markg735@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAEeOP_YFVeFjq4zDZLDQbLSRFxBiTpwBQHxCNgGd%2Bp5VztTXyQ%40mail.gmail.com
2019-09-17 22:03:00 +02:00
Peter Geoghegan 3b6b54f178 Fix nbtree page split rmgr desc routine.
Include newitemoff in rmgr desc output for nbtree page split records.
In passing, correct an obsolete comment that claimed that newitemoff is
only logged for _L variant nbtree page split WAL records.

Both issues were oversights in commit 2c03216d83, which revamped the
WAL format.

Author: Peter Geoghegan
Backpatch: 9.5-, where the WAL format was revamped.
2019-09-12 15:45:08 -07:00
Peter Geoghegan 1b9becd43c Remove redundant _bt_truncate() comment paragraph. 2019-09-12 09:51:27 -07:00
Peter Geoghegan 55d015bde0 Add _bt_binsrch() scantid assertion to nbtree.
Assert that _bt_binsrch() binary searches with scantid set in insertion
scankey cannot be performed on leaf pages.  Leaf-level binary searches
where scantid is set must use _bt_binsrch_insert() instead.

_bt_binsrch_insert() is likely to have additional responsibilities in
the future, such as searching within GIN-style posting lists using
scantid.  It seems like a good idea to tighten things up now.
2019-09-09 11:41:19 -07:00
Alexander Korotkov 02f90879e7 Fix handling of NULL distances in KNN-GiST
In order to implement NULL LAST semantic GiST previously assumed distance to
the NULL value to be Inf.  However, our distance functions can return Inf and
NaN for non-null values.  In such cases, NULL LAST semantic appears to be
broken.  This commit fixes that by introducing separate array of null flags for
distances.

Backpatch to all supported versions.

Discussion: https://postgr.es/m/CAPpHfdsNvNdA0DBS%2BwMpFrgwT6C3-q50sFVGLSiuWnV3FqOJuQ%40mail.gmail.com
Author: Alexander Korotkov
Backpatch-through: 9.4
2019-09-08 22:08:12 +03:00
Alexander Korotkov e5d8f35961 Fix handling Inf and Nan values in GiST pairing heap comparator
Previously plain float comparison was used in GiST pairing heap.  Such
comparison doesn't provide proper ordering for value sets containing Inf and Nan
values.  This commit fixes that by usage of float8_cmp_internal().  Note, there
is remaining problem with NULL distances, which are represented as Inf in
pairing heap.  It would be fixes in subsequent commit.

Backpatch to all supported versions.

Reported-by: Andrey Borodin
Discussion: https://postgr.es/m/CAPpHfdsNvNdA0DBS%2BwMpFrgwT6C3-q50sFVGLSiuWnV3FqOJuQ%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Heikki Linnakangas
Backpatch-through: 9.4
2019-09-08 22:08:12 +03:00
Peter Eisentraut 862ef372d6 Fix behavior of AND CHAIN outside of explicit transaction blocks
When using COMMIT AND CHAIN or ROLLBACK AND CHAIN not in an explicit
transaction block, the previous implementation would leave a
transaction block active in the ROLLBACK case but not the COMMIT case.
To fix for now, error out when using these commands not in an explicit
transaction block.  This restriction could be lifted if a sensible
definition and implementation is found.

Bug: #15977
Author: fn ln <emuser20140816@gmail.com>
Reviewed-by: Fabien COELHO <coelho@cri.ensmp.fr>
2019-09-08 16:23:03 +02:00
Robert Haas bd124996ef Create an API for inserting and deleting rows in TOAST tables.
This moves much of the non-heap-specific logic from toast_delete and
toast_insert_or_update into a helper functions accessible via a new
header, toast_helper.h.  Using the functions in this module, a table
AM can implement creation and deletion of TOAST table rows with
much less code duplication than was possible heretofore.  Some
table AMs won't want to use the TOAST logic at all, but for those
that do this will make that easier.

Patch by me, reviewed and tested by Prabhat Sabu, Thomas Munro,
Andres Freund, and Álvaro Herrera.

Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com
2019-09-06 10:38:51 -04:00
Fujii Masao 946647f845 Make pg_promote() detect postmaster death while waiting for promotion to end.
Previously even if postmaster died and WaitLatch() woke up with that event
while pg_promote() was waiting for the standby promotion to finish,
pg_promote() did nothing special and kept waiting until timeout occurred.
This could cause a busy loop.

This patch make pg_promote() return false immediately when postmaster
dies, to avoid such a busy loop.

Back-patch to v12 where pg_promote() was added.

Author: Fujii Masao
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CAHGQGwEs9ROgSp+QF+YdDU+xP8W=CY1k-_Ov-d_Z3JY+to3eXA@mail.gmail.com
2019-09-06 14:27:25 +09:00
Robert Haas 8b94dab066 Split tuptoaster.c into three separate files.
detoast.c/h contain functions required to detoast a datum, partially
or completely, plus a few other utility functions for examining the
size of toasted datums.

toast_internals.c/h contain functions that are used internally to the
TOAST subsystem but which (mostly) do not need to be accessed from
outside.

heaptoast.c/h contains code that is intrinsically specific to the
heap AM, either because it operates on HeapTuples or is based on the
layout of a heap page.

detoast.c and toast_internals.c are placed in
src/backend/access/common rather than src/backend/access/heap.  At
present, both files still have dependencies on the heap, but that will
be improved in a future commit.

Patch by me, reviewed and tested by Prabhat Sabu, Thomas Munro,
Andres Freund, and Álvaro Herrera.

Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com
2019-09-05 13:15:10 -04:00
Alvaro Herrera 25dcc9d35d Make XLogReaderInvalReadState static
This function is only used by xlogreader.c itself, so there's no need to
export it.  It was introduced by commit 3b02ea4f07 with the apparent
intention that it could be used externally, but I couldn't find any
external code calling it.

I (Álvaro) couldn't resist the urge to sort nearby function prototypes
properly while at it.

Author: Antonin Houska
Discussion: https://postgr.es/m/14984.1554998742@spoje.net
2019-09-03 17:41:43 -04:00
Alvaro Herrera fe66125974 Remove 'msg' parameter from convert_tuples_by_name
The message was included as a parameter when this function was added in
dcb2bda9b7, but I don't think it has ever served any useful purpose.
Let's stop spreading it pointlessly.

Reviewed by Amit Langote and Peter Eisentraut.

Discussion: https://postgr.es/m/20190806224728.GA17233@alvherre.pgsql
2019-09-03 14:47:29 -04:00
Peter Eisentraut 396e4afdbc Better error messages for short reads/writes in SLRU
This avoids getting a

    Could not read from file ...: Success.

for a short read or write (since errno is not set in that case).
Instead, report a more specific error messages.

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/flat/5de61b6b-8be9-7771-0048-860328efe027%402ndquadrant.com
2019-09-03 08:30:21 +02:00
Tom Lane f63a5ead9d Avoid touching replica identity index in ExtractReplicaIdentity().
In what seems like a fit of misplaced optimization,
ExtractReplicaIdentity() accessed the relation's replica-identity
index without taking any lock on it.  Usually, the surrounding query
already holds some lock so this is safe enough ... but in the case
of a previously-planned delete, there might be no existing lock.
Given a suitable test case, this is exposed in v12 and HEAD by an
assertion added by commit b04aeb0a0.

The whole thing's rather poorly thought out anyway; rather than
looking directly at the index, we should use the index-attributes
bitmap that's held by the parent table's relcache entry, as the
caller functions do.  This is more consistent and likely a bit
faster, since it avoids a cache lookup.  Hence, change to doing it
that way.

While at it, rather than blithely assuming that the identity
columns are non-null (with catastrophic results if that's wrong),
add assertion checks that they aren't null.  Possibly those should
be actual test-and-elog, but I'll leave it like this for now.

In principle, this is a bug that's been there since this code was
introduced (in 9.4).  In practice, the risk seems quite low, since
we do have a lock on the index's parent table, so concurrent
changes to the index's catalog entries seem unlikely.  Given the
precedent that commit 9c703c169 wasn't back-patched, I won't risk
back-patching this further than v12.

Per report from Hadi Moshayedi.

Discussion: https://postgr.es/m/CAK=1=Wrek44Ese1V7LjKiQS-Nd-5LgLi_5_CskGbpggKEf3tKQ@mail.gmail.com
2019-09-02 16:10:37 -04:00
Heikki Linnakangas bde7493d10 Fix overflow check and comment in GIN posting list encoding.
The comment did not match what the code actually did for integers with
the 43rd bit set. You get an integer like that, if you have a posting
list with two adjacent TIDs that are more than 2^31 blocks apart.
According to the comment, we would store that in 6 bytes, with no
continuation bit on the 6th byte, but in reality, the code encodes it
using 7 bytes, with a continuation bit on the 6th byte as normal.

The decoding routine also handled these 7-byte integers correctly, except
for an overflow check that assumed that one integer needs at most 6 bytes.
Fix the overflow check, and fix the comment to match what the code
actually does. Also fix the comment that claimed that there are 17 unused
bits in the 64-bit representation of an item pointer. In reality, there
are 64-32-11=21.

Fitting any item pointer into max 6 bytes was an important property when
this was written, because in the old pre-9.4 format, item pointers were
stored as plain arrays, with 6 bytes for every item pointer. The maximum
of 6 bytes per integer in the new format guaranteed that we could convert
any page from the old format to the new format after upgrade, so that the
new format was never larger than the old format. But we hardly need to
worry about that anymore, and running into that problem during upgrade,
where an item pointer is expanded from 6 to 7 bytes such that the data
doesn't fit on a page anymore, is implausible in practice anyway.

Backpatch to all supported versions.

This also includes a little test module to test these large distances
between item pointers, without requiring a 16 TB table. It is not
backpatched, I'm including it more for the benefit of future development
of new posting list formats.

Discussion: https://www.postgresql.org/message-id/33bfc20a-5c86-f50c-f5a5-58e9925d05ff%40iki.fi
Reviewed-by: Masahiko Sawada, Alexander Korotkov
2019-08-28 12:55:33 +03:00
Peter Geoghegan b8b3a276d4 Remove obsolete nbtree page deletion comment.
Commit efada2b8e9, which made the nbtree page deletion algorithm more
robust, removed the concept of a half-dead internal page.  Remove a
comment about half dead parent pages that was overlooked.
2019-08-27 14:01:43 -07:00
Peter Geoghegan 867d25ccb4 Explain subtlety in nbtree locking protocol.
The Postgres approach to coupling locks during an ascent of the tree is
slightly different to the approach taken by Lehman and Yao.  Add a new
paragraph to the "Differences to the Lehman & Yao algorithm" section of
the nbtree README that explains the similarities and differences.
2019-08-23 20:24:49 -07:00
Peter Geoghegan 091bd6befc Update comments on nbtree stack struct.
Adjust the struct comment that describes how page splits use their
descent stack to cascade up the tree from the leaf level.

In passing, fix up some unrelated nbtree comments that had typos or were
obsolete.
2019-08-21 13:50:27 -07:00
Alvaro Herrera f8cf524da1 Fix bogus comment
Author: Alexander Lakhin
Discussion: https://postgr.es/m/20190819072244.GE18166@paquier.xyz
2019-08-20 16:04:09 -04:00
Michael Paquier c96581abe4 Fix inconsistencies and typos in the tree, take 11
This fixes various typos in docs and comments, and removes some orphaned
definitions.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/5da8e325-c665-da95-21e0-c8a99ea61fbf@gmail.com
2019-08-19 16:21:39 +09:00
Andres Freund fb3b098fe8 Remove fmgr.h includes from headers that don't really need it.
Most of the fmgr.h includes were obsoleted by 352a24a1f9. A
few others can be obsoleted using the underlying struct type in an
implementation detail.

Author: Andres Freund
Discussion: https://postgr.es/m/20190803193733.g3l3x3o42uv4qj7l@alap3.anarazel.de
2019-08-16 10:35:31 -07:00
Peter Geoghegan 9c02cf5661 Remove block number field from nbtree stack.
The initial value of the nbtree stack downlink block number field
recorded during an initial descent of the tree wasn't actually used.
Both _bt_getstackbuf() callers overwrote the value with their own value.

Remove the block number field from the stack struct, and add a child
block number argument to _bt_getstackbuf() in its place.  This makes the
overall design of _bt_getstackbuf() clearer.

Author: Peter Geoghegan
Reviewed-By: Anastasia Lubennikova
Discussion: https://postgr.es/m/CAH2-Wzmx+UbXt2YNOUCZ-a04VdXU=S=OHuAuD7Z8uQq-PXTYUg@mail.gmail.com
2019-08-14 11:32:35 -07:00
Peter Geoghegan 68ef887842 Remove obsolete nbtree README commentary.
Commit d2086b08b0 removed almost all cases where nbtree must release a
read buffer lock and acquire a write buffer lock instead, so remaining
cases in which that's still necessary are not notable enough to appear
in the nbtree README.

More importantly, holding on to a buffer pin in cases where nbtree must
trade a read lock for a write lock is very unlikely to save any I/O.
This seems to have been a long overlooked throwback to a time when
nbtree cared about write-ordering dependencies, and performed
synchronous buffer writes.  It hasn't worked that way in many years.
2019-08-13 17:16:44 -07:00
Peter Geoghegan af0ba49809 Use PageIndexTupleOverwrite() within nbtree.
Use the PageIndexTupleOverwrite() bufpage.c routine within nbtree
instead of deleting a tuple and re-inserting its replacement.  This
makes the intent of affected code slightly clearer.  It also makes
CREATE INDEX slightly faster, since there is no longer a need to shift
every leaf page's line pointer array back and forth during index builds.

Author: Peter Geoghegan, Anastasia Lubennikova
Reviewed-By: Anastasia Lubennikova
Discussion: https://postgr.es/m/CAH2-Wz=Zk=B9+Vwm376WuO7YTjFc2SSskifQm4Nme3RRRPtOSQ@mail.gmail.com
2019-08-13 11:54:26 -07:00
Michael Paquier 66bde49d96 Fix inconsistencies and typos in the tree, take 10
This addresses some issues with unnecessary code comments, fixes various
typos in docs and comments, and removes some orphaned structures and
definitions.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/9aabc775-5494-b372-8bcb-4dfc0bd37c68@gmail.com
2019-08-13 13:53:41 +09:00
Heikki Linnakangas 1169fcf129 Fix predicate-locking of HOT updated rows.
In serializable mode, heap_hot_search_buffer() incorrectly acquired a
predicate lock on the root tuple, not the returned tuple that satisfied
the visibility checks. As explained in README-SSI, the predicate lock does
not need to be copied or extended to other tuple versions, but for that to
work, the correct, visible, tuple version must be locked in the first
place.

The original SSI commit had this bug in it, but it was fixed back in 2013,
in commit 81fbbfe335. But unfortunately, it was reintroduced a few months
later in commit b89e151054. Wising up from that, add a regression test
to cover this, so that it doesn't get reintroduced again. Also, move the
code that sets 't_self', so that it happens at the same time that the
other HeapTuple fields are set, to make it more clear that all the code in
the loop operate on the "current" tuple in the chain, not the root tuple.

Bug spotted by Andres Freund, analysis and original fix by Thomas Munro,
test case and some additional changes to the fix by Heikki Linnakangas.
Backpatch to all supported versions (9.4).

Discussion: https://www.postgresql.org/message-id/20190731210630.nqhszuktygwftjty%40alap3.anarazel.de
2019-08-07 12:40:49 +03:00
Michael Paquier 8548ddc61b Fix inconsistencies and typos in the tree, take 9
This addresses more issues with code comments, variable names and
unreferenced variables.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/7ab243e0-116d-3e44-d120-76b3df7abefd@gmail.com
2019-08-05 12:14:58 +09:00
Peter Eisentraut fd6ec93bf8 Add error codes to some corruption log messages
In some cases we have elog(ERROR) while corruption is certain and we
can give a clear error code ERRCODE_DATA_CORRUPTED or
ERRCODE_INDEX_CORRUPTED.

Author: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://www.postgresql.org/message-id/flat/25F6C686-6442-4A6B-BAF8-A6F7B84B16DE@yandex-team.ru
2019-08-01 11:15:26 +02:00
Andres Freund 6384e87be2 Remove superfluous semicolon.
Author: Andres Freund
2019-07-30 22:09:53 -07:00
Michael Paquier eb43f3d193 Fix inconsistencies and typos in the tree
This is numbered take 8, and addresses again a set of issues with code
comments, variable names and unreferenced variables.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/b137b5eb-9c95-9c2f-586e-38aba7d59788@gmail.com
2019-07-29 12:28:30 +09:00
Heikki Linnakangas 6655a7299d Use full 64-bit XID for checking if a deleted GiST page is old enough.
Otherwise, after a deleted page gets even older, it becomes unrecyclable
again. B-tree has the same problem, and has had since time immemorial,
but let's at least fix this in GiST, where this is new.

Backpatch to v12, where GiST page deletion was introduced.

Reviewed-by: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/835A15A5-F1B4-4446-A711-BF48357EB602%40yandex-team.ru
2019-07-24 20:24:07 +03:00
Heikki Linnakangas 9eb5607e69 Refactor checks for deleted GiST pages.
The explicit check in gistScanPage() isn't currently really necessary, as
a deleted page is always empty, so the loop would fall through without
doing anything, anyway. But it's a marginal optimization, and it gives a
nice place to attach a comment to explain how it works.

Backpatch to v12, where GiST page deletion was introduced.

Reviewed-by: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/835A15A5-F1B4-4446-A711-BF48357EB602%40yandex-team.ru
2019-07-24 20:24:05 +03:00
Michael Paquier 23bccc823d Fix inconsistencies and typos in the tree
This is numbered take 7, and addresses a set of issues with code
comments, variable names and unreferenced variables.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/dff75442-2468-f74f-568c-6006e141062f@gmail.com
2019-07-22 10:01:50 +09:00
Peter Geoghegan d004147eb3 Fix nbtree metapage cache upgrade bug.
Commit 857f9c36cd, which taught nbtree VACUUM to avoid unnecessary
index scans, bumped the nbtree version number from 2 to 3, while adding
the ability for nbtree indexes to be upgraded on-the-fly.  Various
assertions that assumed that an nbtree index was always on version 2 had
to be changed to accept any supported version (version 2 or 3 on
Postgres 11).

However, a few assertions were missed in the initial commit, all of
which were in code paths that cache a local copy of the metapage
metadata, where the index had been expected to be on the current version
(no longer version 2) as a generic sanity check.  Rather than simply
update the assertions, follow-up commit 0a64b45152 intentionally made
the metapage caching code update the per-backend cached metadata version
without changing the on-disk version at the same time.  This could even
happen when the planner needed to determine the height of a B-Tree for
costing purposes.  The assertions only fail on Postgres v12 when
upgrading from v10, because they were adjusted to use the authoritative
shared memory metapage by v12's commit dd299df8.

To fix, remove the cache-only upgrade mechanism entirely, and update the
assertions themselves to accept any supported version (go back to using
the cached version in v12).  The fix is almost a full revert of commit
0a64b45152 on the v11 branch.

VACUUM only considers the authoritative metapage, and never bothers with
a locally cached version, whereas everywhere else isn't interested in
the metapage fields that were added by commit 857f9c36cd.  It seems
unlikely that this bug has affected any user on v11.

Reported-By: Christoph Berg
Bug: #15896
Discussion: https://postgr.es/m/15896-5b25e260fdb0b081%40postgresql.org
Backpatch: 11-, where VACUUM was taught to avoid unnecessary index scans.
2019-07-18 13:22:56 -07:00
Tom Lane d97b714a21 Avoid using lcons and list_delete_first where it's easy to do so.
Formerly, lcons was about the same speed as lappend, but with the new
List implementation, that's not so; with a long List, data movement
imposes an O(N) cost on lcons and list_delete_first, but not lappend.

Hence, invent list_delete_last with semantics parallel to
list_delete_first (but O(1) cost), and change various places to use
lappend and list_delete_last where this can be done without much
violence to the code logic.

There are quite a few places that construct result lists using lcons not
lappend.  Some have semantic rationales for that; I added comments about
it to a couple that didn't have them already.  In many such places though,
I think the coding is that way only because back in the dark ages lcons
was faster than lappend.  Hence, switch to lappend where this can be done
without causing semantic changes.

In ExecInitExprRec(), this results in aggregates and window functions that
are in the same plan node being executed in a different order than before.
Generally, the executions of such functions ought to be independent of
each other, so this shouldn't result in visibly different query results.
But if you push it, as one regression test case does, you can show that
the order is different.  The new order seems saner; it's closer to
the order of the functions in the query text.  And we never documented
or promised anything about this, anyway.

Also, in gistfinishsplit(), don't bother building a reverse-order list;
it's easy now to iterate backwards through the original list.

It'd be possible to go further towards removing uses of lcons and
list_delete_first, but it'd require more extensive logic changes,
and I'm not convinced it's worth it.  Most of the remaining uses
deal with queues that probably never get long enough to be worth
sweating over.  (Actually, I doubt that any of the changes in this
patch will have measurable performance effects either.  But better
to have good examples than bad ones in the code base.)

Patch by me, thanks to David Rowley and Daniel Gustafsson for review.

Discussion: https://postgr.es/m/21272.1563318411@sss.pgh.pa.us
2019-07-17 11:15:34 -04:00
Michael Paquier 0896ae561b Fix inconsistencies and typos in the tree
This is numbered take 7, and addresses a set of issues around:
- Fixes for typos and incorrect reference names.
- Removal of unneeded comments.
- Removal of unreferenced functions and structures.
- Fixes regarding variable name consistency.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/10bfd4ac-3e7c-40ab-2b2e-355ed15495e8@gmail.com
2019-07-16 13:23:53 +09:00
Peter Geoghegan bfdbac2ab3 Correct nbtsplitloc.c comment.
The logic just added by commit e3899ffd falls back on a 50:50 page split
in the event of a new item that's just to the right of our provisional
"many duplicates" split point.  Fix a comment that incorrectly claimed
that the new item had to be just to the left of our provisional split
point.

Backpatch: 12-, just like commit e3899ffd.
2019-07-15 14:35:06 -07:00
Peter Geoghegan e3899ffd8b Fix pathological nbtree split point choice issue.
Specific ever-decreasing insertion patterns could cause successive
unbalanced nbtree page splits.  Problem cases involve a large group of
duplicates to the left, and ever-decreasing insertions to the right.

To fix, detect the situation by considering the newitem offset before
performing a split using nbtsplitloc.c's "many duplicates" strategy.  If
the new item was inserted just to the right of our provisional "many
duplicates" split point, infer ever-decreasing insertions and fall back
on a 50:50 (space delta optimal) split.  This seems to barely affect
cases that already had acceptable space utilization.

An alternative fix also seems possible.  Instead of changing
nbtsplitloc.c split choice logic, we could instead teach _bt_truncate()
to generate a new value for new high keys by interpolating from the
lastleft and firstright key values.  That would certainly be a more
elegant fix, but it isn't suitable for backpatching.

Discussion: https://postgr.es/m/CAH2-WznCNvhZpxa__GqAa1fgQ9uYdVc=_apArkW2nc-K3O7_NA@mail.gmail.com
Backpatch: 12-, where the nbtree page split enhancements were introduced.
2019-07-15 13:19:13 -07:00
Tom Lane 1cff1b95ab Represent Lists as expansible arrays, not chains of cons-cells.
Originally, Postgres Lists were a more or less exact reimplementation of
Lisp lists, which consist of chains of separately-allocated cons cells,
each having a value and a next-cell link.  We'd hacked that once before
(commit d0b4399d8) to add a separate List header, but the data was still
in cons cells.  That makes some operations -- notably list_nth() -- O(N),
and it's bulky because of the next-cell pointers and per-cell palloc
overhead, and it's very cache-unfriendly if the cons cells end up
scattered around rather than being adjacent.

In this rewrite, we still have List headers, but the data is in a
resizable array of values, with no next-cell links.  Now we need at
most two palloc's per List, and often only one, since we can allocate
some values in the same palloc call as the List header.  (Of course,
extending an existing List may require repalloc's to enlarge the array.
But this involves just O(log N) allocations not O(N).)

Of course this is not without downsides.  The key difficulty is that
addition or deletion of a list entry may now cause other entries to
move, which it did not before.

For example, that breaks foreach() and sister macros, which historically
used a pointer to the current cons-cell as loop state.  We can repair
those macros transparently by making their actual loop state be an
integer list index; the exposed "ListCell *" pointer is no longer state
carried across loop iterations, but is just a derived value.  (In
practice, modern compilers can optimize things back to having just one
loop state value, at least for simple cases with inline loop bodies.)
In principle, this is a semantics change for cases where the loop body
inserts or deletes list entries ahead of the current loop index; but
I found no such cases in the Postgres code.

The change is not at all transparent for code that doesn't use foreach()
but chases lists "by hand" using lnext().  The largest share of such
code in the backend is in loops that were maintaining "prev" and "next"
variables in addition to the current-cell pointer, in order to delete
list cells efficiently using list_delete_cell().  However, we no longer
need a previous-cell pointer to delete a list cell efficiently.  Keeping
a next-cell pointer doesn't work, as explained above, but we can improve
matters by changing such code to use a regular foreach() loop and then
using the new macro foreach_delete_current() to delete the current cell.
(This macro knows how to update the associated foreach loop's state so
that no cells will be missed in the traversal.)

There remains a nontrivial risk of code assuming that a ListCell *
pointer will remain good over an operation that could now move the list
contents.  To help catch such errors, list.c can be compiled with a new
define symbol DEBUG_LIST_MEMORY_USAGE that forcibly moves list contents
whenever that could possibly happen.  This makes list operations
significantly more expensive so it's not normally turned on (though it
is on by default if USE_VALGRIND is on).

There are two notable API differences from the previous code:

* lnext() now requires the List's header pointer in addition to the
current cell's address.

* list_delete_cell() no longer requires a previous-cell argument.

These changes are somewhat unfortunate, but on the other hand code using
either function needs inspection to see if it is assuming anything
it shouldn't, so it's not all bad.

Programmers should be aware of these significant performance changes:

* list_nth() and related functions are now O(1); so there's no
major access-speed difference between a list and an array.

* Inserting or deleting a list element now takes time proportional to
the distance to the end of the list, due to moving the array elements.
(However, it typically *doesn't* require palloc or pfree, so except in
long lists it's probably still faster than before.)  Notably, lcons()
used to be about the same cost as lappend(), but that's no longer true
if the list is long.  Code that uses lcons() and list_delete_first()
to maintain a stack might usefully be rewritten to push and pop at the
end of the list rather than the beginning.

* There are now list_insert_nth...() and list_delete_nth...() functions
that add or remove a list cell identified by index.  These have the
data-movement penalty explained above, but there's no search penalty.

* list_concat() and variants now copy the second list's data into
storage belonging to the first list, so there is no longer any
sharing of cells between the input lists.  The second argument is
now declared "const List *" to reflect that it isn't changed.

This patch just does the minimum needed to get the new implementation
in place and fix bugs exposed by the regression tests.  As suggested
by the foregoing, there's a fair amount of followup work remaining to
do.

Also, the ENABLE_LIST_COMPAT macros are finally removed in this
commit.  Code using those should have been gone a dozen years ago.

Patch by me; thanks to David Rowley, Jesper Pedersen, and others
for review.

Discussion: https://postgr.es/m/11587.1550975080@sss.pgh.pa.us
2019-07-15 13:41:58 -04:00
Thomas Munro 67b9b3ca32 Provide XLogRecGetFullXid().
In order to be able to work with FullTransactionId values during replay
without increasing the size of the WAL, infer the epoch.  In general we
can't do that safely, but during replay we can because we know that
nextFullXid can't advance concurrently.

Prevent frontend code from seeing this new function, due to the above
restriction.  Perhaps in future it will be possible to extract the value
entirely from independent WAL records, and then this restriction can be
lifted.

Author: Thomas Munro, based on earlier code from Andres Freund
Discussion: https://postgr.es/m/CA%2BhUKG%2BmLmuDjMi6o1dxkKvGRL56Y2Rz%2BiXAcrZV03G9ZuFQ8Q%40mail.gmail.com
2019-07-15 17:04:29 +12:00
Alexander Korotkov c085e1c1cb Add support for <-> (box, point) operator to GiST box_ops
Index-based calculation of this operator is exact.  So, signature of
gist_bbox_distance() function is changes so that caller is responsible for
setting *recheck flag.

Discussion: https://postgr.es/m/f71ba19d-d989-63b6-f04a-abf02ad9345d%40postgrespro.ru
Author: Nikita Glukhov
Reviewed-by: Tom Lane, Alexander Korotkov
2019-07-14 15:09:15 +03:00
Michael Paquier fa19a08d71 Fix variable initialization when using buffering build with GiST
This can cause valgrind to complain, as the flag marking a buffer as a
temporary copy was not getting initialized.

While on it, fill in with zeros newly-created buffer pages.  This does
not matter when loading a block from a temporary file, but it makes the
push of an index tuple into a new buffer page safer.

This has been introduced by 1d27dcf, so backpatch all the way down to
9.4.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/15899-0d24fb273b3dd90c@postgresql.org
Backpatch-through: 9.4
2019-07-10 15:14:54 +09:00
Robert Haas 554106b116 tableam: Provide helper functions for relation sizing.
Most block-based table AMs will need the exact same implementation of
the relation_size callback as the heap, and if they use a standard
page layout, they will likely need an implementation of the
relation_estimate_size callback that is very similar to that of the
heap.  Rearrange to facilitate code reuse.

Patch by me, reviewed by Michael Paquier, Daniel Gustafsson, and
Álvaro Herrera.

Discussion: http://postgr.es/m/CA+TgmoZ6DBPnP1E-vRpQZUJQijJFD54F+SR_pxGiAAS-MyrigA@mail.gmail.com
2019-07-08 14:51:53 -04:00
Michael Paquier 6b8548964b Fix inconsistencies in the code
This addresses a couple of issues in the code:
- Typos and inconsistencies in comments and function declarations.
- Removal of unreferenced function declarations.
- Removal of unnecessary compile flags.
- A cleanup error in regressplans.sh.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/0c991fdf-2670-1997-c027-772a420c4604@gmail.com
2019-07-08 13:15:09 +09:00
Peter Eisentraut 7e9a4c5c3d Use consistent style for checking return from system calls
Use

    if (something() != 0)
        error ...

instead of just

    if (something)
        error ...

The latter is not incorrect, but it's a bit confusing and not the
common style.

Discussion: https://www.postgresql.org/message-id/flat/5de61b6b-8be9-7771-0048-860328efe027%402ndquadrant.com
2019-07-07 15:28:49 +02:00
Amit Kapila 78d41f6c9b Add missing assertions for required table am callbacks.
Reported-by: Ashwin Agrawal
Author: Ashwin Agrawal
Reviewed-by: Amit Kapila
Backpatch-through: 12, where it was introduced
Discussion: https://postgr.es/m/CALfoeisgdZhYDrJOukaBzvXfJOK2FQ0szVMK7dzmcy6w93iDUA@mail.gmail.com
2019-07-06 11:41:23 +05:30
Peter Eisentraut 6a1cd8b923 Unwind some workarounds for lack of portable int64 format specifier
Because there is no portable int64/uint64 format specifier and we
can't stick macros like INT64_FORMAT into the middle of a translatable
string, we have been using various workarounds that put the number to
be printed into a string buffer first.  Now that we always use our own
sprintf(), we can rely on %lld and %llu to work, so we can use those.

This patch undoes this workaround in a few places where it was
egregiously verbose.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/CAH2-Wz%3DWbNxc5ob5NJ9yqo2RMJ0q4HXDS30GVCobeCvC9A1L9A%40mail.gmail.com
2019-07-04 17:01:43 +02:00
David Rowley 8abc13a889 Use appendStringInfoString and appendPQExpBufferStr where possible
This changes various places where appendPQExpBuffer was used in places
where it was possible to use appendPQExpBufferStr, and likewise for
appendStringInfo and appendStringInfoString.  This is really just a
stylistic improvement, but there are also small performance gains to be
had from doing this.

Discussion: http://postgr.es/m/CAKJS1f9P=M-3ULmPvr8iCno8yvfDViHibJjpriHU8+SXUgeZ=w@mail.gmail.com
2019-07-04 13:01:13 +12:00
Peter Geoghegan 66c5bd3a6f Remove obsolete nbtree "get root" comment.
Remove a very old Berkeley era comment that doesn't seem to have
anything to do with the current locking considerations within
_bt_getroot().

Discussion: https://postgr.es/m/CAH2-WzmA2H+rL-xxF5o6QhMD+9x6cJTnz2Mr3Li_pbPBmqoTBQ@mail.gmail.com
2019-07-01 22:28:08 -07:00
Michael Paquier c74d49d41c Fix many typos and inconsistencies
Author: Alexander Lakhin
Discussion: https://postgr.es/m/af27d1b3-a128-9d62-46e0-88f424397f44@gmail.com
2019-07-01 10:00:23 +09:00
Peter Eisentraut 21f428ebde Don't call data type input functions in GUC check hooks
Instead of calling pg_lsn_in() in check_recovery_target_lsn and
timestamptz_in() in check_recovery_target_time, reorganize the
respective code so that we don't raise any errors in the check hooks.
The previous code tried to use PG_TRY/PG_CATCH to handle errors in a
way that is not safe, so now the code contains no ereport() calls and
can operate safely within the GUC error handling system.

Moreover, since the interpretation of the recovery_target_time string
may depend on the time zone, we cannot do the final processing of that
string until all the GUC processing is done.  Instead,
check_recovery_target_time() now does some parsing for syntax
checking, but the actual conversion to a timestamptz value is done
later in the recovery code that uses it.

Reported-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/flat/20190611061115.njjwkagvxp4qujhp%40alap3.anarazel.de
2019-06-30 10:27:43 +02:00
Peter Eisentraut f2f0082ef5 Update comment
Function was renamed/replaced in
c2fe139c20 but the header comment was
not updated.
2019-06-27 15:57:14 +02:00
Michael Paquier ce59b75d44 Add toast-level reloption for vacuum_index_cleanup
a96c41f has introduced the option for heap, but it still lacked the
variant to control the behavior for toast relations.

While on it, refactor the tests so as they stress more scenarios with
the various values that vacuum_index_cleanup can use.  It would be
useful to couple those tests with pageinspect to check that pages are
actually cleaned up, but this is left for later.

Author: Masahiko Sawada, Michael Paquier
Reviewed-by: Peter Geoghegan
Discussion: https://postgr.es/m/CAD21AoCqs8iN04RX=i1KtLSaX5RrTEM04b7NHYps4+rqtpWNEg@mail.gmail.com
2019-06-25 09:09:27 +09:00
Thomas Munro 89ff7c08ee Remove unnecessary comment.
Author: Vik Fearing
Discussion: https://postgr.es/m/150d3e9f-c7ec-3fb3-4fdb-def47c4144af%402ndquadrant.com
2019-06-23 22:19:59 +12:00
Magnus Hagander 66013fe730 Fix typo
Author: Daniel Gustafsson
2019-06-19 14:59:26 +02:00
Michael Paquier 3c28fd2281 Fix description of WAL record XLOG_BTREE_META_CLEANUP
This record uses one metadata buffer and registers some data associated
to the buffer, but when parsing the record for its description a direct
access to the record data was done, but there is none.  This leads
usually to an incorrect description, but can also cause crashes like in
pg_waldump.  Instead, fix things so as the parsing uses the data
associated to the metadata block.

This is an oversight from 3d92796, so backpatch down to 11.

Author: Michael Paquier
Description: https://postgr.es/m/20190617013059.GA3153@paquier.xyz
Backpatch-through: 11
2019-06-19 11:02:19 +09:00
Andres Freund 23224563d9 Fix memory corruption/crash in ANALYZE.
This fixes an embarrassing oversight I (Andres) made in 737a292b,
namely missing two place where liverows/deadrows were used when
converting those variables to pointers, leading to incrementing the
pointer, rather than the value.

It's not that actually that easy to trigger a crash: One needs tuples
deleted by the current transaction, followed by a tuple deleted in
another session, all in one page. Which is presumably why this hasn't
been noticed before.

Reported-By: Steve Singer
Author: Steve Singer
Discussion: https://postgr.es/m/c7988239-d42c-ddc4-41db-171b23b35e4f@ssinger.info
2019-06-18 15:51:04 -07:00
Alvaro Herrera 8b21b416ed Avoid spurious deadlocks when upgrading a tuple lock
This puts back reverted commit de87a084c0, with some bug fixes.

When two (or more) transactions are waiting for transaction T1 to release a
tuple-level lock, and transaction T1 upgrades its lock to a higher level, a
spurious deadlock can be reported among the waiting transactions when T1
finishes.  The simplest example case seems to be:

T1: select id from job where name = 'a' for key share;
Y: select id from job where name = 'a' for update; -- starts waiting for T1
Z: select id from job where name = 'a' for key share;
T1: update job set name = 'b' where id = 1;
Z: update job set name = 'c' where id = 1; -- starts waiting for T1
T1: rollback;

At this point, transaction Y is rolled back on account of a deadlock: Y
holds the heavyweight tuple lock and is waiting for the Xmax to be released,
while Z holds part of the multixact and tries to acquire the heavyweight
lock (per protocol) and goes to sleep; once T1 releases its part of the
multixact, Z is awakened only to be put back to sleep on the heavyweight
lock that Y is holding while sleeping.  Kaboom.

This can be avoided by having Z skip the heavyweight lock acquisition.  As
far as I can see, the biggest downside is that if there are multiple Z
transactions, the order in which they resume after T1 finishes is not
guaranteed.

Backpatch to 9.6.  The patch applies cleanly on 9.5, but the new tests don't
work there (because isolationtester is not smart enough), so I'm not going
to risk it.

Author: Oleksii Kliukin
Discussion: https://postgr.es/m/B9C9D7CD-EB94-4635-91B6-E558ACEC0EC3@hintbits.com
Discussion: https://postgr.es/m/2815.1560521451@sss.pgh.pa.us
2019-06-18 18:23:16 -04:00
Michael Paquier 3412030205 Fix more typos and inconsistencies in the tree
Author: Alexander Lakhin
Discussion: https://postgr.es/m/0a5419ea-1452-a4e6-72ff-545b1a5a8076@gmail.com
2019-06-17 16:13:16 +09:00
Alvaro Herrera 9d20b0ec8f Revert "Avoid spurious deadlocks when upgrading a tuple lock"
This reverts commits 3da73d6839 and de87a084c0.

This code has some tricky corner cases that I'm not sure are correct and
not properly tested anyway, so I'm reverting the whole thing for next
week's releases (reintroducing the deadlock bug that we set to fix).
I'll try again afterwards.

Discussion: https://postgr.es/m/E1hbXKQ-0003g1-0C@gemulon.postgresql.org
2019-06-16 22:24:21 -04:00
Alvaro Herrera 3da73d6839 Silence compiler warning
Introduced in de87a084c0.
2019-06-14 11:33:40 -04:00
Michael Paquier f43608bda2 Fix typos and inconsistencies in code comments
Author: Alexander Lakhin
Discussion: https://postgr.es/m/dec6aae8-2d63-639f-4d50-20e229fb83e3@gmail.com
2019-06-14 09:34:34 +09:00
Alvaro Herrera de87a084c0 Avoid spurious deadlocks when upgrading a tuple lock
When two (or more) transactions are waiting for transaction T1 to release a
tuple-level lock, and transaction T1 upgrades its lock to a higher level, a
spurious deadlock can be reported among the waiting transactions when T1
finishes.  The simplest example case seems to be:

T1: select id from job where name = 'a' for key share;
Y: select id from job where name = 'a' for update; -- starts waiting for X
Z: select id from job where name = 'a' for key share;
T1: update job set name = 'b' where id = 1;
Z: update job set name = 'c' where id = 1; -- starts waiting for X
T1: rollback;

At this point, transaction Y is rolled back on account of a deadlock: Y
holds the heavyweight tuple lock and is waiting for the Xmax to be released,
while Z holds part of the multixact and tries to acquire the heavyweight
lock (per protocol) and goes to sleep; once X releases its part of the
multixact, Z is awakened only to be put back to sleep on the heavyweight
lock that Y is holding while sleeping.  Kaboom.

This can be avoided by having Z skip the heavyweight lock acquisition.  As
far as I can see, the biggest downside is that if there are multiple Z
transactions, the order in which they resume after X finishes is not
guaranteed.

Backpatch to 9.6.  The patch applies cleanly on 9.5, but the new tests don't
work there (because isolationtester is not smart enough), so I'm not going
to risk it.

Author: Oleksii Kliukin
Discussion: https://postgr.es/m/B9C9D7CD-EB94-4635-91B6-E558ACEC0EC3@hintbits.com
2019-06-13 17:28:24 -04:00
Andres Freund fff2a7d7bd Don't access catalogs to validate GUCs when not connected to a DB.
Vignesh found this bug in the check function for
default_table_access_method's check hook, but that was just copied
from older GUCs. Investigation by Michael and me then found the bug in
further places.

When not connected to a database (e.g. in a walsender connection), we
cannot perform (most) GUC checks that need database access. Even when
only shared tables are needed, unless they're
nailed (c.f. RelationCacheInitializePhase2()), they cannot be accessed
without pg_class etc. being present.

Fix by extending the existing IsTransactionState() checks to also
check for MyDatabaseOid.

Reported-By: Vignesh C, Michael Paquier, Andres Freund
Author: Vignesh C, Andres Freund
Discussion: https://postgr.es/m/CALDaNm1KXK9gbZfY-p_peRFm_XrBh1OwQO1Kk6Gig0c0fVZ2uw%40mail.gmail.com
Backpatch: 9.4-
2019-06-10 23:34:50 -07:00
Noah Misch 31d250e049 Update stale comments, and fix comment typos. 2019-06-08 10:12:26 -07:00
Amit Kapila 92c4abc736 Fix assorted inconsistencies.
There were a number of issues in the recent commits which include typos,
code and comments mismatch, leftover function declarations.  Fix them.

Reported-by: Alexander Lakhin
Author: Alexander Lakhin, Amit Kapila and Amit Langote
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/ef0c0232-0c1d-3a35-63d4-0ebd06e31387@gmail.com
2019-06-08 08:16:38 +05:30
Alvaro Herrera e8bdea58f9 Fix message style
Mark one message not for translation, and prefer "cannot" over "may
not", per commentary from Robert Haas.

Discussion: https://postgr.es/m/20190430145813.GA29872@alvherre.pgsql
2019-06-06 12:57:57 -04:00
Michael Paquier 1fb6f62a84 Fix typos in various places
Author: Andrea Gelmini
Reviewed-by: Michael Paquier, Justin Pryzby
Discussion: https://postgr.es/m/20190528181718.GA39034@glet
2019-06-03 13:44:03 +09:00
Alvaro Herrera d22f885f89 Fix double-phrase typo in message
New in 147e3722f7.
2019-05-31 10:08:37 -04:00
Amit Kapila 9679345f3c Fix typos.
Reported-by: Alexander Lakhin
Author: Alexander Lakhin
Reviewed-by: Amit Kapila and Tom Lane
Discussion: https://postgr.es/m/7208de98-add8-8537-91c0-f8b089e2928c@gmail.com
2019-05-26 18:28:18 +05:30
Andres Freund 73b8c3bd28 tableam: Rename wrapper functions to match callback names.
Some of the wrapper functions didn't match the callback names. Many of
them due to staying "consistent" with historic naming of the wrapped
functionality. We decided that for most cases it's more important to
be for tableam to be consistent going forward, than with the past.

The one exception is beginscan/endscan/...  because it'd have looked
odd to have systable_beginscan/endscan/... with a different naming
scheme, and changing the systable_* APIs would have caused way too
much churn (including breaking a lot of external users).

Author: Ashwin Agrawal, with some small additions by Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CALfoeiugyrXZfX7n0ORCa4L-m834dzmaE8eFdbNR6PMpetU4Ww@mail.gmail.com
2019-05-23 16:32:36 -07:00
Tom Lane 8255c7a5ee Phase 2 pgindent run for v12.
Switch to 2.1 version of pg_bsd_indent.  This formats
multiline function declarations "correctly", that is with
additional lines of parameter declarations indented to match
where the first line's left parenthesis is.

Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
2019-05-22 13:04:48 -04:00
Tom Lane be76af171c Initial pgindent run for v12.
This is still using the 2.0 version of pg_bsd_indent.
I thought it would be good to commit this separately,
so as to document the differences between 2.0 and 2.1 behavior.

Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
2019-05-22 12:55:34 -04:00
Robert Haas 1171d7d585 tableam: Move heap-specific logic from needs_toast_table below tableam.
This allows table AMs to completely suppress TOAST table creation, or
to modify the conditions under which they are created.

Patch by me.  Reviewed by Andres Freund.

Discussion: http://postgr.es/m/CA+Tgmoa4O2n=yphqD2pERUnYmUO84bH1SqMsA-nSxBGsZ7gWfA@mail.gmail.com
2019-05-21 11:57:13 -04:00
Fujii Masao b8e2170e40 Fix comment for issue_xlog_fsync().
"segno" is the argument for the function, not "log" and "seg".

Author: Antonin Houska
Discussion: https://postgr.es/m/11863.1558361020@spoje.net
2019-05-21 00:44:00 +09:00
Andres Freund c3b23ae457 Don't to predicate lock for analyze scans, refactor scan option passing.
Before this commit, when ANALYZE was run on a table and serializable
was used (either by virtue of an explicit BEGIN TRANSACTION ISOLATION
LEVEL SERIALIZABLE, or default_transaction_isolation being set to
serializable) a null pointer dereference lead to a crash.

The analyze scan doesn't need a snapshot (nor predicate locking), but
before this commit a scan only contained information about being a
bitmap or sample scan.

Refactor the option passing to the scan_begin callback to use a
bitmask instead. Alternatively we could have added a new boolean
parameter, but that seems harder to read. Even before this issue
various people (Heikki, Tom, Robert) suggested doing so.

These changes don't change the scan APIs outside of tableam. The flags
argument could be exposed, it's not necessary to fix this
problem. Also the wrapper table_beginscan* functions encapsulate most
of that complexity.

After these changes fixing the bug is trivial, just don't acquire
predicate lock for analyze style scans. That was already done for
bitmap heap scans.  Add an assert that a snapshot is passed when
acquiring the predicate lock, so this kind of bug doesn't require
running with serializable.

Also add a comment about sample scans currently requiring predicate
locking the entire relation, that previously wasn't remarked upon.

Reported-By: Joe Wildish
Author: Andres Freund
Discussion:
    https://postgr.es/m/4EA80A20-E9BF-49F1-9F01-5B66CAB21453@elusive.cx
    https://postgr.es/m/20190411164947.nkii4gaeilt4bui7@alap3.anarazel.de
    https://postgr.es/m/20190518203102.g7peu2fianukjuxm@alap3.anarazel.de
2019-05-19 15:10:28 -07:00
Tom Lane d307954a7d "A void function may not return a value".
Per buildfarm.
2019-05-18 00:40:39 -04:00
Andres Freund 147e3722f7 tableam: Avoid relying on relation size to determine validity of tids.
Instead add a tableam callback to do so. To avoid adding per
validation overhead, pass a scan to tuple_tid_valid. In heap's case
we'd otherwise incurred a RelationGetNumberOfBlocks() call for each
tid - which'd have added noticable overhead to nodeTidscan.c.

Author: Andres Freund
Reviewed-By: Ashwin Agrawal
Discussion: https://postgr.es/m/20190515185447.gno2jtqxyktylyvs@alap3.anarazel.de
2019-05-17 18:56:55 -07:00
Andres Freund 7f44ede594 tableam: Don't assume that every AM uses md.c style storage.
Previously various parts of the code routed size requests through
RelationGetNumberOfBlocks[InFork]. That works if md.c is used by the
AM, but not otherwise.

Add a tableam callback to return the size of the table. As not every
AM will use postgres' BLCKSZ, have it return bytes, and have
RelationGetNumberOfBlocksInFork() round the byte size up into blocks.

To allow code outside of the AM to determine the actual relation size
map InvalidForkNumber the total size of a relation, as not every AM
might just need the postgres defined forks.

A few users of RelationGetNumberOfBlocks() ought to be converted away
from that. One case, the use of it to determine whether a tid is
valid, will be fixed in a follow up commit. Others will have to wait
for v13.

Author: Andres Freund
Discussion: https://postgr.es/m/20190423225201.3bbv6tbqzkb5w7cw@alap3.anarazel.de
2019-05-17 18:56:47 -07:00
Peter Geoghegan 3f58cc6dd8 Remove extra nbtree half-dead internal page check.
It's not safe for nbtree VACUUM to attempt to delete a target page whose
right sibling is already half-dead, since that would fail the
cross-check when VACUUM attempts to re-find a downlink to the right
sibling in the parent page.  Logic to prevent this from happening was
added by commit 8da3183780, which addressed a bug in the overhaul of
page deletion that went into PostgreSQL 9.4 (commit efada2b8e9).
VACUUM was made to check the right sibling page, and back off when it
happened to be half-dead already.

However, it is only truly necessary to do the right sibling check on the
leaf level, since that transitively determines if the deletion target's
parent's right sibling page is itself undergoing deletion.  Remove the
internal page level check, and add a comment explaining why the leaf
level check alone suffices.

The extra check is also unnecessary due to the fact that internal pages
that are marked half-dead are generally considered corrupt.  Commit
efada2b8e9 established the principle that there should never be
half-dead internal pages (internal pages pending deletion are possible,
but that status is never directly represented in the internal page).
VACUUM will complain about corruption when it encounters half-dead
internal pages, so VACUUM is bound to raise an error one way or another
when an nbtree index has a half-dead internal page (contrib/amcheck will
also report that the page is corrupt).

It's possible that a pg_upgrade'd 9.3 database will still have half-dead
internal pages, so it may seem like there is an argument for leaving the
check in place to reliably get a cleaner error message that advises the
user to REINDEX.  However, leaf pages are also deleted in the first
phase of deletion prior to PostgreSQL 9.4, so I believe we won't even
attempt to re-find the parent page anyway (we won't have the fully
deleted leaf page as the right sibling of our target page, so we won't
even try to find a downlink for it).

Discussion: https://postgr.es/m/CAH2-Wzm_ntmqJjWLRyKzimFmFvk+BnVAvUpaA4s1h9Ja58woaQ@mail.gmail.com
2019-05-16 15:11:58 -07:00
Peter Geoghegan 489e431ba5 Remove obsolete nbtree insertion comment.
Remove a Berkeley-era comment above _bt_insertonpg() that admonishes the
reader to grok Lehman and Yao's paper before making any changes.  This
made a certain amount of sense back when _bt_insertonpg() was
responsible for most of the things that are now spread across
_bt_insertonpg(), _bt_findinsertloc(), _bt_insert_parent(), and
_bt_split(), but it doesn't work like that anymore.

I believe that this comment alludes to the need to "couple" or "crab"
buffer locks as we ascend the tree as page splits cascade upwards.  The
nbtree README already explains this in detail, which seems sufficient.
Besides, the changes to page splits made by commit 40dae7ec53 altered
the exact details of how buffer locks are retained during splits; Lehman
and Yao's original algorithm seems to release the lock on the left child
page/buffer slightly earlier than _bt_insertonpg()/_bt_insert_parent()
can.
2019-05-15 16:53:11 -07:00
Peter Geoghegan 7505da2f45 Reverse order of newitem nbtree candidate splits.
Commit fab25024, which taught nbtree to choose candidate split points
more carefully, had _bt_findsplitloc() record all possible split points
in an initial pass over a page that is about to be split.  The order
that candidate split points were processed and stored in was assumed to
match the offset number order of split points on an imaginary version of
the page that contains the same items as the original, but also fits
newitem (the item that provoked the split precisely because it didn't
fit).

However, the order of split points in the final array was not quite what
was expected: the split point that makes newitem the firstright item
came after the split point that makes newitem the lastleft item -- not
before.  As a result, _bt_findsplitloc() could get confused about the
leftmost and rightmost tuples among all possible split points recorded
for the page.  This seems to have no appreciable impact on the quality
of the final split point chosen by _bt_findsplitloc(), but it's still
wrong.

To fix, switch the order in which newitem candidate splits are recorded
in.  This also makes it possible to describe candidate split points in
terms of which pair of adjoining tuples enclose the split point within
_bt_findsplitloc(), making it clearer why it's generally safe for
_bt_split() to expect lastleft and firstright tuples.
2019-05-15 12:22:07 -07:00
Andres Freund aa4b8c61d2 Handle table_complete_speculative's succeeded argument as documented.
For some reason both callsite and the implementation for heapam had
the meaning inverted (i.e. succeeded == true was passed in case of
conflict). That's confusing.

I (Andres) briefly pondered whether it'd be better to rename
table_complete_speculative's argument to 'bool specConflict' or such,
but decided not to. The 'complete' in the function name for me makes
`succeeded` sound a bit better.

Reported-By: Ashwin Agrawal, Melanie Plageman, Heikki Linnakangas
Discussion:
   https://postgr.es/m/CALfoeitk7-TACwYv3hCw45FNPjkA86RfXg4iQ5kAOPhR+F1Y4w@mail.gmail.com
   https://postgr.es/m/97673451-339f-b21e-a781-998d06b1067c@iki.fi
2019-05-14 12:19:32 -07:00
Heikki Linnakangas 22251686f0 Detect internal GiST page splits correctly during index build.
As we descend the GiST tree during insertion, we modify any downlinks on
the way down to include the new tuple we're about to insert (if they don't
cover it already). Modifying an existing downlink might cause an internal
page to split, if the new downlink tuple is larger than the old one. If
that happens, we need to back up to the parent and re-choose a page to
insert to. We used to detect that situation, thanks to the NSN-LSN
interlock normally used to detect concurrent page splits, but that got
broken by commit 9155580fd5. With that commit, we now use a dummy constant
LSN value for every page during index build, so the LSN-NSN interlock no
longer works. I thought that was OK because there can't be any other
backends modifying the index during index build, but missed that the
insertion itself can modify the page we're inserting to. The consequence
was that we would sometimes insert the new tuple to an incorrect page, one
whose downlink doesn't cover the new tuple.

To fix, add a flag to the stack that keeps track of the state while
descending tree, to indicate that a page was split, and that we need to
retry the descend from the parent.

Thomas Munro first reported that the contrib/intarray regression test was
failing occasionally on the buildfarm after commit 9155580fd5. The failure
was intermittent, because the gistchoose() function is not deterministic,
and would only occasionally create the right circumstances for this bug to
cause the failure.

Patch by Anastasia Lubennikova, with some changes by me to make it work
correctly also when the internal page split also causes the "grandparent"
to be split.

Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJRzLo7tZExWfSbwM3XuK7aAK7FhdBV0FLkbUG%2BW0v0zg%40mail.gmail.com
2019-05-14 13:18:44 +03:00
Heikki Linnakangas e95d550bbb Fix comment on when HOT update is possible.
The conditions listed in this comment have changed several times, and at
some point the thing that the "if so" referred to was negated.

The text was OK up to 9.6. It was differently wrong in v10, v11 and
master, so fix in all those versions.
2019-05-14 13:06:28 +03:00
Peter Geoghegan ae7291acbc Standardize ItemIdData terminology.
The term "item pointer" should not be used to refer to ItemIdData
variables, since that is needlessly ambiguous.  Only
ItemPointerData/ItemPointer variables should be called item pointers.

To fix, establish the convention that ItemIdData variables should always
be referred to either as "item identifiers" or "line pointers".  The
term "item identifier" already predominates in docs and translatable
messages, and so should be the preferred alternative there.

Discussion: https://postgr.es/m/CAH2-Wz=c=MZQjUzde3o9+2PLAPuHTpVZPPdYxN=E4ndQ2--8ew@mail.gmail.com
2019-05-13 15:53:39 -07:00
Peter Geoghegan 9b42e71376 Don't leave behind junk nbtree pages during split.
Commit 8fa30f906b reduced the elevel of a number of "can't happen"
_bt_split() errors from PANIC to ERROR.  At the same time, the new right
page buffer for the split could continue to be acquired well before the
critical section.  This was possible because it was relatively
straightforward to make sure that _bt_split() could not throw an error,
with a few specific exceptions.  The exceptional cases were safe because
they involved specific, well understood errors, making it possible to
consistently zero the right page before actually raising an error using
elog().  There was no danger of leaving around a junk page, provided
_bt_split() stuck to this coding rule.

Commit 8224de4f, which introduced INCLUDE indexes, added code to make
_bt_split() truncate away non-key attributes.  This happened at a point
that broke the rule around zeroing the right page in _bt_split().  If
truncation failed (perhaps due to palloc() failure), that would result
in an errant right page buffer with junk contents.  This could confuse
VACUUM when it attempted to delete the page, and should be avoided on
general principle.

To fix, reorganize _bt_split() so that truncation occurs before the new
right page buffer is even acquired.  A junk page/buffer will not be left
behind if _bt_nonkey_truncate()/_bt_truncate() raise an error.

Discussion: https://postgr.es/m/CAH2-WzkcWT_-NH7EeL=Az4efg0KCV+wArygW8zKB=+HoP=VWMw@mail.gmail.com
Backpatch: 11-, where INCLUDE indexes were introduced.
2019-05-13 10:27:59 -07:00
Tom Lane 2d7d946cd3 Clean up the behavior and API of catalog.c's is-catalog-relation tests.
The right way for IsCatalogRelation/Class to behave is to return true
for OIDs less than FirstBootstrapObjectId (not FirstNormalObjectId),
without any of the ad-hoc fooling around with schema membership.

The previous code was wrong because (1) it claimed that
information_schema tables were not catalog relations but their toast
tables were, which is silly; and (2) if you dropped and recreated
information_schema, which is a supported operation, the behavior
changed.  That's even sillier.  With this definition, "catalog
relations" are exactly the ones traceable to the postgres.bki data,
which seems like what we want.

With this simplification, we don't actually need access to the pg_class
tuple to identify a catalog relation; we only need its OID.  Hence,
replace IsCatalogClass with "IsCatalogRelationOid(oid)".  But keep
IsCatalogRelation as a convenience function.

This allows fixing some arguably-wrong semantics in contrib/sepgsql and
ReindexRelationConcurrently, which were using an IsSystemNamespace test
where what they really should be using is IsCatalogRelationOid.  The
previous coding failed to protect toast tables of system catalogs, and
also was not on board with the general principle that user-created tables
do not become catalogs just by virtue of being renamed into pg_catalog.
We can also get rid of a messy hack in ReindexMultipleTables.

While we're at it, also rename IsSystemNamespace to IsCatalogNamespace,
because the previous name invited confusion with the more expansive
semantics used by IsSystemRelation/Class.

Also improve the comments in catalog.c.

There are a few remaining places in replication-related code that are
special-casing OIDs below FirstNormalObjectId.  I'm inclined to think
those are wrong too, and if there should be any special case it should
just extend to FirstBootstrapObjectId.  But first we need to debate
whether a FOR ALL TABLES publication should include information_schema.

Discussion: https://postgr.es/m/21697.1557092753@sss.pgh.pa.us
Discussion: https://postgr.es/m/15150.1557257111@sss.pgh.pa.us
2019-05-08 23:27:38 -04:00
Peter Geoghegan d95e36dc38 Remove obsolete nbtree split REDO routine comment.
Commit dd299df818, which added suffix truncation to nbtree, simplified
the WAL record format used by page splits.  It became necessary to
explicitly WAL-log the new high key for the left half of a split in all
cases, which relieved the REDO routine from having to reconstruct a new
high key for the left page by copying the first item from the right
page.  Remove a comment that referred to the previous practice.
2019-05-08 12:47:20 -07:00
Heikki Linnakangas 256be1050c Remove leftover reference to old "flat file" mechanism in a comment.
The flat file mechanism was removed in PostgreSQL 9.0.
2019-05-08 09:32:34 +03:00
Peter Geoghegan d65b5ccad6 Correct obsolete nbtsort.c minimum key comment.
It is no longer possible under any circumstances for nbtree code to
reconstruct a strict lower bound key (parent page's pivot tuple key) for
a right sibling page by retrieving the first item in the right sibling
page.
2019-05-07 21:42:12 -07:00
Fujii Masao b84dbc8eb8 Add TRUNCATE parameter to VACUUM.
This commit adds new parameter to VACUUM command, TRUNCATE,
which specifies that VACUUM should attempt to truncate off
any empty pages at the end of the table and allow the disk space
for the truncated pages to be returned to the operating system.

This parameter, if specified, overrides the vacuum_truncate
reloption. If neither the reloption nor the VACUUM option is
used, the default is true, as before.

Author: Fujii Masao
Reviewed-by: Julien Rouhaud, Masahiko Sawada
Discussion: https://postgr.es/m/CAD21AoD+qtrSDL=GSma4Wd3kLYLeRC0hPna-YAdkDeV4z156vg@mail.gmail.com
2019-05-08 02:10:33 +09:00
Amit Kapila 7db0cde6b5 Revert "Avoid the creation of the free space map for small heap relations".
This feature was using a process local map to track the first few blocks
in the relation.  The map was reset each time we get the block with enough
freespace.  It was discussed that it would be better to track this map on
a per-relation basis in relcache and then invalidate the same whenever
vacuum frees up some space in the page or when FSM is created.  The new
design would be better both in terms of API design and performance.

List of commits reverted, in reverse chronological order:

06c8a5090e  Improve code comments in b0eaa4c51b.
13e8643bfc  During pg_upgrade, conditionally skip transfer of FSMs.
6f918159a9  Add more tests for FSM.
9c32e4c350  Clear the local map when not used.
29d108cdec  Update the documentation for FSM behavior..
08ecdfe7e5  Make FSM test portable.
b0eaa4c51b  Avoid creation of the free space map for small heap relations.

Discussion: https://postgr.es/m/20190416180452.3pm6uegx54iitbt5@alap3.anarazel.de
2019-05-07 09:30:24 +05:30
Peter Geoghegan 7b37f4b02e Correct more obsolete nbtree page split comments.
Commit 3f342839 corrected obsolete comments about buffer locks at the
main _bt_insert_parent() call site, but missed similar obsolete comments
above _bt_insert_parent() itself.  Both sets of comments were rendered
obsolete by commit 40dae7ec53, which made the nbtree page split
algorithm more robust.  Fix the comments that were missed the first time
around now.

In passing, refine a related _bt_insert_parent() comment about
re-finding the parent page to insert new downlink.
2019-05-03 13:34:45 -07:00
Alvaro Herrera 2bf372a4ae heap_prepare_freeze_tuple: Simplify coding
Commit d2599ecfcc introduced some contorted, confused code around:
readers would think that it's possible for HeapTupleHeaderGetXmin return
a non-frozen value for some frozen tuples, which would be disastrous.
There's no actual bug, but it seems better to make it clearer.

Per gripe from Tom Lane and Andres Freund.
Discussion: https://postgr.es/m/30116.1555430496@sss.pgh.pa.us
2019-05-02 16:14:08 -04:00
Peter Geoghegan 6dd86c269d Fix nbtsort.c's page space accounting.
Commit dd299df818, which made heap TID a tiebreaker nbtree index
column, introduced new rules on page space management to make suffix
truncation safe.  In general, suffix truncation needs to have a small
amount of extra space available on the new left page when splitting a
leaf page.  This is needed in case it turns out that truncation cannot
even "truncate away the heap TID column", resulting in a
larger-than-firstright leaf high key with an explicit heap TID
representation.

Despite all this, CREATE INDEX/nbtsort.c did not account for the
possible need for extra heap TID space on leaf pages when deciding
whether or not a new item could fit on current page.  This could lead to
"failed to add item to the index page" errors when CREATE
INDEX/nbtsort.c tried to finish off a leaf page that lacked space for a
larger-than-firstright leaf high key (it only had space for firstright
tuple, which was just short of what was needed following "truncation").

Several conditions needed to be met all at once for CREATE INDEX to
fail.  The problem was in the hard limit on what will fit on a page,
which tends to be masked by the soft fillfactor-wise limit.  The easiest
way to recreate the problem seems to be a CREATE INDEX on a low
cardinality text column, with tuples that are of non-uniform width,
using a fillfactor of 100.

To fix, bring nbtsort.c in line with nbtsplitloc.c, which already
pessimistically assumes that all leaf page splits will have high keys
that have a heap TID appended.

Reported-By: Andreas Joseph Krogh
Discussion: https://postgr.es/m/VisenaEmail.c5.3ee7fe277d514162.16a6d785bea@tc7-visena
2019-05-02 12:33:35 -07:00
Robert Haas dd69597988 Fix some problems with VACUUM (INDEX_CLEANUP FALSE).
The new nleft_dead_tuples and nleft_dead_itemids fields are confusing
and do not seem like the correct way forward.  One of them is tested
via an assertion that can fail, as it has already done on buildfarm
member topminnow.  Remove the assertion and the fields.

Change the logic for the case where a tuple is not initially pruned
by heap_page_prune but later diagnosed HEAPTUPLE_DEAD by
HeapTupleSatisfiesVacuum.  Previously, tupgone = true was set in
that case, which leads to treating the tuple as one that will be
removed.  In a normal vacuum, that's OK, because we'll remove
index entries for it and then the second heap pass will remove the
tuple itself, but when index cleanup is disabled, those things
don't happen, so we must instead treat it as a recently-dead
tuple that we have voluntarily chosen to keep.

Report and analysis by Tom Lane.  This patch loosely based on one
from Masahiko Sawada, but I changed most of it.
2019-05-02 10:07:13 -04:00
Alvaro Herrera 9f8b717a80 Message style fixes 2019-04-30 10:33:37 -04:00
Alvaro Herrera 9a83afecb7 Widen tuple counter variables from long to int64
Mistake in ab0dfc961b6a; progress reporting would have wrapped around
for indexes created with more than 2^31 tuples.

Reported-by: Peter Geoghegan
Discussion: https://postgr.es/m/CAH2-Wz=WbNxc5ob5NJ9yqo2RMJ0q4HXDS30GVCobeCvC9A1L9A@mail.gmail.com
2019-04-30 10:27:38 -04:00
Andres Freund 5c1560606d Fix several recently introduced issues around handling new relation forks.
Most of these stem from d25f519107 "tableam: relation creation, VACUUM
FULL/CLUSTER, SET TABLESPACE.".

1) To pass data to the relation_set_new_filenode()
   RelationSetNewRelfilenode() was made to update RelationData.rd_rel
   directly. That's not OK however, as it makes the relcache entries
   temporarily inconsistent. Which among other scenarios is a problem
   if a REINDEX targets an index on pg_class - the
   CatalogTupleUpdate() in RelationSetNewRelfilenode().  Presumably
   that was introduced because other places in the code do so - while
   those aren't "good practice" they don't appear to be actively
   buggy (e.g. because system tables may not be targeted).

   I (Andres) should have caught this while reviewing and signficantly
   evolving the code in that commit, mea culpa.

   Fix that by instead passing in the new RelFileNode as separate
   argument to relation_set_new_filenode() and rely on the relcache to
   update the catalog entry. Also revert that the
   RelationMapUpdateMap() call was changed to immediate, and undo some
   other more unnecessary changes.

2) Document that the relation_set_new_filenode cannot rely on the
   whole relcache entry to be valid. It might be worthwhile to
   refactor the code to never have to rely on that, but given the way
   heap_create() is currently coded, that'd be a large change.

3) ATExecSetTableSpace() shouldn't do FlushRelationBuffers() itself. A
   table AM might not use shared buffers at all. Move to
   index_copy_data() and heapam_relation_copy_data().

4) heapam_relation_set_new_filenode() previously sometimes accessed
   rel->rd_rel->relpersistence rather than the `persistence`
   argument. Code movement mistake.

5) Previously heapam_relation_set_new_filenode() re-opened the smgr
   relation to create the init for, if necesary. Instead have
   RelationCreateStorage() return the SMgrRelation and use it to
   create the init fork.

6) Add a note about the danger of modifying the relcache directly to
   ATExecSetTableSpace() - it's currently not a bug because there's a
   check ERRORing for catalog tables.

Regression tests and assertion improvements that together trigger the
bug described in 1) will be added in a later commit, as there is a
related bug on all branches.

Reported-By: Michael Paquier
Diagnosed-By: Tom Lane and Andres Freund
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20190418011430.GA19133@paquier.xyz
2019-04-29 19:28:05 -07:00
Peter Geoghegan 9ee7414ed0 Remove obsolete _bt_insert_parent() comment.
Remove a comment that refers to a coding practice that was fully removed
by commit a8b8f4db, which introduced MarkBufferDirty().  It looks like
the comment was even obsolete before then, since it concerns
write-ordering dependencies with synchronous buffer writes.
2019-04-29 14:14:38 -07:00
Andres Freund fdc7efcc30 Allow pg_class xid & multixid horizons to not be set.
This allows table AMs that don't need these horizons. This was already
documented in the tableam relation_set_new_filenode callback, but an
assert prevented if from actually working (the test AM code contained
the change itself). Defang the asserts in the general code, and move
the stronger ones into heap AM.

Relatedly, after CLUSTER/VACUUM, we'd always assign a relfrozenxid /
relminmxid. Change the table_relation_copy_for_cluster() interface to
allow the AM to overwrite the horizons that get set on the pg_class
entry.  This'd also in the future allow AMs like heap to compute a
relfrozenxid during rewrite that's the table's actual minimum rather
than a pre-determined value.  Arguably it'd have been better to move
the whole computation / setting of those values into the callback, but
it seems likely that for other reasons it'd be better to be able to
use one value to vacuum/cluster multiple tables (e.g. a toast's
horizon shouldn't be different than the table's).

Reported-By: Heikki Linnakangas
Author: Andres Freund
Discussion: https://postgr.es/m/9a7fb9cc-2419-5db7-8840-ddc10c93f122@iki.fi
2019-04-23 21:42:12 -07:00
Peter Geoghegan 9b10926263 Prevent O(N^2) unique index insertion edge case.
Commit dd299df8 made nbtree treat heap TID as a tiebreaker column,
establishing the principle that there is only one correct location (page
and page offset number) for every index tuple, no matter what.
Insertions of tuples into non-unique indexes proceed as if heap TID
(scan key's scantid) is just another user-attribute value, but
insertions into unique indexes are more delicate.  The TID value in
scantid must initially be omitted to ensure that the unique index
insertion visits every leaf page that duplicates could be on.  The
scantid is set once again after unique checking finishes successfully,
which can force _bt_findinsertloc() to step right one or more times, to
locate the leaf page that the new tuple must be inserted on.

Stepping right within _bt_findinsertloc() was assumed to occur no more
frequently than stepping right within _bt_check_unique(), but there was
one important case where that assumption was incorrect: inserting a
"duplicate" with NULL values.  Since _bt_check_unique() didn't do any
real work in this case, it wasn't appropriate for _bt_findinsertloc() to
behave as if it was finishing off a conventional unique insertion, where
any existing physical duplicate must be dead or recently dead.
_bt_findinsertloc() might have to grovel through a substantial portion
of all of the leaf pages in the index to insert a single tuple, even
when there were no dead tuples.

To fix, treat insertions of tuples with NULLs into a unique index as if
they were insertions into a non-unique index: never unset scantid before
calling _bt_search() to descend the tree, and bypass _bt_check_unique()
entirely.  _bt_check_unique() is no longer responsible for incoming
tuples with NULL values.

Discussion: https://postgr.es/m/CAH2-Wzm08nr+JPx4jMOa9CGqxWYDQ-_D4wtPBiKghXAUiUy-nQ@mail.gmail.com
2019-04-23 10:33:57 -07:00
Andres Freund b5f58cf213 Convert gist to compute page level xid horizon on primary.
Due to parallel development, gist added the missing conflict
information in c952eae52a, while 558a9165e0 moved that computation
to the primary for the index types that already had it.  Thus adapt
gist to also compute on the primary, using
index_compute_xid_horizon_for_tuples() instead of its own copy of the
logic.

This also adds pg_waldump support for XLOG_GIST_DELETE records, which
previously was not properly present.

Bumps WAL version.

Author: Andres Freund
Discussion: https://postgr.es/m/20190406050243.bszosdg4buvabfrt@alap3.anarazel.de
2019-04-22 14:28:30 -07:00
Michael Paquier 47ac2033d4 Simplify some ERROR paths clearing wait events and transient files
Transient files and wait events get normally cleaned up when seeing an
exception (be it in the context of a transaction for a backend or
another process like the checkpointer), hence there is little point in
complicating error code paths to do this work.  This shaves a bit of
code, and removes some extra handling with errno which needed to be
preserved during the cleanup steps done.

Reported-by: Masahiko Sawada
Author: Michael Paquier
Reviewed-by: Tom Lane, Masahiko Sawada
Discussion: https://postgr.es/m/CAD21AoDhHYVq5KkXfkaHhmjA-zJYj-e4teiRAJefvXuKJz1tKQ@mail.gmail.com
2019-04-17 09:51:45 +09:00
Alexander Korotkov 1e87198182 Fix division by zero in _bt_vacuum_needs_cleanup()
Checks inside _bt_vacuum_needs_cleanup() allow division by zero to happen when
metad->btm_last_cleanup_num_heap_tuples == 0.  This commit adjusts the
expression so that no division by zero might happen.

Reported-by: Piotr Stefaniak
Discussion: https://postgr.es/m/DB8PR03MB5931C41F7787A95313F08322F22A0%40DB8PR03MB5931.eurprd03.prod.outlook.com
Reviewed-by: Masahiko Sawada
Backpatch-through: 11
2019-04-15 20:20:43 +03:00
Thomas Munro f7feb020c3 Fix GetNewTransactionId()'s interaction with xidVacLimit.
Commit ad308058 switched to returning a FullTransactionId, but
failed to load the potentially updated value in the case where
xidVacLimit is reached and we release and reacquire the lock.
Repair, closing bug #15727.

While reviewing that commit, also fix the size computation used
by EstimateTransactionStateSize() and switch to the mul_size()
macro traditionally used in such expressions.

Author: Thomas Munro
Reported-by: Roman Zharkov
Discussion: https://postgr.es/m/15727-0be246e7d852d229%40postgresql.org
2019-04-12 16:47:50 +12:00
Michael Paquier d87ab88686 Fix typos in reloptions.c
Author: Kirk Jamison
Discussion: https://postgr.es/m/D09B13F772D2274BB348A310EE3027C6493463@g01jpexmbkw24
2019-04-12 12:56:38 +09:00
Amit Kapila bdf35744bd Avoid counting transaction stats for parallel worker cooperating
transaction.

The transaction that is initiated by the parallel worker to cooperate
with the actual transaction started by the main backend to complete the
query execution should not be counted as a separate transaction.  The
other internal transactions started and committed by the parallel worker
are still counted as separate transactions as we that is what we do in
other places like autovacuum.

This will partially fix the bloat in transaction stats due to additional
transactions performed by parallel workers.  For a complete fix, we need to
decide how we want to show all the transactions that are started internally
for various operations and that is a matter of separate patch.

Reported-by: Haribabu Kommi
Author: Haribabu Kommi
Reviewed-by: Amit Kapila, Jamison Kirk and Rahila Syed
Backpatch-through: 9.6
Discussion: https://postgr.es/m/CAJrrPGc9=jKXuScvNyQ+VNhO0FZk7LLAShAJRyZjnedd2D61EQ@mail.gmail.com
2019-04-10 08:24:15 +05:30
Fujii Masao 119dcfad98 Add vacuum_truncate reloption.
vacuum_truncate controls whether vacuum tries to truncate off
any empty pages at the end of the table. Previously vacuum always
tried to do the truncation. However, the truncation could cause
some problems; for example, ACCESS EXCLUSIVE lock needs to
be taken on the table during the truncation and can cause
the query cancellation on the standby even if hot_standby_feedback
is true. Setting this reloption to false can be helpful to avoid
such problems.

Author: Tsunakawa Takayuki
Reviewed-By: Julien Rouhaud, Masahiko Sawada, Michael Paquier, Kirk Jamison and Fujii Masao
Discussion: https://postgr.es/m/CAHGQGwE5UqFqSq1=kV3QtTUtXphTdyHA-8rAj4A=Y+e4kyp3BQ@mail.gmail.com
2019-04-08 16:43:57 +09:00
Andres Freund 41f5e04aec Fix a number of issues around modifying a previously updated row.
This commit fixes three, unfortunately related, issues:

1) Since 5db6df0c01, the introduction of DML via tableam, it was
   possible to trigger "ERROR: unexpected table_lock_tuple status: 1"
   when updating a row that was previously updated in the same
   transaction - but only when the previously updated row was before
   updated in a concurrent transaction (and READ COMMITTED was
   used). The reason for that was that that case simply wasn't
   expected. Fixing that lead to:

2) Even before the above commit, there were error checks (introduced
   in 6868ed7491) preventing a row being updated by different
   commands within the same statement (say in a function called by an
   UPDATE) - but that check wasn't performed when the row was first
   updated in a concurrent transaction - instead the second update was
   silently skipped in that case. After this change we throw the same
   error as we'd without the concurrent transaction.

3) The error messages (introduced in 6868ed7491) preventing such
   updates emitted the same error message for both DELETE and
   UPDATE ("tuple to be updated was already modified by an operation
   triggered by the current command"). While that could be changed
   separately, it made it hard to write tests that verify the correct
   correct behavior of the code.

This commit changes heap's implementation of table_lock_tuple() to
return TM_SelfModified instead of TM_Invisible (previously loosely
modeled after EvalPlanQualFetch), and teaches nodeModifyTable.c to
handle that in response to table_lock_tuple() and not just in response
to table_(delete|update).

Additionally it fixes the wrong error message (see 3 above). The
comment for table_lock_tuple() is also adjusted to state that
TM_Deleted won't return information in TM_FailureData - it'll not
always be available.

This also adds tests to ensure that DELETE/UPDATE correctly error out
when affecting a row that concurrently was modified by another
transaction.

Author: Andres Freund
Reported-By: Tom Lane, when investigating a bug bug fix to another bug
    by Amit Langote
Discussion: https://postgr.es/m/19321.1554567786@sss.pgh.pa.us
2019-04-07 22:14:47 -07:00
Andres Freund 57a7a3adfe Remove unused struct member, enforce multi_insert callback presence.
Author: David Rowley, Andres Freund
Discussion: https://postgr.es/m/CAKJS1f9=9phmm66diAji4gvHnWSrK7BGFoNct+mEUT_c8pPOjw@mail.gmail.com
2019-04-04 17:39:39 -07:00
Andres Freund ea97e440b8 Harden tableam against nonexistant / wrong kind of AMs.
Previously it was allowed to set default_table_access_method to an
empty string. That makes sense for default_tablespace, where that was
copied from, as it signals falling back to the database's default
tablespace. As there is no equivalent for table AMs, forbid that.

Also make sure to throw a usable error when creating a table using an
index AM, by using get_am_type_oid() to implement get_table_am_oid()
instead of a separate copy. Previously we'd error out only later, in
GetTableAmRoutine().

Thirdly remove GetTableAmRoutineByAmId() - it was only used in an
earlier version of 8586bf7ed8.

Add tests for the above (some for index AMs as well).
2019-04-04 17:39:39 -07:00
Andres Freund 86b85044e8 tableam: Add table_multi_insert() and revamp/speed-up COPY FROM buffering.
This adds table_multi_insert(), and converts COPY FROM, the only user
of heap_multi_insert, to it.

A simple conversion of COPY FROM use slots would have yielded a
slowdown when inserting into a partitioned table for some
workloads. Different partitions might need different slots (both slot
types and their descriptors), and dropping / creating slots when
there's constant partition changes is measurable.

Thus instead revamp the COPY FROM buffering for partitioned tables to
allow to buffer inserts into multiple tables, flushing only when
limits are reached across all partition buffers. By only dropping
slots when there've been inserts into too many different partitions,
the aforementioned overhead is gone. By allowing larger batches, even
when there are frequent partition changes, we actuall speed such cases
up significantly.

By using slots COPY of very narrow rows into unlogged / temporary
might slow down very slightly (due to the indirect function calls).

Author: David Rowley, Andres Freund, Haribabu Kommi
Discussion:
    https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
    https://postgr.es/m/20190327054923.t3epfuewxfqdt22e@alap3.anarazel.de
2019-04-04 16:28:18 -07:00
Tom Lane 9c703c169a Make queries' locking of indexes more consistent.
The assertions added by commit b04aeb0a0 exposed that there are some
code paths wherein the executor will try to open an index without
holding any lock on it.  We do have some lock on the index's table,
so it seems likely that there's no fatal problem with this (for
instance, the index couldn't get dropped from under us).  Still,
it's bad practice and we should fix it.

To do so, remove the optimizations in ExecInitIndexScan and friends
that tried to avoid taking a lock on an index belonging to a target
relation, and just take the lock always.  In non-bug cases, this
will result in no additional shared-memory access, since we'll find
in the local lock table that we already have a lock of the desired
type; hence, no significant performance degradation should occur.

Also, adjust the planner and executor so that the type of lock taken
on an index is always identical to the type of lock taken for its table,
by relying on the recently added RangeTblEntry.rellockmode field.
This avoids some corner cases where that might not have been true
before (possibly resulting in extra locking overhead), and prevents
future maintenance issues from having multiple bits of logic that
all needed to be in sync.  In addition, this change removes all core
calls to ExecRelationIsTargetRelation, which avoids a possible O(N^2)
startup penalty for queries with large numbers of target relations.
(We'd probably remove that function altogether, were it not that we
advertise it as something that FDWs might want to use.)

Also adjust some places in selfuncs.c to not take any lock on indexes
they are transiently opening, since we can assume that plancat.c
did that already.

In passing, change gin_clean_pending_list() to take RowExclusiveLock
not AccessShareLock on its target index.  Although it's not clear that
that's actually a bug, it seemed very strange for a function that's
explicitly going to modify the index to use only AccessShareLock.

David Rowley, reviewed by Julien Rouhaud and Amit Langote,
a bit of further tweaking by me

Discussion: https://postgr.es/m/19465.1541636036@sss.pgh.pa.us
2019-04-04 15:12:58 -04:00
Robert Haas a96c41feec Allow VACUUM to be run with index cleanup disabled.
This commit adds a new reloption, vacuum_index_cleanup, which
controls whether index cleanup is performed for a particular
relation by default.  It also adds a new option to the VACUUM
command, INDEX_CLEANUP, which can be used to override the
reloption.  If neither the reloption nor the VACUUM option is
used, the default is true, as before.

Masahiko Sawada, reviewed and tested by Nathan Bossart, Alvaro
Herrera, Kyotaro Horiguchi, Darafei Praliaskouski, and me.
The wording of the documentation is mostly due to me.

Discussion: http://postgr.es/m/CAD21AoAt5R3DNUZSjOoXDUY=naYPUOuffVsRzuTYMz29yLzQCA@mail.gmail.com
2019-04-04 15:04:43 -04:00
Peter Geoghegan 74eb2176bf Invalidate binary search bounds consistently.
_bt_check_unique() failed to invalidate binary search bounds in the
event of a live conflict following commit e5adcb78.  This resulted in
problems after waiting for the conflicting xact to commit or abort.  The
subsequent call to _bt_check_unique() would restore the initial binary
search bounds, rather than starting a new search.  Fix by explicitly
invalidating bounds when it becomes clear that there is a live conflict
that insertion will have to wait to resolve.

Ashutosh Sharma, with a few additional tweaks by me.

Author: Ashutosh Sharma
Reported-By: Ashutosh Sharma
Diagnosed-By: Ashutosh Sharma
Discussion: https://postgr.es/m/CAE9k0PnQp-qr-UYKMSCzdC2FBzdE4wKP41hZrZvvP26dKLonLg@mail.gmail.com
2019-04-04 09:38:08 -07:00
Thomas Munro 3eb77eba5a Refactor the fsync queue for wider use.
Previously, md.c and checkpointer.c were tightly integrated so that
fsync calls could be handed off and processed in the background.
Introduce a system of callbacks and file tags, so that other modules
can hand off fsync work in the same way.

For now only md.c uses the new interface, but other users are being
proposed.  Since there may be use cases that are not strictly SMGR
implementations, use a new function table for sync handlers rather
than extending the traditional SMGR one.

Instead of using a bitmapset of segment numbers for each RelFileNode
in the checkpointer's hash table, make the segment number part of the
key.  This requires sending explicit "forget" requests for every
segment individually when relations are dropped, but suits the file
layout schemes of proposed future users better (ie sparse or high
segment numbers).

Author: Shawn Debnath and Thomas Munro
Reviewed-by: Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
2019-04-04 23:38:38 +13:00
Alvaro Herrera 799e220346 Log all statements from a sample of transactions
This is useful to obtain a view of the different transaction types in an
application, regardless of the durations of the statements each runs.

Author: Adrien Nayrat
Reviewed-by: Masahiko Sawada, Hayato Kuroda, Andres Freund
2019-04-03 18:43:59 -03:00
Heikki Linnakangas 9155580fd5 Generate less WAL during GiST, GIN and SP-GiST index build.
Instead of WAL-logging every modification during the build separately,
first build the index without any WAL-logging, and make a separate pass
through the index at the end, to write all pages to the WAL. This
significantly reduces the amount of WAL generated, and is usually also
faster, despite the extra I/O needed for the extra scan through the index.
WAL generated this way is also faster to replay.

For GiST, the LSN-NSN interlock makes this a little tricky. All pages must
be marked with a valid (i.e. non-zero) LSN, so that the parent-child
LSN-NSN interlock works correctly. We now use magic value 1 for that during
index build. Change the fake LSN counter to begin from 1000, so that 1 is
safely smaller than any real or fake LSN. 2 would've been enough for our
purposes, but let's reserve a bigger range, in case we need more special
values in the future.

Author: Anastasia Lubennikova, Andrey V. Lepikhov
Reviewed-by: Heikki Linnakangas, Dmitry Dolgov
2019-04-03 17:03:15 +03:00
Alvaro Herrera 5f768045a1 Correctly initialize newly added struct member
Valgrind was rightly complaining that IndexVacuumInfo->report_progress
(added by commit ab0dfc961b) was not being initialized in some code
paths.  Repair.

Per buildfarm member lousyjack.
2019-04-03 09:58:47 -03:00
Alvaro Herrera ab0dfc961b Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.

There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific.  The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.

The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan.  (The index validation index scan requires
patching each AM, which has not been included here.)

Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 15:18:08 -03:00
Stephen Frost 4d0e994eed Add support for partial TOAST decompression
When asked for a slice of a TOAST entry, decompress enough to return the
slice instead of decompressing the entire object.

For use cases where the slice is at, or near, the beginning of the entry,
this avoids a lot of unnecessary decompression work.

This changes the signature of pglz_decompress() by adding a boolean to
indicate if it's ok for the call to finish before consuming all of the
source or destination buffers.

Author: Paul Ramsey
Reviewed-By: Rafia Sabih, Darafei Praliaskouski, Regina Obe
Discussion: https://postgr.es/m/CACowWR07EDm7Y4m2kbhN_jnys%3DBBf9A6768RyQdKm_%3DNpkcaWg%40mail.gmail.com
2019-04-02 12:35:32 -04:00
Thomas Munro 475861b261 Add wal_recycle and wal_init_zero GUCs.
On at least ZFS, it can be beneficial to create new WAL files every
time and not to bother zero-filling them.  Since it's not clear which
other filesystems might benefit from one or both of those things,
add individual GUCs to control those two behaviors independently and
make only very general statements in the docs.

Author: Jerry Jelinek, with some adjustments by Thomas Munro
Reviewed-by: Alvaro Herrera, Andres Freund, Tomas Vondra, Robert Haas and others
Discussion: https://postgr.es/m/CACPQ5Fo00QR7LNAcd1ZjgoBi4y97%2BK760YABs0vQHH5dLdkkMA%40mail.gmail.com
2019-04-02 14:37:14 +13:00
Andres Freund d45e401586 tableam: Add table_finish_bulk_insert().
This replaces the previous calls of heap_sync() in places using
bulk-insert. By passing in the flags used for bulk-insert the AM can
decide (first at insert time and then during the finish call) which of
the optimizations apply to it, and what operations are necessary to
finish a bulk insert operation.

Also change HEAP_INSERT_* flags to TABLE_INSERT, and rename hi_options
to ti_options.

These changes are made even in copy.c, which hasn't yet been converted
to tableam. There's no harm in doing so.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-04-01 14:41:42 -07:00
Thomas Munro 4fd05bb55b Fix deadlock in heap_compute_xid_horizon_for_tuples().
We can't call code that uses syscache while we hold buffer locks
on a catalog relation.  If passed such a relation, just fall back
to the general effective_io_concurrency GUC rather than trying to
look up the containing tablespace's IO concurrency setting.

We might find a better way to control prefetching in follow-up
work, but for now this is enough to avoid the deadlock introduced
by commit 558a9165e0.

Reviewed-by: Andres Freund
Diagnosed-by: Peter Geoghegan
Discussion: https://postgr.es/m/CA%2BhUKGLCwPF0S4Mk7S8qw%2BDK0Bq65LueN9rofAA3HHSYikW-Zw%40mail.gmail.com
Discussion: https://postgr.es/m/962831d8-c18d-180d-75fb-8b842e3a2742%40chrullrich.net
2019-04-02 09:29:49 +13:00
Peter Eisentraut cc8d415117 Unified logging system for command-line programs
This unifies the various ad hoc logging (message printing, error
printing) systems used throughout the command-line programs.

Features:

- Program name is automatically prefixed.

- Message string does not end with newline.  This removes a common
  source of inconsistencies and omissions.

- Additionally, a final newline is automatically stripped, simplifying
  use of PQerrorMessage() etc., another common source of mistakes.

- I converted error message strings to use %m where possible.

- As a result of the above several points, more translatable message
  strings can be shared between different components and between
  frontends and backend, without gratuitous punctuation or whitespace
  differences.

- There is support for setting a "log level".  This is not meant to be
  user-facing, but can be used internally to implement debug or
  verbose modes.

- Lazy argument evaluation, so no significant overhead if logging at
  some level is disabled.

- Some color in the messages, similar to gcc and clang.  Set
  PG_COLOR=auto to try it out.  Some colors are predefined, but can be
  customized by setting PG_COLORS.

- Common files (common/, fe_utils/, etc.) can handle logging much more
  simply by just using one API without worrying too much about the
  context of the calling program, requiring callbacks, or having to
  pass "progname" around everywhere.

- Some programs called setvbuf() to make sure that stderr is
  unbuffered, even on Windows.  But not all programs did that.  This
  is now done centrally.

Soft goals:

- Reduces vertical space use and visual complexity of error reporting
  in the source code.

- Encourages more deliberate classification of messages.  For example,
  in some cases it wasn't clear without analyzing the surrounding code
  whether a message was meant as an error or just an info.

- Concepts and terms are vaguely aligned with popular logging
  frameworks such as log4j and Python logging.

This is all just about printing stuff out.  Nothing affects program
flow (e.g., fatal exits).  The uses are just too varied to do that.
Some existing code had wrappers that do some kind of print-and-exit,
and I adapted those.

I tried to keep the output mostly the same, but there is a lot of
historical baggage to unwind and special cases to consider, and I
might not always have succeeded.  One significant change is that
pg_rewind used to write all error messages to stdout.  That is now
changed to stderr.

Reviewed-by: Donald Dong <xdong@csumb.edu>
Reviewed-by: Arthur Zakirov <a.zakirov@postgrespro.ru>
Discussion: https://www.postgresql.org/message-id/flat/6a609b43-4f57-7348-6480-bd022f924310@2ndquadrant.com
2019-04-01 20:01:35 +02:00
Andres Freund bfbcad478f tableam: bitmap table scan.
This moves bitmap heap scan support to below an optional tableam
callback. It's optional as the whole concept of bitmap heapscans is
fairly block specific.

This basically moves the work previously done in bitgetpage() into the
new scan_bitmap_next_block callback, and the direct poking into the
buffer done in BitmapHeapNext() into the new scan_bitmap_next_tuple()
callback.

The abstraction is currently somewhat leaky because
nodeBitmapHeapscan.c's prefetching and visibilitymap based logic
remains - it's likely that we'll later have to move more into the
AM. But it's not trivial to do so without introducing a significant
amount of code duplication between the AMs, so that's a project for
later.

Note that now nodeBitmapHeapscan.c and the associated node types are a
bit misnamed. But it's not clear whether renaming wouldn't be a cure
worse than the disease. Either way, that'd be best done in a separate
commit.

Author: Andres Freund
Reviewed-By: Robert Haas (in an older version)
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-31 18:37:57 -07:00
Andres Freund 73c954d248 tableam: sample scan.
This moves sample scan support to below tableam. It's not optional as
there is, in contrast to e.g. bitmap heap scans, no alternative way to
perform tablesample queries. If an AM can't deal with the block based
API, it will have to throw an ERROR.

The tableam callbacks for this are block based, but given the current
TsmRoutine interface, that seems to be required.

The new interface doesn't require TsmRoutines to perform visibility
checks anymore - that requires the TsmRoutine to know details about
the AM, which we want to avoid.  To continue to allow taking the
returned number of tuples account SampleScanState now has a donetuples
field (which previously e.g. existed in SystemRowsSamplerData), which
is only incremented after the visibility check succeeds.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-31 18:37:57 -07:00
Andres Freund 4bb50236eb tableam: Formatting and other minor cleanups.
The superflous heapam_xlog.h includes were reported by Peter
Geoghegan.
2019-03-31 18:16:53 -07:00
Peter Geoghegan 76a39f2295 Fix nbtree high key "continuescan" row compare bug.
Commit 29b64d1d mishandled skipping over truncated high key attributes
during row comparisons.  The row comparison key matching loop would loop
forever when a truncated attribute was encountered for a row compare
subkey.  Fix by following the example of other code in the loop: advance
the current subkey, or break out of the loop when the last subkey is
reached.

Add test coverage for the relevant _bt_check_rowcompare() code path.
The new test case is somewhat tied to nbtree implementation details,
which isn't ideal, but seems unavoidable.
2019-03-31 17:24:04 -07:00
Michael Paquier 2aa6e331ea Skip redundant anti-wraparound vacuums
An anti-wraparound vacuum has to be by definition aggressive as it needs
to work on all the pages of a relation.  However it can happen that due
to some concurrent activity an anti-wraparound vacuum is marked as
non-aggressive, which makes it redundant with a previous run, and
it is actually useless as an anti-wraparound vacuum should process all
the pages of a relation.  This commit makes such vacuums to be skipped.

An anti-wraparound vacuum not aggressive can be found easily by mixing
low values of autovacuum_freeze_max_age (to control anti-wraparound) and
autovacuum_freeze_table_age (to control the aggressiveness).

28a8fa9 has added some extra logging printing all the possible
combinations of anti-wraparound and aggressive vacuums, which now gets
simplified as an anti-wraparound vacuum also non-aggressive gets
skipped.

Per discussion mainly between Andrew Dunstan, Robert Haas, Álvaro
Herrera, Kyotaro Horiguchi, Masahiko Sawada, and myself.

Author: Kyotaro Horiguchi, Michael Paquier
Reviewed-by: Andrew Dunstan, Álvaro Herrera
Discussion: https://postgr.es/m/20180914153554.562muwr3uwujno75@alvherre.pgsql
2019-03-31 22:59:12 +09:00
Andres Freund 696d78469f tableam: Move heap specific logic from estimate_rel_size below tableam.
This just moves the table/matview[/toast] determination of relation
size to a callback, and uses a copy of the existing logic to implement
that callback for heap.

It probably would make sense to also move the index specific logic
into a callback, so the metapage handling (and probably more) can be
index specific. But that's a separate task.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-30 19:26:36 -07:00
Andres Freund 737a292b5d tableam: VACUUM and ANALYZE support.
This is a relatively straightforward move of the current
implementation to sit below tableam. As the current analyze sampling
implementation is pretty inherently block based, the tableam analyze
interface is as well. It might make sense to generalize that at some
point, but that seems like a larger project that shouldn't be
undertaken at the same time as the introduction of tableam.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-30 19:25:58 -07:00
Peter Eisentraut fc22b6623b Generated columns
This is an SQL-standard feature that allows creating columns that are
computed from expressions rather than assigned, similar to a view or
materialized view but on a column basis.

This implements one kind of generated column: stored (computed on
write).  Another kind, virtual (computed on read), is planned for the
future, and some room is left for it.

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/b151f851-4019-bdb1-699e-ebab07d2f40a@2ndquadrant.com
2019-03-30 08:15:57 +01:00
Peter Geoghegan 9c7fb7e6d8 Tweak some nbtree-related code comments. 2019-03-29 12:29:05 -07:00
Andres Freund ffa8444ce4 tableam: Comment fixes.
Author: Haribabu Kommi
Discussion: CAJrrPGeeYOqP3hkZyohDx_8dot4zvPuPMDBmhJ=iC85cTBNeYw@mail.gmail.com
2019-03-29 08:17:26 -07:00
Robert Haas c900c15269 Warn more strongly about the dangers of exclusive backup mode.
Especially, warn about the hazards of mishandling the backup_label
file.  Adjust a couple of server messages to be more clear about
the hazards associated with removing backup_label files, too.

David Steele and Robert Haas, reviewed by Laurenz Albe, Martín
Marqués, Peter Eisentraut, and Magnus Hagander.

Discussion: http://postgr.es/m/7d85c387-000e-16f0-e00b-50bf83c22127@pgmasters.net
2019-03-29 08:15:16 -04:00
Andres Freund d25f519107 tableam: relation creation, VACUUM FULL/CLUSTER, SET TABLESPACE.
This moves the responsibility for:
- creating the storage necessary for a relation, including creating a
  new relfilenode for a relation with existing storage
- non-transactional truncation of a relation
- VACUUM FULL / CLUSTER's rewrite of a table
below tableam.

This is fairly straight forward, with a bit of complexity smattered in
to move the computation of xid / multixid horizons below the AM, as
they don't make sense for every table AM.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-28 20:01:43 -07:00
Thomas Munro 7e69323bf7 Fix typo.
Author: Masahiko Sawada
2019-03-29 10:03:58 +13:00
Thomas Munro ad308058cc Use FullTransactionId for the transaction stack.
Provide GetTopFullTransactionId() and GetCurrentFullTransactionId().
The intended users of these interfaces are access methods that use
xids for visibility checks but don't want to have to go back and
"freeze" existing references some time later before the 32 bit xid
counter wraps around.

Use a new struct to serialize the transaction state for parallel
query, because FullTransactionId doesn't fit into the previous
serialization scheme very well.

Author: Thomas Munro
Reviewed-by: Heikki Linnakangas
Discussion: https://postgr.es/m/CAA4eK1%2BMv%2Bmb0HFfWM9Srtc6MVe160WFurXV68iAFMcagRZ0dQ%40mail.gmail.com
2019-03-28 18:24:43 +13:00
Thomas Munro 2fc7af5e96 Add basic infrastructure for 64 bit transaction IDs.
Instead of inferring epoch progress from xids and checkpoints,
introduce a 64 bit FullTransactionId type and use it to track xid
generation.  This fixes an unlikely bug where the epoch is reported
incorrectly if the range of active xids wraps around more than once
between checkpoints.

The only user-visible effect of this commit is to correct the epoch
used by txid_current() and txid_status(), also visible with
pg_controldata, in those rare circumstances.  It also creates some
basic infrastructure so that later patches can use 64 bit
transaction IDs in more places.

The new type is a struct that we pass by value, as a form of strong
typedef.  This prevents the sort of accidental confusion between
TransactionId and FullTransactionId that would be possible if we
were to use a plain old uint64.

Author: Thomas Munro
Reported-by: Amit Kapila
Reviewed-by: Andres Freund, Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/CAA4eK1%2BMv%2Bmb0HFfWM9Srtc6MVe160WFurXV68iAFMcagRZ0dQ%40mail.gmail.com
2019-03-28 18:12:20 +13:00
Andres Freund 2a96909a4a tableam: Support for an index build's initial table scan(s).
To support building indexes over tables of different AMs, the scans to
do so need to be routed through the table AM.  While moving a fair
amount of code, nearly all the changes are just moving code to below a
callback.

Currently the range based interface wouldn't make much sense for non
block based table AMs. But that seems aceptable for now.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-27 19:59:06 -07:00
Tom Lane a51cc7e9e6 Suppress uninitialized-variable warning.
Apparently Andres' compiler is smart enough to see that hpage
must be initialized before use ... but mine isn't.
2019-03-27 11:10:42 -04:00
Michael Paquier 1983af8e89 Switch some palloc/memset calls to palloc0
Some code paths have been doing some allocations followed by an
immediate memset() to initialize the allocated area with zeros, this is
a bit overkill as there are already interfaces to do both things in one
call.

Author: Daniel Gustafsson
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/vN0OodBPkKs7g2Z1uyk3CUEmhdtspHgYCImhlmSxv1Xn6nY1ZnaaGHL8EWUIQ-NEv36tyc4G5-uA3UXUF2l4sFXtK_EQgLN1hcgunlFVKhA=@yesql.se
2019-03-27 12:02:50 +09:00
Andres Freund 558a9165e0 Compute XID horizon for page level index vacuum on primary.
Previously the xid horizon was only computed during WAL replay. That
had two major problems:
1) It relied on knowing what the table pointed to looks like. That was
   easy enough before the introducing of tableam (we knew it had to be
   heap, although some trickery around logging the heap relfilenodes
   was required). But to properly handle table AMs we need
   per-database catalog access to look up the AM handler, which
   recovery doesn't allow.
2) Not knowing the xid horizon also makes it hard to support logical
   decoding on standbys. When on a catalog table, we need to be able
   to conflict with slots that have an xid horizon that's too old. But
   computing the horizon by visiting the heap only works once
   consistency is reached, but we always need to be able to detect
   conflicts.

There's also a secondary problem, in that the current method performs
redundant work on every standby. But that's counterbalanced by
potentially computing the value when not necessary (either because
there's no standby, or because there's no connected backends).

Solve 1) and 2) by moving computation of the xid horizon to the
primary and by involving tableam in the computation of the horizon.

To address the potentially increased overhead, increase the efficiency
of the xid horizon computation for heap by sorting the tids, and
eliminating redundant buffer accesses. When prefetching is available,
additionally perform prefetching of buffers.  As this is more of a
maintenance task, rather than something routinely done in every read
only query, we add an arbitrary 10 to the effective concurrency -
thereby using IO concurrency, when not globally enabled.  That's
possibly not the perfect formula, but seems good enough for now.

Bumps WAL format, as latestRemovedXid is now part of the records, and
the heap's relfilenode isn't anymore.

Author: Andres Freund, Amit Khandekar, Robert Haas
Reviewed-By: Robert Haas
Discussion:
    https://postgr.es/m/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de
    https://postgr.es/m/20181214014235.dal5ogljs3bmlq44@alap3.anarazel.de
    https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-26 16:52:54 -07:00
Tom Lane 7c366ac969 Fix oversight in data-type change for autovacuum_vacuum_cost_delay.
Commit caf626b2c missed that the relevant reloptions entry needs
to be moved from the intRelOpts[] array to realRelOpts[].
Somewhat surprisingly, it seems to work anyway, perhaps because
the desired default and limit values are all integers.  We ought
to have either a simpler data structure or better cross-checking
here, but that's for another patch.

Nikolay Shaplov

Discussion: https://postgr.es/m/4861742.12LTaSB3sv@x200m
2019-03-26 13:32:38 -04:00
Andres Freund 2ac1b2b175 Remove heap_hot_search().
After 71bdc99d0d, "tableam: Add helper for indexes to check if a
corresponding table tuples exist." there's no in-core user left. As
there's unlikely to be an external user, and such an external user
could easily be adjusted to use table_index_fetch_tuple_check(),
remove heap_hot_search().

Per complaint from Peter Geoghegan

Author: Andres Freund
Discussion: https://postgr.es/m/CAH2-Wzn0Oq4ftJrTqRAsWy2WGjv0QrJcwoZ+yqWsF_Z5vjUBFw@mail.gmail.com
2019-03-25 19:04:41 -07:00
Andres Freund 2e3da03e9e tableam: Add table_get_latest_tid, to wrap heap_get_latest_tid.
This primarily is to allow WHERE CURRENT OF to continue to work as it
currently does. It's not clear to me that these semantics make sense
for every AM, but it works for the in-core heap, and the out of core
zheap. We can refine it further at a later point if necessary.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-25 17:14:48 -07:00
Andres Freund 71bdc99d0d tableam: Add helper for indexes to check if a corresponding table tuples exist.
This is, likely exclusively, useful to verify that conflicts detected
in a unique index are with live tuples, rather than dead ones.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-25 16:52:55 -07:00
Peter Geoghegan f21668f328 Add "split after new tuple" nbtree optimization.
Add additional heuristics to the algorithm for locating an optimal split
location.  New logic identifies localized monotonically increasing
values in indexes with multiple columns.  When this insertion pattern is
detected, page splits split just after the new item that provoked a page
split (or apply leaf fillfactor in the style of a rightmost page split).
This optimization is a variation of the long established leaf fillfactor
optimization used during rightmost page splits.

50/50 page splits are only appropriate with a pattern of truly random
insertions, where the average space utilization ends up at 65% - 70%.
Without this patch, affected cases have leaf pages that are no more than
about 50% full on average.  Future insertions can never make use of the
free space left behind.  With this patch, affected cases have leaf pages
that are about 90% full on average (assuming a fillfactor of 90).

Localized monotonically increasing insertion patterns are presumed to be
fairly common in real-world applications.  There is a fair amount of
anecdotal evidence for this.  Both pg_depend system catalog indexes
(pg_depend_depender_index and pg_depend_reference_index) are at least
20% smaller after the regression tests are run when the optimization is
available.  Furthermore, many of the indexes created by a fair use
implementation of TPC-C for Postgres are consistently about 40% smaller
when the optimization is available.

Note that even pg_upgrade'd v3 indexes make use of this optimization.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzkpKeZJrXvR_p7VSY1b-s85E3gHyTbZQzR0BkJ5LrWF_A@mail.gmail.com
2019-03-25 09:44:25 -07:00
Andres Freund 9a8ee1dc65 tableam: Add and use table_fetch_row_version().
This is essentially the tableam version of heapam_fetch(),
i.e. fetching a tuple identified by a tid, performing visibility
checks.

Note that this different from table_index_fetch_tuple(), which is for
index lookups. It therefore has to handle a tid pointing to an earlier
version of a tuple if the AM uses an optimization like heap's HOT. Add
comments to that end.

This commit removes the stats_relation argument from heap_fetch, as
it's been unused for a long time.

Author: Andres Freund
Reviewed-By: Haribabu Kommi
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-25 00:17:59 -07:00
Peter Geoghegan 59ab3be9e4 Remove dead code from nbtsplitloc.c.
It doesn't make sense to consider the possibility that there will only
be one candidate split point when choosing among split points to find
the split with the lowest penalty.  This is a vestige of an earlier
version of the patch that became commit fab25024.

Issue spotted while rereviewing coverage of the nbtree patch series
using gcov.
2019-03-24 12:28:58 -07:00
Peter Eisentraut 280a408b48 Transaction chaining
Add command variants COMMIT AND CHAIN and ROLLBACK AND CHAIN, which
start new transactions with the same transaction characteristics as the
just finished one, per SQL standard.

Support for transaction chaining in PL/pgSQL is also added.  This
functionality is especially useful when running COMMIT in a loop in
PL/pgSQL.

Reviewed-by: Fabien COELHO <coelho@cri.ensmp.fr>
Discussion: https://www.postgresql.org/message-id/flat/28536681-324b-10dc-ade8-ab46f7645a5a@2ndquadrant.com
2019-03-24 11:33:02 +01:00
Andres Freund 5db6df0c01 tableam: Add tuple_{insert, delete, update, lock} and use.
This adds new, required, table AM callbacks for insert/delete/update
and lock_tuple. To be able to reasonably use those, the EvalPlanQual
mechanism had to be adapted, moving more logic into the AM.

Previously both delete/update/lock call-sites and the EPQ mechanism had
to have awareness of the specific tuple format to be able to fetch the
latest version of a tuple. Obviously that needs to be abstracted
away. To do so, move the logic that find the latest row version into
the AM. lock_tuple has a new flag argument,
TUPLE_LOCK_FLAG_FIND_LAST_VERSION, that forces it to lock the last
version, rather than the current one.  It'd have been possible to do
so via a separate callback as well, but finding the last version
usually also necessitates locking the newest version, making it
sensible to combine the two. This replaces the previous use of
EvalPlanQualFetch().  Additionally HeapTupleUpdated, which previously
signaled either a concurrent update or delete, is now split into two,
to avoid callers needing AM specific knowledge to differentiate.

The move of finding the latest row version into tuple_lock means that
encountering a row concurrently moved into another partition will now
raise an error about "tuple to be locked" rather than "tuple to be
updated/deleted" - which is accurate, as that always happens when
locking rows. While possible slightly less helpful for users, it seems
like an acceptable trade-off.

As part of this commit HTSU_Result has been renamed to TM_Result, and
its members been expanded to differentiated between updating and
deleting. HeapUpdateFailureData has been renamed to TM_FailureData.

The interface to speculative insertion is changed so nodeModifyTable.c
does not have to set the speculative token itself anymore. Instead
there's a version of tuple_insert, tuple_insert_speculative, that
performs the speculative insertion (without requiring a flag to signal
that fact), and the speculative insertion is either made permanent
with table_complete_speculative(succeeded = true) or aborted with
succeeded = false).

Note that multi_insert is not yet routed through tableam, nor is
COPY. Changing multi_insert requires changes to copy.c that are large
enough to better be done separately.

Similarly, although simpler, CREATE TABLE AS and CREATE MATERIALIZED
VIEW are also only going to be adjusted in a later commit.

Author: Andres Freund and Haribabu Kommi
Discussion:
    https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
    https://postgr.es/m/20190313003903.nwvrxi7rw3ywhdel@alap3.anarazel.de
    https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-23 19:55:57 -07:00
Peter Geoghegan 29b64d1de7 Add nbtree high key "continuescan" optimization.
Teach nbtree forward index scans to check the high key before moving to
the right sibling page in the hope of finding that it isn't actually
necessary to do so.  The new check may indicate that the scan definitely
cannot find matching tuples to the right, ending the scan immediately.
We already opportunistically force a similar "continuescan orientated"
key check of the final non-pivot tuple when it's clear that it cannot be
returned to the scan due to being dead-to-all.  The new high key check
is complementary.

The new approach for forward scans is more effective than checking the
final non-pivot tuple, especially with composite indexes and non-unique
indexes.  The improvements to the logic for picking a split point added
by commit fab25024 make it likely that relatively dissimilar high keys
will appear on a page.  A distinguishing key value that can only appear
on non-pivot tuples on the right sibling page will often be present in
leaf page high keys.

Since forcing the final item to be key checked no longer makes any
difference in the case of forward scans, the existing extra key check is
now only used for backwards scans.  Backward scans continue to
opportunistically check the final non-pivot tuple, which is actually the
first non-pivot tuple on the page (not the last).

Note that even pg_upgrade'd v3 indexes make use of this optimization.

Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzkOmUduME31QnuTFpimejuQoiZ-HOf0pOWeFZNhTMctvA@mail.gmail.com
2019-03-23 11:01:53 -07:00
Heikki Linnakangas d1b9ee4e44 Fix bug in the GiST vacuum's 2nd stage.
We mustn't assume that the IndexVacuumInfo pointer passed to bulkdelete()
stage is still valid in the vacuumcleanup() stage.

Per very pink buildfarm.
2019-03-22 14:11:46 +02:00
Heikki Linnakangas 7df159a620 Delete empty pages during GiST VACUUM.
To do this, we scan GiST two times. In the first pass we make note of
empty leaf pages and internal pages. At second pass we scan through
internal pages, looking for downlinks to the empty pages.

Deleting internal pages is still not supported, like in nbtree, the last
child of an internal page is never deleted. That means that if you have a
workload where new keys are always inserted to different area than where
old keys are removed, the index will still grow without bound. But the rate
of growth will be an order of magnitude slower than before.

Author: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/B1E4DF12-6CD3-4706-BDBD-BF3283328F60@yandex-team.ru
2019-03-22 13:21:45 +02:00
Peter Eisentraut 5e1963fb76 Collations with nondeterministic comparison
This adds a flag "deterministic" to collations.  If that is false,
such a collation disables various optimizations that assume that
strings are equal only if they are byte-wise equal.  That then allows
use cases such as case-insensitive or accent-insensitive comparisons
or handling of strings with different Unicode normal forms.

This functionality is only supported with the ICU provider.  At least
glibc doesn't appear to have any locales that work in a
nondeterministic way, so it's not worth supporting this for the libc
provider.

The term "deterministic comparison" in this context is from Unicode
Technical Standard #10
(https://unicode.org/reports/tr10/#Deterministic_Comparison).

This patch makes changes in three areas:

- CREATE COLLATION DDL changes and system catalog changes to support
  this new flag.

- Many executor nodes and auxiliary code are extended to track
  collations.  Previously, this code would just throw away collation
  information, because the eventually-called user-defined functions
  didn't use it since they only cared about equality, which didn't
  need collation information.

- String data type functions that do equality comparisons and hashing
  are changed to take the (non-)deterministic flag into account.  For
  comparison, this just means skipping various shortcuts and tie
  breakers that use byte-wise comparison.  For hashing, we first need
  to convert the input string to a canonical "sort key" using the ICU
  analogue of strxfrm().

Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com
2019-03-22 12:12:43 +01:00
Peter Geoghegan 3d0dcc5c7f Fix spurious compiler warning in nbtxlog.c.
Cleanup from commit dd299df8.

Per complaint from Tom Lane.
2019-03-20 14:04:35 -07:00
Peter Geoghegan fab2502433 Consider secondary factors during nbtree splits.
Teach nbtree to give some consideration to how "distinguishing"
candidate leaf page split points are.  This should not noticeably affect
the balance of free space within each half of the split, while still
making suffix truncation truncate away significantly more attributes on
average.

The logic for choosing a leaf split point now uses a fallback mode in
the case where the page is full of duplicates and it isn't possible to
find even a minimally distinguishing split point.  When the page is full
of duplicates, the split should pack the left half very tightly, while
leaving the right half mostly empty.  Our assumption is that logical
duplicates will almost always be inserted in ascending heap TID order
with v4 indexes.  This strategy leaves most of the free space on the
half of the split that will likely be where future logical duplicates of
the same value need to be placed.

The number of cycles added is not very noticeable.  This is important
because deciding on a split point takes place while at least one
exclusive buffer lock is held.  We avoid using authoritative insertion
scankey comparisons to save cycles, unlike suffix truncation proper.  We
use a faster binary comparison instead.

Note that even pg_upgrade'd v3 indexes make use of these optimizations.
Benchmarking has shown that even v3 indexes benefit, despite the fact
that suffix truncation will only truncate non-key attributes in INCLUDE
indexes.  Grouping relatively similar tuples together is beneficial in
and of itself, since it reduces the number of leaf pages that must be
accessed by subsequent index scans.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzmmoLNQOj9mAD78iQHfWLJDszHEDrAzGTUMG3mVh5xWPw@mail.gmail.com
2019-03-20 10:12:19 -07:00
Peter Geoghegan dd299df818 Make heap TID a tiebreaker nbtree index column.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique.  This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.  A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes.  These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.

A later patch will enhance the logic used by nbtree to pick a split
point.  This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at.  Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.

The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits.  The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised.  However, there should be a compatibility note in the v12
release notes.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20 10:04:01 -07:00
Peter Geoghegan e5adcb789d Refactor nbtree insertion scankeys.
Use dedicated struct to represent nbtree insertion scan keys.  Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions.  This is based on a suggestion by Andrey
Lepikhov.

Streamline how unique index insertions cache binary search progress.
Cache the state of in-progress binary searches within _bt_check_unique()
for later instead of having callers avoid repeating the binary search in
an ad-hoc manner.  This makes it easy to add a new optimization:
_bt_check_unique() now falls out of its loop immediately in the common
case where it's already clear that there couldn't possibly be a
duplicate.

The new _bt_check_unique() scheme makes it a lot easier to manage cached
binary search effort afterwards, from within _bt_findinsertloc().  This
is needed for the upcoming patch to make nbtree tuples unique by
treating heap TID as a final tiebreaker column.  Unique key binary
searches need to restore lower and upper bounds.  They cannot simply
continue to use the >= lower bound as the offset to insert at, because
the heap TID tiebreaker column must be used in comparisons for the
restored binary search (unlike the original _bt_check_unique() binary
search, where scankey's heap TID column must be omitted).

Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas, Andrey Lepikhov
Discussion: https://postgr.es/m/CAH2-WzmE6AhUdk9NdWBf4K3HjWXZBX3+umC7mH7+WDrKcRtsOw@mail.gmail.com
2019-03-20 09:30:57 -07:00
Peter Geoghegan 1009920aaa Tweak nbtsearch.c function prototype order.
nbtsearch.c's static function prototypes were slightly out of order.
Make the order consistent with static function definition order.
2019-03-19 09:59:05 -07:00
Tom Lane f2004f19ed Fix memory leak in printtup.c.
Commit f2dec34e1 changed things so that printtup's output stringinfo
buffer was allocated outside the per-row temporary context, not inside
it.  This creates a need to free that buffer explicitly when the temp
context is freed, but that was overlooked.  In most cases, this is all
happening inside a portal or executor context that will go away shortly
anyhow, but that's not always true.  Notably, the stringinfo ends up
getting leaked when JDBC uses row-at-a-time fetches.  For a query
that returns wide rows, that adds up after awhile.

Per bug #15700 from Matthias Otterbach.  Back-patch to v11 where the
faulty code was added.

Discussion: https://postgr.es/m/15700-8c408321a87d56bb@postgresql.org
2019-03-18 17:54:41 -04:00
Robert Haas f41551f61f Fold vacuum's 'int options' parameter into VacuumParams.
Many places need both, so this allows a few functions to take one
fewer parameter.  More importantly, as soon as we add a VACUUM
option that takes a non-Boolean parameter, we need to replace
'int options' with a struct, and it seems better to think
of adding more fields to VacuumParams rather than passing around
both VacuumParams and a separate struct as well.

Patch by me, reviewed by Masahiko Sawada

Discussion: http://postgr.es/m/CA+Tgmob6g6-s50fyv8E8he7APfwCYYJ4z0wbZC2yZeSz=26CYQ@mail.gmail.com
2019-03-18 13:57:33 -04:00
Michael Paquier 8b938d36f7 Refactor more code logic to update the control file
ce6afc6 has begun the refactoring work by plugging pg_rewind into a
central routine to update the control file, and left around two extra
copies, with one in xlog.c for the backend and one in pg_resetwal.c.  By
adding an extra option to the central routine in controldata_utils.c to
control if a flush of the control file needs to be done, it is proving
to be straight-forward to make xlog.c and pg_resetwal.c use the central
code path at the condition of moving the wait event tracking there.
Hence, this allows to have only one central code path to update the
control file, shaving the code from the duplicates.

This refactoring actually fixes a problem in pg_resetwal.  Previously,
the control file was first removed before being recreated.  So if a
crash happened between the moment the file was removed and the moment
the file was created, then it would have been possible to not have a
control file anymore in the database folder.

Author: Fabien Coelho
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/alpine.DEB.2.21.1903170935210.2506@lancre
2019-03-18 12:59:35 +09:00
Peter Eisentraut 893d6f8a1f Avoid casting away a const 2019-03-16 10:13:03 +01:00
Thomas Munro bb16aba50c Enable parallel query with SERIALIZABLE isolation.
Previously, the SERIALIZABLE isolation level prevented parallel query
from being used.  Allow the two features to be used together by
sharing the leader's SERIALIZABLEXACT with parallel workers.

An extra per-SERIALIZABLEXACT LWLock is introduced to make it safe to
share, and new logic is introduced to coordinate the early release
of the SERIALIZABLEXACT required for the SXACT_FLAG_RO_SAFE
optimization, as follows:

The first backend to observe the SXACT_FLAG_RO_SAFE flag (set by
some other transaction) will 'partially release' the SERIALIZABLEXACT,
meaning that the conflicts and locks it holds are released, but the
SERIALIZABLEXACT itself will remain active because other backends
might still have a pointer to it.

Whenever any backend notices the SXACT_FLAG_RO_SAFE flag, it clears
its own MySerializableXact variable and frees local resources so that
it can skip SSI checks for the rest of the transaction.  In the
special case of the leader process, it transfers the SERIALIZABLEXACT
to a new variable SavedSerializableXact, so that it can be completely
released at the end of the transaction after all workers have exited.

Remove the serializable_okay flag added to CreateParallelContext() by
commit 9da0cc35, because it's now redundant.

Author: Thomas Munro
Reviewed-by: Haribabu Kommi, Robert Haas, Masahiko Sawada, Kevin Grittner
Discussion: https://postgr.es/m/CAEepm=0gXGYhtrVDWOTHS8SQQy_=S9xo+8oCxGLWZAOoeJ=yzQ@mail.gmail.com
2019-03-15 17:47:04 +13:00
Peter Geoghegan 3f34283973 Correct obsolete nbtree page split comment.
Commit 40dae7ec53, which made the nbtree page split algorithm more
robust, made _bt_insert_parent() only unlock the right child of the
parent page before inserting a new downlink into the parent.  Update a
comment from the Berkeley days claiming that both left and right child
pages are unlocked before the new downlink actually gets inserted.

The claim that it is okay to release both locks early based on Lehman
and Yao's say-so never made much sense.  Lehman and Yao must sometimes
"couple" buffer locks across a pair of internal pages when relocating a
downlink, unlike the corresponding code within _bt_getstack().
2019-03-12 16:40:05 -07:00
Andres Freund 8cacea7a72 Ensure sufficient alignment for ParallelTableScanDescData in BTShared.
Previously ParallelTableScanDescData was just a member in BTShared,
but after c2fe139c2 that doesn't guarantee sufficient alignment as
specific AMs might (are likely to) need atomic variables in the
struct.

One might think that MAXALIGNing would be sufficient, but as a
comment in shm_toc_allocate() explains, that's not enough. For now,
copy the hack described there.

For parallel sequential scans no such change is needed, as its
allocations go through shm_toc_allocate().

An alternative approach would have been to allocate the parallel scan
descriptor in a separate TOC entry, but there seems little benefit in
doing so.

Per buildfarm member dromedary.

Author: Andres Freund
Discussion: https://postgr.es/m/20190311203126.ty5gbfz42gjbm6i6@alap3.anarazel.de
2019-03-11 14:26:43 -07:00
Andres Freund c2fe139c20 tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:

1) Heap scans need to be generalized into table scans. Do this by
   introducing TableScanDesc, which will be the "base class" for
   individual AMs. This contains the AM independent fields from
   HeapScanDesc.

   The previous heap_{beginscan,rescan,endscan} et al. have been
   replaced with a table_ version.

   There's no direct replacement for heap_getnext(), as that returned
   a HeapTuple, which is undesirable for a other AMs. Instead there's
   table_scan_getnextslot().  But note that heap_getnext() lives on,
   it's still used widely to access catalog tables.

   This is achieved by new scan_begin, scan_end, scan_rescan,
   scan_getnextslot callbacks.

2) The portion of parallel scans that's shared between backends need
   to be able to do so without the user doing per-AM work. To achieve
   that new parallelscan_{estimate, initialize, reinitialize}
   callbacks are introduced, which operate on a new
   ParallelTableScanDesc, which again can be subclassed by AMs.

   As it is likely that several AMs are going to be block oriented,
   block oriented callbacks that can be shared between such AMs are
   provided and used by heap. table_block_parallelscan_{estimate,
   intiialize, reinitialize} as callbacks, and
   table_block_parallelscan_{nextpage, init} for use in AMs. These
   operate on a ParallelBlockTableScanDesc.

3) Index scans need to be able to access tables to return a tuple, and
   there needs to be state across individual accesses to the heap to
   store state like buffers. That's now handled by introducing a
   sort-of-scan IndexFetchTable, which again is intended to be
   subclassed by individual AMs (for heap IndexFetchHeap).

   The relevant callbacks for an AM are index_fetch_{end, begin,
   reset} to create the necessary state, and index_fetch_tuple to
   retrieve an indexed tuple.  Note that index_fetch_tuple
   implementations need to be smarter than just blindly fetching the
   tuples for AMs that have optimizations similar to heap's HOT - the
   currently alive tuple in the update chain needs to be fetched if
   appropriate.

   Similar to table_scan_getnextslot(), it's undesirable to continue
   to return HeapTuples. Thus index_fetch_heap (might want to rename
   that later) now accepts a slot as an argument. Core code doesn't
   have a lot of call sites performing index scans without going
   through the systable_* API (in contrast to loads of heap_getnext
   calls and working directly with HeapTuples).

   Index scans now store the result of a search in
   IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
   target is not generally a HeapTuple anymore that seems cleaner.

To be able to sensible adapt code to use the above, two further
callbacks have been introduced:

a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
   slots capable of holding a tuple of the AMs
   type. table_slot_callbacks() and table_slot_create() are based
   upon that, but have additional logic to deal with views, foreign
   tables, etc.

   While this change could have been done separately, nearly all the
   call sites that needed to be adapted for the rest of this commit
   also would have been needed to be adapted for
   table_slot_callbacks(), making separation not worthwhile.

b) tuple_satisfies_snapshot checks whether the tuple in a slot is
   currently visible according to a snapshot. That's required as a few
   places now don't have a buffer + HeapTuple around, but a
   slot (which in heap's case internally has that information).

Additionally a few infrastructure changes were needed:

I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
   internally uses a slot to keep track of tuples. While
   systable_getnext() still returns HeapTuples, and will so for the
   foreseeable future, the index API (see 1) above) now only deals with
   slots.

The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.

Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
    https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
    https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 12:46:41 -07:00
Amit Kapila a6e48da088 Fix typos in commit 8586bf7ed8.
Author: Amit Kapila
Discussion: https://postgr.es/m/CAA4eK1KNv1Mg2krf4E9ssWFnE=8A9mZ1VbVywXBZTFSzb+wP2g@mail.gmail.com
2019-03-11 09:58:46 -07:00
Alvaro Herrera af38498d4c Move hash_any prototype from access/hash.h to utils/hashutils.h
... as well as its implementation from backend/access/hash/hashfunc.c to
backend/utils/hash/hashfn.c.

access/hash is the place for the hash index AM, not really appropriate
for generic facilities, which is what hash_any is; having things the old
way meant that anything using hash_any had to include the AM's include
file, pointlessly polluting its namespace with unrelated, unnecessary
cruft.

Also move the HTEqual strategy number to access/stratnum.h from
access/hash.h.

To avoid breaking third-party extension code, add an #include
"utils/hashutils.h" to access/hash.h.  (An easily removed line by
committers who enjoy their asbestos suits to protect them from angry
extension authors.)

Discussion: https://postgr.es/m/201901251935.ser5e4h6djt2@alvherre.pgsql
2019-03-11 13:17:50 -03:00
Michael Paquier f2d84a4a6b Adjust error message for partial writes in WAL segments
93473c6 has removed openLogOff, changing on the way the error message
which is used to report partial writes to WAL segments.  The
newly-introduced error message used the offset up to which the write has
happened, keeping always the same total length to write.  This changes
the error message so as the number of bytes left to write are reported.

Reported-by: Michael Paquier
Author: Robert Haas
Discussion: https://postgr.es/m/20190306235251.GA17293@paquier.xyz
2019-03-11 09:31:25 +09:00
Tom Lane caf626b2cd Convert [autovacuum_]vacuum_cost_delay into floating-point GUCs.
This change makes it possible to specify sub-millisecond delays,
which work well on most modern platforms, though that was not true
when the cost-delay feature was designed.

To support this without breaking existing configuration entries,
improve guc.c to allow floating-point GUCs to have units.  Also,
allow "us" (microseconds) as an input/output unit for time-unit GUCs.
(It's not allowed as a base unit, at least not yet.)

Likewise change the autovacuum_vacuum_cost_delay reloption to be
floating-point; this forces a catversion bump because the layout of
StdRdOptions changes.

This patch doesn't in itself change the default values or allowed
ranges for these parameters, and it should not affect the behavior
for any already-allowed setting for them.

Discussion: https://postgr.es/m/1798.1552165479@sss.pgh.pa.us
2019-03-10 15:01:39 -04:00
Alexander Korotkov f2e403803f Support for INCLUDE attributes in GiST indexes
Similarly to B-tree, GiST index access method gets support of INCLUDE
attributes.  These attributes aren't used for tree navigation and aren't
present in non-leaf pages.  But they are present in leaf pages and can be
fetched during index-only scan.

The point of having INCLUDE attributes in GiST indexes is slightly different
from the point of having them in B-tree.  The main point of INCLUDE attributes
in B-tree is to define UNIQUE constraint over part of attributes enabled for
index-only scan.  In GiST the main point of INCLUDE attributes is to use
index-only scan for attributes, whose data types don't have GiST opclasses.

Discussion: https://postgr.es/m/73A1A452-AD5F-40D4-BD61-978622FF75C1%40yandex-team.ru
Author: Andrey Borodin, with small changes by me
Reviewed-by: Andreas Karlsson
2019-03-10 11:37:17 +03:00
Michael Paquier 82a5649fb9 Tighten use of OpenTransientFile and CloseTransientFile
This fixes two sets of issues related to the use of transient files in
the backend:
1) OpenTransientFile() has been used in some code paths with read-write
flags while read-only is sufficient, so switch those calls to be
read-only where necessary.  These have been reported by Joe Conway.
2) When opening transient files, it is up to the caller to close the
file descriptors opened.  In error code paths, CloseTransientFile() gets
called to clean up things before issuing an error.  However in normal
exit paths, a lot of callers of CloseTransientFile() never actually
reported errors, which could leave a file descriptor open without
knowing about it.  This is an issue I complained about a couple of
times, but never had the courage to write and submit a patch, so here we
go.

Note that one frontend code path is impacted by this commit so as an
error is issued when fetching control file data, making backend and
frontend to be treated consistently.

Reported-by: Joe Conway, Michael Paquier
Author: Michael Paquier
Reviewed-by: Álvaro Herrera, Georgios Kokolatos, Joe Conway
Discussion: https://postgr.es/m/20190301023338.GD1348@paquier.xyz
Discussion: https://postgr.es/m/c49b69ec-e2f7-ff33-4f17-0eaa4f2cef27@joeconway.com
2019-03-09 08:50:55 +09:00
Andres Freund 8586bf7ed8 tableam: introduce table AM infrastructure.
This introduces the concept of table access methods, i.e. CREATE
  ACCESS METHOD ... TYPE TABLE and
  CREATE TABLE ... USING (storage-engine).
No table access functionality is delegated to table AMs as of this
commit, that'll be done in following commits.

Subsequent commits will incrementally abstract table access
functionality to be routed through table access methods. That change
is too large to be reviewed & committed at once, so it'll be done
incrementally.

Docs will be updated at the end, as adding them incrementally would
likely make them less coherent, and definitely is a lot more work,
without a lot of benefit.

Table access methods are specified similar to index access methods,
i.e. pg_am.amhandler returns, as INTERNAL, a pointer to a struct with
callbacks. In contrast to index AMs that struct needs to live as long
as a backend, typically that's achieved by just returning a pointer to
a constant struct.

Psql's \d+ now displays a table's access method. That can be disabled
with HIDE_TABLEAM=true, which is mainly useful so regression tests can
be run against different AMs.  It's quite possible that this behaviour
still needs to be fine tuned.

For now it's not allowed to set a table AM for a partitioned table, as
we've not resolved how partitions would inherit that. Disallowing
allows us to introduce, if we decide that's the way forward, such a
behaviour without a compatibility break.

Catversion bumped, to add the heap table AM and references to it.

Author: Haribabu Kommi, Andres Freund, Alvaro Herrera, Dimitri Golgov and others
Discussion:
    https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
    https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
    https://postgr.es/m/20190107235616.6lur25ph22u5u5av@alap3.anarazel.de
    https://postgr.es/m/20190304234700.w5tmhducs5wxgzls@alap3.anarazel.de
2019-03-06 09:54:38 -08:00
Robert Haas 93473c6ac8 Removed unused variable, openLogOff.
Antonin Houska

Discussion: http://postgr.es/m/30413.1551870730@localhost
2019-03-06 09:44:08 -05:00
Heikki Linnakangas fe280694d0 Scan GiST indexes in physical order during VACUUM.
Scanning an index in physical order is faster than walking it in logical
order, because sequential I/O is faster than random I/O. The idea and code
structure is borrowed from B-tree vacuum code.

Patch by Andrey Borodin, with changes by me. Based on early work by
Konstantin Kuznetsov, although the patch has been rewritten multiple times
since his original version.

Discussion: https://www.postgresql.org/message-id/1B9FAC6F-FA19-4A24-8C1B-F4F574844892%40yandex-team.ru
2019-03-05 15:19:48 +02:00
Peter Geoghegan 35bc0ec7c8 Note case where nbtree VACUUM finishes splits.
The nbtree README claims that VACUUM can never finish interrupted page
splits by design.  That isn't entirely accurate, though.  Note an
exception to the general rule.

Discussion: https://postgr.es/m/CAH2-Wz=_Xvv8byzK_LvY4ci76OgsHCQzoKF7We8yG9waO7j6rA@mail.gmail.com
2019-03-04 17:57:36 -08:00
Peter Geoghegan 72c7c4e386 Correct obsolete nbtree page split WAL comment.
Commit 2c03216d83, which revamped the WAL record format, failed to
update a comment referencing the old API.  Update the comment.
2019-03-04 12:32:40 -08:00
Tom Lane 80b9e9c466 Improve performance of index-only scans with many index columns.
StoreIndexTuple was a loop over index_getattr, which is O(N^2)
if the index columns are variable-width, and the performance
impact is already quite visible at ten columns.  The obvious
move is to replace that with a call to index_deform_tuple ...
but that's *also* a loop over index_getattr.  Improve it to
be essentially a clone of heap_deform_tuple.

(There are a few other places that loop over all index columns
with index_getattr, and perhaps should be changed likewise,
but most of them don't seem performance-critical.  Anyway, the
rest would mostly only be interested in the index key columns,
which there aren't likely to be so many of.  Wide index tuples
are a new thing with INCLUDE.)

Konstantin Knizhnik

Discussion: https://postgr.es/m/e06b2d27-04fc-5c0e-bb8c-ecd72aa24959@postgrespro.ru
2019-03-03 16:57:14 -05:00
Amit Kapila 9c32e4c350 Clear the local map when not used.
After commit b0eaa4c51b, we use a local map of pages to find the required
space for small relations.  We do clear this map when we have found a block
with enough free space, when we extend the relation, or on transaction
abort so that it can be used next time.  However, we miss to clear it when
we didn't find any pages to try from the map which leads to an assertion
failure when we later tried to use it after relation extension.

In the passing, I have improved some comments in this area.

Reported-by: Tom Lane based on buildfarm results
Author: Amit Kapila
Reviewed-by: John Naylor
Tested-by: Kuntal Ghosh
Discussion: https://postgr.es/m/32368.1551114120@sss.pgh.pa.us
2019-03-01 07:38:47 +05:30
Tom Lane c94fb8e8ac Standardize some more loops that chase down parallel lists.
We have forboth() and forthree() macros that simplify iterating
through several parallel lists, but not everyplace that could
reasonably use those was doing so.  Also invent forfour() and
forfive() macros to do the same for four or five parallel lists,
and use those where applicable.

The immediate motivation for doing this is to reduce the number
of ad-hoc lnext() calls, to reduce the footprint of a WIP patch.
However, it seems like good cleanup and error-proofing anyway;
the places that were combining forthree() with a manually iterated
loop seem particularly illegible and bug-prone.

There was some speculation about restructuring related parsetree
representations to reduce the need for parallel list chasing of
this sort.  Perhaps that's a win, or perhaps not, but in any case
it would be considerably more invasive than this patch; and it's
not particularly related to my immediate goal of improving the
List infrastructure.  So I'll leave that question for another day.

Patch by me; thanks to David Rowley for review.

Discussion: https://postgr.es/m/11587.1550975080@sss.pgh.pa.us
2019-02-28 14:25:01 -05:00
Peter Geoghegan 2ab23445bc Remove unneeded argument from _bt_getstackbuf().
_bt_getstackbuf() is called at exactly two points following commit
efada2b8e9 (one call site is concerned with page splits, while the
other is concerned with page deletion).  The parent buffer returned by
_bt_getstackbuf() is write-locked in both cases.  Remove the 'access'
argument and make _bt_getstackbuf() assume that callers require a
write-lock.
2019-02-25 17:47:43 -08:00
Peter Geoghegan 067786cea0 Correct obsolete nbtree page deletion comment.
Commit efada2b8e9, which made the nbtree page deletion algorithm more
robust, removed _bt_getstackbuf() calls from _bt_pagedel().  It failed
to update a comment that referenced the earlier approach.  Update the
comment to explain that the _bt_getstackbuf() page deletion call site
mirrors the only other remaining _bt_getstackbuf() call site, which is
reached during page splits.
2019-02-25 16:54:18 -08:00
Michael Paquier effe7d9552 Make release of 2PC identifier and locks consistent in COMMIT PREPARED
When preparing a transaction in two-phase commit, a dummy PGPROC entry
holding the GID used for the transaction is registered, which gets
released once COMMIT PREPARED is run.  Prior releasing its shared memory
state, all the locks taken in the prepared transaction are released
using a dedicated set of callbacks (pgstat and multixact having similar
callbacks), which may cause the locks to be released before the GID is
set free.

Hence, there is a small window where lock conflicts could happen, for
example:
- Transaction A releases its locks, still holding its GID in shared
memory.
- Transaction B held a lock which conflicted with locks of transaction
A.
- Transaction B continues its processing, reusing the same GID as
transaction A.
- Transaction B fails because of a conflicting GID, already in use by
transaction A.

This commit changes the shared memory state release so as post-commit
callbacks and predicate lock cleanup happen consistently with the shared
memory state cleanup for the dummy PGPROC entry.  The race window is
small and 2PC had this issue from the start, so no backpatch is done.
On top if that fixes discussed involved ABI breakages, which are not
welcome in stable branches.

Reported-by: Oleksii Kliukin, Ildar Musin
Diagnosed-by: Oleksii Kliukin, Ildar Musin
Author: Michael Paquier
Reviewed-by: Masahiko Sawada, Oleksii Kliukin
Discussion: https://postgr.es/m/BF9B38A4-2BFF-46E8-BA87-A2D00A8047A6@hintbits.com
2019-02-25 14:19:34 +09:00
Michael Paquier 4c23216002 Fix incorrect function reference in comment of twophase.c
The header block of TwoPhaseGetDummyBackendId mentioned incorrectly
TwoPhaseGetDummyProc.

Reported-by: Oleksii Kliukin
Discussion: https://postgr.es/m/D8336E40-BBE1-4954-98BB-7830D3F5CB36@hintbits.com
2019-02-23 08:40:01 +09:00
Michael Paquier 0dd6ff0ac8 Avoid some unnecessary block reads in WAL reader
When reading a new page internally and depending on the way the WAL
reader facility gets used by plugins, the current implementation of the
WAL reader may finish by reading a block multiple times while it is not
actually necessary as the requested data length may be equal to what has
been already read.  This can happen for any size, but is more likely to
happen at the end of a page.  This can cause performance penalties in
plugins which rely on the block reads to be purely sequential, zlib not
liking backward reads for example.  The new behavior also shaves some
cycles when doing recovery.

Author: Arthur Zakirov
Reviewed-by: Andrey Lepikhov, Michael Paquier, Grigory Smolkin
Discussion: https://postgr.es/m/2ddf4a32-517e-d6f4-d992-4a63b6035bfd@postgrespro.ru
2019-02-18 09:52:02 +09:00
Tom Lane 02a6a54ecd Make use of compiler builtins and/or assembly for CLZ, CTZ, POPCNT.
Test for the compiler builtins __builtin_clz, __builtin_ctz, and
__builtin_popcount, and make use of these in preference to
handwritten C code if they're available.  Create src/port
infrastructure for "leftmost one", "rightmost one", and "popcount"
so as to centralize these decisions.

On x86_64, __builtin_popcount generally won't make use of the POPCNT
opcode because that's not universally supported yet.  Provide code
that checks CPUID and then calls POPCNT via asm() if available.
This requires indirecting through a function pointer, which is
an annoying amount of overhead for a one-instruction operation,
but it's probably not worth working harder than this for our
current use-cases.

I'm not sure we've found all the existing places that could profit
from this new infrastructure; but we at least touched all the
ones that used copied-and-pasted versions of the bitmapset.c code,
and got rid of multiple copies of the associated constant arrays.

While at it, replace c-compiler.m4's one-per-builtin-function
macros with a single one that can handle all the cases we need
to worry about so far.  Also, because I'm paranoid, make those
checks into AC_LINK checks rather than just AC_COMPILE; the
former coding failed to verify that libgcc has support for the
builtin, in cases where it's not inline code.

David Rowley, Thomas Munro, Alvaro Herrera, Tom Lane

Discussion: https://postgr.es/m/CAKJS1f9WTAGG1tPeJnD18hiQW5gAk59fQ6WK-vfdAKEHyRg2RA@mail.gmail.com
2019-02-15 23:22:33 -05:00
Alvaro Herrera 457aef0f1f Revert attempts to use POPCNT etc instructions
This reverts commits fc6c72747a, 109de05cbb, d0b4663c23 and
711bab1e4d.

Somebody will have to try harder before submitting this patch again.
I've spent entirely too much time on it already, and the #ifdef maze yet
to be written in order for it to build at all got on my nerves.  The
amount of work needed to get a platform-specific performance improvement
that's barely above the noise level is not worth it.
2019-02-15 16:32:30 -03:00
Alvaro Herrera 711bab1e4d Add basic support for using the POPCNT and SSE4.2s LZCNT opcodes
These opcodes have been around in the AMD world since 2007, and 2008 in
the case of intel.  They're supported in GCC and Clang via some __builtin
macros.  The opcodes may be unavailable during runtime, in which case we
fall back on a C-based implementation of the code.  In order to get the
POPCNT instruction we must pass the -mpopcnt option to the compiler.  We
do this only for the pg_bitutils.c file.

David Rowley (with fragments taken from a patch by Thomas Munro)

Discussion: https://postgr.es/m/CAKJS1f9WTAGG1tPeJnD18hiQW5gAk59fQ6WK-vfdAKEHyRg2RA@mail.gmail.com
2019-02-13 16:10:06 -03:00
Peter Eisentraut 37d9916020 More unconstify use
Replace casts whose only purpose is to cast away const with the
unconstify() macro.

Discussion: https://www.postgresql.org/message-id/flat/53a28052-f9f3-1808-fed9-460fd43035ab%402ndquadrant.com
2019-02-13 11:50:16 +01:00
Michael Paquier b7ec820559 Fix description of WAL record XLOG_PARAMETER_CHANGE
max_wal_senders and max_worker_processes got reversed in the output
generated because of ea92368.

Reported-by: Kevin Hale Boyes
Discussion: https://postgr.es/m/CADAecHVAD4=26KAx4nj5DBvxqqvJkuwsy+riiiNhQqwnZg2K8Q@mail.gmail.com
2019-02-12 13:10:59 +09:00