Commit Graph

637 Commits

Author SHA1 Message Date
Peter Geoghegan 8523492d4e Remove tupgone special case from vacuumlazy.c.
Retry the call to heap_prune_page() in rare cases where there is
disagreement between the heap_prune_page() call and the call to
HeapTupleSatisfiesVacuum() that immediately follows.  Disagreement is
possible when a concurrently-aborted transaction makes a tuple DEAD
during the tiny window between each step.  This was the only case where
a tuple considered DEAD by VACUUM still had storage following pruning.
VACUUM's definition of dead tuples is now uniformly simple and
unambiguous: dead tuples from each page are always LP_DEAD line pointers
that were encountered just after we performed pruning (and just before
we considered freezing remaining items with tuple storage).

Eliminating the tupgone=true special case enables INDEX_CLEANUP=off
style skipping of index vacuuming that takes place based on flexible,
dynamic criteria.  The INDEX_CLEANUP=off case had to know about skipping
indexes up-front before now, due to a subtle interaction with the
special case (see commit dd695979) -- this was a special case unto
itself.  Now there are no special cases.  And so now it won't matter
when or how we decide to skip index vacuuming: it won't affect how
pruning behaves, and it won't be affected by any of the implementation
details of pruning or freezing.

Also remove XLOG_HEAP2_CLEANUP_INFO records.  These are no longer
necessary because we now rely entirely on heap pruning taking care of
recovery conflicts.  There is no longer any need to generate recovery
conflicts for DEAD tuples that pruning just missed.  This also means
that heap vacuuming now uses exactly the same strategy for recovery
conflicts as index vacuuming always has: REDO routines never need to
process a latestRemovedXid from the WAL record, since earlier REDO of
the WAL record from pruning is sufficient in all cases.  The generic
XLOG_HEAP2_CLEAN record type is now split into two new record types to
reflect this new division (these are called XLOG_HEAP2_PRUNE and
XLOG_HEAP2_VACUUM).

Also stop acquiring a super-exclusive lock for heap pages when they're
vacuumed during VACUUM's second heap pass.  A regular exclusive lock is
enough.  This is correct because heap page vacuuming is now strictly a
matter of setting the LP_DEAD line pointers to LP_UNUSED.  No other
backend can have a pointer to a tuple located in a pinned buffer that
can be invalidated by a concurrent heap page vacuum operation.

Heap vacuuming can now be thought of as conceptually similar to index
vacuuming and conceptually dissimilar to heap pruning.  Heap pruning now
has sole responsibility for anything involving the logical contents of
the database (e.g., managing transaction status information, recovery
conflicts, considering what to do with HOT chains).  Index vacuuming and
heap vacuuming are now only concerned with recycling garbage items from
physical data structures that back the logical database.

Bump XLOG_PAGE_MAGIC due to pruning and heap page vacuum WAL record
changes.

Credit for the idea of retrying pruning a page to avoid the tupgone case
goes to Andres Freund.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com
2021-04-06 08:49:22 -07:00
David Rowley af527705ed Adjust design of per-worker parallel seqscan data struct
The design of the data structures which allow storage of the per-worker
memory during parallel seq scans were not ideal. The work done in
56788d215 required an additional data structure to allow workers to
remember the range of pages that had been allocated to them for
processing during a parallel seqscan.  That commit added a void pointer
field to TableScanDescData to allow heapam to store the per-worker
allocation information.  However putting the field there made very little
sense given that we have AM specific structs for that, e.g.
HeapScanDescData.

Here we remove the void pointer field from TableScanDescData and add a
dedicated field for this purpose to HeapScanDescData.

Previously we also allocated memory for this parallel per-worker data for
all scans, regardless if it was a parallel scan or not.  This was just a
wasted allocation for non-parallel scans, so here we make the allocation
conditional on the scan being parallel.

Also, add previously missing pfree() to free the per-worker data in
heap_endscan().

Reported-by: Andres Freund
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/20210317023101.anvejcfotwka6gaa@alap3.anarazel.de
2021-03-30 10:17:09 +13:00
Michael Paquier a1999a01bb Sanitize the term "combo CID" in code comments
Combo CIDs were referred in the code comments using different terms
across various places of the code, so unify a bit the term used with
what is currently in use in some of the READMEs.

Author: "Hou, Zhijie"
Discussion: https://postgr.es/m/1d42865c91404f46af4562532fdbea31@G08CNEXMBPEKD05.g08.fujitsu.local
2021-03-25 16:08:03 +09:00
Peter Geoghegan fbe4cb3bd4 Fix comment about promising tuples.
Oversight in commit d168b66682, which added bottom-up index deletion.
2021-03-16 13:38:52 -07:00
David Rowley bb437f995d Add TID Range Scans to support efficient scanning ranges of TIDs
This adds a new executor node named TID Range Scan.  The query planner
will generate paths for TID Range scans when quals are discovered on base
relations which search for ranges on the table's ctid column.  These
ranges may be open at either end. For example, WHERE ctid >= '(10,0)';
will return all tuples on page 10 and over.

To support this, two new optional callback functions have been added to
table AM.  scan_set_tidrange is used to set the scan range to just the
given range of TIDs.  scan_getnextslot_tidrange fetches the next tuple
in the given range.

For AMs were scanning ranges of TIDs would not make sense, these functions
can be set to NULL in the TableAmRoutine.  The query planner won't
generate TID Range Scan Paths in that case.

Author: Edmund Horner, David Rowley
Reviewed-by: David Rowley, Tomas Vondra, Tom Lane, Andres Freund, Zhihong Yu
Discussion: https://postgr.es/m/CAMyN-kB-nFTkF=VA_JPwFNo08S0d-Yk0F741S2B7LDmYAi8eyA@mail.gmail.com
2021-02-27 22:59:36 +13:00
Alvaro Herrera 0f5505a881
Remove pointless HeapTupleHeaderIndicatesMovedPartitions calls
Pavan Deolasee recently noted that a few of the
HeapTupleHeaderIndicatesMovedPartitions calls added by commit
5db6df0c01 are useless, since they are done after comparing t_self
with t_ctid.  But because t_self can never be set to the magical values
that indicate that the tuple moved partition, this can never succeed: if
the first test fails (so we know t_self equals t_ctid), necessarily the
second test will also fail.

So these checks can be removed and no harm is done.  There's no bug
here, just a code legibility issue.

Reported-by: Pavan Deolasee <pavan.deolasee@gmail.com>
Discussion: https://postgr.es/m/20200929164411.GA15497@alvherre.pgsql
2021-02-22 16:51:44 -03:00
David Rowley 16dfe253e3 Fix hypothetical bug in heap backward scans
Both heapgettup() and heapgettup_pagemode() incorrectly set the first page
to scan in a backward scan in which the number of pages to scan was
specified by heap_setscanlimits().  The code incorrectly started the scan
at the end of the relation when startBlk was 0, or otherwise at
startBlk - 1, neither of which is correct when only scanning a subset of
pages.

The fix here checks if heap_setscanlimits() has changed the number of
pages to scan and if so we set the first page to scan as the final page in
the specified range during backward scans.

Proper adjustment of this code was forgotten when heap_setscanlimits() was
added in 7516f5259 back in 9.5.  However, practice, nowhere in core code
performs backward scans after having used heap_setscanlimits(), yet, it is
possible an extension uses the heap functions in this way, hence
backpatch.

An upcoming patch does use heap_setscanlimits() with backward scans, so
this must be fixed before that can go in.

Author: David Rowley
Discussion: https://postgr.es/m/CAApHDvpGc9h0_oVD2CtgBcxCS1N-qDYZSeBRnUh+0CWJA9cMaA@mail.gmail.com
Backpatch-through: 9.5, all supported versions
2021-01-25 19:52:18 +13:00
Tomas Vondra 39b66a91bd Fix COPY FREEZE with CLOBBER_CACHE_ALWAYS
This adds code omitted from commit 7db0cd2145 by accident, which had
two consequences. Firstly, only rows inserted by heap_multi_insert were
frozen as expected when running COPY FREEZE, while heap_insert left
rows unfrozen. That however includes rows in TOAST tables, so a lot of
data might have been left unfrozen. Secondly, page might have been left
partially empty after relcache invalidation.

This addresses both of those issues.

Discussion: https://postgr.es/m/CABOikdN-ptGv0mZntrK2Q8OtfUuAjqaYMGmkdU1dCKFtUxVLrg@mail.gmail.com
2021-01-24 01:08:11 +01:00
Tomas Vondra 7db0cd2145 Set PD_ALL_VISIBLE and visibility map bits in COPY FREEZE
Make sure COPY FREEZE marks the pages as PD_ALL_VISIBLE and updates the
visibility map. Until now we only marked individual tuples as frozen,
but page-level flags were not updated, so the first VACUUM after the
COPY FREEZE had to rewrite the whole table.

This is a fairly old patch, and multiple people worked on it. The first
version was written by Jeff Janes, and then reworked by Pavan Deolasee
and Anastasia Lubennikova.

Author: Anastasia Lubennikova, Pavan Deolasee, Jeff Janes
Reviewed-by: Kuntal Ghosh, Jeff Janes, Tomas Vondra, Masahiko Sawada,
             Andres Freund, Ibrar Ahmed, Robert Haas, Tatsuro Ishii,
             Darafei Praliaskouski
Discussion: https://postgr.es/m/CABOikdN-ptGv0mZntrK2Q8OtfUuAjqaYMGmkdU1dCKFtUxVLrg@mail.gmail.com
Discussion: https://postgr.es/m/CAMkU%3D1w3osJJ2FneELhhNRLxfZitDgp9FPHee08NT2FQFmz_pQ%40mail.gmail.com
2021-01-17 22:28:26 +01:00
Peter Geoghegan d168b66682 Enhance nbtree index tuple deletion.
Teach nbtree and heapam to cooperate in order to eagerly remove
duplicate tuples representing dead MVCC versions.  This is "bottom-up
deletion".  Each bottom-up deletion pass is triggered lazily in response
to a flood of versions on an nbtree leaf page.  This usually involves a
"logically unchanged index" hint (these are produced by the executor
mechanism added by commit 9dc718bd).

The immediate goal of bottom-up index deletion is to avoid "unnecessary"
page splits caused entirely by version duplicates.  It naturally has an
even more useful effect, though: it acts as a backstop against
accumulating an excessive number of index tuple versions for any given
_logical row_.  Bottom-up index deletion complements what we might now
call "top-down index deletion": index vacuuming performed by VACUUM.
Bottom-up index deletion responds to the immediate local needs of
queries, while leaving it up to autovacuum to perform infrequent clean
sweeps of the index.  The overall effect is to avoid certain
pathological performance issues related to "version churn" from UPDATEs.

The previous tableam interface used by index AMs to perform tuple
deletion (the table_compute_xid_horizon_for_tuples() function) has been
replaced with a new interface that supports certain new requirements.
Many (perhaps all) of the capabilities added to nbtree by this commit
could also be extended to other index AMs.  That is left as work for a
later commit.

Extend deletion of LP_DEAD-marked index tuples in nbtree by adding logic
to consider extra index tuples (that are not LP_DEAD-marked) for
deletion in passing.  This increases the number of index tuples deleted
significantly in many cases.  The LP_DEAD deletion process (which is now
called "simple deletion" to clearly distinguish it from bottom-up
deletion) won't usually need to visit any extra table blocks to check
these extra tuples.  We have to visit the same table blocks anyway to
generate a latestRemovedXid value (at least in the common case where the
index deletion operation's WAL record needs such a value).

Testing has shown that the "extra tuples" simple deletion enhancement
increases the number of index tuples deleted with almost any workload
that has LP_DEAD bits set in leaf pages.  That is, it almost never fails
to delete at least a few extra index tuples.  It helps most of all in
cases that happen to naturally have a lot of delete-safe tuples.  It's
not uncommon for an individual deletion operation to end up deleting an
order of magnitude more index tuples compared to the old naive approach
(e.g., custom instrumentation of the patch shows that this happens
fairly often when the regression tests are run).

Add a further enhancement that augments simple deletion and bottom-up
deletion in indexes that make use of deduplication: Teach nbtree's
_bt_delitems_delete() function to support granular TID deletion in
posting list tuples.  It is now possible to delete individual TIDs from
posting list tuples provided the TIDs have a tableam block number of a
table block that gets visited as part of the deletion process (visiting
the table block can be triggered directly or indirectly).  Setting the
LP_DEAD bit of a posting list tuple is still an all-or-nothing thing,
but that matters much less now that deletion only needs to start out
with the right _general_ idea about which index tuples are deletable.

Bump XLOG_PAGE_MAGIC because xl_btree_delete changed.

No bump in BTREE_VERSION, since there are no changes to the on-disk
representation of nbtree indexes.  Indexes built on PostgreSQL 12 or
PostgreSQL 13 will automatically benefit from bottom-up index deletion
(i.e. no reindexing required) following a pg_upgrade.  The enhancement
to simple deletion is available with all B-Tree indexes following a
pg_upgrade, no matter what PostgreSQL version the user upgrades from.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com
2021-01-13 09:21:32 -08:00
Bruce Momjian ca3b37487b Update copyright for 2021
Backpatch-through: 9.5
2021-01-02 13:06:25 -05:00
Peter Geoghegan 32d6287d2e Get heap page max offset with buffer lock held.
On further reflection it seems better to call PageGetMaxOffsetNumber()
after acquiring a buffer lock on the page.  This shouldn't really
matter, but doing it this way is cleaner.

Follow-up to commit 42288174.

Backpatch: 12-, just like commit 42288174
2020-12-30 17:21:42 -08:00
Peter Geoghegan 4228817449 Fix index deletion latestRemovedXid bug.
The logic for determining the latest removed XID for the purposes of
generating recovery conflicts in REDO routines was subtly broken.  It
failed to follow links from HOT chains, and so failed to consider all
relevant heap tuple headers in some cases.

To fix, expand the loop that deals with LP_REDIRECT line pointers to
also deal with HOT chains.  The new version of the loop is loosely based
on a similar loop from heap_prune_chain().

The impact of this bug is probably quite limited, since the horizon code
necessarily deals with heap tuples that are pointed to by LP_DEAD-set
index tuples.  The process of setting LP_DEAD index tuples (e.g. within
the kill_prior_tuple mechanism) is highly correlated with opportunistic
pruning of pointed-to heap tuples.  Plus the question of generating a
recovery conflict usually comes up some time after index tuple LP_DEAD
bits were initially set, unlike heap pruning, where a latestRemovedXid
is generated at the point of the pruning operation (heap pruning has no
deferred "would-be page split" style processing that produces conflicts
lazily).

Only backpatch to Postgres 12, the first version where this logic runs
during original execution (following commit 558a9165e0).  The index
latestRemovedXid mechanism has had the same bug since it first appeared
over 10 years ago (in commit a760893d), but backpatching to all
supported versions now seems like a bad idea on balance.  Running the
new improved code during recovery seems risky, especially given the lack
of complaints from the field.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=Eib393+HHcERK_9MtgNS7Ew1HY=RDC_g6GL46zM5C6Q@mail.gmail.com
Backpatch: 12-
2020-12-30 16:29:05 -08:00
Michael Paquier 947789f1f5 Avoid using tuple from syscache for update of pg_database.datfrozenxid
pg_database.datfrozenxid gets updated using an in-place update at the
end of vacuum or autovacuum.  Since 96cdeae, as pg_database has a toast
relation, it is possible for a pg_database tuple to have toast values
if there is a large set of ACLs in place.  In such a case, the in-place
update would fail because of the flattening of the toast values done for
the catcache entry fetched.  Instead of using a copy from the catcache,
this changes the logic to fetch the copy of the tuple by directly
scanning pg_database.

Per the lack of complaints on the matter, no backpatch is done.  Note
that before 96cdeae, attempting to insert such a tuple to pg_database
would cause a "row is too big" error, so the end-of-vacuum problem was
not reachable.

Author: Ashwin Agrawal, Junfeng Yang
Discussion: https://postgr.es/m/DM5PR0501MB38800D9E4605BCA72DD35557CCE10@DM5PR0501MB3880.namprd05.prod.outlook.com
2020-12-08 12:13:19 +09:00
Peter Geoghegan 180cf876d4 Remove ineffective heapam CHECK_FOR_INTERRUPTS().
Remove a CHECK_FOR_INTERRUPTS() call that could never actually handle an
interrupt.  We always have a heap page buffer lock at this point.
Having a useless CHECK_FOR_INTERRUPTS() call is harmless but misleading.

It is probably possible to work around the immediate problem by moving
the CHECK_FOR_INTERRUPTS() to before the heap page buffer lock is
acquired.  That isn't enough to make the function responsive to
interrupts, though.  The index AM caller will still hold an exclusive
buffer lock of its own.
2020-11-09 09:00:12 -08:00
Alvaro Herrera 93f84d59f8 Revert "Remove pointless HeapTupleHeaderIndicatesMovedPartitions calls"
This reverts commit 85adb5e91e.  It was not intended for commit just
yet.
2020-10-15 15:16:11 -03:00
Alvaro Herrera 85adb5e91e Remove pointless HeapTupleHeaderIndicatesMovedPartitions calls
Pavan Deolasee recently noted that a few of the
HeapTupleHeaderIndicatesMovedPartitions calls added by commit
5db6df0c01 are useless, since they are done after comparing t_self
with t_ctid.  But because t_self can never be set to the magical values
that indicate that the tuple moved partition, this can never succeed: if
the first test fails (so we know t_self equals t_ctid), necessarily the
second test will also fail.

So these checks can be removed and no harm is done.

Discussion: https://postgr.es/m/20200929164411.GA15497@alvherre.pgsql
2020-10-15 14:32:34 -03:00
Amit Kapila 0d32511eca Fix comments in heapam.c.
After commits 85f6b49c2c and 3ba59ccc89, we can allow parallel inserts
which was earlier not possible as parallel group members won't conflict
for relation extension and page lock.  In those commits, we forgot to
update comments at few places.

Author: Amit Kapila
Reviewed-by: Robert Haas and Dilip Kumar
Backpatch-through: 13
Discussion: https://postgr.es/m/CAFiTN-tMrQh5FFMPx5aWJ+1gi1H6JxktEhq5mDwCHgnEO5oBkA@mail.gmail.com
2020-09-18 09:50:44 +05:30
Andres Freund c62a0a49f3 Revert "Make vacuum a bit more verbose to debug BF failure."
This reverts commit 49967da65a.

Enough time has passed that we can be confident that 07f32fcd23
resolved the issue. Therefore we can remove the temporary debugging
aids.

Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/E1k7tGP-0005V0-5k@gemulon.postgresql.org
2020-08-20 12:59:00 -07:00
Andres Freund 49967da65a Make vacuum a bit more verbose to debug BF failure.
This is temporary. While possibly some more error checking / debugging
in this path would be a good thing, it'll not look exactly like this.

Discussion: https://postgr.es/m/20200816181604.l54m6kss5ntd6xow@alap3.anarazel.de
2020-08-16 12:57:01 -07:00
Noah Misch 676a9c3cc4 Correct several behavior descriptions in comments.
Reuse cautionary language from src/test/ssl/README in
src/test/kerberos/README.  SLRUs have had access to six-character
segments names since commit 73c986adde,
and recovery stopped calling HeapTupleHeaderAdvanceLatestRemovedXid() in
commit 558a9165e0.  The other corrections
are more self-evident.
2020-08-15 20:21:52 -07:00
Andres Freund dc7420c2c9 snapshot scalability: Don't compute global horizons while building snapshots.
To make GetSnapshotData() more scalable, it cannot not look at at each proc's
xmin: While snapshot contents do not need to change whenever a read-only
transaction commits or a snapshot is released, a proc's xmin is modified in
those cases. The frequency of xmin modifications leads to, particularly on
higher core count systems, many cache misses inside GetSnapshotData(), despite
the data underlying a snapshot not changing. That is the most
significant source of GetSnapshotData() scaling poorly on larger systems.

Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons /
thresholds as it has so far. But we don't really have to: The horizons don't
actually change that much between GetSnapshotData() calls. Nor are the horizons
actually used every time a snapshot is built.

The trick this commit introduces is to delay computation of accurate horizons
until there use and using horizon boundaries to determine whether accurate
horizons need to be computed.

The use of RecentGlobal[Data]Xmin to decide whether a row version could be
removed has been replaces with new GlobalVisTest* functions.  These use two
thresholds to determine whether a row can be pruned:
1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed
   are definitely still visible.
2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
   definitely be removed
GetSnapshotData() updates definitely_needed to be the xmin of the computed
snapshot.

When testing whether a row can be removed (with GlobalVisTestIsRemovableXid())
and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID <
definitely_needed) the boundaries can be recomputed to be more accurate. As it
is not cheap to compute accurate boundaries, we limit the number of times that
happens in short succession.  As the boundaries used by
GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by
GetSnapshotData()), it is likely that further test can benefit from an earlier
computation of accurate horizons.

To avoid regressing performance when old_snapshot_threshold is set (as that
requires an accurate horizon to be computed), heap_page_prune_opt() doesn't
unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the
computation of the limited horizon, and the triggering of errors (with
SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove
tuples.

This commit just removes the accesses to PGXACT->xmin from
GetSnapshotData(), but other members of PGXACT residing in the same
cache line are accessed. Therefore this in itself does not result in a
significant improvement. Subsequent commits will take advantage of the
fact that GetSnapshotData() now does not need to access xmins anymore.

Note: This contains a workaround in heap_page_prune_opt() to keep the
snapshot_too_old tests working. While that workaround is ugly, the tests
currently are not meaningful, and it seems best to address them separately.

Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-12 16:03:49 -07:00
Amit Kapila 7259736a6e Implement streaming mode in ReorderBuffer.
Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke stream API methods added by commit 45fdc9738b.
However, sometimes if we have incomplete toast or speculative insert we
spill to the disk because we can't generate the complete tuple and stream.
And, as soon as we get the complete tuple we stream the transaction
including the serialized changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages at each command end. These
features are added by commits 0bead9af48 and c55040ccd0 respectively.

Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such a sqlerrcode aborts the decoding of the current
transaction and continue with the decoding of other transactions.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

We also provide a new option via SQL APIs to fetch the changes being
streamed.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila, Nikhil Sontakke
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
2020-08-08 07:47:06 +05:30
Michael Paquier e3931d01f3 Use multi-inserts for pg_attribute and pg_shdepend
For pg_attribute, this allows to insert at once a full set of attributes
for a relation (roughly 15% of WAL reduction in extreme cases).  For
pg_shdepend, this reduces the work done when creating new shared
dependencies from a database template.  The number of slots used for the
insertion is capped at 64kB of data inserted for both, depending on the
number of items to insert and the length of the rows involved.

More can be done for other catalogs, like pg_depend.  This part requires
a different approach as the number of slots to use depends also on the
number of entries discarded as pinned dependencies.  This is also
related to the rework or dependency handling for ALTER TABLE and CREATE
TABLE, mainly.

Author: Daniel Gustafsson
Reviewed-by: Andres Freund, Michael Paquier
Discussion: https://postgr.es/m/20190213182737.mxn6hkdxwrzgxk35@alap3.anarazel.de
2020-07-31 10:54:26 +09:00
Thomas Munro cb04ad4985 Move syncscan.c to src/backend/access/common.
Since the tableam.c code needs to make use of the syncscan.c routines
itself, and since other block-oriented AMs might also want to use it one
day, it didn't make sense for it to live under src/backend/access/heap.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGLCnG%3DNEAByg6bk%2BCT9JZD97Y%3DAxKhh27Su9FeGWOKvDg%40mail.gmail.com
2020-07-29 16:59:33 +12:00
David Rowley 56788d2156 Allocate consecutive blocks during parallel seqscans
Previously we would allocate blocks to parallel workers during a parallel
sequential scan 1 block at a time.  Since other workers were likely to
request a block before a worker returns for another block number to work
on, this could lead to non-sequential I/O patterns in each worker which
could cause the operating system's readahead to perform poorly or not at
all.

Here we change things so that we allocate consecutive "chunks" of blocks
to workers and have them work on those until they're done, at which time
we allocate another chunk for the worker.  The size of these chunks is
based on the size of the relation.

Initial patch here was by Thomas Munro which showed some good improvements
just having a fixed chunk size of 64 blocks with a simple ramp-down near
the end of the scan. The revisions of the patch to make the chunk size
based on the relation size and the adjusted ramp-down in powers of two was
done by me, along with quite extensive benchmarking to determine the
optimal chunk sizes.

For the most part, benchmarks have shown significant performance
improvements for large parallel sequential scans on Linux, FreeBSD and
Windows using SSDs.  It's less clear how this affects the performance of
cloud providers.  Tests done so far are unable to obtain stable enough
performance to provide meaningful benchmark results.  It is possible that
this could cause some performance regressions on more obscure filesystems,
so we may need to later provide users with some ability to get something
closer to the old behavior.  For now, let's leave that until we see that
it's really required.

Author: Thomas Munro, David Rowley
Reviewed-by: Ranier Vilela, Soumyadeep Chakraborty, Robert Haas
Reviewed-by: Amit Kapila, Kirk Jamison
Discussion: https://postgr.es/m/CA+hUKGJ_EErDv41YycXcbMbCBkztA34+z1ts9VQH+ACRuvpxig@mail.gmail.com
2020-07-26 21:02:45 +12:00
Michael Paquier 9168793d72 Fix comments related to table AMs
Incorrect function names were referenced.  As this fixes some portions
of tableam.h, that is mentioned in the docs as something to look at when
implementing a table AM, backpatch down to 12 where this has been
introduced.

Author: Hironobu Suzuki
Discussion: https://postgr.es/m/8fe6d672-28dd-3f1d-7aed-ac2f6d599d3f@interdb.jp
Backpatch-through: 12
2020-07-14 13:17:11 +09:00
Andres Freund 5e7bbb5286 code: replace 'master' with 'primary' where appropriate.
Also changed "in the primary" to "on the primary", and added a few
"the" before "primary".

Author: Andres Freund
Reviewed-By: David Steele
Discussion: https://postgr.es/m/20200615182235.x7lch5n6kcjq4aue@alap3.anarazel.de
2020-07-08 12:57:23 -07:00
Thomas Munro 7aa4fb5925 Improve comments for [Heap]CheckForSerializableConflictOut().
Rewrite the documentation of these functions, in light of recent bug fix
commit 5940ffb2.

Back-patch to 13 where the check-for-conflict-out code was split up into
AM-specific and generic parts, and new documentation was added that now
looked wrong.

Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/db7b729d-0226-d162-a126-8a8ab2dc4443%40jepsen.io
2020-06-12 10:55:38 +12:00
Peter Geoghegan 5940ffb221 Avoid update conflict out serialization anomalies.
SSI's HeapCheckForSerializableConflictOut() test failed to correctly
handle conditions involving a concurrently inserted tuple which is later
concurrently updated by a separate transaction .  A SELECT statement
that called HeapCheckForSerializableConflictOut() could end up using the
same XID (updater's XID) for both the original tuple, and the successor
tuple, missing the XID of the xact that created the original tuple
entirely.  This only happened when neither tuple from the chain was
visible to the transaction's MVCC snapshot.

The observable symptoms of this bug were subtle.  A pair of transactions
could commit, with the later transaction failing to observe the effects
of the earlier transaction (because of the confusion created by the
update to the non-visible row).  This bug dates all the way back to
commit dafaa3ef, which added SSI.

To fix, make sure that we check the xmin of concurrently inserted tuples
that happen to also have been updated concurrently.

Author: Peter Geoghegan
Reported-By: Kyle Kingsbury
Reviewed-By: Thomas Munro
Discussion: https://postgr.es/m/db7b729d-0226-d162-a126-8a8ab2dc4443@jepsen.io
Backpatch: All supported versions
2020-06-11 10:09:47 -07:00
Tom Lane 5cbfce562f Initial pgindent and pgperltidy run for v13.
Includes some manual cleanup of places that pgindent messed up,
most of which weren't per project style anyway.

Notably, it seems some people didn't absorb the style rules of
commit c9d297751, because there were a bunch of new occurrences
of function calls with a newline just after the left paren, all
with faulty expectations about how the rest of the call would get
indented.
2020-05-14 13:06:50 -04:00
Peter Eisentraut aaf069aa34 Remove HEAPDEBUGALL
This has been broken since PostgreSQL 12 and was probably never really
used.  PostgreSQL 12 added an analogous HEAPAMSLOTDEBUGALL, which
still works right now, but it's also not very useful, so remove that
as well.

Discussion: https://www.postgresql.org/message-id/flat/645c0646-4218-d4c3-409a-a7003a0c108d%402ndquadrant.com
2020-04-22 08:35:33 +02:00
Andres Freund f946069e68 Use TransactionXmin instead of RecentGlobalXmin in heap_abort_speculative().
There's a very low risk that RecentGlobalXmin could be far enough in
the past to be older than relfrozenxid, or even wrapped
around. Luckily the consequences of that having happened wouldn't be
too bad - the page wouldn't be pruned for a while.

Avoid that risk by using TransactionXmin instead. As that's announced
via MyPgXact->xmin, it is protected against wrapping around (see code
comments for details around relfrozenxid).

Author: Andres Freund
Discussion: https://postgr.es/m/20200328213023.s4eyijhdosuc4vcj@alap3.anarazel.de
Backpatch: 9.5-
2020-04-05 17:47:30 -07:00
Noah Misch c6b92041d3 Skip WAL for new relfilenodes, under wal_level=minimal.
Until now, only selected bulk operations (e.g. COPY) did this.  If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY.  See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules.  Maintainers of table access
methods should examine that section.

To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL.  A new GUC,
wal_skip_threshold, guides that choice.  If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold.  Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.

Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode.  Introduce rd_firstRelfilenodeSubid.  Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node.  Make relcache.c retain entries for certain
dropped relations until end of transaction.

Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN.
Future servers accept older WAL, so this bump is discretionary.

Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas.  Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem.  Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs.  Reported by Martijn van Oosterhout.

Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-04-04 12:25:34 -07:00
Noah Misch de9396326e Revert "Skip WAL for new relfilenodes, under wal_level=minimal."
This reverts commit cb2fd7eac2.  Per
numerous buildfarm members, it was incompatible with parallel query, and
a test case assumed LP64.  Back-patch to 9.5 (all supported versions).

Discussion: https://postgr.es/m/20200321224920.GB1763544@rfd.leadboat.com
2020-03-22 09:24:09 -07:00
Noah Misch cb2fd7eac2 Skip WAL for new relfilenodes, under wal_level=minimal.
Until now, only selected bulk operations (e.g. COPY) did this.  If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY.  See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules.  Maintainers of table access
methods should examine that section.

To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL.  A new GUC,
wal_skip_threshold, guides that choice.  If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold.  Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.

Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode.  Introduce rd_firstRelfilenodeSubid.  Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node.  Make relcache.c retain entries for certain
dropped relations until end of transaction.

Back-patch to 9.5 (all supported versions).  This introduces a new WAL
record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC.  As
always, update standby systems before master systems.  This changes
sizeof(RelationData) and sizeof(IndexStmt), breaking binary
compatibility for affected extensions.  (The most recent commit to
affect the same class of extensions was
089e4d405d0f3b94c74a2c6a54357a84a681754b.)

Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas.  Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem.  Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs.  Reported by Martijn van Oosterhout.

Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-03-21 09:38:26 -07:00
Thomas Munro fc34b0d9de Introduce a maintenance_io_concurrency setting.
Introduce a GUC and a tablespace option to control I/O prefetching, much
like effective_io_concurrency, but for work that is done on behalf of
many client sessions.

Use the new setting in heapam.c instead of the hard-coded formula
effective_io_concurrency + 10 introduced by commit 558a9165e0.  Go with
a default value of 10 for now, because it's a round number pretty close
to the value used for that existing case.

Discussion: https://postgr.es/m/CA%2BhUKGJUw08dPs_3EUcdO6M90GnjofPYrWp4YSLaBkgYwS-AqA%40mail.gmail.com
2020-03-16 17:14:26 +13:00
Michael Paquier 7d672b76bf Issue properly WAL record for CID of first catalog tuple in multi-insert
Multi-insert for heap is not yet used actively for catalogs, but the
code to support this case is in place for logical decoding.  The
existing code forgot to issue a XLOG_HEAP2_NEW_CID record for the first
tuple inserted, leading to failures when attempting to use multiple
inserts for catalogs at decoding time.  This commit fixes the problem by
WAL-logging the needed CID.

This is not an active bug, so no back-patch is done.

Author: Daniel Gustafsson
Discussion: https://postgr.es/m/E0D4CC67-A1CF-4DF4-991D-B3AC2EB5FAE9@yesql.se
2020-02-25 07:55:22 +09:00
Alvaro Herrera c9d2977519 Clean up newlines following left parentheses
We used to strategically place newlines after some function call left
parentheses to make pgindent move the argument list a few chars to the
left, so that the whole line would fit under 80 chars.  However,
pgindent no longer does that, so the newlines just made the code
vertically longer for no reason.  Remove those newlines, and reflow some
of those lines for some extra naturality.

Reviewed-by: Michael Paquier, Tom Lane
Discussion: https://postgr.es/m/20200129200401.GA6303@alvherre.pgsql
2020-01-30 13:42:14 -03:00
Thomas Munro 11da6bccd1 Fix compile error on HP C.
Per build farm animal anole, after commit 6f38d4dac3.
2020-01-28 20:30:40 +13:00
Thomas Munro 6f38d4dac3 Remove dependency on HeapTuple from predicate locking functions.
The following changes make the predicate locking functions more
generic and suitable for use by future access methods:

- PredicateLockTuple() is renamed to PredicateLockTID().  It takes
  ItemPointer and inserting transaction ID instead of HeapTuple.

- CheckForSerializableConflictIn() takes blocknum instead of buffer.

- CheckForSerializableConflictOut() no longer takes HeapTuple or buffer.

Author: Ashwin Agrawal
Reviewed-by: Andres Freund, Kuntal Ghosh, Thomas Munro
Discussion: https://postgr.es/m/CALfoeiv0k3hkEb3Oqk%3DziWqtyk2Jys1UOK5hwRBNeANT_yX%2Bng%40mail.gmail.com
2020-01-28 13:13:04 +13:00
Michael Paquier 7689d907bb Fix comment in heapam.c
Improvement per suggestion from Tom Lane.

Author: Daniel Gustafsson
Discussion: https://postgr.es/m/FED18699-4270-4778-8DA8-10F119A5ECF3@yesql.se
2020-01-13 17:57:38 +09:00
Bruce Momjian 7559d8ebfa Update copyrights for 2020
Backpatch-through: update all files in master, backpatch legal files through 9.4
2020-01-01 12:21:45 -05:00
Michael Paquier 7854e07f25 Revert "Rename files and headers related to index AM"
This follows multiple complains from Peter Geoghegan, Andres Freund and
Alvaro Herrera that this issue ought to be dug more before actually
happening, if it happens.

Discussion: https://postgr.es/m/20191226144606.GA5659@alvherre.pgsql
2019-12-27 08:09:00 +09:00
Michael Paquier 8ce3aa9b59 Rename files and headers related to index AM
The following renaming is done so as source files related to index
access methods are more consistent with table access methods (the
original names used for index AMs ware too generic, and could be
confused as including features related to table AMs):
- amapi.h -> indexam.h.
- amapi.c -> indexamapi.c.  Here we have an equivalent with
backend/access/table/tableamapi.c.
- amvalidate.c -> indexamvalidate.c.
- amvalidate.h -> indexamvalidate.h.
- genam.c -> indexgenam.c.
- genam.h -> indexgenam.h.

This has been discussed during the development of v12 when table AM was
worked on, but the renaming never happened.

Author: Michael Paquier
Reviewed-by: Fabien Coelho, Julien Rouhaud
Discussion: https://postgr.es/m/20191223053434.GF34339@paquier.xyz
2019-12-25 10:23:39 +09:00
Robert Haas 2e8b6bfa90 Rename some toasting functions based on whether they are heap-specific.
The old names for the attribute-detoasting functions names included
the word "heap," which seems outdated now that the heap is only one of
potentially many table access methods.

On the other hand, toast_insert_or_update and toast_delete are
heap-specific, so rename them by adding "heap_" as a prefix.

Not all of the work of making the TOAST system fully accessible to AMs
other than the heap is done yet, but there seems to be little harm in
getting this renaming out of the way now. Commit
8b94dab066 already divided up the
functions among various files partially according to whether it was
intended that they should be heap-specific or AM-agnostic, so this is
just clarifying the division contemplated by that commit.

Patch by me, reviewed and tested by Prabhat Sabu, Thomas Munro,
Andres Freund, and Álvaro Herrera.

Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com
2019-10-04 14:24:46 -04:00
Robert Haas 8b94dab066 Split tuptoaster.c into three separate files.
detoast.c/h contain functions required to detoast a datum, partially
or completely, plus a few other utility functions for examining the
size of toasted datums.

toast_internals.c/h contain functions that are used internally to the
TOAST subsystem but which (mostly) do not need to be accessed from
outside.

heaptoast.c/h contains code that is intrinsically specific to the
heap AM, either because it operates on HeapTuples or is based on the
layout of a heap page.

detoast.c and toast_internals.c are placed in
src/backend/access/common rather than src/backend/access/heap.  At
present, both files still have dependencies on the heap, but that will
be improved in a future commit.

Patch by me, reviewed and tested by Prabhat Sabu, Thomas Munro,
Andres Freund, and Álvaro Herrera.

Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com
2019-09-05 13:15:10 -04:00
Tom Lane f63a5ead9d Avoid touching replica identity index in ExtractReplicaIdentity().
In what seems like a fit of misplaced optimization,
ExtractReplicaIdentity() accessed the relation's replica-identity
index without taking any lock on it.  Usually, the surrounding query
already holds some lock so this is safe enough ... but in the case
of a previously-planned delete, there might be no existing lock.
Given a suitable test case, this is exposed in v12 and HEAD by an
assertion added by commit b04aeb0a0.

The whole thing's rather poorly thought out anyway; rather than
looking directly at the index, we should use the index-attributes
bitmap that's held by the parent table's relcache entry, as the
caller functions do.  This is more consistent and likely a bit
faster, since it avoids a cache lookup.  Hence, change to doing it
that way.

While at it, rather than blithely assuming that the identity
columns are non-null (with catastrophic results if that's wrong),
add assertion checks that they aren't null.  Possibly those should
be actual test-and-elog, but I'll leave it like this for now.

In principle, this is a bug that's been there since this code was
introduced (in 9.4).  In practice, the risk seems quite low, since
we do have a lock on the index's parent table, so concurrent
changes to the index's catalog entries seem unlikely.  Given the
precedent that commit 9c703c169 wasn't back-patched, I won't risk
back-patching this further than v12.

Per report from Hadi Moshayedi.

Discussion: https://postgr.es/m/CAK=1=Wrek44Ese1V7LjKiQS-Nd-5LgLi_5_CskGbpggKEf3tKQ@mail.gmail.com
2019-09-02 16:10:37 -04:00
Alvaro Herrera f8cf524da1 Fix bogus comment
Author: Alexander Lakhin
Discussion: https://postgr.es/m/20190819072244.GE18166@paquier.xyz
2019-08-20 16:04:09 -04:00
Michael Paquier 66bde49d96 Fix inconsistencies and typos in the tree, take 10
This addresses some issues with unnecessary code comments, fixes various
typos in docs and comments, and removes some orphaned structures and
definitions.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/9aabc775-5494-b372-8bcb-4dfc0bd37c68@gmail.com
2019-08-13 13:53:41 +09:00
Heikki Linnakangas 1169fcf129 Fix predicate-locking of HOT updated rows.
In serializable mode, heap_hot_search_buffer() incorrectly acquired a
predicate lock on the root tuple, not the returned tuple that satisfied
the visibility checks. As explained in README-SSI, the predicate lock does
not need to be copied or extended to other tuple versions, but for that to
work, the correct, visible, tuple version must be locked in the first
place.

The original SSI commit had this bug in it, but it was fixed back in 2013,
in commit 81fbbfe335. But unfortunately, it was reintroduced a few months
later in commit b89e151054. Wising up from that, add a regression test
to cover this, so that it doesn't get reintroduced again. Also, move the
code that sets 't_self', so that it happens at the same time that the
other HeapTuple fields are set, to make it more clear that all the code in
the loop operate on the "current" tuple in the chain, not the root tuple.

Bug spotted by Andres Freund, analysis and original fix by Thomas Munro,
test case and some additional changes to the fix by Heikki Linnakangas.
Backpatch to all supported versions (9.4).

Discussion: https://www.postgresql.org/message-id/20190731210630.nqhszuktygwftjty%40alap3.anarazel.de
2019-08-07 12:40:49 +03:00
Michael Paquier 8548ddc61b Fix inconsistencies and typos in the tree, take 9
This addresses more issues with code comments, variable names and
unreferenced variables.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/7ab243e0-116d-3e44-d120-76b3df7abefd@gmail.com
2019-08-05 12:14:58 +09:00
Michael Paquier 23bccc823d Fix inconsistencies and typos in the tree
This is numbered take 7, and addresses a set of issues with code
comments, variable names and unreferenced variables.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/dff75442-2468-f74f-568c-6006e141062f@gmail.com
2019-07-22 10:01:50 +09:00
Alvaro Herrera 8b21b416ed Avoid spurious deadlocks when upgrading a tuple lock
This puts back reverted commit de87a084c0, with some bug fixes.

When two (or more) transactions are waiting for transaction T1 to release a
tuple-level lock, and transaction T1 upgrades its lock to a higher level, a
spurious deadlock can be reported among the waiting transactions when T1
finishes.  The simplest example case seems to be:

T1: select id from job where name = 'a' for key share;
Y: select id from job where name = 'a' for update; -- starts waiting for T1
Z: select id from job where name = 'a' for key share;
T1: update job set name = 'b' where id = 1;
Z: update job set name = 'c' where id = 1; -- starts waiting for T1
T1: rollback;

At this point, transaction Y is rolled back on account of a deadlock: Y
holds the heavyweight tuple lock and is waiting for the Xmax to be released,
while Z holds part of the multixact and tries to acquire the heavyweight
lock (per protocol) and goes to sleep; once T1 releases its part of the
multixact, Z is awakened only to be put back to sleep on the heavyweight
lock that Y is holding while sleeping.  Kaboom.

This can be avoided by having Z skip the heavyweight lock acquisition.  As
far as I can see, the biggest downside is that if there are multiple Z
transactions, the order in which they resume after T1 finishes is not
guaranteed.

Backpatch to 9.6.  The patch applies cleanly on 9.5, but the new tests don't
work there (because isolationtester is not smart enough), so I'm not going
to risk it.

Author: Oleksii Kliukin
Discussion: https://postgr.es/m/B9C9D7CD-EB94-4635-91B6-E558ACEC0EC3@hintbits.com
Discussion: https://postgr.es/m/2815.1560521451@sss.pgh.pa.us
2019-06-18 18:23:16 -04:00
Alvaro Herrera 9d20b0ec8f Revert "Avoid spurious deadlocks when upgrading a tuple lock"
This reverts commits 3da73d6839 and de87a084c0.

This code has some tricky corner cases that I'm not sure are correct and
not properly tested anyway, so I'm reverting the whole thing for next
week's releases (reintroducing the deadlock bug that we set to fix).
I'll try again afterwards.

Discussion: https://postgr.es/m/E1hbXKQ-0003g1-0C@gemulon.postgresql.org
2019-06-16 22:24:21 -04:00
Alvaro Herrera 3da73d6839 Silence compiler warning
Introduced in de87a084c0.
2019-06-14 11:33:40 -04:00
Alvaro Herrera de87a084c0 Avoid spurious deadlocks when upgrading a tuple lock
When two (or more) transactions are waiting for transaction T1 to release a
tuple-level lock, and transaction T1 upgrades its lock to a higher level, a
spurious deadlock can be reported among the waiting transactions when T1
finishes.  The simplest example case seems to be:

T1: select id from job where name = 'a' for key share;
Y: select id from job where name = 'a' for update; -- starts waiting for X
Z: select id from job where name = 'a' for key share;
T1: update job set name = 'b' where id = 1;
Z: update job set name = 'c' where id = 1; -- starts waiting for X
T1: rollback;

At this point, transaction Y is rolled back on account of a deadlock: Y
holds the heavyweight tuple lock and is waiting for the Xmax to be released,
while Z holds part of the multixact and tries to acquire the heavyweight
lock (per protocol) and goes to sleep; once X releases its part of the
multixact, Z is awakened only to be put back to sleep on the heavyweight
lock that Y is holding while sleeping.  Kaboom.

This can be avoided by having Z skip the heavyweight lock acquisition.  As
far as I can see, the biggest downside is that if there are multiple Z
transactions, the order in which they resume after X finishes is not
guaranteed.

Backpatch to 9.6.  The patch applies cleanly on 9.5, but the new tests don't
work there (because isolationtester is not smart enough), so I'm not going
to risk it.

Author: Oleksii Kliukin
Discussion: https://postgr.es/m/B9C9D7CD-EB94-4635-91B6-E558ACEC0EC3@hintbits.com
2019-06-13 17:28:24 -04:00
Amit Kapila 92c4abc736 Fix assorted inconsistencies.
There were a number of issues in the recent commits which include typos,
code and comments mismatch, leftover function declarations.  Fix them.

Reported-by: Alexander Lakhin
Author: Alexander Lakhin, Amit Kapila and Amit Langote
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/ef0c0232-0c1d-3a35-63d4-0ebd06e31387@gmail.com
2019-06-08 08:16:38 +05:30
Alvaro Herrera e8bdea58f9 Fix message style
Mark one message not for translation, and prefer "cannot" over "may
not", per commentary from Robert Haas.

Discussion: https://postgr.es/m/20190430145813.GA29872@alvherre.pgsql
2019-06-06 12:57:57 -04:00
Andres Freund 73b8c3bd28 tableam: Rename wrapper functions to match callback names.
Some of the wrapper functions didn't match the callback names. Many of
them due to staying "consistent" with historic naming of the wrapped
functionality. We decided that for most cases it's more important to
be for tableam to be consistent going forward, than with the past.

The one exception is beginscan/endscan/...  because it'd have looked
odd to have systable_beginscan/endscan/... with a different naming
scheme, and changing the systable_* APIs would have caused way too
much churn (including breaking a lot of external users).

Author: Ashwin Agrawal, with some small additions by Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CALfoeiugyrXZfX7n0ORCa4L-m834dzmaE8eFdbNR6PMpetU4Ww@mail.gmail.com
2019-05-23 16:32:36 -07:00
Tom Lane 8255c7a5ee Phase 2 pgindent run for v12.
Switch to 2.1 version of pg_bsd_indent.  This formats
multiline function declarations "correctly", that is with
additional lines of parameter declarations indented to match
where the first line's left parenthesis is.

Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
2019-05-22 13:04:48 -04:00
Tom Lane be76af171c Initial pgindent run for v12.
This is still using the 2.0 version of pg_bsd_indent.
I thought it would be good to commit this separately,
so as to document the differences between 2.0 and 2.1 behavior.

Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
2019-05-22 12:55:34 -04:00
Andres Freund c3b23ae457 Don't to predicate lock for analyze scans, refactor scan option passing.
Before this commit, when ANALYZE was run on a table and serializable
was used (either by virtue of an explicit BEGIN TRANSACTION ISOLATION
LEVEL SERIALIZABLE, or default_transaction_isolation being set to
serializable) a null pointer dereference lead to a crash.

The analyze scan doesn't need a snapshot (nor predicate locking), but
before this commit a scan only contained information about being a
bitmap or sample scan.

Refactor the option passing to the scan_begin callback to use a
bitmask instead. Alternatively we could have added a new boolean
parameter, but that seems harder to read. Even before this issue
various people (Heikki, Tom, Robert) suggested doing so.

These changes don't change the scan APIs outside of tableam. The flags
argument could be exposed, it's not necessary to fix this
problem. Also the wrapper table_beginscan* functions encapsulate most
of that complexity.

After these changes fixing the bug is trivial, just don't acquire
predicate lock for analyze style scans. That was already done for
bitmap heap scans.  Add an assert that a snapshot is passed when
acquiring the predicate lock, so this kind of bug doesn't require
running with serializable.

Also add a comment about sample scans currently requiring predicate
locking the entire relation, that previously wasn't remarked upon.

Reported-By: Joe Wildish
Author: Andres Freund
Discussion:
    https://postgr.es/m/4EA80A20-E9BF-49F1-9F01-5B66CAB21453@elusive.cx
    https://postgr.es/m/20190411164947.nkii4gaeilt4bui7@alap3.anarazel.de
    https://postgr.es/m/20190518203102.g7peu2fianukjuxm@alap3.anarazel.de
2019-05-19 15:10:28 -07:00
Andres Freund 147e3722f7 tableam: Avoid relying on relation size to determine validity of tids.
Instead add a tableam callback to do so. To avoid adding per
validation overhead, pass a scan to tuple_tid_valid. In heap's case
we'd otherwise incurred a RelationGetNumberOfBlocks() call for each
tid - which'd have added noticable overhead to nodeTidscan.c.

Author: Andres Freund
Reviewed-By: Ashwin Agrawal
Discussion: https://postgr.es/m/20190515185447.gno2jtqxyktylyvs@alap3.anarazel.de
2019-05-17 18:56:55 -07:00
Heikki Linnakangas e95d550bbb Fix comment on when HOT update is possible.
The conditions listed in this comment have changed several times, and at
some point the thing that the "if so" referred to was negated.

The text was OK up to 9.6. It was differently wrong in v10, v11 and
master, so fix in all those versions.
2019-05-14 13:06:28 +03:00
Peter Geoghegan ae7291acbc Standardize ItemIdData terminology.
The term "item pointer" should not be used to refer to ItemIdData
variables, since that is needlessly ambiguous.  Only
ItemPointerData/ItemPointer variables should be called item pointers.

To fix, establish the convention that ItemIdData variables should always
be referred to either as "item identifiers" or "line pointers".  The
term "item identifier" already predominates in docs and translatable
messages, and so should be the preferred alternative there.

Discussion: https://postgr.es/m/CAH2-Wz=c=MZQjUzde3o9+2PLAPuHTpVZPPdYxN=E4ndQ2--8ew@mail.gmail.com
2019-05-13 15:53:39 -07:00
Alvaro Herrera 2bf372a4ae heap_prepare_freeze_tuple: Simplify coding
Commit d2599ecfcc introduced some contorted, confused code around:
readers would think that it's possible for HeapTupleHeaderGetXmin return
a non-frozen value for some frozen tuples, which would be disastrous.
There's no actual bug, but it seems better to make it clearer.

Per gripe from Tom Lane and Andres Freund.
Discussion: https://postgr.es/m/30116.1555430496@sss.pgh.pa.us
2019-05-02 16:14:08 -04:00
Andres Freund 86b85044e8 tableam: Add table_multi_insert() and revamp/speed-up COPY FROM buffering.
This adds table_multi_insert(), and converts COPY FROM, the only user
of heap_multi_insert, to it.

A simple conversion of COPY FROM use slots would have yielded a
slowdown when inserting into a partitioned table for some
workloads. Different partitions might need different slots (both slot
types and their descriptors), and dropping / creating slots when
there's constant partition changes is measurable.

Thus instead revamp the COPY FROM buffering for partitioned tables to
allow to buffer inserts into multiple tables, flushing only when
limits are reached across all partition buffers. By only dropping
slots when there've been inserts into too many different partitions,
the aforementioned overhead is gone. By allowing larger batches, even
when there are frequent partition changes, we actuall speed such cases
up significantly.

By using slots COPY of very narrow rows into unlogged / temporary
might slow down very slightly (due to the indirect function calls).

Author: David Rowley, Andres Freund, Haribabu Kommi
Discussion:
    https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
    https://postgr.es/m/20190327054923.t3epfuewxfqdt22e@alap3.anarazel.de
2019-04-04 16:28:18 -07:00
Thomas Munro 4fd05bb55b Fix deadlock in heap_compute_xid_horizon_for_tuples().
We can't call code that uses syscache while we hold buffer locks
on a catalog relation.  If passed such a relation, just fall back
to the general effective_io_concurrency GUC rather than trying to
look up the containing tablespace's IO concurrency setting.

We might find a better way to control prefetching in follow-up
work, but for now this is enough to avoid the deadlock introduced
by commit 558a9165e0.

Reviewed-by: Andres Freund
Diagnosed-by: Peter Geoghegan
Discussion: https://postgr.es/m/CA%2BhUKGLCwPF0S4Mk7S8qw%2BDK0Bq65LueN9rofAA3HHSYikW-Zw%40mail.gmail.com
Discussion: https://postgr.es/m/962831d8-c18d-180d-75fb-8b842e3a2742%40chrullrich.net
2019-04-02 09:29:49 +13:00
Tom Lane a51cc7e9e6 Suppress uninitialized-variable warning.
Apparently Andres' compiler is smart enough to see that hpage
must be initialized before use ... but mine isn't.
2019-03-27 11:10:42 -04:00
Andres Freund 558a9165e0 Compute XID horizon for page level index vacuum on primary.
Previously the xid horizon was only computed during WAL replay. That
had two major problems:
1) It relied on knowing what the table pointed to looks like. That was
   easy enough before the introducing of tableam (we knew it had to be
   heap, although some trickery around logging the heap relfilenodes
   was required). But to properly handle table AMs we need
   per-database catalog access to look up the AM handler, which
   recovery doesn't allow.
2) Not knowing the xid horizon also makes it hard to support logical
   decoding on standbys. When on a catalog table, we need to be able
   to conflict with slots that have an xid horizon that's too old. But
   computing the horizon by visiting the heap only works once
   consistency is reached, but we always need to be able to detect
   conflicts.

There's also a secondary problem, in that the current method performs
redundant work on every standby. But that's counterbalanced by
potentially computing the value when not necessary (either because
there's no standby, or because there's no connected backends).

Solve 1) and 2) by moving computation of the xid horizon to the
primary and by involving tableam in the computation of the horizon.

To address the potentially increased overhead, increase the efficiency
of the xid horizon computation for heap by sorting the tids, and
eliminating redundant buffer accesses. When prefetching is available,
additionally perform prefetching of buffers.  As this is more of a
maintenance task, rather than something routinely done in every read
only query, we add an arbitrary 10 to the effective concurrency -
thereby using IO concurrency, when not globally enabled.  That's
possibly not the perfect formula, but seems good enough for now.

Bumps WAL format, as latestRemovedXid is now part of the records, and
the heap's relfilenode isn't anymore.

Author: Andres Freund, Amit Khandekar, Robert Haas
Reviewed-By: Robert Haas
Discussion:
    https://postgr.es/m/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de
    https://postgr.es/m/20181214014235.dal5ogljs3bmlq44@alap3.anarazel.de
    https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-26 16:52:54 -07:00
Andres Freund 2ac1b2b175 Remove heap_hot_search().
After 71bdc99d0d, "tableam: Add helper for indexes to check if a
corresponding table tuples exist." there's no in-core user left. As
there's unlikely to be an external user, and such an external user
could easily be adjusted to use table_index_fetch_tuple_check(),
remove heap_hot_search().

Per complaint from Peter Geoghegan

Author: Andres Freund
Discussion: https://postgr.es/m/CAH2-Wzn0Oq4ftJrTqRAsWy2WGjv0QrJcwoZ+yqWsF_Z5vjUBFw@mail.gmail.com
2019-03-25 19:04:41 -07:00
Andres Freund 9a8ee1dc65 tableam: Add and use table_fetch_row_version().
This is essentially the tableam version of heapam_fetch(),
i.e. fetching a tuple identified by a tid, performing visibility
checks.

Note that this different from table_index_fetch_tuple(), which is for
index lookups. It therefore has to handle a tid pointing to an earlier
version of a tuple if the AM uses an optimization like heap's HOT. Add
comments to that end.

This commit removes the stats_relation argument from heap_fetch, as
it's been unused for a long time.

Author: Andres Freund
Reviewed-By: Haribabu Kommi
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-25 00:17:59 -07:00
Andres Freund 5db6df0c01 tableam: Add tuple_{insert, delete, update, lock} and use.
This adds new, required, table AM callbacks for insert/delete/update
and lock_tuple. To be able to reasonably use those, the EvalPlanQual
mechanism had to be adapted, moving more logic into the AM.

Previously both delete/update/lock call-sites and the EPQ mechanism had
to have awareness of the specific tuple format to be able to fetch the
latest version of a tuple. Obviously that needs to be abstracted
away. To do so, move the logic that find the latest row version into
the AM. lock_tuple has a new flag argument,
TUPLE_LOCK_FLAG_FIND_LAST_VERSION, that forces it to lock the last
version, rather than the current one.  It'd have been possible to do
so via a separate callback as well, but finding the last version
usually also necessitates locking the newest version, making it
sensible to combine the two. This replaces the previous use of
EvalPlanQualFetch().  Additionally HeapTupleUpdated, which previously
signaled either a concurrent update or delete, is now split into two,
to avoid callers needing AM specific knowledge to differentiate.

The move of finding the latest row version into tuple_lock means that
encountering a row concurrently moved into another partition will now
raise an error about "tuple to be locked" rather than "tuple to be
updated/deleted" - which is accurate, as that always happens when
locking rows. While possible slightly less helpful for users, it seems
like an acceptable trade-off.

As part of this commit HTSU_Result has been renamed to TM_Result, and
its members been expanded to differentiated between updating and
deleting. HeapUpdateFailureData has been renamed to TM_FailureData.

The interface to speculative insertion is changed so nodeModifyTable.c
does not have to set the speculative token itself anymore. Instead
there's a version of tuple_insert, tuple_insert_speculative, that
performs the speculative insertion (without requiring a flag to signal
that fact), and the speculative insertion is either made permanent
with table_complete_speculative(succeeded = true) or aborted with
succeeded = false).

Note that multi_insert is not yet routed through tableam, nor is
COPY. Changing multi_insert requires changes to copy.c that are large
enough to better be done separately.

Similarly, although simpler, CREATE TABLE AS and CREATE MATERIALIZED
VIEW are also only going to be adjusted in a later commit.

Author: Andres Freund and Haribabu Kommi
Discussion:
    https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
    https://postgr.es/m/20190313003903.nwvrxi7rw3ywhdel@alap3.anarazel.de
    https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-23 19:55:57 -07:00
Andres Freund c2fe139c20 tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:

1) Heap scans need to be generalized into table scans. Do this by
   introducing TableScanDesc, which will be the "base class" for
   individual AMs. This contains the AM independent fields from
   HeapScanDesc.

   The previous heap_{beginscan,rescan,endscan} et al. have been
   replaced with a table_ version.

   There's no direct replacement for heap_getnext(), as that returned
   a HeapTuple, which is undesirable for a other AMs. Instead there's
   table_scan_getnextslot().  But note that heap_getnext() lives on,
   it's still used widely to access catalog tables.

   This is achieved by new scan_begin, scan_end, scan_rescan,
   scan_getnextslot callbacks.

2) The portion of parallel scans that's shared between backends need
   to be able to do so without the user doing per-AM work. To achieve
   that new parallelscan_{estimate, initialize, reinitialize}
   callbacks are introduced, which operate on a new
   ParallelTableScanDesc, which again can be subclassed by AMs.

   As it is likely that several AMs are going to be block oriented,
   block oriented callbacks that can be shared between such AMs are
   provided and used by heap. table_block_parallelscan_{estimate,
   intiialize, reinitialize} as callbacks, and
   table_block_parallelscan_{nextpage, init} for use in AMs. These
   operate on a ParallelBlockTableScanDesc.

3) Index scans need to be able to access tables to return a tuple, and
   there needs to be state across individual accesses to the heap to
   store state like buffers. That's now handled by introducing a
   sort-of-scan IndexFetchTable, which again is intended to be
   subclassed by individual AMs (for heap IndexFetchHeap).

   The relevant callbacks for an AM are index_fetch_{end, begin,
   reset} to create the necessary state, and index_fetch_tuple to
   retrieve an indexed tuple.  Note that index_fetch_tuple
   implementations need to be smarter than just blindly fetching the
   tuples for AMs that have optimizations similar to heap's HOT - the
   currently alive tuple in the update chain needs to be fetched if
   appropriate.

   Similar to table_scan_getnextslot(), it's undesirable to continue
   to return HeapTuples. Thus index_fetch_heap (might want to rename
   that later) now accepts a slot as an argument. Core code doesn't
   have a lot of call sites performing index scans without going
   through the systable_* API (in contrast to loads of heap_getnext
   calls and working directly with HeapTuples).

   Index scans now store the result of a search in
   IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
   target is not generally a HeapTuple anymore that seems cleaner.

To be able to sensible adapt code to use the above, two further
callbacks have been introduced:

a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
   slots capable of holding a tuple of the AMs
   type. table_slot_callbacks() and table_slot_create() are based
   upon that, but have additional logic to deal with views, foreign
   tables, etc.

   While this change could have been done separately, nearly all the
   call sites that needed to be adapted for the rest of this commit
   also would have been needed to be adapted for
   table_slot_callbacks(), making separation not worthwhile.

b) tuple_satisfies_snapshot checks whether the tuple in a slot is
   currently visible according to a snapshot. That's required as a few
   places now don't have a buffer + HeapTuple around, but a
   slot (which in heap's case internally has that information).

Additionally a few infrastructure changes were needed:

I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
   internally uses a slot to keep track of tuples. While
   systable_getnext() still returns HeapTuples, and will so for the
   foreseeable future, the index API (see 1) above) now only deals with
   slots.

The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.

Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
    https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
    https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 12:46:41 -07:00
Michael Paquier eb8c9f0bc3 Fix use of dangling pointer in heap_delete() when logging replica identity
When logging the replica identity of a deleted tuple, XLOG_HEAP_DELETE
records include references of the old tuple.  Its data is stored in an
intermediate variable used to register this information for the WAL
record, but this variable gets away from the stack when the record gets
actually inserted.

Spotted by clang's AddressSanitizer.

Author: Stas Kelvish
Discussion: https://postgr.es/m/085C8825-AD86-4E93-AF80-E26CDF03D1EA@postgrespro.ru
Backpatch-through: 9.4
2019-02-01 10:35:16 +09:00
Andres Freund ebcc7bf949 Rephrase references to "time qualification".
Now that the relevant code has, for other reasons, moved out of
tqual.[ch], it seems time to refer to visiblity rather than time
qualification.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-01-21 17:07:10 -08:00
Andres Freund c91560defc Move remaining code from tqual.[ch] to heapam.h / heapam_visibility.c.
Given these routines are heap specific, and that there will be more
generic visibility support in via table AM, it makes sense to move the
prototypes to heapam.h (routines like HeapTupleSatisfiesVacuum will
not be exposed in a generic fashion, because they are too storage
specific).

Similarly, the code in tqual.c is specific to heap, so moving it into
access/heap/ makes sense.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-01-21 17:07:10 -08:00
Andres Freund e0c4ec0728 Replace uses of heap_open et al with the corresponding table_* function.
Author: Andres Freund
Discussion: https://postgr.es/m/20190111000539.xbv7s6w7ilcvm7dp@alap3.anarazel.de
2019-01-21 10:51:37 -08:00
Andres Freund 4b21acf522 Introduce access/{table.h, relation.h}, for generic functions from heapam.h.
access/heapam contains functions that are very storage specific (say
heap_insert() and a lot of lower level functions), and fairly generic
infrastructure like relation_open(), heap_open() etc.  In the upcoming
pluggable storage work we're introducing a layer between table
accesses in general and heapam, to allow for different storage
methods. For a bit cleaner separation it thus seems advantageous to
move generic functions like the aforementioned to their own headers.

access/relation.h will contain relation_open() etc, and access/table.h
will contain table_open() (formerly known as heap_open()). I've decided
for table.h not to include relation.h, but we might change that at a
later stage.

relation.h already exists in another directory, but the other
plausible name (rel.h) also conflicts. It'd be nice if there were a
non-conflicting name, but nobody came up with a suggestion. It's
possible that the appropriate way to address the naming conflict would
be to rename nodes/relation.h, which isn't particularly well named.

To avoid breaking a lot of extensions that just use heap_open() etc,
table.h has macros mapping the old names to the new ones, and heapam.h
includes relation, table.h.  That also allows to keep the
bulk renaming of existing callers in a separate commit.

Author: Andres Freund
Discussion: https://postgr.es/m/20190111000539.xbv7s6w7ilcvm7dp@alap3.anarazel.de
2019-01-21 10:51:36 -08:00
Andres Freund 93507e67c9 Adjust some more comments for WITH OIDS removal.
I missed these in 578b229718.

Author: Andres Freund
2019-01-21 09:13:43 -08:00
Andres Freund 90525d7b4e Don't duplicate parallel seqscan shmem sizing logic in nbtree.
This is architecturally mildly problematic, which becomes more
pronounced with the upcoming introduction of pluggable storage.

To fix, teach heap_parallelscan_estimate() to deal with SnapshotAny
snapshots, and then use it from _bt_parallel_estimate_shared().

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-01-15 12:19:21 -08:00
Tom Lane 1c53c4dec3 Finish reverting "recheck_on_update" patch.
This reverts commit c203d6cf8 and some follow-on fixes, completing the
task begun in commit 5d28c9bd7.  If that feature is ever resurrected,
the code will look quite a bit different from this, so it seems best
to start from a clean slate.

The v11 branch is not touched; in that branch, the recheck_on_update
storage option remains present, but nonfunctional and undocumented.

Discussion: https://postgr.es/m/20190114223409.3tcvejfhlvbucrv5@alap3.anarazel.de
2019-01-15 12:07:10 -05:00
Andres Freund 0944ec54de Don't include genam.h from execnodes.h and relscan.h anymore.
This is the genam.h equivalent of 4c850ecec6 (which removed
heapam.h from a lot of other headers).  There's still a few header
includes of genam.h, but not from central headers anymore.

As a few headers are not indirectly included anymore, execnodes.h and
relscan.h need a few additional includes. Some of the depended on
types were replacable by using the underlying structs, but e.g. for
Snapshot in execnodes.h that'd have gotten more invasive than
reasonable in this commit.

Like the aforementioned commit 4c850ecec6, this requires adding new
genam.h includes to a number of backend files, which likely is also
required in a few external projects.

Author: Andres Freund
Discussion: https://postgr.es/m/20190114000701.y4ttcb74jpskkcfb@alap3.anarazel.de
2019-01-14 17:02:12 -08:00
Bruce Momjian 97c39498e5 Update copyright for 2019
Backpatch-through: certain files through 9.4
2019-01-02 12:44:25 -05:00
Andres Freund 578b229718 Remove WITH OIDS support, change oid catalog column visibility.
Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.

This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row.  Neither pg_dump nor COPY included the contents of the
oid column by default.

The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.

WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.

Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
  WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
  issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
  restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
  OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
  plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.

The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.

The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such.  This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.

The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.

Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).

The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.

While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.

Catversion bump, for obvious reasons.

Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
2018-11-20 16:00:17 -08:00
Andres Freund 1a0586de36 Introduce notion of different types of slots (without implementing them).
Upcoming work intends to allow pluggable ways to introduce new ways of
storing table data. Accessing those table access methods from the
executor requires TupleTableSlots to be carry tuples in the native
format of such storage methods; otherwise there'll be a significant
conversion overhead.

Different access methods will require different data to store tuples
efficiently (just like virtual, minimal, heap already require fields
in TupleTableSlot). To allow that without requiring additional pointer
indirections, we want to have different structs (embedding
TupleTableSlot) for different types of slots.  Thus different types of
slots are needed, which requires adapting creators of slots.

The slot that most efficiently can represent a type of tuple in an
executor node will often depend on the type of slot a child node
uses. Therefore we need to track the type of slot is returned by
nodes, so parent slots can create slots based on that.

Relatedly, JIT compilation of tuple deforming needs to know which type
of slot a certain expression refers to, so it can create an
appropriate deforming function for the type of tuple in the slot.

But not all nodes will only return one type of slot, e.g. an append
node will potentially return different types of slots for each of its
subplans.

Therefore add function that allows to query the type of a node's
result slot, and whether it'll always be the same type (whether it's
fixed). This can be queried using ExecGetResultSlotOps().

The scan, result, inner, outer type of slots are automatically
inferred from ExecInitScanTupleSlot(), ExecInitResultSlot(),
left/right subtrees respectively. If that's not correct for a node,
that can be overwritten using new fields in PlanState.

This commit does not introduce the actually abstracted implementation
of different kind of TupleTableSlots, that will be left for a followup
commit.  The different types of slots introduced will, for now, still
use the same backing implementation.

While this already partially invalidates the big comment in
tuptable.h, it seems to make more sense to update it later, when the
different TupleTableSlot implementations actually exist.

Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-15 22:00:30 -08:00
Andres Freund e9edc1ba0b Fix logical decoding error when system table w/ toast is repeatedly rewritten.
Repeatedly rewriting a mapped catalog table with VACUUM FULL or
CLUSTER could cause logical decoding to fail with:
ERROR, "could not map filenode \"%s\" to relation OID"

To trigger the problem the rewritten catalog had to have live tuples
with toasted columns.

The problem was triggered as during catalog table rewrites the
heap_insert() check that prevents logical decoding information to be
emitted for system catalogs, failed to treat the new heap's toast table
as a system catalog (because the new heap is not recognized as a
catalog table via RelationIsLogicallyLogged()). The relmapper, in
contrast to the normal catalog contents, does not contain historical
information. After a single rewrite of a mapped table the new relation
is known to the relmapper, but if the table is rewritten twice before
logical decoding occurs, the relfilenode cannot be mapped to a
relation anymore.  Which then leads us to error out.   This only
happens for toast tables, because the main table contents aren't
re-inserted with heap_insert().

The fix is simple, add a new heap_insert() flag that prevents logical
decoding information from being emitted, and accept during decoding
that there might not be tuple data for toast tables.

Unfortunately that does not fix pre-existing logical decoding
errors. Doing so would require not throwing an error when a filenode
cannot be mapped to a relation during decoding, and that seems too
likely to hide bugs.  If it's crucial to fix decoding for an existing
slot, temporarily changing the ERROR in ReorderBufferCommit() to a
WARNING appears to be the best fix.

Author: Andres Freund
Discussion: https://postgr.es/m/20180914021046.oi7dm4ra3ot2g2kt@alap3.anarazel.de
Backpatch: 9.4-, where logical decoding was introduced
2018-10-10 13:53:02 -07:00
Tom Lane 29ef2b310d Restore sane locking behavior during parallel query.
Commit 9a3cebeaa changed things so that parallel workers didn't obtain
any lock of their own on tables they access.  That was clearly a bad
idea, but I'd mistakenly supposed that it was the intended end result
of the series of patches for simplifying the executor's lock management.
Undo that change in relation_open(), and adjust ExecOpenScanRelation()
so that it gets the correct lock if inside a parallel worker.

In passing, clean up some more obsolete comments about when locks
are acquired.

Discussion: https://postgr.es/m/468c85d9-540e-66a2-1dde-fec2b741e688@lab.ntt.co.jp
2018-10-06 15:49:37 -04:00
Tom Lane 9a3cebeaa7 Change executor to just Assert that table locks were already obtained.
Instead of locking tables during executor startup, just Assert that
suitable locks were obtained already during the parse/plan pipeline
(or re-obtained by the plan cache).  This must be so, else we have a
hazard that concurrent DDL has invalidated the plan.

This is pretty inefficient as well as undercommented, but it's all going
to go away shortly, so I didn't try hard.  This commit is just another
attempt to use the buildfarm to see if we've missed anything in the plan
to simplify the executor's table management.

Note that the change needed here in relation_open() exposes that
parallel workers now really are accessing tables without holding any
lock of their own, whereas they were not doing that before this commit.
This does not give me a warm fuzzy feeling about that aspect of parallel
query; it does not seem like a good design, and we now know that it's
had exactly no actual testing.  I think that we should modify parallel
query so that that change can be reverted.

Discussion: https://postgr.es/m/468c85d9-540e-66a2-1dde-fec2b741e688@lab.ntt.co.jp
2018-10-03 16:05:12 -04:00
Tom Lane b04aeb0a05 Add assertions that we hold some relevant lock during relation open.
Opening a relation with no lock at all is unsafe; there's no guarantee
that we'll see a consistent state of the relevant catalog entries.
While use of MVCC scans to read the catalogs partially addresses that
complaint, it's still possible to switch to a new catalog snapshot
partway through loading the relcache entry.  Moreover, whether or not
you trust the reasoning behind sometimes using less than
AccessExclusiveLock for ALTER TABLE, that reasoning is certainly not
valid if concurrent users of the table don't hold a lock corresponding
to the operation they want to perform.

Hence, add some assertion-build-only checks that require any caller
of relation_open(x, NoLock) to hold at least AccessShareLock.  This
isn't a full solution, since we can't verify that the lock level is
semantically appropriate for the action --- but it's definitely of
some use, because it's already caught two bugs.

We can also assert that callers of addRangeTableEntryForRelation()
hold at least the lock level specified for the new RTE.

Amit Langote and Tom Lane

Discussion: https://postgr.es/m/16565.1538327894@sss.pgh.pa.us
2018-10-01 12:43:21 -04:00
Andres Freund 29c94e03c7 Split ExecStoreTuple into ExecStoreHeapTuple and ExecStoreBufferHeapTuple.
Upcoming changes introduce further types of tuple table slots, in
preparation of making table storage pluggable. New storage methods
will have different representation of tuples, therefore the slot
accessor should refer explicitly to heap tuples.

Instead of just renaming the functions, split it into one function
that accepts heap tuples not residing in buffers, and one accepting
ones in buffers.  Previously one function was used for both, but that
was a bit awkward already, and splitting will allow us to represent
slot types for tuples in buffers and normal memory separately.

This is split out from the patch introducing abstract slots, as this
largely consists out of mechanical changes.

Author: Ashutosh Bapat
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/20180220224318.gw4oe5jadhpmcdnm@alap3.anarazel.de
2018-09-25 16:27:48 -07:00
Tom Lane 44cac93464 Avoid using potentially-under-aligned page buffers.
There's a project policy against using plain "char buf[BLCKSZ]" local
or static variables as page buffers; preferred style is to palloc or
malloc each buffer to ensure it is MAXALIGN'd.  However, that policy's
been ignored in an increasing number of places.  We've apparently got
away with it so far, probably because (a) relatively few people use
platforms on which misalignment causes core dumps and/or (b) the
variables chance to be sufficiently aligned anyway.  But this is not
something to rely on.  Moreover, even if we don't get a core dump,
we might be paying a lot of cycles for misaligned accesses.

To fix, invent new union types PGAlignedBlock and PGAlignedXLogBlock
that the compiler must allocate with sufficient alignment, and use
those in place of plain char arrays.

I used these types even for variables where there's no risk of a
misaligned access, since ensuring proper alignment should make
kernel data transfers faster.  I also changed some places where
we had been palloc'ing short-lived buffers, for coding style
uniformity and to save palloc/pfree overhead.

Since this seems to be a live portability hazard (despite the lack
of field reports), back-patch to all supported versions.

Patch by me; thanks to Michael Paquier for review.

Discussion: https://postgr.es/m/1535618100.1286.3.camel@credativ.de
2018-09-01 15:27:17 -04:00
Alvaro Herrera ab7dbd681c Update FSM on WAL replay of page all-visible/frozen
We aren't very strict about keeping FSM up to date on WAL replay,
because per-page freespace values aren't critical in replicas (can't
write to heap in a replica; and if the replica is promoted, the values
would be updated by VACUUM anyway).  However, VACUUM since 9.6 can skip
processing pages marked all-visible or all-frozen, and if such pages are
recorded in FSM with wrong values, those values are blindly propagated
to FSM's upper layers by VACUUM's FreeSpaceMapVacuum.  (This rationale
assumes that crashes are not very frequent, because those would cause
outdated FSM to occur in the primary.)

Even when the FSM is outdated in standby, things are not too bad
normally, because, most per-page FSM values will be zero (other than
those propagated with the base-backup that created the standby); only
once the remaining free space is less than 0.2*BLCKSZ the per-page value
is maintained by WAL replay of heap ins/upd/del.  However, if
wal_log_hints=on causes complete FSM pages to be propagated to a standby
via full-page images, many too-optimistic per-page values can end up
being registered in the standby.

Incorrect per-page values aren't critical in most cases, since an
inserter that is given a page that doesn't actually contain the claimed
free space will update FSM with the correct value, and retry until it
finds a usable page.  However, if there are many such updates to do, an
inserter can spend a long time doing them before a usable page is found;
in a heavily trafficked insert-only table with many concurrent inserters
this has been observed to cause several second stalls, causing visible
application malfunction.

To fix this problem, it seems sufficient to have heap_xlog_visible
(replay of setting all-visible and all-frozen VM bits for a heap page)
update the FSM value for the page being processed.  This fixes the
per-page counters together with making the page skippable to vacuum, so
when vacuum does FreeSpaceMapVacuum, the values propagated to FSM upper
layers are the correct ones, avoiding the problem.

While at it, apply the same fix to heap_xlog_clean (replay of tuple
removal by HOT pruning and vacuum).  This makes any space freed by the
cleaning available earlier than the next vacuum in the promoted replica.

Backpatch to 9.6, where this problem was diagnosed on an insert-only
table with all-frozen pages, which were introduced as a concept in that
release.  Theoretically it could apply with all-visible pages to older
branches, but there's been no report of that and it doesn't backpatch
cleanly anyway.

Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20180802172857.5skoexsilnjvgruk@alvherre.pgsql
2018-08-15 18:09:29 -03:00
Alvaro Herrera d2599ecfcc Don't mark pages all-visible spuriously
Dan Wood diagnosed a long-standing problem that pages containing tuples
that are locked by multixacts containing live lockers may spuriously end
up as candidates for getting their all-visible flag set.  This has the
long-term effect that multixacts remain unfrozen; this may previously
pass undetected, but since commit XYZ it would be reported as
  "ERROR: found multixact 134100944 from before relminmxid 192042633"
because when a later vacuum tries to freeze the page it detects that a
multixact that should have gotten frozen, wasn't.

Dan proposed a (correct) patch that simply sets a variable to its
correct value, after a bogus initialization.  But, per discussion, it
seems better coding to avoid the bogus initializations altogether, since
they could give rise to more bugs later.  Therefore this fix rewrites
the logic a little bit to avoid depending on the bogus initializations.

This bug was part of a family introduced in 9.6 by commit a892234f830e;
later, commit 38e9f90a22 fixed most of them, but this one was
unnoticed.

Authors: Dan Wood, Pavan Deolasee, Álvaro Herrera
Reviewed-by: Masahiko Sawada, Pavan Deolasee, Álvaro Herrera
Discussion: https://postgr.es/m/84EBAC55-F06D-4FBE-A3F3-8BDA093CE3E3@amazon.com
2018-05-04 18:24:45 -03:00
Tom Lane bdf46af748 Post-feature-freeze pgindent run.
Discussion: https://postgr.es/m/15719.1523984266@sss.pgh.pa.us
2018-04-26 14:47:16 -04:00
Simon Riggs 08ea7a2291 Revert MERGE patch
This reverts commits d204ef6377,
83454e3c2b and a few more commits thereafter
(complete list at the end) related to MERGE feature.

While the feature was fully functional, with sufficient test coverage and
necessary documentation, it was felt that some parts of the executor and
parse-analyzer can use a different design and it wasn't possible to do that in
the available time. So it was decided to revert the patch for PG11 and retry
again in the future.

Thanks again to all reviewers and bug reporters.

List of commits reverted, in reverse chronological order:

 f1464c5380 Improve parse representation for MERGE
 ddb4158579 MERGE syntax diagram correction
 530e69e59b Allow cpluspluscheck to pass by renaming variable
 01b88b4df5 MERGE minor errata
 3af7b2b0d4 MERGE fix variable warning in non-assert builds
 a5d86181ec MERGE INSERT allows only one VALUES clause
 4b2d44031f MERGE post-commit review
 4923550c20 Tab completion for MERGE
 aa3faa3c7a WITH support in MERGE
 83454e3c2b New files for MERGE
 d204ef6377 MERGE SQL Command following SQL:2016

Author: Pavan Deolasee
Reviewed-by: Michael Paquier
2018-04-12 11:22:56 +01:00
Andres Freund f16241bef7 Raise error when affecting tuple moved into different partition.
When an update moves a row between partitions (supported since
2f17844104), our normal logic for following update chains in READ
COMMITTED mode doesn't work anymore. Cross partition updates are
modeled as an delete from the old and insert into the new
partition. No ctid chain exists across partitions, and there's no
convenient space to introduce that link.

Not throwing an error in a partitioned context when one would have
been thrown without partitioning is obviously problematic. This commit
introduces infrastructure to detect when a tuple has been moved, not
just plainly deleted. That allows to throw an error when encountering
a deletion that's actually a move, while attempting to following a
ctid chain.

The row deleted as part of a cross partition update is marked by
pointing it's t_ctid to an invalid block, instead of self as a normal
update would.  That was deemed to be the least invasive and most
future proof way to represent the knowledge, given how few infomask
bits are there to be recycled (there's also some locking issues with
using infomask bits).

External code following ctid chains should be updated to check for
moved tuples. The most likely consequence of not doing so is a missed
error.

Author: Amul Sul, editorialized by me
Reviewed-By: Amit Kapila, Pavan Deolasee, Andres Freund, Robert Haas
Discussion: http://postgr.es/m/CAAJ_b95PkwojoYfz0bzXU8OokcTVGzN6vYGCNVUukeUDrnF3dw@mail.gmail.com
2018-04-07 13:24:27 -07:00
Teodor Sigaev 8224de4f42 Indexes with INCLUDE columns and their support in B-tree
This patch introduces INCLUDE clause to index definition.  This clause
specifies a list of columns which will be included as a non-key part in
the index.  The INCLUDE columns exist solely to allow more queries to
benefit from index-only scans.  Also, such columns don't need to have
appropriate operator classes.  Expressions are not supported as INCLUDE
columns since they cannot be used in index-only scans.

Index access methods supporting INCLUDE are indicated by amcaninclude flag
in IndexAmRoutine.  For now, only B-tree indexes support INCLUDE clause.

In B-tree indexes INCLUDE columns are truncated from pivot index tuples
(tuples located in non-leaf pages and high keys).  Therefore, B-tree indexes
now might have variable number of attributes.  This patch also provides
generic facility to support that: pivot tuples contain number of their
attributes in t_tid.ip_posid.  Free 13th bit of t_info is used for indicating
that.  This facility will simplify further support of index suffix truncation.
The changes of above are backward-compatible, pg_upgrade doesn't need special
handling of B-tree indexes for that.

Bump catalog version

Author: Anastasia Lubennikova with contribition by Alexander Korotkov and me
Reviewed by: Peter Geoghegan, Tomas Vondra, Antonin Houska, Jeff Janes,
			 David Rowley, Alexander Korotkov
Discussion: https://www.postgresql.org/message-id/flat/56168952.4010101@postgrespro.ru
2018-04-07 23:00:39 +03:00
Peter Eisentraut 5dfd1e5a66 Logical decoding of TRUNCATE
Add a new WAL record type for TRUNCATE, which is only used when
wal_level >= logical.  (For physical replication, TRUNCATE is already
replicated via SMGR records.)  Add new callback for logical decoding
output plugins to receive TRUNCATE actions.

Author: Simon Riggs <simon@2ndquadrant.com>
Author: Marco Nenciarini <marco.nenciarini@2ndquadrant.it>
Author: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
Reviewed-by: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
2018-04-07 11:34:10 -04:00