Commit Graph

4877 Commits

Author SHA1 Message Date
Peter Geoghegan 5bf748b86b Enhance nbtree ScalarArrayOp execution.
Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).

Take this general approach a lot further: teach nbtree SAOP index scans
to decide how to execute ScalarArrayOp scans (when and where to start
the next primitive index scan) based on physical index characteristics.
This can be far more efficient.  All SAOP scans will now reliably avoid
duplicative leaf page accesses (just like any other nbtree index scan).
SAOP scans whose array keys are naturally clustered together now require
far fewer index descents, since we'll reliably avoid starting a new
primitive scan just to get to a later offset from the same leaf page.

The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value.  Required
scan key arrays (i.e. arrays from scan keys that can terminate the scan)
ratchet forward in lockstep with the index scan.  Non-required arrays
(i.e. arrays from scan keys that can only exclude non-matching tuples)
"advance" without the process ever rolling over to a higher-order array.

Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, even index scans of a composite index with a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we won't mark required) now avoid repeating leaf page
accesses -- that benefit isn't limited to simpler equality-only cases.
In general, all nbtree index scans now output tuples as if they were one
continuous index scan -- even scans that mix a high-order inequality
with lower-order SAOP equalities reliably output tuples in index order.
This allows us to remove a couple of special cases that were applied
when building index paths with SAOP clauses during planning.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Affected queries can now exploit scan output order in all the usual ways
(e.g., certain "ORDER BY ... LIMIT n" queries can now terminate early).

Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths, with path keys, but
without low-order SAOP index quals (filter quals were used instead).
We'll no longer generate these alternative paths, since they can no
longer offer any meaningful advantages over standard index qual paths.
Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes.  They can avoid extra heap
page accesses from using filter quals to exclude non-matching tuples
(index quals will never have that problem).  They can also skip over
irrelevant sections of the index in more cases (though only when nbtree
determines that starting another primitive scan actually makes sense).

There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with amcanorder-based
index AMs maintained as extensions.  Such an index AM could have the
same limitations around ordered SAOP scans as nbtree had up until now.
Adding a pro forma incompatibility item about the issue to the Postgres
17 release notes seems like a good idea.

Author: Peter Geoghegan <pg@bowt.ie>
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
2024-04-06 11:47:10 -04:00
Alvaro Herrera ee1cbe806d
Operate XLogCtl->log{Write,Flush}Result with atomics
This removes the need to hold both the info_lck spinlock and
WALWriteLock to update them.  We use stock atomic write instead, with
WALWriteLock held.  Readers can use atomic read, without any locking.

This allows for some code to be reordered: some places were a bit
contorted to avoid repeated spinlock acquisition, but that's no longer a
concern, so we can turn them to more natural coding.  Some further
changes are possible (maybe to performance wins), but in this commit I
did rather minimal ones only, to avoid increasing the blast radius.

Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions)
Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql
2024-04-05 16:14:39 +02:00
David Rowley 3a4a3537a9 Secondary refactor of heap scanning functions
Similar to 44086b097, refactor heap scanning functions to be more
suitable for the read stream API.

Author: Melanie Plageman
Discussion: https://postgr.es/m/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg=gEQ@mail.gmail.com
2024-04-04 19:22:45 +13:00
David Rowley 44086b0975 Preliminary refactor of heap scanning functions
To allow the use of the read stream API added in b5a9b18cd for
sequential scans on heap tables, here we make some adjustments to make
that change less invasive and perhaps make the code easier to follow in
the process.

Here heapgetpage() gets broken into two functions:

1) The part which reads the block has now been moved into a function
   named heapfetchbuf().
2) The part which performed pruning and populated the scan's
   rs_vistuples[] array is now moved into a new function named
   heap_prepare_pagescan().

The functionality provided by heap_prepare_pagescan() was only ever
required by SO_ALLOW_PAGEMODE scans, so the branching that was
previously done in heapgetpage() is no longer needed as we simply just
don't call heap_prepare_pagescan() from heapgettup() in the refactored
code.

Author: Melanie Plageman
Discussion: https://postgr.es/m/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg=gEQ@mail.gmail.com
2024-04-04 16:41:13 +13:00
Tom Lane 06286709ee Invent SERIALIZE option for EXPLAIN.
EXPLAIN (ANALYZE, SERIALIZE) allows collection of statistics about
the volume of data emitted by a query, as well as the time taken
to convert the data to the on-the-wire format.  Previously there
was no way to investigate this without actually sending the data
to the client, in which case network transmission costs might
swamp what you wanted to see.  In particular this feature allows
investigating the costs of de-TOASTing compressed or out-of-line
data during formatting.

Stepan Rutz and Matthias van de Meent,
reviewed by Tomas Vondra and myself

Discussion: https://postgr.es/m/ca0adb0e-fa4e-c37e-1cd7-91170b18cae1@gmx.de
2024-04-03 17:41:57 -04:00
Alvaro Herrera c9920a9068
Split XLogCtl->LogwrtResult into separate struct members
After this change we have XLogCtl->logWriteResult and ->logFlushResult.
There's no functional change, other than the fact that the assignment
from shared memory to local is no longer done via struct assignment, but
instead using a macro that copies each member separately.

The current representation is inconvenient going forward; notably, we
would like to add a new member "Copy" (to keep track of the last
position copied into WAL buffers), so the symmetry between the values in
shared memory vs. those in local would be lost.

This also gives us freedom to later change the concurrency model for the
values in shared memory: we can make them use atomics instead of relying
on the info_lck spinlock.

Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/202404031119.cd2kugjk2vho@alvherre.pgsql
2024-04-03 19:55:11 +02:00
Heikki Linnakangas 6dbb490261 Combine freezing and pruning steps in VACUUM
Execute both freezing and pruning of tuples in the same
heap_page_prune() function, now called heap_page_prune_and_freeze(),
and emit a single WAL record containing all changes. That reduces the
overall amount of WAL generated.

This moves the freezing logic from vacuumlazy.c to the
heap_page_prune_and_freeze() function. The main difference in the
coding is that in vacuumlazy.c, we looked at the tuples after the
pruning had already happened, but in heap_page_prune_and_freeze() we
operate on the tuples before pruning. The heap_prepare_freeze_tuple()
function is now invoked after we have determined that a tuple is not
going to be pruned away.

VACUUM no longer needs to loop through the items on the page after
pruning. heap_page_prune_and_freeze() does all the work. It now
returns the list of dead offsets, including existing LP_DEAD items, to
the caller. Similarly it's now responsible for tracking 'all_visible',
'all_frozen', and 'hastup' on the caller's behalf.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
2024-04-03 19:32:28 +03:00
Heikki Linnakangas 26d138f644 Refactor how heap_prune_chain() updates prunable_xid
In preparation of freezing and counting tuples which are not
candidates for pruning, split heap_prune_record_unchanged() into
multiple functions, depending the kind of line pointer. That's not too
interesting right now, but makes the next commit smaller.

Recording the lowest soon-to-be prunable xid is one of the actions we
take for unchanged LP_NORMAL item pointers but not for others, so move
that to the new heap_prune_record_unchanged_lp_normal() function. The
next commit will add more actions to these functions.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
2024-04-03 19:32:21 +03:00
Alexander Korotkov bf1e650806 Use the pairing heap instead of a flat array for LSN replay waiters
06c418e163 introduced pg_wal_replay_wait() procedure allowing to wait for
the particular LSN to be replayed on standby.  The waiters were stored in
the flat array.  Even though scanning small arrays is fast, that might be a
problem at scale (a lot of waiting processes).

This commit replaces the flat shared memory array with the pairing heap,
which holds the waiter with the least LSN at the top.  This gives us O(log N)
complexity for both inserting and removing waiters.

Reported-by: Alvaro Herrera
Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql
2024-04-03 18:15:41 +03:00
Daniel Gustafsson 226261f387 Add error codes to some PANIC/FATAL errors reports
This adds errcodes to a set of PANIC and FATAL errors in xlog.c
and relcache.c,  which previously had no errcode at all set, in
order to make fleetwide analysis of errorlogs easier. There are
many more ereport/elogs left which could benefit from having an
errcode but this at least makes a dent in the issue.

Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CAN55FZ1k8LgLEqncPGmz_fWnrobV6bjABOTH4tOWta6xNcPQig@mail.gmail.com
2024-04-03 09:19:25 +02:00
Alexander Korotkov 06c418e163 Implement pg_wal_replay_wait() stored procedure
pg_wal_replay_wait() is to be used on standby and specifies waiting for
the specific WAL location to be replayed before starting the transaction.
This option is useful when the user makes some data changes on primary and
needs a guarantee to see these changes on standby.

The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.

pg_wal_replay_wait() needs to wait without any snapshot held.  Otherwise,
the snapshot could prevent the replay of WAL records implying a kind of
self-deadlock.  This is why it is only possible to implement
pg_wal_replay_wait() as a procedure working in a non-atomic context,
not a function.

Catversion is bumped.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
2024-04-02 22:48:03 +03:00
Alexander Korotkov 867cc7b6dd Revert "Custom reloptions for table AM"
This reverts commit c95c25f9af due to multiple
design issues spotted after commit.

Reported-by: Jeff Davis
Discussion: https://postgr.es/m/11550b536211d5748bb2865ed6cb3502ff073bf7.camel%40j-davis.com
2024-04-02 11:29:00 +03:00
Masahiko Sawada 667e65aac3 Use TidStore for dead tuple TIDs storage during lazy vacuum.
Previously, we used a simple array for storing dead tuple IDs during
lazy vacuum, which had a number of problems:

* The array used a single allocation and so was limited to 1GB.
* The allocation was pessimistically sized according to table size.
* Lookup with binary search was slow because of poor CPU cache and
  branch prediction behavior.

This commit replaces that array with the TID store from commit
30e144287a.

Since the backing radix tree makes small allocations as needed, the
1GB limit is now gone. Further, the total memory used is now often
smaller by an order of magnitude or more, depending on the
distribution of blocks and offsets. These two features should make
multiple rounds of heap scanning and index cleanup an extremely rare
event. TID lookup during index cleanup is also several times faster,
even more so when index order is correlated with heap tuple order.

Since there is no longer a predictable relationship between the number
of dead tuples vacuumed and the space taken up by their TIDs, the
number of tuples no longer provides any meaningful insights for users,
nor is the maximum number predictable. For that reason this commit
also changes to byte-based progress reporting, with the relevant
columns of pg_stat_progress_vacuum renamed accordingly to
max_dead_tuple_bytes and dead_tuple_bytes.

For parallel vacuum, both the TID store and supplemental information
specific to vacuum are shared among the parallel vacuum workers. As
with the previous array, we don't take any locks on TidStore during
parallel vacuum since writes are still only done by the leader
process.

Bump catalog version.

Reviewed-by: John Naylor, (in an earlier version) Dilip Kumar
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
2024-04-02 10:15:37 +09:00
Heikki Linnakangas 3d0f730bf1 Introduce 'options' argument to heap_page_prune()
Currently there is only one option, HEAP_PAGE_PRUNE_MARK_UNUSED_NOW
which replaces the old boolean argument, but upcoming patches will
introduce at least one more. Having a lot of boolean arguments makes
it hard to see at the call sites what the arguments mean, so prefer a
bitmask of options with human-readable names.

Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Discussion: https://www.postgresql.org/message-id/20240401172219.fngjosaqdgqqvg4e@liskov
2024-04-02 00:56:05 +03:00
Heikki Linnakangas 6f47f68831 Handle non-chain tuples outside of heap_prune_chain()
Handle dead branches of aborted HOT chains outside heap_prune_chain()
as a separate phase. This simplifies the logic in heap_prune_chain(),
as well as allowing us to clean up more RECENTLY_DEAD -> DEAD chains.

To accomplish this efficiently, partition tuples into HOT and non-HOT
while first collecting visibility information for each tuple in
heap_page_prune(). Then call heap_prune_chain() only on potential
chain members. Then mop up the leftover HOT tuples afterwards.

As part of this, keep track of which items on page have already been
processed, in 'processed' array. This replaces the 'marked' array
which was only set for tuples marked for removal or redirection. The
'processed' array is updated also for items that are left unchanged,
when we conclude that an item can be left unchanged. At the end of
pruning, every item on the page should be marked as processed in the
array; an assertion is added for that.

Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
2024-04-01 20:33:50 +03:00
Heikki Linnakangas 7aa00f1360 Refactor heap_prune_chain()
Keep track of the number of deleted tuples in PruneState and record this
information when recording a tuple dead, unused or redirected. This
removes a special case from the traversal and chain processing logic as
well as setting a precedent of recording the impact of prune actions in
the record functions themselves. This paradigm will be used in future
commits which move tracking of additional statistics on pruning actions
from lazy_scan_prune() to heap_prune_chain().

Simplify heap_prune_chain()'s chain traversal logic by handling each
case explicitly. That is, do not attempt to share code when processing
different types of chains. For each category of chain, process it
specifically and procedurally: first handling the root, then any
intervening tuples, and, finally, the end of the chain.

While we are at it, add a few new comments to heap_prune_chain()
clarifying some special cases involving RECENTLY_DEAD tuples.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
2024-04-01 13:28:44 +03:00
Heikki Linnakangas 9917e79d99 Minor refactoring in heap_page_prune
Pass 'page', 'blockno' and 'maxoff' to heap_prune_chain() as
arguments, so that it doesn't need to fetch them from the buffer. This
saves a few cycles per chain.

Remove the "if (off_loc != NULL)" checks, and require the caller to
pass a non-NULL 'off_loc'. Pass a pointer to a dummy local variable
when it's not needed. Those checks are cheap, but it's still better to
avoid them in the per-chain loops when we can do so easily.

The CPU time saving from these changes are hardly measurable, but
fewer instructions is good anyway, so why not. I spotted the potential
for these while reviewing Melanie Plageman's patch set to combine
prune and freeze records.

Discussion: https://www.postgresql.org/message-id/CAAKRu_abm2tHhrc0QSQa%3D%3DsHe%3DVA1%3Doz1dJMQYUOKuHmu%2B9Xrg%40mail.gmail.com
2024-04-01 12:07:30 +03:00
Alexander Korotkov b1484a3f19 Let table AM insertion methods control index insertion
Previously, the executor did index insert unconditionally after calling
table AM interface methods tuple_insert() and multi_insert().  This commit
introduces the new parameter insert_indexes for these two methods.  Setting
'*insert_indexes' to true saves the current logic.  Setting it to false
indicates that table AM cares about index inserts itself and doesn't want the
caller to do that.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Pavel Borisov, Matthias van de Meent, Mark Dilger
2024-03-30 22:53:56 +02:00
Alexander Korotkov c95c25f9af Custom reloptions for table AM
Let table AM define custom reloptions for its tables.  This allows to
specify AM-specific parameters by WITH clause when creating a table.

The code may use some parts from prior work by Hao Wu.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn
Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent
2024-03-30 22:36:25 +02:00
Alexander Korotkov 27bc1772fc Generalize relation analyze in table AM interface
Currently, there is just one algorithm for sampling tuples from a table written
in acquire_sample_rows().  Custom table AM can just redefine the way to get the
next block/tuple by implementing scan_analyze_next_block() and
scan_analyze_next_tuple() API functions.

This approach doesn't seem general enough.  For instance, it's unclear how to
sample this way index-organized tables.  This commit allows table AM to
encapsulate the whole sampling algorithm (currently implemented in
acquire_sample_rows()) into the relation_analyze() API function.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Pavel Borisov, Matthias van de Meent
2024-03-30 22:34:04 +02:00
Tom Lane 0075d78947 Allow "internal" subtransactions in parallel mode.
Allow use of BeginInternalSubTransaction() in parallel mode, so long
as the subtransaction doesn't attempt to acquire an XID or increment
the command counter.  Given those restrictions, the other parallel
processes don't need to know about the subtransaction at all, so
this should be safe.  The benefit is that it allows subtransactions
intended for error recovery, such as pl/pgsql exception blocks,
to be used in PARALLEL SAFE functions.

Another reason for doing this is that the API of
BeginInternalSubTransaction() doesn't allow reporting failure.
pl/python for one, and perhaps other PLs, copes very poorly with an
error longjmp out of BeginInternalSubTransaction().  The headline
feature of this patch removes the only easily-triggerable failure
case within that function.  There remain some resource-exhaustion
and similar cases, which we now deal with by promoting them to FATAL
errors, so that callers need not try to clean up.  (It is likely
that such errors would leave us with corrupted transaction state
inside xact.c, making recovery difficult if not impossible anyway.)

Although this work started because of a report of a pl/python crash,
we're not going to do anything about that in the back branches.
Back-patching this particular fix is obviously not very wise.
While we could contemplate some narrower band-aid, pl/python is
already an untrusted language, so it seems okay to classify this
as a "so don't do that" case.

Patch by me, per report from Hao Zhang.  Thanks to Robert Haas for
review.

Discussion: https://postgr.es/m/CALY6Dr-2yLVeVPhNMhuBnRgOZo1UjoTETgtKBx1B2gUi8yy+3g@mail.gmail.com
2024-03-28 12:43:10 -04:00
Heikki Linnakangas 427005742b Remove obsolete comment about VACUUM retrying pruning
Commit 1ccc1e05ae removed the retry logic that the comment talked
about.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/20240328015326.x5gnzsohl6j23b42@liskov
2024-03-28 10:18:48 +02:00
Masahiko Sawada 2d8f56dabb Rethink create and attach APIs of shared TidStore.
Previously, the behavior of TidStoreCreate() was inconsistent between
local and shared TidStore instances in terms of memory limitation. For
local TidStore, a memory context was created with initial and maximum
memory block sizes, as well as a minimum memory context size, based on
the specified max_bytes values. However, for shared TidStore, the
provided DSA area was used for TID storage. Although commit bb952c8c8b
allowed specifying the initial and maximum DSA segment sizes, callers
would have needed to clamp their own limits, which was not consistent
and user-friendly.

With this commit, when creating a shared TidStore, a dedicated DSA
area is created for TID storage instead of using a provided DSA
area. The initial and maximum DSA segment sizes are chosen based on
the specified max_bytes. Other processes can attach to the shared
TidStore using the handle of the created DSA returned by the new
TidStoreGetDSA() function and the DSA pointer returned by
TidStoreGetHandle(). The created DSA has the same lifetime as the
shared TidStore and is deleted when all processes detach from it.

To improve clarity, the TidStoreCreate() function has been divided
into two separate functions: TidStoreCreateLocal() and
TidStoreCreateShared().

Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAD21AoAyc1j%3DBCdUqZfk6qbdjZ68UgRx1Gkpk0oah4K7S0Ri9g%40mail.gmail.com
2024-03-28 10:03:28 +09:00
Alexander Korotkov 818861eb57 Fix some typos and grammar issues from commit 87985cc925
Reported-by: Alexander Lakhin
2024-03-27 11:47:41 +02:00
Masahiko Sawada 4edb37e322 Fix a calculation in TidStoreCreate().
Since we expect that the max_bytes is in bytes, not in kilobytes, it
should not be multiplied by 1024.

Introduced by 30e144287a.

Reported-by: John Naylor, David Rowley
Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CANWCAZZTE-14ofsucofTuhFsfuDGBNf%3DNZb22TMYT8bxA41oQQ%40mail.gmail.com
Discussion: https://postgr.es/m/CAApHDvojg82NDaDEpj1WEZSbVTafj%3DDRmW%2BFrkBdW8ScL4OFxA%40mail.gmail.com
2024-03-26 13:06:06 +09:00
Alexander Korotkov 87985cc925 Allow locking updated tuples in tuple_update() and tuple_delete()
Currently, in read committed transaction isolation mode (default), we have the
following sequence of actions when tuple_update()/tuple_delete() finds
the tuple updated by the concurrent transaction.

1. Attempt to update/delete tuple with tuple_update()/tuple_delete(), which
   returns TM_Updated.
2. Lock tuple with tuple_lock().
3. Re-evaluate plan qual (recheck if we still need to update/delete and
   calculate the new tuple for update).
4. Second attempt to update/delete tuple with tuple_update()/tuple_delete().
   This attempt should be successful, since the tuple was previously locked.

This commit eliminates step 2 by taking the lock during the first
tuple_update()/tuple_delete() call.  The heap table access method saves some
effort by checking the updated tuple once instead of twice.  Future
undo-based table access methods, which will start from the latest row version,
can immediately place a lock there.

Also, this commit makes tuple_update()/tuple_delete() optionally save the old
tuple into the dedicated slot.  That saves efforts on re-fetching tuples in
certain cases.

The code in nodeModifyTable.c is simplified by removing the nested switch/case.

Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp
Reviewed-by: Andres Freund, Chris Travers
2024-03-26 01:27:56 +02:00
Heikki Linnakangas f83d709760 Merge prune, freeze and vacuum WAL record formats
The new combined WAL record is now used for pruning, freezing and 2nd
pass of vacuum. This is in preparation for changing VACUUM to write a
combined prune+freeze record per page, instead of separate two
records. The new WAL record format now supports that, but the code
still always writes separate records for pruning and freezing.

This reserves separate XLOG_HEAP2_* info codes for when the pruning
record is emitted for on-access pruning or VACUUM, per Peter
Geoghegan's suggestion. The record format is identical, but having
separate info codes makes it easier analyze pruning and vacuuming with
pg_waldump.

The function to emit the new WAL record, log_heap_prune_and_freeze(),
is in pruneheap.c. The existing heap_log_freeze_plan() and its
subroutines are moved to pruneheap.c without changes, to keep them
together with log_heap_prune_and_freeze().

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAAKRu_azf-zH%3DDgVbquZ3tFWjMY1w5pO8m-TXJaMdri8z3933g@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/CAAKRu_b2oE4GL%3Dq4g9mcByS9yT7wTQvEH9OLpabj28e%2BWKFi2A@mail.gmail.com
2024-03-25 14:59:58 +02:00
Alexander Korotkov b670b93a66 Fix an oversight in refactoring in 06b10f80ba.
It was against intended skipping prechecking keys optimization in the
first page of range queries to not influence point queries performance.

Reported-by: Anton Melnikov
Discussion: https://postgr.es/m/30cd7524-b9f1-4cf8-9c4a-223eb2e34441%40postgrespro.ru
Author: Pavel Borisov
2024-03-22 15:25:53 +02:00
Alexander Korotkov c35a3fb5e0 Allow table AM tuple_insert() method to return the different slot
This allows table AM to return a native tuple slot even if
VirtualTupleTableSlot is given as an input.  Native tuple slots have knowledge
about system attributes, which could be accessed in the future.
table_multi_insert() method already can modify the input 'slots' array.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Matthias van de Meent, Mark Dilger, Pavel Borisov
Reviewed-by: Nikita Malakhov, Japin Li
2024-03-21 23:00:40 +02:00
Alexander Korotkov 02eb07ea89 Allow table AM to store complex data structures in rd_amcache
The new table AM method free_rd_amcache is responsible for freeing all the
memory related to rd_amcache and setting free_rd_amcache to NULL.  If the new
method is not specified, we still assume rd_amcache to be a single chunk of
memory, which could be just pfree'd.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Matthias van de Meent, Mark Dilger, Pavel Borisov
Reviewed-by: Nikita Malakhov, Japin Li
2024-03-21 23:00:34 +02:00
Masahiko Sawada 30e144287a Add TIDStore, to store sets of TIDs (ItemPointerData) efficiently.
TIDStore is a data structure designed to efficiently store large sets
of TIDs. For TID storage, it employs a radix tree, where the key is
a block number, and the value is a bitmap representing offset
numbers. The TIDStore can be created on a DSA area and used by
multiple backend processes simultaneously.

There are potential future users such as tidbitmap.c, though it's very
likely the interface will need to evolve as we come to understand the
needs of different kinds of users. For example, we can support
updating the offset bitmap of existing values.

Currently, the TIDStore is not used for anything yet, aside from the
test code. But an upcoming patch will use it.

This includes a unit test module, in src/test/modules/test_tidstore.

Co-authored-by: John Naylor
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
2024-03-21 10:08:42 +09:00
Heikki Linnakangas c9c260decd Remove unused PruneState member rel
PruneState->rel is no longer being used, so just remove it.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/20240320013602.6sypr4cx6sefpemg@liskov
2024-03-20 10:13:42 +02:00
Heikki Linnakangas c33084205a Reorganize heap_page_prune() function comment
heap_page_prune()'s function header comment didn't explain the
parameters in the same order they appear in the function. Fix that.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/20240320013602.6sypr4cx6sefpemg@liskov
2024-03-20 10:13:39 +02:00
Peter Eisentraut 20e58105ba Separate equalRowTypes() from equalTupleDescs()
This introduces a new function equalRowTypes() that is effectively a
subset of equalTupleDescs() but only compares the number of attributes
and attribute name, type, typmod, and collation.  This is enough for
most existing uses of equalTupleDescs(), which are changed to use the
new function.  The only remaining callers of equalTupleDescs() are
those that really want to check the full tuple descriptor as such,
without concern about record or row or record type semantics.

The existing function hashTupleDesc() is renamed to hashRowType(),
because it now corresponds more to equalRowTypes().

The purpose of this change is to be clearer about the semantics of the
equality asked for by each caller.  (At least one caller had a comment
that questioned whether equalTupleDescs() was too restrictive.)  For
example, 4f622503d6 removed attstattarget from the tuple descriptor
structure.  It was not fully clear at the time how this should affect
equalTupleDescs().  Now the answer is clear: By their own definitions,
equalRowTypes() does not care, and equalTupleDescs() just compares
whatever is in the tuple descriptor but does not care why it is in
there.

Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/f656d6d9-6660-4518-a006-2f65cafbebd1%40eisentraut.org
2024-03-17 05:58:04 +01:00
Heikki Linnakangas 84c18acaf6 Remove redundant snapshot copying from parallel leader to workers
The parallel query infrastructure copies the leader backend's active
snapshot to the worker processes. But BitmapHeapScan node also had
bespoken code to pass the snapshot from leader to the worker. That was
redundant, so remove it.

The removed code was analogous to the snapshot serialization in
table_parallelscan_initialize(), but that was the wrong role model. A
parallel bitmap heap scan is more like an independent non-parallel
bitmap heap scan in each parallel worker as far as the table AM is
concerned, because the coordination is done in nodeBitmapHeapscan.c,
and the table AM doesn't need to know anything about it.

This relies on the assumption that es_snapshot ==
GetActiveSnapshot(). That's not a new assumption, things would get
weird if you used the QueryDesc's snapshot for visibility checks in
the scans, but the active snapshot for evaluating quals, for
example. This could use some refactoring and cleanup, but for now,
just add some assertions.

Reviewed-by: Dilip Kumar, Robert Haas
Discussion: https://www.postgresql.org/message-id/5f3b9d59-0f43-419d-80ca-6d04c07cf61a@iki.fi
2024-03-14 15:18:10 +02:00
Peter Eisentraut 97d85be365 Make the order of the header file includes consistent
Similar to commit 7e735035f2.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAMbWs4-WhpCFMbXCjtJ%2BFzmjfPrp7Hw1pk4p%2BZpU95Kh3ofZ1A%40mail.gmail.com
2024-03-13 15:07:00 +01:00
Peter Geoghegan 3045324214 Update obsolete index scan TID comments.
Oversight in commit c2fe139c20.
2024-03-11 18:07:10 -04:00
Heikki Linnakangas 3d8652cd32 Remove unneeded vacuum_delay_point from heap_vac_scan_get_next_block
heap_vac_scan_get_next_block() does relatively little work, so there
is no need to call vacuum_delay_point(). A future commit will call
heap_vac_scan_get_next_block() from a callback, and we would like to
avoid calling vacuum_delay_point() in that callback.

Author: Melanie Plageman
Discussion: https://postgr.es/m/CAAKRu_Yf3gvXGcCnqqfoq0Q8LX8UM-e-qbm_B1LeZh60f8WhWA%40mail.gmail.com
2024-03-11 20:45:33 +02:00
Heikki Linnakangas 4e76f984a7 Confine vacuum skip logic to lazy_scan_skip()
Rename lazy_scan_skip() to heap_vac_scan_next_block() and move more
code into the function, so that the caller doesn't need to know about
ranges or skipping anymore. heap_vac_scan_next_block() returns the
next block to process, and the logic for determining that block is all
within the function. This makes the skipping logic easier to
understand, as it's all in the same function, and makes the calling
code easier to understand as it's less cluttered. The state variables
needed to manage the skipping logic are moved to LVRelState.

heap_vac_scan_next_block() now manages its own VM buffer separately
from the caller's vmbuffer variable. The caller's vmbuffer holds the
VM page for the current block its processing, while
heap_vac_scan_next_block() keeps a pin on the VM page for the next
unskippable block. Most of the time they are the same, so we hold two
pins on the same buffer, but it's more convenient to manage them
separately.

For readability inside heap_vac_scan_next_block(), move the logic of
finding the next unskippable block to separate function, and add some
comments.

This refactoring will also help future patches to switch to using a
streaming read interface, and eventually AIO
(https://postgr.es/m/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com)

Author: Melanie Plageman, Heikki Linnakangas
Reviewed-by: Andres Freund (older version)
Discussion: https://postgr.es/m/CAAKRu_Yf3gvXGcCnqqfoq0Q8LX8UM-e-qbm_B1LeZh60f8WhWA%40mail.gmail.com
2024-03-11 20:43:58 +02:00
Heikki Linnakangas 674e49c73c Set all_visible_according_to_vm correctly with DISABLE_PAGE_SKIPPING
It's important for 'all_visible_according_to_vm' to correctly reflect
whether the VM bit is set or not, even when we are not trusting the VM
to skip pages, because contrary to what the comment said,
lazy_scan_prune() relies on it.

If it's incorrectly set to 'false', when the VM bit is in fact set,
lazy_scan_prune() will try to set the VM bit again and dirty the page
unnecessarily. As a result, if you used DISABLE_PAGE_SKIPPING, all
heap pages were dirtied, even if there were no changes. We would also
fail to clear any VM bits that were set incorrectly.

This was broken in commit 980ae17310, so backpatch to v16.

Backpatch-through: 16
Reviewed-by: Melanie Plageman, Peter Geoghegan
Discussion: https://www.postgresql.org/message-id/3df2b582-dc1c-46b6-99b6-38eddd1b2784@iki.fi
2024-03-11 09:28:09 +02:00
Michael Paquier f500ba07fa Add some checkpoint and redo LSNs to a couple of recovery errors
Two FATALs and one PANIC gain details about the LSNs they fail at:
- When restoring from a backup_label, the FATAL log generated when not
finding the checkpoint record now reports its LSN.
- When restoring from a backup_label, the FATAL log generated when not
finding the redo record referenced by a checkpoint record now shows both
the redo and checkpoint record LSNs.
- When not restoring from a backup_label, the PANIC error generated when
not finding the checkpoint record now reports its LSN.

This information is useful when debugging corruption issues, and these
LSNs may not show up in the logs depending on the level of logging
configured in the backend.

Author: David Steele
Discussion: https://postgr.es/m/0e90da89-77ca-4ccf-872c-9626d755e288@pgmasters.net
2024-03-11 09:08:05 +09:00
Alexander Korotkov 6f38c43eb1 Avoid stack overflow in ShowTransactionStateRec()
The function recurses, but didn't perform stack-depth checks. It's
just a debugging aid, so instead of the usual check_stack_depth()
call, stop the printing if we'd risk stack overflow.

Here's an example of how to test this:

    (n=1000000; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT s$i;"; done; printf "SET log_min_messages = 'DEBUG5'; SAVEPOINT sp;") | psql >/dev/null

In the passing, swap building the list of child XIDs and recursing to
parent. That saves memory while recursing, reducing the risk of out of
memory errors with lots of subtransactions. The saving is not very
significant in practice, but this order seems more logical anyway.

Report by Egor Chindyaskin and Alexander Lakhin.

Discussion: https://www.postgresql.org/message-id/1672760457.940462079%40f306.i.mail.ru
Author: Heikki Linnakangas
Reviewed-by: Robert Haas, Andres Freund, Alexander Korotkov
2024-03-08 13:18:30 +02:00
Alexander Korotkov fefd9a3fed Turn tail recursion into iteration in CommitTransactionCommand()
Usually the compiler will optimize away the tail recursion anyway, but
if it doesn't, you can drive the function into stack overflow. For
example:

    (n=1000000; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT s$i;"; done; printf "ERROR; COMMIT;") | psql >/dev/null

In order to get better readability and less changes to the existing code the
recursion-replacing loop is implemented as a wrapper function.

Report by Egor Chindyaskin and Alexander Lakhin.
Discussion: https://postgr.es/m/1672760457.940462079%40f306.i.mail.ru
Author: Alexander Korotkov, Heikki Linnakangas
2024-03-08 13:18:30 +02:00
Alvaro Herrera 0d3a71d0c8
Fix misspelled assertions
Remove an extra & operator, per Tom Lane.  My bugs, introduced with
commit 53c2a97a92.

Discussion: https://postgr.es/m/3885480.1709590472@sss.pgh.pa.us
2024-03-05 12:13:12 +01:00
Alvaro Herrera 1a2654b32b
Rework redundant code in subtrans.c
When this code was written the duplicity didn't matter, but with all the
SLRU-bank stuff we just added, it has become excessive.  Turn it into a
simpler loop with no code duplication.  Also add a test so that this
code becomes covered.

Discussion: https://postgr.es/m/202403041517.3a35jw53os65@alvherre.pgsql
2024-03-05 12:09:18 +01:00
Jeff Davis 59825d1639 Fix buildfarm failures from 2af07e2f74.
Use GUC_ACTION_SAVE rather than GUC_ACTION_SET, necessary for working
with parallel query.

Now that the call requires more arguments, wrap the call in a new
function to avoid code duplication and offer a place for a comment.

Discussion: https://postgr.es/m/E1rhJpO-0027Wf-9L@gemulon.postgresql.org
2024-03-04 19:42:16 -08:00
Jeff Davis 2af07e2f74 Fix search_path to a safe value during maintenance operations.
While executing maintenance operations (ANALYZE, CLUSTER, REFRESH
MATERIALIZED VIEW, REINDEX, or VACUUM), set search_path to
'pg_catalog, pg_temp' to prevent inconsistent behavior.

Functions that are used for functional indexes, in index expressions,
or in materialized views and depend on a different search path must be
declared with CREATE FUNCTION ... SET search_path='...'.

This change was previously committed as 05e1737351, then reverted in
commit 2fcc7ee7af because it was too late in the cycle.

Preparation for the MAINTAIN privilege, which was previously reverted
due to search_path manipulation hazards.

Discussion: https://postgr.es/m/d4ccaf3658cb3c281ec88c851a09733cd9482f22.camel@j-davis.com
Discussion: https://postgr.es/m/E1q7j7Y-000z1H-Hr%40gemulon.postgresql.org
Discussion: https://postgr.es/m/e44327179e5c9015c8dda67351c04da552066017.camel%40j-davis.com
Reviewed-by: Greg Stark, Nathan Bossart, Noah Misch
2024-03-04 17:31:38 -08:00
Alvaro Herrera a0b808baef
Rework locking code in GetMultiXactIdMembers
After commit 53c2a97a92, the code flow around the "retry" goto label
in GetMultiXactIdMembers was confused about what was possible: we never
return there with a held lock, so there's no point in testing for one.
This realization lets us simplify the code a bit.  While at it, make the
scope of a couple of local variables in the same function a bit tighter.

Per Coverity.
2024-03-04 17:48:01 +01:00
Alvaro Herrera f9baaf96d3
Simplify coding in slru.c
New code in 53c2a97a92 uses direct array access to
shared->bank_locks[bankno].lock which can be made a little bit more
legible by using the SimpleLruGetBankLock helper function.
Nothing terribly serious, but let's add some clarity.

Discussion: https://postgr.es/m/202403041517.3a35jw53os65@alvherre.pgsql
2024-03-04 17:37:47 +01:00
Peter Eisentraut dbbca2cf29 Remove unused #include's from backend .c files
as determined by include-what-you-use (IWYU)

While IWYU also suggests to *add* a bunch of #include's (which is its
main purpose), this patch does not do that.  In some cases, a more
specific #include replaces another less specific one.

Some manual adjustments of the automatic result:

- IWYU currently doesn't know about includes that provide global
  variable declarations (like -Wmissing-variable-declarations), so
  those includes are being kept manually.

- All includes for port(ability) headers are being kept for now, to
  play it safe.

- No changes of catalog/pg_foo.h to catalog/pg_foo_d.h, to keep the
  patch from exploding in size.

Note that this patch touches just *.c files, so nothing declared in
header files changes in hidden ways.

As a small example, in src/backend/access/transam/rmgr.c, some IWYU
pragma annotations are added to handle a special case there.

Discussion: https://www.postgresql.org/message-id/flat/af837490-6b2f-46df-ba05-37ea6a6653fc%40eisentraut.org
2024-03-04 12:02:20 +01:00