Commit Graph

25412 Commits

Author SHA1 Message Date
Thomas Munro 3bd8439ed6 Allow BufferAccessStrategy to limit pin count.
While pinning extra buffers to look ahead, users of strategies are in
danger of using too many buffers.  For some strategies, that means
"escaping" from the ring, and in others it means forcing dirty data to
disk very frequently with associated WAL flushing.  Since external code
has no insight into any of that, allow individual strategy types to
expose a clamp that should be applied when deciding how many buffers to
pin at once.

Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com
2024-04-06 23:11:45 +13:00
Thomas Munro aa1e8c2064 Improve read_stream.c's fast path.
The "fast path" for well cached scans that don't do any I/O was
accidentally coded in a way that could only be triggered by pg_prewarm's
usage pattern, which starts out with a higher distance because of the
flags it passes in.  We want it to work for streaming sequential scans
too, once that patch is committed.  Adjust.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGKXZALJ%3D6aArUsXRJzBm%3Dqvc4AWp7%3DiJNXJQqpbRLnD_w%40mail.gmail.com
2024-04-06 18:00:35 +13:00
Tom Lane 3c5ff36aba Suppress "variable may be used uninitialized" warning.
Buildfarm member caiman is showing this, which surprises me because
it's very late-model gcc (14.0.1) and ought to be smart enough to
know that elog(ERROR) doesn't return.  But we're likely to see the
same from stupider compilers too, so add a dummy initialization in
our usual style.
2024-04-05 10:58:30 -04:00
Tomas Vondra 10e3226ba1 Align blocks in incremental backups to BLCKSZ
Align blocks stored in incremental files to BLCKSZ, so that the
incremental backups work well with CoW filesystems.

The header of the incremental file is padded with \0 to a multiple of
BLCKSZ, so that the block data (also BLCKSZ) is aligned to BLCKSZ. The
padding is added only to files containing block data, so files with just
the header remain small. This adds a bit of extra space, but as the
number of blocks increases the overhead gets negligible very quickly.
And as the padding is \0 bytes, it does compress extremely well.

The alignment is important for CoW filesystems that usually require the
blocks to be aligned to filesystem page size for features like block
sharing, deduplication etc. to work well. With the variable sized header
the blocks in the increments were not aligned at all, negating the
benefits of the CoW filesystems.

This matters even for non-CoW filesystems, for example when placed on a
RAID array. If the block is not aligned, it may easily span multiple
devices, causing read and write amplification.

It might be better to align the blocks to the filesystem page, not
BLCKSZ, but we have no good way to determine that. Even if we determine
the page size at the time of taking the backup, the backup may move. For
now the BLCKSZ seems sufficient - the filesystem page is usually 4K, so
the default BLCKSZ (8K by default) is aligned to that.

Author: Tomas Vondra
Reviewed-by: Robert Haas, Jakub Wartak
Discussion: https://postgr.es/m/3024283a-7491-4240-80d0-421575f6bb23%40enterprisedb.com
2024-04-05 16:30:01 +02:00
Alvaro Herrera ee1cbe806d
Operate XLogCtl->log{Write,Flush}Result with atomics
This removes the need to hold both the info_lck spinlock and
WALWriteLock to update them.  We use stock atomic write instead, with
WALWriteLock held.  Readers can use atomic read, without any locking.

This allows for some code to be reordered: some places were a bit
contorted to avoid repeated spinlock acquisition, but that's no longer a
concern, so we can turn them to more natural coding.  Some further
changes are possible (maybe to performance wins), but in this commit I
did rather minimal ones only, to avoid increasing the blast radius.

Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions)
Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql
2024-04-05 16:14:39 +02:00
Amit Kapila 6f132ed693 Allow synced slots to have their inactive_since.
This commit does two things:
1) Maintains inactive_since for sync slots whenever the slot is released
just like any other regular slot.

2) Ensures the value is set to the current timestamp during the promotion
of standby to help correctly interpret the time after promotion. We don't
want the slots to appear inactive for a long time after promotion if they
haven't been synchronized recently. This would also avoid the invalidation
of such slots immediately after promotion if tomorrow we have a feature
that invalidates slots based on their inactivity time. Whoever acquires
the slot i.e. makes the slot active will reset it to NULL.

Author: Bharath Rupireddy
Reviewed-by: Bertrand Drouvot, Amit Kapila, Shveta Malik, Masahiko Sawada
Discussion: https://postgr.es/m/CAA4eK1KrPGwfZV9LYGidjxHeW+rxJ=E2ThjXvwRGLO=iLNuo=Q@mail.gmail.com
Discussion: https://postgr.es/m/CALj2ACW4aUe-_uFQOjdWCEN-xXoLGhmvRFnL8SNw_TZ5nJe+aw@mail.gmail.com
Discussion: https://postgr.es/m/CA+Tgmob_Ta-t2ty8QrKHBGnNLrf4ZYcwhGHGFsuUoFrAEDw4sA@mail.gmail.com
2024-04-05 09:48:49 +05:30
Michael Paquier f98dbdeb51 Add "ABI_compatibility" regions to wait_event_names.txt
The current design behind the automatic generation of the C code and
documentation related to wait events introduced in fa88928470 does not
offer a way to attach new wait events without breaking ABI
compatibility, as all the events are forcibly reordered for each section
in the input file wait_event_names.txt.  Adding new wait events to
stable branches is something that has happened in the past, 0b6517a3b7
being a recent example of that with VERSION_FILE_SYNC, so we need a way
to generate any C code for wait events while maintaining compatibility
on stable branches already released.

This commit solves this issue by adding a new region called
"ABI_compatibility" (keyword could be updated to something else if
someone had a better idea) to each section of wait_event_names.txt, so
as one can add new wait events to stable branches in
wait_event_names.txt while keeping the code ABI-compatible.
"ABI_compatibility" has no impact on the documentation generated: all
the wait events of one section are still alphabetically ordered.  LWLock
and Lock sections generate their C code elsewhere, so they do not need
an "ABI_compatibility" region.

For example, let's imagine a wait_event_names.txt like that:
Section: ClassName - Foo
FOO_1	"Waiting in Foo 1"
FOO_2	"Waiting in Foo 2"
ABI_compatibility:
NEW_FOO_1	"Waiting in New Foo 1"
NEW_BAR_1	"Waiting in New Bar 1"

This results in the following enum, where the events in the ABI region
are listed last with the same ordering as in wait_event_names.txt:
typedef enum
{
    WAIT_EVENT_FOO_1,
    WAIT_EVENT_FOO_2,
    WAIT_EVENT_NEW_FOO_1,
    WAIT_EVENT_NEW_BAR_1
} WaitEventFoo;

New wait events added in stable branches should be added at the end of
each ABI_compatibility region, and ABI_compatibility should remain empty
on HEAD and unreleased stable branches.

This design has been suggested by Noah Misch and me.

Reported-by: Noah Misch
Author: Bertrand Drouvot
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/20240317183114.16@rfd.leadboat.com
2024-04-05 08:56:52 +09:00
Amit Langote 6f4d63e989 Add missing initialization in transformJsonFuncExpr()
de3600452b added some code for the new JSON_TABLE_OP to that function
but missed to initialize the default_format variable.

Reported-by: Erik Rijkers <er@xs4all.nl>
Discussion: https://postgr.es/m/254b2fa2-2f6b-a30a-20ee-21f8a2c12a50@xs4all.nl
2024-04-04 22:01:13 +09:00
Amit Langote 2f6e78b061 Fix typo introduced in 6185c9737
Reported-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxGHiU0p0usjh5hnR0_ByZn4tq1FC3eKAtrQgJeKU6W9kw@mail.gmail.com
2024-04-04 20:53:23 +09:00
Amit Langote de3600452b Add basic JSON_TABLE() functionality
JSON_TABLE() allows JSON data to be converted into a relational view
and thus used, for example, in a FROM clause, like other tabular
data.  Data to show in the view is selected from a source JSON object
using a JSON path expression to get a sequence of JSON objects that's
called a "row pattern", which becomes the source to compute the
SQL/JSON values that populate the view's output columns.  Column
values themselves are computed using JSON path expressions applied to
each of the JSON objects comprising the "row pattern", for which the
SQL/JSON query functions added in 6185c9737c are used.

To implement JSON_TABLE() as a table function, this augments the
TableFunc and TableFuncScanState nodes that are currently used to
support XMLTABLE() with some JSON_TABLE()-specific fields.

Note that the JSON_TABLE() spec includes NESTED COLUMNS and PLAN
clauses, which are required to provide more flexibility to extract
data out of nested JSON objects, but they are not implemented here
to keep this commit of manageable size.

Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Author: Teodor Sigaev <teodor@sigaev.ru>
Author: Oleg Bartunov <obartunov@gmail.com>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Andrew Dunstan <andrew@dunslane.net>
Author: Amit Langote <amitlangote09@gmail.com>
Author: Jian He <jian.universality@gmail.com>

Reviewers have included (in no particular order):

Andres Freund, Alexander Korotkov, Pavel Stehule, Andrew Alsup,
Erik Rijkers, Zihong Yu, Himanshu Upadhyaya, Daniel Gustafsson,
Justin Pryzby, Álvaro Herrera, Jian He

Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru
Discussion: https://postgr.es/m/20220616233130.rparivafipt6doj3@alap3.anarazel.de
Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org
Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com
2024-04-04 20:20:15 +09:00
Andrew Dunstan 222e11a10a Use incremental parsing of backup manifests.
This changes the three callers to json_parse_manifest() to use
json_parse_manifest_incremental_chunk() if appropriate. In the case of
the backend caller, since we don't know the size of the manifest in
advance we always call the incremental parser.

Author: Andrew Dunstan
Reviewed-By: Jacob Champion

Discussion: https://postgr.es/m/7b0a51d6-0d9d-7366-3a1a-f74397a02f55@dunslane.net
2024-04-04 06:46:40 -04:00
David Rowley 3a4a3537a9 Secondary refactor of heap scanning functions
Similar to 44086b097, refactor heap scanning functions to be more
suitable for the read stream API.

Author: Melanie Plageman
Discussion: https://postgr.es/m/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg=gEQ@mail.gmail.com
2024-04-04 19:22:45 +13:00
Michael Paquier 2a217c3717 Coordinate emit_log_hook and all log destinations to share the same timeval
This would cause the timestamp values used by emit_log_hook and all the
other log destinations to differ, because the timestamps are reset
before sending the logs to the server and after calling the hook.

This change matters for emit_log_hook when generating log information
with 'n' or 'm' in log_line_prefix through log_status_format(), or when
doing direct calls to get_formatted_log_time() like in the JSON or CSV
log formats.

While on it, this commit fixes a couple of comments related to the
formatted timestamps where the JSON was not mentioned.  Oversight in
dc686681e0, that I have noticed while reviewing this patch.

Author: Kambam Vinay, Michael Paquier
Discussion: https://postgr.es/m/CANiRfmsK36A0i8mnQtzaxhSm3CUCimPwJPp4WQNq53OdSNkgWg@mail.gmail.com
2024-04-04 14:15:22 +09:00
David Rowley 44086b0975 Preliminary refactor of heap scanning functions
To allow the use of the read stream API added in b5a9b18cd for
sequential scans on heap tables, here we make some adjustments to make
that change less invasive and perhaps make the code easier to follow in
the process.

Here heapgetpage() gets broken into two functions:

1) The part which reads the block has now been moved into a function
   named heapfetchbuf().
2) The part which performed pruning and populated the scan's
   rs_vistuples[] array is now moved into a new function named
   heap_prepare_pagescan().

The functionality provided by heap_prepare_pagescan() was only ever
required by SO_ALLOW_PAGEMODE scans, so the branching that was
previously done in heapgetpage() is no longer needed as we simply just
don't call heap_prepare_pagescan() from heapgettup() in the refactored
code.

Author: Melanie Plageman
Discussion: https://postgr.es/m/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg=gEQ@mail.gmail.com
2024-04-04 16:41:13 +13:00
Jeff Davis 71b66171d0 CREATE INDEX: do not update stats during binary upgrade.
During binary upgrade, indexes are created before the data is moved
into place, so it will always be zero.

This is not currently a major problem, but will be when we try to
preserve statistics during upgrade.

Author: Corey Huinker
Discussion: https://postgr.es/m/CADkLM=daPdFB8V0tgFxK-dLowFsAEzWRWJHyxij7BG3kBjcouA@mail.gmail.com
2024-04-03 16:12:45 -07:00
Tom Lane 06286709ee Invent SERIALIZE option for EXPLAIN.
EXPLAIN (ANALYZE, SERIALIZE) allows collection of statistics about
the volume of data emitted by a query, as well as the time taken
to convert the data to the on-the-wire format.  Previously there
was no way to investigate this without actually sending the data
to the client, in which case network transmission costs might
swamp what you wanted to see.  In particular this feature allows
investigating the costs of de-TOASTing compressed or out-of-line
data during formatting.

Stepan Rutz and Matthias van de Meent,
reviewed by Tomas Vondra and myself

Discussion: https://postgr.es/m/ca0adb0e-fa4e-c37e-1cd7-91170b18cae1@gmx.de
2024-04-03 17:41:57 -04:00
Alvaro Herrera c9920a9068
Split XLogCtl->LogwrtResult into separate struct members
After this change we have XLogCtl->logWriteResult and ->logFlushResult.
There's no functional change, other than the fact that the assignment
from shared memory to local is no longer done via struct assignment, but
instead using a macro that copies each member separately.

The current representation is inconvenient going forward; notably, we
would like to add a new member "Copy" (to keep track of the last
position copied into WAL buffers), so the symmetry between the values in
shared memory vs. those in local would be lost.

This also gives us freedom to later change the concurrency model for the
values in shared memory: we can make them use atomics instead of relying
on the info_lck spinlock.

Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/202404031119.cd2kugjk2vho@alvherre.pgsql
2024-04-03 19:55:11 +02:00
Heikki Linnakangas 6dbb490261 Combine freezing and pruning steps in VACUUM
Execute both freezing and pruning of tuples in the same
heap_page_prune() function, now called heap_page_prune_and_freeze(),
and emit a single WAL record containing all changes. That reduces the
overall amount of WAL generated.

This moves the freezing logic from vacuumlazy.c to the
heap_page_prune_and_freeze() function. The main difference in the
coding is that in vacuumlazy.c, we looked at the tuples after the
pruning had already happened, but in heap_page_prune_and_freeze() we
operate on the tuples before pruning. The heap_prepare_freeze_tuple()
function is now invoked after we have determined that a tuple is not
going to be pruned away.

VACUUM no longer needs to loop through the items on the page after
pruning. heap_page_prune_and_freeze() does all the work. It now
returns the list of dead offsets, including existing LP_DEAD items, to
the caller. Similarly it's now responsible for tracking 'all_visible',
'all_frozen', and 'hastup' on the caller's behalf.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
2024-04-03 19:32:28 +03:00
Heikki Linnakangas 26d138f644 Refactor how heap_prune_chain() updates prunable_xid
In preparation of freezing and counting tuples which are not
candidates for pruning, split heap_prune_record_unchanged() into
multiple functions, depending the kind of line pointer. That's not too
interesting right now, but makes the next commit smaller.

Recording the lowest soon-to-be prunable xid is one of the actions we
take for unchanged LP_NORMAL item pointers but not for others, so move
that to the new heap_prune_record_unchanged_lp_normal() function. The
next commit will add more actions to these functions.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
2024-04-03 19:32:21 +03:00
Alvaro Herrera be2f073100
Fix zeroing of pg_serial page without SLRU bank lock
Bug in commit 53c2a97a9266: we failed to acquire the correct SLRU bank
lock when iterating to zero-out intermediate pages in predicate.c.
Rewrite the code block so that we follow the locking protocol correctly.

Also update an outdated comment in the same file -- SerialSLRULock
exists no more.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/2a25eaf4-a3a4-5fd1-6241-9d7c73142085@gmail.com
2024-04-03 17:49:44 +02:00
Alexander Korotkov bf1e650806 Use the pairing heap instead of a flat array for LSN replay waiters
06c418e163 introduced pg_wal_replay_wait() procedure allowing to wait for
the particular LSN to be replayed on standby.  The waiters were stored in
the flat array.  Even though scanning small arrays is fast, that might be a
problem at scale (a lot of waiting processes).

This commit replaces the flat shared memory array with the pairing heap,
which holds the waiter with the least LSN at the top.  This gives us O(log N)
complexity for both inserting and removing waiters.

Reported-by: Alvaro Herrera
Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql
2024-04-03 18:15:41 +03:00
Amit Kapila 2ec005b4e2 Ensure that the sync slots reach a consistent state after promotion without losing data.
We were directly copying the LSN locations while syncing the slots on the
standby. Now, it is possible that at some particular restart_lsn there are
some running xacts, which means if we start reading the WAL from that
location after promotion, we won't reach a consistent snapshot state at
that point. However, on the primary, we would have already been in a
consistent snapshot state at that restart_lsn so we would have just
serialized the existing snapshot.

To avoid this problem we will use the advance_slot functionality unless
the snapshot already exists at the synced restart_lsn location. This will
help us to ensure that snapbuilder/slot statuses are updated properly
without generating any changes. Note that the synced slot will remain as
RS_TEMPORARY till the decoding from corresponding restart_lsn can reach a
consistent snapshot state after which they will be marked as
RS_PERSISTENT.

Per buildfarm

Author: Hou Zhijie
Reviewed-by: Bertrand Drouvot, Shveta Malik, Bharath Rupireddy, Amit Kapila
Discussion: https://postgr.es/m/OS0PR01MB5716B3942AE49F3F725ACA92943B2@OS0PR01MB5716.jpnprd01.prod.outlook.com
2024-04-03 14:04:59 +05:30
Alexander Korotkov e37662f221 Minor improvements for waitlsn.c
* Remove extra includes
 * Fill 'cur' in addLSNWaiter() before taking the spinlock
 * Initialize 'endtime' with zero in WaitForLSN() to avoid compiler warning

Reported-by: Alvaro Herrera, Masahiko Sawada, Daniel Gustafsson
Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql
Discussion: https://postgr.es/m/CAD21AoAx7irptnPH1OkkkNh9E0M6X-phfX7sYZfwoMsc1qV1sQ%40mail.gmail.com
2024-04-03 11:32:39 +03:00
Daniel Gustafsson 226261f387 Add error codes to some PANIC/FATAL errors reports
This adds errcodes to a set of PANIC and FATAL errors in xlog.c
and relcache.c,  which previously had no errcode at all set, in
order to make fleetwide analysis of errorlogs easier. There are
many more ereport/elogs left which could benefit from having an
errcode but this at least makes a dent in the issue.

Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CAN55FZ1k8LgLEqncPGmz_fWnrobV6bjABOTH4tOWta6xNcPQig@mail.gmail.com
2024-04-03 09:19:25 +02:00
Nathan Bossart c627d944e6 Add built-in ERROR handling for archive callbacks.
Presently, the archiver process restarts when an archive callback
ERRORs.  To avoid this, archive module authors can use sigsetjmp(),
manage a memory context, etc., but that requires a lot of extra
code that will likely look roughly the same between modules.  This
commit adds basic archive callback ERROR handling to pgarch.c so
that module authors won't ordinarily need to worry about this.
While this built-in handler attempts to clean up anything that an
archive module could conceivably have left behind, it is possible
that some modules are doing unexpected things that require
additional cleanup.  Module authors should be sure to do any extra
required cleanup in a PG_CATCH block within the archiving callback.

The archiving callback is now called in a short-lived memory
context that the archiver process resets between invocations.  If a
module requires longer-lived storage, it must maintain its own
memory context.

Thanks to these changes, the basic_archive module can be greatly
simplified.

Suggested-by: Andres Freund
Reviewed-by: Andres Freund, Yong Li
Discussion: https://postgr.es/m/20230217215624.GA3131134%40nathanxps13
2024-04-02 22:28:11 -05:00
Masahiko Sawada 5bec1d6bc5 Improve eviction algorithm in ReorderBuffer using max-heap for many subtransactions.
Previously, when selecting the transaction to evict during logical
decoding, we check all transactions to find the largest
transaction. This could lead to a significant replication lag
especially in the case where there are many subtransactions.

This commit improves the eviction algorithm in ReorderBuffer using the
max-heap with transaction size as the key to efficiently find the
largest transaction.

The max-heap starts with empty. While the max-heap is empty, we don't
do anything for the max-heap when updating the memory
counter. Therefore, we get the largest transaction in O(N) time, where
N is the number of transactions including top-level transactions and
subtransactions.

We build the max-heap just before selecting the largest transactions
if the number of transactions being decoded is higher than the
threshold, MAX_HEAP_TXN_COUNT_THRESHOLD. After building the max-heap,
we also update the max-heap when updating the memory counter. The
intention is to efficiently find the largest transaction in O(1) time
instead of incurring the cost of memory counter updates (O(log
N)). Once the number of transactions got lower than the threshold, we
reset the max-heap.

The performance benchmark results showed significant speed up (more
than x30 speed up on my machine) in decoding a transaction with 100k
subtransactions, whereas there is no visible overhead in other cases.

Reviewed-by: Amit Kapila, Hayato Kuroda, Vignesh C, Ajin Cherian,
Tomas Vondra, Shubham Khanna, Peter Smith, Álvaro Herrera,
Euler Taveira
Discussion: https://postgr.es/m/CAD21AoAfKTgrBrLq96GcTv9d6k97zaQcDM-rxfKEt4GSe0qnaQ%40mail.gmail.com
2024-04-03 11:40:42 +09:00
David Rowley 7487044d6c Don't adjust ressortgroupref in generate_setop_child_grouplist()
This is already done inside assignSortGroupRef(), therefore is
redundant.

Oversight from 66c0185a3.

Reported-by: Tom Lane
Discussion: https://postgr.es/m/3703023.1711654574@sss.pgh.pa.us
2024-04-03 15:39:29 +13:00
Masahiko Sawada b840508644 Add functions to binaryheap for efficient key removal and update.
Previously, binaryheap didn't support updating a key and removing a
node in an efficient way. For example, in order to remove a node from
the binaryheap, the caller had to pass the node's position within the
array that the binaryheap internally has. Removing a node from the
binaryheap is done in O(log n) but searching for the key's position is
done in O(n).

This commit adds a hash table to binaryheap in order to track the
position of each nodes in the binaryheap. That way, by using newly
added functions such as binaryheap_update_up() etc., both updating a
key and removing a node can be done in O(1) on an average and O(log n)
in worst case. This is known as the indexed binary heap. The caller
can specify to use the indexed binaryheap by passing indexed = true.

The current code does not use the new indexing logic, but it will be
used by an upcoming patch.

Reviewed-by: Vignesh C, Peter Smith, Hayato Kuroda, Ajin Cherian,
Tomas Vondra, Shubham Khanna
Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com
2024-04-03 10:44:21 +09:00
Alexander Korotkov 2c91e13013 Move WaitLSNShmemInit() to CreateOrAttachShmemStructs()
Thanks to Andres Freund, Thomas Munrom and David Rowley for investigating
this issue.

Discussion: https://postgr.es/m/CAPpHfdvap5mMLikt8CUjA0osAvCJHT0qnYeR3f84EJ_Kvse0mg%40mail.gmail.com
2024-04-03 02:55:03 +03:00
David Rowley 3b1a7eb289 Don't zero tuple_fraction when planning UNIONs with ORDER BYs
Since 66c0185a3, the planner is able to use Merge Append -> Unique to
implement UNION queries and each subquery is prompted to produce Paths
correctly sorted by the UNION's targetlist.

Here we remove some now redundant code which was zeroing the
tuple_fraction at the parent level.  This will allow the planner to
consider cheap startup paths when planning the UNION's subqueries.

EXCEPT and INTERSECT set operations still have the tuple_fraction zeroed
in generate_nonunion_paths().  These operations currently always read
all of their subqueries' tuples.

Reported-by: Tom Lane
Discussion: https://postgr.es/m/3703023.1711654574@sss.pgh.pa.us
2024-04-03 11:40:33 +13:00
Alexander Korotkov 06c418e163 Implement pg_wal_replay_wait() stored procedure
pg_wal_replay_wait() is to be used on standby and specifies waiting for
the specific WAL location to be replayed before starting the transaction.
This option is useful when the user makes some data changes on primary and
needs a guarantee to see these changes on standby.

The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.

pg_wal_replay_wait() needs to wait without any snapshot held.  Otherwise,
the snapshot could prevent the replay of WAL records implying a kind of
self-deadlock.  This is why it is only possible to implement
pg_wal_replay_wait() as a procedure working in a non-atomic context,
not a function.

Catversion is bumped.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
2024-04-02 22:48:03 +03:00
Tom Lane 6faca9ae28 Avoid deadlock during orphan temp table removal.
If temp tables have dependencies (such as sequences) then it's
possible for autovacuum's cleanup of orphan temp tables to deadlock
against an incoming backend that's trying to clean out the temp
namespace for its own use.  That can happen because RemoveTempRelations'
performDeletion call can visit objects within the namespace in
an order different from the order in which a per-table deletion
will visit them.

To fix, observe that performDeletion will begin by taking an exclusive
lock on the temp namespace (even though it won't actually delete it).
So, if we can get a shared lock on the namespace, we can be sure we're
not running concurrently with RemoveTempRelations, while also not
conflicting with ordinary use of the namespace.  This requires
introducing a conditional version of LockDatabaseObject, but that's no
big deal.  (It's surprising we've got along without that this long.)

Report and patch by Mikhail Zhilin.  Back-patch to all supported
branches.

Discussion: https://postgr.es/m/c43ce028-2bc2-4865-9b89-3f706246eed5@postgrespro.ru
2024-04-02 14:59:32 -04:00
Nathan Bossart 4133c1f45c Avoid function call overhead of pg_popcount() in syslogger.c.
Instead of calling the pg_popcount() function for a single byte, we
can look up the value in the pg_number_of_ones array.

Discussion: https://postgr.es/m/20240401221117.GB2362108%40nathanxps13
2024-04-02 10:32:49 -05:00
Thomas Munro b5a9b18cd0 Provide API for streaming relation data.
Introduce an abstraction allowing relation data to be accessed as a
stream of buffers, with an implementation that is more efficient than
the equivalent sequence of ReadBuffer() calls.

Client code supplies a callback that can say which block number it wants
next, and then consumes individual buffers one at a time from the
stream.  This division puts read_stream.c in control of how far ahead it
can see and allows it to read clusters of neighboring blocks with
StartReadBuffers().  It also issues POSIX_FADV_WILLNEED advice ahead of
time when random access is detected.

Other variants of I/O stream will be proposed in future work (for
example to support recovery, whose LsnReadQueue device in
xlogprefetcher.c is a distant cousin of this code and should eventually
be replaced by this), but this basic API is sufficient for many common
executor usage patterns involving predictable access to a single fork of
a single relation.

Several patches using this API are proposed separately.

This stream concept is loosely based on ideas from Andres Freund on how
we should pave the way for later work on asynchronous I/O.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (contributions)
Author: Melanie Plageman <melanieplageman@gmail.com> (contributions)
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Tested-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
2024-04-03 00:49:46 +13:00
Thomas Munro 210622c60e Provide vectored variant of ReadBuffer().
Break ReadBuffer() up into two steps.  StartReadBuffers() and
WaitReadBuffers() give us two main advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the
kernel ahead of time.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.

A new GUC io_combine_limit is defined, and the functions for limiting
per-backend pin counts are made into public APIs.  Those are provided
for use by callers of StartReadBuffers(), when deciding how many buffers
to read at once.  The following commit will add a higher level mechanism
for doing that automatically with a practical interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of just issuing
advice and leaving WaitReadBuffers() to do the work synchronously.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de> (some optimization tweaks)
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Tested-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
2024-04-03 00:23:20 +13:00
Alexander Korotkov 867cc7b6dd Revert "Custom reloptions for table AM"
This reverts commit c95c25f9af due to multiple
design issues spotted after commit.

Reported-by: Jeff Davis
Discussion: https://postgr.es/m/11550b536211d5748bb2865ed6cb3502ff073bf7.camel%40j-davis.com
2024-04-02 11:29:00 +03:00
Masahiko Sawada 667e65aac3 Use TidStore for dead tuple TIDs storage during lazy vacuum.
Previously, we used a simple array for storing dead tuple IDs during
lazy vacuum, which had a number of problems:

* The array used a single allocation and so was limited to 1GB.
* The allocation was pessimistically sized according to table size.
* Lookup with binary search was slow because of poor CPU cache and
  branch prediction behavior.

This commit replaces that array with the TID store from commit
30e144287a.

Since the backing radix tree makes small allocations as needed, the
1GB limit is now gone. Further, the total memory used is now often
smaller by an order of magnitude or more, depending on the
distribution of blocks and offsets. These two features should make
multiple rounds of heap scanning and index cleanup an extremely rare
event. TID lookup during index cleanup is also several times faster,
even more so when index order is correlated with heap tuple order.

Since there is no longer a predictable relationship between the number
of dead tuples vacuumed and the space taken up by their TIDs, the
number of tuples no longer provides any meaningful insights for users,
nor is the maximum number predictable. For that reason this commit
also changes to byte-based progress reporting, with the relevant
columns of pg_stat_progress_vacuum renamed accordingly to
max_dead_tuple_bytes and dead_tuple_bytes.

For parallel vacuum, both the TID store and supplemental information
specific to vacuum are shared among the parallel vacuum workers. As
with the previous array, we don't take any locks on TidStore during
parallel vacuum since writes are still only done by the leader
process.

Bump catalog version.

Reviewed-by: John Naylor, (in an earlier version) Dilip Kumar
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
2024-04-02 10:15:37 +09:00
David Rowley d5d2205c8d Fix assert failure when planning setop subqueries with CTEs
66c0185a3 adjusted the UNION planner to request that union child queries
produce Paths correctly ordered to implement the UNION by way of
MergeAppend followed by Unique.  The code there made a bad assumption
that if the root->parent_root->parse had setOperations set that the
query must be the child subquery of a set operation.  That's not true
when it comes to planning a non-inlined CTE which is parented by a set
operation.  This causes issues as the CTE's targetlist has no
requirement to match up to the SetOperationStmt's groupClauses

Fix this by adding a new parameter to both subquery_planner() and
grouping_planner() to explicitly pass the SetOperationStmt only when
planning set operation child subqueries.

Thank you to Tom Lane for helping to rationalize the decision on the
best function signature for subquery_planner().

Reported-by: Alexander Lakhin
Discussion: https://postgr.es/m/242fc7c6-a8aa-2daf-ac4c-0a231e2619c1@gmail.com
2024-04-02 12:15:45 +13:00
Heikki Linnakangas 3d0f730bf1 Introduce 'options' argument to heap_page_prune()
Currently there is only one option, HEAP_PAGE_PRUNE_MARK_UNUSED_NOW
which replaces the old boolean argument, but upcoming patches will
introduce at least one more. Having a lot of boolean arguments makes
it hard to see at the call sites what the arguments mean, so prefer a
bitmask of options with human-readable names.

Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Discussion: https://www.postgresql.org/message-id/20240401172219.fngjosaqdgqqvg4e@liskov
2024-04-02 00:56:05 +03:00
Heikki Linnakangas 6f47f68831 Handle non-chain tuples outside of heap_prune_chain()
Handle dead branches of aborted HOT chains outside heap_prune_chain()
as a separate phase. This simplifies the logic in heap_prune_chain(),
as well as allowing us to clean up more RECENTLY_DEAD -> DEAD chains.

To accomplish this efficiently, partition tuples into HOT and non-HOT
while first collecting visibility information for each tuple in
heap_page_prune(). Then call heap_prune_chain() only on potential
chain members. Then mop up the leftover HOT tuples afterwards.

As part of this, keep track of which items on page have already been
processed, in 'processed' array. This replaces the 'marked' array
which was only set for tuples marked for removal or redirection. The
'processed' array is updated also for items that are left unchanged,
when we conclude that an item can be left unchanged. At the end of
pruning, every item on the page should be marked as processed in the
array; an assertion is added for that.

Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
2024-04-01 20:33:50 +03:00
Heikki Linnakangas 7aa00f1360 Refactor heap_prune_chain()
Keep track of the number of deleted tuples in PruneState and record this
information when recording a tuple dead, unused or redirected. This
removes a special case from the traversal and chain processing logic as
well as setting a precedent of recording the impact of prune actions in
the record functions themselves. This paradigm will be used in future
commits which move tracking of additional statistics on pruning actions
from lazy_scan_prune() to heap_prune_chain().

Simplify heap_prune_chain()'s chain traversal logic by handling each
case explicitly. That is, do not attempt to share code when processing
different types of chains. For each category of chain, process it
specifically and procedurally: first handling the root, then any
intervening tuples, and, finally, the end of the chain.

While we are at it, add a few new comments to heap_prune_chain()
clarifying some special cases involving RECENTLY_DEAD tuples.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
2024-04-01 13:28:44 +03:00
Heikki Linnakangas 9917e79d99 Minor refactoring in heap_page_prune
Pass 'page', 'blockno' and 'maxoff' to heap_prune_chain() as
arguments, so that it doesn't need to fetch them from the buffer. This
saves a few cycles per chain.

Remove the "if (off_loc != NULL)" checks, and require the caller to
pass a non-NULL 'off_loc'. Pass a pointer to a dummy local variable
when it's not needed. Those checks are cheap, but it's still better to
avoid them in the per-chain loops when we can do so easily.

The CPU time saving from these changes are hardly measurable, but
fewer instructions is good anyway, so why not. I spotted the potential
for these while reviewing Melanie Plageman's patch set to combine
prune and freeze records.

Discussion: https://www.postgresql.org/message-id/CAAKRu_abm2tHhrc0QSQa%3D%3DsHe%3DVA1%3Doz1dJMQYUOKuHmu%2B9Xrg%40mail.gmail.com
2024-04-01 12:07:30 +03:00
Masahiko Sawada f5a227895e Add new COPY option LOG_VERBOSITY.
This commit adds a new COPY option LOG_VERBOSITY, which controls the
amount of messages emitted during processing. Valid values are
'default' and 'verbose'.

This is currently used in COPY FROM when ON_ERROR option is set to
ignore. If 'verbose' is specified, a NOTICE message is emitted for
each discarded row, providing additional information such as line
number, column name, and the malformed value. This helps users to
identify problematic rows that failed to load.

Author: Bharath Rupireddy
Reviewed-by: Michael Paquier, Atsushi Torikoshi, Masahiko Sawada
Discussion: https://www.postgresql.org/message-id/CALj2ACUk700cYhx1ATRQyRw-fBM%2BaRo6auRAitKGff7XNmYfqQ%40mail.gmail.com
2024-04-01 15:25:25 +09:00
Alexander Korotkov b1484a3f19 Let table AM insertion methods control index insertion
Previously, the executor did index insert unconditionally after calling
table AM interface methods tuple_insert() and multi_insert().  This commit
introduces the new parameter insert_indexes for these two methods.  Setting
'*insert_indexes' to true saves the current logic.  Setting it to false
indicates that table AM cares about index inserts itself and doesn't want the
caller to do that.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Pavel Borisov, Matthias van de Meent, Mark Dilger
2024-03-30 22:53:56 +02:00
Alexander Korotkov c95c25f9af Custom reloptions for table AM
Let table AM define custom reloptions for its tables.  This allows to
specify AM-specific parameters by WITH clause when creating a table.

The code may use some parts from prior work by Hao Wu.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn
Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent
2024-03-30 22:36:25 +02:00
Alexander Korotkov 27bc1772fc Generalize relation analyze in table AM interface
Currently, there is just one algorithm for sampling tuples from a table written
in acquire_sample_rows().  Custom table AM can just redefine the way to get the
next block/tuple by implementing scan_analyze_next_block() and
scan_analyze_next_tuple() API functions.

This approach doesn't seem general enough.  For instance, it's unclear how to
sample this way index-organized tables.  This commit allows table AM to
encapsulate the whole sampling algorithm (currently implemented in
acquire_sample_rows()) into the relation_analyze() API function.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Pavel Borisov, Matthias van de Meent
2024-03-30 22:34:04 +02:00
Tom Lane b154d8a6d0 Add pg_basetype() function to extract a domain's base type.
This SQL-callable function behaves much like our internal utility
function getBaseType(), except it returns NULL rather than failing for
an invalid type OID.  (That behavior is modeled on our experience with
other catalog-inquiry functions such as the ACL checking functions.)
The key advantage over doing a join to pg_type is that it will loop
as needed to find the bottom base type of a nest of domains.

Steve Chavez, reviewed by jian he and others

Discussion: https://postgr.es/m/CAGRrpzZSX8j=MQcbCSEisFA=ic=K3bknVfnFjAv1diVJxFHJvg@mail.gmail.com
2024-03-30 13:57:19 -04:00
Dean Rasheed 0294df2f1f Add support for MERGE ... WHEN NOT MATCHED BY SOURCE.
This allows MERGE commands to include WHEN NOT MATCHED BY SOURCE
actions, which operate on rows that exist in the target relation, but
not in the data source. These actions can execute UPDATE, DELETE, or
DO NOTHING sub-commands.

This is in contrast to already-supported WHEN NOT MATCHED actions,
which operate on rows that exist in the data source, but not in the
target relation. To make this distinction clearer, such actions may
now be written as WHEN NOT MATCHED BY TARGET.

Writing WHEN NOT MATCHED without specifying BY SOURCE or BY TARGET is
equivalent to writing WHEN NOT MATCHED BY TARGET.

Dean Rasheed, reviewed by Alvaro Herrera, Ted Yu and Vik Fearing.

Discussion: https://postgr.es/m/CAEZATCWqnKGc57Y_JanUBHQXNKcXd7r=0R4NEZUVwP+syRkWbA@mail.gmail.com
2024-03-30 10:00:26 +00:00
Jeff Davis 46e5441fa5 Add unicode_strtitle() for Unicode Default Case Conversion.
This brings the titlecasing implementation for the builtin provider
out of formatting.c and into unicode_case.c, along with
unicode_strlower() and unicode_strupper(). Accepts an arbitrary word
boundary callback.

Simple for now, but can be extended to support the Unicode Default
Case Conversion algorithm with full case mapping.

Discussion: https://postgr.es/m/3bc653b5d562ae9e2838b11cb696816c328a489a.camel@j-davis.com
Reviewed-by: Peter Eisentraut
2024-03-29 17:35:07 -07:00
Jeff Davis 46a44dc372 Use version for builtin collations.
Given that the version field already exists, there's little reason not
to use it. Suggestion from Peter Eisentraut.

Discussion: https://postgr.es/m/613c120a-5413-4fa7-a501-6590eae558f8@eisentraut.org
Reviewed-by: Peter Eisentraut
2024-03-29 10:53:26 -07:00
Robert Haas d3ae2a24f2 Add allow_alter_system GUC.
This is marked PGC_SIGHUP, so it can only be set in a configuration
file, not anywhere else; and it is also marked GUC_DISALLOW_IN_AUTO_FILE,
so it can't be set using ALTER SYSTEM. When set to false, the
ALTER SYSTEM command is disallowed.

There was considerable concern that this would be misinterpreted as
a security feature, which it is not, because a determined superuser
has various ways of bypassing it. Hence, a lot of work has gone into
wordsmithing the documentation, in the hopes of avoiding any such
confusion.

Jelte Fennemia-Nio and Gabriele Bartolini, with wording suggestions
for the documentation from many others.

Discussion: http://postgr.es/m/CA%2BVUV5rEKt2%2BCdC_KUaPoihMu%2Bi5ChT4WVNTr4CD5-xXZUfuQw%40mail.gmail.com
2024-03-29 08:45:11 -04:00
Tom Lane 0075d78947 Allow "internal" subtransactions in parallel mode.
Allow use of BeginInternalSubTransaction() in parallel mode, so long
as the subtransaction doesn't attempt to acquire an XID or increment
the command counter.  Given those restrictions, the other parallel
processes don't need to know about the subtransaction at all, so
this should be safe.  The benefit is that it allows subtransactions
intended for error recovery, such as pl/pgsql exception blocks,
to be used in PARALLEL SAFE functions.

Another reason for doing this is that the API of
BeginInternalSubTransaction() doesn't allow reporting failure.
pl/python for one, and perhaps other PLs, copes very poorly with an
error longjmp out of BeginInternalSubTransaction().  The headline
feature of this patch removes the only easily-triggerable failure
case within that function.  There remain some resource-exhaustion
and similar cases, which we now deal with by promoting them to FATAL
errors, so that callers need not try to clean up.  (It is likely
that such errors would leave us with corrupted transaction state
inside xact.c, making recovery difficult if not impossible anyway.)

Although this work started because of a report of a pl/python crash,
we're not going to do anything about that in the back branches.
Back-patching this particular fix is obviously not very wise.
While we could contemplate some narrower band-aid, pl/python is
already an untrusted language, so it seems okay to classify this
as a "so don't do that" case.

Patch by me, per report from Hao Zhang.  Thanks to Robert Haas for
review.

Discussion: https://postgr.es/m/CALY6Dr-2yLVeVPhNMhuBnRgOZo1UjoTETgtKBx1B2gUi8yy+3g@mail.gmail.com
2024-03-28 12:43:10 -04:00
Alvaro Herrera e2395cdbe8
ALTER TABLE: rework determination of access method ID
Avoid setting an access method OID for relation kinds that don't take
one.  Code review for new feature added in 374c7a2290.

Author: Justin Pryzby <pryzby@telsasoft.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/e5516ac1-5264-c3c0-d822-9e6f614ea93b@gmail.com
2024-03-28 16:51:20 +01:00
Tom Lane be98a550cc Update comment in set_dummy_rel_pathlist().
This comment claimed that set_dummy_rel_pathlist() has callers
other than (possibly indirectly) set_rel_size().  It doesn't,
so revise the argument to not rely on that.

Noted by Richard Guo.

Discussion: https://postgr.es/m/CAMbWs4-KFEU_fDuJPNCOkUu3rwvZvKBEytkd9VrM4kH4-2h1CQ@mail.gmail.com
2024-03-28 11:44:00 -04:00
Heikki Linnakangas 427005742b Remove obsolete comment about VACUUM retrying pruning
Commit 1ccc1e05ae removed the retry logic that the comment talked
about.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/20240328015326.x5gnzsohl6j23b42@liskov
2024-03-28 10:18:48 +02:00
Masahiko Sawada 2d8f56dabb Rethink create and attach APIs of shared TidStore.
Previously, the behavior of TidStoreCreate() was inconsistent between
local and shared TidStore instances in terms of memory limitation. For
local TidStore, a memory context was created with initial and maximum
memory block sizes, as well as a minimum memory context size, based on
the specified max_bytes values. However, for shared TidStore, the
provided DSA area was used for TID storage. Although commit bb952c8c8b
allowed specifying the initial and maximum DSA segment sizes, callers
would have needed to clamp their own limits, which was not consistent
and user-friendly.

With this commit, when creating a shared TidStore, a dedicated DSA
area is created for TID storage instead of using a provided DSA
area. The initial and maximum DSA segment sizes are chosen based on
the specified max_bytes. Other processes can attach to the shared
TidStore using the handle of the created DSA returned by the new
TidStoreGetDSA() function and the DSA pointer returned by
TidStoreGetHandle(). The created DSA has the same lifetime as the
shared TidStore and is deleted when all processes detach from it.

To improve clarity, the TidStoreCreate() function has been divided
into two separate functions: TidStoreCreateLocal() and
TidStoreCreateShared().

Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAD21AoAyc1j%3DBCdUqZfk6qbdjZ68UgRx1Gkpk0oah4K7S0Ri9g%40mail.gmail.com
2024-03-28 10:03:28 +09:00
Tom Lane a767cdc84c Fix unnecessary use of moving-aggregate mode with non-moving frame.
When a plain aggregate is used as a window function, and the window
frame start is specified as UNBOUNDED PRECEDING, the frame's head
cannot move so we do not need to use moving-aggregate mode.  The check
for that was put into initialize_peragg(), failing to notice that
ExecInitWindowAgg() calls that function before it's filled in
winstate->frameOptions.  Since makeNode() would have zeroed the field,
this didn't provoke uninitialized-value complaints, nor would the
erroneous decision have resulted in more than a little inefficiency.
Still, it's wrong, so move the initialization of
winstate->frameOptions earlier to make it work properly.

While here, also fix a thinko in a comment.  Both errors crept in in
commit a9d9acbf2 which introduced the moving-aggregate mode.

Spotted by Vallimaharajan G.  Back-patch to all supported branches.

Discussion: https://postgr.es/m/18e7f2a5167.fe36253866818.977923893562469143@zohocorp.com
2024-03-27 13:39:03 -04:00
Robert Haas de7e96bd0f Rename COMPAT_OPTIONS_CLIENT to COMPAT_OPTIONS_OTHER.
The user-facing name is "Other Platforms and Clients", but the
internal name seems too focused on clients specifically, especially
given the plan to add a new setting to this session that is about
platform or deployment model compatibility rather than client
compatibility.

Jelte Fennema-Nio

Discussion: http://postgr.es/m/CAGECzQTfMbDiM6W3av+3weSnHxJvPmuTEcjxVvSt91sQBdOxuQ@mail.gmail.com
2024-03-27 10:45:28 -04:00
Dean Rasheed e6341323a8 Add functions to generate random numbers in a specified range.
This adds 3 new variants of the random() function:

    random(min integer, max integer) returns integer
    random(min bigint, max bigint) returns bigint
    random(min numeric, max numeric) returns numeric

Each returns a random number x in the range min <= x <= max.

For the numeric function, the number of digits after the decimal point
is equal to the number of digits that "min" or "max" has after the
decimal point, whichever has more.

The main entry points for these functions are in a new C source file.
The existing random(), random_normal(), and setseed() functions are
moved there too, so that they can all share the same PRNG state, which
is kept private to that file.

Dean Rasheed, reviewed by Jian He, David Zhang, Aleksander Alekseev,
and Tomas Vondra.

Discussion: https://postgr.es/m/CAEZATCV89Vxuq93xQdmc0t-0Y2zeeNQTdsjbmV7dyFBPykbV4Q@mail.gmail.com
2024-03-27 10:12:39 +00:00
Alexander Korotkov 818861eb57 Fix some typos and grammar issues from commit 87985cc925
Reported-by: Alexander Lakhin
2024-03-27 11:47:41 +02:00
Amit Kapila 6d49c8d4b4 Change last_inactive_time to inactive_since in pg_replication_slots.
Commit a11f330b55 added last_inactive_time to show the last time the slot
was inactive. But, it tells the last time that a currently-inactive slot
previously *WAS* active. This could be unclear, so we changed the name to
inactive_since.

Reported-by: Robert Haas
Author: Bharath Rupireddy
Reviewed-by: Bertrand Drouvot, Shveta Malik, Amit Kapila
Discussion: https://postgr.es/m/CA+Tgmob_Ta-t2ty8QrKHBGnNLrf4ZYcwhGHGFsuUoFrAEDw4sA@mail.gmail.com
Discussion: https://postgr.es/m/CALj2ACUXS0SfbHzsX8bqo+7CZhocsV52Kiu7OWGb5HVPAmJqnA@mail.gmail.com
2024-03-27 09:27:44 +05:30
Masahiko Sawada bb952c8c8b Allow specifying initial and maximum segment sizes for DSA.
Previously, the DSA segment size always started with 1MB and grew up
to DSA_MAX_SEGMENT_SIZE. It was inconvenient in certain scenarios,
such as when the caller desired a soft constraint on the total DSA
segment size, limiting it to less than 1MB.

This commit introduces the capability to specify the initial and
maximum DSA segment sizes when creating a DSA area, providing more
flexibility and control over memory usage.

Reviewed-by: John Naylor, Tomas Vondra
Discussion: https://postgr.es/m/CAD21AoAYGGC1ePjVX0H%2Bpp9rH%3D9vuPK19nNOiu12NprdV5TVJA%40mail.gmail.com
2024-03-27 11:43:29 +09:00
Tom Lane 9d00cf4772 Remove some redundant set_cheapest() calls.
Commit e2fa76d80 centralized the responsibility for doing
set_cheapest() for a baserel, but these functions added later
seemingly didn't get the memo.  There's no apparent reason why
we need the cheapest path for these relation types to be available
any sooner than it is for other base relation types, so delete the
duplicate calls.  Doesn't save much since there's only one path
in these cases, but it might improve clarity.

Richard Guo

Discussion: https://postgr.es/m/CAMbWs4-KFEU_fDuJPNCOkUu3rwvZvKBEytkd9VrM4kH4-2h1CQ@mail.gmail.com
2024-03-26 16:02:44 -04:00
Nathan Bossart d365ae7054 Optimize roles_is_member_of() with a Bloom filter.
When the list of roles gathered by roles_is_member_of() grows very
large, a Bloom filter is created to help avoid some linear searches
through the list.  The threshold for creating the Bloom filter is
set arbitrarily high and may require future adjustment.

Suggested-by: Tom Lane
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com
2024-03-26 14:43:37 -05:00
Tom Lane fad3b5b5ac Fix failure of ALTER FOREIGN TABLE SET SCHEMA to move sequences.
Ordinary ALTER TABLE SET SCHEMA will also move any owned sequences
into the new schema.  We failed to do likewise for foreign tables,
because AlterTableNamespaceInternal believed that only certain
relkinds could have indexes, owned sequences, or constraints.
We could simply add foreign tables to that relkind list, but it
seems likely that the same oversight could be made again in
future.  Instead let's remove the relkind filter altogether.
These functions shouldn't cost much when there are no objects
that they need to process, and surely this isn't an especially
performance-critical case anyway.

Per bug #18407 from Vidushi Gupta.  Back-patch to all supported
branches.

Discussion: https://postgr.es/m/18407-4fd07373d252c6a0@postgresql.org
2024-03-26 15:28:31 -04:00
Tom Lane a65724dfa7 Propagate pathkeys from CTEs up to the outer query.
If we know the sort order of a CTE's output, and it is relevant
to the outer query, label the CTE's outer-query access path using
those pathkeys.  This may enable optimizations such as avoiding
a sort in the outer query.

The code for hoisting pathkeys into the outer query already exists
for regular RTE_SUBQUERY subqueries, but it wasn't getting used for
CTEs, possibly out of concern for maintaining an optimization fence
between the CTE and the outer query.  However, on the same arguments
used for commit f7816aec2, there seems no harm in letting the outer
query know what the inner query decided to do.

In support of this, we now remember the best Path as well as Plan
for each subquery for the rest of the planner run.  There may be
future applications for having that at hand, and it surely costs
little to build one more List.

Richard Guo (minor mods by me)

Discussion: https://postgr.es/m/CAMbWs49xYd3f8CrE8-WW3--dV1zH_sDSDn-vs2DzHj81Wcnsew@mail.gmail.com
2024-03-26 13:05:51 -04:00
Bruce Momjian e648e77e25 C comment: mention no doc for negative start of substring(text)
Also add URL to hackers discussion.

Backpatch-through: master
2024-03-26 12:27:41 -04:00
Tom Lane 8a92b70c11 Allow "make check"-style testing to work with musl C library.
The musl dynamic linker saves a pointer to the process' environment
value of LD_LIBRARY_PATH very early in startup.  When we move/clobber
the environment to make more room for ps status strings, we clobber
that value and thereby prevent libraries from being found via
LD_LIBRARY_PATH, which breaks the use of a temporary installation
for testing purposes.  To fix, stop collecting usable space for
ps status if we notice that the variable we are about to clobber
is LD_LIBRARY_PATH.  This will result in some reduction in how long
the ps status can be, but it's only likely to occur in temporary
test contexts, so it doesn't seem like a big problem.  In any case,
we don't have to do it if we see we are on glibc, which surely is
where the majority of our Linux testing is done.

Thomas Munro, Bruce Momjian, and Tom Lane, per report from Wolfgang
Walther.  Back-patch to all supported branches, with the hope that
we'll set up a buildfarm animal to test on this platform.

Discussion: https://postgr.es/m/fddd1cd6-dc16-40a2-9eb5-d7fef2101488@technowledgy.de
2024-03-26 11:44:49 -04:00
Peter Eisentraut 89e5ef7e21 Remove ObjectClass type
ObjectClass is an enum whose values correspond to catalog OIDs.  But
the extra layer of redirection, which is used only in small parts of
the code, and the similarity to ObjectType, are confusing and
cumbersome.

One advantage has been that some switches processing the OCLASS enum
don't have "default:" cases.  This is so that the compiler tells us
when we fail to add support for some new object class.  But you can
also handle that with some assertions and proper test coverage.  It's
not even clear how strong this benefit is.  For example, in
AlterObjectNamespace_oid(), you could still put a new OCLASS into the
"ignore object types that don't have schema-qualified names" case, and
it might or might not be wrong.  Also, there are already various
OCLASS switches that do have a default case, so it's not even clear
what the preferred coding style should be.

Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/flat/CAGECzQT3caUbcCcszNewCCmMbCuyP7XNAm60J3ybd6PN5kH2Dw%40mail.gmail.com
2024-03-26 10:08:56 +01:00
Masahiko Sawada 4edb37e322 Fix a calculation in TidStoreCreate().
Since we expect that the max_bytes is in bytes, not in kilobytes, it
should not be multiplied by 1024.

Introduced by 30e144287a.

Reported-by: John Naylor, David Rowley
Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CANWCAZZTE-14ofsucofTuhFsfuDGBNf%3DNZb22TMYT8bxA41oQQ%40mail.gmail.com
Discussion: https://postgr.es/m/CAApHDvojg82NDaDEpj1WEZSbVTafj%3DDRmW%2BFrkBdW8ScL4OFxA%40mail.gmail.com
2024-03-26 13:06:06 +09:00
Alexander Korotkov 41d3780d3d Improve error message for tts_(virtual|minimal)_is_current_xact_tuple
Discussion: https://postgr.es/m/CALT9ZEHNeagO5PLb4Nv9J_ZaCtp%2BArdVmbSLc0RHUzx_RPAa4w%40mail.gmail.com
Author: Pavel Borisov
2024-03-26 01:55:22 +02:00
Alexander Korotkov 10baee0c95 Add comments on some MinimalTupleSlots methods usage
Discussion: https://postgr.es/m/CALT9ZEHNeagO5PLb4Nv9J_ZaCtp%2BArdVmbSLc0RHUzx_RPAa4w%40mail.gmail.com
Author: Pavel Borisov
2024-03-26 01:53:34 +02:00
Alexander Korotkov 87985cc925 Allow locking updated tuples in tuple_update() and tuple_delete()
Currently, in read committed transaction isolation mode (default), we have the
following sequence of actions when tuple_update()/tuple_delete() finds
the tuple updated by the concurrent transaction.

1. Attempt to update/delete tuple with tuple_update()/tuple_delete(), which
   returns TM_Updated.
2. Lock tuple with tuple_lock().
3. Re-evaluate plan qual (recheck if we still need to update/delete and
   calculate the new tuple for update).
4. Second attempt to update/delete tuple with tuple_update()/tuple_delete().
   This attempt should be successful, since the tuple was previously locked.

This commit eliminates step 2 by taking the lock during the first
tuple_update()/tuple_delete() call.  The heap table access method saves some
effort by checking the updated tuple once instead of twice.  Future
undo-based table access methods, which will start from the latest row version,
can immediately place a lock there.

Also, this commit makes tuple_update()/tuple_delete() optionally save the old
tuple into the dedicated slot.  That saves efforts on re-fetching tuples in
certain cases.

The code in nodeModifyTable.c is simplified by removing the nested switch/case.

Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp
Reviewed-by: Andres Freund, Chris Travers
2024-03-26 01:27:56 +02:00
Tom Lane c7076ba6ad Refactor predicate_{implied,refuted}_by_simple_clause.
Put the node-type-dependent operations into switches on nodeTag.
This should ease addition of new proof rules for other expression
node types.  There is no functional change, although some tests
are made in a different order than before.

Also, add a couple of new cross-checks in test_predtest.c.

James Coleman (part of a larger patch series)

Discussion: https://postgr.es/m/CAAaqYe8Bo4bf_i6qKj8KBsmHMYXhe3Xt6vOe3OBQnOaf3_XBWg@mail.gmail.com
2024-03-25 17:45:15 -04:00
Jeff Davis bc5fcaa289 Clarify comment for LogicalTapeSetBlocks().
Discussion: https://postgr.es/m/1229327.1711160246@sss.pgh.pa.us
Backpatch-through: 13
2024-03-25 11:51:44 -07:00
Alvaro Herrera 374c7a2290
Allow specifying an access method for partitioned tables
It's now possible to specify a table access method via
CREATE TABLE ... USING for a partitioned table, as well change it with
ALTER TABLE ... SET ACCESS METHOD.  Specifying an AM for a partitioned
table lets the value be used for all future partitions created under it,
closely mirroring the behavior of the TABLESPACE option for partitioned
tables.  Existing partitions are not modified.

For a partitioned table with no AM specified, any new partitions are
created with the default_table_access_method.

Also add ALTER TABLE ... SET ACCESS METHOD DEFAULT, which reverts to the
original state of using the default for new partitions.

The relcache of partitioned tables is not changed: rd_tableam is not
set, even if a partitioned table has a relam set.

Author: Justin Pryzby <pryzby@telsasoft.com>
Author: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>
Author: Michaël Paquier <michael@paquier.xyz>
Reviewed-by: The authors themselves
Discussion: https://postgr.es/m/CAE-ML+9zM4wJCGCBGv01k96qQ3gFv4WFcFy=zqPHKeaEFwwv6A@mail.gmail.com
Discussion: https://postgr.es/m/20210308010707.GA29832%40telsasoft.com
2024-03-25 16:30:36 +01:00
Daniel Gustafsson 64e401b62b Fix indentation from a11f330b5
Per buildfarm animal koel
2024-03-25 14:18:33 +01:00
Heikki Linnakangas f83d709760 Merge prune, freeze and vacuum WAL record formats
The new combined WAL record is now used for pruning, freezing and 2nd
pass of vacuum. This is in preparation for changing VACUUM to write a
combined prune+freeze record per page, instead of separate two
records. The new WAL record format now supports that, but the code
still always writes separate records for pruning and freezing.

This reserves separate XLOG_HEAP2_* info codes for when the pruning
record is emitted for on-access pruning or VACUUM, per Peter
Geoghegan's suggestion. The record format is identical, but having
separate info codes makes it easier analyze pruning and vacuuming with
pg_waldump.

The function to emit the new WAL record, log_heap_prune_and_freeze(),
is in pruneheap.c. The existing heap_log_freeze_plan() and its
subroutines are moved to pruneheap.c without changes, to keep them
together with log_heap_prune_and_freeze().

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAAKRu_azf-zH%3DDgVbquZ3tFWjMY1w5pO8m-TXJaMdri8z3933g@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/CAAKRu_b2oE4GL%3Dq4g9mcByS9yT7wTQvEH9OLpabj28e%2BWKFi2A@mail.gmail.com
2024-03-25 14:59:58 +02:00
Amit Kapila a11f330b55 Track last_inactive_time in pg_replication_slots.
This commit adds a new property called last_inactive_time for slots. It is
set to 0 whenever a slot is made active/acquired and set to the current
timestamp whenever the slot is inactive/released or restored from the disk.
Note that we don't set the last_inactive_time for the slots currently being
synced from the primary to the standby because such slots are typically
inactive as decoding is not allowed on those.

The 'last_inactive_time' will be useful on production servers to debug and
analyze inactive replication slots. It will also help to know the lifetime
of a replication slot - one can know how long a streaming standby, logical
subscriber, or replication slot consumer is down.

The 'last_inactive_time' will also be useful to implement inactive
timeout-based replication slot invalidation in a future commit.

Author: Bharath Rupireddy
Reviewed-by: Bertrand Drouvot, Amit Kapila, Shveta Malik
Discussion: https://www.postgresql.org/message-id/CALj2ACW4aUe-_uFQOjdWCEN-xXoLGhmvRFnL8SNw_TZ5nJe+aw@mail.gmail.com
2024-03-25 16:34:33 +05:30
Amit Langote 0f7863afef Code review for 6190d828cd
* Fix the comment of init_dummy_sjinfo() to remove references to
  non-existing parameters 'rel1' and 'rel2'.

* Adjust consider_new_or_clause() to call init_dummy_sjinfo() to make
  up a SpecialJoinInfo for inner joins like other code sites that
  were adjusted in 6190d828cd to do so.

Author: Richard Guo <guofenglinux@gmail.com>
Reported-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAExHW5tHqEf3ASVqvFFcghYGPfpy7o3xnvhHwBGbJFMRH8KjNw@mail.gmail.com
2024-03-25 19:45:27 +09:00
Amit Langote 6190d828cd Do not translate dummy SpecialJoinInfos for child joins
This teaches build_child_join_sjinfo() to create the dummy
SpecialJoinInfos (those created for inner joins) directly for a given
child join, skipping the unnecessary overhead of translating the
parent joinrel's SpecialJoinInfo.

To that end, this commit moves the code to initialize the dummy
SpecialJoinInfos to a new function named init_dummy_sjinfo() and
changes the few existing sites that have this code and
build_child_join_sjinfo() to call this new function.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAExHW5tHqEf3ASVqvFFcghYGPfpy7o3xnvhHwBGbJFMRH8KjNw@mail.gmail.com
2024-03-25 18:06:47 +09:00
Amit Langote 5278d0a2e8 Reduce memory used by partitionwise joins
Specifically, this commit reduces the memory consumed by the
SpecialJoinInfos that are allocated for child joins in
try_partitionwise_join() by freeing them at the end of creating paths
for each child join.

A SpecialJoinInfo allocated for a given child join is a copy of the
parent join's SpecialJoinInfo, which contains the translated copies
of the various Relids bitmapsets and semi_rhs_exprs, which is a List
of Nodes.  The newly added freeing step frees the struct itself and
the various bitmapsets, but not semi_rhs_exprs, because there's no
handy function to free the memory of Node trees.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAExHW5tHqEf3ASVqvFFcghYGPfpy7o3xnvhHwBGbJFMRH8KjNw@mail.gmail.com
2024-03-25 18:06:46 +09:00
David Rowley 66c0185a3d Allow planner to use Merge Append to efficiently implement UNION
Until now, UNION queries have often been suboptimal as the planner has
only ever considered using an Append node and making the results unique
by either using a Hash Aggregate, or by Sorting the entire Append result
and running it through the Unique operator.  Both of these methods
always require reading all rows from the union subqueries.

Here we adjust the union planner so that it can request that each subquery
produce results in target list order so that these can be Merge Appended
together and made unique with a Unique node.  This can improve performance
significantly as the union child can make use of the likes of btree
indexes and/or Merge Joins to provide the top-level UNION with presorted
input.  This is especially good if the top-level UNION contains a LIMIT
node that limits the output rows to a small subset of the unioned rows as
cheap startup plans can be used.

Author: David Rowley
Reviewed-by: Richard Guo, Andy Fan
Discussion: https://postgr.es/m/CAApHDvpb_63XQodmxKUF8vb9M7CxyUyT4sWvEgqeQU-GB7QFoQ@mail.gmail.com
2024-03-25 14:31:14 +13:00
Tom Lane af1d395843 Allow more cases to pass the unsafe-use-of-new-enum-value restriction.
Up to now we've rejected cases like

BEGIN;
CREATE TYPE rainbow AS ENUM ();
ALTER TYPE rainbow ADD VALUE 'red';
-- use the value 'red', perhaps in a constraint or index
COMMIT;

The concern is that the uncommitted enum value 'red' might get into
an index and then break the index if we roll back the ALTER ADD.
If the ALTER is in the same transaction as the CREATE then it's really
perfectly safe, but we weren't taking the trouble to identify that.

pg_dump in binary-upgrade mode will emit enum definitions that look
like the above, which up to now didn't fall foul of the unsafe-usage
check because we processed each restore command as a separate
transaction.  However an upcoming patch proposes to bundle the restore
commands into large transactions to reduce XID consumption during
pg_upgrade, and that makes this behavior a problem.

To fix, remember the OIDs of enum types created in the current
transaction, and allow use of enum values that are added to one later
in the same transaction.  To do this fully correctly in the presence
of subtransactions, we'd have to track subtransaction nesting level of
the CREATE and do maintenance work at every subsequent subtransaction
exit.  That seems expensive, and we don't need it to satisfy pg_dump's
usage.  Hence, apply the additional optimization only when the CREATE
and ALTER are at outermost transaction level.

Patch by me, reviewed by Andrew Dunstan

Discussion: https://postgr.es/m/1548468.1711220438@sss.pgh.pa.us
2024-03-24 14:30:29 -04:00
Peter Eisentraut 34768ee361 Add temporal FOREIGN KEY contraints
Add PERIOD clause to foreign key constraint definitions.  This is
supported for range and multirange types.  Temporal foreign keys check
for range containment instead of equality.

This feature matches the behavior of the SQL standard temporal foreign
keys, but it works on PostgreSQL's native ranges instead of SQL's
"periods", which don't exist in PostgreSQL (yet).

Reference actions ON {UPDATE,DELETE} {CASCADE,SET NULL,SET DEFAULT}
are not supported yet.

Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
2024-03-24 07:37:13 +01:00
Daniel Gustafsson 697f8d266c Revert "Add notBefore and notAfter to SSL cert info display"
This reverts commit 6acb0a628e since
LibreSSL didn't support ASN1_TIME_diff until OpenBSD 7.1, leaving
the older OpenBSD animals in the buildfarm complaining.

Per plover in the buildfarm.

Discussion: https://postgr.es/m/F0DF7102-192D-4C21-96AE-9A01AE153AD1@yesql.se
2024-03-22 22:58:41 +01:00
Tom Lane 473182c952 Use a hash table for catcache.c's CatCList objects.
Up to now, all of the "catcache list" objects within a catalog cache
were just chained together on a single dlist, requiring O(N) time to
search.  Remarkably, we've not had serious performance problems with
that so far; but we got a complaint of a bad performance regression
from v15 in a case with a large number of roles in the system, which
traced down to O(N^2) total time when we probed N catcache lists.

Replace that data structure with a hashtable having an enlargeable
number of dlists, in an exactly parallel way to the data structure
we've used for years for the plain CatCTup cache members.  The extra
cost of maintaining a hash table seems negligible, since we were
already computing a hash value for list searches.

Normally this'd be HEAD-only material, but in view of the performance
regression it seems advisable to back-patch into v16.  In the v16
version of the patch, leave the dead cc_lists field where it is and
add the new fields at the end of struct catcache, to avoid possible
ABI breakage in case any external code is looking at these structs.
(We assume no external code is actually allocating new catcache
structs.)

Per report from alex work.

Discussion: https://postgr.es/m/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz=vggErdSG7pv2s6vmmTOLJSg@mail.gmail.com
2024-03-22 17:13:53 -04:00
Daniel Gustafsson 6acb0a628e Add notBefore and notAfter to SSL cert info display
This adds the X509 attributes notBefore and notAfter to sslinfo
as well as pg_stat_ssl to allow verifying and identifying the
validity period of the current client certificate. OpenSSL has
APIs for extracting notAfter and notBefore, but they are only
supported in recent versions so we have to calculate the dates
by hand in order to make this work for the older versions of
OpenSSL that we still support.

Original patch by Cary Huang with additional hacking by Jacob
and myself.

Author: Cary Huang <cary.huang@highgo.ca>
Co-author: Jacob Champion <jacob.champion@enterprisedb.com>
Co-author: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/182b8565486.10af1a86f158715.2387262617218380588@highgo.ca
2024-03-22 21:25:25 +01:00
Alexander Korotkov b670b93a66 Fix an oversight in refactoring in 06b10f80ba.
It was against intended skipping prechecking keys optimization in the
first page of range queries to not influence point queries performance.

Reported-by: Anton Melnikov
Discussion: https://postgr.es/m/30cd7524-b9f1-4cf8-9c4a-223eb2e34441%40postgrespro.ru
Author: Pavel Borisov
2024-03-22 15:25:53 +02:00
Peter Eisentraut d20d8fbd3e Do not output actual value of location fields in node serialization by default
This changes nodeToString() to not output the actual value of location
fields in nodes, but instead it writes -1.  This mirrors the fact that
stringToNode() also does not read location field values but always
stores -1.

For most uses of nodeToString(), which is to store nodes in catalog
fields, this is more useful.  We don't store original query texts in
catalogs, so any lingering query location values are not meaningful.

For debugging purposes, there is a new nodeToStringWithLocations(),
which mirrors the existing stringToNodeWithLocations().  This is used
for WRITE_READ_PARSE_PLAN_TREES and nodes/print.c functions, which
covers all the debugging uses.

Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAEze2WgrCiR3JZmWyB0YTc8HV7ewRdx13j0CqD6mVkYAW+SFGQ@mail.gmail.com
2024-03-22 09:49:12 +01:00
Amit Kapila 6ae701b437 Track invalidation_reason in pg_replication_slots.
Till now, the reason for replication slot invalidation is not tracked
directly in pg_replication_slots. A recent commit 007693f2a3 added
'conflict_reason' to show the reasons for slot conflict/invalidation, but
only for logical slots.

This commit adds a new column 'invalidation_reason' to show invalidation
reasons for both physical and logical slots. And, this commit also turns
'conflict_reason' text column to 'conflicting' boolean column (effectively
reverting commit 007693f2a3). The 'conflicting' column is true for
invalidation reasons 'rows_removed' and 'wal_level_insufficient' because
those make the slot conflict with recovery. When 'conflicting' is true,
one can now look at the new 'invalidation_reason' column for the reason
for the logical slot's conflict with recovery.

The new 'invalidation_reason' column will also be useful to track other
invalidation reasons in the future commit.

Author: Bharath Rupireddy
Reviewed-by: Bertrand Drouvot, Amit Kapila, Shveta Malik
Discussion: https://www.postgresql.org/message-id/ZfR7HuzFEswakt/a%40ip-10-97-1-34.eu-west-3.compute.internal
Discussion: https://www.postgresql.org/message-id/CALj2ACW4aUe-_uFQOjdWCEN-xXoLGhmvRFnL8SNw_TZ5nJe+aw@mail.gmail.com
2024-03-22 13:52:05 +05:30
Peter Eisentraut b4080fa3dc Make RangeTblEntry dump order consistent
Put the fields alias and eref earlier in the struct, so that it
matches the order in _outRangeTblEntry()/_readRangeTblEntry().  This
helps if we ever want to fully automate out/read of RangeTblEntry.
Also, it makes dumps in the debugger easier to read in the same way.
Internally, this makes no difference.

Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://www.postgresql.org/message-id/flat/4b27fc50-8cd6-46f5-ab20-88dbaadca645@eisentraut.org
2024-03-22 07:28:33 +01:00
Peter Eisentraut 367c989cd8 Remove custom _jumbleRangeTblEntry()
This is part of an effort to reduce the number of special cases in the
automatically generated node support functions.

This patch removes _jumbleRangeTblEntry() and instead adds per-field
query_jumble_ignore annotations to match the behavior of the previous
custom code.  The pg_stat_statements test suite has some coverage of
this.  It gets rid of the switch on rtekind; this should be
technically correct, since we do the equal and copy functions like
this also.

The list of fields to jumble has been checked and is considered
correct as of 8b29a119fd.

Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://www.postgresql.org/message-id/flat/4b27fc50-8cd6-46f5-ab20-88dbaadca645@eisentraut.org
2024-03-22 07:23:47 +01:00
Amit Langote 085e759e9d Avoid splitting errmsg string to span multiple lines
The error message being fixed was added in 6185c9737c.

While at it, add an "a" to the sentence.

Reported-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20240322.095149.895185546948714852.horikyota.ntt%40gmail.com
2024-03-22 12:04:06 +09:00
Alexander Korotkov 0997e0af27 Add TupleTableSlotOps.is_current_xact_tuple() method
This allows us to abstract how/whether table AM uses transaction identifiers.
A custom table AM can use a custom slot, which may not store xmin directly,
but determine the tuple belonging to the current transaction in the other way.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Matthias van de Meent, Mark Dilger, Pavel Borisov
Reviewed-by: Nikita Malakhov, Japin Li
2024-03-21 23:00:43 +02:00
Alexander Korotkov c35a3fb5e0 Allow table AM tuple_insert() method to return the different slot
This allows table AM to return a native tuple slot even if
VirtualTupleTableSlot is given as an input.  Native tuple slots have knowledge
about system attributes, which could be accessed in the future.
table_multi_insert() method already can modify the input 'slots' array.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Matthias van de Meent, Mark Dilger, Pavel Borisov
Reviewed-by: Nikita Malakhov, Japin Li
2024-03-21 23:00:40 +02:00
Alexander Korotkov 02eb07ea89 Allow table AM to store complex data structures in rd_amcache
The new table AM method free_rd_amcache is responsible for freeing all the
memory related to rd_amcache and setting free_rd_amcache to NULL.  If the new
method is not specified, we still assume rd_amcache to be a single chunk of
memory, which could be just pfree'd.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Matthias van de Meent, Mark Dilger, Pavel Borisov
Reviewed-by: Nikita Malakhov, Japin Li
2024-03-21 23:00:34 +02:00
Amit Langote 6185c9737c Add SQL/JSON query functions
This introduces the following SQL/JSON functions for querying JSON
data using jsonpath expressions:

JSON_EXISTS(), which can be used to apply a jsonpath expression to a
JSON value to check if it yields any values.

JSON_QUERY(), which can be used to to apply a jsonpath expression to
a JSON value to get a JSON object, an array, or a string.  There are
various options to control whether multi-value result uses array
wrappers and whether the singleton scalar strings are quoted or not.

JSON_VALUE(), which can be used to apply a jsonpath expression to a
JSON value to return a single scalar value, producing an error if it
multiple values are matched.

Both JSON_VALUE() and JSON_QUERY() functions have options for
handling EMPTY and ERROR conditions, which can be used to specify
the behavior when no values are matched and when an error occurs
during jsonpath evaluation, respectively.

Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Author: Teodor Sigaev <teodor@sigaev.ru>
Author: Oleg Bartunov <obartunov@gmail.com>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Andrew Dunstan <andrew@dunslane.net>
Author: Amit Langote <amitlangote09@gmail.com>
Author: Peter Eisentraut <peter@eisentraut.org>
Author: Jian He <jian.universality@gmail.com>

Reviewers have included (in no particular order):

Andres Freund, Alexander Korotkov, Pavel Stehule, Andrew Alsup,
Erik Rijkers, Zihong Yu, Himanshu Upadhyaya, Daniel Gustafsson,
Justin Pryzby, Álvaro Herrera, Jian He, Anton A. Melnikov,
Nikita Malakhov, Peter Eisentraut, Tomas Vondra

Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru
Discussion: https://postgr.es/m/20220616233130.rparivafipt6doj3@alap3.anarazel.de
Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org
Discussion: https://postgr.es/m/CA+HiwqHROpf9e644D8BRqYvaAPmgBZVup-xKMDPk-nd4EpgzHw@mail.gmail.com
Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com
2024-03-21 17:07:03 +09:00
Masahiko Sawada 30e144287a Add TIDStore, to store sets of TIDs (ItemPointerData) efficiently.
TIDStore is a data structure designed to efficiently store large sets
of TIDs. For TID storage, it employs a radix tree, where the key is
a block number, and the value is a bitmap representing offset
numbers. The TIDStore can be created on a DSA area and used by
multiple backend processes simultaneously.

There are potential future users such as tidbitmap.c, though it's very
likely the interface will need to evolve as we come to understand the
needs of different kinds of users. For example, we can support
updating the offset bitmap of existing values.

Currently, the TIDStore is not used for anything yet, aside from the
test code. But an upcoming patch will use it.

This includes a unit test module, in src/test/modules/test_tidstore.

Co-authored-by: John Naylor
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
2024-03-21 10:08:42 +09:00
Tom Lane 995e0fbc1c Un-break genbki.pl's error reporting capabilities.
This essentially reverts commit 69eb643b2, which added a fast path
in Catalog::ParseData, but neglected to preserve the behavior of
adding a line_number field in each hash.  That makes it impossible
for genbki.pl to provide any localization of error reports, which is
bad enough; but actually the affected error reports failed entirely,
producing useless bleats like "use of undefined value in sprintf".

69eb643b2 claimed to get a 15% speedup, but I'm not sure I believe
that: the time to rebuild the bki files changes by less than 1% for
me.  In any case, making debugging of mistakes in .dat files more
difficult would not be justified by even an order of magnitude
speedup here; it's just not that big a chunk of the total build time.

Per report from David Wheeler.  Although it's also broken in v16,
I don't think this is worth a back-patch, since we're very unlikely
to touch the v16 catalog data again.

Discussion: https://postgr.es/m/19238.1710953049@sss.pgh.pa.us
2024-03-20 18:02:50 -04:00
Tom Lane 1218ca9956 Add to_regtypemod function to extract typemod from a string type name.
In combination with to_regtype, this allows converting a string to
the "canonicalized" form emitted by format_type.  That usage requires
parsing the string twice, which is slightly annoying but not really
too expensive.  We considered alternatives such as returning a record
type, but that way was notationally uglier than this, and possibly
less flexible.

Like to_regtype(), we'd rather that this return NULL for any bad
input, but the underlying type-parsing logic isn't yet capable of
not throwing syntax errors.  Adjust the documentation for both
functions to point that out.

In passing, fix up a couple of nearby entries in the System Catalog
Information Functions table that had not gotten the word about our
since-v13 convention for displaying function usage examples.

David Wheeler and Erik Wienhold, reviewed by Pavel Stehule, Jim Jones,
and others.

Discussion: https://postgr.es/m/DF2324CA-2673-4ABE-B382-26B5770B6AA3@justatheory.com
2024-03-20 17:11:28 -04:00
Nathan Bossart 80686761c4 Avoid overflow in MaybeRemoveOldWalSummaries().
This commit limits the maximum value of wal_summary_keep_time to
INT_MAX / SECS_PER_MINUTE to avoid overflow when it is converted to
seconds.  In passing, use the HOURS_PER_DAY, MINS_PER_HOUR, and
SECS_PER_MINUTE macros in the code for this GUC instead of hard-
coding those values.

Discussion: https://postgr.es/m/20240314210010.GA3056455%40nathanxps13
2024-03-20 13:31:58 -05:00
Nathan Bossart 2b520860c0 Revert "Temporary patch to help debug pg_walsummary test failures."
Thanks to commits ea18eb7d62, b6ee30ec08, and 19a829a327, the
002_blocks.pl test now consistently passes, so we can remove this
temporary debugging code.

This reverts commit 5ddf997347.

Discussion: https://postgr.es/m/20240314210010.GA3056455%40nathanxps13
2024-03-20 11:34:00 -05:00
Alvaro Herrera a0390f6ca6
Review wording on tablespaces w.r.t. partitioned tables
Remove a redundant comment, and document pg_class.reltablespace properly
in catalogs.sgml.

After commits a36c84c3e4, 87259588d0 and others.

Backpatch to 12.

Discussion: https://postgr.es/m/202403191013.w2kr7wqlamqz@alvherre.pgsql
2024-03-20 15:28:14 +01:00
Alvaro Herrera da952b415f
Rework lwlocknames.txt to become lwlocklist.h
This way, we can fold the list of lock names to occur in
BuiltinTrancheNames instead of having its own separate array.  This
saves two lines of code in GetLWTrancheName and some space in
BuiltinTrancheNames, as foreseen in commit 74a7306310, as well as
removing the need for a separate lwlocknames.c file.

We still have to build lwlocknames.h using Perl code, which initially I
wanted to avoid, but it gives us the chance to cross-check
wait_event_names.txt.

Discussion: https://postgr.es/m/202401231025.gbv4nnte5fmm@alvherre.pgsql
2024-03-20 11:55:20 +01:00
Peter Eisentraut e5da0fe3c2 Catalog domain not-null constraints
This applies the explicit catalog representation of not-null
constraints introduced by b0e96f3119 for table constraints also to
domain not-null constraints.

Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/9ec24d7b-633d-463a-84c6-7acff769c9e8%40eisentraut.org
2024-03-20 10:05:37 +01:00
Heikki Linnakangas c9c260decd Remove unused PruneState member rel
PruneState->rel is no longer being used, so just remove it.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/20240320013602.6sypr4cx6sefpemg@liskov
2024-03-20 10:13:42 +02:00
Heikki Linnakangas c33084205a Reorganize heap_page_prune() function comment
heap_page_prune()'s function header comment didn't explain the
parameters in the same order they appear in the function. Fix that.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/20240320013602.6sypr4cx6sefpemg@liskov
2024-03-20 10:13:39 +02:00
Heikki Linnakangas d63d486d6c Remove assertions that some compiler say are tautological
To avoid the compiler warnings:

    launch_backend.c:211:39: warning: comparison of constant 16 with expression of type 'BackendType' (aka 'enum BackendType') is always true [-Wtautological-constant-out-of-range-compare]
    launch_backend.c:233:39: warning: comparison of constant 16 with expression of type 'BackendType' (aka 'enum BackendType') is always true [-Wtautological-constant-out-of-range-compare]

The point of the assertions was to fail more explicitly if someone
adds a new BackendType to the end of the enum, but forgets to add it
to the child_process_kinds array. It was a pretty weak assertion to
begin with, because it wouldn't catch if you added a new BackendType
in the middle of the enum. So let's just remove it.

Per buildfarm member ayu and a few others, spotted by Tom Lane.

Discussion: https://www.postgresql.org/message-id/4119680.1710913067@sss.pgh.pa.us
2024-03-20 09:14:51 +02:00
Jeff Davis f69319f2f1 Support C.UTF-8 locale in the new builtin collation provider.
The builtin C.UTF-8 locale has similar semantics to the libc locale of
the same name. That is, code point sort order (fast, memcmp-based)
combined with Unicode semantics for character operations such as
pattern matching, regular expressions, and
LOWER()/INITCAP()/UPPER(). The character semantics are based on
Unicode simple case mappings.

The builtin provider's C.UTF-8 offers several important advantages
over libc:

 * faster sorting -- benefits from additional optimizations such as
   abbreviated keys and varstrfastcmp_c
 * faster case conversion, e.g. LOWER(), at least compared with some
   libc implementations
 * available on all platforms with identical semantics, and the
   semantics are stable, testable, and documentable within a given
   Postgres major version

Being based on memcmp, the builtin C.UTF-8 locale does not offer
natural language sort order. But it is an improvement for most use
cases that might otherwise use libc's "C.UTF-8" locale, as well as
many use cases that use libc's "C" locale.

Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel%40j-davis.com
Reviewed-by: Daniel Vérité, Peter Eisentraut, Jeremy Schneider
2024-03-19 15:24:41 -07:00
Tom Lane fd0398fcb0 Improve EXPLAIN's display of SubPlan nodes and output parameters.
Historically we've printed SubPlan expression nodes as "(SubPlan N)",
which is pretty uninformative.  Trying to reproduce the original SQL
for the subquery is still as impractical as before, and would be
mighty verbose as well.  However, we can still do better than that.
Displaying the "testexpr" when present, and adding a keyword to
indicate the SubLinkType, goes a long way toward showing what's
really going on.

In addition, this patch gets rid of EXPLAIN's use of "$n" to represent
subplan and initplan output Params.  Instead we now print "(SubPlan
N).colX" or "(InitPlan N).colX" to represent the X'th output column
of that subplan.  This eliminates confusion with the use of "$n" to
represent PARAM_EXTERN Params, and it's useful for the first part of
this change because it eliminates needing some other indication of
which subplan is referenced by a SubPlan that has a testexpr.

In passing, this adds simple regression test coverage of the
ROWCOMPARE_SUBLINK code paths, which were entirely unburdened
by testing before.

Tom Lane and Dean Rasheed, reviewed by Aleksander Alekseev.
Thanks to Chantal Keller for raising the question of whether
this area couldn't be improved.

Discussion: https://postgr.es/m/2838538.1705692747@sss.pgh.pa.us
2024-03-19 18:19:24 -04:00
Tom Lane b7e2121ab7 Postpone reparameterization of paths until create_plan().
When considering nestloop paths for individual partitions within
a partitionwise join, if the inner path is parameterized, it is
parameterized by the topmost parent of the outer rel, not the
corresponding outer rel itself.  Therefore, we need to translate the
parameterization so that the inner path is parameterized by the
corresponding outer rel.

Up to now, we did this while generating join paths.  However, that's
problematic because we must also translate some expressions that are
shared across all paths for a relation, such as restriction clauses
(kept in the RelOptInfo and/or IndexOptInfo) and TableSampleClauses
(kept in the RangeTblEntry).  The existing code fails to translate
these at all, leading to wrong answers, odd failures such as
"variable not found in subplan target list", or executor crashes.
But we can't modify them during path generation, because that would
break things if we end up choosing some non-partitioned-join path.

So this patch postpones reparameterization of the inner path until
createplan.c, where it is safe to modify the referenced RangeTblEntry,
RelOptInfo or IndexOptInfo, because we have made a final choice of which
Path to use.  We do still have to check during path generation that
the reparameterization will be possible.  So we introduce a new
function path_is_reparameterizable_by_child() to detect that.

The duplication between path_is_reparameterizable_by_child() and
reparameterize_path_by_child() is a bit annoying, but there seems
no other good answer.  A small benefit is that we can avoid building
useless reparameterized trees in cases where a non-partitioned join
is ultimately chosen.  Also, reparameterize_path_by_child() can now
be allowed to scribble on the input paths, saving a few cycles.

This fix repairs the same problems previously addressed in the
back branches by commits 62f120203 et al.

Richard Guo, reviewed at various times by Ashutosh Bapat, Andrei
Lepikhov, Alena Rybakina, Robert Haas, and myself

Discussion: https://postgr.es/m/CAMbWs496+N=UAjOc=rcD3P7B6oJe4rZw08e_TZRUsWbPxZW3Tw@mail.gmail.com
2024-03-19 14:51:58 -04:00
Peter Eisentraut 605721f819 gen_node_support.pl: Mark location fields as type alias ParseLoc
Instead of the rather ugly type=int + name ~= location$, we now have a
marker type for offset pointers or sizes that are only relevant when a
query text is included, which decreases the complexity required in
gen_node_support.pl for handling these values.

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAEze2WgrCiR3JZmWyB0YTc8HV7ewRdx13j0CqD6mVkYAW+SFGQ@mail.gmail.com
2024-03-19 16:56:44 +01:00
Peter Eisentraut 49b579f92d Fix misleading comments
To match code changes in 229fb58d4f.
2024-03-19 10:55:51 +01:00
Peter Eisentraut 794f10f6b9 Add some UUID support functions
Add uuid_extract_timestamp() and uuid_extract_version().

Author: Andrey Borodin
Reviewed-by: Sergey Prokhorenko, Kirk Wolak, Przemysław Sztoch
Reviewed-by: Nikolay Samokhvalov, Jelte Fennema-Nio, Aleksander Alekseev
Reviewed-by: Peter Eisentraut, Chris Travers, Lukas Fittl
Discussion: https://postgr.es/m/CAAhFRxitJv%3DyoGnXUgeLB_O%2BM7J2BJAmb5jqAT9gZ3bij3uLDA%40mail.gmail.com
2024-03-19 09:32:04 +01:00
Jeff Davis 60769c62dc Fix another warning, introduced by 846311051e.
Discussion: https://postgr.es/m/3703896.1710799495@sss.pgh.pa.us
Reported-by: Tom Lane
2024-03-18 15:34:43 -07:00
Jeff Davis 846311051e Address more review comments on commit 2d819a08a1.
Based on comments from Peter Eisentraut.

 * Document CREATE DATABASE ... BUILTIN_LOCALE.
 * Determine required encoding based on locale name for CREATE
   COLLATION. Use -1 for "C" (requires catversion bump).
 * initdb output fixups.
 * Make ctype_is_c a constant true for now.
 * Fixups to ICU 010_create_database.pl test.

Discussion: https://postgr.es/m/4135cf11-206d-40ed-96c0-9363c1232379@eisentraut.org
2024-03-18 11:58:13 -07:00
Jeff Davis 61f352ece9 Fix unreachable code warning from commit 2d819a08a1.
Found by Coverity.

Discussion: https://postgr.es/m/3422201.1710711993@sss.pgh.pa.us
Reported-by: Tom Lane
2024-03-18 09:15:47 -07:00
Heikki Linnakangas 0960ae1967 Fix EXPLAIN Bitmap heap scan to count pages with no visible tuples
Previously, bitmap heap scans only counted lossy and exact pages for
explain when there was at least one visible tuple on the page.

heapam_scan_bitmap_next_block() returned true only if there was a
"valid" page with tuples to be processed. However, the lossy and exact
page counters in EXPLAIN should count the number of pages represented
in a lossy or non-lossy way in the constructed bitmap, regardless of
whether or not the pages ultimately contained visible tuples.

Backpatch to all supported versions.

Author: Melanie Plageman
Discussion: https://www.postgresql.org/message-id/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/CAAKRu_bxrXeZ2rCnY8LyeC2Ls88KpjWrQ%2BopUrXDRXdcfwFZGA@mail.gmail.com
2024-03-18 14:03:58 +02:00
Heikki Linnakangas 05c3980e7f Move code for backend startup to separate file
This is code that runs in the backend process after forking, rather
than postmaster. Move it out of postmaster.c for clarity.

Reviewed-by: Tristan Partin, Andres Freund
Discussion: https://www.postgresql.org/message-id/7a59b073-5b5b-151e-7ed3-8b01ff7ce9ef@iki.fi
2024-03-18 11:38:10 +02:00
Heikki Linnakangas aafc05de1b Refactor postmaster child process launching
Introduce new postmaster_child_launch() function that deals with the
differences in EXEC_BACKEND mode.

Refactor the mechanism of passing information from the parent to child
process. Instead of using different command-line arguments when
launching the child process in EXEC_BACKEND mode, pass a
variable-length blob of startup data along with all the global
variables. The contents of that blob depend on the kind of child
process being launched. In !EXEC_BACKEND mode, we use the same blob,
but it's simply inherited from the parent to child process.

Reviewed-by: Tristan Partin, Andres Freund
Discussion: https://www.postgresql.org/message-id/7a59b073-5b5b-151e-7ed3-8b01ff7ce9ef@iki.fi
2024-03-18 11:35:30 +02:00
Heikki Linnakangas f1baed18bc Move some functions from postmaster.c to a new source file
This just moves the functions, with no other changes, to make the next
commits smaller and easier to review. The moved functions are related
to launching postmaster child processes in EXEC_BACKEND mode.

Reviewed-by: Tristan Partin, Andres Freund
Discussion: https://www.postgresql.org/message-id/7a59b073-5b5b-151e-7ed3-8b01ff7ce9ef@iki.fi
2024-03-18 11:35:05 +02:00
Heikki Linnakangas 14cddee9cc Split registration of Win32 deadchild callback to separate function
The next commit will move the internal_forkexec() function to a
different source file, but it makes sense to keep all the code related
to the win32 waitpid() emulation in postmaster.c. Split it off to a
separate function now, to make the next commit more mechanical.

Reviewed-by: Tristan Partin, Andres Freund
Discussion: https://www.postgresql.org/message-id/7a59b073-5b5b-151e-7ed3-8b01ff7ce9ef@iki.fi
2024-03-18 11:35:01 +02:00
Nathan Bossart 949300402b Initialize variables to placate compiler.
Since commit 012460ee93, some compilers have been warning that a
couple of variables may be used uninitialized.  There doesn't
appear to be any actual risk, so let's just initialize these
variables to 0 to silence the compiler warnings.

Discussion: https://postgr.es/m/20240317192927.GA3978212%40nathanxps13
2024-03-17 20:16:15 -05:00
Tom Lane 33f13168cc Mark hash_corrupted() as pg_attribute_noreturn.
Coverity started complaining about this after cc5ef90ed.
The code's not really different from before, but might
as well clarify its intent.
2024-03-17 17:54:45 -04:00
Dean Rasheed c649fa24a4 Add RETURNING support to MERGE.
This allows a RETURNING clause to be appended to a MERGE query, to
return values based on each row inserted, updated, or deleted. As with
plain INSERT, UPDATE, and DELETE commands, the returned values are
based on the new contents of the target table for INSERT and UPDATE
actions, and on its old contents for DELETE actions. Values from the
source relation may also be returned.

As with INSERT/UPDATE/DELETE, the output of MERGE ... RETURNING may be
used as the source relation for other operations such as WITH queries
and COPY commands.

Additionally, a special function merge_action() is provided, which
returns 'INSERT', 'UPDATE', or 'DELETE', depending on the action
executed for each row. The merge_action() function can be used
anywhere in the RETURNING list, including in arbitrary expressions and
subqueries, but it is an error to use it anywhere outside of a MERGE
query's RETURNING list.

Dean Rasheed, reviewed by Isaac Morland, Vik Fearing, Alvaro Herrera,
Gurjeet Singh, Jian He, Jeff Davis, Merlin Moncure, Peter Eisentraut,
and Wolfgang Walther.

Discussion: http://postgr.es/m/CAEZATCWePEGQR5LBn-vD6SfeLZafzEm2Qy_L_Oky2=qw2w3Pzg@mail.gmail.com
2024-03-17 13:58:59 +00:00
Peter Eisentraut 6a004f1be8 Add attstattarget to FormExtraData_pg_attribute
This allows setting attstattarget when a relation is created.

We make use of this by having index_concurrently_create_copy() copy
over the attstattarget values when the new index is created, instead
of having index_concurrently_swap() fix it up later.

Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/4da8d211-d54d-44b9-9847-f2a9f1184c76@eisentraut.org
2024-03-17 12:38:27 +01:00
Peter Eisentraut d939cb2fd6 Generalize handling of nullable pg_attribute columns in DDL
DDL code uses tuple descriptors to pass around pg_attribute values
during table and index creation.  But tuple descriptors don't include
the variable-length/nullable columns of pg_attribute, so they have to
be handled separately.  Right now, the attoptions field is handled in
a one-off way with a separate argument passed to
InsertPgAttributeTuples().  The other affected fields of pg_attribute
are right now not needed at relation creation time.

The goal of this patch is to generalize this to allow handling
additional variable-length/nullable columns of pg_attribute in a
similar manner.  For that, create a new struct
FormExtraData_pg_attribute, which is to be passed around in parallel
to the tuple descriptor and optionally supplies the additional
columns.  Right now, this struct only contains one field for
attoptions, so no functionality is actually changed by this.

Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/4da8d211-d54d-44b9-9847-f2a9f1184c76@eisentraut.org
2024-03-17 12:30:51 +01:00
Peter Eisentraut 012460ee93 Make stxstattarget nullable
To match attstattarget change (commit 4f622503d6).  The logic inside
CreateStatistics() is clarified a bit compared to that previous patch,
and so here we also update ATExecSetStatistics() to match.

Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/4da8d211-d54d-44b9-9847-f2a9f1184c76@eisentraut.org
2024-03-17 12:26:26 +01:00
Dean Rasheed 33e729c514 Fix EXPLAIN output for subplans in MERGE.
Given a subplan in a MERGE query, EXPLAIN would sometimes fail to
properly display expressions involving Params referencing variables in
other parts of the plan tree.

This would affect subplans outside the topmost join plan node, for
which expansion of Params would go via the top-level ModifyTable plan
node.  The problem was that "inner_tlist" for the ModifyTable node's
deparse_namespace was set to the join node's targetlist, but
"inner_plan" was set to the ModifyTable node itself, rather than the
join node, leading to incorrect results when descending to the
referenced variable.

Fix and backpatch to v15, where MERGE was introduced.

Discussion: https://postgr.es/m/CAEZATCWAv-sZuH%2BwG5xJ-%2BGt7qGNGX8wUQd3XYydMFDKgRB9nw%40mail.gmail.com
2024-03-17 10:17:11 +00:00
Peter Eisentraut 20e58105ba Separate equalRowTypes() from equalTupleDescs()
This introduces a new function equalRowTypes() that is effectively a
subset of equalTupleDescs() but only compares the number of attributes
and attribute name, type, typmod, and collation.  This is enough for
most existing uses of equalTupleDescs(), which are changed to use the
new function.  The only remaining callers of equalTupleDescs() are
those that really want to check the full tuple descriptor as such,
without concern about record or row or record type semantics.

The existing function hashTupleDesc() is renamed to hashRowType(),
because it now corresponds more to equalRowTypes().

The purpose of this change is to be clearer about the semantics of the
equality asked for by each caller.  (At least one caller had a comment
that questioned whether equalTupleDescs() was too restrictive.)  For
example, 4f622503d6 removed attstattarget from the tuple descriptor
structure.  It was not fully clear at the time how this should affect
equalTupleDescs().  Now the answer is clear: By their own definitions,
equalRowTypes() does not care, and equalTupleDescs() just compares
whatever is in the tuple descriptor but does not care why it is in
there.

Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/f656d6d9-6660-4518-a006-2f65cafbebd1%40eisentraut.org
2024-03-17 05:58:04 +01:00
Daniel Gustafsson b783186515 Add destroyStringInfo function for cleaning up StringInfos
destroyStringInfo() is a counterpart to makeStringInfo(), freeing a
palloc'd StringInfo and its data. This is a convenience function to
align the StringInfo API with the PQExpBuffer API. Originally added
in the OAuth patchset, it was extracted and committed separately in
order to aid upcoming JSON work.

Author: Daniel Gustafsson <daniel@yesql.se>
Author: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAOYmi+mWdTd6ujtyF7MsvXvk7ToLRVG_tYAcaGbQLvf=N4KrQw@mail.gmail.com
2024-03-16 23:18:28 +01:00
Alexander Korotkov 4c2eda67f5 Fix race condition in transaction timeout TAP tests
The interruption handler within the injection point can get stuck in an
infinite loop while handling transaction timeout. To avoid this situation
we reset the timeout flag before invoking the injection point.

Author: Alexander Korotkov
Reviewed-by: Andrey Borodin
Discussion: https://postgr.es/m/ZfPchPC6oNN71X2J%40paquier.xyz
2024-03-15 14:38:22 +02:00
Heikki Linnakangas a3f349c612 Improve log messages referring to background worker processes
"Worker" could also mean autovacuum worker or slot sync worker, so
let's be more explicit.

Per Tristan Partin's suggestion.

Discussion: https://www.postgresql.org/message-id/CZM6WDX5H4QI.NZG1YUCKWLA@neon.tech
2024-03-15 13:14:38 +02:00
Michael Paquier cc5ef90edd Refactor initial hash lookup in dynahash.c
The same pattern is used three times in dynahash.c to retrieve a bucket
number and a hash bucket from a hash value.  This has popped up while
discussing improvements for the type cache, where this piece of
refactoring would become useful.

Note that hash_search_with_hash_value() does not need the bucket number,
just the hash bucket.

Author: Teodor Sigaev
Reviewed-by: Aleksander Alekseev, Michael Paquier
Discussion: https://postgr.es/m/5812a6e5-68ae-4d84-9d85-b443176966a1@sigaev.ru
2024-03-15 07:57:17 +09:00
David Rowley 4169850f0b Trim ORDER BY/DISTINCT aggregate pathkeys in gather_grouping_paths
Similar to d8a295389, trim off any PathKeys which are for ORDER BY /
DISTINCT aggregate functions from the PathKey List for the Gather Merge
paths created by gather_grouping_paths().  These additional PathKeys are
not valid to use after grouping has taken place as these PathKeys belong
to columns which are inputs to an aggregate function and, therefore are
unavailable after aggregation.

Reported-by: Alexander Lakhin
Discussion: https://postgr.es/m/cf63174c-8c89-3953-cb49-48f41f74941a@gmail.com
Backpatch-through: 16, where 1349d2790 was added
2024-03-15 11:54:36 +13:00
Daniel Gustafsson 4665cebc8a Login event trigger documentation wordsmithing
Minor wordsmithing on the login trigger documentation and code
comments to improve readability, as well as fixing a few small
incorrect statements in the comments.

Author: Robert Treat <rob@xzilla.net>
Discussion: https://postgr.es/m/CAJSLCQ0aMWUh1m6E9YdjeqV61baQ=EhteJX8XOxXg8H_2Lcr0Q@mail.gmail.com
2024-03-14 23:35:35 +01:00
Tom Lane b4a71cf65d Make INSERT-from-multiple-VALUES-rows handle domain target columns.
Commit a3c7a993d fixed some cases involving target columns that are
arrays or composites by applying transformAssignedExpr to the VALUES
entries, and then stripping off any assignment ArrayRefs or
FieldStores that the transformation added.  But I forgot about domains
over arrays or composites :-(.  Such cases would either fail with
surprising complaints about mismatched datatypes, or insert unexpected
coercions that could lead to odd results.  To fix, extend the
stripping logic to get rid of CoerceToDomain if it's atop an ArrayRef
or FieldStore.

While poking at this, I realized that there's a poorly documented and
not-at-all-tested behavior nearby: we coerce each VALUES column to
the domain type separately, and rely on the rewriter to merge those
operations so that the domain constraints are checked only once.
If that merging did not happen, it's entirely possible that we'd get
unexpected domain constraint failures due to checking a
partially-updated container value.  There's no bug there, but while
we're here let's improve the commentary about it and add some test
cases that explicitly exercise that behavior.

Per bug #18393 from Pablo Kharo.  Back-patch to all supported
branches.

Discussion: https://postgr.es/m/18393-65fedb1a0de9260d@postgresql.org
2024-03-14 14:57:16 -04:00
Nathan Bossart d1162cfda8 Add pg_column_toast_chunk_id().
This function returns the chunk_id of an on-disk TOASTed value.  If
the value is un-TOASTed or not on-disk, it returns NULL.  This is
useful for identifying which values are actually TOASTed and for
investigating "unexpected chunk number" errors.

Bumps catversion.

Author: Yugo Nagata
Reviewed-by: Jian He
Discussion: https://postgr.es/m/20230329105507.d764497456eeac1ca491b5bd%40sraoss.co.jp
2024-03-14 10:58:00 -05:00
Heikki Linnakangas 84c18acaf6 Remove redundant snapshot copying from parallel leader to workers
The parallel query infrastructure copies the leader backend's active
snapshot to the worker processes. But BitmapHeapScan node also had
bespoken code to pass the snapshot from leader to the worker. That was
redundant, so remove it.

The removed code was analogous to the snapshot serialization in
table_parallelscan_initialize(), but that was the wrong role model. A
parallel bitmap heap scan is more like an independent non-parallel
bitmap heap scan in each parallel worker as far as the table AM is
concerned, because the coordination is done in nodeBitmapHeapscan.c,
and the table AM doesn't need to know anything about it.

This relies on the assumption that es_snapshot ==
GetActiveSnapshot(). That's not a new assumption, things would get
weird if you used the QueryDesc's snapshot for visibility checks in
the scans, but the active snapshot for evaluating quals, for
example. This could use some refactoring and cleanup, but for now,
just add some assertions.

Reviewed-by: Dilip Kumar, Robert Haas
Discussion: https://www.postgresql.org/message-id/5f3b9d59-0f43-419d-80ca-6d04c07cf61a@iki.fi
2024-03-14 15:18:10 +02:00
Robert Haas 2346df6fc3 Allow a no-wait lock acquisition to succeed in more cases.
We don't determine the position at which a process waiting for a lock
should insert itself into the wait queue until we reach ProcSleep(),
and we may at that point discover that we must insert ourselves ahead
of everyone who wants a conflicting lock, in which case we obtain the
lock immediately. Up until now, a no-wait lock acquisition would fail
in such cases, erroneously claiming that the lock couldn't be obtained
immediately.  Fix that by trying ProcSleep even in the no-wait case.

No back-patch for now, because I'm treating this as an improvement to
the existing no-wait feature. It could instead be argued that it's a
bug fix, on the theory that there should never be any case whatsoever
where no-wait fails to obtain a lock that would have been obtained
immediately without no-wait, but I'm reluctant to interpret the
semantics of no-wait that strictly.

Robert Haas and Jingxian Li

Discussion: http://postgr.es/m/CA+TgmobCH-kMXGVpb0BB-iNMdtcNkTvcZ4JBxDJows3kYM+GDg@mail.gmail.com
2024-03-14 08:56:06 -04:00
Alexander Korotkov eeefd4280f Add TAP tests for timeouts
This commit adds new tests to verify that transaction_timeout,
idle_session_timeout, and idle_in_transaction_session_timeout work as expected.
We introduce new injection points in before throwing a timeout FATAL error
and check these injection points are reached.

Discussion: https://postgr.es/m/CAAhFRxiQsRs2Eq5kCo9nXE3HTugsAAJdSQSmxncivebAxdmBjQ%40mail.gmail.com
Author: Andrey Borodin
Reviewed-by: Alexander Korotkov
2024-03-14 13:12:15 +02:00
Alexander Korotkov e85662df44 Fix false reports in pg_visibility
Currently, pg_visibility computes its xid horizon using the
GetOldestNonRemovableTransactionId().  The problem is that this horizon can
sometimes go backward.  That can lead to reporting false errors.

In order to fix that, this commit implements a new function
GetStrictOldestNonRemovableTransactionId().  This function computes the xid
horizon, which would be guaranteed to be newer or equal to any xid horizon
computed before.

We have to do the following to achieve this.

1. Ignore processes xmin's, because they consider connection to other databases
   that were ignored before.
2. Ignore KnownAssignedXids, because they are not database-aware. At the same
   time, the primary could compute its horizons database-aware.
3. Ignore walsender xmin, because it could go backward if some replication
   connections don't use replication slots.

As a result, we're using only currently running xids to compute the horizon.
Surely these would significantly sacrifice accuracy.  But we have to do so to
avoid reporting false errors.

Inspired by earlier patch by Daniel Shelepanov and the following discussion
with Robert Haas and Tom Lane.

Discussion: https://postgr.es/m/1649062270.289865713%40f403.i.mail.ru
Reviewed-by: Alexander Lakhin, Dmitry Koval
2024-03-14 13:12:05 +02:00
Amit Kapila 9c40db3b02 Fix typos in reorderbuffer.c.
Author: Kyotaro Horiguchi
Discussion: https://postgr.es/m/20240314.132817.1496502692848380820.horikyota.ntt@gmail.com
2024-03-14 12:12:55 +05:30
Jeff Davis 2d819a08a1 Introduce "builtin" collation provider.
New provider for collations, like "libc" or "icu", but without any
external dependency.

Initially, the only locale supported by the builtin provider is "C",
which is identical to the libc provider's "C" locale. The libc
provider's "C" locale has always been treated as a special case that
uses an internal implementation, without using libc at all -- so the
new builtin provider uses the same implementation.

The builtin provider's locale is independent of the server environment
variables LC_COLLATE and LC_CTYPE. Using the builtin provider, the
database collation locale can be "C" while LC_COLLATE and LC_CTYPE are
set to "en_US", which is impossible with the libc provider.

By offering a new builtin provider, it clarifies that the semantics of
a collation using this provider will never depend on libc, and makes
it easier to document the behavior.

Discussion: https://postgr.es/m/ab925f69-5f9d-f85e-b87c-bd2a44798659@joeconway.com
Discussion: https://postgr.es/m/dd9261f4-7a98-4565-93ec-336c1c110d90@manitou-mail.org
Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel%40j-davis.com
Reviewed-by: Daniel Vérité, Peter Eisentraut, Jeremy Schneider
2024-03-13 23:33:44 -07:00
Peter Eisentraut 6ab2e8385d Put genbki.pl output into src/include/catalog/ directly
With the makefile rules, the output of genbki.pl was written to
src/backend/catalog/, and then the header files were linked to
src/include/catalog/.

This changes it so that the output files are written directly to
src/include/catalog/.  This makes the logic simpler, and it also makes
the behavior consistent with the meson build system.  Also, the list
of catalog files is now kept in parallel in
src/include/catalog/{meson.build,Makefile}, while before the makefiles
had it in src/backend/catalog/Makefile.

Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Discussion: https://www.postgresql.org/message-id/flat/21b74bdc-183d-4dd5-9c27-9378d178f459@eisentraut.org
2024-03-14 07:11:21 +01:00
Nathan Bossart ecb0fd3372 Reintroduce MAINTAIN privilege and pg_maintain predefined role.
Roles with MAINTAIN on a relation may run VACUUM, ANALYZE, REINDEX,
REFRESH MATERIALIZE VIEW, CLUSTER, and LOCK TABLE on the relation.
Roles with privileges of pg_maintain may run those same commands on
all relations.

This was previously committed for v16, but it was reverted in
commit 151c22deee due to concerns about search_path tricks that
could be used to escalate privileges to the table owner.  Commits
2af07e2f74, 59825d1639, and c7ea3f4229 resolved these concerns by
restricting search_path when running maintenance commands.

Bumps catversion.

Reviewed-by: Jeff Davis
Discussion: https://postgr.es/m/20240305161235.GA3478007%40nathanxps13
2024-03-13 14:49:26 -05:00
Robert Haas 2041bc4276 Add the system identifier to backup manifests.
Before this patch, if you took a full backup on server A and then
tried to use the backup manifest to take an incremental backup on
server B, it wouldn't know that the manifest was from a different
server and so the incremental backup operation could potentially
complete without error. When you later tried to run pg_combinebackup,
you'd find out that your incremental backup was and always had been
invalid. That's poor timing, because nobody likes finding out about
backup problems only at restore time.

With this patch, you'll get an error when trying to take the (invalid)
incremental backup, which seems a lot nicer.

Amul Sul, revised by me. Review by Michael Paquier.

Discussion: http://postgr.es/m/CA+TgmoYLZzbSAMM3cAjV4Y+iCRZn-bR9H2+Mdz7NdaJFU1Zb5w@mail.gmail.com
2024-03-13 15:12:33 -04:00
Peter Eisentraut 97d85be365 Make the order of the header file includes consistent
Similar to commit 7e735035f2.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAMbWs4-WhpCFMbXCjtJ%2BFzmjfPrp7Hw1pk4p%2BZpU95Kh3ofZ1A%40mail.gmail.com
2024-03-13 15:07:00 +01:00
Michael Paquier 77cf6a78de Add some asserts based on LWLockHeldByMe() for replication slot statistics
Two assertions checking that ReplicationSlotAllocationLock is acquired
are added to pgstat_create_replslot() and pgstat_drop_replslot(),
corresponding to the routines in charge of the creation and the drop of
replication slot statistics.  The code previously relied on this
assumption and documented it in comments, but did not enforce this
policy at runtime.

Reviewed-by: Bertrand Drouvot
Discussion: https://postgr.es/m/Ze_p-hmD_yFeVYXg@paquier.xyz
2024-03-13 07:45:11 +09:00