Commit Graph

4914 Commits

Author SHA1 Message Date
Heikki Linnakangas 15916ffb04 Initialize ShmemVariableCache like other shmem areas
For sake of consistency.

Reviewed-by: Tristan Partin, Richard Guo
Discussion: https://www.postgresql.org/message-id/6537d63d-4bb5-46f8-9b5d-73a8ba4720ab@iki.fi
2023-12-08 09:46:59 +02:00
Thomas Munro cd7f19da34 Fix potential pointer overflow in xlogreader.c.
While checking if a record could fit in the circular WAL decoding
buffer, the coding from commit 3f1ce973 used arithmetic that could
overflow.  64 bit systems were unaffected for various technical reasons,
which probably explains the lack of problem reports.  Likewise for 32
bit systems running known 32 bit kernels.  The systems at risk of
problems appear to be 32 bit processes running on 64 bit kernels, with
unlucky placement in memory.

Per complaint from GCC -fsanitize=undefined -m32, while testing
variations of 039_end_of_wal.pl.

Back-patch to 15.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGKH0oRPOX7DhiQ_b51sM8HqcPp2J3WA-Oen%3DdXog%2BAGGQ%40mail.gmail.com
2023-12-08 16:09:03 +13:00
Michael Paquier 7636725b92 Fix compilation on Windows with WAL_DEBUG
This has been broken since b060dbe000 that has reworked the callback
mechanism of XLogReader, most likely unnoticed because any form of
development involving WAL happens on platforms where this compiles fine.

Author: Bharath Rupireddy
Discussion: https://postgr.es/m/CALj2ACVF14WKQMFwcJ=3okVDhiXpuK5f7YdT+BdYXbbypMHqWA@mail.gmail.com
Backpatch-through: 13
2023-12-06 14:10:39 +09:00
Amit Kapila f66fcc5cd6 Fix an uninitialized access in hash_xlog_squeeze_page().
Commit 861f86beea changed hash_xlog_squeeze_page() to start reading
the write buffer conditionally but forgot to initialize it leading to an
uninitialized access.

Reported-by: Alexander Lakhin
Author: Hayato Kuroda
Reviewed-by: Alexander Lakhin, Amit Kapila
Discussion: http://postgr.es/m/62ed1a9f-746a-8e86-904b-51b9b806a1d9@gmail.com
2023-12-01 10:22:13 +05:30
Alexander Korotkov ae2ccf66a2 Fix typo in 5a1dfde833
Reported-by: Alexander Lakhin
Discussion: https://postgr.es/m/55d8800f-4a80-5256-1e84-246fbe79acd0@gmail.com
2023-11-30 13:46:23 +02:00
Alexander Korotkov b589f211e0 Fix warning due non-standard inline declaration in 4ed8f0913b
Reported-by: Alexander Lakhin, Tom Lane
Author: Pavel Borisov
Discussion: https://postgr.es/m/55d8800f-4a80-5256-1e84-246fbe79acd0@gmail.com
2023-11-30 11:34:45 +02:00
Michael Paquier 8d9978a717 Apply quotes more consistently to GUC names in logs
Quotes are applied to GUCs in a very inconsistent way across the code
base, with a mix of double quotes or no quotes used.  This commit
removes double quotes around all the GUC names that are obviously
referred to as parameters with non-English words (use of underscore,
mixed case, etc).

This is the result of a discussion with Álvaro Herrera, Nathan Bossart,
Laurenz Albe, Peter Eisentraut, Tom Lane and Daniel Gustafsson.

Author: Peter Smith
Discussion: https://postgr.es/m/CAHut+Pv-kSN8SkxSdoHano_wPubqcg5789ejhCDZAcLFceBR-w@mail.gmail.com
2023-11-30 14:11:45 +09:00
Alexander Korotkov 5a1dfde833 Make use FullTransactionId in 2PC filenames
Switch from using TransactionId to FullTransactionId in naming of 2PC files.
Transaction state file in the pg_twophase directory now have extra 8 bytes in
the name to address an epoch of a given xid.

Author: Maxim Orlov, Aleksander Alekseev, Alexander Korotkov, Teodor Sigaev
Author: Nikita Glukhov, Pavel Borisov, Yura Sokolov
Reviewed-by: Jacob Champion, Heikki Linnakangas, Alexander Korotkov
Reviewed-by: Japin Li, Pavel Borisov, Tom Lane, Peter Eisentraut, Andres Freund
Reviewed-by: Andrey Borodin, Dilip Kumar, Aleksander Alekseev
Discussion: https://postgr.es/m/CACG%3DezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe%3DpyyjVWA%40mail.gmail.com
Discussion: https://postgr.es/m/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq%2BvfkmTF5Q%40mail.gmail.com
2023-11-29 01:43:36 +02:00
Alexander Korotkov 4ed8f0913b Index SLRUs by 64-bit integers rather than by 32-bit integers
We've had repeated bugs in the area of handling SLRU wraparound in the past,
some of which have caused data loss. Switching to an indexing system for SLRUs
that does not wrap around should allow us to get rid of a whole bunch
of problems and improve the overall reliability of the system.

This particular patch however only changes the indexing and doesn't address
the wraparound per se. This is going to be done in the following patches.

Author: Maxim Orlov, Aleksander Alekseev, Alexander Korotkov, Teodor Sigaev
Author: Nikita Glukhov, Pavel Borisov, Yura Sokolov
Reviewed-by: Jacob Champion, Heikki Linnakangas, Alexander Korotkov
Reviewed-by: Japin Li, Pavel Borisov, Tom Lane, Peter Eisentraut, Andres Freund
Reviewed-by: Andrey Borodin, Dilip Kumar, Aleksander Alekseev
Discussion: https://postgr.es/m/CACG%3DezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe%3DpyyjVWA%40mail.gmail.com
Discussion: https://postgr.es/m/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq%2BvfkmTF5Q%40mail.gmail.com
2023-11-29 01:40:56 +02:00
Heikki Linnakangas 60f227316c Fix assertions with RI triggers in heap_update and heap_delete.
If the tuple being updated is not visible to the crosscheck snapshot,
we return TM_Updated but the assertions would not hold in that case.
Move them to before the cross-check.

Fixes bug #17893. Backpatch to all supported versions.

Author: Alexander Lakhin
Backpatch-through: 12
Discussion: https://www.postgresql.org/message-id/17893-35847009eec517b5%40postgresql.org
2023-11-28 12:00:14 +02:00
Tomas Vondra a82ee7ef3a Check if ii_AmCache is NULL in aminsertcleanup
Fix a bug introduced by c1ec02be1d. It may happen that the executor
opens indexes on the result relation, but no rows end up being inserted.
Then the index_insert_cleanup still gets executed, but passes down NULL
to the AM callback. The AM callback may not expect this, as is the case
of brininsertcleanup, leading to a crash.

Fixed by only calling the cleanup callback if (ii_AmCache != NULL). This
way the AM can simply assume to only see a valid cache.

Reported-by: Richard Guo
Discussion: https://postgr.es/m/CAMbWs4-w9qC-o9hQox9UHvdVZAYTp8OrPQOKtwbvzWaRejTT=Q@mail.gmail.com
2023-11-27 16:53:06 +01:00
Heikki Linnakangas 1f395354d8 Reduce rate of walwriter wakeups due to async commits.
XLogSetAsyncXactLSN(), called at asynchronous commit, would wake up
walwriter every time the LSN advances, but walwriter doesn't actually
do anything unless it has at least 'wal_writer_flush_after' full
blocks of WAL to write. Repeatedly waking up walwriter to do nothing
is a waste of CPU cycles in both walwriter and the backends doing the
wakeups. To fix, apply the same logic in XLogSetAsyncXactLSN() to
decide whether to wake up walwriter, as walwriter uses to determine if
it has any work to do.

In the passing, rename misleadingly named 'flushbytes' local variable
to 'flushblocks'.

Author: Andres Freund, Heikki Linnakangas
Discussion: https://www.postgresql.org/message-id/20231024230929.vsc342baqs7kmbte@awork3.anarazel.de
2023-11-27 17:42:39 +02:00
Tomas Vondra b2caf7c0e1 Fix brin.c indentation issues introduced by c1ec02be1d
Per buildfarm member koel.
2023-11-26 21:35:32 +01:00
Tomas Vondra c1ec02be1d Reuse BrinDesc and BrinRevmap in brininsert
The brininsert code used to initialize (and destroy) BrinDesc and
BrinRevmap for each tuple, which is not free. This patch initializes
these structures only once, and reuses them for all inserts in the same
command. The data is passed through indexInfo->ii_AmCache.

This also introduces an optional AM callback "aminsertcleanup" that
allows performing custom cleanup in case simply pfree-ing ii_AmCache is
not sufficient (which is the case when the cache contains TupleDesc,
Buffers, and so on).

Author: Soumyadeep Chakraborty
Reviewed-by: Alvaro Herrera, Matthias van de Meent, Tomas Vondra
Discussion: https://postgr.es/m/CAE-ML%2B9r2%3DaO1wwji1sBN9gvPz2xRAtFUGfnffpd0ZqyuzjamA%40mail.gmail.com
2023-11-25 20:27:28 +01:00
Bruce Momjian 8d981341a5 C comment: clarify that WAL files can be _recycled_ or removed
Reported-by: Michael Paquier

Discussion: https://postgr.es/m/CAB7nPqSDdF0heotQU3gsepgqx+9c+6KjLd3R6aNYH7KKfDd2ig@mail.gmail.com

Author: Michael Paquier

Backpatch-through: master
2023-11-25 10:48:18 -05:00
Bruce Momjian 344afc7769 modify segno. for pg_walfile_name() and pg_walfile_name_offset()
Previously these functions returned the previous segment number if the
LSN was on a segment boundary.  We now always return the current segment
number for an LSN.

Docs updated to reflect this change.  Regression tests added, author
Andres Freund.

Also mentioned in thread https://postgr.es/m/flat/20220204225057.GA1535307%40nathanxps13#d964275c9540d8395e138efc0a75f7e8

BACKWARD INCOMPATIBILITY

Reported-by: Kyotaro Horiguchi

Discussion: https://postgr.es/m/20190726.172120.101752680.horikyota.ntt@gmail.com

Co-authored-by: Kyotaro Horiguchi

Backpatch-through: master
2023-11-24 19:44:09 -05:00
Andres Freund b2e237afdd Release lock on heap buffer before vacuuming FSM
When there are no indexes on a table, we vacuum each heap block after
pruning it and then update the freespace map. Periodically, we also
vacuum the freespace map. This was done while unnecessarily holding a
lock on the heap page. Release the lock before calling
FreeSpaceMapVacuumRange() and, while we're at it, ensure the range
includes the heap block we just vacuumed.

There are no known deadlocks or other similar issues, therefore don't
backpatch. It's certainly not good to do all this work under a lock, but it's
not frequently reached, making it not worth the risk of backpatching.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAAKRu_YiL%3D44GvGnt1dpYouDSSoV7wzxVoXs8m3p311rp-TVQQ%40mail.gmail.com
2023-11-17 12:46:55 -08:00
Nathan Bossart 6a72c42fd5 Retire MemoryContextResetAndDeleteChildren() macro.
As of commit eaa5808e8e, MemoryContextResetAndDeleteChildren() is
just a backwards compatibility macro for MemoryContextReset().  Now
that some time has passed, this macro seems more likely to create
confusion.

This commit removes the macro and replaces all remaining uses with
calls to MemoryContextReset().  Any third-party code that use this
macro will need to be adjusted to call MemoryContextReset()
instead.  Since the two have behaved the same way since v9.5, such
adjustments won't produce any behavior changes for all
currently-supported versions of PostgreSQL.

Reviewed-by: Amul Sul, Tom Lane, Alvaro Herrera, Dagfinn Ilmari Mannsåker
Discussion: https://postgr.es/m/20231113185950.GA1668018%40nathanxps13
2023-11-15 13:42:30 -06:00
Heikki Linnakangas c21e6e2fd4 Clear CurrentResourceOwner earlier in CommitTransaction.
Alexander reported a crash with repeated create + drop database, after
the ResourceOwner rewrite (commit b8bff07daa). That was fixed by the
previous commit, but it nevertheless seems like a good idea clear
CurrentResourceOwner earlier, because you're not supposed to use it
for anything after we start releasing it.

Reviewed-by: Alexander Lakhin
Discussion: https://www.postgresql.org/message-id/11b70743-c5f3-3910-8e5b-dd6c115ff829%40gmail.com
2023-11-15 11:03:49 +01:00
Tom Lane 5c62ecf6ec Don't release index root page pin in ginFindParents().
It's clearly stated in the comments that ginFindParents() must keep
the pin on the index's root page that's associated with the topmost
GinBtreeStack item.  However, the code path for the case that the
desired downlink has been pushed down to the next index level
ignored this proviso, and would release the pin anyway if we were
still examining the root level.  That led to an assertion failure
or "buffer NNNN is not owned by resource owner" error later, when
we try to release the pin again at the end of the insertion.

This is quite hard to reproduce, since it can only happen if an
index root page split occurs concurrently with our own insertion.
Thanks to Jeff Janes for finding a test case that triggers it
often enough to allow investigation.

This has been there since the beginning of GIN, so back-patch
to all supported branches.

Discussion: https://postgr.es/m/CAMkU=1yCAKtv86dMrD__Ja-7KzjE=uMeKX8y__cx5W-OEWy2ow@mail.gmail.com
2023-11-13 11:44:35 -05:00
Amit Kapila 861f86beea Use REGBUF_NO_CHANGE at one more place in the hash index.
Commit 00d7fb5e2e started to use REGBUF_NO_CHANGE at a few places in the
code where we register the buffer before marking it dirty but missed
updating one of the code flows in the hash index where we free the overflow
page without any live tuples on it.

Author: Amit Kapila and Hayato Kuroda
Discussion: http://postgr.es/m/f045c8f7-ee24-ead6-3679-c04a43d21351@gmail.com
2023-11-13 14:08:26 +05:30
Amit Kapila 8bfb231b43 Prohibit max_slot_wal_keep_size to value other than -1 during upgrade.
We don't want existing slots in the old cluster to get invalidated during
the upgrade. During an upgrade, we set this variable to -1 via the command
line in an attempt to prevent such invalidations, but users have ways to
override it. This patch ensures the value is not overridden by the user.

Author: Kyotaro Horiguchi
Reviewed-by: Peter Smith, Alvaro Herrera
Discussion: http://postgr.es/m/20231027.115759.2206827438943188717.horikyota.ntt@gmail.com
2023-11-10 08:45:01 +05:30
David Rowley 10d34fefc2 Ensure we use the correct spelling of "ensure"
We seem to have accidentally used "insure" in a few places.  Correct
that.

Author: Peter Smith
Discussion: https://postgr.es/m/CAHut+Pv0biqrhA3pMhu40aDsj343mTsD75khKnHsLqR8P04f=Q@mail.gmail.com
Backpatch-through: 12, oldest supported version
2023-11-10 00:15:54 +13:00
Heikki Linnakangas b8bff07daa Make ResourceOwners more easily extensible.
Instead of having a separate array/hash for each resource kind, use a
single array and hash to hold all kinds of resources. This makes it
possible to introduce new resource "kinds" without having to modify
the ResourceOwnerData struct. In particular, this makes it possible
for extensions to register custom resource kinds.

The old approach was to have a small array of resources of each kind,
and if it fills up, switch to a hash table. The new approach also uses
an array and a hash, but now the array and the hash are used at the
same time. The array is used to hold the recently added resources, and
when it fills up, they are moved to the hash. This keeps the access to
recent entries fast, even when there are a lot of long-held resources.

All the resource-specific ResourceOwnerEnlarge*(),
ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have
been replaced with three generic functions that take resource kind as
argument. For convenience, we still define resource-specific wrapper
macros around the generic functions with the old names, but they are
now defined in the source files that use those resource kinds.

The release callback no longer needs to call ResourceOwnerForget on
the resource being released. ResourceOwnerRelease unregisters the
resource from the owner before calling the callback. That needed some
changes in bufmgr.c and some other files, where releasing the
resources previously always called ResourceOwnerForget.

Each resource kind specifies a release priority, and
ResourceOwnerReleaseAll releases the resources in priority order. To
make that possible, we have to restrict what you can do between
phases. After calling ResourceOwnerRelease(), you are no longer
allowed to remember any more resources in it or to forget any
previously remembered resources by calling ResourceOwnerForget.  There
was one case where that was done previously. At subtransaction commit,
AtEOSubXact_Inval() would handle the invalidation messages and call
RelationFlushRelation(), which temporarily increased the reference
count on the relation being flushed. We now switch to the parent
subtransaction's resource owner before calling AtEOSubXact_Inval(), so
that there is a valid ResourceOwner to temporarily hold that relcache
reference.

Other end-of-xact routines make similar calls to AtEOXact_Inval()
between release phases, but I didn't see any regression test failures
from those, so I'm not sure if they could reach a codepath that needs
remembering extra resources.

There were two exceptions to how the resource leak WARNINGs on commit
were printed previously: llvmjit silently released the context without
printing the warning, and a leaked buffer io triggered a PANIC. Now
everything prints a WARNING, including those cases.

Add tests in src/test/modules/test_resowner.

Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 13:30:50 +02:00
Michael Paquier 1b2c6b756e Enlarge assertion in bloom_init() for false_positive_rate
false_positive_rate is a parameter that can be set with the bloom
opclass in BRIN, and setting it to a value of exactly 0.25 would trigger
an assertion in the first INSERT done on the index with value set.

The assertion changed here relied on BLOOM_{MIN|MAX}_FALSE_POSITIVE_RATE
that are somewhat arbitrary values, and specifying an out-of-range value
would also trigger a failure when defining such an index.  So, as-is,
the assertion was just doubling on the min-max check of the reloption.
This is now enlarged to check that it is a correct percentage value,
instead, based on a suggestion by Tom Lane.

Author: Alexander Lakhin
Reviewed-by: Tom Lane, Shihao Zhong
Discussion: https://postgr.es/m/17969-a6c54de48026d694@postgresql.org
Backpatch-through: 14
2023-11-08 14:06:26 +09:00
Bruce Momjian 989adace3f doc: 1-byte varlena headers can be used for user PLAIN storage
This also updates some C comments.

Reported-by: suchithjn22@gmail.com

Discussion: https://postgr.es/m/167336599095.2667301.15497893107226841625@wrigleys.postgresql.org

Author: Laurenz Albe (doc patch)

Backpatch-through: 11
2023-10-31 09:10:35 -04:00
Noah Misch 13503eb590 Diagnose !indisvalid in more SQL functions.
pgstatindex failed with ERRCODE_DATA_CORRUPTED, of the "can't-happen"
class XX.  The other functions succeeded on an empty index; they might
have malfunctioned if the failed index build left torn I/O or other
complex state.  Report an ERROR in statistics functions pgstatindex,
pgstatginindex, pgstathashindex, and pgstattuple.  Report DEBUG1 and
skip all index I/O in maintenance functions brin_desummarize_range,
brin_summarize_new_values, brin_summarize_range, and
gin_clean_pending_list.  Back-patch to v11 (all supported versions).

Discussion: https://postgr.es/m/20231001195309.a3@google.com
2023-10-30 14:46:05 -07:00
Michael Paquier dc5bd38894 Delay recovery mode LOG after reading backup_label and/or checkpoint record
When beginning recovery, a LOG is displayed by the startup process to
show which recovery mode will be used depending on the .signal file(s)
set in the data folder, like "standby mode", recovery up to a given
target type and value, or archive recovery.

A different patch is under discussion to simplify the startup code by
requiring the presence of recovery.signal and/or standby.signal when a
backup_label file is read.  Delaying a bit this LOG ensures that the
correct recovery mode would be reported, and putting it at this position
does not make it lose its value.

While on it, this commit adds a few comments documenting a bit more the
initial recovery steps and their dependencies, and fixes an incorrect
comment format.  This introduces no behavior changes.

Extracted from a larger patch by me.

Reviewed-by: David Steele, Bowen Shi
Discussion: https://postgr.es/m/ZArVOMifjzE7f8W7@paquier.xyz
2023-10-30 15:28:20 +09:00
Michael Paquier 1ffdc03c21 Mention standby.signal in FATALs for checkpoint record missing at recovery
When beginning recovery from a base backup by reading a backup_label
file, it may be possible that no checkpoint record is available
depending on the method used when the case backup was taken, which would
prevent recovery from beginning.  In this case, the FATAL messages
issued, initially added by c900c15269, mentioned recovery.signal as
an option to do recovery but not standby.signal.  Let's add it as an
available option, for clarity.

Per suggestion from Bowen Shi, extracted from a larger patch by me.

Author: Michael Paquier
Discussion: https://postgr.es/m/CAM_vCudkSjr7NsNKSdjwtfAm9dbzepY6beZ5DP177POKy8=2aw@mail.gmail.com
2023-10-30 13:56:02 +09:00
Michael Paquier 96f052613f Introduce pg_stat_checkpointer
Historically, the statistics of the checkpointer have been always part
of pg_stat_bgwriter.  This commit removes a few columns from
pg_stat_bgwriter, and introduces pg_stat_checkpointer with equivalent,
renamed columns (plus a new one for the reset timestamp):
- checkpoints_timed -> num_timed
- checkpoints_req -> num_requested
- checkpoint_write_time -> write_time
- checkpoint_sync_time -> sync_time
- buffers_checkpoint -> buffers_written

The fields of PgStat_CheckpointerStats and its SQL functions are renamed
to match with the new field names, for consistency.  Note that
background writer and checkpointer have been split into two different
processes in commits 806a2aee37 and bf405ba8e4.  The pgstat
structures were already split, making this change straight-forward.

Bump catalog version.

Author: Bharath Rupireddy
Reviewed-by: Bertrand Drouvot, Andres Freund, Michael Paquier
Discussion: https://postgr.es/m/CALj2ACVxX2ii=66RypXRweZe2EsBRiPMj0aHfRfHUeXJcC7kHg@mail.gmail.com
2023-10-30 09:47:16 +09:00
Bruce Momjian 12cf3ac7f3 doc Improve C GUC-related comments
Discussion: https://postgr.es/m/CAEG8a3LZHTR5S+OPZCbZvECwsqdbx=pBRFZZyDjKaAtgoALOQQ@mail.gmail.com

Author: Junwang Zhao

Backpatch-through: master
2023-10-27 19:05:25 -04:00
Tomas Vondra c6cf6d353c Fix minmax-multi distance for extreme interval values
When calculating distance for interval values, the code mostly mimicked
interval_mi, i.e. it built a new interval value for the difference.
That however does not work for sufficiently distant interval values,
when the difference overflows the interval range.

Instead, we can calculate the distance directly, without constructing
the intermediate (and unnecessary) interval value.

Backpatch to 14, where minmax-multi indexes were introduced.

Reported-by: Dean Rasheed
Reviewed-by: Ashutosh Bapat, Dean Rasheed
Backpatch-through: 14
Discussion: https://postgr.es/m/eef0ea8c-4aaa-8d0d-027f-58b1f35dd170@enterprisedb.com
2023-10-27 18:15:37 +02:00
Tomas Vondra 8da86d62a1 Fix minmax-multi on infinite date/timestamp values
Make sure that infinite values in date/timestamp columns are treated as
if in infinite distance. Infinite values should not be merged with other
values, leaving them as outliers. The code however returned distance 0
in this case, so that infinite values were merged first. While this does
not break the index (i.e. it still produces correct query results), it
may make it much less efficient.

We don't need explicit handling of infinite date/timestamp values when
calculating distances, because those values are represented as extreme
but regular values (e.g. INT64_MIN/MAX for the timestamp type).

We don't need an exact distance, just a value that is much larger than
distanced between regular values. With the added cast to double values,
we can simply subtract the values.

The regression test queries a value in the "gap" and checks the range
was properly eliminated by the BRIN index.

This only affects minmax-multi indexes on timestamp/date columns with
infinite values, which is not very common in practice. The affected
indexes may need to be rebuilt.

Backpatch to 14, where minmax-multi indexes were introduced.

Reported-by: Ashutosh Bapat
Reviewed-by: Ashutosh Bapat, Dean Rasheed
Backpatch-through: 14
Discussion: https://postgr.es/m/eef0ea8c-4aaa-8d0d-027f-58b1f35dd170@enterprisedb.com
2023-10-27 18:15:37 +02:00
Tomas Vondra 394d517314 Fix calculation in brin_minmax_multi_distance_date
When calculating the distance between date values, make sure to subtract
them in the right order, i.e. (larger - smaller).

The distance is used to determine which values to merge, and is expected
to be a positive value. The code unfortunately did the subtraction in
the opposite order, i.e. (smaller - larger), thus producing negative
values and merging values the most distant values first.

The resulting index is correct (i.e. produces correct results), but may
be significantly less efficient. This affects all minmax-multi indexes
on date columns.

Backpatch to 14, where minmax-multi indexes were introduced.

Reported-by: Ashutosh Bapat
Reviewed-by: Ashutosh Bapat, Dean Rasheed
Backpatch-through: 14
Discussion: https://postgr.es/m/eef0ea8c-4aaa-8d0d-027f-58b1f35dd170@enterprisedb.com
2023-10-27 18:15:37 +02:00
Tomas Vondra b5489b75c6 Fix overflow when calculating timestamp distance in BRIN
When calculating distances for timestamp values for BRIN minmax-multi
indexes, we need to be careful about overflows for extreme values. If
the value overflows into a negative value, the index may be inefficient.

The new regression test checks this for the timestamp type by adding a
table with enough values to force range compaction/merging. The values
are close to min/max, which means a risk of overflow.

Fixed by converting the int64 values to double first, before calculating
the distance. This prevents the overflow. We may lose some precision, of
course, but that's good enough. In the worst case we build a slightly
less efficient index, but for large distances this won't matter.

This only affects minmax-multi indexes on timestamp columns, with ranges
containing values sufficiently distant to cause an overflow. That seems
like a fairly rare case in practice.

Backpatch to 14, where minmax-multi indexes were introduced.

Reported-by: Ashutosh Bapat
Reviewed-by: Ashutosh Bapat, Dean Rasheed
Backpatch-through: 14
Discussion: https://postgr.es/m/eef0ea8c-4aaa-8d0d-027f-58b1f35dd170@enterprisedb.com
2023-10-27 18:15:37 +02:00
Peter Eisentraut 611806cd72 Add trailing commas to enum definitions
Since C99, there can be a trailing comma after the last value in an
enum definition.  A lot of new code has been introducing this style on
the fly.  Some new patches are now taking an inconsistent approach to
this.  Some add the last comma on the fly if they add a new last
value, some are trying to preserve the existing style in each place,
some are even dropping the last comma if there was one.  We could
nudge this all in a consistent direction if we just add the trailing
commas everywhere once.

I omitted a few places where there was a fixed "last" value that will
always stay last.  I also skipped the header files of libpq and ecpg,
in case people want to use those with older compilers.  There were
also a small number of cases where the enum type wasn't used anywhere
(but the enum values were), which ended up confusing pgindent a bit,
so I left those alone.

Discussion: https://www.postgresql.org/message-id/flat/386f8c45-c8ac-4681-8add-e3b0852c1620%40eisentraut.org
2023-10-26 09:20:54 +02:00
Jeff Davis 00d7fb5e2e Assert that buffers are marked dirty before XLogRegisterBuffer().
Enforce the rule from transam/README in XLogRegisterBuffer(), and
update callers to follow the rule.

Hash indexes sometimes register clean pages as a part of the locking
protocol, so provide a REGBUF_NO_CHANGE flag to support that use.

Discussion: https://postgr.es/m/c84114f8-c7f1-5b57-f85a-3adc31e1a904@iki.fi
Reviewed-by: Heikki Linnakangas
2023-10-23 17:17:46 -07:00
Robert Haas 5b36e8f078 Change struct tablespaceinfo's oid member from 'char *' to 'Oid'
This shouldn't change behavior except in the unusual case where
there are file in the tablespace directory that have entirely
numeric names but are nevertheless not possible names for a
tablespace directory, either because their names have leading zeroes
that shouldn't be there, or the value is actually zero, or because
the value is too large to represent as an OID.

In those cases, the directory would previously have made it into
the list of tablespaceinfo objects and no longer will. Thus, base
backups will now ignore such directories, instead of treating them
as legitimate tablespace directories. Similarly, if entries for
such tablespaces occur in a tablespace_map file, they will now
be rejected as erroneous, instead of being honored.

This is infrastructure for future work that wants to be able to
know the tablespace of each relation that is part of a backup
*as an OID*. By strengthening the up-front validation, we don't
have to worry about weird cases later, and can more easily avoid
repeated string->integer conversions.

Patch by me, reviewed by David Steele.

Discussion: http://postgr.es/m/CA+TgmoZNVeBzoqDL8xvr-nkaepq815jtDR4nJzPew7=3iEuM1g@mail.gmail.com
2023-10-23 15:17:26 -04:00
Robert Haas afd12774ae During online checkpoints, insert XLOG_CHECKPOINT_REDO at redo point.
This allows tools that read the WAL sequentially to identify (possible)
redo points when they're reached, rather than only being able to
detect them in retrospect when XLOG_CHECKPOINT_ONLINE is found, possibly
much later in the WAL stream. There are other possible applications as
well; see the discussion links below.

Any redo location that precedes the checkpoint location should now point
to an XLOG_CHECKPOINT_REDO record, so add a cross-check to verify this.

While adjusting the code in CreateCheckPoint() for this patch, I made it
call WALInsertLockAcquireExclusive a bit later than before, since there
appears to be no need for it to be held while checking whether the system
is idle, whether this is an end-of-recovery checkpoint, or what the current
timeline is.

Bump XLOG_PAGE_MAGIC.

Patch by me, based in part on earlier work from Dilip Kumar. Review by
Dilip Kumar, Amit Kapila, Andres Freund, and Michael Paquier.

Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com
Discussion: http://postgr.es/m/20230614194717.jyuw3okxup4cvtbt%40awork3.anarazel.de
Discussion: http://postgr.es/m/CA+hUKG+b2ego8=YNW2Ohe9QmSiReh1-ogrv8V_WZpJTqP3O+2w@mail.gmail.com
2023-10-19 14:47:29 -04:00
Robert Haas 2406c4e34c Reword messages about impending (M)XID exhaustion.
First, we shouldn't recommend switching to single-user mode, because
that's terrible advice. Especially on newer versions where VACUUM
will enter emergency mode when nearing (M)XID exhaustion, it's
perfectly fine to just VACUUM in multi-user mode. Doing it that way
is less disruptive and avoids disabling the safeguards that prevent
actual wraparound, so recommend that instead.

Second, be more precise about what is going to happen (when we're
nearing the limits) or what is happening (when we actually hit them).
The database doesn't shut down, nor does it refuse all commands. It
refuses commands that assign whichever of XIDs and MXIDs are nearly
exhausted.

No back-patch. The existing hint that advises going to single-user
mode is sufficiently awful advice that removing it or changing it
might be justifiable even though we normally avoid changing
user-facing messages in back-branches, but I (rhaas) felt that it
was better to be more conservative and limit this fix to master
only. Aside from the usual risk of breaking translations, people
might be used to the existing message, or even have monitoring
scripts that look for it.

Alexander Alekseev, John Naylor, Robert Haas, reviewed at various
times by Peter Geoghegan, Hannu Krosing, and Andres Freund.

Discussion: http://postgr.es/m/CA+TgmoZBg95FiR9wVQPAXpGPRkacSt2okVge+PKPPFppN7sfnQ@mail.gmail.com
2023-10-17 10:34:21 -04:00
Robert Haas a1a5da8cb7 Talk about assigning, rather than generating, new MultiXactIds.
The word "assign" is used in various places internally to describe what
GetNewMultiXactId does, but the user-facing messages have previously
said "generate". For consistency, standardize on "assign," which seems
(at least to me) to be slightly clearer.

Discussion: http://postgr.es/m/CA+TgmoaoE1_i3=4-7GCTtKLVZVQ2Gh6qESW2VG1OprtycxOHMA@mail.gmail.com
2023-10-17 10:23:31 -04:00
Nathan Bossart 8fb13dd6ab Move extra code out of the Pre/PostRestoreCommand() section.
If SIGTERM is received within this section, the startup process
will immediately proc_exit() in the signal handler, so it is
inadvisable to include any more code than is required there (as
such code is unlikely to be compatible with doing proc_exit() in a
signal handler).  This commit moves the code recently added to this
section (see 1b06d7bac9 and 7fed801135) to outside of the section.
This ensures that the startup process only calls proc_exit() in its
SIGTERM handler for the duration of the system() call, which is how
this code worked from v8.4 to v14.

Reported-by: Michael Paquier, Thomas Munro
Analyzed-by: Andres Freund
Suggested-by: Tom Lane
Reviewed-by: Michael Paquier, Robert Haas, Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/Y9nGDSgIm83FHcad%40paquier.xyz
Discussion: https://postgr.es/m/20230223231503.GA743455%40nathanxps13
Backpatch-through: 15
2023-10-16 12:41:55 -05:00
Thomas Munro 01529c7040 Fix comment from commit 22655aa231.
Per automated complaint from BF animal koel this needed to be
re-indented, but there was also a typo.  Back-patch to 16.
2023-10-16 13:32:41 +13:00
Andres Freund 22655aa231 Fix bulk table extension when copying into multiple partitions
When COPYing into a partitioned table that does now permit the use of
table_multi_insert(), we could error out with
  ERROR: could not read block NN in file "base/...": read only 0 of 8192 bytes

because BulkInsertState->next_free was not reset between partitions. This
problem occurred only when not able to use table_multi_insert(), as a
dedicated BulkInsertState for each partition is used in that case.

The bug was introduced in 00d1e02be2, but it was hard to hit at that point,
as commonly bulk relation extension is not used when not using
table_multi_insert(). It became more likely after 82a4edabd2, which expanded
the use of bulk extension.

To fix the bug, reset the bulk relation extension state in BulkInsertState in
ReleaseBulkInsertStatePin(). That was added (in b1ecb9b3fc) to tackle a very
similar issue.  Obviously the name is not quite correct, but there might be
external callers, and bulk insert state needs to be reset in precisely in the
situations that ReleaseBulkInsertStatePin() already needed to be called.

Medium term the better fix likely is to disallow reusing BulkInsertState
across relations.

Add a test that, without the fix, reproduces #18130 in most
configurations. The test also catches the problem fixed in b1ecb9b3fc when
run with small shared_buffers.

Reported-by: Ivan Kolombet <enderstd@gmail.com>
Analyzed-by: Tom Lane <tgl@sss.pgh.pa.us>
Analyzed-by: Andres Freund <andres@anarazel.de>
Bug: #18130
Discussion: https://postgr.es/m/18130-7a86a7356a75209d%40postgresql.org
Discussion: https://postgr.es/m/257696.1695670946%40sss.pgh.pa.us
Backpatch: 16-
2023-10-13 19:16:44 -07:00
Nathan Bossart 8d140c5822 Improve the naming in wal_sync_method code.
* sync_method is renamed to wal_sync_method.

* sync_method_options[] is renamed to wal_sync_method_options[].

* assign_xlog_sync_method() is renamed to assign_wal_sync_method().

* The names of the available synchronization methods are now
  prefixed with "WAL_SYNC_METHOD_" and have been moved into a
  WalSyncMethod enum.

* PLATFORM_DEFAULT_SYNC_METHOD is renamed to
  PLATFORM_DEFAULT_WAL_SYNC_METHOD, and DEFAULT_SYNC_METHOD is
  renamed to DEFAULT_WAL_SYNC_METHOD.

These more descriptive names help distinguish the code for
wal_sync_method from the code for DataDirSyncMethod (e.g., the
recovery_init_sync_method configuration parameter and the
--sync-method option provided by several frontend utilities).  This
change also prevents name collisions between the aforementioned
sets of code.  Since this only improves the naming of internal
identifiers, there should be no behavior change.

Author: Maxim Orlov
Discussion: https://postgr.es/m/CACG%3DezbL1gwE7_K7sr9uqaCGkWhmvRTcTEnm3%2BX1xsRNwbXULQ%40mail.gmail.com
2023-10-13 15:16:45 -05:00
Thomas Munro 0013ba290b Add wait events for checkpoint delay mechanism.
When MyProc->delayChkptFlags is set to temporarily block phase
transitions in a concurrent checkpoint, the checkpointer enters a
sleep-poll loop to wait for the flag to be cleared.  We should show that
as a wait event in the pg_stat_activity view.

Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CA%2BhUKGL7Whi8iwKbzkbn_1fixH3Yy8aAPz7mfq6Hpj7FeJrKMg%40mail.gmail.com
2023-10-13 16:43:22 +13:00
Robert Haas df9a3d4e99 Unify two isLogSwitch tests in XLogInsertRecord.
An upcoming patch wants to introduce an additional special case in
this function. To keep that as cheap as possible, minimize the amount
of branching that we do based on whether this is an XLOG_SWITCH
record.

Additionally, and also in the interest of keeping the overhead of
special-case code paths as low as possible, apply likely() to the
non-XLOG_SWITCH case, since only a very tiny fraction of WAL records
will be XLOG_SWITCH records.

Patch by me, reviewed by Dilip Kumar, Amit Kapila, Andres Freund,
and Michael Paquier.

Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com
2023-10-12 13:48:21 -04:00
Tom Lane b6a77c6a6c Reindent comment in GenericXLogFinish().
Restore pgindent cleanliness, per buildfarm member koel.
2023-10-11 17:14:31 -04:00
Jeff Davis ef74c7197c Fix bug in GenericXLogFinish().
Mark the buffers dirty before writing WAL.

Discussion: https://postgr.es/m/25104133-7df8-cae3-b9a2-1c0aaa1c094a@iki.fi
Reviewed-by: Heikki Linnakangas
Backpatch-through: 11
2023-10-10 11:01:13 -07:00
Peter Eisentraut 1d91d24d9a Add const to values and nulls arguments
This excludes any changes that would change the external AM APIs.

Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://www.postgresql.org/message-id/flat/14c31f4a-0347-0805-dce8-93a9072c05a5%40eisentraut.org
2023-10-10 07:50:43 +02:00
Amit Kapila 7cc2f59dd5 Remove duplicate words in docs and code comments.
Additionally, add a missing "the" in a couple of places.

Author: Vignesh C, Dagfinn Ilmari Mannsåker
Discussion: http://postgr.es/m/CALDaNm28t+wWyPfuyqEaARS810Je=dRFkaPertaLAEJYY2cWYQ@mail.gmail.com
2023-10-09 09:18:47 +05:30
Alexander Korotkov 82a7132f53 Fix another typo in e0b1ee17dc
Reported-by: Richard Guo
Discussion: https://postgr.es/m/CAMbWs4_kHMJDak75y1kBTirv-drS1-knT-7Mpg5LprAjqRJDVA%40mail.gmail.com
2023-10-07 20:36:47 +03:00
Alexander Korotkov e8c334c47a Fix typos in e0b1ee17dc
Reported-by: Alexander Lakhin
2023-10-07 11:55:55 +03:00
Alexander Korotkov e0b1ee17dc Skip checking of scan keys required for directional scan in B-tree
Currently, B-tree code matches every scan key to every item on the page.
Imagine the ordered B-tree scan for the query like this.

SELECT * FROM tbl WHERE col > 'a' AND col < 'b' ORDER BY col;

The (col > 'a') scan key will be always matched once we find the location to
start the scan.  The (col < 'b') scan key will match every item on the page
as long as it matches the last item on the page.

This patch implements prechecking of the scan keys required for directional
scan on beginning of page scan.  If precheck is successful we can skip this
scan keys check for the items on the page.  That could lead to significant
acceleration especially if the comparison operator is expensive.

Idea from patch by Konstantin Knizhnik.

Discussion: https://postgr.es/m/079c3f8e-3371-abe2-e93c-fc8a0ae3f571%40garret.ru
Reviewed-by: Peter Geoghegan, Pavel Borisov
2023-10-06 10:40:51 +03:00
Peter Eisentraut 04e485273b Move BuildDescForRelation() from tupdesc.c to tablecmds.c
BuildDescForRelation() main job is to convert ColumnDef lists to
pg_attribute/tuple descriptor arrays, which is really mostly an
internal subroutine of DefineRelation() and some related functions,
which is more the remit of tablecmds.c and doesn't have much to do
with the basic tuple descriptor interfaces in tupdesc.c.  This is also
supported by observing the header includes we can remove in tupdesc.c.
By moving it over, we can also (in the future) make
BuildDescForRelation() use more internals of tablecmds.c that are not
sensible to be exposed in tupdesc.c.

Discussion: https://www.postgresql.org/message-id/flat/52a125e4-ff9a-95f5-9f61-b87cf447e4da@eisentraut.org
2023-10-05 16:20:46 +02:00
Peter Eisentraut 6d341407a6 Push attidentity and attgenerated handling into BuildDescForRelation()
Previously, this was handled by the callers separately, but it can be
trivially moved into BuildDescForRelation() so that it is handled in a
central place.

Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://www.postgresql.org/message-id/flat/52a125e4-ff9a-95f5-9f61-b87cf447e4da@eisentraut.org
2023-10-05 16:20:46 +02:00
David Rowley 2075ba9dc9 Tidy-up some appendStringInfo*() usages
Make a few newish calls to appendStringInfo() which have no special
formatting use appendStringInfoString() instead.  Also, adjust usages of
appendStringInfoString() which only append a string containing a single
character to make use of appendStringInfoChar() instead.

This makes the code marginally faster, but primarily this change is so
we use the StringInfo type as it was intended to be used.

Discussion: https://postgr.es/m/CAApHDvpXKQmL+r=VDNS98upqhr9yGBhv2Jw3GBFFk_wKHcB39A@mail.gmail.com
2023-10-03 17:09:52 +13:00
Michael Paquier 6b18b3fe2c Fail hard on out-of-memory failures in xlogreader.c
This commit changes the WAL reader routines so as a FATAL for the
backend or exit(FAILURE) for the frontend is triggered if an allocation
for a WAL record decode fails in walreader.c, rather than treating this
case as bogus data, which would be equivalent to the end of WAL.  The
key is to avoid palloc_extended(MCXT_ALLOC_NO_OOM) in walreader.c,
relying on plain palloc() calls.

The previous behavior could make WAL replay finish too early than it
should.  For example, crash recovery finishing earlier may corrupt
clusters because not all the WAL available locally was replayed to
ensure a consistent state.  Out-of-memory failures would show up
randomly depending on the memory pressure on the host, but one simple
case would be to generate a large record, then replay this record after
downsizing a host, as Ethan Mertz originally reported.

This relies on bae868caf2, as the WAL reader routines now do the
memory allocation required for a record only once its header has been
fully read and validated, making xl_tot_len trustable.  Making the WAL
reader react differently on out-of-memory or bogus record data would
require ABI changes, so this is the safest choice for stable branches.
Also, it is worth noting that 3f1ce97346 has been using a plain
palloc() in this code for some time now.

Thanks to Noah Misch and Thomas Munro for the discussion.

Like the other commit, backpatch down to 12, leaving out v11 that will
be EOL'd soon.  The behavior of considering a failed allocation as bogus
data comes originally from 0ffe11abd3, where the record length
retrieved from its header was not entirely trustable.

Reported-by: Ethan Mertz
Discussion: https://postgr.es/m/ZRKKdI5-RRlta3aF@paquier.xyz
Backpatch-through: 12
2023-10-03 10:21:44 +09:00
Robert Haas 1ccc1e05ae Remove retry loop in heap_page_prune().
The retry loop is needed because heap_page_prune() calls
HeapTupleSatisfiesVacuum() and then lazy_scan_prune() does the same
thing again, and they might get different answers due to concurrent
clog updates.  But this patch makes heap_page_prune() return the
HeapTupleSatisfiesVacuum() results that it computed back to the
caller, which allows lazy_scan_prune() to avoid needing to recompute
those values in the first place. That's nice both because it eliminates
the need for a retry loop and also because it's cheaper.

Melanie Plageman, reviewed by David Geier, Andres Freund, and me.

Discussion: https://postgr.es/m/CAAKRu_br124qsGJieuYA0nGjywEukhK1dKBfRdby_4yY3E9SXA%40mail.gmail.com
2023-10-02 11:40:07 -04:00
Heikki Linnakangas f0bd0b4489 Add rmgrdesc README
In the README, briefly explain what rmgrdesc functions are, and why
they are in a separate directory. Commit c03c2eae0a added some
guidelines on the preferred output format; move that to the README
too.

Reviewed-by: Melanie Plageman, Peter Geoghegan
Discussion: https://www.postgresql.org/message-id/9159daf7-f42d-781b-458f-1b2cf32cb256%40iki.fi
2023-10-02 12:18:57 +03:00
Noah Misch e1f95ec8cf Correct assertion and comments about XLogRecordMaxSize.
The largest allocation, of xl_tot_len+8192, is in allocate_recordbuf().

Discussion: https://postgr.es/m/20230812211327.GB2326466@rfd.leadboat.com
2023-10-01 12:20:55 -07:00
Peter Geoghegan 714780dcdd Fix btmarkpos/btrestrpos array key wraparound bug.
nbtree's mark/restore processing failed to correctly handle an edge case
involving array key advancement and related search-type scan key state.
Scans with ScalarArrayScalarArrayOpExpr quals requiring mark/restore
processing (for a merge join) could incorrectly conclude that an
affected array/scan key must not have advanced during the time between
marking and restoring the scan's position.

As a result of all this, array key handling within btrestrpos could skip
a required call to _bt_preprocess_keys().  This confusion allowed later
primitive index scans to overlook tuples matching the true current array
keys.  The scan's search-type scan keys would still have spurious values
corresponding to the final array element(s) -- not values matching the
first/now-current array element(s).

To fix, remember that "array key wraparound" has taken place during the
ongoing btrescan in a flag variable stored in the scan's state, and use
that information at the point where btrestrpos decides if another call
to _bt_preprocess_keys is required.

Oversight in commit 70bc5833, which taught nbtree to handle array keys
during mark/restore processing, but missed this subtlety.  That commit
was itself a bug fix for an issue in commit 9e8da0f7, which taught
nbtree to handle ScalarArrayOpExpr quals natively.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkgP3DDRJxw6DgjCxo-cu-DKrvjEv_ArkP2ctBJatDCYg@mail.gmail.com
Backpatch: 11- (all supported branches).
2023-09-28 16:29:37 -07:00
Robert Haas 4e9fc3a976 Return data from heap_page_prune via a struct.
Previously, one of the values in the struct was returned as the return
value, and another was returned via an output parameter. In
preparation for returning more stuff, consolidate both values into a
struct returned via an output parameter.

Melanie Plageman, reviewed by Andres Freund and by me.

Discussion: https://postgr.es/m/CAAKRu_br124qsGJieuYA0nGjywEukhK1dKBfRdby_4yY3E9SXA%40mail.gmail.com
2023-09-28 10:36:34 -04:00
Etsuro Fujita c68f78538f Fix typo in src/backend/access/transam/README. 2023-09-28 19:45:00 +09:00
Peter Eisentraut ebf76f2753 Add TupleDescGetDefault()
This unifies some repetitive code.

Note: I didn't push the "not found" error message into the new
function, even though all existing callers would be able to make use
of it.  Using the existing error handling as-is would probably require
exposing the Relation type via tupdesc.h, which doesn't seem
desirable.  (Or even if we changed it to just report the OID, it would
inject the concept of a relation containing the tuple descriptor into
tupdesc.h, which might be a layering violation.  Perhaps some further
improvements could be considered here separately.)

Discussion: https://www.postgresql.org/message-id/flat/52a125e4-ff9a-95f5-9f61-b87cf447e4da%40eisentraut.org
2023-09-27 18:52:40 +01:00
Peter Eisentraut b0ae29512c MergeAttributes() and related variable renaming
Mainly, rename "schema" to "columns" and related changes.  The
previous naming has long been confusing.

Discussion: https://www.postgresql.org/message-id/flat/52a125e4-ff9a-95f5-9f61-b87cf447e4da%40eisentraut.org
2023-09-26 16:08:35 +01:00
Heikki Linnakangas 28d3c2ddcf Fix another bug in parent page splitting during GiST index build.
Yet another bug in the ilk of commits a7ee7c851 and 741b88435. In
741b88435, we took care to clear the memorized location of the
downlink when we split the parent page, because splitting the parent
page can move the downlink. But we missed that even *updating* a tuple
on the parent can move it, because updating a tuple on a gist page is
implemented as a delete+insert, so the updated tuple gets moved to the
end of the page.

This commit fixes the bug in two different ways (belt and suspenders):

1. Clear the downlink when we update a tuple on the parent page, even
   if it's not split. This the same approach as in commits a7ee7c851
   and 741b88435.

   I also noticed that gistFindCorrectParent did not clear the
   'downlinkoffnum' when it stepped to the right sibling. Fix that
   too, as it seems like a clear bug even though I haven't been able
   to find a test case to hit that.

2. Change gistFindCorrectParent so that it treats 'downlinkoffnum'
   merely as a hint. It now always first checks if the downlink is
   still at that location, and if not, it scans the page like before.
   That's more robust if there are still more cases where we fail to
   clear 'downlinkoffnum' that we haven't yet uncovered. With this,
   it's no longer necessary to meticulously clear 'downlinkoffnum',
   so this makes the previous fixes unnecessary, but I didn't revert
   them because it still seems nice to clear it when we know that the
   downlink has moved.

Also add the test case using the same test data that Alexander
posted. I tried to reduce it to a smaller test, and I also tried to
reproduce this with different test data, but I was not able to, so
let's just include what we have.

Backpatch to v12, like the previous fixes.

Reported-by: Alexander Lakhin
Discussion: https://www.postgresql.org/message-id/18129-caca016eaf0c3702@postgresql.org
2023-09-26 14:14:49 +03:00
Thomas Munro becfbdd6c1 Fix edge-case for xl_tot_len broken by bae868ca.
bae868ca removed a check that was still needed.  If you had an
xl_tot_len at the end of a page that was too small for a record header,
but not big enough to span onto the next page, we'd immediately perform
the CRC check using a bogus large length.  Because of arbitrary coding
differences between the CRC implementations on different platforms,
nothing very bad happened on common modern systems.  On systems using
the _sb8.c fallback we could segfault.

Restore that check, add a new assertion and supply a test for that case.
Back-patch to 12, like bae868ca.

Tested-by: Tom Lane <tgl@sss.pgh.pa.us>
Tested-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGLCkTT7zYjzOxuLGahBdQ%3DMcF%3Dz5ZvrjSOnW4EDhVjT-g%40mail.gmail.com
2023-09-26 10:53:38 +13:00
Thomas Munro bae868caf2 Don't trust unvalidated xl_tot_len.
xl_tot_len comes first in a WAL record.  Usually we don't trust it to be
the true length until we've validated the record header.  If the record
header was split across two pages, previously we wouldn't do the
validation until after we'd already tried to allocate enough memory to
hold the record, which was bad because it might actually be garbage
bytes from a recycled WAL file, so we could try to allocate a lot of
memory.  Release 15 made it worse.

Since 70b4f82a4b, we'd at least generate an end-of-WAL condition if the
garbage 4 byte value happened to be > 1GB, but we'd still try to
allocate up to 1GB of memory bogusly otherwise.  That was an
improvement, but unfortunately release 15 tries to allocate another
object before that, so you could get a FATAL error and recovery could
fail.

We can fix both variants of the problem more fundamentally using
pre-existing page-level validation, if we just re-order some logic.

The new order of operations in the split-header case defers all memory
allocation based on xl_tot_len until we've read the following page.  At
that point we know that its first few bytes are not recycled data, by
checking its xlp_pageaddr, and that its xlp_rem_len agrees with
xl_tot_len on the preceding page.  That is strong evidence that
xl_tot_len was truly the start of a record that was logged.

This problem was most likely to occur on a standby, because
walreceiver.c recycles WAL files without zeroing out trailing regions of
each page.  We could fix that too, but it wouldn't protect us from rare
crash scenarios where the trailing zeroes don't make it to disk.

With reliable xl_tot_len validation in place, the ancient policy of
considering malloc failure to indicate corruption at end-of-WAL seems
quite surprising, but changing that is left for later work.

Also included is a new TAP test to exercise various cases of end-of-WAL
detection by writing contrived data into the WAL from Perl.

Back-patch to 12.  We decided not to put this change into the final
release of 11.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Michael Paquier <michael@paquier.xyz>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Noah Misch <noah@leadboat.com> (the idea, not the code)
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Sergei Kornilov <sk@zsrv.org>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/17928-aa92416a70ff44a2%40postgresql.org
2023-09-23 10:26:24 +12:00
Tom Lane 48e2b234f8 Fix COMMIT/ROLLBACK AND CHAIN in the presence of subtransactions.
In older branches, COMMIT/ROLLBACK AND CHAIN failed to propagate
the current transaction's properties to the new transaction if
there was any open subtransaction (unreleased savepoint).
Instead, some previous transaction's properties would be restored.
This is because the "if (s->chain)" check in CommitTransactionCommand
examined the wrong instance of the "chain" flag and falsely
concluded that it didn't need to save transaction properties.

Our regression tests would have noticed this, except they used
identical transaction properties for multiple tests in a row,
so that the faulty behavior was not distinguishable from correct
behavior.

Commit 12d768e70 fixed the problem in v15 and later, but only rather
accidentally, because I removed the "if (s->chain)" test to avoid a
compiler warning, while not realizing that the warning was flagging a
real bug.

In v14 and before, remove the if-test and save transaction properties
unconditionally; just as in the newer branches, that's not expensive
enough to justify thinking harder.

Add the comment and extra regression test to v15 and later to
forestall any future recurrence, but there's no live bug in those
branches.

Patch by me, per bug #18118 from Liu Xiang.  Back-patch to v12 where
the AND CHAIN feature was added.

Discussion: https://postgr.es/m/18118-4b72fcbb903aace6@postgresql.org
2023-09-21 23:11:30 -04:00
Heikki Linnakangas bf094372d1 Fix GiST README's explanation of the NSN cross-check.
The text got the condition backwards, it's "NSN > LSN", not "NSN < LSN".
While we're at it, expand it a little for clarity.

Reviewed-by: Daniel Gustafsson
Discussion: https://www.postgresql.org/message-id/4cb46e18-e688-524a-0f73-b1f03ed5d6ee@iki.fi
2023-09-19 11:53:51 +03:00
Daniel Gustafsson a396e20ad0 Rename variable for code clarity
When tracking IO timing for WAL, the duration is what we calculate
based on the start and end timestamps, it's not what the variable
contains. Rename the timestamp variable to end to better communicate
what it contains.  Original patch by Krishnakumar with additional
hacking to fix another occurrence by me.

Author: Krishnakumar R <kksrcv001@gmail.com>
Discussion: https://postgr.es/m/CAPMWgZ9f9o8awrQpjo8oxnNQ=bMDVPx00NE0QcDzvHD_ZrdLPw@mail.gmail.com
2023-09-15 19:05:57 +02:00
Heikki Linnakangas 18724af9e8 Remove unnecessary smgrimmedsync() when creating unlogged table.
This became safe after commit 4b4798e138. The smgrcreate() call will
now register the segment for syncing at the next checkpoint, so we
don't need to sync it here. If a checkpoint happens before the
creation is WAL-logged, the records will be replayed when starting
recovery from the checkpoint. If a checkpoint happens after the WAL
logging, the checkpoint will fsync() it.

In the passing, clarify a comment in smgrDoPendingSyncs().

Discussion: https://www.postgresql.org/message-id/6e5bbc08-cdfc-b2b3-9e23-1a914b9850a9%40iki.fi
Reviewed-by: Robert Haas
2023-09-15 17:29:37 +03:00
Daniel Gustafsson b0ec61c9c2 Quote filenames in error messages
The majority of all filenames are quoted in user facing error and
log messages, but a few were still printed without quotes.  While
these filenames do not risk causing any ambiguity as their format
is strict, quote them anyways to be consistent across all logs.

Also concatenate a message to keep it one line to make it easier
to grep for in the code.

Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/080EEABE-6645-4A46-AB20-6285ADAC44FE@yesql.se
2023-09-14 11:17:33 +02:00
Amit Kapila e0b2eed047 Flush logical slots to disk during a shutdown checkpoint if required.
It's entirely possible for a logical slot to have a confirmed_flush LSN
higher than the last value saved on disk while not being marked as dirty.
Currently, it is not a major problem but a later patch adding support for
the upgrade of slots relies on that value being properly flushed to disk.

It can also help avoid processing the same transactions again in some
boundary cases after the clean shutdown and restart.  Say, we process
some transactions for which we didn't send anything downstream (the
changes got filtered) but the confirm_flush LSN is updated due to
keepalives.  As we don't flush the latest value of confirm_flush LSN, it
may lead to processing the same changes again without this patch.

The approach taken by this patch has been suggested by Ashutosh Bapat.

Author: Vignesh C, Julien Rouhaud, Kuroda Hayato
Reviewed-by: Amit Kapila, Dilip Kumar, Michael Paquier, Ashutosh Bapat, Peter Smith, Hou Zhijie
Discussion: http://postgr.es/m/CAA4eK1JzJagMmb_E8D4au=GYQkxox0AfNBm1FbP7sy7t4YWXPQ@mail.gmail.com
Discussion: http://postgr.es/m/TYAPR01MB58664C81887B3AF2EB6B16E3F5939@TYAPR01MB5866.jpnprd01.prod.outlook.com
2023-09-14 08:57:05 +05:30
Thomas Munro 9f0602539d Remove some more "snapshot too old" vestiges.
Commit f691f5b8 removed the logic, but left behind some now-useless
Snapshot arguments to various AM-internal functions, and missed a couple
of comments.

Reported-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wznj9qSNXZ1P1uWTUD_FeaTezbUazb416EPwi4Qr_jR_6A%40mail.gmail.com
2023-09-08 17:12:12 +12:00
Thomas Munro f691f5b80a Remove the "snapshot too old" feature.
Remove the old_snapshot_threshold setting and mechanism for producing
the error "snapshot too old", originally added by commit 848ef42b.
Unfortunately it had a number of known problems in terms of correctness
and performance, mostly reported by Andres in the course of his work on
snapshot scalability.  We agreed to remove it, after a long period
without an active plan to fix it.

This is certainly a desirable feature, and someone might propose a new
or improved implementation in the future.

Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CACG%3DezYV%2BEvO135fLRdVn-ZusfVsTY6cH1OZqWtezuEYH6ciQA%40mail.gmail.com
Discussion: https://postgr.es/m/20200401064008.qob7bfnnbu4w5cw4%40alap3.anarazel.de
Discussion: https://postgr.es/m/CA%2BTgmoY%3Daqf0zjTD%2B3dUWYkgMiNDegDLFjo%2B6ze%3DWtpik%2B3XqA%40mail.gmail.com
2023-09-05 19:53:43 +12:00
Heikki Linnakangas e8d74ad625 Report syncscan position at end of scan.
The comment in heapgettup_advance_block() says that it reports the
scan position before checking for end of scan, but that didn't match
the code. The code was refactored in commit 7ae0ab0ad9, which
inadvertently changed the order of the check and reporting. Change it
back.

This caused a few regression test failures with a small shared_buffers
setting like 10 MB. The 'portals' and 'cluster' tests perform seqscans
that are large enough that sync seqscans kick in. When the sync scan
position is not updated at end of scan, the next seq scan doesn't
start at the beginning of the table, and the test queries are
sensitive to that.

Reviewed-by: Melanie Plageman, David Rowley
Discussion: https://www.postgresql.org/message-id/6f991389-ae22-d844-a9d8-9aceb7c01a9a@iki.fi
Backpatch-through: 16
2023-08-31 13:02:15 +03:00
Michael Paquier f593c5517d Tweak pg_promote() to report failures on kill() or postmaster failures
Since its introduction in 10074651e3, pg_promote() has been returning
a false status in three cases:
- SIGUSR1 not sent to the postmaster process.
- Postmaster death during standby promotion.
- Standby not promoted within the specified wait time.

An application calling this function will have a hard time understanding
what a false state returned actually means.

Per discussion, this switches the two first states to fail rather than
return a "false" status, making the second case more consistent with the
existing CHECK_FOR_INTERRUPTS in the wait loop.  False is only returned
when the promotion is not completed within the specified time (60s by
default).

Author: Ashutosh Sharma
Reviewed-by: Fujii Masao, Laurenz Albe, Michael Paquier
Discussion: https://postgr.es/m/CAE9k0P=QTrwptL0t4J0fuBRDDjgsT-0PVKd-ikd96i1hyL7Bcg@mail.gmail.com
2023-08-29 08:45:04 +09:00
Peter Eisentraut 36e4419d1f Make error messages about WAL segment size more consistent
Make the primary messages more compact and make the detail messages
uniform.  In initdb.c and pg_resetwal.c, use the newish
option_parse_int() to simplify some of the option parsing.  For the
backend GUC wal_segment_size, add a GUC check hook to do the
verification instead of coding it in bootstrap.c.  This might be
overkill, but that way the check is in the right place and it becomes
more self-documenting.

In passing, make pg_controldata use the logging API for warning
messages.

Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://www.postgresql.org/message-id/flat/9939aa8a-d7be-da2c-7715-0a0b5535a1f7@eisentraut.org
2023-08-28 15:17:04 +02:00
Heikki Linnakangas bf8bf6d0bd Fix _bt_allequalimage() call within critical section.
_bt_allequalimage() does complicated things, so it's not OK to call it
in a critical section. Per buildfarm failure on 'prion', which uses
-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE options.

Discussion: https://www.postgresql.org/message-id/6e5bbc08-cdfc-b2b3-9e23-1a914b9850a9@iki.fi
Backpatch-through: 16, like commit ccadf73163 that introduced this
2023-08-23 18:12:41 +03:00
Heikki Linnakangas ccadf73163 Use the buffer cache when initializing an unlogged index.
Some of the ambuildempty functions used smgrwrite() directly, followed
by smgrimmedsync(). A few small problems with that:

Firstly, one is supposed to use smgrextend() when extending a
relation, not smgrwrite(). It doesn't make much difference in
production builds. smgrextend() updates the relation size cache, so
you miss that, but that's harmless because we never use the cached
relation size of an init fork. But if you compile with
CHECK_WRITE_VS_EXTEND, you get an assertion failure.

Secondly, the smgrwrite() calls were performed before WAL-logging, so
the page image written to disk had 0/0 as the LSN, not the LSN of the
WAL record. That's also harmless in practice, but seems sloppy.

Thirdly, it's better to use the buffer cache, because then you don't
need to smgrimmedsync() the relation to disk, which adds latency.
Bypassing the cache makes sense for bulk operations like index
creation, but not when you're just initializing an empty index.
Creation of unlogged tables is hardly performance bottleneck in any
real world applications, but nevertheless.

Backpatch to v16, but no further. These issues should be harmless in
practice, so better to not rock the boat in older branches.

Reviewed-by: Robert Haas
Discussion: https://www.postgresql.org/message-id/6e5bbc08-cdfc-b2b3-9e23-1a914b9850a9@iki.fi
2023-08-23 17:21:31 +03:00
Nathan Bossart f4b54e1ed9 Introduce macros for protocol characters.
This commit introduces descriptively-named macros for the
identifiers used in wire protocol messages.  These new macros are
placed in a new header file so that they can be easily used by
third-party code.

Author: Dave Cramer
Reviewed-by: Alvaro Herrera, Tatsuo Ishii, Peter Smith, Robert Haas, Tom Lane, Peter Eisentraut, Michael Paquier
Discussion: https://postgr.es/m/CADK3HHKbBmK-PKf1bPNFoMC%2BoBt%2BpD9PH8h5nvmBQskEHm-Ehw%40mail.gmail.com
2023-08-22 19:16:12 -07:00
Thomas Munro 7114791158 ExtendBufferedWhat -> BufferManagerRelation.
Commit 31966b15 invented a way for functions dealing with relation
extension to accept a Relation in online code and an SMgrRelation in
recovery code.  It seems highly likely that future bufmgr.c interfaces
will face the same problem, and need to do something similar.
Generalize the names so that each interface doesn't have to re-invent
the wheel.

Back-patch to 16.  Since extension AM authors might start using the
constructor macros once 16 ships, we agreed to do the rename in 16
rather than waiting for 17.

Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKG%2B6tLD2BhpRWycEoti6LVLyQq457UL4ticP5xd8LqHySA%40mail.gmail.com
2023-08-23 12:31:23 +12:00
Andrew Dunstan a684581085 Cache by-reference missing values in a long lived context
Attribute missing values might be needed past the lifetime of the tuple
descriptors from which they are extracted. To avoid possibly using
pointers for by-reference values which might thus be left dangling, we
cache a datumCopy'd version of the datum in the TopMemoryContext. Since
we first search for the value this only needs to be done once per
session for any such value.

Original complaint from Tom Lane, idea for mitigation by Andrew Dunstan,
tweaked by Tom Lane.

Backpatch to version 11 where missing values were introduced.

Discussion: https://postgr.es/m/1306569.1687978174@sss.pgh.pa.us
2023-08-22 15:17:05 -04:00
Andres Freund 82a4edabd2 hio: Take number of prior relation extensions into account
The new relation extension logic, introduced in 00d1e02be2, could lead to
slowdowns in some scenarios. E.g., when loading narrow rows into a table using
COPY, the caller of RelationGetBufferForTuple() will only request a small
number of pages. Without concurrency, we just extended using pwritev() in that
case. However, if there is *some* concurrency, we switched between extending
by a small number of pages and a larger number of pages, depending on the
number of waiters for the relation extension logic.  However, some
filesystems, XFS in particular, do not perform well when switching between
extending files using fallocate() and pwritev().

To avoid that issue, remember the number of prior relation extensions in
BulkInsertState and extend more aggressively if there were prior relation
extensions. That not just avoids the aforementioned slowdown, but also leads
to noticeable performance gains in other situations, primarily due to
extending more aggressively when there is no concurrency. I should have done
it this way from the get go.

Reported-by: Masahiko Sawada <sawada.mshk@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAD21AoDvDmUQeJtZrau1ovnT_smN940=Kp6mszNGK3bq9yRN6g@mail.gmail.com
Backpatch: 16-, where the new relation extension code was added
2023-08-14 11:33:09 -07:00
Noah Misch c36b636096 Fix off-by-one in XLogRecordMaxSize check.
pg_logical_emit_message(false, '_', repeat('x', 1069547465)) failed with
self-contradictory message "WAL record would be 1069547520 bytes (of
maximum 1069547520 bytes)".  There's no particular benefit from allowing
or denying one byte in either direction; XLogRecordMaxSize could rise a
few megabytes without trouble.  Hence, this is just for cleanliness.
Back-patch to v16, where this check first appeared.
2023-08-12 14:37:05 -07:00
Michael Paquier 66d86d4201 Document more assumptions of LWLock variable changes with WAL inserts
This commit adds a few comments about what LWLockWaitForVar() relies on
when a backend waits for a variable update on its LWLocks for WAL
insertions up to an expected LSN.

First, LWLockWaitForVar() does not include a memory barrier, relying on
a spinlock taken at the beginning of WaitXLogInsertionsToFinish().  This
was hidden behind two layers of routines in lwlock.c.  This assumption
is now documented at the top of LWLockWaitForVar(), and detailed at bit
more within LWLockConflictsWithVar().

Second, document why WaitXLogInsertionsToFinish() does not include
memory barriers, relying on a spinlock at its top, which is, per Andres'
input, fine for two different reasons, both depending on the fact that
the caller of WaitXLogInsertionsToFinish() is waiting for a LSN up to a
certain value.

This area's documentation and assumptions could be improved more in the
future, but at least that's a beginning.

Author: Bharath Rupireddy, Andres Freund
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CALj2ACVF+6jLvqKe6xhDzCCkr=rfd6upaGc3477Pji1Ke9G7Bg@mail.gmail.com
2023-07-26 12:06:04 +09:00
Michael Paquier 71e4cc6b8e Optimize WAL insertion lock acquisition and release with some atomics
The WAL insertion lock variable insertingAt is currently being read
and written with the help of the LWLock wait list lock to avoid any read
of torn values.  This wait list lock can become a point of contention on
a highly concurrent write workloads.

This commit switches insertingAt to a 64b atomic variable that provides
torn-free reads/writes.  On platforms without 64b atomic support, the
fallback implementation uses spinlocks to provide the same guarantees
for the values read.  LWLockWaitForVar(), through
LWLockConflictsWithVar(), reads the new value to check if it still needs
to wait with a u64 atomic operation.  LWLockUpdateVar() updates the
variable before waking up the waiters with an exchange_u64 (full memory
barrier).  LWLockReleaseClearVar() now uses also an exchange_u64 to
reset the variable.  Before this commit, all these steps relied on
LWLockWaitListLock() and LWLockWaitListUnlock().

This reduces contention on LWLock wait list lock and improves
performance of highly-concurrent write workloads.  Here are some
numbers using pg_logical_emit_message() (HEAD at d6677b93) with various
arbitrary record lengths and clients up to 1k on a rather-large machine
(64 vCPUs, 512GB of RAM, 16 cores per sockets, 2 sockets), in terms of
TPS numbers coming from pgbench:
 message_size_b     |     16 |     64 |    256 |   1024
--------------------+--------+--------+--------+-------
 patch_4_clients    |  83830 |  82929 |  80478 |  73131
 patch_16_clients   | 267655 | 264973 | 250566 | 213985
 patch_64_clients   | 380423 | 378318 | 356907 | 294248
 patch_256_clients  | 360915 | 354436 | 326209 | 263664
 patch_512_clients  | 332654 | 321199 | 287521 | 240128
 patch_1024_clients | 288263 | 276614 | 258220 | 217063
 patch_2048_clients | 252280 | 243558 | 230062 | 192429
 patch_4096_clients | 212566 | 213654 | 205951 | 166955
 head_4_clients     |  83686 |  83766 |  81233 |  73749
 head_16_clients    | 266503 | 265546 | 249261 | 213645
 head_64_clients    | 366122 | 363462 | 341078 | 261707
 head_256_clients   | 132600 | 132573 | 134392 | 165799
 head_512_clients   | 118937 | 114332 | 116860 | 150672
 head_1024_clients  | 133546 | 115256 | 125236 | 151390
 head_2048_clients  | 137877 | 117802 | 120909 | 138165
 head_4096_clients  | 113440 | 115611 | 120635 | 114361

Bharath has been measuring similar improvements, where the limit of the
WAL insertion lock begins to be felt when more than 256 concurrent
clients are involved in this specific workload.

An extra patch has been discussed to introduce a fast-exit path in
LWLockUpdateVar() when there are no waiters, still this does not
influence the write-heavy workload cases discussed as there are always
waiters.  This will be considered separately.

Author: Bharath Rupireddy
Reviewed-by: Nathan Bossart, Andres Freund, Michael Paquier
Discussion: https://postgr.es/m/CALj2ACVF+6jLvqKe6xhDzCCkr=rfd6upaGc3477Pji1Ke9G7Bg@mail.gmail.com
2023-07-25 13:38:58 +09:00
Michael Paquier 4e465aac36 Fix indentation in twophase.c
This has been missed in cb0cca1, noticed before buildfarm member koel
has been able to complain while poking at a different patch.  Like the
other commit, backpatch all the way down to limit the odds of merge
conflicts.

Backpatch-through: 11
2023-07-18 14:04:31 +09:00
Michael Paquier cb0cca1880 Fix recovery of 2PC transaction during crash recovery
A crash in the middle of a checkpoint with some two-phase state data
already flushed to disk by this checkpoint could cause a follow-up crash
recovery to recover twice the same transaction, once from what has been
found in pg_twophase/ at the beginning of recovery and a second time
when replaying its corresponding record.

This would lead to FATAL failures in the startup process during
recovery, where the same transaction would have a state recovered twice
instead of once:
LOG:  recovering prepared transaction 731 from shared memory
LOG:  recovering prepared transaction 731 from shared memory
FATAL:  lock ExclusiveLock on object 731/0/0 is already held

This issue is fixed by skipping the addition of any 2PC state coming
from a record whose equivalent 2PC state file has already been loaded in
TwoPhaseState at the beginning of recovery by restoreTwoPhaseData(),
which is OK as long as the system has not reached a consistent state.

The timing to get a messed up recovery processing is very racy, and
would very unlikely happen.  The thread that has reported the issue has
demonstrated the bug using injection points to force a PANIC in the
middle of a checkpoint.

Issue introduced in 728bd99, so backpatch all the way down.

Reported-by: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com>
Author: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com>
Author: Michael Paquier
Discussion: https://postgr.es/m/109e6994-b971-48cb-84f6-829646f18b4c.mengjuan.cmj@alibaba-inc.com
Backpatch-through: 11
2023-07-18 13:43:44 +09:00
Nathan Bossart 0fef877538 Rename session_auth_is_superuser to current_role_is_superuser.
This variable might've been accurately named when it was added in
ea886339b8, but the name hasn't been accurate since at least the
introduction of SET ROLE in e5d6b91220.  The corresponding
documentation was fixed in eedb068c0a.  This commit renames the
variable accordingly.

Suggested-by: Joseph Koshakow
Discussion: https://postgr.es/m/CAAvxfHc-HHzONQ2oXdvhFF9ayRnidPwK%2BfVBhRzaBWYYLVQL-g%40mail.gmail.com
2023-07-12 21:28:54 -07:00
Masahiko Sawada 46ebdfe164 Report index vacuum progress.
This commit adds two columns: indexes_total and indexes_processed, to
pg_stat_progress_vacuum system view to show the index vacuum
progress. These numbers are reported in the "vacuuming indexes" and
"cleaning up indexes" phases.

This uses the new parallel message type for progress reporting added
by be06506e7.

Bump catversion because this changes the definition of
pg_stat_progress_vacuum.

Author: Sami Imseih
Reviewed by: Masahiko Sawada, Michael Paquier, Nathan Bossart, Andres Freund
Discussion: https://www.postgresql.org/message-id/flat/5478DFCD-2333-401A-B2F0-0D186AB09228@amazon.com
2023-07-11 12:34:01 +09:00
Masahiko Sawada f1889729dd Add new parallel message type to progress reporting.
This commit adds a new type of parallel message 'P' to allow a
parallel worker to poke at a leader to update the progress.

Currently it supports only incremental progress reporting but it's
possible to allow for supporting of other backend progress APIs in the
future.

There are no users of this new message type as of this commit. That
will follow in future commits.

Idea from Andres Freund.

Author: Sami Imseih
Reviewed by: Michael Paquier, Masahiko Sawada
Discussion: https://www.postgresql.org/message-id/flat/5478DFCD-2333-401A-B2F0-0D186AB09228@amazon.com
2023-07-11 12:33:54 +09:00
Peter Eisentraut e53a611523 Message wording improvements 2023-07-10 10:47:24 +02:00
Tomas Vondra ec99d6e9c8 Document relaxed HOT for summarizing indexes
Commit 19d8e2308b allowed a weaker check for HOT with summarizing
indexes, but it did not update README.HOT. So do that now.

Patch by Matthias van de Meent, minor changes by me. Backpatch to 16,
where the optimization was introduced.

Author: Matthias van de Meent
Reviewed-by: Tomas Vondra
Backpatch-through: 16
Discussion: https://postgr.es/m/CAEze2WiEOm8V+c9kUeYp2BPhbEc5s473fUf51xNeqvSFGv44Ew@mail.gmail.com
2023-07-07 19:04:53 +02:00
Thomas Munro 126552c85c Fix race in SSI interaction with gin fast path.
The ginfast.c code previously checked for conflicts in before locking
the relevant buffer, leaving a window where a RW conflict could be
missed.  Re-order.

There was also a place where buffer ID and block number were confused
while trying to predicate-lock a page, noted by visual inspection.

Back-patch to all supported releases.  Fixes one more problem discovered
with the reproducer from bug #17949, in this case when Dmitry tried
other index types.

Reported-by: Artem Anisimov <artem.anisimov.255@gmail.com>
Reported-by: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/17949-a0f17035294a55e2%40postgresql.org
2023-07-04 09:07:31 +12:00
Thomas Munro bcc93a389c Fix race in SSI interaction with bitmap heap scan.
When performing a bitmap heap scan, we don't want to miss concurrent
writes that occurred after we observed the heap's rs_nblocks, but before
we took predicate locks on index pages.  Therefore, we can't skip
fetching any heap tuples that are referenced by the index, because we
need to test them all with CheckForSerializableConflictOut().  The
old optimization that would ignore any references to blocks >=
rs_nblocks gets in the way of that requirement, because it means that
concurrent writes in that window are ignored.

Removing that optimization shouldn't affect correctness at any isolation
level, because any new tuples shouldn't be visible to an MVCC snapshot.
There also shouldn't be any error-causing references to heap blocks past
the end, because we should have held at least an AccessShareLock on the
table before the index scan.  It can't get smaller while our transaction
is running.  For now, though, we'll keep the optimization at lower
levels to avoid making unnecessary changes in a bug fix.

Back-patch to all supported releases.  In release 11, the code is in a
different place but not fundamentally different.  Fixes one aspect of
bug #17949.

Reported-by: Artem Anisimov <artem.anisimov.255@gmail.com>
Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/17949-a0f17035294a55e2%40postgresql.org
2023-07-04 09:07:31 +12:00
Thomas Munro f9b7fc651a Fix race in SSI interaction with empty btrees.
When predicate-locking btrees, we have a special case for completely
empty btrees, since there is no page to lock.  This was racy, because,
without buffer lock held, a matching key could be inserted between the
_bt_search() and the PredicateLockRelation() calls.

Fix, by rechecking _bt_search() after taking the relation-level SIREAD
lock, if using SERIALIZABLE isolation and an empty btree is discovered.

Back-patch to all supported releases.  Fixes one aspect of bug #17949.

Reported-by: Artem Anisimov <artem.anisimov.255@gmail.com>
Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/17949-a0f17035294a55e2%40postgresql.org
2023-07-04 09:07:31 +12:00
Tomas Vondra 29cf61ade3 Consider fillfactor when estimating relation size
When table_block_relation_estimate_size() estimated the number of tuples
in a relation without statistics (e.g. right after load), it did not
consider fillfactor when calculating density. With non-default
fillfactor values, this may result in significant overestimate of the
number of tuples - up to 10x with the minimum 10% fillfactor. This may
have unexpected consequences, e.g. when creating hash indexes.

This considers the current fillfactor value in the "no statistics" code
path.  If the fillfactor changes after loading data into the table, the
estimate may be off. But that seems much less likely than changing the
fillfactor before the data load.

Reviewed-by: Corey Huinker, Peter Eisentraut
Discussion: https://postgr.es/m/cf154ef9-6bac-d268-b735-67a3443debba@enterprisedb.com
2023-07-03 18:55:31 +02:00
Tomas Vondra a4cfeeca5a Fix code indentation violations
Commits ce5aaea8cd, 2b8b2852bb and 28d03feac3 violated the expected code
indentation rules, upsetting the new buildfarm member "koel."

Discussion: https://postgr.es/m/ZKIU4mhWpgJOM0W0%40paquier.xyz
2023-07-03 12:47:49 +02:00
Thomas Munro 4637a6ac0b Silence "missing contrecord" error.
Commit dd38ff28ad added a new error message "missing contrecord" when
we fail to reassemble a record.  Unfortunately that caused noisy
messages to be logged by pg_waldump at end of segment, and by walsender
when asked to shut down on a segment boundary.

Remove the new error message, so that this condition signals end-of-
WAL without a message.  It's arguably a reportable condition that should
not be silenced while performing crash recovery, but fixing that without
introducing noise in the other cases will require more research.

Back-patch to 15.

Reported-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/6a1df56e-4656-b3ce-4b7a-a9cb41df8189%40enterprisedb.com
2023-07-03 11:16:27 +12:00
Tomas Vondra 0457109344 Improve BRIN minmax-multi opclass test coverage
Per the code coverage report, the existing regression tests did not
exercice some a couple important BRIN minmax-multi code paths.

- The tests focused on testing planning with a range of scan key
  strategies, but not the execution. Fixed by adding queries that
  actually test query execution for both equality and inequality.

- All tests created indexes after inserting data, but this only
  exercises the CREATE INDEX strategy that sees all values at once, not
  incremental summary updates. The new tests flip the order and create
  the index before adding data.

- The assert check(s) validating correctness of expanded ranges were
  present only in the "union" code path, which is not covered by
  regression tests at all (as it requires concurrency etc.). Fixed by
  adding the asserts to a couple more places.

Reviewed-by: Heikki Linnakangas
Discussion: https://postgr.es/m/57020b2e-d9c9-9bc7-4892-b36d9bb07563%40enterprisedb.com
2023-07-02 10:33:38 +02:00
Tomas Vondra 2b8b2852bb Introduce bloom_filter_size for BRIN bloom opclass
Move the calculation of Bloom filter parameters (for BRIN indexes) into
a separate function to make reuse easier. At the moment we only call it
from one place, but that may change and it's easier to read anyway.

Reviewed-by: Heikki Linnakangas
Discussion: https://postgr.es/m/0e1f3350-c9cf-ab62-43a5-5dae314de89c%40enterprisedb.com
2023-07-02 10:24:29 +02:00
Tomas Vondra 28d03feac3 Minor cleanups in the BRIN code
BRIN bloom and minmax-multi opclasses were somewhat inconsistent when
dealing with bool variables, assigning to them Datum values etc. While
not a bug, it makes the code harder to understand, so fix that.

While at it, update an incorrect comment copied to bloom opclass from
minmax, talking about strategies not supported by bloom.

Reviewed-by: Heikki Linnakangas
Discussion: https://postgr.es/m/0e1f3350-c9cf-ab62-43a5-5dae314de89c%40enterprisedb.com
2023-07-02 10:21:17 +02:00
Peter Geoghegan 5f0762f147 nbtree VACUUM: cope with topparent inconsistencies.
Avoid "right sibling %u of block %u is not next child" errors when
vacuuming a corrupt nbtree index.  Just LOG the issue and press on.
That way VACUUM will have a decent chance of finishing off all required
processing for the index (and for the table as a whole).

This is similar to recent work from commit 5abff197, as well as work
from commit 5b861baa (later backpatched as commit 43e409ce), which
taught nbtree VACUUM to keep going when its "re-find" check fails.  The
hardening added by this commit takes place directly after the "re-find"
check, right before the critical section for the first stage of page
deletion.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=dayg0vjs4+er84TS9ami=csdzjpuiCGbEw=idhwqhzQ@mail.gmail.com
Backpatch: 11- (all supported versions).
2023-06-21 17:41:58 -07:00
Michael Paquier 68cb5af46c Enable archiving in recovery TAP test 009_twophase.pl
This is a follow-up of f663b00, that has been committed to v13 and v14,
tweaking the TAP test for two-phase transactions so as it provides
coverage for the bug that has been fixed.  This change is done in its
own commit for clarity, as v15 and HEAD did not show the problematic
behavior, still missed coverage for it.

While on it, this adds a comment about the dependency of the last
partial segment rename and RecoverPreparedTransactions() at the end of
recovery, as that can be easy to miss.

Author: Michael Paquier
Reviewed-by: Kyotaro Horiguchi
Discussion: https://postgr.es/m/743b9b45a2d4013bd90b6a5cba8d6faeb717ee34.camel@cybertec.at
Backpatch-through: 13
2023-06-20 10:25:27 +09:00
Peter Geoghegan d088ba5a5a nbtree: Allocate new pages in separate function.
Split nbtree's _bt_getbuf function is two: code that read locks or write
locks existing pages remains in _bt_getbuf, while code that deals with
allocating new pages is moved to a new, dedicated function called
_bt_allocbuf.  This simplifies most _bt_getbuf callers, since it is no
longer necessary for them to pass a heaprel argument.  Many of the
changes to nbtree from commit 61b313e4 can be reverted.  This minimizes
the divergence between HEAD/PostgreSQL 16 and earlier release branches.

_bt_allocbuf replaces the previous nbtree idiom of passing P_NEW to
_bt_getbuf.  There are only 3 affected call sites, all of which continue
to pass a heaprel for recovery conflict purposes.  Note that nbtree's
use of P_NEW was superficial; nbtree never actually relied on the P_NEW
code paths in bufmgr.c, so this change is strictly mechanical.

GiST already took the same approach; it has a dedicated function for
allocating new pages called gistNewBuffer().  That factor allowed commit
61b313e4 to make much more targeted changes to GiST.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CAH2-Wz=8Z9qY58bjm_7TAHgtW6RzZ5Ke62q5emdCEy9BAzwhmg@mail.gmail.com
2023-06-10 14:08:25 -07:00
Jeff Davis 2fcc7ee7af Revert "Fix search_path to a safe value during maintenance operations."
This reverts commit 05e1737351.
2023-06-10 08:11:41 -07:00
Jeff Davis 05e1737351 Fix search_path to a safe value during maintenance operations.
While executing maintenance operations (ANALYZE, CLUSTER, REFRESH
MATERIALIZED VIEW, REINDEX, or VACUUM), set search_path to
'pg_catalog, pg_temp' to prevent inconsistent behavior.

Functions that are used for functional indexes, in index expressions,
or in materialized views and depend on a different search path must be
declared with CREATE FUNCTION ... SET search_path='...'.

This change addresses a security risk introduced in commit 60684dd834,
where a role with MAINTAIN privileges on a table may be able to
escalate privileges to the table owner. That commit is not yet part of
any release, so no need to backpatch.

Discussion: https://postgr.es/m/e44327179e5c9015c8dda67351c04da552066017.camel%40j-davis.com
Reviewed-by: Greg Stark
Reviewed-by: Nathan Bossart
2023-06-09 11:20:47 -07:00
Nathan Bossart 9aee26a491 Fix missing word in nbtree/README.
Reported-by: Daniel Westermann
Author: Gurjeet Singh
Reviewed-by: Richard Guo
Discussion: https://postgr.es/m/ZR0P278MB0427F0E0CE4ED140F52D1923D250A%40ZR0P278MB0427.CHEP278.PROD.OUTLOOK.COM
2023-06-08 21:20:24 -07:00
Heikki Linnakangas 95f0340c3b Initialize 'recordXtime' to silence compiler warning.
In reality, recordXtime will always be set by the getRecordTimestamp
call, but the compiler doesn't necessarily see that.

Back-patch to all supported versions.

Author: Tristan Partin
Discussion: https://www.postgresql.org/message-id/CT5MN8E11U0M.1NYNCHXYUHY41@gonk
2023-06-06 20:30:53 +03:00
Peter Geoghegan 5abff197cc nbtree VACUUM: cope with right sibling link corruption.
Avoid "right sibling's left-link doesn't match" errors when vacuuming a
corrupt nbtree index.  Just LOG the issue and press on.  That way VACUUM
will have a decent chance of finishing off all required processing for
the index (and for the table as a whole).

This error was seen in the field from time to time (it's more than a
theoretical risk), so giving VACUUM the ability to press on like this
has real value.  Nothing short of a REINDEX is expected to fix the
underlying index corruption, so giving up (by throwing an error) risks
making a bad situation far worse.  Anything that blocks forward progress
by VACUUM like this might go unnoticed for a long time.  This could
eventually lead to a wraparound/xidStopLimit outage.

Note that _bt_unlink_halfdead_page() has always been able to bail on
page deletion when the target page's left sibling page was in an
inconsistent state.  It now does the same thing (returns false to back
out of the second phase of deletion) when it notices sibling link
corruption in the target page's right sibling page.

This is similar to the work from commit 5b861baa (later backpatched as
commit 43e409ce), which taught nbtree to press on with vacuuming an
index when page deletion fails to "re-find" a downlink in the target
page's parent page.  The "re-find" check seems to make VACUUM bail on
page deletion more often in practice, but there is no reason to take any
chances here.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CAH2-Wzko2q2kP1+UvgJyP9g0mF4hopK0NtQZcxwvMv9_ytGhkQ@mail.gmail.com
Backpatch: 11- (all supported versions).
2023-05-25 15:33:00 -07:00
Tom Lane 0245f8db36 Pre-beta mechanical code beautification.
Run pgindent, pgperltidy, and reformat-dat-files.

This set of diffs is a bit larger than typical.  We've updated to
pg_bsd_indent 2.1.2, which properly indents variable declarations that
have multi-line initialization expressions (the continuation lines are
now indented one tab stop).  We've also updated to perltidy version
20230309 and changed some of its settings, which reduces its desire to
add whitespace to lines to make assignments etc. line up.  Going
forward, that should make for fewer random-seeming changes to existing
code.

Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
2023-05-19 17:24:48 -04:00
Peter Eisentraut 8e7912e73d Message style improvements 2023-05-19 18:45:29 +02:00
Tomas Vondra 3581cbdcd6 Fix handling of empty ranges and NULLs in BRIN
BRIN indexes did not properly distinguish between summaries for empty
(no rows) and all-NULL ranges, treating them as essentially the same
thing. Summaries were initialized with allnulls=true, and opclasses
simply reset allnulls to false when processing the first non-NULL value.
This however produces incorrect results if the range starts with a NULL
value (or a sequence of NULL values), in which case we forget the range
contains NULL values when adding the first non-NULL value.

This happens because the allnulls flag is used for two separate
purposes - to mark empty ranges (not representing any rows yet) and
ranges containing only NULL values.

Opclasses don't know which of these cases it is, and so don't know
whether to set hasnulls=true. Setting the flag in both cases would make
it correct, but it would also make BRIN indexes useless for queries with
IS NULL clauses. All ranges start empty (and thus allnulls=true), so all
ranges would end up with either allnulls=true or hasnulls=true.

The severity of the issue is somewhat reduced by the fact that it only
happens when adding values to an existing summary with allnulls=true.
This can happen e.g. for small tables (because a summary for the first
range exists for all BRIN indexes), or for tables with large fraction of
NULL values in the indexed columns.

Bulk summarization (e.g. during CREATE INDEX or automatic summarization)
that processes all values at once is not affected by this issue. In this
case the flags were updated in a slightly different way, not forgetting
the NULL values.

To identify empty ranges we use a new flag, stored in an unused bit in
the BRIN tuple header so the on-disk format remains the same. A matching
flag is added to BrinMemTuple, into a 3B gap after bt_placeholder.
That means there's no risk of ABI breakage, although we don't actually
pass the BrinMemTuple to any public API.

We could also skip storing index tuples for empty summaries, but then
we'd have to always process such ranges - even if there are no rows in
large parts of the table (e.g. after a bulk DELETE), it would still
require reading the pages etc. So we store them, but ignore them when
building the bitmap.

Backpatch to 11. The issue exists since BRIN indexes were introduced in
9.5, but older releases are already EOL.

Backpatch-through: 11
Reviewed-by: Justin Pryzby, Matthias van de Meent, Alvaro Herrera
Discussion: https://postgr.es/m/402430e4-7d9d-6cf1-09ef-464d80afff3b@enterprisedb.com
2023-05-19 01:29:44 +02:00
Tomas Vondra 3ec8a3bfb5 Fix handling of NULLs when merging BRIN summaries
When merging BRIN summaries, union_tuples() did not correctly update the
target hasnulls/allnulls flags. When merging all-NULL summary into a
summary without any NULL values, the result had both flags set to false
(instead of having hasnulls=true).

This happened because the code only considered the hasnulls flags,
ignoring the possibility the source summary has allnulls=true.

Discovered while investigating issues with handling empty BRIN ranges
and handling of NULL values, but it's a separate problem (has nothing to
do with empty ranges).

Fixed by considering both flags on the source summary, and updating the
hasnulls flag on the target summary.

Backpatch to 11. The bug exists since 9.5 (where BRIN indexes were
introduced), but those releases are EOL already.

Discussion: https://postgr.es/m/9d993d0d-e431-2196-9ccc-0554d0e60154%40enterprisedb.com
2023-05-18 23:33:23 +02:00
Michael Paquier 8961cb9a03 Fix typos in comments
The changes done in this commit impact comments with no direct
user-visible changes, with fixes for incorrect function, variable or
structure names.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/e8c38840-596a-83d6-bd8d-cebc51111572@gmail.com
2023-05-02 12:23:08 +09:00
Daniel Gustafsson 4a6603cd46 Fix assertion failure in heap_vacuum_rel
Commit 7d71d3dd08 changed resetting the VacuumFailsafeActive flag to an
assertion since the flag is reset before starting vacuuming a relation.
This however failed to take recursive calls of vacuum_rel() and vacuum
of TOAST tables into consideration. Fix by reverting back to resettting
the flag.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/CAFBsxsFz=GqaG5Ens5aNgVYoV2Y+pfMUijX0ku+CCkWfALwiqg@mail.gmail.com
2023-04-28 10:30:05 +02:00
Nathan Bossart b72623671d Prevent underflow in KeepLogSeg().
The call to XLogGetReplicationSlotMinimumLSN() might return a
greater LSN than the one given to the function.  Subsequent segment
number calculations might then underflow, which could result in
unexpected behavior when removing or recyling WAL files.  This was
introduced with max_slot_wal_keep_size in c655077639.  To fix, skip
the block of code for replication slots if the LSN is greater.

Reported-by: Xu Xingwang
Author: Kyotaro Horiguchi
Reviewed-by: Junwang Zhao
Discussion: https://postgr.es/m/17903-4288d439dee856c6%40postgresql.org
Backpatch-through: 13
2023-04-27 14:31:17 -07:00
Michael Paquier 84cc142674 Re-add tracking of wait event SLRUFlushSync
SLRUFlushSync has been accidently removed during dee663f, that has moved
the flush of the SLRU files to the checkpointer, so add it back.  The
issue has been noticed by Thomas when checking for orphaned wait
events.

Author: Thomas Munro
Reviewed-by: Bharath Rupireddy
Discussion: https://postgr.es/m/CA+hUKGK6tqm59KuF1z+h5Y8fsWcu5v8+84kduSHwRzwjB2aa_A@mail.gmail.com
2023-04-26 07:10:06 +09:00
Alexander Korotkov cd115c3530 Fix custom validators call in build_local_reloptions()
We need to call them only when validate == true.

Backpatch to 13, where opclass options were introduced.

Reported-by: Tom Lane
Discussion: https://postgr.es/m/2656633.1681831542%40sss.pgh.pa.us
Reviewed-by: Tom Lane, Pavel Borisov
Backpatch-through: 13
2023-04-23 13:58:41 +03:00
Peter Geoghegan 2584639653 Use nbtdesc "level" field name consistently.
The "lev" name that appeared in NEWROOT nbtree record desc output was
inconsistent with the symbol name from the underlying C struct.  It was
also inconsistent with nbtdesc output for other nearby record types with
similar level fields.

Standardize on "level" to make everything consistent.

Follow-up to commit 1c453cfd.
2023-04-19 12:15:15 -07:00
Peter Geoghegan 50547a3fae Fix wal_consistency_checking enhanced desc output.
Recent enhancements to rmgr desc routines that made the output summarize
certain block data (added by commits 7d8219a4 and 1c453cfd) dealt with
records that lack relevant block data (and so have nothing to give a
more detailed summary of) by testing !DecodedBkpBlock.has_image.  As a
result, more detailed descriptions of block data were not output when
wal_consistency_checking was enabled.

This bug affected records with summarizable block data that also
happened to have an FPI that the REDO routine isn't supposed to apply
(FPIs used for consistency checking purposes only).  The presence of
such an FPI was incorrectly taken to indicate the absence of block data.

To fix, test DecodedBkpBlock.has_data, not !DecodedBkpBlock.has_image.
This is the exact condition that we care about, not an inexact proxy.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wzm5Sc9cBg1qWV_cEBfLNJCrW9FjS-SoHVt8FLA7Ldn8yg@mail.gmail.com
2023-04-19 10:42:39 -07:00
David Rowley 3f58a4e296 Fix various typos and incorrect/outdated name references
Author: Alexander Lakhin
Discussion: https://postgr.es/m/699beab4-a6ca-92c9-f152-f559caf6dc25@gmail.com
2023-04-19 13:50:33 +12:00
Peter Geoghegan 06e0652750 Remove useless argument from nbtree dedup function.
_bt_dedup_pass()'s heapRel argument hasn't been needed or used since
commit cf2acaf4dc made deleting any existing LP_DEAD index tuples the
caller's responsibility.
2023-04-18 10:33:15 -07:00
Robert Haas 363e8f9115 Fix pg_basebackup with in-place tablespaces some more.
Commit c6f2f01611 purported to make
this work, but problems remained. In a plain-format backup, the
files from an in-place tablespace got included in the tar file for
the main tablespace, which is wrong but it's not clear that it
has any user-visible consequences. In a tar-format backup, the
TABLESPACE_MAP option is used, and so we never iterated over
pg_tblspc and thus never backed up the in-place tablespaces
anywhere at all.

To fix this, reverse the changes in that commit, so that when we scan
pg_tblspc during a backup, we create tablespaceinfo objects even for
in-place tablespaces. We set the field that would normally contain the
absolute pathname to the relative path pg_tblspc/${TSOID}, and that's
good enough to make basebackup.c happy without any further changes.

However, pg_basebackup needs a couple of adjustments to make it work.
First, it needs to understand that a relative path for a tablespace
means it's an in-place tablespace.  Second, it needs to tolerate the
situation where restoring the main tablespace tries to create
pg_tblspc or a subdirectory and finds that it already exists, because
we restore user-defined tablespaces before the main tablespace.

Since in-place tablespaces are only intended for use in development
and testing, no back-patch.

Patch by me, reviewed by Thomas Munro and Michael Paquier.

Discussion: http://postgr.es/m/CA+TgmobwvbEp+fLq2PykMYzizcvuNv0a7gPMJtxOTMOuuRLMHg@mail.gmail.com
2023-04-18 11:23:34 -04:00
Tom Lane 3e383f9b68 Avoid trying to write an empty WAL record in log_newpage_range().
If the last few pages in the specified range are empty (all zero),
then log_newpage_range() could try to emit an empty WAL record
containing no FPIs.  This at least upsets an Assert in
ReserveXLogInsertLocation, and might perhaps have bad real-world
consequences in non-assert builds.

This has been broken since log_newpage_range() was introduced,
but the case was hard if not impossible to hit before commit 3d6a98457
decided it was okay to leave VM and FSM pages intentionally zero.
Nonetheless, it seems prudent to back-patch.  log_newpage_range()
was added in v12 but later back-patched, so this affects all
supported branches.

Matthias van de Meent, per report from Justin Pryzby

Discussion: https://postgr.es/m/ZD1daibg4RF50IOj@telsasoft.com
2023-04-17 14:22:26 -04:00
Peter Eisentraut 5f38a2034e Fix incorrect format placeholders 2023-04-12 10:05:50 +02:00
Peter Geoghegan c03c2eae0a Refine the guidelines for rmgrdesc authors.
Clarify the goals of the recently added guidelines for rmgrdesc authors:
to avoid gratuitous inconsistencies across resource managers, and to
make it reasonably easy to write a reusable custom parser.

Beyond that, the guidelines leave rmgrdesc authors with a significant
amount of leeway.  This even includes the leeway to invent custom
conventions (in cases where it's warranted).

Follow-up to commit 7d8219a4.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkbYuvwYKm-Y-72QEh6SPMQcAo9uONv+mR3bMGcu9E_Cg@mail.gmail.com
2023-04-11 15:26:24 -07:00
Peter Geoghegan 96149a180d Fix Heap rmgr's desc output for infobits arrays.
Make heap desc routines that output status bit as arrays of constants
avoid outputting array literals that contain superfluous punctuation
characters that complicate parsing the output.  Also make sure that no
heap desc routine repeats the same key name (at the same nesting level),
for the same reason.  Arguably, these were both oversights in commit
7d8219a4.

In passing, make the desc output code (which covers Heap's DELETE,
UPDATE, HOT_UPDATE, LOCK, and LOCK_UPDATED record types) consistent in
terms of the output order of each field.  This order also matches WAL
record struct order.  Heap's DELETE desc output now shows the record's
xmax field for the first time (just like UPDATE/HOT_UPDATE records).

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=pNYtxiJ2Jx5Lj=fKo1OEZ4GE0p_kct+ugAUTqBwU46g@mail.gmail.com
2023-04-11 15:25:02 -07:00
Peter Geoghegan e944063294 Fix xl_heap_lock WAL record field's data type.
Make xl_heap_lock's infobits_set field of type uint8, not int8.  Using
int8 isn't appropriate given that the field just holds status bits.
This fixes an oversight in commit 0ac5ad5134.

In passing rename the nearby TransactionId field to "xmax" to make
things consistency with related records, such as xl_heap_lock_updated.

Deliberately avoid a bump in XLOG_PAGE_MAGIC.  No backpatch, either.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkCd3kOS8b7Rfxw7Mh1_6jvX=Nzo-CWR1VBTiOtVZkWHA@mail.gmail.com
2023-04-11 14:07:54 -07:00
Peter Geoghegan 26e65ebdb2 Clarify nbtree posting list update desc issue.
Per complaint from Melanie Plageman.

Follow-up to commit 5d6728e5.

Reported-By: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20230411002315.oyaicmcqrq2hb3ek@liskov
2023-04-10 17:55:23 -07:00
Peter Geoghegan 5d6728e588 Fix nbtree posting list update desc output.
We cannot use the generic array_desc approach with per-tuple nbtree
posting list update metadata because array_desc can only deal with fixed
width elements (e.g., page offset numbers).  Using array_desc led to
incorrect rmgr descriptions for updates from nbtree DELETE/VACUUM WAL
records.

To fix, add specialized code to describe the update metadata as array
elements in desc output.  We now iterate over the update metadata using
an approach that matches related REDO routines.

Also stop showing the updates offset number array separately in nbtree
DELETE/VACUUM desc output.  It's redundant information, since the same
page offset numbers appear in the description of each individual update
element.  Also make some small tweaks to the way that we format arrays
in all desc routines (not just nbtree desc routines) to make arrays a
little less verbose.

Oversight in commit 1c453cfd, which enhanced the nbtree rmgr desc
routines.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkbYuvwYKm-Y-72QEh6SPMQcAo9uONv+mR3bMGcu9E_Cg@mail.gmail.com
2023-04-10 11:15:41 -07:00
Andres Freund 0fdab27ad6 Allow logical decoding on standbys
Unsurprisingly, this requires wal_level = logical to be set on the primary and
standby. The infrastructure added in 26669757b6 ensures that slots are
invalidated if the primary's wal_level is lowered.

Creating a slot on a standby waits for a xl_running_xact record to be
processed. If the primary is idle (and thus not emitting xl_running_xact
records), that can take a while.  To make that faster, this commit also
introduces the pg_log_standby_snapshot() function. By executing it on the
primary, completion of slot creation on the standby can be accelerated.

Note that logical decoding on a standby does not itself enforce that required
catalog rows are not removed. The user has to use physical replication slots +
hot_standby_feedback or other measures to prevent that. If catalog rows
required for a slot are removed, the slot is invalidated.

See 6af1793954 for an overall design of logical decoding on a standby.

Bumps catversion, for the addition of the pg_log_standby_snapshot() function.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de> (in an older version)
Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version)
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: FabrÌzio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
2023-04-08 02:20:05 -07:00
Andres Freund e101dfac3a For cascading replication, wake physical and logical walsenders separately
Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.

Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would have been not applied yet, so logical walsenders were
awakened too early.

Per idea from Jeff Davis and Amit Kapila.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1+zO5LUeisabX10c81LU-fWMKO4M9Wyg1cdkbW7Hqh6vQ@mail.gmail.com
2023-04-08 01:06:00 -07:00
Andres Freund 26669757b6 Handle logical slot conflicts on standby
During WAL replay on the standby, when a conflict with a logical slot is
identified, invalidate such slots. There are two sources of conflicts:
1) Using the information added in 6af1793954, logical slots are invalidated if
   required rows are removed
2) wal_level on the primary server is reduced to below logical

Uses the infrastructure introduced in the prior commit. FIXME: add commit
reference.

Change InvalidatePossiblyObsoleteSlot() to use a recovery conflict to
interrupt use of a slot, if called in the startup process. The new recovery
conflict is added to pg_stat_database_conflicts, as confl_active_logicalslot.

See 6af1793954 for an overall design of logical decoding on a standby.

Bumps catversion for the addition of the pg_stat_database_conflicts column.
Bumps PGSTAT_FILE_FORMAT_ID for the same reason.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version)
Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de
2023-04-08 00:05:44 -07:00
Andres Freund be87200efd Support invalidating replication slots due to horizon and wal_level
Needed for logical decoding on a standby. Slots need to be invalidated because
of the horizon if rows required for logical decoding are removed. If the
primary's wal_level is lowered from 'logical', logical slots on the standby
need to be invalidated.

The new invalidation methods will be used in a subsequent commit.

Logical slots that have been invalidated can be identified via the new
pg_replication_slots.conflicting column.

See 6af1793954 for an overall design of logical decoding on a standby.

Bumps catversion for the addition of the new pg_replication_slots column.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version)
Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de
2023-04-07 22:40:27 -07:00
Thomas Munro d4e71df6d7 Add io_direct setting (developer-only).
Provide a way to ask the kernel to use O_DIRECT (or local equivalent)
where available for data and WAL files, to avoid or minimize kernel
caching.  This hurts performance currently and is not intended for end
users yet.  Later proposed work would introduce our own I/O clustering,
read-ahead, etc to replace the facilities the kernel disables with this
option.

The only user-visible change, if the developer-only GUC is not used, is
that this commit also removes the obscure logic that would activate
O_DIRECT for the WAL when wal_sync_method=open_[data]sync and
wal_level=minimal (which also requires max_wal_senders=0).  Those are
non-default and unlikely settings, and this behavior wasn't (correctly)
documented.  The same effect can be achieved with io_direct=wal.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg%40mail.gmail.com
2023-04-08 16:35:07 +12:00
Thomas Munro faeedbcefd Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.
In order to have the option to use O_DIRECT/FILE_FLAG_NO_BUFFERING in a
later commit, we need the addresses of user space buffers to be well
aligned.  The exact requirements vary by OS and file system (typically
sectors and/or memory pages).  The address alignment size is set to
4096, which is enough for currently known systems: it matches modern
sectors and common memory page size.  There is no standard governing
O_DIRECT's requirements so we might eventually have to reconsider this
with more information from the field or future systems.

Aligning I/O buffers on memory pages is also known to improve regular
buffered I/O performance.

Three classes of I/O buffers for regular data pages are adjusted:
(1) Heap buffers are now allocated with the new palloc_aligned() or
MemoryContextAllocAligned() functions introduced by commit 439f6175.
(2) Stack buffers now use a new struct PGIOAlignedBlock to respect
PG_IO_ALIGN_SIZE, if possible with this compiler.  (3) The buffer
pool is also aligned in shared memory.

WAL buffers were already aligned on XLOG_BLCKSZ.  It's possible for
XLOG_BLCKSZ to be configured smaller than PG_IO_ALIGNED_SIZE and thus
for O_DIRECT WAL writes to fail to be well aligned, but that's a
pre-existing condition and will be addressed by a later commit.

BufFiles are not yet addressed (there's no current plan to use O_DIRECT
for those, but they could potentially get some incidental speedup even
in plain buffered I/O operations through better alignment).

If we can't align stack objects suitably using the compiler extensions
we know about, we disable the use of O_DIRECT by setting PG_O_DIRECT to
0.  This avoids the need to consider systems that have O_DIRECT but
can't align stack objects the way we want; such systems could in theory
be supported with more work but we don't currently know of any such
machines, so it's easier to pretend there is no O_DIRECT support
instead.  That's an existing and tested class of system.

Add assertions that all buffers passed into smgrread(), smgrwrite() and
smgrextend() are correctly aligned, unless PG_O_DIRECT is 0 (= stack
alignment tricks may be unavailable) or the block size has been set too
small to allow arrays of buffers to be all aligned.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
2023-04-08 16:34:50 +12:00
Peter Geoghegan 1c453cfd89 Show more detail in nbtree rmgr descriptions.
Show a detailed description of the page offset number arrays that appear
in certain nbtree WAL records.

Also brings nbtree desc routines in line with the guidelines established
by recent commit 7d8219a4.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/flat/20230109215842.fktuhesvayno6o4g%40awork3.anarazel.de
2023-04-07 16:46:23 -07:00
Peter Geoghegan 7d8219a444 Show more detail in heapam rmgr descriptions.
Add helper functions that output arrays in a standard format, and use
the functions inside heapdesc routines.  This allows tools like
pg_walinspect to show a detailed description of the page offset number
arrays for records like PRUNE and VACUUM (unless there was an FPI).

Also document the conventions that desc routines should follow.  Only
the heapdesc routines follow the conventions for now, so they're just
guidelines for the time being.

Based on a suggestion from Andres Freund.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/flat/20230109215842.fktuhesvayno6o4g%40awork3.anarazel.de
2023-04-07 16:08:52 -07:00
Michael Paquier 8fcb32db98 Add more protections in WAL record APIs against overflows
This commit adds a limit to the size of an XLogRecord at 1020MB, based
on a suggestion by Heikki Linnakangas.  This counts for the overhead
needed by the XLogReader when allocating the memory it needs to read a
record in DecodeXLogRecordRequiredSpace(), based on the record size.  An
assertion based on that is added to detect that any additions in the
XLogReader facilities would not cause any overflows.  If that's ever the
case, the upper bound allowed would need to be adjusted.

Before this, it was possible for an external module to create WAL
records large enough to be assembled but not replayable, causing
failures when replaying such WAL records on standbys.  One case
mentioned where this is possible is the in-core function
pg_logical_emit_message() (wrapper for LogLogicalMessage), that allows
to emit WAL records with an arbitrary amount of data potentially higher
than the replay limit of approximately 1GB (limit of a palloc, minus the
overhead needed by a XLogReader).

This commit is a follow-up of ffd1b6b that has added similar protections
for the block-level data.  Here, the checks are extended to the whole
record length, mainrdata_len being extended from uint32 to uint64 with
the routines registering buffer and record data still limited to uint32
to minimize the checks when assembling a record.  All the error messages
related to overflow checks are improved to provide more context about
the error happening.

Author: Matthias van de Meent
Reviewed-by: Andres Freund, Heikki Linnakangas, Michael Paquier
Discussion: https://postgr.es/m/CAEze2WgGiw+LZt+vHf8tWqB_6VxeLsMeoAuod0N=ij1q17n5pw@mail.gmail.com
2023-04-07 10:10:17 +09:00
Andres Freund 26158b852d Use ExtendBufferedRelTo() in XLogReadBufferExtended()
Instead of extending the relation block-by-block, use ExtendBufferedRelTo(),
introduced in 31966b151e. This is faster and simpler.

This also somewhat reduces the danger that disconnected segments pose (which
can be "discovered" once the previous segment reaches SEGSIZE), as
ExtendBufferedRelTo() won't extend past the block it has been asked. However,
the risk of the content of such a disconnected segment being invalid
remains.

Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
Discussion: https://postgr.es/m/20230223010147.32oir7sb66slqnjk@awork3.anarazel.de
2023-04-06 17:56:17 -07:00
Andres Freund 00d1e02be2 hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.

To address this, use ExtendBufferedRelBy(), introduced in 31966b151e, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:

1) The caller tells RelationGetBufferForTuple() that it will need multiple
   pages. For now that's only used by heap_multi_insert(), see commit FIXME.

2) If there is contention on the extension lock, use the number of waiters for
   the lock as a multiplier for the number of blocks to extend by. This is
   similar to what we already did. Previously we additionally multiplied the
   numbers of waiters by 20, but with the new relation extension
   infrastructure I could not see a benefit in doing so.

Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.

One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.

This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 16:53:17 -07:00
Andres Freund 5279e9db8e heapam: Pass number of required pages to RelationGetBufferForTuple()
A future commit will use this information to determine how aggressively to
extend the relation by. In heap_multi_insert() we know accurately how many
pages we need once we need to extend the relation, providing an accurate lower
bound for how much to extend.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 16:17:16 -07:00
Daniel Gustafsson 7d71d3dd08 Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.

Now, check if a reload is pending roughly once per block, when checking
if we need to delay.

In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.

Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.

Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
Daniel Gustafsson 71a825194f Make vacuum failsafe_active globally visible
While vacuuming a table in failsafe mode, VacuumCostActive should
not be re-enabled.  This currently isn't a problem because vacuum
cost parameters are only refreshed in between vacuuming tables and
failsafe status is reset for every table.

In preparation for allowing vacuum cost parameters to be updated
more frequently, elevate LVRelState->failsafe_active to a global,
VacuumFailsafeActive, which will be checked when determining whether
or not to re-enable vacuum cost-related delays.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 00:54:08 +02:00
Andres Freund 18103b7c5f hio: Don't pin the VM while holding buffer lock while extending
Starting with commit 7db0cd2145, RelationGetBufferForTuple() did a
visibilitymap_pin() while holding an exclusive buffer content lock on a newly
extended page, when using COPY FREEZE. We elsewhere try hard to avoid to doing
IO while holding a content lock. And until 14f98e0af9, that happened while
holding the relation extension lock.

Practically, this isn't a huge issue, because COPY FREEZE is restricted to
relations created or truncated in the current session, so it's unlikely
there's a lot of contention.

We can't avoid doing IO while holding the content lock by pinning the VM
earlier, because we don't know which page it will be on.

While we could just ignore the issue in this case, a future commit will add
bulk relation extension, which needs to enter pages into the FSM while also
trying to hold onto a buffer lock.

To address this issue, use visibilitymap_pin_ok() to see if the relevant
buffer is already pinned. If not, release the buffer, pin the VM buffer, and
acquire the lock again. This opens up a small window for other backends to
insert data onto the page - as the page is not entered into the freespacemap,
other backends won't see it normally, but a concurrent vacuum could enter the
page, if it started just after the relation is extended. In case the page is
used by another backend, retry. This is very similar to how locking
"otherBuffer" is already dealt with.

Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
2023-04-06 11:11:13 -07:00
Andres Freund bba9003b62 hio: Relax rules for calling GetVisibilityMapPins()
GetVisibilityMapPins() insisted on the buffer1/buffer2 being in a specific
order. This required checks at the callsite. As a subsequent patch will add
another callsite, move related logic into GetVisibilityMapPins().

Discussion: https://postgr.es/m/20230403190030.fk2frxv6faklrseb@awork3.anarazel.de
2023-04-06 10:36:30 -07:00
Andres Freund acab1b0914 Convert many uses of ReadBuffer[Extended](P_NEW) to ExtendBufferedRel()
A few places are not converted. Some because they are tackled in later
commits (e.g. hio.c, xlogutils.c), some because they are more
complicated (e.g. brin_pageops.c).  Having a few users of ReadBuffer(P_NEW) is
good anyway, to ensure the backward compat path stays working.

Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-05 18:57:29 -07:00
Andres Freund fcdda1e4b5 Use ExtendBufferedRelTo() in {vm,fsm}_extend()
This uses ExtendBufferedRelTo(), introduced in 31966b151e, to extend the
visibilitymap and freespacemap to the size needed.

It also happens to fix a warning introduced in 3d6a98457d, reported by Tom
Lane.

Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
Discussion: https://postgr.es/m/2194723.1680736788@sss.pgh.pa.us
2023-04-05 17:50:09 -07:00
Andres Freund 12f3867f55 bufmgr: Support multiple in-progress IOs by using resowner
A future patch will add support for extending relations by multiple blocks at
once. To be concurrency safe, the buffers for those blocks need to be marked
as BM_IO_IN_PROGRESS. Until now we only had infrastructure for recovering from
an IO error for a single buffer. This commit extends that infrastructure to
multiple buffers by using the resource owner infrastructure.

This commit increases the size of the ResourceOwnerData struct, which appears
to have a just about measurable overhead in very extreme workloads. Medium
term we are planning to substantially shrink the size of
ResourceOwnerData. Short term the increase is small enough to not worry about
it for now.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
Discussion: https://postgr.es/m/20221029200025.w7bvlgvamjfo6z44@awork3.anarazel.de
2023-04-05 14:17:55 -07:00
Andres Freund 3d6a98457d Don't initialize page in {vm,fsm}_extend(), not needed
The read path needs to be able to initialize pages anyway, as relation
extensions are not durable. By avoiding initializing pages, we can, in a
future patch, extend the relation by multiple blocks at once.

Using smgrextend() for {vm,fsm}_extend() is not a good idea in general, as at
least one page of the VM/FSM will be read immediately after, always causing a
cache miss, requiring us to read content we just wrote.

Discussion: https://postgr.es/m/20230301223515.pucbj7nb54n4i4nv@awork3.anarazel.de
2023-04-05 08:19:39 -07:00
Peter Geoghegan 05a304a855 Make SP-GiST redirect cleanup more aggressive.
Commit 61b313e4 made VACUUM pass down a heaprel to index AM bulkdelete
and vacuumcleanup routines.  Although this was primarily intended as
preparation for logical decoding on standbys, it also made it easy to
correct an old deficiency in how we determine how to cleanup SP-GiST
redirect and placeholder tuples.

Pass the heaprel to GlobalVisTestFor() during cleanup of redirect and
placeholder tuples, rather than pessimistically passing NULL.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/02392033-f030-a3c8-c7d0-5c27eb529fec@gmail.com
2023-04-03 11:47:48 -07:00
Peter Geoghegan e48c817395 Recycle deleted nbtree pages more aggressively.
Commit 61b313e4 made nbtree consistently pass down a heaprel to low
level routines like _bt_getbuf().  Although this was primarily intended
as preparation for logical decoding on standbys, it also made it easy to
correct an old deficiency in how nbtree VACUUM determines whether or not
it's now safe to recycle deleted pages.

Pass the heaprel to GlobalVisTestFor() in nbtree routines that deal with
recycle safety.  nbtree now makes less pessimistic assumptions about
recycle safety within non-catalog relations.  This enhancement
complements the recycling enhancement added by commit 9dd963ae25.

nbtree remains just as pessimistic as ever when it comes to recycle
safety within indexes on catalog relations.  There is no fundamental
reason why we need to treat catalog relations differently, though.  The
behavioral inconsistency is a consequence of the way that nbtree uses
nextXID values to implement what Lanin and Shasha call "the drain
technique".  Note in particular that it has nothing to do with whether
or not index tuples might still be required for an older MVCC snapshot.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkaiDxCje0yPuH=3Uh2p1V_2pFGY==xfbZoZu7Ax_NB8g@mail.gmail.com
2023-04-03 11:31:43 -07:00
Peter Geoghegan a349b86603 Move heaprel struct field next to index rel field.
Commit 61b313e4 added a heaprel struct member to IndexVacuumInfo, but
placed it last.  Move the heaprel struct member next to the index struct
member to improve the code's readability.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WznG=TV6S9d3VA=y0vBHbXwnLs9_LLdiML=aNJuHeriwxg@mail.gmail.com
2023-04-03 11:01:11 -07:00
Alexander Korotkov 2b65bf046d Revert 11470f544e
Discussion: https://postgr.es/m/20230323003003.plgaxjqahjgkuxrk%40awork3.anarazel.de
2023-04-03 16:54:31 +03:00
David Rowley 4830f10243 Disable vacuum's use of a buffer access strategy during failsafe
Traditionally, vacuum always makes use of a buffer access strategy 32
buffers in size.  This means that running vacuums tend not to cause too
many shared buffers to become dirty, however, this can cause vacuums to
run much more slowly than they otherwise could as WAL flushes will occur
more frequently due to having to flush WAL out to the LSN of the dirty
page before that page can be written to disk.

When we are performing failsafe VACUUMs (as added in 1e55e7d17), we really
want to make the vacuum work go as quickly as possible, so here we disable
the buffer access strategy when entering failsafe mode while vacuuming a
relation.

Per idea and analyis from Andres Freund.

In passing, also include some changes I had intended for 32fbe0239.

Author: Melanie Plageman
Reviewed-by: Justin Pryzby, David Rowley
Discussion: https://postgr.es/m/20230111182720.ejifsclfwymw2reb%40awork3.anarazel.de
2023-04-03 23:05:58 +12:00
Andres Freund 6af1793954 Add info in WAL records in preparation for logical slot conflict handling
This commit only implements one prerequisite part for allowing logical
decoding. The commit message contains an explanation of the overall design,
which later commits will refer back to.

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing error(s) on the standby. To prevent those errors, a new replication
conflict scenario needs to be addressed (as much as hot standby does).

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access. That way, during WAL replay, we know when there is a risk of
conflict and, if so, if there is a conflict.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

The affected WAL records are the ones that already contain the
snapshotConflictHorizon field, namely:

- gistxlogDelete
- gistxlogPageReuse
- xl_hash_vacuum_one_page
- xl_heap_prune
- xl_heap_freeze_page
- xl_heap_visible
- xl_btree_reuse_page
- xl_btree_delete
- spgxlogVacuumRedirect

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignment.
It's not needed on the others struct where isCatalogRel has
been added.

This commit just introduces the WAL format changes mentioned above. Handling
the actual conflicts will follow in future commits.

Bumps XLOG_PAGE_MAGIC as the several WAL records are changed.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de> (in an older version)
Author: Amit Khandekar <amitdkhan.pg@gmail.com>  (in an older version)
Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
2023-04-02 12:32:19 -07:00
Andres Freund 61b313e47e Pass down table relation into more index relation functions
This is done in preparation for logical decoding on standby, which needs to
include whether visibility affecting WAL records are about a (user) catalog
table. Which is only known for the table, not the indexes.

It's also nice to be able to pass the heap relation to GlobalVisTestFor() in
vacuumRedirectAndPlaceholder().

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/21b700c3-eecf-2e05-a699-f8c78dd31ec7@gmail.com
2023-04-01 20:18:29 -07:00
Andres Freund a88a18b125 Assert only valid flag bits are passed to visibilitymap_set()
If visibilitymap_set() is called with flags containing a higher bit than
VISIBILITYMAP_ALL_FROZEN, the state of neighboring pages is affected. While
there was an assertion that *some* valid bits were set, it did not check
that *only* valid bits were. Change that.

Discussion: https://postgr.es/m/20230331043300.gux3s5wzrapqi4oe@awork3.anarazel.de
2023-04-01 18:00:19 -07:00
Andres Freund 14f98e0af9 hio: Release extension lock before initializing page / pinning VM
PageInit() while holding the extension lock is unnecessary after 0d1fe9f74e
started to use RBM_ZERO_AND_LOCK - nobody can look at the new page before we
release the page lock. PageInit() zeroes the page, which isn't that cheap, so
deferring it until after the extension lock is released seems like a good idea.

Doing visibilitymap_pin() while holding the extension lock, introduced in
7db0cd2145, looks like an accident. Due to the restrictions on
HEAP_INSERT_FROZEN it's unlikely to be a performance issue, but it still seems
better to move it out.  We also are doing the visibilitymap_pin() while
holding the buffer lock, which will be fixed in a separate commit.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: http://postgr.es/m/419312fd-9255-078c-c3e3-f0525f911d7f@iki.fi
2023-04-01 17:50:18 -07:00
Andres Freund ca7b3c4c00 pg_stat_wal: Accumulate time as instr_time instead of microseconds
In instr_time.h it is stated that:

* When summing multiple measurements, it's recommended to leave the
* running sum in instr_time form (ie, use INSTR_TIME_ADD or
* INSTR_TIME_ACCUM_DIFF) and convert to a result format only at the end.

The reason for that is that converting to microseconds is not cheap, and can
loose precision.  Therefore this commit changes 'PendingWalStats' to use
'instr_time' instead of 'PgStat_Counter' while accumulating 'wal_write_time'
and 'wal_sync_time'.

Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/1feedb83-7aa9-cb4b-5086-598349d3f555@gmail.com
2023-03-30 14:23:14 -07:00
Tom Lane 8e5eef50c5 Fix dereference of dangling pointer in GiST index buffering build.
gistBuildCallback tried to fetch the size of an index tuple that
might have already been freed by gistProcessEmptyingQueue.
While this seems to usually be harmless in production builds,
in principle it could result in a SIGSEGV, or more likely a bogus
value for indtuplesSize leading to poor page-split decisions later
in the build.

The memory management here is confusing and could stand to be
refactored, but for the moment it seems to be enough to fetch
the tuple size sooner.  AFAICT the indtuples[Size] totals aren't
used in between these places; even if they were, the updated
values shouldn't be any worse to use.  So just move the
incrementing of the totals up.

It's not very clear why our valgrind-using buildfarm animals
haven't noticed this problem, because the relevant code path
does seem to be exercised according to the code coverage report.
I think the reason that we didn't fix this bug after the first
report is that I'd wanted to try to understand that better.
However, now that it's been re-discovered let's just be pragmatic
and fix it already.

Original report by Alexander Lakhin (bug #16329),
later rediscovered by Egor Chindyaskin (bug #17874).

Patch by Alexander Lakhin (commentary by Pavel Borisov and me).
Back-patch to all supported branches.

Discussion: https://postgr.es/m/16329-7a6aa9b6fa1118a1@postgresql.org
Discussion: https://postgr.es/m/17874-63ca6c7ce42d2103@postgresql.org
2023-03-29 11:31:30 -04:00
Tom Lane 58c9600a9f Remove empty function BufmgrCommit().
This function has been a no-op for over a decade.  Even if bufmgr
regains a need to be called during commit, it seems unlikely that
the most appropriate call points would be precisely here, so it's not
doing us much good as a placeholder either.  Now, removing it probably
doesn't save any noticeable number of cycles --- but the main call is
inside the commit critical section, and the less work done there the
better.

Matthias van de Meent

Discussion: https://postgr.es/m/CAEze2Wi1=tLKbxZnXzcD+8fYKyKqBtivVakLQC_mYBsP4Y8qVA@mail.gmail.com
2023-03-29 09:13:57 -04:00
Peter Eisentraut 0d15afc875 Simplify useless 0L constants
In ancient times, these belonged to arguments or fields that were
actually of type long, but now they are not anymore, so this "L"
decoration is just confusing.  (Some other 0L and other "L" constants
remain, where they are actually associated with a long type.)
2023-03-29 08:25:12 +02:00
Peter Eisentraut 90189eefc1 Save a few bytes in pg_attribute
Change the columns attndims, attstattarget, and attinhcount from int32
to int16, and reorder a bit.  This saves some space (currently 4
bytes) in pg_attribute and tuple descriptors, which translates into
small performance benefits and/or room for new columns in pg_attribute
needed by future features.

attndims and attinhcount are never realistically used with values
larger than int16.  Just to be sure, add some overflow checks.
attstattarget is currently limited explicitly to 10000.

For consistency, pg_constraint.coninhcount is also changed like
attinhcount.

Discussion: https://www.postgresql.org/message-id/flat/d07ffc2b-e0e8-77f7-38fb-be921dff71af%40enterprisedb.com
2023-03-28 10:05:56 +02:00
Daniel Gustafsson d435f15fff Add SysCacheGetAttrNotNull for guaranteed not-null attrs
When extracting an attr from a cached tuple in the syscache with
SysCacheGetAttr the isnull parameter must be checked in case the
attr cannot be NULL.  For cases when this is known beforehand, a
wrapper is introduced which perform the errorhandling internally
on behalf of the caller, invoking an elog in case of a NULL attr.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/AD76405E-DB45-46B6-941F-17B1EB3A9076@yesql.se
2023-03-25 22:49:33 +01:00
Peter Geoghegan ae4fdde135 Count updates that move row to a new page.
Add pgstat counter to track row updates that result in the successor
version going to a new heap page, leaving behind an original version
whose t_ctid points to the new version.  The current count is shown by
the n_tup_newpage_upd column of each of the pg_stat_*_tables views.

The new n_tup_newpage_upd column complements the existing n_tup_hot_upd
and n_tup_upd columns.  Tables that have high n_tup_newpage_upd values
(relative to n_tup_upd) are good candidates for tuning heap fillfactor.

Corey Huinker, with small tweaks by me.

Author: Corey Huinker <corey.huinker@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CADkLM=ded21M9iZ36hHm-vj2rE2d=zcKpUQMds__Xm2pxLfHKA@mail.gmail.com
2023-03-23 11:16:17 -07:00
Alexander Korotkov 11470f544e Allow locking updated tuples in tuple_update() and tuple_delete()
Currently, in read committed transaction isolation mode (default), we have the
following sequence of actions when tuple_update()/tuple_delete() finds
the tuple updated by concurrent transaction.

1. Attempt to update/delete tuple with tuple_update()/tuple_delete(), which
   returns TM_Updated.
2. Lock tuple with tuple_lock().
3. Re-evaluate plan qual (recheck if we still need to update/delete and
   calculate the new tuple for update).
4. Second attempt to update/delete tuple with tuple_update()/tuple_delete().
   This attempt should be successful, since the tuple was previously locked.

This patch eliminates step 2 by taking the lock during first
tuple_update()/tuple_delete() call.  Heap table access method saves some
efforts by checking the updated tuple once instead of twice.  Future
undo-based table access methods, which will start from the latest row version,
can immediately place a lock there.

The code in nodeModifyTable.c is simplified by removing the nested switch/case.

Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp
Reviewed-by: Andres Freund, Chris Travers
2023-03-23 00:26:59 +03:00
Tomas Vondra 19d8e2308b Ignore BRIN indexes when checking for HOT updates
When determining whether an index update may be skipped by using HOT, we
can ignore attributes indexed by block summarizing indexes without
references to individual tuples that need to be cleaned up.

A new type TU_UpdateIndexes provides a signal to the executor to
determine which indexes to update - no indexes, all indexes, or only the
summarizing indexes.

This also removes rd_indexattr list, and replaces it with rd_attrsvalid
flag. The list was not used anywhere, and a simple flag is sufficient.

This was originally committed as 5753d4ee32, but then got reverted by
e3fcca0d0d because of correctness issues.

Original patch by Josef Simanek, various fixes and improvements by Tomas
Vondra and me.

Authors: Matthias van de Meent, Josef Simanek, Tomas Vondra
Reviewed-by: Tomas Vondra, Alvaro Herrera
Discussion: https://postgr.es/m/05ebcb44-f383-86e3-4f31-0a97a55634cf@enterprisedb.com
Discussion: https://postgr.es/m/CAFp7QwpMRGcDAQumN7onN9HjrJ3u4X3ZRXdGFT0K5G2JWvnbWg%40mail.gmail.com
2023-03-20 11:02:42 +01:00
Tomas Vondra e858312683 Fix netmask handling in inet_minmax_multi_ops
When calculating distance in brin_minmax_multi_distance_inet(), the
netmask was applied incorrectly. This results in (seemingly) incorrect
ordering of values, triggering an assert.

For builds without asserts this is mostly harmless - we may merge other
ranges, possibly resulting in slightly less efficient index. But it's
still correct and the greedy algorithm doesn't guarantee optimality
anyway.

Backpatch to 14, where minmax-multi indexes were introduced.

Reported by Dmitry Dolgov, investigation and fix by me.

Reported-by: Dmitry Dolgov
Backpatch-through: 14
Discussion: https://postgr.es/m/17774-c6f3e36dd4471e67@postgresql.org
2023-03-20 10:24:14 +01:00
Michael Paquier ce340e530d Revise pg_pwrite_zeros()
The following changes are made to pg_write_zeros(), the API able to
write series of zeros using vectored I/O:
- Add of an "offset" parameter, to write the size from this position
(the 'p' of "pwrite" seems to mean position, though POSIX does not
outline ythat directly), hence the name of the routine is incorrect if
it is not able to handle offsets.
- Avoid memset() of "zbuffer" on every call.
- Avoid initialization of the whole IOV array if not needed.
- Group the trailing write() call with the main write() call,
simplifying the function logic.

Author: Andres Freund
Reviewed-by: Michael Paquier, Bharath Rupireddy
Discussion: https://postgr.es/m/20230215005525.mrrlmqrxzjzhaipl@awork3.anarazel.de
2023-03-06 13:21:33 +09:00
Thomas Munro 1da569ca1f Don't leak descriptors into subprograms.
Open long-lived data and WAL file descriptors with O_CLOEXEC.  This flag
was introduced by SUSv4 (POSIX.1-2008), and by now all of our target
Unix systems have it.  Our open() implementation for Windows already had
that behavior, so provide a dummy O_CLOEXEC flag on that platform.

For now, callers of open() and the "thin" wrappers in fd.c that deal in
raw descriptors need to pass in O_CLOEXEC explicitly if desired.  This
commit does that for WAL files, and automatically for everything
accessed via VFDs including SMgrRelation and BufFile.  (With more
discussion we might decide to turn it on automatically for the thin
open()-wrappers too to avoid risk of missing places that need it, but
these are typically used for short-lived descriptors where we don't
expect to fork/exec, and it's remotely possible that extensions could be
using these APIs and passing descriptors to subprograms deliberately, so
that hasn't been done here.)

Do the same for sockets and the postmaster pipe with FD_CLOEXEC.  (Later
commits might use modern interfaces to remove these extra fcntl() calls
and more where possible, but we'll need them as a fallback for a couple
of systems, so do it that way in this initial commit.)

With this change, subprograms executed for archiving, copying etc will
no longer have access to the server's descriptors, other than the ones
that we decide to pass down.

Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
2023-03-03 10:43:33 +13:00
Tom Lane 462bb7f128 Remove bms_first_member().
This function has been semi-deprecated ever since we invented
bms_next_member().  Its habit of scribbling on the input bitmapset
isn't great, plus for sufficiently large bitmapsets it would take
O(N^2) time to complete a loop.  Now we have the additional problem
that reducing the input to empty while leaving it still accessible
would violate a planned invariant.  So let's just get rid of it,
after updating the few extant callers to use bms_next_member().

Patch by me; thanks to Nathan Bossart and Richard Guo for review.

Discussion: https://postgr.es/m/1159933.1677621588@sss.pgh.pa.us
2023-03-02 11:34:29 -05:00
Peter Eisentraut 4ac30ba4f2 Make some xlogreader messages more accurate
When you have some invalid WAL, you often get a message like "wanted
24, got 0".  This is a bit incorrect, since it really wanted *at
least* 24, not exactly 24.  This updates the messages to that effect,
and also adds that detail to one message where it was available but
not printed.

Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Jeevan Ladhe <jeevanladhe.os@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/726d782b-5e45-0c3e-d775-6686afe9aa83%40enterprisedb.com
2023-03-02 07:46:12 +01:00
Peter Eisentraut b9f0e54bc9 Update types in smgr API
Change data buffer to void *, from char *, and add const where
appropriate.  This makes it match the File API (see also
2d4f1ba6cf) and stdio.

Discussion: https://www.postgresql.org/message-id/flat/11dda853-bb5b-59ba-a746-e168b1ce4bdb%40enterprisedb.com
2023-02-27 07:47:46 +01:00
Jeff Davis e0b3074e89 Remove unnecessary #ifdef USE_ICU and branch.
Now that the provider-independent API pg_strnxfrm() is available, we
no longer need the special cases for ICU in hashfunc.c and varchar.c.

Reviewed-by: Peter Eisentraut, Peter Geoghegan
Discussion: https://postgr.es/m/a581136455c940d7bd0ff482d3a2bd51af25a94f.camel%40j-davis.com
2023-02-23 11:20:00 -08:00
Jeff Davis 6974a8f768 Refactor to introduce pg_locale_deterministic().
Avoids the need of callers to test for NULL, and also avoids the need
to access the pg_locale_t structure directly.

Reviewed-by: Peter Eisentraut, Peter Geoghegan
Discussion: https://postgr.es/m/a581136455c940d7bd0ff482d3a2bd51af25a94f.camel%40j-davis.com
2023-02-23 11:17:41 -08:00
Jeff Davis d87d548cd0 Refactor to add pg_strcoll(), pg_strxfrm(), and variants.
Offers a generally better separation of responsibilities for collation
code. Also, a step towards multi-lib ICU, which should be based on a
clean separation of the routines required for collation providers.

Callers with NUL-terminated strings should call pg_strcoll() or
pg_strxfrm(); callers with strings and their length should call the
variants pg_strncoll() or pg_strnxfrm().

Reviewed-by: Peter Eisentraut, Peter Geoghegan
Discussion: https://postgr.es/m/a581136455c940d7bd0ff482d3a2bd51af25a94f.camel%40j-davis.com
2023-02-23 10:55:20 -08:00
Tomas Vondra e72910f808 Fix handling of multi-column BRIN indexes
When evaluating clauses on multiple scan keys of a multi-column BRIN
index, we can stop processing as soon as we find a scan key eliminating
the range, and the range should not be added to tbe bitmap.

That's how it worked before 14, but since a681e3c107 the code treated
the range as matching if it matched at least the last scan key.

Backpatch to 14, where this code was introduced.

Backpatch-through: 14
Discussion: https://postgr.es/m/ebc18613-125e-60df-7520-fcbe0f9274fc%40enterprisedb.com
2023-02-19 01:46:58 +01:00
David Rowley 5352ca22e0 Rename force_parallel_mode to debug_parallel_query
force_parallel_mode is meant to be used to allow us to exercise the
parallel query infrastructure to ensure that it's working as we expect.
It seems some users think this GUC is for forcing the query planner into
picking a parallel plan regardless of the costs.  A quick look at the
documentation would have made them realize that they were wrong, but the
GUC is likely too conveniently named which, evidently, seems to often
result in users expecting that it forces the planner into usefully
parallelizing queries.

Here we rename the GUC to something which casual users are less likely to
mistakenly think is what they need to make their query run more quickly.

For now, the old name can still be used.  We'll revisit if the old name
mapping can be removed once the buildfarm configs are all updated.

Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAApHDvrsOi92_uA7PEaHZMH-S4Xv+MGhQWA+GrP8b1kjpS1HjQ@mail.gmail.com
2023-02-15 21:21:59 +13:00
Peter Eisentraut aa69541046 Remove useless casts to (void *) in arguments of some system functions
The affected functions are: bsearch, memcmp, memcpy, memset, memmove,
qsort, repalloc

Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/fd9adf5d-b1aa-e82f-e4c7-263c30145807%40enterprisedb.com
2023-02-07 06:57:59 +01:00
David Rowley cfcf56f923 More refactoring of heapgettup() and heapgettup_pagemode()
Here we further simplify the code in heapgettup() and
heapgettup_pagemode() to make better use of the helper functions added in
the previous recent refactors in this area.

In passing, remove an unneeded cast added in 8ca6d49f6.

Author: Melanie Plageman
Reviewed-by: Andres Freund, David Rowley
Discussion: https://postgr.es/m/CAAKRu_YSOnhKsDyFcqJsKtBSrd32DP-jjXmv7hL0BPD-z0TGXQ@mail.gmail.com
2023-02-07 17:24:07 +13:00
Robert Haas 8a2f783cc4 Disable STARTUP_PROGRESS_TIMEOUT in standby mode.
In standby mode, we don't actually report progress of recovery,
but up until now, startup_progress_timeout_handler() nevertheless
got called every log_startup_progress_interval seconds. That's
an unnecessary expense, so avoid it.

Report by Thomas Munro. Patch by Bharath Rupireddy, reviewed by
Simon Riggs, Thomas Munro, and me. Back-patch to v15, where
the problem was introduced.

Discussion: https://www.postgresql.org/message-id/CA%2BhUKGKCHSffAj8zZJKJvNX7ygnQFxVD6wm1d-2j3fVw%2BMafPQ%40mail.gmail.com
2023-02-06 10:51:08 -05:00
Peter Eisentraut 54a177a948 Remove useless casts to (void *) in hash_search() calls
Some of these appear to be leftovers from when hash_search() took a
char * argument (changed in 5999e78fc4).

Since after this there is some more horizontal space available, do
some light reformatting where suitable.

Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/fd9adf5d-b1aa-e82f-e4c7-263c30145807%40enterprisedb.com
2023-02-06 09:41:01 +01:00
Michael Paquier 2f6e15ac93 Revert refactoring of restore command code to shell_restore.c
This reverts commits 24c35ec and 57169ad.  PreRestoreCommand() and
PostRestoreCommand() need to be put closer to the system() call calling
a restore_command, as they enable in_restore_command for the startup
process which would in turn trigger an immediate proc_exit() in the
SIGTERM handler.  Perhaps we could get rid of this behavior entirely,
but 24c35ec has made the window where the flag is enabled much larger
than it was, and any Postgres-like actions (palloc, etc.) taken by code
paths while the flag is enabled could lead to more severe issues in the
shutdown processing.

Note that curculio has showed that there are much more problems in this
area, unrelated to this change, actually, hence the issues related to
that had better be addressed first.  Keeping the code of HEAD in line
with the stable branches should make that a bit easier.

Per discussion with Andres Freund and Nathan Bossart.

Discussion: https://postgr.es/m/Y979NR3U5VnWrTwB@paquier.xyz
2023-02-06 08:28:42 +09:00
David Rowley 7ae0ab0ad9 Reduce code duplication between heapgettup and heapgettup_pagemode
The code to get the next block number was exactly the same between these
two functions, so let's just put it into a helper function and call that
from both locations.

Author: Melanie Plageman
Reviewed-by: Andres Freund, David Rowley
Discussion: https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_T8QEqZqOL02rpi2bw@mail.gmail.com
2023-02-03 16:20:43 +13:00
David Rowley 8ca6d49f63 Add helper functions to simplify heapgettup code
Here we add heapgettup_start_page() and heapgettup_continue_page() to
simplify the code in the heapgettup() function.

Author: Melanie Plageman
Reviewed-by: David Rowley
Discussion: https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_T8QEqZqOL02rpi2bw@mail.gmail.com
2023-02-03 12:38:42 +13:00
David Rowley f9bc34fcb6 Further refactor of heapgettup and heapgettup_pagemode
Backward and forward scans share much of the same page acquisition code.
Here we consolidate that code to reduce some duplication.

Additionally, add a new rs_coffset field to HeapScanDescData to track the
offset of the current tuple.  The new field fits nicely into the padding
between a bool and BlockNumber field and saves having to look at the last
returned tuple to figure out which offset we should be looking at for the
current tuple.

Author: Melanie Plageman
Reviewed-by: David Rowley
Discussion: https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_T8QEqZqOL02rpi2bw@mail.gmail.com
2023-02-03 11:48:39 +13:00
Thomas Munro cdf6518ef0 Retire PG_SETMASK() macro.
In the 90s we needed to deal with computers that still had the
pre-standard signal masking APIs.  That hasn't been relevant for a very
long time on Unix systems, and c94ae9d8 got rid of a remaining
dependency in our Windows porting code.  PG_SETMASK didn't expose
save/restore functionality, so we'd already started using sigprocmask()
directly in places, creating the visual distraction of having two ways
to spell it.  It's not part of the API that extensions are expected to
be using (but if they are, the change will be trivial).  It seems like a
good time to drop the old macro and just call the standard POSIX
function.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2BKfQgrhHP2DLTohX1WwubaCBHmTzGnAEDPZ-Gug-Xskg%40mail.gmail.com
2023-02-03 11:29:46 +13:00
David Rowley fb1a59de0c Refactor heapam.c adding heapgettup_initial_block function
Here we adjust heapgettup() and heapgettup_pagemode() to move the code
that fetches the first block number to scan out into a helper function.
This removes some code duplication.

Author: Melanie Plageman
Reviewed-by: David Rowley
Discussion: https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_T8QEqZqOL02rpi2bw@mail.gmail.com
2023-02-02 14:17:15 +13:00
David Rowley e9aaf06328 Remove dead NoMovementScanDirection code
Here remove some dead code from heapgettup() and heapgettup_pagemode()
which was trying to support NoMovementScanDirection scans.  This code can
never be reached as standard_ExecutorRun() never calls ExecutePlan with
NoMovementScanDirection.

Additionally, plans which were scanning an unordered index would use
NoMovementScanDirection rather than ForwardScanDirection.  There was no
real need for this, so here we adjust this so we use ForwardScanDirection
for unordered index scans.  A comment in pathnodes.h claimed that
NoMovementScanDirection was used for PathKey reasons, but if that was
true, it no longer is, per code in build_index_paths().

This does change the non-text format of the EXPLAIN output so that
unordered index scans now have a "Forward" scan direction rather than
"NoMovement".  The text format of EXPLAIN has not changed.

Author: Melanie Plageman
Reviewed-by: Tom Lane, David Rowley
Discussion: https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_T8QEqZqOL02rpi2bw@mail.gmail.com
2023-02-01 10:52:41 +13:00
Peter Geoghegan 6c6b497266 Revert "Add eager and lazy freezing strategies to VACUUM."
This reverts commit 4d41799261.  Broad
concerns about regressions caused by eager freezing strategy have been
raised.  Whether or not these concerns can be worked through in any time
frame is far from certain.

Discussion: https://postgr.es/m/20230126004347.gepcmyenk2csxrri@awork3.anarazel.de
2023-01-25 22:22:27 -08:00
Peter Geoghegan 4d41799261 Add eager and lazy freezing strategies to VACUUM.
Eager freezing strategy avoids large build-ups of all-visible pages.  It
makes VACUUM trigger page-level freezing whenever doing so will enable
the page to become all-frozen in the visibility map.  This is useful for
tables that experience continual growth, particularly strict append-only
tables such as pgbench's history table.  Eager freezing significantly
improves performance stability by spreading out the cost of freezing
over time, rather than doing most freezing during aggressive VACUUMs.
It complements the insert autovacuum mechanism added by commit b07642db.

VACUUM determines its freezing strategy based on the value of the new
vacuum_freeze_strategy_threshold GUC (or reloption) with logged tables.
Tables that exceed the size threshold use the eager freezing strategy.
Unlogged tables and temp tables always use eager freezing strategy,
since the added cost is negligible there.  Non-permanent relations won't
incur any extra overhead in WAL written (for the obvious reason), nor in
pages dirtied (since any extra freezing will only take place on pages
whose PD_ALL_VISIBLE bit needed to be set either way).

VACUUM uses lazy freezing strategy for logged tables that fall under the
GUC size threshold.  Page-level freezing triggers based on the criteria
established in commit 1de58df4, which added basic page-level freezing.

Eager freezing is strictly more aggressive than lazy freezing.  Settings
like vacuum_freeze_min_age still get applied in just the same way in
every VACUUM, independent of the strategy in use.  The only mechanical
difference between eager and lazy freezing strategies is that only the
former applies its own additional criteria to trigger freezing pages.
Note that even lazy freezing strategy will trigger freezing whenever a
page happens to have required that an FPI be written during pruning,
provided that the page will thereby become all-frozen in the visibility
map afterwards (due to the FPI optimization from commit 1de58df4).

The vacuum_freeze_strategy_threshold default setting is 4GB.  This is a
relatively low setting that prioritizes performance stability.  It will
be reviewed at the end of the Postgres 16 beta period.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
2023-01-25 14:15:38 -08:00
Peter Geoghegan 8f8f115932 Update more obsolete multixact.c comments.
Update some remaining comments in multixact.c that still described SLRU
truncation as happening in the checkpointer, rather than during VACUUM.

Follow-up to commit 5212d447.

Shi yu, with tweaks by me.

Author: Shi yu <shiy.fnst@fujitsu.com>
Discussion: https://postgr.es/m/OSZPR01MB631066BF246F8F74E83222FCFDC69@OSZPR01MB6310.jpnprd01.prod.outlook.com
2023-01-24 15:15:33 -08:00
Peter Eisentraut 8dd43894b1 Fix XLogPageRead() comment
7fcbf6a and 2ff6555 changed the function signature of XLogPageRead()
but did not update the comment.

XLogReaderRoutine contains up to date information about the API, so no
need to repeat all that at XLogPageRead(), but fix the mentions of the
no longer existing function arguments.
2023-01-23 21:46:30 +01:00
David Rowley 009dbdea02 Run pgindent on heapam.c
An upcoming patch by Melanie Plageman does some refactoring work in this
area.  Run pgindent on that file now before making any changes so that
it's easier to maintain/evolve each of the individual patches doing the
refactor work.  Additionally, add a few new required typedefs to the list
to make it easier to do future pgindent runs on this file during the
refactor work.

Discussion: https://postgr.es/m/CAAKRu_YSOnhKsDyFcqJsKtBSrd32DP-jjXmv7hL0BPD-z0TGXQ@mail.gmail.com
2023-01-23 23:08:38 +13:00
Andres Freund 25b2aba0c3 Zero initialize uses of instr_time about to trigger compiler warnings
These are all not necessary from a correctness POV. However, in the near
future instr_time will be simplified to an int64, at which point gcc would
otherwise start to warn about the changed places.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/20230116023639.rn36vf6ajqmfciua@awork3.anarazel.de
2023-01-20 21:16:47 -08:00