Commit Graph

11020 Commits

Author SHA1 Message Date
Peter Geoghegan c03c2eae0a Refine the guidelines for rmgrdesc authors.
Clarify the goals of the recently added guidelines for rmgrdesc authors:
to avoid gratuitous inconsistencies across resource managers, and to
make it reasonably easy to write a reusable custom parser.

Beyond that, the guidelines leave rmgrdesc authors with a significant
amount of leeway.  This even includes the leeway to invent custom
conventions (in cases where it's warranted).

Follow-up to commit 7d8219a4.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkbYuvwYKm-Y-72QEh6SPMQcAo9uONv+mR3bMGcu9E_Cg@mail.gmail.com
2023-04-11 15:26:24 -07:00
Peter Geoghegan e944063294 Fix xl_heap_lock WAL record field's data type.
Make xl_heap_lock's infobits_set field of type uint8, not int8.  Using
int8 isn't appropriate given that the field just holds status bits.
This fixes an oversight in commit 0ac5ad5134.

In passing rename the nearby TransactionId field to "xmax" to make
things consistency with related records, such as xl_heap_lock_updated.

Deliberately avoid a bump in XLOG_PAGE_MAGIC.  No backpatch, either.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkCd3kOS8b7Rfxw7Mh1_6jvX=Nzo-CWR1VBTiOtVZkWHA@mail.gmail.com
2023-04-11 14:07:54 -07:00
Peter Geoghegan 5d6728e588 Fix nbtree posting list update desc output.
We cannot use the generic array_desc approach with per-tuple nbtree
posting list update metadata because array_desc can only deal with fixed
width elements (e.g., page offset numbers).  Using array_desc led to
incorrect rmgr descriptions for updates from nbtree DELETE/VACUUM WAL
records.

To fix, add specialized code to describe the update metadata as array
elements in desc output.  We now iterate over the update metadata using
an approach that matches related REDO routines.

Also stop showing the updates offset number array separately in nbtree
DELETE/VACUUM desc output.  It's redundant information, since the same
page offset numbers appear in the description of each individual update
element.  Also make some small tweaks to the way that we format arrays
in all desc routines (not just nbtree desc routines) to make arrays a
little less verbose.

Oversight in commit 1c453cfd, which enhanced the nbtree rmgr desc
routines.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkbYuvwYKm-Y-72QEh6SPMQcAo9uONv+mR3bMGcu9E_Cg@mail.gmail.com
2023-04-10 11:15:41 -07:00
Stephen Frost 3d03b24c35 Revert "Add support for Kerberos credential delegation"
This reverts commit 3d4fa227bc.

Per discussion and buildfarm, this depends on APIs that seem to not
be available on at least one platform (NetBSD).  Should be certainly
possible to rework to be optional on that platform if necessary but bit
late for that at this point.

Discussion: https://postgr.es/m/3286097.1680922218@sss.pgh.pa.us
2023-04-08 07:21:35 -04:00
Thomas Munro db4f21e4a3 Redesign interrupt/cancel API for regex engine.
Previously, a PostgreSQL-specific callback checked by the regex engine
had a way to trigger a special error code REG_CANCEL if it detected that
the next call to CHECK_FOR_INTERRUPTS() would certainly throw via
ereport().

A later proposed bugfix aims to move some complex logic out of signal
handlers, so that it won't run until the next CHECK_FOR_INTERRUPTS(),
which makes the above design impossible unless we split
CHECK_FOR_INTERRUPTS() into two phases, one to run logic and another to
ereport().  We may develop such a system in the future, but for the
regex code it is no longer necessary.

An earlier commit moved regex memory management over to our
MemoryContext system.  Given that the purpose of the two-phase interrupt
checking was to free memory before throwing, something we don't need to
worry about anymore, it seems simpler to inject CHECK_FOR_INTERRUPTS()
directly into cancelation points, and just let it throw.

Since the plan is to keep PostgreSQL-specific concerns separate from the
main regex engine code (with a view to bein able to stay in sync with
other projects), do this with a new macro INTERRUPT(), customizable in
regcustom.h and defaulting to nothing.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CA%2BhUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO%3DnEYKSg%40mail.gmail.com
2023-04-08 22:10:39 +12:00
Thomas Munro 4f51429dd7 Update tsearch regex memory management.
Now that our regex engine uses palloc(), it's not necessary to set up a
special memory context callback to free compiled regexes.  The regex has
no resources other than the memory that is already going to be freed in
bulk.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CA%2BhUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO%3DnEYKSg%40mail.gmail.com
2023-04-08 22:09:17 +12:00
Thomas Munro bea3d7e383 Use MemoryContext API for regex memory management.
Previously, regex_t objects' memory was managed with malloc() and free()
directly.  Switch to palloc()-based memory management instead.
Advantages:

 * memory used by cached regexes is now visible with MemoryContext
   observability tools

 * cleanup can be done automatically in certain failure modes
   (something that later commits will take advantage of)

 * cleanup can be done in bulk

On the downside, there may be more fragmentation (wasted memory) due to
per-regex MemoryContext objects.  This is a problem shared with other
cached objects in PostgreSQL and can probably be improved with later
tuning.

Thanks to Noah Misch for suggesting this general approach, which
unblocks later work on interrupts.

Suggested-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CA%2BhUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO%3DnEYKSg%40mail.gmail.com
2023-04-08 22:08:41 +12:00
Andres Freund 0fdab27ad6 Allow logical decoding on standbys
Unsurprisingly, this requires wal_level = logical to be set on the primary and
standby. The infrastructure added in 26669757b6 ensures that slots are
invalidated if the primary's wal_level is lowered.

Creating a slot on a standby waits for a xl_running_xact record to be
processed. If the primary is idle (and thus not emitting xl_running_xact
records), that can take a while.  To make that faster, this commit also
introduces the pg_log_standby_snapshot() function. By executing it on the
primary, completion of slot creation on the standby can be accelerated.

Note that logical decoding on a standby does not itself enforce that required
catalog rows are not removed. The user has to use physical replication slots +
hot_standby_feedback or other measures to prevent that. If catalog rows
required for a slot are removed, the slot is invalidated.

See 6af1793954 for an overall design of logical decoding on a standby.

Bumps catversion, for the addition of the pg_log_standby_snapshot() function.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de> (in an older version)
Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version)
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: FabrÌzio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
2023-04-08 02:20:05 -07:00
Andres Freund e101dfac3a For cascading replication, wake physical and logical walsenders separately
Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.

Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would have been not applied yet, so logical walsenders were
awakened too early.

Per idea from Jeff Davis and Amit Kapila.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1+zO5LUeisabX10c81LU-fWMKO4M9Wyg1cdkbW7Hqh6vQ@mail.gmail.com
2023-04-08 01:06:00 -07:00
Andres Freund 26669757b6 Handle logical slot conflicts on standby
During WAL replay on the standby, when a conflict with a logical slot is
identified, invalidate such slots. There are two sources of conflicts:
1) Using the information added in 6af1793954, logical slots are invalidated if
   required rows are removed
2) wal_level on the primary server is reduced to below logical

Uses the infrastructure introduced in the prior commit. FIXME: add commit
reference.

Change InvalidatePossiblyObsoleteSlot() to use a recovery conflict to
interrupt use of a slot, if called in the startup process. The new recovery
conflict is added to pg_stat_database_conflicts, as confl_active_logicalslot.

See 6af1793954 for an overall design of logical decoding on a standby.

Bumps catversion for the addition of the pg_stat_database_conflicts column.
Bumps PGSTAT_FILE_FORMAT_ID for the same reason.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version)
Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de
2023-04-08 00:05:44 -07:00
Andres Freund be87200efd Support invalidating replication slots due to horizon and wal_level
Needed for logical decoding on a standby. Slots need to be invalidated because
of the horizon if rows required for logical decoding are removed. If the
primary's wal_level is lowered from 'logical', logical slots on the standby
need to be invalidated.

The new invalidation methods will be used in a subsequent commit.

Logical slots that have been invalidated can be identified via the new
pg_replication_slots.conflicting column.

See 6af1793954 for an overall design of logical decoding on a standby.

Bumps catversion for the addition of the new pg_replication_slots column.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version)
Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de
2023-04-07 22:40:27 -07:00
Andres Freund 15f8203a59 Replace replication slot's invalidated_at LSN with an enum
This is mainly useful because the upcoming logical-decoding-on-standby feature
adds further reasons for invalidating slots, and we don't want to end up with
multiple invalidated_* fields, or check different attributes.

Eventually we should consider not resetting restart_lsn when invalidating a
slot due to max_slot_wal_keep_size. But that's a user visible change, so left
for later.

Increases SLOT_VERSION, due to the changed field (with a different alignment,
no less).

Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de
2023-04-07 21:47:25 -07:00
Thomas Munro d4e71df6d7 Add io_direct setting (developer-only).
Provide a way to ask the kernel to use O_DIRECT (or local equivalent)
where available for data and WAL files, to avoid or minimize kernel
caching.  This hurts performance currently and is not intended for end
users yet.  Later proposed work would introduce our own I/O clustering,
read-ahead, etc to replace the facilities the kernel disables with this
option.

The only user-visible change, if the developer-only GUC is not used, is
that this commit also removes the obscure logic that would activate
O_DIRECT for the WAL when wal_sync_method=open_[data]sync and
wal_level=minimal (which also requires max_wal_senders=0).  Those are
non-default and unlikely settings, and this behavior wasn't (correctly)
documented.  The same effect can be achieved with io_direct=wal.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg%40mail.gmail.com
2023-04-08 16:35:07 +12:00
Thomas Munro faeedbcefd Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.
In order to have the option to use O_DIRECT/FILE_FLAG_NO_BUFFERING in a
later commit, we need the addresses of user space buffers to be well
aligned.  The exact requirements vary by OS and file system (typically
sectors and/or memory pages).  The address alignment size is set to
4096, which is enough for currently known systems: it matches modern
sectors and common memory page size.  There is no standard governing
O_DIRECT's requirements so we might eventually have to reconsider this
with more information from the field or future systems.

Aligning I/O buffers on memory pages is also known to improve regular
buffered I/O performance.

Three classes of I/O buffers for regular data pages are adjusted:
(1) Heap buffers are now allocated with the new palloc_aligned() or
MemoryContextAllocAligned() functions introduced by commit 439f6175.
(2) Stack buffers now use a new struct PGIOAlignedBlock to respect
PG_IO_ALIGN_SIZE, if possible with this compiler.  (3) The buffer
pool is also aligned in shared memory.

WAL buffers were already aligned on XLOG_BLCKSZ.  It's possible for
XLOG_BLCKSZ to be configured smaller than PG_IO_ALIGNED_SIZE and thus
for O_DIRECT WAL writes to fail to be well aligned, but that's a
pre-existing condition and will be addressed by a later commit.

BufFiles are not yet addressed (there's no current plan to use O_DIRECT
for those, but they could potentially get some incidental speedup even
in plain buffered I/O operations through better alignment).

If we can't align stack objects suitably using the compiler extensions
we know about, we disable the use of O_DIRECT by setting PG_O_DIRECT to
0.  This avoids the need to consider systems that have O_DIRECT but
can't align stack objects the way we want; such systems could in theory
be supported with more work but we don't currently know of any such
machines, so it's easier to pretend there is no O_DIRECT support
instead.  That's an existing and tested class of system.

Add assertions that all buffers passed into smgrread(), smgrwrite() and
smgrextend() are correctly aligned, unless PG_O_DIRECT is 0 (= stack
alignment tricks may be unavailable) or the block size has been set too
small to allow arrays of buffers to be all aligned.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
2023-04-08 16:34:50 +12:00
Stephen Frost 3d4fa227bc Add support for Kerberos credential delegation
Support GSSAPI/Kerberos credentials being delegated to the server by a
client.  With this, a user authenticating to PostgreSQL using Kerberos
(GSSAPI) credentials can choose to delegate their credentials to the
PostgreSQL server (which can choose to accept them, or not), allowing
the server to then use those delegated credentials to connect to
another service, such as with postgres_fdw or dblink or theoretically
any other service which is able to be authenticated using Kerberos.

Both postgres_fdw and dblink are changed to allow non-superuser
password-less connections but only when GSSAPI credentials have been
delegated to the server by the client and GSSAPI is used to
authenticate to the remote system.

Authors: Stephen Frost, Peifeng Qiu
Reviewed-By: David Christensen
Discussion: https://postgr.es/m/CO1PR05MB8023CC2CB575E0FAAD7DF4F8A8E29@CO1PR05MB8023.namprd05.prod.outlook.com
2023-04-07 21:58:04 -04:00
Andres Freund ac8d53dae5 Track IO times in pg_stat_io
a9c70b46db and 8aaa04b32S added counting of IO operations to a new view,
pg_stat_io. Now, add IO timing for reads, writes, extends, and fsyncs to
pg_stat_io as well.

This combines the tracking for pgBufferUsage with the tracking for pg_stat_io
into a new function pgstat_count_io_op_time(). This should make it a bit
easier to avoid the somewhat costly instr_time conversion done for
pgBufferUsage.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
2023-04-07 17:04:56 -07:00
Peter Geoghegan 1c453cfd89 Show more detail in nbtree rmgr descriptions.
Show a detailed description of the page offset number arrays that appear
in certain nbtree WAL records.

Also brings nbtree desc routines in line with the guidelines established
by recent commit 7d8219a4.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/flat/20230109215842.fktuhesvayno6o4g%40awork3.anarazel.de
2023-04-07 16:46:23 -07:00
Peter Geoghegan 7d8219a444 Show more detail in heapam rmgr descriptions.
Add helper functions that output arrays in a standard format, and use
the functions inside heapdesc routines.  This allows tools like
pg_walinspect to show a detailed description of the page offset number
arrays for records like PRUNE and VACUUM (unless there was an FPI).

Also document the conventions that desc routines should follow.  Only
the heapdesc routines follow the conventions for now, so they're just
guidelines for the time being.

Based on a suggestion from Andres Freund.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/flat/20230109215842.fktuhesvayno6o4g%40awork3.anarazel.de
2023-04-07 16:08:52 -07:00
Alvaro Herrera e056c557ae
Catalog NOT NULL constraints
We now create pg_constaint rows for NOT NULL constraints with
contype='n'.

We propagate these constraints during operations such as adding
inheritance relationships, creating and attaching partitions, creating
tables LIKE other tables.  We mostly follow the well-known rules of
conislocal and coninhcount that we have for CHECK constraints, with some
adaptations; for example, as opposed to CHECK constraints, we don't
match NOT NULL ones by name when descending a hierarchy to alter it;
instead we match by column number.  This means we don't require the
constraint names to be identical across a hierarchy.

For now, we omit them from system catalogs.  Maybe this is worth
reconsidering.  We don't support NOT VALID nor DEFERRABLE clauses
either; these can be added as separate features later (this patch is
already large and complicated enough.)

This has been very long in the making.  The first patch was written by
Bernd Helmle in 2010 to add a new pg_constraint.contype value ('n'),
which I (Álvaro) then hijacked in 2011 and 2012, until that one was
killed by the realization that we ought to use contype='c' instead:
manufactured CHECK constraints.  However, later SQL standard
development, as well as nonobvious emergent properties of that design
(mostly, failure to distinguish them from "normal" CHECK constraints as
well as the performance implication of having to test the CHECK
expression) led us to reconsider this choice, so now the current
implementation uses contype='n' again.

In 2016 Vitaly Burovoy also worked on this feature[1] but found no
consensus for his proposed approach, which was claimed to be closer to
the letter of the standard, requiring additional pg_attribute columns to
track the OID of the NOT NULL constraint for that column.
[1] https://postgr.es/m/CAKOSWNkN6HSyatuys8xZxzRCR-KL1OkHS5-b9qd9bf1Rad3PLA@mail.gmail.com

Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Author: Bernd Helmle <mailings@oopsware.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>

Discussion: https://postgr.es/m/CACA0E642A0267EDA387AF2B%40%5B172.26.14.62%5D
Discussion: https://postgr.es/m/AANLkTinLXMOEMz+0J29tf1POokKi4XDkWJ6-DDR9BKgU@mail.gmail.com
Discussion: https://postgr.es/m/20110707213401.GA27098@alvh.no-ip.org
Discussion: https://postgr.es/m/1343682669-sup-2532@alvh.no-ip.org
Discussion: https://postgr.es/m/CAKOSWNkN6HSyatuys8xZxzRCR-KL1OkHS5-b9qd9bf1Rad3PLA@mail.gmail.com
Discussion: https://postgr.es/m/20220817181249.q7qvj3okywctra3c@alvherre.pgsql
2023-04-07 19:59:57 +02:00
Tom Lane 888f2ea0a8 Add array_sample() and array_shuffle() functions.
These are useful in Monte Carlo applications.

Martin Kalcher, reviewed/adjusted by Daniel Gustafsson and myself

Discussion: https://postgr.es/m/9d160a44-7675-51e8-60cf-6d64b76db831@aboutsource.net
2023-04-07 11:47:07 -04:00
Michael Paquier 8fcb32db98 Add more protections in WAL record APIs against overflows
This commit adds a limit to the size of an XLogRecord at 1020MB, based
on a suggestion by Heikki Linnakangas.  This counts for the overhead
needed by the XLogReader when allocating the memory it needs to read a
record in DecodeXLogRecordRequiredSpace(), based on the record size.  An
assertion based on that is added to detect that any additions in the
XLogReader facilities would not cause any overflows.  If that's ever the
case, the upper bound allowed would need to be adjusted.

Before this, it was possible for an external module to create WAL
records large enough to be assembled but not replayable, causing
failures when replaying such WAL records on standbys.  One case
mentioned where this is possible is the in-core function
pg_logical_emit_message() (wrapper for LogLogicalMessage), that allows
to emit WAL records with an arbitrary amount of data potentially higher
than the replay limit of approximately 1GB (limit of a palloc, minus the
overhead needed by a XLogReader).

This commit is a follow-up of ffd1b6b that has added similar protections
for the block-level data.  Here, the checks are extended to the whole
record length, mainrdata_len being extended from uint32 to uint64 with
the routines registering buffer and record data still limited to uint32
to minimize the checks when assembling a record.  All the error messages
related to overflow checks are improved to provide more context about
the error happening.

Author: Matthias van de Meent
Reviewed-by: Andres Freund, Heikki Linnakangas, Michael Paquier
Discussion: https://postgr.es/m/CAEze2WgGiw+LZt+vHf8tWqB_6VxeLsMeoAuod0N=ij1q17n5pw@mail.gmail.com
2023-04-07 10:10:17 +09:00
David Rowley ae78cae3be Add --buffer-usage-limit option to vacuumdb
1cbbee033 added BUFFER_USAGE_LIMIT to the VACUUM and ANALYZE commands, so
here we permit that option to be specified in vacuumdb.

In passing, adjust the documents for vacuum_buffer_usage_limit and the
BUFFER_USAGE_LIMIT VACUUM option to mention "kB" rather than "KB".  Do the
same for the ERROR message in ExecVacuum() and
check_vacuum_buffer_usage_limit().  Without that we might tell a user that
the valid minimum value is 128 KB only to reject that because we accept
only "kB" and not "KB".

Also, add a small reminder comment in vacuum.h to try to trigger the
memory of anyone adding new fields to VacuumParams that they might want to
consider if vacuumdb needs to grow a new option too.

Author: Melanie Plageman
Reviewed-by: Justin Pryzby
Reviewed-by: David Rowley
Discussion: https://postgr.es/m/ZAzTg3iEnubscvbf@telsasoft.com
2023-04-07 12:47:10 +12:00
Andres Freund 00d1e02be2 hio: Use ExtendBufferedRelBy() to extend tables more efficiently
While we already had some form of bulk extension for relations, it was fairly
limited. It only amortized the cost of acquiring the extension lock, the
relation itself was still extended one-by-one. Bulk extension was also solely
triggered by contention, not by the amount of data inserted.

To address this, use ExtendBufferedRelBy(), introduced in 31966b151e, to
extend the relation. We try to extend the relation by multiple blocks in two
situations:

1) The caller tells RelationGetBufferForTuple() that it will need multiple
   pages. For now that's only used by heap_multi_insert(), see commit FIXME.

2) If there is contention on the extension lock, use the number of waiters for
   the lock as a multiplier for the number of blocks to extend by. This is
   similar to what we already did. Previously we additionally multiplied the
   numbers of waiters by 20, but with the new relation extension
   infrastructure I could not see a benefit in doing so.

Using the freespacemap to provide empty pages can cause significant
contention, and adds measurable overhead, even if there is no contention. To
reduce that, remember the blocks the relation was extended by in the
BulkInsertState, in the extending backend. In case 1) from above, the blocks
the extending backend needs are not entered into the FSM, as we know that we
will need those blocks.

One complication with using the FSM to record empty pages, is that we need to
insert blocks into the FSM, when we already hold a buffer content lock. To
avoid doing IO while holding a content lock, release the content lock before
recording free space. Currently that opens a small window in which another
backend could fill the block, if a concurrent VACUUM records the free
space. If that happens, we retry, similar to the already existing case when
otherBuffer is provided. In the future it might be worth closing the race by
preventing VACUUM from recording the space in newly extended pages.

This change provides very significant wins (3x at 16 clients, on my
workstation) for concurrent COPY into a single relation. Even single threaded
COPY is measurably faster, primarily due to not dirtying pages while
extending, if supported by the operating system (see commit 4d330a61bb). Even
single-row INSERTs benefit, although to a much smaller degree, as the relation
extension lock rarely is the primary bottleneck.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 16:53:17 -07:00
David Rowley 1cbbee0338 Add VACUUM/ANALYZE BUFFER_USAGE_LIMIT option
Add new options to the VACUUM and ANALYZE commands called
BUFFER_USAGE_LIMIT to allow users more control over how large to make the
buffer access strategy that is used to limit the usage of buffers in
shared buffers.  Larger rings can allow VACUUM to run more quickly but
have the drawback of VACUUM possibly evicting more buffers from shared
buffers that might be useful for other queries running on the database.

Here we also add a new GUC named vacuum_buffer_usage_limit which controls
how large to make the access strategy when it's not specified in the
VACUUM/ANALYZE command.  This defaults to 256KB, which is the same size as
the access strategy was prior to this change.  This setting also
controls how large to make the buffer access strategy for autovacuum.

Per idea by Andres Freund.

Author: Melanie Plageman
Reviewed-by: David Rowley
Reviewed-by: Andres Freund
Reviewed-by: Justin Pryzby
Reviewed-by: Bharath Rupireddy
Discussion: https://postgr.es/m/20230111182720.ejifsclfwymw2reb@awork3.anarazel.de
2023-04-07 11:40:31 +12:00
Andres Freund 5279e9db8e heapam: Pass number of required pages to RelationGetBufferForTuple()
A future commit will use this information to determine how aggressively to
extend the relation by. In heap_multi_insert() we know accurately how many
pages we need once we need to extend the relation, providing an accurate lower
bound for how much to extend.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-06 16:17:16 -07:00
Daniel Gustafsson 7d71d3dd08 Refresh cost-based delay params more frequently in autovacuum
Allow autovacuum to reload the config file more often so that cost-based
delay parameters can take effect while VACUUMing a relation. Previously,
autovacuum workers only reloaded the config file once per relation
vacuumed, so config changes could not take effect until beginning to
vacuum the next table.

Now, check if a reload is pending roughly once per block, when checking
if we need to delay.

In order for autovacuum workers to safely update their own cost delay
and cost limit parameters without impacting performance, we had to
rethink when and how these values were accessed.

Previously, an autovacuum worker's wi_cost_limit was set only at the
beginning of vacuuming a table, after reloading the config file.
Therefore, at the time that autovac_balance_cost() was called, workers
vacuuming tables with no cost-related storage parameters could still
have different values for their wi_cost_limit_base and wi_cost_delay.

Now that the cost parameters can be updated while vacuuming a table,
workers will (within some margin of error) have no reason to have
different values for cost limit and cost delay (in the absence of
cost-related storage parameters). This removes the rationale for keeping
cost limit and cost delay in shared memory. Balancing the cost limit
requires only the number of active autovacuum workers vacuuming a table
with no cost-based storage parameters.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 01:00:21 +02:00
Daniel Gustafsson a85c60a945 Separate vacuum cost variables from GUCs
Vacuum code run both by autovacuum workers and a backend doing
VACUUM/ANALYZE previously inspected VacuumCostLimit and VacuumCostDelay,
which are the global variables backing the GUCs vacuum_cost_limit and
vacuum_cost_delay.

Autovacuum workers needed to override these variables with their
own values, derived from autovacuum_vacuum_cost_limit and
autovacuum_vacuum_cost_delay and worker cost limit balancing logic.
This led to confusing code which, in some cases, both derived and
set a new value of VacuumCostLimit from VacuumCostLimit.

In preparation for refreshing these GUC values more often, introduce
new, independent global variables and add a function to update them
using the GUCs and existing logic.

Per suggestion by Kyotaro Horiguchi

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 00:54:53 +02:00
Daniel Gustafsson 71a825194f Make vacuum failsafe_active globally visible
While vacuuming a table in failsafe mode, VacuumCostActive should
not be re-enabled.  This currently isn't a problem because vacuum
cost parameters are only refreshed in between vacuuming tables and
failsafe status is reset for every table.

In preparation for allowing vacuum cost parameters to be updated
more frequently, elevate LVRelState->failsafe_active to a global,
VacuumFailsafeActive, which will be checked when determining whether
or not to re-enable vacuum cost-related delays.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ZngzqnEODc7LmS1NH04Kt6Y9huSjz5pp7%2BDXhrjDA0gw%40mail.gmail.com
2023-04-07 00:54:08 +02:00
Tomas Vondra 2820adf775 Support long distance matching for zstd compression
zstd compression supports a special mode for finding matched in distant
past, which may result in better compression ratio, at the expense of
using more memory (the window size is 128MB).

To enable this optional mode, use the "long" keyword when specifying the
compression method (--compress=zstd:long).

Author: Justin Pryzby
Reviewed-by: Tomas Vondra, Jacob Champion
Discussion: https://postgr.es/m/20230224191840.GD1653@telsasoft.com
Discussion: https://postgr.es/m/20220327205020.GM28503@telsasoft.com
2023-04-06 17:18:42 +02:00
David Rowley b9b125b9c1 Move various prechecks from vacuum() into ExecVacuum()
vacuum() is used for both the VACUUM command and for autovacuum. There
were many prechecks being done inside vacuum() that were just not relevant
to autovacuum.  Let's move the bulk of these into ExecVacuum() so that
they're only executed when running the VACUUM command.  This removes a
small amount of overhead when autovacuum vacuums a table.

While we are at it, allocate VACUUM's BufferAccessStrategy in ExecVacuum()
and pass it into vacuum() instead of expecting vacuum() to make it if it's
not already made by the calling function.  To make this work, we need to
create the vacuum memory context slightly earlier, so we now need to pass
that down to vacuum() so that it's available for use in other memory
allocations.

Author: Melanie Plageman
Reviewed-by: David Rowley
Discussion: https://postgr.es/m/20230405211534.4skgskbilnxqrmxg@awork3.anarazel.de
2023-04-06 15:44:52 +12:00
Michael Paquier 1d477a907e Fix row tracking in pg_stat_statements with extended query protocol
pg_stat_statements relies on EState->es_processed to count the number of
rows processed by ExecutorRun().  This proves to be a problem under the
extended query protocol when the result of a query is fetched through
more than one call of ExecutorRun(), as es_processed is reset each time
ExecutorRun() is called.  This causes pg_stat_statements to report the
number of rows calculated in the last execute fetch, rather than the
global sum of all the rows processed.

As pquery.c tells, this is a problem when a portal does not use
holdStore.  For example, DMLs with RETURNING would report a correct
tuple count as these do one execution cycle when the query is first
executed to fill in the portal's store with one ExecutorRun(), feeding
on the portal's store for each follow-up execute fetch depending on the
fetch size requested by the client.

The fix proposed for this issue is simple with the addition of an extra
counter in EState that's preserved across multiple ExecutorRun() calls,
incremented with the value calculated in es_processed.  This approach is
not back-patchable, unfortunately.

Note that libpq does not currently give any way to control the fetch
size when using the extended v3 protocol, meaning that in-core testing
is not possible yet.  This issue can be easily verified with the JDBC
driver, though, with *autocommit disabled*.  Hence, having in-core tests
requires more features, left for future discussion:
- At least two new libpq routines splitting PQsendQueryGuts(), one for
the bind/describe and a second for a series of execute fetches with a
custom fetch size, likely in a fashion similar to what JDBC does.
- A psql meta-command for the execute phase.  This part is not strictly
mandatory, still it could be handy.

Reported-by: Andrew Dunstan (original discovery by Simon Siggs)
Author: Sami Imseih
Reviewed-by: Tom Lane, Michael Paquier
Discussion: https://postgr.es/m/EBE6C507-9EB6-4142-9E4D-38B1673363A7@amazon.com
Discussion: https://postgr.es/m/c90890e7-9c89-c34f-d3c5-d5c763a34bd8@dunslane.net
2023-04-06 09:29:03 +09:00
Andres Freund 31966b151e bufmgr: Introduce infrastructure for faster relation extension
The primary bottlenecks for relation extension are:

1) The extension lock is held while acquiring a victim buffer for the new
   page. Acquiring a victim buffer can require writing out the old page
   contents including possibly needing to flush WAL.

2) When extending via ReadBuffer() et al, we write a zero page during the
   extension, and then later write out the actual page contents. This can
   nearly double the write rate.

3) The existing bulk relation extension infrastructure in hio.c just amortized
   the cost of acquiring the relation extension lock, but none of the other
   costs.

Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.

The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.

To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.

There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.

To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.

Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-05 16:21:09 -07:00
Andres Freund 12f3867f55 bufmgr: Support multiple in-progress IOs by using resowner
A future patch will add support for extending relations by multiple blocks at
once. To be concurrency safe, the buffers for those blocks need to be marked
as BM_IO_IN_PROGRESS. Until now we only had infrastructure for recovering from
an IO error for a single buffer. This commit extends that infrastructure to
multiple buffers by using the resource owner infrastructure.

This commit increases the size of the ResourceOwnerData struct, which appears
to have a just about measurable overhead in very extreme workloads. Medium
term we are planning to substantially shrink the size of
ResourceOwnerData. Short term the increase is small enough to not worry about
it for now.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
Discussion: https://postgr.es/m/20221029200025.w7bvlgvamjfo6z44@awork3.anarazel.de
2023-04-05 14:17:55 -07:00
Tom Lane 16dc2703c5 Support "Right Anti Join" plan shapes.
Merge and hash joins can support antijoin with the non-nullable input
on the right, using very simple combinations of their existing logic
for right join and anti join.  This gives the planner more freedom
about how to order the join.  It's particularly useful for hash join,
since we may now have the option to hash the smaller table instead
of the larger.

Richard Guo, reviewed by Ronan Dunklau and myself

Discussion: https://postgr.es/m/CAMbWs48xh9hMzXzSy3VaPzGAz+fkxXXTUbCLohX1_L8THFRm2Q@mail.gmail.com
2023-04-05 16:59:09 -04:00
Andres Freund 794f259447 bufmgr: Add Pin/UnpinLocalBuffer()
So far these were open-coded in quite a few places, without a good reason.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-05 10:42:17 -07:00
Andres Freund 819b69a81d bufmgr: Add some more error checking [infrastructure] around pinning
This adds a few more assertions against a buffer being local in places we
don't expect, and extracts the check for a buffer being pinned exactly once
from LockBufferForCleanup() into its own function. Later commits will use this
function.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: http://postgr.es/m/419312fd-9255-078c-c3e3-f0525f911d7f@iki.fi
2023-04-05 10:42:17 -07:00
Andres Freund 4d330a61bb Add smgrzeroextend(), FileZero(), FileFallocate()
smgrzeroextend() uses FileFallocate() to efficiently extend files by multiple
blocks. When extending by a small number of blocks, use FileZero() instead, as
using posix_fallocate() for small numbers of blocks is inefficient for some
file systems / operating systems. FileZero() is also used as the fallback for
FileFallocate() on platforms / filesystems that don't support fallocate.

A big advantage of using posix_fallocate() is that it typically won't cause
dirty buffers in the kernel pagecache. So far the most common pattern in our
code is that we smgrextend() a page full of zeroes and put the corresponding
page into shared buffers, from where we later write out the actual contents of
the page. If the kernel, e.g. due to memory pressure or elapsed time, already
wrote back the all-zeroes page, this can lead to doubling the amount of writes
reaching storage.

There are no users of smgrzeroextend() as of this commit. That will follow in
future commits.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-05 10:06:39 -07:00
Peter Eisentraut c9f57541d9 doc: Update SQL features/conformance information to SQL:2023
Optional subfeatures have been changed to top-level features, so there
is a bit of a churn in the list for that.

Some existing functions have been added to the standard, so they are
moved from the "other" to the "standard" lists in their sections.

Discussion: https://www.postgresql.org/message-id/flat/63f285d9-4ec8-0c9e-4bf5-e76334ddc0af@enterprisedb.com
2023-04-05 09:20:25 +02:00
Jeff Davis ea1db8ae70 Canonicalize ICU locale names to language tags.
Convert to BCP47 language tags before storing in the catalog, except
during binary upgrade or when the locale comes from an existing
collation or template database.

The resulting language tags can vary slightly between ICU
versions. For instance, "@colBackwards=yes" is converted to
"und-u-kb-true" in older versions of ICU, and to the simpler (but
equivalent) "und-u-kb" in newer versions.

The process of canonicalizing to a language tag also understands more
input locale string formats than ucol_open(). For instance,
"fr_CA.UTF-8" is misinterpreted by ucol_open() and the region is
ignored; effectively treating it the same as the locale "fr" and
opening the wrong collator. Canonicalization properly interprets the
language and region, resulting in the language tag "fr-CA", which can
then be understood by ucol_open().

This commit fixes a problem in prior versions due to ucol_open()
misinterpreting locale strings as described above. For instance,
creating an ICU collation with locale "fr_CA.UTF-8" would store that
string directly in the catalog, which would later be passed to (and
misinterpreted by) ucol_open(). After this commit, the locale string
will be canonicalized to language tag "fr-CA" in the catalog, which
will be properly understood by ucol_open(). Because this fix affects
the resulting collator, we cannot change the locale string stored in
the catalog for existing databases or collations; otherwise we'd risk
corrupting indexes. Therefore, only canonicalize locales for
newly-created (not upgraded) collations/databases. For similar
reasons, do not backport.

Discussion: https://postgr.es/m/8c7af6820aed94dc7bc259d2aa7f9663518e6137.camel@j-davis.com
Reviewed-by: Peter Eisentraut
2023-04-04 10:38:58 -07:00
Robert Haas 482675987b Add a run_as_owner option to subscriptions.
This option is normally false, but can be set to true to obtain
the legacy behavior where the subscription runs with the permissions
of the subscription owner rather than the permissions of the
table owner. The advantages of this mode are (1) it doesn't require
that the subscription owner have permission to SET ROLE to each
table owner and (2) since no role switching occurs, the
SECURITY_RESTRICTED_OPERATION restrictions do not apply.

On the downside, it allows any table owner to easily usurp
the privileges of the subscription owner - basically, to take
over their account. Because that's generally quite undesirable,
we don't make this mode the default, but we do make it available,
just in case the new behavior causes too many problems for someone.

Discussion: http://postgr.es/m/CA+TgmoZ-WEeG6Z14AfH7KhmpX2eFh+tZ0z+vf0=eMDdbda269g@mail.gmail.com
2023-04-04 12:03:03 -04:00
Robert Haas 1e10d49b65 Perform logical replication actions as the table owner.
Up until now, logical replication actions have been performed as the
subscription owner, who will generally be a superuser.  Commit
cec57b1a0f documented hazards
associated with that situation, namely, that any user who owns a
table on the subscriber side could assume the privileges of the
subscription owner by attaching a trigger, expression index, or
some other kind of executable code to it. As a remedy, it suggested
not creating configurations where users who are not fully trusted
own tables on the subscriber.

Although that will work, it basically precludes using logical
replication in the way that people typically want to use it,
namely, to replicate a database from one node to another
without necessarily having any restrictions on which database
users can own tables. So, instead, change logical replication to
execute INSERT, UPDATE, DELETE, and TRUNCATE operations as the
table owner when they are replicated.

Since this involves switching the active user frequently within
a session that is authenticated as the subscription user, also
impose SECURITY_RESTRICTED_OPERATION restrictions on logical
replication code. As an exception, if the table owner can SET
ROLE to the subscription owner, these restrictions have no
security value, so don't impose them in that case.

Subscription owners are now required to have the ability to
SET ROLE to every role that owns a table that the subscription
is replicating. If they don't, replication will fail. Superusers,
who normally own subscriptions, satisfy this property by default.
Non-superusers users who own subscriptions will need to be
granted the roles that own relevant tables.

Patch by me, reviewed (but not necessarily in its entirety) by
Jelte Fennema, Jeff Davis, and Noah Misch.

Discussion: http://postgr.es/m/CA+TgmoaSCkg9ww9oppPqqs+9RVqCexYCE6Aq=UsYPfnOoDeFkw@mail.gmail.com
2023-04-04 11:25:23 -04:00
Alvaro Herrera 71bfd1543f
Code review for recent SQL/JSON commits
- At the last minute and for no particularly good reason, I changed the
  WITHOUT token to be marked especially for lookahead, from the one in
  WITHOUT TIME to the one in WITHOUT UNIQUE.  Study of upcoming patches
  (where a new WITHOUT ARRAY WRAPPER clause is added) showed me that the
  former was better, so put it back the way the original patch had it.

- update exprTypmod() for JsonConstructorExpr to return the typmod of
  the RETURNING clause, as a comment there suggested.  Perhaps it's
  possible for this to make a difference with datetime types, but I
  didn't try to build a test case.

- The nodeFuncs.c support code for new nodes was calling walker()
  directly instead of the WALK() macro as introduced by commit 1c27d16e6e.
  Modernize that.  Also add exprLocation() support for a couple of nodes
  that missed it.  Lastly, reorder the code more sensibly.

The WITHOUT_LA -> WITHOUT change means that stored rules containing
either WITHOUT TIME ZONE or WITHOUT UNIQUE KEYS would change
representation.  Therefore, bump catversion.

Discussion: https://postgr.es/m/20230329181708.e64g2tpy7jyufqkr@alvherre.pgsql
2023-04-04 14:04:30 +02:00
Peter Geoghegan e48c817395 Recycle deleted nbtree pages more aggressively.
Commit 61b313e4 made nbtree consistently pass down a heaprel to low
level routines like _bt_getbuf().  Although this was primarily intended
as preparation for logical decoding on standbys, it also made it easy to
correct an old deficiency in how nbtree VACUUM determines whether or not
it's now safe to recycle deleted pages.

Pass the heaprel to GlobalVisTestFor() in nbtree routines that deal with
recycle safety.  nbtree now makes less pessimistic assumptions about
recycle safety within non-catalog relations.  This enhancement
complements the recycling enhancement added by commit 9dd963ae25.

nbtree remains just as pessimistic as ever when it comes to recycle
safety within indexes on catalog relations.  There is no fundamental
reason why we need to treat catalog relations differently, though.  The
behavioral inconsistency is a consequence of the way that nbtree uses
nextXID values to implement what Lanin and Shasha call "the drain
technique".  Note in particular that it has nothing to do with whether
or not index tuples might still be required for an older MVCC snapshot.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkaiDxCje0yPuH=3Uh2p1V_2pFGY==xfbZoZu7Ax_NB8g@mail.gmail.com
2023-04-03 11:31:43 -07:00
Peter Geoghegan a349b86603 Move heaprel struct field next to index rel field.
Commit 61b313e4 added a heaprel struct member to IndexVacuumInfo, but
placed it last.  Move the heaprel struct member next to the index struct
member to improve the code's readability.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WznG=TV6S9d3VA=y0vBHbXwnLs9_LLdiML=aNJuHeriwxg@mail.gmail.com
2023-04-03 11:01:11 -07:00
Alexander Korotkov 2b65bf046d Revert 11470f544e
Discussion: https://postgr.es/m/20230323003003.plgaxjqahjgkuxrk%40awork3.anarazel.de
2023-04-03 16:54:31 +03:00
Andres Freund 6af1793954 Add info in WAL records in preparation for logical slot conflict handling
This commit only implements one prerequisite part for allowing logical
decoding. The commit message contains an explanation of the overall design,
which later commits will refer back to.

Overall design:

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing error(s) on the standby. To prevent those errors, a new replication
conflict scenario needs to be addressed (as much as hot standby does).

2. Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.

3. To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access. That way, during WAL replay, we know when there is a risk of
conflict and, if so, if there is a conflict.

4. We can't rely on the standby's relcache entries for this purpose in
any way, because the startup process can't access catalog contents.

5. Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Why do we need this for logical decoding on standby?

First, let's forget about logical decoding on standby and recall that
on a primary database, any catalog rows that may be needed by a logical
decoding replication slot are not removed.

This is done thanks to the catalog_xmin associated with the logical
replication slot.

But, with logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between
  the primary and the standby. Then, hot_standby_feedback will work,
  but only while the connection is alive (for example a node restart
  would break it)

Then, the primary may delete system catalog rows that could be needed
by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

So, it’s mandatory to identify those rows and invalidate the slots
that may need them if any. Identifying those rows is the purpose of
this commit.

Implementation:

When a WAL replay on standby indicates that a catalog table tuple is
to be deleted by an xid that is greater than a logical slot's
catalog_xmin, then that means the slot's catalog_xmin conflicts with
the xid, and we need to handle the conflict. While subsequent commits
will do the actual conflict handling, this commit adds a new field
isCatalogRel in such WAL records (and a new bit set in the
xl_heap_visible flags field), that is true for catalog tables, so as to
arrange for conflict handling.

The affected WAL records are the ones that already contain the
snapshotConflictHorizon field, namely:

- gistxlogDelete
- gistxlogPageReuse
- xl_hash_vacuum_one_page
- xl_heap_prune
- xl_heap_freeze_page
- xl_heap_visible
- xl_btree_reuse_page
- xl_btree_delete
- spgxlogVacuumRedirect

Due to this new field being added, xl_hash_vacuum_one_page and
gistxlogDelete do now contain the offsets to be deleted as a
FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignment.
It's not needed on the others struct where isCatalogRel has
been added.

This commit just introduces the WAL format changes mentioned above. Handling
the actual conflicts will follow in future commits.

Bumps XLOG_PAGE_MAGIC as the several WAL records are changed.

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de> (in an older version)
Author: Amit Khandekar <amitdkhan.pg@gmail.com>  (in an older version)
Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
2023-04-02 12:32:19 -07:00
Andres Freund 61b313e47e Pass down table relation into more index relation functions
This is done in preparation for logical decoding on standby, which needs to
include whether visibility affecting WAL records are about a (user) catalog
table. Which is only known for the table, not the indexes.

It's also nice to be able to pass the heap relation to GlobalVisTestFor() in
vacuumRedirectAndPlaceholder().

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/21b700c3-eecf-2e05-a699-f8c78dd31ec7@gmail.com
2023-04-01 20:18:29 -07:00
Alvaro Herrera 6ee30209a6
SQL/JSON: support the IS JSON predicate
This patch introduces the SQL standard IS JSON predicate. It operates
on text and bytea values representing JSON, as well as on the json and
jsonb types. Each test has IS and IS NOT variants and supports a WITH
UNIQUE KEYS flag. The tests are:

IS JSON [VALUE]
IS JSON ARRAY
IS JSON OBJECT
IS JSON SCALAR

These should be self-explanatory.

The WITH UNIQUE KEYS flag makes these return false when duplicate keys
exist in any object within the value, not necessarily directly contained
in the outermost object.

Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Author: Teodor Sigaev <teodor@sigaev.ru>
Author: Oleg Bartunov <obartunov@gmail.com>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Amit Langote <amitlangote09@gmail.com>
Author: Andrew Dunstan <andrew@dunslane.net>

Reviewers have included (in no particular order) Andres Freund, Alexander
Korotkov, Pavel Stehule, Andrew Alsup, Erik Rijkers, Zihong Yu,
Himanshu Upadhyaya, Daniel Gustafsson, Justin Pryzby.

Discussion: https://postgr.es/m/CAF4Au4w2x-5LTnN_bxky-mq4=WOqsGsxSpENCzHRAzSnEd8+WQ@mail.gmail.com
Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru
Discussion: https://postgr.es/m/20220616233130.rparivafipt6doj3@alap3.anarazel.de
Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org
2023-03-31 22:34:04 +02:00
Alvaro Herrera 9b058f6b0d
Move ExecEvalJsonConstructor new function to a more natural place
Commit 7081ac46ac put it at the end of the file, but that doesn't look
very nice.
2023-03-31 12:55:25 +02:00
Andres Freund f95c1cd6b2 Bump PGSTAT_FILE_FORMAT_ID, omitted in 8aaa04b32d
I forgot to do so in the referenced commit. While the consequences of omitting
the version change are likely to be harmless (besides discarding stats, as a
PGSTAT_FILE_FORMAT_ID bump also does), it still seems worth doing.
2023-03-30 19:48:01 -07:00
Andres Freund 8aaa04b32d Track shared buffer hits in pg_stat_io
Among other things, this should make it easier to calculate a useful cache hit
ratio by excluding buffer reads via buffer access strategies. As buffer access
strategies reuse buffers (and thus evict the prior buffer contents), it is
normal to see reads on repeated scans of the same data.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAAKRu_beMa9Hzih40%3DXPYqhDVz6tsgUGTrhZXRo%3Dunp%2Bszb%3DUA%40mail.gmail.com
2023-03-30 19:24:21 -07:00
Thomas Munro 11c2d6fdf5 Parallel Hash Full Join.
Full and right outer joins were not supported in the initial
implementation of Parallel Hash Join because of deadlock hazards (see
discussion).  Therefore FULL JOIN inhibited parallelism, as the other
join strategies can't do that in parallel either.

Add a new PHJ phase PHJ_BATCH_SCAN that scans for unmatched tuples on
the inner side of one batch's hash table.  For now, sidestep the
deadlock problem by terminating parallelism there.  The last process to
arrive at that phase emits the unmatched tuples, while others detach and
are free to go and work on other batches, if there are any, but
otherwise they finish the join early.

That unfairness is considered acceptable for now, because it's better
than no parallelism at all.  The build and probe phases are run in
parallel, and the new scan-for-unmatched phase, while serial, is usually
applied to the smaller of the two relations and is either limited by
some multiple of work_mem, or it's too big and is partitioned into
batches and then the situation is improved by batch-level parallelism.

Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2BA6ftXPz4oe92%2Bx8Er%2BxpGZqto70-Q_ERwRaSyA%3DafNg%40mail.gmail.com
2023-03-31 11:34:03 +13:00
Andres Freund ca7b3c4c00 pg_stat_wal: Accumulate time as instr_time instead of microseconds
In instr_time.h it is stated that:

* When summing multiple measurements, it's recommended to leave the
* running sum in instr_time form (ie, use INSTR_TIME_ADD or
* INSTR_TIME_ACCUM_DIFF) and convert to a result format only at the end.

The reason for that is that converting to microseconds is not cheap, and can
loose precision.  Therefore this commit changes 'PendingWalStats' to use
'instr_time' instead of 'PgStat_Counter' while accumulating 'wal_write_time'
and 'wal_sync_time'.

Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/1feedb83-7aa9-cb4b-5086-598349d3f555@gmail.com
2023-03-30 14:23:14 -07:00
Alvaro Herrera 60966f56c3
Fix inconsistencies and style issues in new SQL/JSON code
Reported by Alexander Lakhin.

Discussion: https://postgr.es/m/60483139-5c34-851d-baee-6c0d014e1710@gmail.com
2023-03-30 21:06:31 +02:00
Robert Haas c3afe8cf5a Add new predefined role pg_create_subscription.
This role can be granted to non-superusers to allow them to issue
CREATE SUBSCRIPTION. The non-superuser must additionally have CREATE
permissions on the database in which the subscription is to be
created.

Most forms of ALTER SUBSCRIPTION, including ALTER SUBSCRIPTION .. SKIP,
now require only that the role performing the operation own the
subscription, or inherit the privileges of the owner. However, to
use ALTER SUBSCRIPTION ... RENAME or ALTER SUBSCRIPTION ... OWNER TO,
you also need CREATE permission on the database. This is similar to
what we do for schemas. To change the owner of a schema, you must also
have permission to SET ROLE to the new owner, similar to what we do
for other object types.

Non-superusers are required to specify a password for authentication
and the remote side must use the password, similar to what is required
for postgres_fdw and dblink.  A superuser who wants a non-superuser to
own a subscription that does not rely on password authentication may
set the new password_required=false property on that subscription. A
non-superuser may not set password_required=false and may not modify a
subscription that already has password_required=false.

This new password_required subscription property works much like the
eponymous postgres_fdw property.  In both cases, the actual semantics
are that a password is not required if either (1) the property is set
to false or (2) the relevant user is the superuser.

Patch by me, reviewed by Andres Freund, Jeff Davis, Mark Dilger,
and Stephen Frost (but some of those people did not fully endorse
all of the decisions that the patch makes).

Discussion: http://postgr.es/m/CA+TgmoaDH=0Xj7OBiQnsHTKcF2c4L+=gzPBUKSJLh8zed2_+Dg@mail.gmail.com
2023-03-30 11:37:19 -04:00
David Rowley 902ecd3bd4 Fix outdated comments regarding TupleTableSlots
The tts_flag is named TTS_FLAG_SHOULDFREE, so use that instead of
TTS_SHOULDFREE, which is the name of the macro that checks for that flag.

Additionally, 4da597edf got rid of the TupleTableSlot.tts_tuple field but
forgot to update a comment which referenced that field.  Fix that.

Reported-by: Zhen Mingyang <zhenmingyang@yeah.net>
Reported-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/1a96696c.9d3.187193989c3.Coremail.zhenmingyang@yeah.net
2023-03-30 16:37:03 +13:00
Daniel Gustafsson 44d85ba5a3 Copy and store addrinfo in libpq-owned private memory
This refactors libpq to copy addrinfos returned by getaddrinfo to
memory owned by libpq such that future improvements can alter for
example the order of entries.

As a nice side effect of this refactor the mechanism for iteration
over addresses in PQconnectPoll is now identical to its iteration
over hosts.

Author: Jelte Fennema <postgres@jeltef.nl>
Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Reviewed-by: Michael Banck <mbanck@gmx.net>
Reviewed-by: Andrey Borodin <amborodin86@gmail.com>
Discussion: https://postgr.es/m/PR3PR83MB04768E2FF04818EEB2179949F7A69@PR3PR83MB0476.EURPRD83.prod.outlook.com
2023-03-29 21:41:27 +02:00
Tom Lane 58c9600a9f Remove empty function BufmgrCommit().
This function has been a no-op for over a decade.  Even if bufmgr
regains a need to be called during commit, it seems unlikely that
the most appropriate call points would be precisely here, so it's not
doing us much good as a placeholder either.  Now, removing it probably
doesn't save any noticeable number of cycles --- but the main call is
inside the commit critical section, and the less work done there the
better.

Matthias van de Meent

Discussion: https://postgr.es/m/CAEze2Wi1=tLKbxZnXzcD+8fYKyKqBtivVakLQC_mYBsP4Y8qVA@mail.gmail.com
2023-03-29 09:13:57 -04:00
Alvaro Herrera 7081ac46ac
SQL/JSON: add standard JSON constructor functions
This commit introduces the SQL/JSON standard-conforming constructors for
JSON types:

JSON_ARRAY()
JSON_ARRAYAGG()
JSON_OBJECT()
JSON_OBJECTAGG()

Most of the functionality was already present in PostgreSQL-specific
functions, but these include some new functionality such as the ability
to skip or include NULL values, and to allow duplicate keys or throw
error when they are found, as well as the standard specified syntax to
specify output type and format.

Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Author: Teodor Sigaev <teodor@sigaev.ru>
Author: Oleg Bartunov <obartunov@gmail.com>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Amit Langote <amitlangote09@gmail.com>

Reviewers have included (in no particular order) Andres Freund, Alexander
Korotkov, Pavel Stehule, Andrew Alsup, Erik Rijkers, Zihong Yu,
Himanshu Upadhyaya, Daniel Gustafsson, Justin Pryzby.

Discussion: https://postgr.es/m/CAF4Au4w2x-5LTnN_bxky-mq4=WOqsGsxSpENCzHRAzSnEd8+WQ@mail.gmail.com
Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru
Discussion: https://postgr.es/m/20220616233130.rparivafipt6doj3@alap3.anarazel.de
Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org
2023-03-29 12:11:36 +02:00
Peter Eisentraut 563f21cda8 Move definition of standard collations from initdb to pg_collation.dat
The standard collations "ucs_basic" and "unicode" were defined in
initdb, even though pg_collation.dat seems like the correct place for
them.  It seems this was just forgotten during various reorganizations
of initdb and pg_collation.dat/.h over time.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/08b58ecd-0d50-9395-ed51-dc8294e3fd2b%40enterprisedb.com
2023-03-29 09:45:21 +02:00
Amit Kapila 062a844424 Avoid syncing data twice for the 'publish_via_partition_root' option.
When there are multiple publications for a subscription and one of those
publishes via the parent table by using publish_via_partition_root and the
other one directly publishes the child table, we end up copying the same
data twice during initial synchronization. The reason for this was that we
get both the parent and child tables from the publisher and try to copy
the data for both of them.

This patch extends the function pg_get_publication_tables() to take a
publication list as its input parameter. This allows us to exclude a
partition table whose ancestor is published by the same publication list.

This problem does exist in back-branches but we decide to fix it there in
a separate commit if required. The fix for back-branches requires quite
complicated changes to fetch the required table information from the
publisher as we can't update the function pg_get_publication_tables() in
back-branches. We are not sure whether we want to deviate and complicate
the code in back-branches for this problem as there are no field reports
yet.

Author: Wang wei
Reviewed-by: Peter Smith, Jacob Champion, Kuroda Hayato, Vignesh C, Osumi Takamichi, Amit Kapila
Discussion: https://postgr.es/m/OS0PR01MB57167F45D481F78CDC5986F794B99@OS0PR01MB5716.jpnprd01.prod.outlook.com
2023-03-29 10:46:58 +05:30
Jeff Davis 1671f990dd Validate ICU locales.
For ICU collations, ensure that the locale's language exists in ICU,
and that the locale can be opened.

Basic validation helps avoid minor mistakes and misspellings, which
often fall back to the root locale instead of the intended
locale. It's even more important to avoid such mistakes in ICU
versions 54 and earlier, where the same (misspelled) locale string
could fall back to different locales depending on the environment.

Discussion: https://postgr.es/m/11b1eeb7e7667fdd4178497aeb796c48d26e69b9.camel@j-davis.com
Discussion: https://postgr.es/m/df2efad0cae7c65180df8e5ebb709e5eb4f2a82b.camel@j-davis.com
Reviewed-by: Peter Eisentraut
2023-03-28 16:34:29 -07:00
Peter Eisentraut 90189eefc1 Save a few bytes in pg_attribute
Change the columns attndims, attstattarget, and attinhcount from int32
to int16, and reorder a bit.  This saves some space (currently 4
bytes) in pg_attribute and tuple descriptors, which translates into
small performance benefits and/or room for new columns in pg_attribute
needed by future features.

attndims and attinhcount are never realistically used with values
larger than int16.  Just to be sure, add some overflow checks.
attstattarget is currently limited explicitly to 10000.

For consistency, pg_constraint.coninhcount is also changed like
attinhcount.

Discussion: https://www.postgresql.org/message-id/flat/d07ffc2b-e0e8-77f7-38fb-be921dff71af%40enterprisedb.com
2023-03-28 10:05:56 +02:00
Daniel Gustafsson b577743000 Make SCRAM iteration count configurable
Replace the hardcoded value with a GUC such that the iteration
count can be raised in order to increase protection against
brute-force attacks.  The hardcoded value for SCRAM iteration
count was defined to be 4096, which is taken from RFC 7677, so
set the default for the GUC to 4096 to match.  In RFC 7677 the
recommendation is at least 15000 iterations but 4096 is listed
as a SHOULD requirement given that it's estimated to yield a
0.5s processing time on a mobile handset of the time of RFC
writing (late 2015).

Raising the iteration count of SCRAM will make stored passwords
more resilient to brute-force attacks at a higher computational
cost during connection establishment.  Lowering the count will
reduce computational overhead during connections at the tradeoff
of reducing strength against brute-force attacks.

There are however platforms where even a modest iteration count
yields a too high computational overhead, with weaker password
encryption schemes chosen as a result.  In these situations,
SCRAM with a very low iteration count still gives benefits over
weaker schemes like md5, so we allow the iteration count to be
set to one at the low end.

The new GUC is intentionally generically named such that it can
be made to support future SCRAM standards should they emerge.
At that point the value can be made into key:value pairs with
an undefined key as a default which will be backwards compatible
with this.

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Jonathan S. Katz <jkatz@postgresql.org>
Discussion: https://postgr.es/m/F72E7BC7-189F-4B17-BF47-9735EB72C364@yesql.se
2023-03-27 09:46:29 +02:00
Daniel Gustafsson d435f15fff Add SysCacheGetAttrNotNull for guaranteed not-null attrs
When extracting an attr from a cached tuple in the syscache with
SysCacheGetAttr the isnull parameter must be checked in case the
attr cannot be NULL.  For cases when this is known beforehand, a
wrapper is introduced which perform the errorhandling internally
on behalf of the caller, invoking an elog in case of a NULL attr.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/AD76405E-DB45-46B6-941F-17B1EB3A9076@yesql.se
2023-03-25 22:49:33 +01:00
Tom Lane 27f5c712b2 Fix CREATE INDEX progress reporting for multi-level partitioning.
The "partitions_total" and "partitions_done" fields were updated
as though the current level of partitioning was the only one.
In multi-level cases, not only could partitions_total change
over the course of the command, but partitions_done could go
backwards or exceed the currently-reported partitions_total.

Fix by setting partitions_total to the total number of direct
and indirect children once at command start, and then just
incrementing partitions_done at appropriate points.  Invent
a new progress monitoring function "pgstat_progress_incr_param"
to simplify doing the latter.  We can avoid adding cost for the
former when doing CREATE INDEX, because ProcessUtility already
enumerates the children and it's pretty easy to pass the count
down to DefineIndex.  In principle the same could be done in
ALTER TABLE, but that's structurally difficult; for now, just
eat the cost of an extra find_all_inheritors scan in that case.

Ilya Gladyshev and Justin Pryzby

Discussion: https://postgr.es/m/a15f904a70924ffa4ca25c3c744cff31e0e6e143.camel@gmail.com
2023-03-25 15:34:03 -04:00
Tom Lane 3c05284d83 Invent GENERIC_PLAN option for EXPLAIN.
This provides a very simple way to see the generic plan for a
parameterized query.  Without this, it's necessary to define
a prepared statement and temporarily change plan_cache_mode,
which is a bit tedious.

One thing that's a bit of a hack perhaps is that we disable
execution-time partition pruning when the GENERIC_PLAN option
is given.  That's because the pruning code may attempt to
fetch the value of one of the parameters, which would fail.

Laurenz Albe, reviewed by Julien Rouhaud, Christoph Berg,
Michel Pelletier, Jim Jones, and myself

Discussion: https://postgr.es/m/0a29b954b10b57f0d135fe12aa0909bd41883eb0.camel@cybertec.at
2023-03-24 17:07:22 -04:00
Michael Paquier 36f40ce2dc libpq: Add sslcertmode option to control client certificates
The sslcertmode option controls whether the server is allowed and/or
required to request a certificate from the client.  There are three
modes:
- "allow" is the default and follows the current behavior, where a
configured client certificate is sent if the server requests one
(via one of its default locations or sslcert).  With the current
implementation, will happen whenever TLS is negotiated.
- "disable" causes the client to refuse to send a client certificate
even if sslcert is configured or if a client certificate is available in
one of its default locations.
- "require" causes the client to fail if a client certificate is never
sent and the server opens a connection anyway.  This doesn't add any
additional security, since there is no guarantee that the server is
validating the certificate correctly, but it may helpful to troubleshoot
more complicated TLS setups.

sslcertmode=require requires SSL_CTX_set_cert_cb(), available since
OpenSSL 1.0.2.  Note that LibreSSL does not include it.

Using a connection parameter different than require_auth has come up as
the simplest design because certificate authentication does not rely
directly on any of the AUTH_REQ_* codes, and one may want to require a
certificate to be sent in combination of a given authentication method,
like SCRAM-SHA-256.

TAP tests are added in src/test/ssl/, some of them relying on sslinfo to
check if a certificate has been set.  These are compatible across all
the versions of OpenSSL supported on HEAD (currently down to 1.0.1).

Author: Jacob Champion
Reviewed-by: Aleksander Alekseev, Peter Eisentraut, David G. Johnston,
Michael Paquier
Discussion: https://postgr.es/m/9e5a8ccddb8355ea9fa4b75a1e3a9edc88a70cd3.camel@vmware.com
2023-03-24 13:34:26 +09:00
Michael Paquier 8089517ab8 Rename fields in pgstat structures for functions and relations
This commit renames the members of a few pgstat structures related to
functions and relations, by respectively removing their prefix "f_" and
"t_".  The statistics for functions and relations and handled in their
own file, and pgstatfuncs.c associates each field in a structure
variable named based on the object type handled, so no information is
lost with this rename.

This will help with some of the refactoring aimed for pgstatfuncs.c, as
this makes more consistent the field names with the SQL functions
retrieving them.

Author: Bertrand Drouvot
Reviewed-by: Michael Paquier, Melanie Plageman
Discussion: https://postgr.es/m/9142f62a-a422-145c-bde0-b5bc498a4ada@gmail.com
2023-03-24 08:46:29 +09:00
Peter Geoghegan ae4fdde135 Count updates that move row to a new page.
Add pgstat counter to track row updates that result in the successor
version going to a new heap page, leaving behind an original version
whose t_ctid points to the new version.  The current count is shown by
the n_tup_newpage_upd column of each of the pg_stat_*_tables views.

The new n_tup_newpage_upd column complements the existing n_tup_hot_upd
and n_tup_upd columns.  Tables that have high n_tup_newpage_upd values
(relative to n_tup_upd) are good candidates for tuning heap fillfactor.

Corey Huinker, with small tweaks by me.

Author: Corey Huinker <corey.huinker@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CADkLM=ded21M9iZ36hHm-vj2rE2d=zcKpUQMds__Xm2pxLfHKA@mail.gmail.com
2023-03-23 11:16:17 -07:00
Thomas Munro 8fba928fd7 Improve the naming of Parallel Hash Join phases.
* Commit 3048898e dropped -ING from PHJ wait event names.  Update the
  corresponding barrier phases names to match.

* Rename the "DONE" phases to "FREE".  That's symmetrical with
  "ALLOCATE", and names the activity that actually happens in that phase
  (as we do for the other phases) rather than a state.  The bug fixed by
  commit 8d578b9b might have been more obvious with this name.

* Rename the batch/bucket growth barriers' "ALLOCATE" phases to
  "REALLOCATE", a better description of what they do.

* Update the high level comments about phases to highlight phases
  are executed by a single process with an asterisk (mostly memory
  management phases).

No behavior change, as this is just improving internal identifiers.  The
only user-visible sign of this is that a couple of wait events' display
names change from "...Allocate" to "...Reallocate" in pg_stat_activity,
to stay in sync with the internal names.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2BMDpwF2Eo2LAvzd%3DpOh81wUTsrwU1uAwR-v6OGBB6%2B7g%40mail.gmail.com
2023-03-23 13:14:25 +13:00
Alexander Korotkov 11470f544e Allow locking updated tuples in tuple_update() and tuple_delete()
Currently, in read committed transaction isolation mode (default), we have the
following sequence of actions when tuple_update()/tuple_delete() finds
the tuple updated by concurrent transaction.

1. Attempt to update/delete tuple with tuple_update()/tuple_delete(), which
   returns TM_Updated.
2. Lock tuple with tuple_lock().
3. Re-evaluate plan qual (recheck if we still need to update/delete and
   calculate the new tuple for update).
4. Second attempt to update/delete tuple with tuple_update()/tuple_delete().
   This attempt should be successful, since the tuple was previously locked.

This patch eliminates step 2 by taking the lock during first
tuple_update()/tuple_delete() call.  Heap table access method saves some
efforts by checking the updated tuple once instead of twice.  Future
undo-based table access methods, which will start from the latest row version,
can immediately place a lock there.

The code in nodeModifyTable.c is simplified by removing the nested switch/case.

Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp
Reviewed-by: Andres Freund, Chris Travers
2023-03-23 00:26:59 +03:00
Tom Lane b0d8f2d983 Add SHELL_ERROR and SHELL_EXIT_CODE magic variables to psql.
These are set after a \! command or a backtick substitution.
SHELL_ERROR is just "true" for error (nonzero exit status) or "false"
for success, while SHELL_EXIT_CODE records the actual exit status
following standard shell/system(3) conventions.

Corey Huinker, reviewed by Maxim Orlov and myself

Discussion: https://postgr.es/m/CADkLM=cWao2x2f+UDw15W1JkVFr_bsxfstw=NGea7r9m4j-7rQ@mail.gmail.com
2023-03-21 13:03:56 -04:00
Thomas Munro 8d578b9b2e Fix race in parallel hash join batch cleanup, take II.
With unlucky timing and parallel_leader_participation=off (not the
default), PHJ could attempt to access per-batch shared state just as it
was being freed.  There was code intended to prevent that by checking
for a cleared pointer, but it was racy.  Fix, by introducing an extra
barrier phase.  The new phase PHJ_BUILD_RUNNING means that it's safe to
access the per-batch state to find a batch to help with, and
PHJ_BUILD_DONE means that it is too late.  The last to detach will free
the array of per-batch state as before, but now it will also atomically
advance the phase, so that late attachers can avoid the hazard.  This
mirrors the way per-batch hash tables are freed (see phases
PHJ_BATCH_PROBING and PHJ_BATCH_DONE).

An earlier attempt to fix this (commit 3b8981b6, later reverted) missed
one special case.  When the inner side is empty (the "empty inner
optimization), the build barrier would only make it to
PHJ_BUILD_HASHING_INNER phase before workers attempted to detach from
the hashtable.  In that case, fast-forward the build barrier to
PHJ_BUILD_RUNNING before proceeding, so that our later assertions hold
and we can still negotiate who is cleaning up.

Revealed by build farm failures, where BarrierAttach() failed a sanity
check assertion, because the memory had been clobbered by dsa_free().
In non-assert builds, the result could be a segmentation fault.

Back-patch to all supported releases.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: Michael Paquier <michael@paquier.xyz>
Reported-by: David Geier <geidav.pg@gmail.com>
Tested-by: David Geier <geidav.pg@gmail.com>
Discussion: https://postgr.es/m/20200929061142.GA29096%40paquier.xyz
2023-03-21 14:29:34 +13:00
Tomas Vondra 19d8e2308b Ignore BRIN indexes when checking for HOT updates
When determining whether an index update may be skipped by using HOT, we
can ignore attributes indexed by block summarizing indexes without
references to individual tuples that need to be cleaned up.

A new type TU_UpdateIndexes provides a signal to the executor to
determine which indexes to update - no indexes, all indexes, or only the
summarizing indexes.

This also removes rd_indexattr list, and replaces it with rd_attrsvalid
flag. The list was not used anywhere, and a simple flag is sufficient.

This was originally committed as 5753d4ee32, but then got reverted by
e3fcca0d0d because of correctness issues.

Original patch by Josef Simanek, various fixes and improvements by Tomas
Vondra and me.

Authors: Matthias van de Meent, Josef Simanek, Tomas Vondra
Reviewed-by: Tomas Vondra, Alvaro Herrera
Discussion: https://postgr.es/m/05ebcb44-f383-86e3-4f31-0a97a55634cf@enterprisedb.com
Discussion: https://postgr.es/m/CAFp7QwpMRGcDAQumN7onN9HjrJ3u4X3ZRXdGFT0K5G2JWvnbWg%40mail.gmail.com
2023-03-20 11:02:42 +01:00
Tom Lane 75bd846b68 Add functions to do timestamptz arithmetic in a non-default timezone.
Add versions of timestamptz + interval, timestamptz - interval, and
generate_series(timestamptz, ...) in which a timezone can be specified
explicitly instead of defaulting to the TimeZone GUC setting.

The new functions for the first two are named date_add and
date_subtract.  This might seem too generic, but we could use
overloading to add additional variants if that seems useful.

Along the way, improve the docs' pretty inadequate explanation
of how timestamptz +- interval works.

Przemysław Sztoch and Gurjeet Singh; cosmetic changes and most of
the docs work by me

Discussion: https://postgr.es/m/01a84551-48dd-1359-bf7e-f6b0203a6bd0@sztoch.pl
2023-03-18 14:12:16 -04:00
Michael Paquier 0e681cf039 Add files related to query jumbling in src/include/nodes/ for meson
This caused ninja clean to not remove the two files generated by
gen_node_support.pl for the query jumbling, for example:
queryjumblefuncs.funcs.c and queryjumblefuncs.switch.c.

Reported-by: Pavel Stehule
Discussion: https://postgr.es/m/CAFj8pRBFiWVRyGYSPziyFuXJbHirNmfWwzbfTyCf8YOdiwK74w@mail.gmail.com
2023-03-18 18:04:04 +09:00
Tom Lane 3e59e5048d Refactor datetime functions' timezone lookup code to reduce duplication.
We already had five copies of essentially the same logic, and an
upcoming patch introduces yet another use-case.  That's past my
threshold of pain, so introduce a common subroutine.  There's not
that much net code savings, but the chance of typos should go down.

Inspired by a patch from Przemysław Sztoch, but different in detail.

Discussion: https://postgr.es/m/01a84551-48dd-1359-bf7e-f6b0203a6bd0@sztoch.pl
2023-03-17 17:47:19 -04:00
Jeff Davis f413941f41 Fix t_isspace(), etc., when datlocprovider=i and datctype=C.
Check whether the datctype is C to determine whether t_isspace() and
related functions use isspace() or iswspace().

Previously, t_isspace() checked whether the database default collation
was C; which is incorrect when the default collation uses the ICU
provider.

Discussion: https://postgr.es/m/79e4354d9eccfdb00483146a6b9f6295202e7890.camel@j-davis.com
Reviewed-by: Peter Eisentraut
Backpatch-through: 15
2023-03-17 12:08:46 -07:00
Amit Kapila e709596b25 Add macros for ReorderBufferTXN toptxn.
Currently, there are quite a few places in reorderbuffer.c that tries to
access top-transaction for a subtransaction. This makes the code to access
top-transaction consistent and easier to follow.

Author: Peter Smith
Reviewed-by: Vignesh C, Sawada Masahiko
Discussion: https://postgr.es/m/CAHut+PuCznOyTqBQwjRUu-ibG-=KHyCv-0FTcWQtZUdR88umfg@mail.gmail.com
2023-03-17 08:29:41 +05:30
Michael Paquier 98ae2c84a4 libpq: Remove code for SCM credential authentication
Support for SCM credential authentication has been removed in the
backend in 9.1, and libpq has kept some code to handle it for
compatibility.

Commit be4585b, that did the cleanup of the backend code, has done
so because the code was not really portable originally.  And, as there
are likely little chances that this is used these days, this removes the
remaining code from libpq.  An error will now be raised by libpq if
attempting to connect to a server that returns AUTH_REQ_SCM_CREDS,
instead.

References to SCM credential authentication are removed from the
protocol documentation.  This removes some meson and configure checks.

Author: Michael Paquier
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/ZBLH8a4otfqgd6Kn@paquier.xyz
2023-03-17 10:52:26 +09:00
Michael Paquier e731aeac89 Remove PgStat_BackendFunctionEntry
This structure included only PgStat_FunctionCounts, and removing it
facilitates some upcoming refactoring for pgstatfuncs.c to use more
macros rather that mostly-duplicated functions.

Author: Bertrand Drouvot
Reviewed-by: Nathan Bossart
Discussion: https://postgr.es/m/11d531fe-52fc-c6ea-7e8e-62f1b6ec626e@gmail.com
2023-03-16 14:22:34 +09:00
Tom Lane 483bdb2afe Support [NO] INDENT option in XMLSERIALIZE().
This adds the ability to pretty-print XML documents ... according to
libxml's somewhat idiosyncratic notions of what's pretty, anyway.
One notable divergence from a strict reading of the spec is that
libxml is willing to collapse empty nodes "<node></node>" to just
"<node/>", whereas SQL and the underlying XML spec say that this
option should only result in whitespace tweaks.  Nonetheless,
it seems close enough to justify using the SQL-standard syntax.

Jim Jones, reviewed by Peter Smith and myself

Discussion: https://postgr.es/m/2f5df461-dad8-6d7d-4568-08e10608a69b@uni-muenster.de
2023-03-15 16:59:09 -04:00
Andrew Dunstan 419a8dd814 Add a hook for modifying the ldapbind password
The hook can be installed by a shared_preload library.

A similar mechanism could be used for radius paswords, for example, and
the type name auth_password_hook_typ has been shosen with that in mind.

John Naylor and Andrew Dunstan

Discussion: https://postgr.es/m/469b06ed-69de-ba59-c13a-91d2372e52a9@dunslane.net
2023-03-15 16:37:28 -04:00
Amit Kapila 89e46da5e5 Allow the use of indexes other than PK and REPLICA IDENTITY on the subscriber.
Using REPLICA IDENTITY FULL on the publisher can lead to a full table scan
per tuple change on the subscription when REPLICA IDENTITY or PK index is
not available. This makes REPLICA IDENTITY FULL impractical to use apart
from some small number of use cases.

This patch allows using indexes other than PRIMARY KEY or REPLICA
IDENTITY on the subscriber during apply of update/delete. The index that
can be used must be a btree index, not a partial index, and it must have
at least one column reference (i.e. cannot consist of only expressions).
We can uplift these restrictions in the future. There is no smart
mechanism to pick the index. If there is more than one index that
satisfies these requirements, we just pick the first one. We discussed
using some of the optimizer's low-level APIs for this but ruled it out
as that can be a maintenance burden in the long run.

This patch improves the performance in the vast majority of cases and the
improvement is proportional to the amount of data in the table. However,
there could be some regression in a small number of cases where the indexes
have a lot of duplicate and dead rows. It was discussed that those are
mostly impractical cases but we can provide a table or subscription level
option to disable this feature if required.

Author: Onder Kalaci, Amit Kapila
Reviewed-by: Peter Smith, Shi yu, Hou Zhijie, Vignesh C, Kuroda Hayato, Amit Kapila
Discussion: https://postgr.es/m/CACawEhVLqmAAyPXdHEPv1ssU2c=dqOniiGz7G73HfyS7+nGV4w@mail.gmail.com
2023-03-15 08:49:04 +05:30
Dean Rasheed d5d574146d Add support for the error functions erf() and erfc().
Expose the standard error functions as SQL-callable functions. These
are expected to be useful to people working with normal distributions,
and we use them here to test the distribution from random_normal().

Since these functions are defined in the POSIX and C99 standards, they
should in theory be available on all supported platforms. If that
turns out not to be the case, more work will be needed.

On all platforms tested so far, using extra_float_digits = -1 in the
regression tests is sufficient to allow for variations between
implementations. However, past experience has shown that there are
almost certainly going to be additional unexpected portability issues,
so these tests may well need further adjustments, based on the
buildfarm results.

Dean Rasheed, reviewed by Nathan Bossart and Thomas Munro.

Discussion: https://postgr.es/m/CAEZATCXv5fi7+Vu-POiyai+ucF95+YMcCMafxV+eZuN1B-=MkQ@mail.gmail.com
2023-03-14 09:17:36 +00:00
Michael Paquier 3a465cc678 libpq: Add support for require_auth to control authorized auth methods
The new connection parameter require_auth allows a libpq client to
define a list of comma-separated acceptable authentication types for use
with the server.  There is no negotiation: if the server does not
present one of the allowed authentication requests, the connection
attempt done by the client fails.

The following keywords can be defined in the list:
- password, for AUTH_REQ_PASSWORD.
- md5, for AUTH_REQ_MD5.
- gss, for AUTH_REQ_GSS[_CONT].
- sspi, for AUTH_REQ_SSPI and AUTH_REQ_GSS_CONT.
- scram-sha-256, for AUTH_REQ_SASL[_CONT|_FIN].
- creds, for AUTH_REQ_SCM_CREDS (perhaps this should be removed entirely
now).
- none, to control unauthenticated connections.

All the methods that can be defined in the list can be negated, like
"!password", in which case the server must NOT use the listed
authentication type.  The special method "none" allows/disallows the use
of unauthenticated connections (but it does not govern transport-level
authentication via TLS or GSSAPI).

Internally, the patch logic is tied to check_expected_areq(), that was
used for channel_binding, ensuring that an incoming request is
compatible with conn->require_auth.  It also introduces a new flag,
conn->client_finished_auth, which is set by various authentication
routines when the client side of the handshake is finished.  This
signals to check_expected_areq() that an AUTH_REQ_OK from the server is
expected, and allows the client to complain if the server bypasses
authentication entirely, with for example the reception of a too-early
AUTH_REQ_OK message.

Regression tests are added in authentication TAP tests for all the
keywords supported (except "creds", because it is around only for
compatibility reasons).  A new TAP script has been added for SSPI, as
there was no script dedicated to it yet.  It relies on SSPI being the
default authentication method on Windows, as set by pg_regress.

Author: Jacob Champion
Reviewed-by: Peter Eisentraut, David G. Johnston, Michael Paquier
Discussion: https://postgr.es/m/9e5a8ccddb8355ea9fa4b75a1e3a9edc88a70cd3.camel@vmware.com
2023-03-14 14:00:05 +09:00
Andrew Dunstan 9f8377f7a2 Add a DEFAULT option to COPY FROM
This allows for a string which if an input field matches causes the
column's default value to be inserted. The advantage of this is that
the default can be inserted in some rows and not others, for which
non-default data is available.

The file_fdw extension is also modified to take allow use of this
option.

Israel Barth Rubio

Discussion: https://postgr.es/m/CAO_rXXAcqesk6DsvioOZ5zmeEmpUN5ktZf-9=9yu+DTr0Xr8Uw@mail.gmail.com
2023-03-13 10:01:56 -04:00
Dean Rasheed 9321c79c86 Fix concurrent update issues with MERGE.
If MERGE attempts an UPDATE or DELETE on a table with BEFORE ROW
triggers, or a cross-partition UPDATE (with or without triggers), and
a concurrent UPDATE or DELETE happens, the merge code would fail.

In some cases this would lead to a crash, while in others it would
cause the wrong merge action to be executed, or no action at all. The
immediate cause of the crash was the trigger code calling
ExecGetUpdateNewTuple() as part of the EPQ mechanism, which fails
because during a merge ri_projectNew is NULL, since merge has its own
per-action projection information, which ExecGetUpdateNewTuple() knows
nothing about.

Fix by arranging for the trigger code to exit early, returning the
TM_Result and TM_FailureData information, if a concurrent modification
is detected, allowing the merge code to do the necessary EPQ handling
in its own way. Similarly, prevent the cross-partition update code
from doing any EPQ processing for a merge, allowing the merge code to
work out what it needs to do.

This leads to a number of simplifications in nodeModifyTable.c. Most
notably, the ModifyTableContext->GetUpdateNewTuple() callback is no
longer needed, and mergeGetUpdateNewTuple() can be deleted, since
there is no longer any requirement for get-update-new-tuple during a
merge. Similarly, ModifyTableContext->cpUpdateRetrySlot is no longer
needed. Thus ExecGetUpdateNewTuple() and the retry_slot handling of
ExecCrossPartitionUpdate() can be restored to how they were in v14,
before the merge code was added, and ExecMergeMatched() no longer
needs any special-case handling for cross-partition updates.

While at it, tidy up ExecUpdateEpilogue() a bit, making it handle
recheckIndexes locally, rather than passing it in as a parameter,
ensuring that it is freed properly. This dates back to when it was
split off from ExecUpdate() to support merge.

Per bug #17809 from Alexander Lakhin, and follow-up investigation of
bug #17792, also from Alexander Lakhin.

Back-patch to v15, where MERGE was introduced, taking care to preserve
backwards-compatibility of the trigger API in v15 for any extensions
that might use it.

Discussion:
  https://postgr.es/m/17809-9e6650bef133f0fe%40postgresql.org
  https://postgr.es/m/17792-0f89452029662c36%40postgresql.org
2023-03-13 10:22:22 +00:00
Peter Eisentraut d72900bded Improve support for UNICODE collation on older ICU
The recently added standard collation UNICODE (0d21d4b9bc) doesn't
give consistent results on some build farm members with old ICU
versions.  Apparently, the ICU locale specification 'und' (language
tag style) misbehaves on some older ICU versions.  Replacing it with
'' (ICU locale ID style) fixes it at least on some OS versions.  Let's
see what the build farm says.
2023-03-13 09:08:58 +01:00
Peter Eisentraut 0d21d4b9bc Add standard collation UNICODE
This adds a new predefined collation named UNICODE, which sorts by the
default Unicode collation algorithm specifications, per SQL standard.

This only works if ICU support is built.

Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://www.postgresql.org/message-id/flat/1293e382-2093-a2bf-a397-c04e8f83d3c2@enterprisedb.com
2023-03-10 13:35:43 +01:00
Michael Paquier 6ad5793a49 Include headers of archive/ in installation
These new headers have been recently added in 35739b8, but they were not
installed.  Sravan has provided the patch for configure/make, while I
have fixed the meson part.

Author: Sravan Kumar, Michael Paquier
Discussion: https://postgr.es/m/CA+=NbjguiQy-MbVqfQ-jQ=2Fcmx3Zs36OkKb-vjt28jMTG0OOg@mail.gmail.com
2023-03-10 20:08:10 +09:00
Peter Eisentraut 30a53b7929 Allow tailoring of ICU locales with custom rules
This exposes the ICU facility to add custom collation rules to a
standard collation.

New options are added to CREATE COLLATION, CREATE DATABASE, createdb,
and initdb to set the rules.

Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Discussion: https://www.postgresql.org/message-id/flat/821c71a4-6ef0-d366-9acf-bb8e367f739f@enterprisedb.com
2023-03-08 16:56:37 +01:00
Peter Eisentraut 822e8e2951 Update comment
There was apparently an attempt here to list all the object types that
ACL_USAGE applies to, but it wasn't complete.  So instead of trying to
keep up, put in a more timeless comment.
2023-03-08 14:22:06 +01:00
Michael Paquier a4e003338d Refine query jumbling handling for CallStmt
Previously, all the nodes of CallStmt were included in the jumbling,
causing a duplicate in the computation as the transformed state of the
CALL query was included as well as the parsed state (transformed
FuncCall with all the input arguments and potential output arguments).

Reviewed-by: Bertrand Drouvot
Discussion: https://postgr.es/m/Y+MRdEq9W9XVa2AB@paquier.xyz
2023-03-08 14:38:35 +09:00
Michael Paquier d69cd3a2e2 Ignore IntoClause.viewQuery in query jumbling
IntoClause.viewQuery is a copy of the parsed-but-not-rewritten SELECT
clause copied to IntoClause when transforming CreateTableAsStmt for a
materialized view.  Including a second copy of the SELECT Query into the
query jumbling was leading to an incorrect numbering of the Const node
locations, as these would be counted twice instead of once.

This becomes visible once the query normalization is applied to CREATE
MATERIALIZED VIEW in pg_stat_statements in the shape of a query string
using only odd numbers for the normalized constants, (regression tests
added in pg_stat_statements as of de2aca2 would show the difference).
Including the original Query from CreateTableAsStmt is enough for the
query jumbling.

Reviewed-by: Bertrand Drouvot
Discussion: https://postgr.es/m/Y+MRdEq9W9XVa2AB@paquier.xyz
2023-03-08 11:41:52 +09:00
Michael Paquier e20b1ea157 Make get_extension_schema() available
This routine is able to retrieve the OID of the schema used with an
extension (pg_extension.extnamespace), or InvalidOid if this information
is not available.  plpgsql_check embeds a copy of this code when
performing checks on functions, as one out-of-core example.

Author: Pavel Stehule
Reviewed-by: Julien Rouhaud
Discussion: https://postgr.es/m/CAFj8pRD+9x55hjDoi285jCcjPc8uuY_D+FLn5RpXggdz+4O2sQ@mail.gmail.com
2023-03-07 14:18:20 +09:00
Tom Lane 7fee7871b4 Fix some more cases of missed GENERATED-column updates.
If UPDATE is forced to retry after an EvalPlanQual check, it neglected
to repeat GENERATED-column computations, even though those might well
have changed since we're dealing with a different tuple than before.
Fixing this is mostly a matter of looping back a bit further when
we retry.  In v15 and HEAD that's most easily done by altering the API
of ExecUpdateAct so that it includes computing GENERATED expressions.

Also, if an UPDATE in a partitioned table turns into a cross-partition
INSERT operation, we failed to recompute GENERATED columns.  That's a
bug since 8bf6ec3ba allowed partitions to have different generation
expressions; although it seems to have no ill effects before that.
Fixing this is messier because we can now have situations where the same
query needs both the UPDATE-aligned set of GENERATED columns and the
INSERT-aligned set, and it's unclear which set will be generated first
(else we could hack things by forcing the INSERT-aligned set to be
generated, which is indeed how fe9e658f4 made it work for MERGE).
The best fix seems to be to build and store separate sets of expressions
for the INSERT and UPDATE cases.  That would create ABI issues in the
back branches, but so far it seems we can leave this alone in the back
branches.

Per bug #17823 from Hisahiro Kauchi.  The first part of this affects all
branches back to v12 where GENERATED columns were added.

Discussion: https://postgr.es/m/17823-b64909cf7d63de84@postgresql.org
2023-03-06 18:31:27 -05:00
Tom Lane b803b7d132 Fill EState.es_rteperminfos more systematically.
While testing a fix for bug #17823, I discovered that EvalPlanQualStart
failed to copy es_rteperminfos from the parent EState, resulting in
failure if anything in EPQ execution wanted to consult that information.

This led me to conclude that commit a61b1f748 had been too haphazard
about where to fill es_rteperminfos, and that we need to be sure that
that happens exactly where es_range_table gets filled.  So I changed the
signature of ExecInitRangeTable to help ensure that this new requirement
doesn't get missed.  (Indeed, pgoutput.c was also failing to fill it.
Maybe we don't ever need it there, but I wouldn't bet on that.)

No test case yet; one will arrive with the fix for #17823.
But that needs to be back-patched, while this fix is HEAD-only.

Discussion: https://postgr.es/m/17823-b64909cf7d63de84@postgresql.org
2023-03-06 13:10:57 -05:00
Michael Paquier 4211fbd841 Add PROCESS_MAIN to VACUUM
Disabling this option is useful to run VACUUM (with or without FULL) on
only the toast table of a relation, bypassing the main relation.  This
option is enabled by default.

Running directly VACUUM on a toast table was already possible without
this feature, by using the non-deterministic name of a toast relation
(as of pg_toast.pg_toast_N, where N would be the OID of the parent
relation) in the VACUUM command, and it required a scan of pg_class to
know the name of the toast table.  So this feature is basically a
shortcut to be able to run VACUUM or VACUUM FULL on a toast relation,
using only the name of the parent relation.

A new switch called --no-process-main is added to vacuumdb, to work as
an equivalent of PROCESS_MAIN.

Regression tests are added to cover VACUUM and VACUUM FULL, looking at
pg_stat_all_tables.vacuum_count to see how many vacuums have run on
each table, main or toast.

Author: Nathan Bossart
Reviewed-by: Masahiko Sawada
Discussion: https://postgr.es/m/20221230000028.GA435655@nathanxps13
2023-03-06 16:41:05 +09:00
Michael Paquier ce340e530d Revise pg_pwrite_zeros()
The following changes are made to pg_write_zeros(), the API able to
write series of zeros using vectored I/O:
- Add of an "offset" parameter, to write the size from this position
(the 'p' of "pwrite" seems to mean position, though POSIX does not
outline ythat directly), hence the name of the routine is incorrect if
it is not able to handle offsets.
- Avoid memset() of "zbuffer" on every call.
- Avoid initialization of the whole IOV array if not needed.
- Group the trailing write() call with the main write() call,
simplifying the function logic.

Author: Andres Freund
Reviewed-by: Michael Paquier, Bharath Rupireddy
Discussion: https://postgr.es/m/20230215005525.mrrlmqrxzjzhaipl@awork3.anarazel.de
2023-03-06 13:21:33 +09:00
Tom Lane 6949b921d5 Avoid failure when altering state of partitioned foreign-key triggers.
Beginning in v15, if you apply ALTER TABLE ENABLE/DISABLE TRIGGER to
a partitioned table, it also affects the partitions' cloned versions
of the affected trigger(s).  The initial implementation of this
located the clones by name, but that fails on foreign-key triggers
which have names incorporating their own OIDs.  We can fix that, and
also make the behavior more bulletproof in the face of user-initiated
trigger renames, by identifying the cloned triggers by tgparentid.

Following the lead of earlier commits in this area, I took care not
to break ABI in the v15 branch, even though I rather doubt there
are any external callers of EnableDisableTrigger.

While here, update the documentation, which was not touched when
the semantics were changed.

Per bug #17817 from Alan Hodgson.  Back-patch to v15; older versions
do not have this behavior.

Discussion: https://postgr.es/m/17817-31dfb7c2100d9f3d@postgresql.org
2023-03-04 13:32:35 -05:00
Robert Haas ebd551f586 Update some incorrect comments about xlog records.
The comments claim that certain pieces of data are part of the main
WAL record data when in reality they are part of the data for
block 0. Repair.

Bertrand Drouvot, reviewed by Amit Kapila. Originally reported by me.

Discussion: http://postgr.es/m/80db7836-4415-d54a-64c3-66b88b1430e7@gmail.com
2023-03-03 12:52:04 -05:00
Thomas Munro 1da569ca1f Don't leak descriptors into subprograms.
Open long-lived data and WAL file descriptors with O_CLOEXEC.  This flag
was introduced by SUSv4 (POSIX.1-2008), and by now all of our target
Unix systems have it.  Our open() implementation for Windows already had
that behavior, so provide a dummy O_CLOEXEC flag on that platform.

For now, callers of open() and the "thin" wrappers in fd.c that deal in
raw descriptors need to pass in O_CLOEXEC explicitly if desired.  This
commit does that for WAL files, and automatically for everything
accessed via VFDs including SMgrRelation and BufFile.  (With more
discussion we might decide to turn it on automatically for the thin
open()-wrappers too to avoid risk of missing places that need it, but
these are typically used for short-lived descriptors where we don't
expect to fork/exec, and it's remotely possible that extensions could be
using these APIs and passing descriptors to subprograms deliberately, so
that hasn't been done here.)

Do the same for sockets and the postmaster pipe with FD_CLOEXEC.  (Later
commits might use modern interfaces to remove these extra fcntl() calls
and more where possible, but we'll need them as a fallback for a couple
of systems, so do it that way in this initial commit.)

With this change, subprograms executed for archiving, copying etc will
no longer have access to the server's descriptors, other than the ones
that we decide to pass down.

Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
2023-03-03 10:43:33 +13:00
Tom Lane 00b41463c2 Require empty Bitmapsets to be represented as NULL.
When I designed the Bitmapset module, I set things up so that an empty
Bitmapset could be represented either by a NULL pointer, or by an
allocated object all of whose bits are zero.  I've recently come to
the conclusion that that was a bad idea and we should instead have a
convention like the longstanding invariant for Lists, whereby an empty
list is represented by NIL and nothing else.

To do this, we need to fix bms_intersect, bms_difference, and a couple
of other functions to check for having produced an empty result; but
then we can replace bms_is_empty(a) by a simple "a == NULL" test.

This is very likely a (marginal) win performance-wise, because we
call bms_is_empty many more times than those other functions put
together.  However, the real reason to do it is that we have various
places that have hand-implemented a rule about "this Bitmapset
variable must be exactly NULL if empty", so that they can use
checks-for-null in place of bms_is_empty calls in particularly hot
code paths.  That is a really fragile, mistake-prone way to do things,
and I'm surprised that we've seldom been bitten by it.  It's not well
documented at all which variables have this property, so you can't
readily tell which code might be violating those conventions.  By
making the convention universal, we can eliminate a subtle source of
bugs.

Patch by me; thanks to Nathan Bossart and Richard Guo for review.

Discussion: https://postgr.es/m/1159933.1677621588@sss.pgh.pa.us
2023-03-02 11:47:26 -05:00
Tom Lane 141225b251 Mop up some undue familiarity with the innards of Bitmapsets.
nodeAppend.c used non-nullness of appendstate->as_valid_subplans as
a state flag to indicate whether it'd done ExecFindMatchingSubPlans
(or some sufficient approximation to that).  This was pretty
questionable even in the beginning, since it wouldn't really work
right if there are no valid subplans.  It got more questionable
after commit 27e1f1456 added logic that could reduce as_valid_subplans
to an empty set: at that point we were depending on unspecified
behavior of bms_del_members, namely that it'd not return an empty
set as NULL.  It's about to start doing that, which breaks this
logic entirely.  Hence, add a separate boolean flag to signal
whether as_valid_subplans has been computed.

Also fix a previously-cosmetic bug in nodeAgg.c, wherein it ignored
the return value of bms_del_member instead of updating its pointer.

Patch by me; thanks to Nathan Bossart and Richard Guo for review.

Discussion: https://postgr.es/m/1159933.1677621588@sss.pgh.pa.us
2023-03-02 11:37:37 -05:00
Tom Lane 462bb7f128 Remove bms_first_member().
This function has been semi-deprecated ever since we invented
bms_next_member().  Its habit of scribbling on the input bitmapset
isn't great, plus for sufficiently large bitmapsets it would take
O(N^2) time to complete a loop.  Now we have the additional problem
that reducing the input to empty while leaving it still accessible
would violate a planned invariant.  So let's just get rid of it,
after updating the few extant callers to use bms_next_member().

Patch by me; thanks to Nathan Bossart and Richard Guo for review.

Discussion: https://postgr.es/m/1159933.1677621588@sss.pgh.pa.us
2023-03-02 11:34:29 -05:00
Daniel Gustafsson 7ab1bc2939 Fix outdated references to guc.c
Commit 0a20ff54f split out the GUC variables from guc.c into a new file
guc_tables.c. This updates comments referencing guc.c regarding variables
which are now in guc_tables.c.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/6B50C70C-8C1F-4F9A-A7C0-EEAFCC032406@yesql.se
2023-03-02 13:49:39 +01:00
Michael Paquier b8da37b3ad Rework pg_input_error_message(), now renamed pg_input_error_info()
pg_input_error_info() is now a SQL function able to return a row with
more than just the error message generated for incorrect data type
inputs when these are able to handle soft failures, returning more
contents of ErrorData, as of:
- The error message (same as before).
- The error detail, if set.
- The error hint, if set.
- SQL error code.

All the regression tests that relied on pg_input_error_message() are
updated to reflect the effects of the rename.

Per discussion with Tom Lane and Andrew Dunstan.

Author: Nathan Bossart
Discussion: https://postgr.es/m/139a68e1-bd1f-a9a7-b5fe-0be9845c6311@dunslane.net
2023-02-28 08:04:13 +09:00
Tom Lane 728560db7d Suppress compiler warnings in new pgstats code.
Some clang versions whine about comparing an enum variable to
a value outside the range of the enum, on the grounds that the
result must be constant.  In the cases we fix here, the loops
will terminate only if the enum variable can in fact hold a
value one beyond its declared range.  While that's very likely
to always be true for these enum types, it still seems like a
poor coding practice to assume it; so use "int" loop variables
instead to silence the warnings.  (This matches what we've done
in other places, for example loops over the range of ForkNumber.)

While at it, let's drop the XXX_FIRST macros for these enums and just
write zeroes for the loop start values.  The apparent flexibility
seems rather illusory given that iterating up to one-less-than-
the-number-of-values is only correct for a zero-based range.

Melanie Plageman

Discussion: https://postgr.es/m/20520.1677435600@sss.pgh.pa.us
2023-02-27 17:21:31 -05:00
Peter Eisentraut b9f0e54bc9 Update types in smgr API
Change data buffer to void *, from char *, and add const where
appropriate.  This makes it match the File API (see also
2d4f1ba6cf) and stdio.

Discussion: https://www.postgresql.org/message-id/flat/11dda853-bb5b-59ba-a746-e168b1ce4bdb%40enterprisedb.com
2023-02-27 07:47:46 +01:00
Amit Kapila a6cd1fc692 Change xl_hash_vacuum_one_page.ntuples from int to uint16.
This will create two bytes of padding space in xl_hash_vacuum_one_page which
can be used for future patches. This makes the datatype of
xl_hash_vacuum_one_page.ntuples same as gistxlogDelete.ntodelete which is
advisable as both are used for the same purpose.

Author: Bertrand Drouvot
Reviewed-by: Nathan Bossart
Discussion: https://postgr.es/m/b0e20c40-cb7a-fc1c-c607-2a78dac5021e@gmail.com
2023-02-27 08:32:45 +05:30
Tom Lane 87f3667ec0 Fix MULTIEXPR_SUBLINK with partitioned target tables, yet again.
We already tried to fix this in commits 3f7323cbb et al (and follow-on
fixes), but now it emerges that there are still unfixed cases;
moreover, these cases affect all branches not only pre-v14.  I thought
we had eliminated all cases of making multiple clones of an UPDATE's
target list when we nuked inheritance_planner.  But it turns out we
still do that in some partitioned-UPDATE cases, notably including
INSERT ... ON CONFLICT UPDATE, because ExecInitPartitionInfo thinks
it's okay to clone and modify the parent's targetlist.

This fix is based on a suggestion from Andres Freund: let's stop
abusing the ParamExecData.execPlan mechanism, which was only ever
meant to handle initplans, and instead solve the execution timing
problem by having the expression compiler move MULTIEXPR_SUBLINK steps
to the front of their expression step lists.  This is feasible because
(a) all branches still in support compile the entire targetlist of
an UPDATE into a single ExprState, and (b) we know that all
MULTIEXPR_SUBLINKs do need to be evaluated --- none could be buried
inside a CASE, for example.  There is a minor semantics change
concerning the order of execution of the MULTIEXPR's subquery versus
other parts of the parent targetlist, but that seems like something
we can get away with.  By doing that, we no longer need to worry
about whether different clones of a MULTIEXPR_SUBLINK share output
Params; their usage of that data structure won't overlap.

Per bug #17800 from Alexander Lakhin.  Back-patch to all supported
branches.  In v13 and earlier, we can revert 3f7323cbb and follow-on
fixes; however, I chose to keep the SubPlan.subLinkId field added
in ccbb54c72.  We don't need that anymore in the core code, but it's
cheap enough to fill, and removing a plan node field in a minor
release seems like it'd be asking for trouble.

Andres Freund and Tom Lane

Discussion: https://postgr.es/m/17800-ff90866b3906c964@postgresql.org
2023-02-25 14:44:14 -05:00
Jeff Davis 6974a8f768 Refactor to introduce pg_locale_deterministic().
Avoids the need of callers to test for NULL, and also avoids the need
to access the pg_locale_t structure directly.

Reviewed-by: Peter Eisentraut, Peter Geoghegan
Discussion: https://postgr.es/m/a581136455c940d7bd0ff482d3a2bd51af25a94f.camel%40j-davis.com
2023-02-23 11:17:41 -08:00
Jeff Davis d87d548cd0 Refactor to add pg_strcoll(), pg_strxfrm(), and variants.
Offers a generally better separation of responsibilities for collation
code. Also, a step towards multi-lib ICU, which should be based on a
clean separation of the routines required for collation providers.

Callers with NUL-terminated strings should call pg_strcoll() or
pg_strxfrm(); callers with strings and their length should call the
variants pg_strncoll() or pg_strnxfrm().

Reviewed-by: Peter Eisentraut, Peter Geoghegan
Discussion: https://postgr.es/m/a581136455c940d7bd0ff482d3a2bd51af25a94f.camel%40j-davis.com
2023-02-23 10:55:20 -08:00
Tomas Vondra e9960732a9 Introduce a generic pg_dump compression API
Switch pg_dump to use the Compression API, implemented by bf9aa490db.

The CompressFileHandle replaces the cfp* family of functions with a
struct of callbacks for accessing (compressed) files. This allows adding
new compression methods simply by introducing a new struct instance with
appropriate implementation of the callbacks.

Archives compressed using custom compression methods store an identifier
of the compression algorithm in their header instead of the compression
level. The header version is bumped.

Author: Georgios Kokolatos
Reviewed-by: Michael Paquier, Rachel Heaton, Justin Pryzby, Tomas Vondra
Discussion: https://postgr.es/m/faUNEOpts9vunEaLnmxmG-DldLSg_ql137OC3JYDmgrOMHm1RvvWY2IdBkv_CRxm5spCCb_OmKNk2T03TMm0fBEWveFF9wA1WizPuAgB7Ss%3D%40protonmail.com
2023-02-23 18:33:40 +01:00
Tom Lane 739f1d6218 Fix mis-handling of outer join quals generated by EquivalenceClasses.
It's possible, in admittedly-rather-contrived cases, for an eclass
to generate a derived "join" qual that constrains the post-outer-join
value(s) of some RHS variable(s) without mentioning the LHS at all.
While the mechanisms were set up to work for this, we fell foul of
the "get_common_eclass_indexes" filter installed by commit 3373c7155:
it could decide that such an eclass wasn't relevant to the join, so
that the required qual clause wouldn't get emitted there or anywhere
else.

To fix, apply get_common_eclass_indexes only at inner joins, where
its rule is still valid.  At an outer join, fall back to examining all
eclasses that mention either input (or the OJ relid, though it should
be impossible for an eclass to mention that without mentioning either
input).  Perhaps we can improve on that later, but the cost/benefit of
adding more complexity to skip some irrelevant eclasses is dubious.

To allow cheaply distinguishing outer from inner joins, pass the
ojrelid to generate_join_implied_equalities as a separate argument.
This also allows cleaning up some sloppiness that had crept into
the definition of its join_relids argument, and it allows accurate
calculation of nominal_join_relids for a child outer join.  (The
latter oversight seems not to have been a live bug, but it certainly
could have caused problems in future.)

Also fix what might be a live bug in check_index_predicates: it was
being sloppy about what it passed to generate_join_implied_equalities.

Per report from Richard Guo.

Discussion: https://postgr.es/m/CAMbWs4-DsTBfOvXuw64GdFss2=M5cwtEhY=0DCS7t2gT7P6hSA@mail.gmail.com
2023-02-23 11:05:58 -05:00
Andres Freund 78be04e4c6 Add static assertion ensuring sizeof(ExprEvalStep) <= 64 bytes
This was previously only documented in a comment. Given the size of the
struct, it's not hard to miss that comment. As evidenced by the commits
leading up to fe3caa1439, 67b26703b4.

It's possible, but not likely, that we might have to weaken these assertions
on a less commonly used architecture.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/295606.1677101684@sss.pgh.pa.us
2023-02-22 14:30:39 -08:00
John Naylor 83a611a259 Remove newly added asserts from pg_bitutils.h
These were valuable during development, but are unlikely to tell us
anything going forward. This reverts 204b0cbec and adjusts the content
of 677319746 to more closely match the more-readable original style.

Per review from Tom Lane

Discussion: https://www.postgresql.org/message-id/3567481.1676906261%40sss.pgh.pa.us
2023-02-22 17:22:43 +07:00
Peter Eisentraut 2ddab010c2 Implement ANY_VALUE aggregate
SQL:2023 defines an ANY_VALUE aggregate whose purpose is to emit an
implementation-dependent (i.e. non-deterministic) value from the
aggregated rows.

Author: Vik Fearing <vik@postgresfriends.org>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/5cff866c-10a8-d2df-32cb-e9072e6b04a2@postgresfriends.org
2023-02-22 09:33:07 +01:00
John Naylor 6773197464 Add MSVC support for pg_leftmost_one_pos32() and friends
To allow testing for general support for fast bitscan intrinsics,
add symbols HAVE_BITSCAN_REVERSE and HAVE_BITSCAN_FORWARD.

Also do related cleanup in AllocSetFreeIndex(): Previously, we
tested for HAVE__BUILTIN_CLZ and copied the relevant internals of
pg_leftmost_one_pos32(), with a special fallback that does less
work than the general fallback for that function. Now that we have
a more general test, we just call pg_leftmost_one_pos32() directly
for platforms with intrinsic support. On gcc at least, there is no
difference in the binary for non-assert builds.

Discussion: https://www.postgresql.org/message-id/CAFBsxsEPc%2BFnX_0vmmQ5DHv60sk4rL_RZJ%2BMD6ei%3D76L0kFMvA%40mail.gmail.com
2023-02-20 14:55:32 +07:00
John Naylor 204b0cbecb Add assert checking to pg_leftmost_one_pos32() and friends
Discussion: https://www.postgresql.org/message-id/CAFBsxsEPc%2BFnX_0vmmQ5DHv60sk4rL_RZJ%2BMD6ei%3D76L0kFMvA%40mail.gmail.com
2023-02-20 14:16:34 +07:00
David Rowley 2cb82e2acf Speedup and increase usability of set proc title functions
The setting of the process title could be seen on profiles of very
fast-to-execute queries.  In many locations where we call
set_ps_display() we pass along a string constant, the length of which is
known during compilation.  Here we effectively rename set_ps_display() to
set_ps_display_with_len() and then add a static inline function named
set_ps_display() which calls strlen() on the given string.  This allows
the compiler to optimize away the strlen() call when dealing with
call sites passing a string constant.  We can then also use memcpy()
instead of strlcpy() to copy the string into the destination buffer.
That's significantly faster than strlcpy's byte-at-a-time way of
copying.

Here we also take measures to improve some code which was adjusting the
process title to add a " waiting" suffix to it.  Call sites which require
this can now just call set_ps_display_suffix() to add or adjust the suffix
and call set_ps_display_remove_suffix() to remove it again.

Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/CAApHDvocBvvk-0gWNA2Gohe+sv9fMcv+fK_G+siBKJrgDG4O7g@mail.gmail.com
2023-02-20 16:18:27 +13:00
Michael Paquier 35739b87dc Redesign archive modules
A new callback named startup_cb, called shortly after a module is
loaded, is added.  This makes possible the initialization of any
additional state data required by a module.  This initial state data can
be saved in a ArchiveModuleState, that is now passed down to all the
callbacks that can be defined in a module.  With this design, it is
possible to have a per-module state, aimed at opening the door to the
support of more than one archive module.

The initialization of the callbacks is changed so as
_PG_archive_module_init() does not anymore give in input a
ArchiveModuleCallbacks that a module has to fill in with callback
definitions.  Instead, a module now needs to return a const
ArchiveModuleCallbacks.

All the structure and callback definitions of archive modules are moved
into their own header, named archive_module.h, from pgarch.h.
Command-based archiving follows the same line, with a new set of files
named shell_archive.{c,h}.

There are a few more items that are under discussion to improve the
design of archive modules, like the fact that basic_archive calls
sigsetjmp() by itself to define its own error handling flow.  These will
be adjusted later, the changes done here cover already a good portion
of what has been discussed.

Any modules created for v15 will need to be adjusted to this new
design.

Author: Nathan Bossart
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/20230130194810.6fztfgbn32e7qarj@awork3.anarazel.de
2023-02-17 14:26:42 +09:00
Thomas Munro d2ea2d310d Remove obsolete platforms from ps_status.c.
Time to remove various code, comments and configure/meson probes
relating to ancient BSD, SunOS, GNU/Hurd, IRIX, NeXT and Unixware.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CA%2BhUKGJMNGUAqf27WbckYFrM-Mavy0RKJvocfJU%3DJ2XcAZyv%2Bw%40mail.gmail.com
2023-02-17 15:18:18 +13:00
Amit Kapila fce003cfde Add a new wait state and use it when sending data in the apply worker.
d9d7fe68d3 made use of an existing wait event when sending data from the
apply worker, but we should have invented a new wait event since this is a
new place to wait.

This patch corrects the mistake by using a new wait event
"LogicalApplySendData".

Author: Hou Zhijie
Reviewed-by: Peter Smith
Discussion: https://postgr.es/m/CA+TgmobWzbr9H3yN3dLVckviEZKemPwd+XyCFKEgyZQZhgP66Q@mail.gmail.com
2023-02-16 07:46:31 +05:30
David Rowley 5352ca22e0 Rename force_parallel_mode to debug_parallel_query
force_parallel_mode is meant to be used to allow us to exercise the
parallel query infrastructure to ensure that it's working as we expect.
It seems some users think this GUC is for forcing the query planner into
picking a parallel plan regardless of the costs.  A quick look at the
documentation would have made them realize that they were wrong, but the
GUC is likely too conveniently named which, evidently, seems to often
result in users expecting that it forces the planner into usefully
parallelizing queries.

Here we rename the GUC to something which casual users are less likely to
mistakenly think is what they need to make their query run more quickly.

For now, the old name can still be used.  We'll revisit if the old name
mapping can be removed once the buildfarm configs are all updated.

Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAApHDvrsOi92_uA7PEaHZMH-S4Xv+MGhQWA+GrP8b1kjpS1HjQ@mail.gmail.com
2023-02-15 21:21:59 +13:00
Michael Paquier 9244c11afe Fix handling of SCRAM-SHA-256's channel binding with RSA-PSS certificates
OpenSSL 1.1.1 and newer versions have added support for RSA-PSS
certificates, which requires the use of a specific routine in OpenSSL to
determine which hash function to use when compiling it when using
channel binding in SCRAM-SHA-256.  X509_get_signature_nid(), that is the
original routine the channel binding code has relied on, is not able to
determine which hash algorithm to use for such certificates.  However,
X509_get_signature_info(), new to OpenSSL 1.1.1, is able to do it.  This
commit switches the channel binding logic to rely on
X509_get_signature_info() over X509_get_signature_nid(), which would be
the choice when building with 1.1.1 or newer.

The error could have been triggered on the client or the server, hence
libpq and the backend need to have their related code paths patched.
Note that attempting to load an RSA-PSS certificate with OpenSSL 1.1.0
or older leads to a failure due to an unsupported algorithm.

The discovery of relying on X509_get_signature_info() comes from Jacob,
the tests have been written by Heikki (with few tweaks from me), while I
have bundled the whole together while adding the bits needed for MSVC
and meson.

This issue exists since channel binding exists, so backpatch all the way
down.  Some tests are added in 15~, triggered if compiling with OpenSSL
1.1.1 or newer, where the certificate and key files can easily be
generated for RSA-PSS.

Reported-by: Gunnar "Nick" Bluth
Author: Jacob Champion, Heikki Linnakangas
Discussion: https://postgr.es/m/17760-b6c61e752ec07060@postgresql.org
Backpatch-through: 11
2023-02-15 10:12:16 +09:00
Peter Eisentraut 3b12e68a5c Change argument type of pq_sendbytes from char * to void *
This is a follow-up to 1f605b82ba.  It
allows getting rid of further casts at call sites.

Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://www.postgresql.org/message-id/783a4edb-84f9-6df2-7470-2ef5ccc6607a@enterprisedb.com
2023-02-14 13:32:19 +01:00
Tom Lane e9a20e451f When removing a relation from the query, drop its RelOptInfo.
In commit b78f6264e I opined that it was "too risky" to delete a
relation's RelOptInfo from the planner's data structures when we have
realized that we don't need to join to it; so instead we just marked
it as a dead relation.  In hindsight that judgment seems flawed: any
subsequent access to such a dead relation is arguably a bug in
itself, so leaving the RelOptInfo present just helps to mask bugs.
Let's delete it instead, allowing removal of the whole notion of a
"dead relation".  So far as the regression tests can find, this
requires no other code changes, except for one Assert in equivclass.c
that was very dubiously not complaining about access to a dead rel.

Discussion: https://postgr.es/m/229905.1676062220@sss.pgh.pa.us
2023-02-13 13:35:38 -05:00
Peter Eisentraut bd944884e9 Consolidate ItemPointer to Datum conversion functions
Instead of defining the same set of macros several times, define it
once in an appropriate header file.  In passing, convert to inline
functions.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://www.postgresql.org/message-id/flat/844dd4c5-e5a1-3df1-bfaf-d1e1c2a16e45%40enterprisedb.com
2023-02-13 09:57:15 +01:00
Michael Paquier 2a507f6fd8 Mark more nodes with attribute no_query_jumble
This commit removes most of the Plan and Path nodes, which should never
be included in the query jumbling because we ignore these in Query
nodes.  This is facilitated by making no_query_jumble an inherited
attribute, like no_copy, no_equal and no_read when the supertype of a
node is found as marked with that.

RawStmt is not used in parsed queries, so it can be removed from the
query jumbling.  A couple of nodes defined in pathnodes.h, plannodes.h
and primnodes.h with NodeTag as supertype need to be marked
individually.

Forcing the execution of the query jumbling code with compute_query_id =
auto while pg_stat_statements is loaded brings the code coverage of
queryjumblefuncs.funcs.c to 95.6%.

The core code does not yet include a way to enforce the execution in
query jumbling except in pg_stat_statements, so the numbers I am
mentioning above will not reflect on the default coverage report with
just what is done in this commit.

Reported-by: Tom Lane
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/3344827.1675809127@sss.pgh.pa.us
2023-02-13 09:07:33 +09:00
Andres Freund a9c70b46db Add pg_stat_io view, providing more detailed IO statistics
Builds on 28e626bde0 and f30d62c2fc. See the former for motivation.

Rows of the view show IO operations for a particular backend type, IO target
object, IO context combination (e.g. a client backend's operations on
permanent relations in shared buffers) and each column in the view is the
total number of IO Operations done (e.g. writes). So a cell in the view would
be, for example, the number of blocks of relation data written from shared
buffers by client backends since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "op_bytes" column specifies the unit of the "reads",
"writes", and "extends" columns for a given row.

Rows for combinations of IO operation, backend type, target object and context
that never occur, are ommitted entirely. For example, checkpointer will never
operate on temporary relations.

Similarly, if an IO operation never occurs for such a combination, the IO
operation's cell will be null, to distinguish from 0 observed IO
operations. For example, bgwriter should not perform reads.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend). For now, these have been kept for
backwards compatibility.

Bumps catversion.

Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Samay Sharma <smilingsamay@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200124195226.lth52iydq2n2uilq@alap3.anarazel.de
2023-02-11 09:52:15 -08:00
Michael Paquier 9e8b694d81 Fix typo in parsenodes.h
Introduced in a61b1f7 when RTEPermissionInfo got added.  Issue spotted
while reviewing the area for a different patch.
2023-02-10 15:37:41 +09:00
Andres Freund f30d62c2fc pgstat: Track more detailed relation IO statistics
Commit 28e626bde0 introduced the infrastructure for tracking more detailed IO
statistics. This commit adds the actual collection of the new IO statistics
for relations and temporary relations. See aforementioned commit for goals and
high-level design.

The changes in this commit are fairly straight-forward. The bulk of the change
is to passing sufficient information to the callsites of pgstat_count_io_op().

A somewhat unsightly detail is that it currently is hard to find a better
place to count fsyncs than in md.c, whereas the other pgstat_count_io_op()
calls are in bufmgr.c/localbuf.c. As the number of fsyncs is tied to md.c
implementation details, it's not obvious there is a better answer.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20200124195226.lth52iydq2n2uilq@alap3.anarazel.de
2023-02-09 22:22:26 -08:00
Michael Paquier ef7002dbe0 Fix various typos in code and tests
Most of these are recent, and the documentation portions are new as of
v16 so there is no need for a backpatch.

Author: Justin Pryzby
Discussion: https://postgr.es/m/20230208155644.GM1653@telsasoft.com
2023-02-09 14:43:53 +09:00
Andres Freund 28e626bde0 pgstat: Infrastructure for more detailed IO statistics
This commit adds the infrastructure for more detailed IO statistics. The calls
to actually count IOs, a system view to access the new statistics,
documentation and tests will be added in subsequent commits, to make review
easier.

While we already had some IO statistics, e.g. in pg_stat_bgwriter and
pg_stat_database, they did not provide sufficient detail to understand what
the main sources of IO are, or whether configuration changes could avoid
IO. E.g., pg_stat_bgwriter.buffers_backend does contain the number of buffers
written out by a backend, but as that includes extending relations (always
done by backends) and writes triggered by the use of buffer access strategies,
it cannot easily be used to tune background writer or checkpointer. Similarly,
pg_stat_database.blks_read cannot easily be used to tune shared_buffers /
compute a cache hit ratio, as the use of buffer access strategies will often
prevent a large fraction of the read blocks to end up in shared_buffers.

The new IO statistics count IO operations (evict, extend, fsync, read, reuse,
and write), and are aggregated for each combination of backend type (backend,
autovacuum worker, bgwriter, etc), target object of the IO (relations, temp
relations) and context of the IO (normal, vacuum, bulkread, bulkwrite).

What is tracked in this series of patches, is sufficient to perform the
aforementioned analyses. Further details, e.g. tracking the number of buffer
hits, would make that even easier, but was left out for now, to keep the scope
of the already large patchset manageable.

Bumps PGSTAT_FILE_FORMAT_ID.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20200124195226.lth52iydq2n2uilq@alap3.anarazel.de
2023-02-08 20:53:42 -08:00
David Rowley 9ed50ab349 Remove stray duplicated comment in heapam.h
This is just the same as what's written under the rs_numblocks field.

Reported-by: Melanie Plageman
Discussion: https://postgr.es/m/20230207204127.7vs6krqjqn5farr7@liskov
2023-02-08 16:03:26 +13:00
Amit Kapila 8c58624df4 Fix the logical replication timeout during large DDLs.
The DDLs like Refresh Materialized views that generate lots of temporary
data due to rewrite rules may not be processed by output plugins (for
example pgoutput). So, we won't send keep-alive messages for a long time
while processing such commands and that can lead the subscriber side to
timeout. We have previously fixed a similar case for large transactions in
commit f95d53eded where the output plugin filters all or most of the
changes but missed to handle the DDLs.

We decided not to backpatch this as this adds a new callback in the
existing exposed structure and moreover, users can increase the
wal_sender_timeout and wal_receiver_timeout to avoid this problem.

Author: Wang wei, Hou Zhijie
Reviewed-by: Peter Smith, Ashutosh Bapat, Shi yu, Amit Kapila
Discussion: https://postgr.es/m/OS3PR01MB6275478E5D29E4A563302D3D9E2B9@OS3PR01MB6275.jpnprd01.prod.outlook.com
Discussion: https://postgr.es/m/CAA5-nLARN7-3SLU_QUxfy510pmrYK6JJb=bk3hcgemAM_pAv+w@mail.gmail.com
2023-02-08 07:58:25 +05:30
Michael Paquier 9ba37b2cb6 Include values of A_Const nodes in query jumbling
Like the implementation for node copy, write and read, this node
requires a custom implementation so as the query jumbling is able to
consider the correct value assigned to it, depending on its type (int,
float, bool, string, bitstring).

Based on a dump of pg_stat_statements from the regression database, this
would confuse the query jumbling of the following queries:
- SET.
- COPY TO with SELECT queries.
- START TRANSACTION with different isolation levels.
- ALTER TABLE with default expressions.
- CREATE TABLE with partition bounds.

Note that there may be a long-term argument in tracking the location of
such nodes so as query strings holding such nodes could be normalized,
but this is left as a separate discussion.

Oversight in 3db72eb.

Discussion: https://postgr.es/m/Y9+HuYslMAP6yyPb@paquier.xyz
2023-02-07 09:03:54 +09:00
Robert Haas 8a2f783cc4 Disable STARTUP_PROGRESS_TIMEOUT in standby mode.
In standby mode, we don't actually report progress of recovery,
but up until now, startup_progress_timeout_handler() nevertheless
got called every log_startup_progress_interval seconds. That's
an unnecessary expense, so avoid it.

Report by Thomas Munro. Patch by Bharath Rupireddy, reviewed by
Simon Riggs, Thomas Munro, and me. Back-patch to v15, where
the problem was introduced.

Discussion: https://www.postgresql.org/message-id/CA%2BhUKGKCHSffAj8zZJKJvNX7ygnQFxVD6wm1d-2j3fVw%2BMafPQ%40mail.gmail.com
2023-02-06 10:51:08 -05:00
Michael Paquier 2f6e15ac93 Revert refactoring of restore command code to shell_restore.c
This reverts commits 24c35ec and 57169ad.  PreRestoreCommand() and
PostRestoreCommand() need to be put closer to the system() call calling
a restore_command, as they enable in_restore_command for the startup
process which would in turn trigger an immediate proc_exit() in the
SIGTERM handler.  Perhaps we could get rid of this behavior entirely,
but 24c35ec has made the window where the flag is enabled much larger
than it was, and any Postgres-like actions (palloc, etc.) taken by code
paths while the flag is enabled could lead to more severe issues in the
shutdown processing.

Note that curculio has showed that there are much more problems in this
area, unrelated to this change, actually, hence the issues related to
that had better be addressed first.  Keeping the code of HEAD in line
with the stable branches should make that a bit easier.

Per discussion with Andres Freund and Nathan Bossart.

Discussion: https://postgr.es/m/Y979NR3U5VnWrTwB@paquier.xyz
2023-02-06 08:28:42 +09:00
David Rowley f9bc34fcb6 Further refactor of heapgettup and heapgettup_pagemode
Backward and forward scans share much of the same page acquisition code.
Here we consolidate that code to reduce some duplication.

Additionally, add a new rs_coffset field to HeapScanDescData to track the
offset of the current tuple.  The new field fits nicely into the padding
between a bool and BlockNumber field and saves having to look at the last
returned tuple to figure out which offset we should be looking at for the
current tuple.

Author: Melanie Plageman
Reviewed-by: David Rowley
Discussion: https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_T8QEqZqOL02rpi2bw@mail.gmail.com
2023-02-03 11:48:39 +13:00
Thomas Munro cdf6518ef0 Retire PG_SETMASK() macro.
In the 90s we needed to deal with computers that still had the
pre-standard signal masking APIs.  That hasn't been relevant for a very
long time on Unix systems, and c94ae9d8 got rid of a remaining
dependency in our Windows porting code.  PG_SETMASK didn't expose
save/restore functionality, so we'd already started using sigprocmask()
directly in places, creating the visual distraction of having two ways
to spell it.  It's not part of the API that extensions are expected to
be using (but if they are, the change will be trivial).  It seems like a
good time to drop the old macro and just call the standard POSIX
function.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2BKfQgrhHP2DLTohX1WwubaCBHmTzGnAEDPZ-Gug-Xskg%40mail.gmail.com
2023-02-03 11:29:46 +13:00
David Rowley e9aaf06328 Remove dead NoMovementScanDirection code
Here remove some dead code from heapgettup() and heapgettup_pagemode()
which was trying to support NoMovementScanDirection scans.  This code can
never be reached as standard_ExecutorRun() never calls ExecutePlan with
NoMovementScanDirection.

Additionally, plans which were scanning an unordered index would use
NoMovementScanDirection rather than ForwardScanDirection.  There was no
real need for this, so here we adjust this so we use ForwardScanDirection
for unordered index scans.  A comment in pathnodes.h claimed that
NoMovementScanDirection was used for PathKey reasons, but if that was
true, it no longer is, per code in build_index_paths().

This does change the non-text format of the EXPLAIN output so that
unordered index scans now have a "Forward" scan direction rather than
"NoMovement".  The text format of EXPLAIN has not changed.

Author: Melanie Plageman
Reviewed-by: Tom Lane, David Rowley
Discussion: https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_T8QEqZqOL02rpi2bw@mail.gmail.com
2023-02-01 10:52:41 +13:00
Michael Paquier 3db72ebcbe Generate code for query jumbling through gen_node_support.pl
This commit changes the query jumbling code in queryjumblefuncs.c to be
generated automatically based on the information of the nodes in the
headers of src/include/nodes/ by using gen_node_support.pl.  This
approach offers many advantages:
- Support for query jumbling for all the utility statements, based on the
state of their parsed Nodes and not only their query string.  This will
greatly ease the switch to normalize the information of some DDLs, like
SET or CALL for example (this is left unchanged and should be part of a
separate discussion).  With this feature, the number of entries stored
for utilities in pg_stat_statements is reduced (for example now
"CHECKPOINT" and "checkpoint" mean the same thing with the same query
ID).
- Documentation of query jumbling directly in the structure definition
of the nodes.  Since this code has been introduced in pg_stat_statements
and then moved to code, the reasons behind the choices of what should be
included in the jumble are rather sparse.  Note that some explanation is
added for the most relevant parts, as a start.
- Overall code reduction and more consistency with the other parts
generating read, write and copy depending on the nodes.

The query jumbling is controlled by a couple of new node attributes,
documented in nodes/nodes.h:
- custom_query_jumble, to mark a Node as having a custom
implementation.
- no_query_jumble, to ignore entirely a Node.
- query_jumble_ignore, to ignore a field in a Node.
- query_jumble_location, to mark a location in a Node, for
normalization.  This can apply only to int fields, with "location" in
their name (only Const as of this commit).

There should be no compatibility impact on pg_stat_statements, as the
new code applies the jumbling to the same fields for each node (its
regression tests have no modification, for one).

Some benchmark of the query jumbling between HEAD and this commit for
SELECT and DMLs has proved that this new code does not cause a
performance regression, with computation times close for both methods.
For utility queries, the new method is slower than the previous method
of calculating a hash of the query string, though we are talking about
extra ns-level changes based on what I measured, which is unnoticeable
even for OLTP workloads as a query ID is calculated once per query
post-parse analysis.

Author: Michael Paquier
Reviewed-by: Peter Eisentraut
Discussion: https://postgr.es/m/Y5BHOUhX3zTH/ig6@paquier.xyz
2023-01-31 15:24:05 +09:00
Tom Lane 3bef56e116 Invent "join domains" to replace the below_outer_join hack.
EquivalenceClasses are now understood as applying within a "join
domain", which is a set of inner-joined relations (possibly underneath
an outer join).  We no longer need to treat an EC from below an outer
join as a second-class citizen.

I have hopes of eventually being able to treat outer-join clauses via
EquivalenceClasses, by means of only applying deductions within the
EC's join domain.  There are still problems in the way of that, though,
so for now the reconsider_outer_join_clause logic is still here.

I haven't been able to get rid of RestrictInfo.is_pushed_down either,
but I wonder if that could be recast using JoinDomains.

I had to hack one test case in postgres_fdw.sql to make it still test
what it was meant to, because postgres_fdw is inconsistent about
how it deals with quals containing non-shippable expressions; see
https://postgr.es/m/1691374.1671659838@sss.pgh.pa.us.  That should
be improved, but I don't think it's within the scope of this patch
series.

Patch by me; thanks to Richard Guo for review.

Discussion: https://postgr.es/m/830269.1656693747@sss.pgh.pa.us
2023-01-30 13:50:25 -05:00
Tom Lane b448f1c8d8 Do assorted mop-up in the planner.
Remove RestrictInfo.nullable_relids, along with a good deal of
infrastructure that calculated it.  One use-case for it was in
join_clause_is_movable_to, but we can now replace that usage with
a check to see if the clause's relids include any outer join
that can null the target relation.  The other use-case was in
join_clause_is_movable_into, but that test can just be dropped
entirely now that the clause's relids include outer joins.
Furthermore, join_clause_is_movable_into should now be
accurate enough that it will accept anything returned by
generate_join_implied_equalities, so we can restore the Assert
that was diked out in commit 95f4e59c3.

Remove the outerjoin_delayed mechanism.  We needed this before to
prevent quals from getting evaluated below outer joins that should
null some of their vars.  Now that we consider varnullingrels while
placing quals, that's taken care of automatically, so throw the
whole thing away.

Teach remove_useless_result_rtes to also remove useless FromExprs.
Having done that, the delay_upper_joins flag serves no purpose any
more and we can remove it, largely reverting 11086f2f2.

Use constant TRUE for "dummy" clauses when throwing back outer joins.
This improves on a hack I introduced in commit 6a6522529.  If we
have a left-join clause l.x = r.y, and a WHERE clause l.x = constant,
we generate r.y = constant and then don't really have a need for the
join clause.  But we must throw the join clause back anyway after
marking it redundant, so that the join search heuristics won't think
this is a clauseless join and avoid it.  That was a kluge introduced
under time pressure, and after looking at it I thought of a better
way: let's just introduce constant-TRUE "join clauses" instead,
and get rid of them at the end.  This improves the generated plans for
such cases by not having to test a redundant join clause.  We can also
get rid of the ugly hack used to mark such clauses as redundant for
selectivity estimation.

Patch by me; thanks to Richard Guo for review.

Discussion: https://postgr.es/m/830269.1656693747@sss.pgh.pa.us
2023-01-30 13:44:36 -05:00
Tom Lane 2489d76c49 Make Vars be outer-join-aware.
Traditionally we used the same Var struct to represent the value
of a table column everywhere in parse and plan trees.  This choice
predates our support for SQL outer joins, and it's really a pretty
bad idea with outer joins, because the Var's value can depend on
where it is in the tree: it might go to NULL above an outer join.
So expression nodes that are equal() per equalfuncs.c might not
represent the same value, which is a huge correctness hazard for
the planner.

To improve this, decorate Var nodes with a bitmapset showing
which outer joins (identified by RTE indexes) may have nulled
them at the point in the parse tree where the Var appears.
This allows us to trust that equal() Vars represent the same value.
A certain amount of klugery is still needed to cope with cases
where we re-order two outer joins, but it's possible to make it
work without sacrificing that core principle.  PlaceHolderVars
receive similar decoration for the same reason.

In the planner, we include these outer join bitmapsets into the relids
that an expression is considered to depend on, and in consequence also
add outer-join relids to the relids of join RelOptInfos.  This allows
us to correctly perceive whether an expression can be calculated above
or below a particular outer join.

This change affects FDWs that want to plan foreign joins.  They *must*
follow suit when labeling foreign joins in order to match with the
core planner, but for many purposes (if postgres_fdw is any guide)
they'd prefer to consider only base relations within the join.
To support both requirements, redefine ForeignScan.fs_relids as
base+OJ relids, and add a new field fs_base_relids that's set up by
the core planner.

Large though it is, this commit just does the minimum necessary to
install the new mechanisms and get check-world passing again.
Follow-up patches will perform some cleanup.  (The README additions
and comments mention some stuff that will appear in the follow-up.)

Patch by me; thanks to Richard Guo for review.

Discussion: https://postgr.es/m/830269.1656693747@sss.pgh.pa.us
2023-01-30 13:16:20 -05:00
Amit Kapila 1e8b61735c Rename GUC logical_decoding_mode to logical_replication_mode.
Rename the developer option 'logical_decoding_mode' to the more flexible
name 'logical_replication_mode' because doing so will make it easier to
extend this option in the future to help test other areas of logical
replication.

Currently, it is used on the publisher side to allow streaming or
serializing each change in logical decoding. In the upcoming patch, we are
planning to use it on the subscriber. On the subscriber, it will allow
serializing the changes to file and notifies the parallel apply workers to
read and apply them at the end of the transaction.

We discussed exposing this parameter as a subscription option but
it did not seem advisable since it is primarily used for testing/debugging
and there is no other such parameter. We also discussed having separate
GUCs for publisher and subscriber but for current testing/debugging
requirements, one GUC is sufficient.

Author: Hou Zhijie
Reviewed-by: Peter Smith, Kuroda Hayato, Sawada Masahiko, Amit Kapila
Discussion: https://postgr.es/m/CAD21AoAy2c=Mx=FTCs+EwUsf2kQL5MmU3N18X84k0EmCXntK4g@mail.gmail.com
Discussion: https://postgr.es/m/CAA4eK1+wyN6zpaHUkCLorEWNx75MG0xhMwcFhvjqm2KURZEAGw@mail.gmail.com
2023-01-30 08:02:08 +05:30