Commit Graph

282 Commits

Author SHA1 Message Date
Heikki Linnakangas b8bff07daa Make ResourceOwners more easily extensible.
Instead of having a separate array/hash for each resource kind, use a
single array and hash to hold all kinds of resources. This makes it
possible to introduce new resource "kinds" without having to modify
the ResourceOwnerData struct. In particular, this makes it possible
for extensions to register custom resource kinds.

The old approach was to have a small array of resources of each kind,
and if it fills up, switch to a hash table. The new approach also uses
an array and a hash, but now the array and the hash are used at the
same time. The array is used to hold the recently added resources, and
when it fills up, they are moved to the hash. This keeps the access to
recent entries fast, even when there are a lot of long-held resources.

All the resource-specific ResourceOwnerEnlarge*(),
ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have
been replaced with three generic functions that take resource kind as
argument. For convenience, we still define resource-specific wrapper
macros around the generic functions with the old names, but they are
now defined in the source files that use those resource kinds.

The release callback no longer needs to call ResourceOwnerForget on
the resource being released. ResourceOwnerRelease unregisters the
resource from the owner before calling the callback. That needed some
changes in bufmgr.c and some other files, where releasing the
resources previously always called ResourceOwnerForget.

Each resource kind specifies a release priority, and
ResourceOwnerReleaseAll releases the resources in priority order. To
make that possible, we have to restrict what you can do between
phases. After calling ResourceOwnerRelease(), you are no longer
allowed to remember any more resources in it or to forget any
previously remembered resources by calling ResourceOwnerForget.  There
was one case where that was done previously. At subtransaction commit,
AtEOSubXact_Inval() would handle the invalidation messages and call
RelationFlushRelation(), which temporarily increased the reference
count on the relation being flushed. We now switch to the parent
subtransaction's resource owner before calling AtEOSubXact_Inval(), so
that there is a valid ResourceOwner to temporarily hold that relcache
reference.

Other end-of-xact routines make similar calls to AtEOXact_Inval()
between release phases, but I didn't see any regression test failures
from those, so I'm not sure if they could reach a codepath that needs
remembering extra resources.

There were two exceptions to how the resource leak WARNINGs on commit
were printed previously: llvmjit silently released the context without
printing the warning, and a leaked buffer io triggered a PANIC. Now
everything prints a WARNING, including those cases.

Add tests in src/test/modules/test_resowner.

Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud
Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu
Reviewed-by: Peter Eisentraut, Andres Freund
Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
2023-11-08 13:30:50 +02:00
Amit Kapila 29d0a77fa6 Migrate logical slots to the new node during an upgrade.
While reading information from the old cluster, a list of logical
slots is fetched. At the later part of upgrading, pg_upgrade revisits the
list and restores slots by executing pg_create_logical_replication_slot()
on the new cluster. Migration of logical replication slots is only
supported when the old cluster is version 17.0 or later.

If the old node has invalid slots or slots with unconsumed WAL records,
the pg_upgrade fails. These checks are needed to prevent data loss.

The significant advantage of this commit is that it makes it easy to
continue logical replication even after upgrading the publisher node.
Previously, pg_upgrade allowed copying publications to a new node. With
this patch, adjusting the connection string to the new publisher will
cause the apply worker on the subscriber to connect to the new publisher
automatically. This enables seamless continuation of logical replication,
even after an upgrade.

Author: Hayato Kuroda, Hou Zhijie
Reviewed-by: Peter Smith, Bharath Rupireddy, Dilip Kumar, Vignesh C, Shlok Kyal
Discussion: http://postgr.es/m/TYAPR01MB58664C81887B3AF2EB6B16E3F5939@TYAPR01MB5866.jpnprd01.prod.outlook.com
Discussion: http://postgr.es/m/CAA4eK1+t7xYcfa0rEQw839=b2MzsfvYDPz3xbD+ZqOdP3zpKYg@mail.gmail.com
2023-10-26 07:06:55 +05:30
Alexander Korotkov d3d55ce571 Remove useless self-joins
The Self Join Elimination (SJE) feature removes an inner join of a plain table
to itself in the query tree if is proved that the join can be replaced with
a scan without impacting the query result.  Self join and inner relation are
replaced with the outer in query, equivalence classes, and planner info
structures. Also, inner restrictlist moves to the outer one with removing
duplicated clauses. Thus, this optimization reduces the length of the range
table list (this especially makes sense for partitioned relations), reduces
the number of restriction clauses === selectivity estimations, and potentially
can improve total planner prediction for the query.

The SJE proof is based on innerrel_is_unique machinery.

We can remove a self-join when for each outer row:
 1. At most one inner row matches the join clause.
 2. Each matched inner row must be (physically) the same row as the outer one.

In this patch we use the next approach to identify a self-join:
 1. Collect all merge-joinable join quals which look like a.x = b.x
 2. Add to the list above the baseretrictinfo of the inner table.
 3. Check innerrel_is_unique() for the qual list.  If it returns false, skip
    this pair of joining tables.
 4. Check uniqueness, proved by the baserestrictinfo clauses. To prove
    the possibility of self-join elimination inner and outer clauses must have
    an exact match.

The relation replacement procedure is not trivial and it is partly combined
with the one, used to remove useless left joins.  Tests, covering this feature,
were added to join.sql.  Some regression tests changed due to self-join removal
logic.

Discussion: https://postgr.es/m/flat/64486b0b-0404-e39e-322d-0801154901f3%40postgrespro.ru
Author: Andrey Lepikhov, Alexander Kuzmenkov
Reviewed-by: Tom Lane, Robert Haas, Andres Freund, Simon Riggs, Jonathan S. Katz
Reviewed-by: David Rowley, Thomas Munro, Konstantin Knizhnik, Heikki Linnakangas
Reviewed-by: Hywel Carver, Laurenz Albe, Ronan Dunklau, vignesh C, Zhihong Yu
Reviewed-by: Greg Stark, Jaime Casanova, Michał Kłeczek, Alena Rybakina
Reviewed-by: Alexander Korotkov
2023-10-25 12:59:16 +03:00
Robert Haas afd12774ae During online checkpoints, insert XLOG_CHECKPOINT_REDO at redo point.
This allows tools that read the WAL sequentially to identify (possible)
redo points when they're reached, rather than only being able to
detect them in retrospect when XLOG_CHECKPOINT_ONLINE is found, possibly
much later in the WAL stream. There are other possible applications as
well; see the discussion links below.

Any redo location that precedes the checkpoint location should now point
to an XLOG_CHECKPOINT_REDO record, so add a cross-check to verify this.

While adjusting the code in CreateCheckPoint() for this patch, I made it
call WALInsertLockAcquireExclusive a bit later than before, since there
appears to be no need for it to be held while checking whether the system
is idle, whether this is an end-of-recovery checkpoint, or what the current
timeline is.

Bump XLOG_PAGE_MAGIC.

Patch by me, based in part on earlier work from Dilip Kumar. Review by
Dilip Kumar, Amit Kapila, Andres Freund, and Michael Paquier.

Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com
Discussion: http://postgr.es/m/20230614194717.jyuw3okxup4cvtbt%40awork3.anarazel.de
Discussion: http://postgr.es/m/CA+hUKG+b2ego8=YNW2Ohe9QmSiReh1-ogrv8V_WZpJTqP3O+2w@mail.gmail.com
2023-10-19 14:47:29 -04:00
Nathan Bossart 8d140c5822 Improve the naming in wal_sync_method code.
* sync_method is renamed to wal_sync_method.

* sync_method_options[] is renamed to wal_sync_method_options[].

* assign_xlog_sync_method() is renamed to assign_wal_sync_method().

* The names of the available synchronization methods are now
  prefixed with "WAL_SYNC_METHOD_" and have been moved into a
  WalSyncMethod enum.

* PLATFORM_DEFAULT_SYNC_METHOD is renamed to
  PLATFORM_DEFAULT_WAL_SYNC_METHOD, and DEFAULT_SYNC_METHOD is
  renamed to DEFAULT_WAL_SYNC_METHOD.

These more descriptive names help distinguish the code for
wal_sync_method from the code for DataDirSyncMethod (e.g., the
recovery_init_sync_method configuration parameter and the
--sync-method option provided by several frontend utilities).  This
change also prevents name collisions between the aforementioned
sets of code.  Since this only improves the naming of internal
identifiers, there should be no behavior change.

Author: Maxim Orlov
Discussion: https://postgr.es/m/CACG%3DezbL1gwE7_K7sr9uqaCGkWhmvRTcTEnm3%2BX1xsRNwbXULQ%40mail.gmail.com
2023-10-13 15:16:45 -05:00
Robert Haas 4e9fc3a976 Return data from heap_page_prune via a struct.
Previously, one of the values in the struct was returned as the return
value, and another was returned via an output parameter. In
preparation for returning more stuff, consolidate both values into a
struct returned via an output parameter.

Melanie Plageman, reviewed by Andres Freund and by me.

Discussion: https://postgr.es/m/CAAKRu_br124qsGJieuYA0nGjywEukhK1dKBfRdby_4yY3E9SXA%40mail.gmail.com
2023-09-28 10:36:34 -04:00
Nathan Bossart 5af0263afd Make binaryheap available to frontend code.
There are a couple of places in frontend code that could make use
of this simple binary heap implementation.  This commit makes
binaryheap usable in frontend code, much like commit 26aaf97b68 did
for StringInfo.  Like StringInfo, the header file is left in lib/
to reduce the likelihood of unnecessary breakage.

The frontend version of binaryheap exposes a void *-based API since
frontend code does not have access to the Datum definitions.  This
seemed like a better approach than switching all existing uses to
void * or making the Datum definitions available to frontend code.

Reviewed-by: Tom Lane, Alvaro Herrera
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
2023-09-18 12:18:33 -07:00
Thomas Munro 3acd0599bd Fix exception safety bug in typcache.c.
If an out-of-memory error was thrown at an unfortunate time,
ensure_record_cache_typmod_slot_exists() could leak memory and leave
behind a global state that produced an infinite loop on the next call.

Fix by merging RecordCacheArray and RecordIdentifierArray into a single
array.  With only one allocation or re-allocation, there is no
intermediate state.

Back-patch to all supported releases.

Reported-by: "James Pang (chaolpan)" <chaolpan@cisco.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/PH0PR11MB519113E738814BDDA702EDADD6EFA%40PH0PR11MB5191.namprd11.prod.outlook.com
2023-09-13 14:58:22 +12:00
Nathan Bossart 3ed1956719 Make enum for sync methods available to frontend code.
This commit renames RecoveryInitSyncMethod to DataDirSyncMethod and
moves it to common/file_utils.h.  This is preparatory work for a
follow-up commit that will allow specifying the synchronization
method in frontend utilities such as pg_upgrade and pg_basebackup.

Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/ZN2ZB4afQ2JbR9TA%40paquier.xyz
2023-09-06 16:26:39 -07:00
Thomas Munro f691f5b80a Remove the "snapshot too old" feature.
Remove the old_snapshot_threshold setting and mechanism for producing
the error "snapshot too old", originally added by commit 848ef42b.
Unfortunately it had a number of known problems in terms of correctness
and performance, mostly reported by Andres in the course of his work on
snapshot scalability.  We agreed to remove it, after a long period
without an active plan to fix it.

This is certainly a desirable feature, and someone might propose a new
or improved implementation in the future.

Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CACG%3DezYV%2BEvO135fLRdVn-ZusfVsTY6cH1OZqWtezuEYH6ciQA%40mail.gmail.com
Discussion: https://postgr.es/m/20200401064008.qob7bfnnbu4w5cw4%40alap3.anarazel.de
Discussion: https://postgr.es/m/CA%2BTgmoY%3Daqf0zjTD%2B3dUWYkgMiNDegDLFjo%2B6ze%3DWtpik%2B3XqA%40mail.gmail.com
2023-09-05 19:53:43 +12:00
Thomas Munro 7114791158 ExtendBufferedWhat -> BufferManagerRelation.
Commit 31966b15 invented a way for functions dealing with relation
extension to accept a Relation in online code and an SMgrRelation in
recovery code.  It seems highly likely that future bufmgr.c interfaces
will face the same problem, and need to do something similar.
Generalize the names so that each interface doesn't have to re-invent
the wheel.

Back-patch to 16.  Since extension AM authors might start using the
constructor macros once 16 ships, we agreed to do the rename in 16
rather than waiting for 17.

Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKG%2B6tLD2BhpRWycEoti6LVLyQq457UL4ticP5xd8LqHySA%40mail.gmail.com
2023-08-23 12:31:23 +12:00
Andrew Dunstan a684581085 Cache by-reference missing values in a long lived context
Attribute missing values might be needed past the lifetime of the tuple
descriptors from which they are extracted. To avoid possibly using
pointers for by-reference values which might thus be left dangling, we
cache a datumCopy'd version of the datum in the TopMemoryContext. Since
we first search for the value this only needs to be done once per
session for any such value.

Original complaint from Tom Lane, idea for mitigation by Andrew Dunstan,
tweaked by Tom Lane.

Backpatch to version 11 where missing values were introduced.

Discussion: https://postgr.es/m/1306569.1687978174@sss.pgh.pa.us
2023-08-22 15:17:05 -04:00
Michael Paquier af720b4c50 Change custom wait events to use dynamic shared hash tables
Currently, the names of the custom wait event must be registered for
each backend, requiring all these to link to the shared memory area of
an extension, even if these are not loaded with
shared_preload_libraries.

This patch relaxes the constraints related to this infrastructure by
storing the wait events and their names in two dynamic hash tables in
shared memory.  This has the advantage to simplify the registration of
custom wait events to a single routine call that returns an event ID
ready for consumption:
uint32 WaitEventExtensionNew(const char *wait_event_name);

The caller of this routine can then cache locally the ID returned, to be
used for pgstat_report_wait_start(), WaitLatch() or a similar routine.

The implementation uses two hash tables: one with a key based on the
event name to avoid duplicates and a second using the event ID as key
for event lookups, like on pg_stat_activity.  These tables can hold a
minimum of 16 entries, and a maximum of 128 entries, which should be plenty
enough.

The code changes done in worker_spi show how things are simplified (most
of the code removed in this commit comes from there):
- worker_spi_init() is gone.
- No more shared memory hooks required (size requested and
initialization).
- The custom wait event ID is cached in the process that needs to set
it, with one single call to WaitEventExtensionNew() to retrieve it.

Per suggestion from Andres Freund.

Author: Masahiro Ikeda, with a few tweaks from me.
Discussion: https://postgr.es/m/20230801032349.aaiuvhtrcvvcwzcx@awork3.anarazel.de
2023-08-14 14:47:27 +09:00
Amit Kapila 2a8b40e368 Simplify determining logical replication worker types.
We deduce a LogicalRepWorker's type from the values of several different
fields ('relid' and 'leader_pid') whenever logic needs to know it.

In fact, the logical replication worker type is already known at the time
of launching the LogicalRepWorker and it never changes for the lifetime of
that process. Instead of deducing the type, it is simpler to just store it
one time, and access it directly thereafter.

Author: Peter Smith
Reviewed-by: Amit Kapila, Bharath Rupireddy
Discussion: http://postgr.es/m/CAHut+PttPSuP0yoZ=9zLDXKqTJ=d0bhxwKaEaNcaym1XqcvDEg@mail.gmail.com
2023-08-14 08:38:03 +05:30
Noah Misch d3a38318ac Rename OverrideSearchPath to SearchPathMatcher.
The previous commit removed the "override" APIs.  Surviving APIs facilitate
plancache.c to snapshot search_path and test whether the current value equals
a remembered snapshot.

Aleksander Alekseev.  Reported by Alexander Lakhin and Noah Misch.

Discussion: https://postgr.es/m/8ffb4650-52c4-6a81-38fc-8f99be981130@gmail.com
2023-07-31 17:04:47 -07:00
Noah Misch 7c5c4e1c03 Remove PushOverrideSearchPath() and PopOverrideSearchPath().
Since commit 681d9e4621, they have no in-tree
calls.  Any new calls would introduce security vulnerabilities like the one
fixed in that commit.

Alexander Lakhin, reviewed by Aleksander Alekseev.

Discussion: https://postgr.es/m/8ffb4650-52c4-6a81-38fc-8f99be981130@gmail.com
2023-07-31 17:04:47 -07:00
Michael Paquier c9af054653 Support custom wait events for wait event type "Extension"
Two backend routines are added to allow extension to allocate and define
custom wait events, all of these being allocated in the type
"Extension":
* WaitEventExtensionNew(), that allocates a wait event ID computed from
a counter in shared memory.
* WaitEventExtensionRegisterName(), to associate a custom string to the
wait event ID allocated.

Note that this includes an example of how to use this new facility in
worker_spi with tests in TAP for various scenarios, and some
documentation about how to use them.

Any code in the tree that currently uses WAIT_EVENT_EXTENSION could
switch to this new facility to define custom wait events.  This is left
as work for future patches.

Author: Masahiro Ikeda
Reviewed-by: Andres Freund, Michael Paquier, Tristan Partin, Bharath
Rupireddy
Discussion: https://postgr.es/m/b9f5411acda0cf15c8fbb767702ff43e@oss.nttdata.com
2023-07-31 17:09:24 +09:00
Amit Langote 03734a7fed Add more SQL/JSON constructor functions
This Patch introduces three SQL standard JSON functions:

JSON()
JSON_SCALAR()
JSON_SERIALIZE()

JSON() produces json values from text, bytea, json or jsonb values,
and has facilitites for handling duplicate keys.

JSON_SCALAR() produces a json value from any scalar sql value,
including json and jsonb.

JSON_SERIALIZE() produces text or bytea from input which containis
or represents json or jsonb;

For the most part these functions don't add any significant new
capabilities, but they will be of use to users wanting standard
compliant JSON handling.

Catversion bumped as this changes ruleutils.c.

Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Author: Teodor Sigaev <teodor@sigaev.ru>
Author: Oleg Bartunov <obartunov@gmail.com>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Andrew Dunstan <andrew@dunslane.net>
Author: Amit Langote <amitlangote09@gmail.com>

Reviewers have included (in no particular order) Andres Freund, Alexander
Korotkov, Pavel Stehule, Andrew Alsup, Erik Rijkers, Zihong Yu,
Himanshu Upadhyaya, Daniel Gustafsson, Justin Pryzby, Álvaro Herrera,
Peter Eisentraut

Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru
Discussion: https://postgr.es/m/20220616233130.rparivafipt6doj3@alap3.anarazel.de
Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org
Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com
2023-07-26 17:08:33 +09:00
Amit Langote 3c152a27b0 Unify JSON categorize type API and export for external use
This essentially removes the JsonbTypeCategory enum and
jsonb_categorize_type() and integrates any jsonb-specific logic that
was in jsonb_categorize_type() into json_categorize_type(), now
moved to jsonfuncs.c.  The remaining JsonTypeCategory enum and
json_categorize_type() cover the needs of the callers in both json.c
and jsonb.c.  json_categorize_type() has grown a new parameter named
is_jsonb for callers to engage the jsonb-specific behavior of
json_categorize_type().

One notable change in the now exported API of json_categorize_type()
is that it now always returns *outfuncoid even though a caller may
have no need currently to see one.

This is in preparation of later commits to implement additional
SQL/JSON functions.

Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com
2023-07-20 16:19:56 +09:00
Michael Paquier 47556a0013 Fix pg_recvlogical upon signal termination
When pg_recvlogical needs to abort on a signal like SIGINT/SIGTERM, it
is expected to exit cleanly as the code documents.  However, the code
forgot to clean up the state of the connection before leaving.  This
would cause the tool to emit messages like "unexpected termination of
replication stream" error, which is meant for really unexpected
termination or a crash.

The code is refactored to apply the same termination abort operations for
signals, end LSN and keepalive cases, registering a "reason" for the
termination with a message printed under --verbose adapted to the reason
used.

This is arguably a bug, but this has been this way since the tool exists
and the signal termination can now become slower depending on the change
being decoded when the signal is received.

Reported-by: Andres Freund
Author: Bharath Rupireddy
Reviewed-by: Andres Freund, Kyotaro Horiguchi, Cary Huang, Michael
Paquier
Discussion: https://postgr.es/m/20221019213953.htdtzikf4f45ywil@awork3.anarazel.de
2023-07-20 10:22:46 +09:00
Michael Paquier 2aeaf80e57 Refactor some code related to wait events "BufferPin" and "Extension"
The following changes are done:
- Addition of WaitEventBufferPin and WaitEventExtension, that hold a
list of wait events related to each category.
- Addition of two functions that encapsulate the list of wait events for
each category.
- Rename BUFFER_PIN to BUFFERPIN (only this wait event class used an
underscore, requiring a specific rule in the automation script).

These changes make a bit easier the automatic generation of all the code
and documentation related to wait events, as all the wait event
categories are now controlled by consistent structures and functions.

Author: Bertrand Drouvot
Discussion: https://postgr.es/m/c6f35117-4b20-4c78-1df5-d3056010dcf5@gmail.com
Discussion: https://postgr.es/m/77a86b3a-c4a8-5f5d-69b9-d70bbf2e9b98@gmail.com
2023-07-03 11:01:02 +09:00
Tom Lane 0245f8db36 Pre-beta mechanical code beautification.
Run pgindent, pgperltidy, and reformat-dat-files.

This set of diffs is a bit larger than typical.  We've updated to
pg_bsd_indent 2.1.2, which properly indents variable declarations that
have multi-line initialization expressions (the continuation lines are
now indented one tab stop).  We've also updated to perltidy version
20230309 and changed some of its settings, which reduces its desire to
add whitespace to lines to make assignments etc. line up.  Going
forward, that should make for fewer random-seeming changes to existing
code.

Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
2023-05-19 17:24:48 -04:00
Michael Paquier d8c3106bb6 Add back SQLValueFunction for SQL keywords
This is equivalent to a revert of f193883 and fb32748, with the addition
that the declaration of the SQLValueFunction node needs to gain a couple
of node_attr for query jumbling.  The performance impact of removing the
function call inlining is proving to be too huge for some workloads
where these are used.  A worst-case test case of involving only simple
SELECT queries with a SQL keyword is proving to lead to a reduction of
10% in TPS via pgbench and prepared queries on a high-end machine.

None of the tests I ran back for this set of changes saw such a huge
gap, but Alexander Lakhin and Andres Freund have found that this can be
noticeable.  Keeping the older performance would mean to do more
inlining in the executor when using COERCE_SQL_SYNTAX for a function
expression, similarly to what SQLValueFunction does.  This requires more
redesign work and there is little time until 16beta1 is released, so for
now reverting the change is the best way forward, bringing back the
previous performance.

Bump catalog version.

Reported-by: Alexander Lakhin
Discussion: https://postgr.es/m/b32bed1b-0746-9b20-1472-4bdc9ca66d52@gmail.com
2023-05-17 10:19:17 +09:00
Andres Freund 15f8203a59 Replace replication slot's invalidated_at LSN with an enum
This is mainly useful because the upcoming logical-decoding-on-standby feature
adds further reasons for invalidating slots, and we don't want to end up with
multiple invalidated_* fields, or check different attributes.

Eventually we should consider not resetting restart_lsn when invalidating a
slot due to max_slot_wal_keep_size. But that's a user visible change, so left
for later.

Increases SLOT_VERSION, due to the changed field (with a different alignment,
no less).

Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de
2023-04-07 21:47:25 -07:00
Thomas Munro faeedbcefd Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.
In order to have the option to use O_DIRECT/FILE_FLAG_NO_BUFFERING in a
later commit, we need the addresses of user space buffers to be well
aligned.  The exact requirements vary by OS and file system (typically
sectors and/or memory pages).  The address alignment size is set to
4096, which is enough for currently known systems: it matches modern
sectors and common memory page size.  There is no standard governing
O_DIRECT's requirements so we might eventually have to reconsider this
with more information from the field or future systems.

Aligning I/O buffers on memory pages is also known to improve regular
buffered I/O performance.

Three classes of I/O buffers for regular data pages are adjusted:
(1) Heap buffers are now allocated with the new palloc_aligned() or
MemoryContextAllocAligned() functions introduced by commit 439f6175.
(2) Stack buffers now use a new struct PGIOAlignedBlock to respect
PG_IO_ALIGN_SIZE, if possible with this compiler.  (3) The buffer
pool is also aligned in shared memory.

WAL buffers were already aligned on XLOG_BLCKSZ.  It's possible for
XLOG_BLCKSZ to be configured smaller than PG_IO_ALIGNED_SIZE and thus
for O_DIRECT WAL writes to fail to be well aligned, but that's a
pre-existing condition and will be addressed by a later commit.

BufFiles are not yet addressed (there's no current plan to use O_DIRECT
for those, but they could potentially get some incidental speedup even
in plain buffered I/O operations through better alignment).

If we can't align stack objects suitably using the compiler extensions
we know about, we disable the use of O_DIRECT by setting PG_O_DIRECT to
0.  This avoids the need to consider systems that have O_DIRECT but
can't align stack objects the way we want; such systems could in theory
be supported with more work but we don't currently know of any such
machines, so it's easier to pretend there is no O_DIRECT support
instead.  That's an existing and tested class of system.

Add assertions that all buffers passed into smgrread(), smgrwrite() and
smgrextend() are correctly aligned, unless PG_O_DIRECT is 0 (= stack
alignment tricks may be unavailable) or the block size has been set too
small to allow arrays of buffers to be all aligned.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
2023-04-08 16:34:50 +12:00
Andres Freund ac8d53dae5 Track IO times in pg_stat_io
a9c70b46db and 8aaa04b32S added counting of IO operations to a new view,
pg_stat_io. Now, add IO timing for reads, writes, extends, and fsyncs to
pg_stat_io as well.

This combines the tracking for pgBufferUsage with the tracking for pg_stat_io
into a new function pgstat_count_io_op_time(). This should make it a bit
easier to avoid the somewhat costly instr_time conversion done for
pgBufferUsage.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
2023-04-07 17:04:56 -07:00
Andres Freund 31966b151e bufmgr: Introduce infrastructure for faster relation extension
The primary bottlenecks for relation extension are:

1) The extension lock is held while acquiring a victim buffer for the new
   page. Acquiring a victim buffer can require writing out the old page
   contents including possibly needing to flush WAL.

2) When extending via ReadBuffer() et al, we write a zero page during the
   extension, and then later write out the actual page contents. This can
   nearly double the write rate.

3) The existing bulk relation extension infrastructure in hio.c just amortized
   the cost of acquiring the relation extension lock, but none of the other
   costs.

Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.

The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.

To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.

There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.

To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.

Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
2023-04-05 16:21:09 -07:00
Tomas Vondra 84adc8e20f pg_dump: Add support for zstd compression
Allow pg_dump to use the zstd compression, in addition to gzip/lz4. Bulk
of the new compression method is implemented in compress_zstd.{c,h},
covering the pg_dump compression APIs. The rest of the patch adds test
and makes various places aware of the new compression method.

The zstd library (which this patch relies on) supports multithreaded
compression since version 1.5. We however disallow that feature for now,
as it might interfere with parallel backups on platforms that rely on
threads (e.g. Windows). This can be improved / relaxed in the future.

This also fixes a minor issue in InitDiscoverCompressFileHandle(), which
was not updated to check if the file already has the .lz4 extension.

Adding zstd compression was originally proposed in 2020 (see the second
thread), but then was reworked to use the new compression API introduced
in e9960732a9. I've considered both threads when compiling the list of
reviewers.

Author: Justin Pryzby
Reviewed-by: Tomas Vondra, Jacob Champion, Andreas Karlsson
Discussion: https://postgr.es/m/20230224191840.GD1653@telsasoft.com
Discussion: https://postgr.es/m/20201221194924.GI30237@telsasoft.com
2023-04-05 21:39:33 +02:00
Alexander Korotkov 2b65bf046d Revert 11470f544e
Discussion: https://postgr.es/m/20230323003003.plgaxjqahjgkuxrk%40awork3.anarazel.de
2023-04-03 16:54:31 +03:00
Andres Freund ca7b3c4c00 pg_stat_wal: Accumulate time as instr_time instead of microseconds
In instr_time.h it is stated that:

* When summing multiple measurements, it's recommended to leave the
* running sum in instr_time form (ie, use INSTR_TIME_ADD or
* INSTR_TIME_ACCUM_DIFF) and convert to a result format only at the end.

The reason for that is that converting to microseconds is not cheap, and can
loose precision.  Therefore this commit changes 'PendingWalStats' to use
'instr_time' instead of 'PgStat_Counter' while accumulating 'wal_write_time'
and 'wal_sync_time'.

Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/1feedb83-7aa9-cb4b-5086-598349d3f555@gmail.com
2023-03-30 14:23:14 -07:00
Daniel Gustafsson 7f5b19817e Support connection load balancing in libpq
This adds support for load balancing connections with libpq using a
connection parameter: load_balance_hosts=<string>. When setting the
param to random, hosts and addresses will be connected to in random
order. This then results in load balancing across these addresses and
hosts when multiple clients or frequent connection setups are used.

The randomization employed performs two levels of shuffling:

  1. The given hosts are randomly shuffled, before resolving them
     one-by-one.
  2. Once a host its addresses get resolved, the returned addresses
     are shuffled, before trying to connect to them one-by-one.

Author: Jelte Fennema <postgres@jeltef.nl>
Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Reviewed-by: Michael Banck <mbanck@gmx.net>
Reviewed-by: Andrey Borodin <amborodin86@gmail.com>
Discussion: https://postgr.es/m/PR3PR83MB04768E2FF04818EEB2179949F7A69@PR3PR83MB0476.EURPRD83.prod.outlook.
2023-03-29 21:53:38 +02:00
Daniel Gustafsson 44d85ba5a3 Copy and store addrinfo in libpq-owned private memory
This refactors libpq to copy addrinfos returned by getaddrinfo to
memory owned by libpq such that future improvements can alter for
example the order of entries.

As a nice side effect of this refactor the mechanism for iteration
over addresses in PQconnectPoll is now identical to its iteration
over hosts.

Author: Jelte Fennema <postgres@jeltef.nl>
Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Reviewed-by: Michael Banck <mbanck@gmx.net>
Reviewed-by: Andrey Borodin <amborodin86@gmail.com>
Discussion: https://postgr.es/m/PR3PR83MB04768E2FF04818EEB2179949F7A69@PR3PR83MB0476.EURPRD83.prod.outlook.com
2023-03-29 21:41:27 +02:00
Alvaro Herrera 7081ac46ac
SQL/JSON: add standard JSON constructor functions
This commit introduces the SQL/JSON standard-conforming constructors for
JSON types:

JSON_ARRAY()
JSON_ARRAYAGG()
JSON_OBJECT()
JSON_OBJECTAGG()

Most of the functionality was already present in PostgreSQL-specific
functions, but these include some new functionality such as the ability
to skip or include NULL values, and to allow duplicate keys or throw
error when they are found, as well as the standard specified syntax to
specify output type and format.

Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Author: Teodor Sigaev <teodor@sigaev.ru>
Author: Oleg Bartunov <obartunov@gmail.com>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Amit Langote <amitlangote09@gmail.com>

Reviewers have included (in no particular order) Andres Freund, Alexander
Korotkov, Pavel Stehule, Andrew Alsup, Erik Rijkers, Zihong Yu,
Himanshu Upadhyaya, Daniel Gustafsson, Justin Pryzby.

Discussion: https://postgr.es/m/CAF4Au4w2x-5LTnN_bxky-mq4=WOqsGsxSpENCzHRAzSnEd8+WQ@mail.gmail.com
Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru
Discussion: https://postgr.es/m/20220616233130.rparivafipt6doj3@alap3.anarazel.de
Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org
2023-03-29 12:11:36 +02:00
Amit Kapila 062a844424 Avoid syncing data twice for the 'publish_via_partition_root' option.
When there are multiple publications for a subscription and one of those
publishes via the parent table by using publish_via_partition_root and the
other one directly publishes the child table, we end up copying the same
data twice during initial synchronization. The reason for this was that we
get both the parent and child tables from the publisher and try to copy
the data for both of them.

This patch extends the function pg_get_publication_tables() to take a
publication list as its input parameter. This allows us to exclude a
partition table whose ancestor is published by the same publication list.

This problem does exist in back-branches but we decide to fix it there in
a separate commit if required. The fix for back-branches requires quite
complicated changes to fetch the required table information from the
publisher as we can't update the function pg_get_publication_tables() in
back-branches. We are not sure whether we want to deviate and complicate
the code in back-branches for this problem as there are no field reports
yet.

Author: Wang wei
Reviewed-by: Peter Smith, Jacob Champion, Kuroda Hayato, Vignesh C, Osumi Takamichi, Amit Kapila
Discussion: https://postgr.es/m/OS0PR01MB57167F45D481F78CDC5986F794B99@OS0PR01MB5716.jpnprd01.prod.outlook.com
2023-03-29 10:46:58 +05:30
Alexander Korotkov 11470f544e Allow locking updated tuples in tuple_update() and tuple_delete()
Currently, in read committed transaction isolation mode (default), we have the
following sequence of actions when tuple_update()/tuple_delete() finds
the tuple updated by concurrent transaction.

1. Attempt to update/delete tuple with tuple_update()/tuple_delete(), which
   returns TM_Updated.
2. Lock tuple with tuple_lock().
3. Re-evaluate plan qual (recheck if we still need to update/delete and
   calculate the new tuple for update).
4. Second attempt to update/delete tuple with tuple_update()/tuple_delete().
   This attempt should be successful, since the tuple was previously locked.

This patch eliminates step 2 by taking the lock during first
tuple_update()/tuple_delete() call.  Heap table access method saves some
efforts by checking the updated tuple once instead of twice.  Future
undo-based table access methods, which will start from the latest row version,
can immediately place a lock there.

The code in nodeModifyTable.c is simplified by removing the nested switch/case.

Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp
Reviewed-by: Andres Freund, Chris Travers
2023-03-23 00:26:59 +03:00
Michael Paquier e731aeac89 Remove PgStat_BackendFunctionEntry
This structure included only PgStat_FunctionCounts, and removing it
facilitates some upcoming refactoring for pgstatfuncs.c to use more
macros rather that mostly-duplicated functions.

Author: Bertrand Drouvot
Reviewed-by: Nathan Bossart
Discussion: https://postgr.es/m/11d531fe-52fc-c6ea-7e8e-62f1b6ec626e@gmail.com
2023-03-16 14:22:34 +09:00
Tomas Vondra 0da243fed0 Add LZ4 compression to pg_dump
Expand pg_dump's compression streaming and file APIs to support the lz4
algorithm. The newly added compress_lz4.{c,h} files cover all the
functionality of the aforementioned APIs. Minor changes were necessary
in various pg_backup_* files, where code for the 'lz4' file suffix has
been added, as well as pg_dump's compression option parsing.

Author: Georgios Kokolatos
Reviewed-by: Michael Paquier, Rachel Heaton, Justin Pryzby, Shi Yu, Tomas Vondra
Discussion: https://postgr.es/m/faUNEOpts9vunEaLnmxmG-DldLSg_ql137OC3JYDmgrOMHm1RvvWY2IdBkv_CRxm5spCCb_OmKNk2T03TMm0fBEWveFF9wA1WizPuAgB7Ss%3D%40protonmail.com
2023-02-23 21:19:26 +01:00
Tomas Vondra e9960732a9 Introduce a generic pg_dump compression API
Switch pg_dump to use the Compression API, implemented by bf9aa490db.

The CompressFileHandle replaces the cfp* family of functions with a
struct of callbacks for accessing (compressed) files. This allows adding
new compression methods simply by introducing a new struct instance with
appropriate implementation of the callbacks.

Archives compressed using custom compression methods store an identifier
of the compression algorithm in their header instead of the compression
level. The header version is bumped.

Author: Georgios Kokolatos
Reviewed-by: Michael Paquier, Rachel Heaton, Justin Pryzby, Tomas Vondra
Discussion: https://postgr.es/m/faUNEOpts9vunEaLnmxmG-DldLSg_ql137OC3JYDmgrOMHm1RvvWY2IdBkv_CRxm5spCCb_OmKNk2T03TMm0fBEWveFF9wA1WizPuAgB7Ss%3D%40protonmail.com
2023-02-23 18:33:40 +01:00
Michael Paquier 35739b87dc Redesign archive modules
A new callback named startup_cb, called shortly after a module is
loaded, is added.  This makes possible the initialization of any
additional state data required by a module.  This initial state data can
be saved in a ArchiveModuleState, that is now passed down to all the
callbacks that can be defined in a module.  With this design, it is
possible to have a per-module state, aimed at opening the door to the
support of more than one archive module.

The initialization of the callbacks is changed so as
_PG_archive_module_init() does not anymore give in input a
ArchiveModuleCallbacks that a module has to fill in with callback
definitions.  Instead, a module now needs to return a const
ArchiveModuleCallbacks.

All the structure and callback definitions of archive modules are moved
into their own header, named archive_module.h, from pgarch.h.
Command-based archiving follows the same line, with a new set of files
named shell_archive.{c,h}.

There are a few more items that are under discussion to improve the
design of archive modules, like the fact that basic_archive calls
sigsetjmp() by itself to define its own error handling flow.  These will
be adjusted later, the changes done here cover already a good portion
of what has been discussed.

Any modules created for v15 will need to be adjusted to this new
design.

Author: Nathan Bossart
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/20230130194810.6fztfgbn32e7qarj@awork3.anarazel.de
2023-02-17 14:26:42 +09:00
Andres Freund a9c70b46db Add pg_stat_io view, providing more detailed IO statistics
Builds on 28e626bde0 and f30d62c2fc. See the former for motivation.

Rows of the view show IO operations for a particular backend type, IO target
object, IO context combination (e.g. a client backend's operations on
permanent relations in shared buffers) and each column in the view is the
total number of IO Operations done (e.g. writes). So a cell in the view would
be, for example, the number of blocks of relation data written from shared
buffers by client backends since the last stats reset.

In anticipation of tracking WAL IO and non-block-oriented IO (such as
temporary file IO), the "op_bytes" column specifies the unit of the "reads",
"writes", and "extends" columns for a given row.

Rows for combinations of IO operation, backend type, target object and context
that never occur, are ommitted entirely. For example, checkpointer will never
operate on temporary relations.

Similarly, if an IO operation never occurs for such a combination, the IO
operation's cell will be null, to distinguish from 0 observed IO
operations. For example, bgwriter should not perform reads.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend). For now, these have been kept for
backwards compatibility.

Bumps catversion.

Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Samay Sharma <smilingsamay@gmail.com>
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200124195226.lth52iydq2n2uilq@alap3.anarazel.de
2023-02-11 09:52:15 -08:00
Andres Freund 28e626bde0 pgstat: Infrastructure for more detailed IO statistics
This commit adds the infrastructure for more detailed IO statistics. The calls
to actually count IOs, a system view to access the new statistics,
documentation and tests will be added in subsequent commits, to make review
easier.

While we already had some IO statistics, e.g. in pg_stat_bgwriter and
pg_stat_database, they did not provide sufficient detail to understand what
the main sources of IO are, or whether configuration changes could avoid
IO. E.g., pg_stat_bgwriter.buffers_backend does contain the number of buffers
written out by a backend, but as that includes extending relations (always
done by backends) and writes triggered by the use of buffer access strategies,
it cannot easily be used to tune background writer or checkpointer. Similarly,
pg_stat_database.blks_read cannot easily be used to tune shared_buffers /
compute a cache hit ratio, as the use of buffer access strategies will often
prevent a large fraction of the read blocks to end up in shared_buffers.

The new IO statistics count IO operations (evict, extend, fsync, read, reuse,
and write), and are aggregated for each combination of backend type (backend,
autovacuum worker, bgwriter, etc), target object of the IO (relations, temp
relations) and context of the IO (normal, vacuum, bulkread, bulkwrite).

What is tracked in this series of patches, is sufficient to perform the
aforementioned analyses. Further details, e.g. tracking the number of buffer
hits, would make that even easier, but was left out for now, to keep the scope
of the already large patchset manageable.

Bumps PGSTAT_FILE_FORMAT_ID.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20200124195226.lth52iydq2n2uilq@alap3.anarazel.de
2023-02-08 20:53:42 -08:00
Amit Kapila 8c58624df4 Fix the logical replication timeout during large DDLs.
The DDLs like Refresh Materialized views that generate lots of temporary
data due to rewrite rules may not be processed by output plugins (for
example pgoutput). So, we won't send keep-alive messages for a long time
while processing such commands and that can lead the subscriber side to
timeout. We have previously fixed a similar case for large transactions in
commit f95d53eded where the output plugin filters all or most of the
changes but missed to handle the DDLs.

We decided not to backpatch this as this adds a new callback in the
existing exposed structure and moreover, users can increase the
wal_sender_timeout and wal_receiver_timeout to avoid this problem.

Author: Wang wei, Hou Zhijie
Reviewed-by: Peter Smith, Ashutosh Bapat, Shi yu, Amit Kapila
Discussion: https://postgr.es/m/OS3PR01MB6275478E5D29E4A563302D3D9E2B9@OS3PR01MB6275.jpnprd01.prod.outlook.com
Discussion: https://postgr.es/m/CAA5-nLARN7-3SLU_QUxfy510pmrYK6JJb=bk3hcgemAM_pAv+w@mail.gmail.com
2023-02-08 07:58:25 +05:30
Amit Kapila 1e8b61735c Rename GUC logical_decoding_mode to logical_replication_mode.
Rename the developer option 'logical_decoding_mode' to the more flexible
name 'logical_replication_mode' because doing so will make it easier to
extend this option in the future to help test other areas of logical
replication.

Currently, it is used on the publisher side to allow streaming or
serializing each change in logical decoding. In the upcoming patch, we are
planning to use it on the subscriber. On the subscriber, it will allow
serializing the changes to file and notifies the parallel apply workers to
read and apply them at the end of the transaction.

We discussed exposing this parameter as a subscription option but
it did not seem advisable since it is primarily used for testing/debugging
and there is no other such parameter. We also discussed having separate
GUCs for publisher and subscriber but for current testing/debugging
requirements, one GUC is sufficient.

Author: Hou Zhijie
Reviewed-by: Peter Smith, Kuroda Hayato, Sawada Masahiko, Amit Kapila
Discussion: https://postgr.es/m/CAD21AoAy2c=Mx=FTCs+EwUsf2kQL5MmU3N18X84k0EmCXntK4g@mail.gmail.com
Discussion: https://postgr.es/m/CAA4eK1+wyN6zpaHUkCLorEWNx75MG0xhMwcFhvjqm2KURZEAGw@mail.gmail.com
2023-01-30 08:02:08 +05:30
David Rowley 009dbdea02 Run pgindent on heapam.c
An upcoming patch by Melanie Plageman does some refactoring work in this
area.  Run pgindent on that file now before making any changes so that
it's easier to maintain/evolve each of the individual patches doing the
refactor work.  Additionally, add a few new required typedefs to the list
to make it easier to do future pgindent runs on this file during the
refactor work.

Discussion: https://postgr.es/m/CAAKRu_YSOnhKsDyFcqJsKtBSrd32DP-jjXmv7hL0BPD-z0TGXQ@mail.gmail.com
2023-01-23 23:08:38 +13:00
Andres Freund d137cb52cb Remove SHM_QUEUE
Prior patches got rid of all the uses of SHM_QUEUE. ilist.h style lists are
more widely used and have an easier to use interface. As there are no users
left, remove SHM_QUEUE.

Reviewed-by: Thomas Munro <thomas.munro@gmail.com> (in an older version)
Discussion: https://postgr.es/m/20221120055930.t6kl3tyivzhlrzu2@awork3.anarazel.de
Discussion: https://postgr.es/m/20200211042229.msv23badgqljrdg2@alap3.anarazel.de
2023-01-19 18:55:51 -08:00
Andres Freund 5764f611e1 Use dlist/dclist instead of PROC_QUEUE / SHM_QUEUE for heavyweight locks
Part of a series to remove SHM_QUEUE. ilist.h style lists are more widely used
and have an easier to use interface.

As PROC_QUEUE is now unused, remove it.

Reviewed-by: Thomas Munro <thomas.munro@gmail.com> (in an older version)
Discussion: https://postgr.es/m/20221120055930.t6kl3tyivzhlrzu2@awork3.anarazel.de
Discussion: https://postgr.es/m/20200211042229.msv23badgqljrdg2@alap3.anarazel.de
2023-01-18 11:41:14 -08:00
Amit Kapila 216a784829 Perform apply of large transactions by parallel workers.
Currently, for large transactions, the publisher sends the data in
multiple streams (changes divided into chunks depending upon
logical_decoding_work_mem), and then on the subscriber-side, the apply
worker writes the changes into temporary files and once it receives the
commit, it reads from those files and applies the entire transaction. To
improve the performance of such transactions, we can instead allow them to
be applied via parallel workers.

In this approach, we assign a new parallel apply worker (if available) as
soon as the xact's first stream is received and the leader apply worker
will send changes to this new worker via shared memory. The parallel apply
worker will directly apply the change instead of writing it to temporary
files. However, if the leader apply worker times out while attempting to
send a message to the parallel apply worker, it will switch to
"partial serialize" mode -  in this mode, the leader serializes all
remaining changes to a file and notifies the parallel apply workers to
read and apply them at the end of the transaction. We use a non-blocking
way to send the messages from the leader apply worker to the parallel
apply to avoid deadlocks. We keep this parallel apply assigned till the
transaction commit is received and also wait for the worker to finish at
commit. This preserves commit ordering and avoid writing to and reading
from files in most cases. We still need to spill if there is no worker
available.

This patch also extends the SUBSCRIPTION 'streaming' parameter so that the
user can control whether to apply the streaming transaction in a parallel
apply worker or spill the change to disk. The user can set the streaming
parameter to 'on/off', or 'parallel'. The parameter value 'parallel' means
the streaming will be applied via a parallel apply worker, if available.
The parameter value 'on' means the streaming transaction will be spilled
to disk. The default value is 'off' (same as current behaviour).

In addition, the patch extends the logical replication STREAM_ABORT
message so that abort_lsn and abort_time can also be sent which can be
used to update the replication origin in parallel apply worker when the
streaming transaction is aborted. Because this message extension is needed
to support parallel streaming, parallel streaming is not supported for
publications on servers < PG16.

Author: Hou Zhijie, Wang wei, Amit Kapila with design inputs from Sawada Masahiko
Reviewed-by: Sawada Masahiko, Peter Smith, Dilip Kumar, Shi yu, Kuroda Hayato, Shveta Mallik
Discussion: https://postgr.es/m/CAA4eK1+wyN6zpaHUkCLorEWNx75MG0xhMwcFhvjqm2KURZEAGw@mail.gmail.com
2023-01-09 07:52:45 +05:30
Amit Kapila 5de94a041e Add 'logical_decoding_mode' GUC.
This enables streaming or serializing changes immediately in logical
decoding. This parameter is intended to be used to test logical decoding
and replication of large transactions for which otherwise we need to
generate the changes till logical_decoding_work_mem is reached.

This helps in reducing the timing of existing tests related to logical
replication of in-progress transactions and will help in writing tests for
for the upcoming feature for parallelly applying large in-progress
transactions.

Author: Shi yu
Reviewed-by: Sawada Masahiko, Shveta Mallik, Amit Kapila, Dilip Kumar, Kuroda Hayato, Kyotaro Horiguchi
Discussion: https://postgr.es/m/OSZPR01MB63104E7449DBE41932DB19F1FD1B9@OSZPR01MB6310.jpnprd01.prod.outlook.com
2022-12-26 08:58:16 +05:30
Tom Lane d9f7f5d32f Create infrastructure for "soft" error reporting.
Postgres' standard mechanism for reporting errors (ereport() or elog())
is used for all sorts of error conditions.  This means that throwing
an exception via ereport(ERROR) requires an expensive transaction or
subtransaction abort and cleanup, since the exception catcher dare not
make many assumptions about what has gone wrong.  There are situations
where we would rather have a lighter-weight mechanism for dealing
with errors that are known to be safe to recover from without a full
transaction cleanup.  This commit creates infrastructure to let us
adapt existing error-reporting code for that purpose.  See the
included documentation changes for details.  Follow-on commits will
provide test code and usage examples.

The near-term plan is to convert most if not all datatype input
functions to report invalid input "softly".  This will enable
implementing some SQL/JSON features cleanly and without the cost
of subtransactions, and it will also allow creating COPY options
to deal with bad input without cancelling the whole COPY.

This patch is mostly by me, but it owes very substantial debt to
earlier work by Nikita Glukhov, Andrew Dunstan, and Amul Sul.
Thanks also to Andres Freund for review.

Discussion: https://postgr.es/m/3bbbb0df-7382-bf87-9737-340ba096e034@postgrespro.ru
2022-12-09 09:58:38 -05:00
Alvaro Herrera a61b1f7482
Rework query relation permission checking
Currently, information about the permissions to be checked on relations
mentioned in a query is stored in their range table entries.  So the
executor must scan the entire range table looking for relations that
need to have permissions checked.  This can make the permission checking
part of the executor initialization needlessly expensive when many
inheritance children are present in the range range.  While the
permissions need not be checked on the individual child relations, the
executor still must visit every range table entry to filter them out.

This commit moves the permission checking information out of the range
table entries into a new plan node called RTEPermissionInfo.  Every
top-level (inheritance "root") RTE_RELATION entry in the range table
gets one and a list of those is maintained alongside the range table.
This new list is initialized by the parser when initializing the range
table.  The rewriter can add more entries to it as rules/views are
expanded.  Finally, the planner combines the lists of the individual
subqueries into one flat list that is passed to the executor for
checking.

To make it quick to find the RTEPermissionInfo entry belonging to a
given relation, RangeTblEntry gets a new Index field 'perminfoindex'
that stores the corresponding RTEPermissionInfo's index in the query's
list of the latter.

ExecutorCheckPerms_hook has gained another List * argument; the
signature is now:
typedef bool (*ExecutorCheckPerms_hook_type) (List *rangeTable,
					      List *rtePermInfos,
					      bool ereport_on_violation);
The first argument is no longer used by any in-core uses of the hook,
but we leave it in place because there may be other implementations that
do.  Implementations should likely scan the rtePermInfos list to
determine which operations to allow or deny.

Author: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CA+HiwqGjJDmUhDSfv-U2qhKJjt9ST7Xh9JXC_irsAQ1TAUsJYg@mail.gmail.com
2022-12-06 16:09:24 +01:00