Commit Graph

1285 Commits

Author SHA1 Message Date
Amit Kapila 2ce353fc19 Tighten the concurrent abort check during decoding.
During decoding of an in-progress or prepared transaction, we detect
concurrent abort with an error code ERRCODE_TRANSACTION_ROLLBACK. That is
not sufficient because a callback can decide to throw that error code
at other times as well.

Reported-by: Tom Lane
Author: Amit Kapila
Reviewed-by: Dilip Kumar
Discussion: https://postgr.es/m/CAA4eK1KCjPRS4aZHB48QMM4J8XOC1+TD8jo-4Yu84E+MjwqVhA@mail.gmail.com
2021-05-06 08:26:42 +05:30
Amit Kapila 205f466282 Fix the computation of slot stats for 'total_bytes'.
Previously, we were using the size of all the changes present in
ReorderBuffer to compute total_bytes after decoding a transaction and that
can lead to counting some of the transactions' changes more than once. Fix
it by using the size of the changes decoded for a transaction to compute
'total_bytes'.

Author: Sawada Masahiko
Reviewed-by: Vignesh C, Amit Kapila
Discussion: https://postgr.es/m/20210319185247.ldebgpdaxsowiflw@alap3.anarazel.de
2021-05-03 07:22:08 +05:30
Amit Kapila ee4ba01dbb Fix the bugs in selecting the transaction for streaming.
There were two problems:
a. We were always selecting the next available txn instead of selecting it
when it is larger than the previous transaction.
b. We were selecting the transactions which haven't made any changes to
the database (base snapshot is not set). Later it was hitting an Assert
because we don't decode such transactions and the changes in txn remain as
it is. It is better not to choose such transactions for streaming in the
first place.

Reported-by: Haiying Tang
Author: Dilip Kumar
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/OS0PR01MB61133B94E63177040F7ECDA1FB429@OS0PR01MB6113.jpnprd01.prod.outlook.com
2021-04-30 10:49:52 +05:30
Tom Lane 9626325da5 Add heuristic incoming-message-size limits in the server.
We had a report of confusing server behavior caused by a client bug
that sent junk to the server: the server thought the junk was a
very long message length and waited patiently for data that would
never come.  We can reduce the risk of that by being less trusting
about message lengths.

For a long time, libpq has had a heuristic rule that it wouldn't
believe large message size words, except for a small number of
message types that are expected to be (potentially) long.  This
provides some defense against loss of message-boundary sync and
other corrupted-data cases.  The server does something similar,
except that up to now it only limited the lengths of messages
received during the connection authentication phase.  Let's
do the same as in libpq and put restrictions on the allowed
length of all messages, while distinguishing between message
types that are expected to be long and those that aren't.

I used a limit of 10000 bytes for non-long messages.  (libpq's
corresponding limit is 30000 bytes, but given the asymmetry of
the FE/BE protocol, there's no good reason why the numbers should
be the same.)  Experimentation suggests that this is at least a
factor of 10, maybe a factor of 100, more than we really need;
but plenty of daylight seems desirable to avoid false positives.
In any case we can adjust the limit based on beta-test results.

For long messages, set a limit of MaxAllocSize - 1, which is the
most that we can absorb into the StringInfo buffer that the message
is collected in.  This just serves to make sure that a bogus message
size is reported as such, rather than as a confusing gripe about
not being able to enlarge a string buffer.

While at it, make sure that non-mainline code paths (such as
COPY FROM STDIN) are as paranoid as SocketBackend is, and validate
the message type code before believing the message length.
This provides an additional guard against getting stuck on corrupted
input.

Discussion: https://postgr.es/m/2003757.1619373089@sss.pgh.pa.us
2021-04-28 15:50:46 -04:00
Fujii Masao 8e9ea08bae Don't pass "ONLY" options specified in TRUNCATE to foreign data wrapper.
Commit 8ff1c94649 allowed TRUNCATE command to truncate foreign tables.
Previously the information about "ONLY" options specified in TRUNCATE
command were passed to the foreign data wrapper. Then postgres_fdw
constructed the TRUNCATE command to issue the remote server and
included "ONLY" options in it based on the passed information.

On the other hand, "ONLY" options specified in SELECT, UPDATE or DELETE
have no effect when accessing or modifying the remote table, i.e.,
are not passed to the foreign data wrapper. So it's inconsistent to
make only TRUNCATE command pass the "ONLY" options to the foreign data
wrapper. Therefore this commit changes the TRUNCATE command so that
it doesn't pass the "ONLY" options to the foreign data wrapper,
for the consistency with other statements. Also this commit changes
postgres_fdw so that it always doesn't include "ONLY" options in
the TRUNCATE command that it constructs.

Author: Fujii Masao
Reviewed-by: Bharath Rupireddy, Kyotaro Horiguchi, Justin Pryzby, Zhihong Yu
Discussion: https://postgr.es/m/551ed8c1-f531-818b-664a-2cecdab99cd8@oss.nttdata.com
2021-04-27 14:41:27 +09:00
Amit Kapila 3fa17d3771 Use HTAB for replication slot statistics.
Previously, we used to use the array of size max_replication_slots to
store stats for replication slots. But that had two problems in the cases
where a message for dropping a slot gets lost: 1) the stats for the new
slot are not recorded if the array is full and 2) writing beyond the end
of the array if the user reduces the max_replication_slots.

This commit uses HTAB for replication slot statistics, resolving both
problems. Now, pgstat_vacuum_stat() search for all the dead replication
slots in stats hashtable and tell the collector to remove them. To avoid
showing the stats for the already-dropped slots, pg_stat_replication_slots
view searches slot stats by the slot name taken from pg_replication_slots.

Also, we send a message for creating a slot at slot creation, initializing
the stats. This reduces the possibility that the stats are accumulated
into the old slot stats when a message for dropping a slot gets lost.

Reported-by: Andres Freund
Author: Sawada Masahiko, test case by Vignesh C
Reviewed-by: Amit Kapila, Vignesh C, Dilip Kumar
Discussion: https://postgr.es/m/20210319185247.ldebgpdaxsowiflw@alap3.anarazel.de
2021-04-27 09:09:11 +05:30
Amit Kapila e7eea52b2d Fix Logical Replication of Truncate in synchronous commit mode.
The Truncate operation acquires an exclusive lock on the target relation
and indexes. It then waits for logical replication of the operation to
finish at commit. Now because we are acquiring the shared lock on the
target index to get index attributes in pgoutput while sending the
changes for the Truncate operation, it leads to a deadlock.

Actually, we don't need to acquire a lock on the target index as we build
the cache entry using a historic snapshot and all the later changes are
absorbed while decoding WAL. So, we wrote a special purpose function for
logical replication to get a bitmap of replica identity attribute numbers
where we get that information without locking the target index.

We decided not to backpatch this as there doesn't seem to be any field
complaint about this issue since it was introduced in commit 5dfd1e5a in
v11.

Reported-by: Haiying Tang
Author: Takamichi Osumi, test case by Li Japin
Reviewed-by: Amit Kapila, Ajin Cherian
Discussion: https://postgr.es/m/OS0PR01MB6113C2499C7DC70EE55ADB82FB759@OS0PR01MB6113.jpnprd01.prod.outlook.com
2021-04-27 08:28:26 +05:30
Amit Kapila f25a4584c6 Avoid sending prepare multiple times while decoding.
We send the prepare for the concurrently aborted xacts so that later when
rollback prepared is decoded and sent, the downstream should be able to
rollback such a xact. For 'streaming' case (when we send changes for
in-progress transactions), we were sending prepare twice when concurrent
abort was detected.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/f82133c6-6055-b400-7922-97dae9f2b50b@enterprisedb.com
2021-04-26 11:27:44 +05:30
Amit Kapila 6d2e87a077 Fix typo in reorderbuffer.c.
Author: Peter Smith
Discussion: https://postgr.es/m/CAHut+PtvzuYY0zu=dVRK_WVz5WGos1+otZWgEWqjha1ncoSRag@mail.gmail.com
2021-04-26 08:42:46 +05:30
Michael Paquier f3b141c482 Fix relation leak for subscribers firing triggers in logical replication
Creating a trigger on a relation to which an apply operation is
triggered would cause a relation leak once the change gets committed,
as the executor would miss that the relation needs to be closed
beforehand.  This issue got introduced with the refactoring done in
1375422c, where it becomes necessary to track relations within
es_opened_result_relations to make sure that they are closed.

We have discussed using ExecInitResultRelation() coupled with
ExecCloseResultRelations() for the relations in need of tracking by the
apply operations in the subscribers, which would simplify greatly the
opening and closing of indexes, but this requires a larger rework and
reorganization of the worker code, particularly for the tuple routing
part.  And that's not really welcome post feature freeze.  So, for now,
settle down to the same solution as TRUNCATE which is to fill in
es_opened_result_relations with the relation opened, to make sure that
ExecGetTriggerResultRel() finds them and that they get closed.

The code is lightly refactored so as a relation is not registered three
times for each DML code path, making the whole a bit easier to follow.

Reported-by: Tang Haiying, Shi Yu, Hou Zhijie
Author: Amit Langote, Masahiko Sawada, Hou Zhijie
Reviewed-by: Amit Kapila, Michael Paquier
Discussion: https://postgr.es/m/OS0PR01MB611383FA0FE92EB9DE21946AFB769@OS0PR01MB6113.jpnprd01.prod.outlook.com
2021-04-22 12:48:54 +09:00
Peter Eisentraut 640b91c3ed Use correct format placeholder for pids
Should be signed, not unsigned.
2021-04-19 10:43:18 +02:00
Peter Eisentraut f59b58e2a1 Use correct format placeholder for block numbers
Should be %u rather than %d.
2021-04-17 09:40:50 +02:00
Amit Kapila f5fc2f5b23 Add information of total data processed to replication slot stats.
This adds the statistics about total transactions count and total
transaction data logically sent to the decoding output plugin from
ReorderBuffer. Users can query the pg_stat_replication_slots view to check
these stats.

Suggested-by: Andres Freund
Author: Vignesh C and Amit Kapila
Reviewed-by: Sawada Masahiko, Amit Kapila
Discussion: https://postgr.es/m/20210319185247.ldebgpdaxsowiflw@alap3.anarazel.de
2021-04-16 07:34:43 +05:30
Amit Kapila cca57c1d9b Use NameData datatype for slotname in stats.
This will make it consistent with the other usage of slotname in the code.
In the passing, change pgstat_report_replslot signature to use a structure
rather than multiple parameters.

Reported-by: Andres Freund
Author: Vignesh C
Reviewed-by: Sawada Masahiko, Amit Kapila
Discussion: https://postgr.es/m/20210319185247.ldebgpdaxsowiflw@alap3.anarazel.de
2021-04-14 08:55:03 +05:30
Fujii Masao 8ff1c94649 Allow TRUNCATE command to truncate foreign tables.
This commit introduces new foreign data wrapper API for TRUNCATE.
It extends TRUNCATE command so that it accepts foreign tables as
the targets to truncate and invokes that API. Also it extends postgres_fdw
so that it can issue TRUNCATE command to foreign servers, by adding
new routine for that TRUNCATE API.

The information about options specified in TRUNCATE command, e.g.,
ONLY, CACADE, etc is passed to FDW via API. The list of foreign tables to
truncate is also passed to FDW. FDW truncates the foreign data sources
that the passed foreign tables specify, based on those information.
For example, postgres_fdw constructs TRUNCATE command using them
and issues it to the foreign server.

For performance, TRUNCATE command invokes the FDW routine for
TRUNCATE once per foreign server that foreign tables to truncate belong to.

Author: Kazutaka Onishi, Kohei KaiGai, slightly modified by Fujii Masao
Reviewed-by: Bharath Rupireddy, Michael Paquier, Zhihong Yu, Alvaro Herrera, Stephen Frost, Ashutosh Bapat, Amit Langote, Daniel Gustafsson, Ibrar Ahmed, Fujii Masao
Discussion: https://postgr.es/m/CAOP8fzb_gkReLput7OvOK+8NHgw-RKqNv59vem7=524krQTcWA@mail.gmail.com
Discussion: https://postgr.es/m/CAJuF6cMWDDqU-vn_knZgma+2GMaout68YUgn1uyDnexRhqqM5Q@mail.gmail.com
2021-04-08 20:56:08 +09:00
Thomas Munro f003d9f872 Add circular WAL decoding buffer.
Teach xlogreader.c to decode its output into a circular buffer, to
support optimizations based on looking ahead.

 * XLogReadRecord() works as before, consuming records one by one, and
   allowing them to be examined via the traditional XLogRecGetXXX()
   macros.

 * An alternative new interface XLogNextRecord() is added that returns
   pointers to DecodedXLogRecord structs that can be examined directly.

 * XLogReadAhead() provides a second cursor that lets you see
   further ahead, as long as data is available and there is enough space
   in the decoding buffer.  This returns DecodedXLogRecord pointers to the
   caller, but also adds them to a queue of records that will later be
   consumed by XLogNextRecord()/XLogReadRecord().

The buffer's size is controlled with wal_decode_buffer_size.  The buffer
could potentially be placed into shared memory, for future projects.
Large records that don't fit in the circular buffer are called
"oversized" and allocated separately with palloc().

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
2021-04-08 23:20:42 +12:00
Thomas Munro 323cbe7c7d Remove read_page callback from XLogReader.
Previously, the XLogReader module would fetch new input data using a
callback function.  Redesign the interface so that it tells the caller
to insert more data with a special return value instead.  This API suits
later patches for prefetching, encryption and maybe other future
projects that would otherwise require continually extending the callback
interface.

As incidental cleanup work, move global variables readOff, readLen and
readSegNo inside XlogReaderState.

Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (parts of earlier version)
Reviewed-by: Antonin Houska <ah@cybertec.at>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Takashi Menjo <takashi.menjo@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/20190418.210257.43726183.horiguchi.kyotaro%40lab.ntt.co.jp
2021-04-08 23:20:42 +12:00
Peter Eisentraut 0b5e824528 Message improvement
The previous wording contained a superfluous comma.  Adjust phrasing
for grammatical correctness and clarity.
2021-04-07 07:42:44 +02:00
Tom Lane c5b7ba4e67 Postpone some stuff out of ExecInitModifyTable.
Arrange to do some things on-demand, rather than immediately during
executor startup, because there's a fair chance of never having to do
them at all:

* Don't open result relations' indexes until needed.

* Don't initialize partition tuple routing, nor the child-to-root
tuple conversion map, until needed.

This wins in UPDATEs on partitioned tables when only some of the
partitions will actually receive updates; with larger partition
counts the savings is quite noticeable.  Also, we can remove some
sketchy heuristics in ExecInitModifyTable about whether to set up
tuple routing.

Also, remove execPartition.c's private hash table tracking which
partitions were already opened by the ModifyTable node.  Instead
use the hash added to ModifyTable itself by commit 86dc90056.

To allow lazy computation of the conversion maps, we now set
ri_RootResultRelInfo in all child ResultRelInfos.  We formerly set it
only in some, not terribly well-defined, cases.  This has user-visible
side effects in that now more error messages refer to the root
relation instead of some partition (and provide error data in the
root's column order, too).  It looks to me like this is a strict
improvement in consistency, so I don't have a problem with the
output changes visible in this commit.

Extracted from a larger patch, which seemed to me to be too messy
to push in one commit.

Amit Langote, reviewed at different times by Heikki Linnakangas and
myself

Discussion: https://postgr.es/m/CA+HiwqG7ZruBmmih3wPsBZ4s0H2EhywrnXEduckY5Hr3fWzPWA@mail.gmail.com
2021-04-06 15:57:11 -04:00
Peter Geoghegan 8523492d4e Remove tupgone special case from vacuumlazy.c.
Retry the call to heap_prune_page() in rare cases where there is
disagreement between the heap_prune_page() call and the call to
HeapTupleSatisfiesVacuum() that immediately follows.  Disagreement is
possible when a concurrently-aborted transaction makes a tuple DEAD
during the tiny window between each step.  This was the only case where
a tuple considered DEAD by VACUUM still had storage following pruning.
VACUUM's definition of dead tuples is now uniformly simple and
unambiguous: dead tuples from each page are always LP_DEAD line pointers
that were encountered just after we performed pruning (and just before
we considered freezing remaining items with tuple storage).

Eliminating the tupgone=true special case enables INDEX_CLEANUP=off
style skipping of index vacuuming that takes place based on flexible,
dynamic criteria.  The INDEX_CLEANUP=off case had to know about skipping
indexes up-front before now, due to a subtle interaction with the
special case (see commit dd695979) -- this was a special case unto
itself.  Now there are no special cases.  And so now it won't matter
when or how we decide to skip index vacuuming: it won't affect how
pruning behaves, and it won't be affected by any of the implementation
details of pruning or freezing.

Also remove XLOG_HEAP2_CLEANUP_INFO records.  These are no longer
necessary because we now rely entirely on heap pruning taking care of
recovery conflicts.  There is no longer any need to generate recovery
conflicts for DEAD tuples that pruning just missed.  This also means
that heap vacuuming now uses exactly the same strategy for recovery
conflicts as index vacuuming always has: REDO routines never need to
process a latestRemovedXid from the WAL record, since earlier REDO of
the WAL record from pruning is sufficient in all cases.  The generic
XLOG_HEAP2_CLEAN record type is now split into two new record types to
reflect this new division (these are called XLOG_HEAP2_PRUNE and
XLOG_HEAP2_VACUUM).

Also stop acquiring a super-exclusive lock for heap pages when they're
vacuumed during VACUUM's second heap pass.  A regular exclusive lock is
enough.  This is correct because heap page vacuuming is now strictly a
matter of setting the LP_DEAD line pointers to LP_UNUSED.  No other
backend can have a pointer to a tuple located in a pinned buffer that
can be invalidated by a concurrent heap page vacuum operation.

Heap vacuuming can now be thought of as conceptually similar to index
vacuuming and conceptually dissimilar to heap pruning.  Heap pruning now
has sole responsibility for anything involving the logical contents of
the database (e.g., managing transaction status information, recovery
conflicts, considering what to do with HOT chains).  Index vacuuming and
heap vacuuming are now only concerned with recycling garbage items from
physical data structures that back the logical database.

Bump XLOG_PAGE_MAGIC due to pruning and heap page vacuum WAL record
changes.

Credit for the idea of retrying pruning a page to avoid the tupgone case
goes to Andres Freund.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com
2021-04-06 08:49:22 -07:00
Amit Kapila ac4645c015 Allow pgoutput to send logical decoding messages.
The output plugin accepts a new parameter (messages) that controls if
logical decoding messages are written into the replication stream. It is
useful for those clients that use pgoutput as an output plugin and needs
to process messages that were written by pg_logical_emit_message().

Although logical streaming replication protocol supports logical
decoding messages now, logical replication does not use this feature yet.

Author: David Pirotte, Euler Taveira
Reviewed-by: Euler Taveira, Andres Freund, Ashutosh Bapat, Amit Kapila
Discussion: https://postgr.es/m/CADK3HHJ-+9SO7KuRLH=9Wa1rAo60Yreq1GFNkH_kd0=CdaWM+A@mail.gmail.com
2021-04-06 08:40:47 +05:30
Amit Kapila 531737ddad Refactor function parse_output_parameters.
Instead of using multiple parameters in parse_ouput_parameters function
signature, use the struct PGOutputData that encapsulates all pgoutput
options. It will be useful for future work where we need to add other
options in pgoutput.

Author: Euler Taveira
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/CADK3HHJ-+9SO7KuRLH=9Wa1rAo60Yreq1GFNkH_kd0=CdaWM+A@mail.gmail.com
2021-04-06 08:26:31 +05:30
Andres Freund 1d9c5d0ce2 Do not rely on pgstat.h to indirectly include storage/ headers.
An upcoming patch might remove the (now indirect) proc.h
include (which in turn includes other headers), and it's cleaner for
the modified files to include their dependencies directly anyway...

Discussion: https://postgr.es/m/20210402194458.2vu324hkk2djq6ce@alap3.anarazel.de
2021-04-02 20:02:47 -07:00
Stephen Frost c9c41c7a33 Rename Default Roles to Predefined Roles
The term 'default roles' wasn't quite apt as these roles aren't able to
be modified or removed after installation, so rename them to be
'Predefined Roles' instead, adding an entry into the newly added
Obsolete Appendix to help users of current releases find the new
documentation.

Bruce Momjian and Stephen Frost

Discussion: https://postgr.es/m/157742545062.1149.11052653770497832538%40wrigleys.postgresql.org
and https://www.postgresql.org/message-id/20201120211304.GG16415@tamriel.snowman.net
2021-04-01 15:32:06 -04:00
Amit Kapila 4778826532 Ensure to send a prepare after we detect concurrent abort during decoding.
It is possible that while decoding a prepared transaction, it gets aborted
concurrently via a ROLLBACK PREPARED command. In that case, we were
skipping all the changes and directly sending Rollback Prepared when we
find the same in WAL. However, the downstream has no idea of the GID of
such a transaction. So, ensure to send prepare even when a concurrent
abort is detected.

Author: Ajin Cherian
Reviewed-by: Markus Wanner, Amit Kapila
Discussion: https://postgr.es/m/f82133c6-6055-b400-7922-97dae9f2b50b@enterprisedb.com
2021-04-01 07:57:34 +05:30
Tom Lane 6197db5340 Improve style of some replication-related error messages.
Put the remote end's error message into the primary error string,
instead of relegating it to errdetail().  Although this could end up
being awkward if the remote sends us a really long error message,
it seems more in keeping with our message style guidelines, and more
helpful in situations where the errdetail could get dropped.

Peter Smith

Discussion: https://postgr.es/m/CAHut+Ps-Qv2yQceCwobQDP0aJOkfDzRFrOaR6+2Op2K=WHGeWg@mail.gmail.com
2021-03-31 15:25:53 -04:00
Amit Kapila f64ea6dc5c Add a xid argument to the filter_prepare callback for output plugins.
Along with gid, this provides a different way to identify the transaction.
The users that use xid in some way to prepare the transactions can use it
to filter prepare transactions. The later commands COMMIT PREPARED or
ROLLBACK PREPARED carries both identifiers, providing an output plugin the
choice of what to use.

Author: Markus Wanner
Reviewed-by: Vignesh C, Amit Kapila
Discussion: https://postgr.es/m/ee280000-7355-c4dc-e47b-2436e7be959c@enterprisedb.com
2021-03-30 10:34:43 +05:30
Michael Paquier a1999a01bb Sanitize the term "combo CID" in code comments
Combo CIDs were referred in the code comments using different terms
across various places of the code, so unify a bit the term used with
what is currently in use in some of the READMEs.

Author: "Hou, Zhijie"
Discussion: https://postgr.es/m/1d42865c91404f46af4562532fdbea31@G08CNEXMBPEKD05.g08.fujitsu.local
2021-03-25 16:08:03 +09:00
Peter Eisentraut 1509c6fc29 Improve an error message
Make it the same as another nearby message.
2021-03-24 08:02:06 +01:00
Fujii Masao 84007043fc Rename wait event WalrcvExit to WalReceiverExit.
Commit de829ddf23 added wait event WalrcvExit. But its name is not
consistent with other wait events like WalReceiverMain or
WalReceiverWaitStart, etc. So this commit renames WalrcvExit to
WalReceiverExit.

Author: Fujii Masao
Reviewed-by: Thomas Munro
Discussion: https://postgr.es/m/cced9995-8fa2-7b22-9d91-3f22a2b8c23c@oss.nttdata.com
2021-03-24 10:37:54 +09:00
Amit Kapila 4b82ed6eca Fix dangling pointer reference in stream_cleanup_files.
We can't access the entry after it is removed from dynahash.

Author: Peter Smith
Discussion: https://postgr.es/m/CAHut+Ps-pL++f6CJwPx2+vUqXuew=Xt-9Bi-6kCyxn+Fwi2M7w@mail.gmail.com
2021-03-23 09:43:33 +05:30
Robert Haas bbe0a81db6 Allow configurable LZ4 TOAST compression.
There is now a per-column COMPRESSION option which can be set to pglz
(the default, and the only option in up until now) or lz4. Or, if you
like, you can set the new default_toast_compression GUC to lz4, and
then that will be the default for new table columns for which no value
is specified. We don't have lz4 support in the PostgreSQL code, so
to use lz4 compression, PostgreSQL must be built --with-lz4.

In general, TOAST compression means compression of individual column
values, not the whole tuple, and those values can either be compressed
inline within the tuple or compressed and then stored externally in
the TOAST table, so those properties also apply to this feature.

Prior to this commit, a TOAST pointer has two unused bits as part of
the va_extsize field, and a compessed datum has two unused bits as
part of the va_rawsize field. These bits are unused because the length
of a varlena is limited to 1GB; we now use them to indicate the
compression type that was used. This means we only have bit space for
2 more built-in compresison types, but we could work around that
problem, if necessary, by introducing a new vartag_external value for
any further types we end up wanting to add. Hopefully, it won't be
too important to offer a wide selection of algorithms here, since
each one we add not only takes more coding but also adds a build
dependency for every packager. Nevertheless, it seems worth doing
at least this much, because LZ4 gets better compression than PGLZ
with less CPU usage.

It's possible for LZ4-compressed datums to leak into composite type
values stored on disk, just as it is for PGLZ. It's also possible for
LZ4-compressed attributes to be copied into a different table via SQL
commands such as CREATE TABLE AS or INSERT .. SELECT.  It would be
expensive to force such values to be decompressed, so PostgreSQL has
never done so. For the same reasons, we also don't force recompression
of already-compressed values even if the target table prefers a
different compression method than was used for the source data.  These
architectural decisions are perhaps arguable but revisiting them is
well beyond the scope of what seemed possible to do as part of this
project.  However, it's relatively cheap to recompress as part of
VACUUM FULL or CLUSTER, so this commit adjusts those commands to do
so, if the configured compression method of the table happens not to
match what was used for some column value stored therein.

Dilip Kumar. The original patches on which this work was based were
written by Ildus Kurbangaliev, and those were patches were based on
even earlier work by Nikita Glukhov, but the design has since changed
very substantially, since allow a potentially large number of
compression methods that could be added and dropped on a running
system proved too problematic given some of the architectural issues
mentioned above; the choice of which specific compression method to
add first is now different; and a lot of the code has been heavily
refactored.  More recently, Justin Przyby helped quite a bit with
testing and reviewing and this version also includes some code
contributions from him. Other design input and review from Tomas
Vondra, Álvaro Herrera, Andres Freund, Oleg Bartunov, Alexander
Korotkov, and me.

Discussion: http://postgr.es/m/20170907194236.4cefce96%40wp.localdomain
Discussion: http://postgr.es/m/CAFiTN-uUpX3ck%3DK0mLEk-G_kUQY%3DSNOTeqdaNRR9FMdQrHKebw%40mail.gmail.com
2021-03-19 15:10:38 -04:00
Tom Lane 9bacdf9f53 Don't leak malloc'd error string in libpqrcv_check_conninfo().
We leaked the error report from PQconninfoParse, when there was
one.  It seems unlikely that real usage patterns would repeat
the failure often enough to create serious bloat, but let's
back-patch anyway to keep the code similar in all branches.

Found via valgrind testing.
Back-patch to v10 where this code was added.

Discussion: https://postgr.es/m/3816764.1616104288@sss.pgh.pa.us
2021-03-18 22:22:47 -04:00
Tom Lane 8620a7f6db Code review for server's handling of "tablespace map" files.
While looking at Robert Foggia's report, I noticed a passel of
other issues in the same area:

* The scheme for backslash-quoting newlines in pathnames is just
wrong; it will misbehave if the last ordinary character in a pathname
is a backslash.  I'm not sure why we're bothering to allow newlines
in tablespace paths, but if we're going to do it we should do it
without introducing other problems.  Hence, backslashes themselves
have to be backslashed too.

* The author hadn't read the sscanf man page very carefully, because
this code would drop any leading whitespace from the path.  (I doubt
that a tablespace path with leading whitespace could happen in
practice; but if we're bothering to allow newlines in the path, it
sure seems like leading whitespace is little less implausible.)  Using
sscanf for the task of finding the first space is overkill anyway.

* While I'm not 100% sure what the rationale for escaping both \r and
\n is, if the idea is to allow Windows newlines in the file then this
code failed, because it'd throw an error if it saw \r followed by \n.

* There's no cross-check for an incomplete final line in the map file,
which would be a likely apparent symptom of the improper-escaping
bug.

On the generation end, aside from the escaping issue we have:

* If needtblspcmapfile is true then do_pg_start_backup will pass back
escaped strings in tablespaceinfo->path values, which no caller wants
or is prepared to deal with.  I'm not sure if there's a live bug from
that, but it looks like there might be (given the dubious assumption
that anyone actually has newlines in their tablespace paths).

* It's not being very paranoid about the possibility of random stuff
in the pg_tblspc directory.  IMO we should ignore anything without an
OID-like name.

The escaping rule change doesn't seem back-patchable: it'll require
doubling of backslashes in the tablespace_map file, which is basically
a basebackup format change.  The odds of that causing trouble are
considerably more than the odds of the existing bug causing trouble.
The rest of this seems somewhat unlikely to cause problems too,
so no back-patch.
2021-03-17 16:18:46 -04:00
Amit Kapila 6b67d72b60 Fix race condition in drop subscription's handling of tablesync slots.
Commit ce0fdbfe97 made tablesync slots permanent and allow Drop
Subscription to drop such slots. However, it is possible that before
tablesync worker could get the acknowledgment of slot creation, drop
subscription stops it and that can lead to a dangling slot on the
publisher. Prevent cancel/die interrupts while creating a slot in the
tablesync worker.

Reported-by: Thomas Munro as per buildfarm
Author: Amit Kapila
Reviewed-by: Vignesh C, Takamichi Osumi
Discussion: https://postgr.es/m/CA+hUKGJG9dWpw1cOQ2nzWU8PHjm=PTraB+KgE5648K9nTfwvxg@mail.gmail.com
2021-03-17 08:15:12 +05:30
Alvaro Herrera acb7e4eb6b
Implement pipeline mode in libpq
Pipeline mode in libpq lets an application avoid the Sync messages in
the FE/BE protocol that are implicit in the old libpq API after each
query.  The application can then insert Sync at its leisure with a new
libpq function PQpipelineSync.  This can lead to substantial reductions
in query latency.

Co-authored-by: Craig Ringer <craig.ringer@enterprisedb.com>
Co-authored-by: Matthieu Garrigues <matthieu.garrigues@gmail.com>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Aya Iwata <iwata.aya@jp.fujitsu.com>
Reviewed-by: Daniel Vérité <daniel@manitou-mail.org>
Reviewed-by: David G. Johnston <david.g.johnston@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Kirk Jamison <k.jamison@fujitsu.com>
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reviewed-by: Nikhil Sontakke <nikhils@2ndquadrant.com>
Reviewed-by: Vaishnavi Prabakaran <VaishnaviP@fast.au.fujitsu.com>
Reviewed-by: Zhihong Yu <zyu@yugabyte.com>

Discussion: https://postgr.es/m/CAMsr+YFUjJytRyV4J-16bEoiZyH=4nj+sQ7JP9ajwz=B4dMMZw@mail.gmail.com
Discussion: https://postgr.es/m/CAJkzx4T5E-2cQe3dtv2R78dYFvz+in8PY7A8MArvLhs_pg75gg@mail.gmail.com
2021-03-15 18:13:42 -03:00
Thomas Munro de829ddf23 Add condition variable for walreceiver shutdown.
Use this new CV to wait for walreceiver shutdown without a sleep/poll
loop, while also benefiting from standard postmaster death handling.

Discussion: https://postgr.es/m/CA%2BhUKGK1607VmtrDUHQXrsooU%3Dap4g4R2yaoByWOOA3m8xevUQ%40mail.gmail.com
2021-03-12 19:45:42 +13:00
Peter Eisentraut 1657b37d7c Small debug message tweak
This makes the wording of the delete case match the update case.
2021-03-10 08:16:38 +01:00
Amit Kapila 19890a064e Add option to enable two_phase commits via pg_create_logical_replication_slot.
Commit 0aa8a01d04 extends the output plugin API to allow decoding of
prepared xacts and allowed the user to enable/disable the two-phase option
via pg_logical_slot_get_changes(). This can lead to a problem such that
the first time when it gets changes via pg_logical_slot_get_changes()
without two_phase option enabled it will not get the prepared even though
prepare is after consistent snapshot. Now next time during getting changes,
if the two_phase option is enabled it can skip prepare because by that
time start decoding point has been moved. So the user will only get commit
prepared.

Allow to enable/disable this option at the create slot time and default
will be false. It will break the existing slots which is fine in a major
release.

Author: Ajin Cherian
Reviewed-by: Amit Kapila and Vignesh C
Discussion: https://postgr.es/m/d0f60d60-133d-bf8d-bd70-47784d8fabf3@enterprisedb.com
2021-03-03 07:34:11 +05:30
Amit Kapila 8bdb1332eb Avoid repeated decoding of prepared transactions after a restart.
In commit a271a1b50e, we allowed decoding at prepare time and the prepare
was decoded again if there is a restart after decoding it. It was done
that way because we can't distinguish between the cases where we have not
decoded the prepare because it was prior to consistent snapshot or we have
decoded it earlier but restarted. To distinguish between these two cases,
we have introduced an initial_consistent_point at the slot level which is
an LSN at which we found a consistent point at the time of slot creation.
This is also the point where we have exported a snapshot for the initial
copy. So, prepare transaction prior to this point are sent along with
commit prepared.

This commit bumps SNAPBUILD_VERSION because of change in SnapBuild. It
will break existing slots which is fine in a major release.

Author: Ajin Cherian, based on idea by Andres Freund
Reviewed-by: Amit Kapila and Vignesh C
Discussion: https://postgr.es/m/d0f60d60-133d-bf8d-bd70-47784d8fabf3@enterprisedb.com
2021-03-01 09:11:18 +05:30
Thomas Munro 6230912f23 Use FeBeWaitSet for walsender.c.
This avoids the need to set up and tear down a fresh WaitEventSet every
time we need need to wait.  We have to add an explicit exit on
postmaster exit (FeBeWaitSet isn't set up to do that automatically), so
move the code to do that into a new function to avoid repetition.

Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> (earlier version)
Discussion: https://postgr.es/m/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com
2021-03-01 16:19:38 +13:00
Amit Kapila b4e3dc7fd4 Update the docs and comments for decoding of prepared xacts.
Commit a271a1b50e introduced decoding at prepare time in ReorderBuffer.
This can lead to deadlock for out-of-core logical replication solutions
that uses this feature to build distributed 2PC in case such transactions
lock [user] catalog tables exclusively. They need to inform users to not
have locks on catalog tables (via explicit LOCK command) in such
transactions.

Reported-by: Andres Freund
Discussion: https://postgr.es/m/20210222222847.tpnb6eg3yiykzpky@alap3.anarazel.de
2021-03-01 08:14:33 +05:30
Peter Eisentraut 6f6f284c7e Simplify printing of LSNs
Add a macro LSN_FORMAT_ARGS for use in printf-style printing of LSNs.
Convert all applicable code to use it.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/flat/CAExHW5ub5NaTELZ3hJUCE6amuvqAtsSxc7O+uK7y4t9Rrk23cw@mail.gmail.com
2021-02-23 10:27:02 +01:00
Amit Kapila ade89ba5f4 Fix an oversight in ReorderBufferFinishPrepared.
We don't have anything to decode in a transaction if ReorderBufferTXN
doesn't exist by the time we decode the commit prepared. So don't create a
new ReorderBufferTXN here. This is an oversight in commit a271a1b5.

Reported-by: Markus Wanner
Discussion: https://postgr.es/m/dbec82e2-dbd7-95a2-c6b6-e488cbbdf853@bluegap.ch
2021-02-23 09:47:41 +05:30
Fujii Masao 614b7f18b3 Fix "invalid spinlock number: 0" error in pg_stat_wal_receiver.
Commit 2c8dd05d6c added the atomic variable writtenUpto into
walreceiver's shared memory information. It's initialized only
when walreceiver started up but could be read via pg_stat_wal_receiver
view anytime, i.e., even before it's initialized. In the server built
with --disable-atomics and --disable-spinlocks, this uninitialized
atomic variable read could cause "invalid spinlock number: 0" error.

This commit changed writtenUpto so that it's initialized at
the postmaster startup, to avoid the uninitialized variable read
via pg_stat_wal_receiver and fix the error.

Also this commit moved the read of writtenUpto after the release
of spinlock protecting walreceiver's shared variables. This is
necessary to prevent new spinlock from being taken by atomic
variable read while holding another spinlock, and to shorten
the spinlock duration. This change leads writtenUpto not to be
consistent with the other walreceiver's shared variables protected
by a spinlock. But this is OK because writtenUpto should not be
used for data integrity checks.

Back-patch to v13 where commit 2c8dd05d6c introduced the bug.

Author: Fujii Masao
Reviewed-by: Michael Paquier, Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/7ef8708c-5b6b-edd3-2cf2-7783f1c7c175@oss.nttdata.com
2021-02-18 23:28:15 +09:00
Peter Eisentraut 0e392fcc0d Use errmsg_internal for debug messages
An inconsistent set of debug-level messages was not using
errmsg_internal(), thus uselessly exposing the messages to translation
work.  Fix those.
2021-02-17 11:33:25 +01:00
Amit Kapila f672df5fdd Remove the unnecessary PrepareWrite in pgoutput.
This issue exists from the inception of this code (PG-10) but got exposed
by the recent commit ce0fdbfe97 where we are using origins in tablesync
workers. The problem was that we were sometimes sending the prepare_write
('w') message but then the actual message was not being sent and on the
subscriber side, we always expect a message after prepare_write message
which led to this bug.

I refrained from backpatching this because there is no way in the core
code to hit this prior to commit ce0fdbfe97 and we haven't received any
complaints so far.

Reported-by: Erik Rijkers
Author: Amit Kapila and Vignesh C
Tested-by: Erik Rijkers
Discussion: https://postgr.es/m/1295168140.139428.1613133237154@webmailclassic.xs4all.nl
2021-02-16 07:26:50 +05:30
Andres Freund a975ff4980 Remove backwards compat ugliness in snapbuild.c.
In 955a684e04 we fixed a bug in initial snapshot creation. In the
course of which several members of struct SnapBuild were obsoleted. As
SnapBuild is serialized to disk we couldn't change the memory layout.

Unfortunately I subsequently forgot about removing the backward compat
gunk, but luckily Heikki just reminded me.

This commit bumps SNAPBUILD_VERSION, therefore breaking existing
slots (which is fine in a major release).

Author: Andres Freund
Reminded-By: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/c94be044-818f-15e3-1ad3-7a7ae2dfed0a@iki.fi
2021-02-15 16:57:47 -08:00
Amit Kapila d9b0767bec Fix the warnings introduced in commit ce0fdbfe97.
Author: Amit Kapila
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/1610789.1613170207@sss.pgh.pa.us
2021-02-15 07:28:02 +05:30
Michael Paquier b83dcf7928 Add result size as argument of pg_cryptohash_final() for overflow checks
With its current design, a careless use of pg_cryptohash_final() could
would result in an out-of-bound write in memory as the size of the
destination buffer to store the result digest is not known to the
cryptohash internals, without the caller knowing about that.  This
commit adds a new argument to pg_cryptohash_final() to allow such sanity
checks, and implements such defenses.

The internals of SCRAM for HMAC could be tightened a bit more, but as
everything is based on SCRAM_KEY_LEN with uses particular to this code
there is no need to complicate its interface more than necessary, and
this comes back to the refactoring of HMAC in core.  Except that, this
minimizes the uses of the existing DIGEST_LENGTH variables, relying
instead on sizeof() for the result sizes.  In ossp-uuid, this also makes
the code more defensive, as it already relied on dce_uuid_t being at
least the size of a MD5 digest.

This is in philosophy similar to cfc40d3 for base64.c and aef8948 for
hex.c.

Reported-by: Ranier Vilela
Author: Michael Paquier, Ranier Vilela
Reviewed-by: Kyotaro Horiguchi
Discussion: https://postgr.es/m/CAEudQAoqEGmcff3J4sTSV-R_16Monuz-UpJFbf_dnVH=APr02Q@mail.gmail.com
2021-02-15 10:18:34 +09:00
Amit Kapila ce0fdbfe97 Allow multiple xacts during table sync in logical replication.
For the initial table data synchronization in logical replication, we use
a single transaction to copy the entire table and then synchronize the
position in the stream with the main apply worker.

There are multiple downsides of this approach: (a) We have to perform the
entire copy operation again if there is any error (network breakdown,
error in the database operation, etc.) while we synchronize the WAL
position between tablesync worker and apply worker; this will be onerous
especially for large copies, (b) Using a single transaction in the
synchronization-phase (where we can receive WAL from multiple
transactions) will have the risk of exceeding the CID limit, (c) The slot
will hold the WAL till the entire sync is complete because we never commit
till the end.

This patch solves all the above downsides by allowing multiple
transactions during the tablesync phase. The initial copy is done in a
single transaction and after that, we commit each transaction as we
receive. To allow recovery after any error or crash, we use a permanent
slot and origin to track the progress. The slot and origin will be removed
once we finish the synchronization of the table. We also remove slot and
origin of tablesync workers if the user performs DROP SUBSCRIPTION .. or
ALTER SUBSCRIPTION .. REFERESH and some of the table syncs are still not
finished.

The commands ALTER SUBSCRIPTION ... REFRESH PUBLICATION and
ALTER SUBSCRIPTION ... SET PUBLICATION ... with refresh option as true
cannot be executed inside a transaction block because they can now drop
the slots for which we have no provision to rollback.

This will also open up the path for logical replication of 2PC
transactions on the subscriber side. Previously, we can't do that because
of the requirement of maintaining a single transaction in tablesync
workers.

Bump catalog version due to change of state in the catalog
(pg_subscription_rel).

Author: Peter Smith, Amit Kapila, and Takamichi Osumi
Reviewed-by: Ajin Cherian, Petr Jelinek, Hou Zhijie and Amit Kapila
Discussion: https://postgr.es/m/CAA4eK1KHJxaZS-fod-0fey=0tq3=Gkn4ho=8N4-5HWiCfu0H1A@mail.gmail.com
2021-02-12 07:41:51 +05:30
Amit Kapila cd142e032e Make pg_replication_origin_drop safe against concurrent drops.
Currently, we get the origin id from the name and then drop the origin by
taking ExclusiveLock on ReplicationOriginRelationId. So, two concurrent
sessions can get the id from the name at the same time and then when they
try to drop the origin, one of the sessions will get the either
"tuple concurrently deleted" or "cache lookup failed for replication
origin ..".

To prevent this race condition we do the entire operation under lock. This
obviates the need for replorigin_drop() API and we have removed it so if
any extension authors are using it they need to instead use
replorigin_drop_by_name. See it's usage in pg_replication_origin_drop().

Author: Peter Smith
Reviewed-by: Amit Kapila, Euler Taveira, Petr Jelinek, and Alvaro
Herrera
Discussion: https://www.postgresql.org/message-id/CAHut%2BPuW8DWV5fskkMWWMqzt-x7RPcNQOtJQBp6SdwyRghCk7A%40mail.gmail.com
2021-02-10 07:17:09 +05:30
Michael Paquier fc749bc704 Clarify comment in tablesync.c
Author: Peter Smith
Reviewed-by: Amit Kapila, Michael Paquier, Euler Taveira
Discussion: https://postgr.es/m/CAHut+Pt9_T6pWar0FLtPsygNmme8HPWPdGUyZ_8mE1Yvjdf0ZA@mail.gmail.com
2021-02-04 16:02:31 +09:00
Michael Paquier 4ad31bb2ef Remove unused column atttypmod from initial tablesync query
The initial tablesync done by logical replication used a query to fetch
the information of a relation's columns that included atttypmod, but it
was left unused.  This was added by 7c4f524.

Author: Euler Taveira
Reviewed-by: Önder Kalacı, Amit Langote, Japin Li
Discussion: https://postgr.es/m/CAHE3wggb715X+mK_DitLXF25B=jE6xyNCH4YOwM860JR7HarGQ@mail.gmail.com
2021-02-02 13:59:23 +09:00
Andres Freund 55ef8555f0 Fix two typos in snapbuild.c.
Reported-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/c94be044-818f-15e3-1ad3-7a7ae2dfed0a@iki.fi
2021-01-25 12:15:10 -08:00
Robert Haas d18e75664a Remove CheckpointLock.
Up until now, we've held this lock when performing a checkpoint or
restartpoint, but commit 076a055acf back
in 2004 and commit 7e48b77b1c from 2009,
taken together, have removed all need for this. In the present code,
there's only ever one process entitled to attempt a checkpoint: either
the checkpointer, during normal operation, or the postmaster, during
single-user operation. So, we don't need the lock.

One possible concern in making this change is that it means that
a substantial amount of code where HOLD_INTERRUPTS() was previously
in effect due to the preceding LWLockAcquire() will now be
running without that. This could mean that ProcessInterrupts()
gets called in places from which it didn't before. However, this
seems unlikely to do very much, because the checkpointer doesn't
have any signal mapped to die(), so it's not clear how,
for example, ProcDiePending = true could happen in the first
place. Similarly with ClientConnectionLost and recovery conflicts.

Also, if there are any such problems, we might want to fix them
rather than reverting this, since running lots of code with
interrupt handling suspended is generally bad.

Patch by me, per an inquiry by Amul Sul. Review by Tom Lane
and Michael Paquier.

Discussion: http://postgr.es/m/CAAJ_b97XnBBfYeSREDJorFsyoD1sHgqnNuCi=02mNQBUMnA=FA@mail.gmail.com
2021-01-25 12:34:38 -05:00
Amit Kapila 40ab64c1ec Fix ALTER PUBLICATION...DROP TABLE behavior.
Commit 69bd60672 fixed the initialization of streamed transactions for
RelationSyncEntry. It forgot to initialize the publication actions while
invalidating the RelationSyncEntry due to which even though the relation
is dropped from a particular publication we still publish its changes. Fix
it by initializing pubactions when entry got invalidated.

Author: Japin Li and Bharath Rupireddy
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/CALj2ACV+0UFpcZs5czYgBpujM9p0Hg1qdOZai_43OU7bqHU_xw@mail.gmail.com
2021-01-25 07:39:29 +05:30
Amit Kapila ed43677e20 pgindent worker.c.
This is a leftover from commit 0926e96c49. Changing this separately
because this file is being modified for upcoming patch logical replication
of 2PC.

Author: Peter Smith
Discussion: https://postgr.es/m/CAHut+Ps+EgG8KzcmAyAgBUi_vuTps6o9ZA8DG6SdnO0-YuOhPQ@mail.gmail.com
2021-01-19 08:10:13 +05:30
Tom Lane 3fd80c728d Narrow the scope of a local variable.
This is better style and more symmetrical with the other if-branch.
This likely should have been included in 9de77b545 (which created
the opportunity), but it was overlooked.

Japin Li

Discussion: https://postgr.es/m/MEYP282MB16699FA4A7CD57EB250E871FB6A40@MEYP282MB1669.AUSP282.PROD.OUTLOOK.COM
2021-01-18 15:55:01 -05:00
Fujii Masao fef5b47f6b Ensure that a standby is able to follow a primary on a newer timeline.
Commit 709d003fbd refactored WAL-reading code, but accidentally caused
WalSndSegmentOpen() to fail to follow a timeline switch while reading from
a historic timeline. This issue caused a standby to fail to follow a primary
on a newer timeline when WAL archiving is enabled.

If there is a timeline switch within the segment, WalSndSegmentOpen() should
read from the WAL segment belonging to the new timeline. But previously
since it failed to follow a timeline switch, it tried to read the WAL segment
with old timeline. When WAL archiving is enabled, that WAL segment with
old timeline doesn't exist because it's renamed to .partial. This leads
a primary to have tried to read non-existent WAL segment, and which caused
replication to faill with the error "ERROR:  requested WAL segment ... has
 already been removed".

This commit fixes WalSndSegmentOpen() so that it's able to follow a timeline
switch, to ensure that a standby is able to follow a primary on a newer
timeline even when WAL archiving is enabled.

This commit also adds the regression test to check whether a standby is
able to follow a primary on a newer timeline when WAL archiving is enabled.

Back-patch to v13 where the bug was introduced.

Reported-by: Kyotaro Horiguchi
Author: Kyotaro Horiguchi, tweaked by Fujii Masao
Reviewed-by:  Alvaro Herrera, Fujii Masao
Discussion: https://postgr.es/m/20201209.174314.282492377848029776.horikyota.ntt@gmail.com
2021-01-14 12:27:11 +09:00
Michael Paquier aef8948f38 Rework refactoring of hex and encoding routines
This commit addresses some issues with c3826f83 that moved the hex
decoding routine to src/common/:
- The decoding function lacked overflow checks, so when used for
security-related features it was an open door to out-of-bound writes if
not carefully used that could remain undetected.  Like the base64
routines already in src/common/ used by SCRAM, this routine is reworked
to check for overflows by having the size of the destination buffer
passed as argument, with overflows checked before doing any writes.
- The encoding routine was missing.  This is moved to src/common/ and
it gains the same overflow checks as the decoding part.

On failure, the hex routines of src/common/ issue an error as per the
discussion done to make them usable by frontend tools, but not by shared
libraries.  Note that this is why ECPG is left out of this commit, and
it still includes a duplicated logic doing hex encoding and decoding.

While on it, this commit uses better variable names for the source and
destination buffers in the existing escape and base64 routines in
encode.c and it makes them more robust to overflow detection.  The
previous core code issued a FATAL after doing out-of-bound writes if
going through the SQL functions, which would be enough to detect
problems when working on changes that impacted this area of the
code.  Instead, an error is issued before doing an out-of-bound write.
The hex routines were being directly called for bytea conversions and
backup manifests without such sanity checks.  The current calls happen
to not have any problems, but careless uses of such APIs could easily
lead to CVE-class bugs.

Author: Bruce Momjian, Michael Paquier
Reviewed-by: Sehrope Sarkuni
Discussion: https://postgr.es/m/20201231003557.GB22199@momjian.us
2021-01-14 11:13:24 +09:00
Peter Geoghegan 9dc718bdf2 Pass down "logically unchanged index" hint.
Add an executor aminsert() hint mechanism that informs index AMs that
the incoming index tuple (the tuple that accompanies the hint) is not
being inserted by execution of an SQL statement that logically modifies
any of the index's key columns.

The hint is received by indexes when an UPDATE takes place that does not
apply an optimization like heapam's HOT (though only for indexes where
all key columns are logically unchanged).  Any index tuple that receives
the hint on insert is expected to be a duplicate of at least one
existing older version that is needed for the same logical row.  Related
versions will typically be stored on the same index page, at least
within index AMs that apply the hint.

Recognizing the difference between MVCC version churn duplicates and
true logical row duplicates at the index AM level can help with cleanup
of garbage index tuples.  Cleanup can intelligently target tuples that
are likely to be garbage, without wasting too many cycles on less
promising tuples/pages (index pages with little or no version churn).

This is infrastructure for an upcoming commit that will teach nbtree to
perform bottom-up index deletion.  No index AM actually applies the hint
just yet.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=CEKFa74EScx_hFVshCOn6AA5T-ajFASTdzipdkLTNQQ@mail.gmail.com
2021-01-13 08:11:00 -08:00
Amit Kapila ee1b38f659 Fix memory leak in SnapBuildSerialize.
The memory for the snapshot was leaked while serializing it to disk during
logical decoding. This memory will be freed only once walsender stops
streaming the changes. This can lead to a huge memory increase when master
logs Standby Snapshot too frequently say when the user is trying to create
many replication slots.

Reported-by: funnyxj.fxj@alibaba-inc.com
Diagnosed-by: funnyxj.fxj@alibaba-inc.com
Author: Amit Kapila
Backpatch-through: 9.5
Discussion: https://postgr.es/m/033ab54c-6393-42ee-8ec9-2b399b5d8cde.funnyxj.fxj@alibaba-inc.com
2021-01-13 08:19:50 +05:30
Amit Kapila 044aa9e70e Fix relation descriptor leak.
We missed closing the relation descriptor while sending changes via the
root of partitioned relations during logical replication.

Author: Amit Langote and Mark Zhao
Reviewed-by: Amit Kapila and Ashutosh Bapat
Backpatch-through: 13, where it was introduced
Discussion: https://postgr.es/m/tencent_41FEA657C206F19AB4F406BE9252A0F69C06@qq.com
Discussion: https://postgr.es/m/tencent_6E296D2F7D70AFC90D83353B69187C3AA507@qq.com
2021-01-12 08:19:39 +05:30
Amit Kapila e02e840ff7 Fix typos in decode.c and logical.c.
Per report by Ajin Cherian in email:
https://postgr.es/m/CAFPTHDYnRKDvzgDxoMn_CKqXA-D0MtrbyJvfvjBsO4G=UHDXkg@mail.gmail.com
2021-01-06 08:56:19 +05:30
Amit Kapila cd357c7629 Fix typo in origin.c.
Author: Peter Smith
Discussion: https://postgr.es/m/CAHut+PsReyuvww_Fn1NN_Vsv0wBP1bnzuhzRFr_2=y1nNZrG7w@mail.gmail.com
2021-01-05 08:05:08 +05:30
Amit Kapila 9da2224ea2 Fix typo in reorderbuffer.c.
Author: Zhijie Hou
Reviewed-by: Sawada Masahiko
Discussion: https://postgr.es/m/ba88bb58aaf14284abca16aec04bf279@G08CNEXMBPEKD05.g08.fujitsu.local
2021-01-05 07:56:40 +05:30
Amit Kapila a271a1b50e Allow decoding at prepare time in ReorderBuffer.
This patch allows PREPARE-time decoding of two-phase transactions (if the
output plugin supports this capability), in which case the transactions
are replayed at PREPARE and then committed later when COMMIT PREPARED
arrives.

Now that we decode the changes before the commit, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We detect such failures with a special sqlerrcode
ERRCODE_TRANSACTION_ROLLBACK introduced by commit 7259736a6e and stop
decoding the remaining changes. Then we rollback the changes when rollback
prepared is encountered.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, Arseny Sher, and Dilip Kumar
Tested-by: Takamichi Osumi
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
2021-01-04 08:34:50 +05:30
Bruce Momjian ca3b37487b Update copyright for 2021
Backpatch-through: 9.5
2021-01-02 13:06:25 -05:00
Amit Kapila 0aa8a01d04 Extend the output plugin API to allow decoding of prepared xacts.
This adds six methods to the output plugin API, adding support for
streaming changes of two-phase transactions at prepare time.

* begin_prepare
* filter_prepare
* prepare
* commit_prepared
* rollback_prepared
* stream_prepare

Most of this is a simple extension of the existing methods, with the
semantic difference that the transaction is not yet committed and maybe
aborted later.

Until now two-phase transactions were translated into regular transactions
on the subscriber, and the GID was not forwarded to it. None of the
two-phase commands were communicated to the subscriber.

This patch provides the infrastructure for logical decoding plugins to be
informed of two-phase commands Like PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED commands with the corresponding GID.

This also extends the 'test_decoding' plugin, implementing these new
methods.

This commit simply adds these new APIs and the upcoming patch to "allow
the decoding at prepare time in ReorderBuffer" will use these APIs.

Author: Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Peter Smith, Sawada Masahiko, and Dilip Kumar
Discussion:
https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
https://postgr.es/m/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3+Q@mail.gmail.com
2020-12-30 16:17:26 +05:30
Bruce Momjian 3187ef7c46 Revert "Add key management system" (978f869b99) & later commits
The patch needs test cases, reorganization, and cfbot testing.
Technically reverts commits 5c31afc49d..e35b2bad1a (exclusive/inclusive)
and 08db7c63f3..ccbe34139b.

Reported-by: Tom Lane, Michael Paquier

Discussion: https://postgr.es/m/E1ktAAG-0002V2-VB@gemulon.postgresql.org
2020-12-27 21:37:42 -05:00
Bruce Momjian 978f869b99 Add key management system
This adds a key management system that stores (currently) two data
encryption keys of length 128, 192, or 256 bits.  The data keys are
AES256 encrypted using a key encryption key, and validated via GCM
cipher mode.  A command to obtain the key encryption key must be
specified at initdb time, and will be run at every database server
start.  New parameters allow a file descriptor open to the terminal to
be passed.  pg_upgrade support has also been added.

Discussion: https://postgr.es/m/CA+fd4k7q5o6Nc_AaX6BcYM9yqTbC6_pnH-6nSD=54Zp6NBQTCQ@mail.gmail.com
Discussion: https://postgr.es/m/20201202213814.GG20285@momjian.us

Author: Masahiko Sawada, me, Stephen Frost
2020-12-25 10:19:44 -05:00
Amit Kapila 20659fd8e5 Update comment atop of ReorderBufferQueueMessage().
The comments atop of this function describes behaviour in case of a
transactional WAL message only, but it accepts both transactional and
non-transactional WAL messages. Update the comments to describe
behaviour in case of non-transactional WAL message as well.

Ashutosh Bapat, rephrased by Amit Kapila
Discussion: https://postgr.es/m/CAGEoWWTTzNzHOi8bj0wfAo1siGi-YEh6wqH1oaz4DrkTJ6HbTQ@mail.gmail.com
2020-12-19 10:08:46 +05:30
Tom Lane b3817f5f77 Improve hash_create()'s API for some added robustness.
Invent a new flag bit HASH_STRINGS to specify C-string hashing, which
was formerly the default; and add assertions insisting that exactly
one of the bits HASH_STRINGS, HASH_BLOBS, and HASH_FUNCTION be set.
This is in hopes of preventing recurrences of the type of oversight
fixed in commit a1b8aa1e4 (i.e., mistakenly omitting HASH_BLOBS).

Also, when HASH_STRINGS is specified, insist that the keysize be
more than 8 bytes.  This is a heuristic, but it should catch
accidental use of HASH_STRINGS for integer or pointer keys.
(Nearly all existing use-cases set the keysize to NAMEDATALEN or
more, so there's little reason to think this restriction should
be problematic.)

Tweak hash_create() to insist that the HASH_ELEM flag be set, and
remove the defaults it had for keysize and entrysize.  Since those
defaults were undocumented and basically useless, no callers
omitted HASH_ELEM anyway.

Also, remove memset's zeroing the HASHCTL parameter struct from
those callers that had one.  This has never been really necessary,
and while it wasn't a bad coding convention it was confusing that
some callers did it and some did not.  We might as well save a few
cycles by standardizing on "not".

Also improve the documentation for hash_create().

In passing, improve reinit.c's usage of a hash table by storing
the key as a binary Oid rather than a string; and, since that's
a temporary hash table, allocate it in CurrentMemoryContext for
neatness.

Discussion: https://postgr.es/m/590625.1607878171@sss.pgh.pa.us
2020-12-15 11:38:53 -05:00
Jeff Davis a58db3aa10 Revert "Cannot use WL_SOCKET_WRITEABLE without WL_SOCKET_READABLE."
This reverts commit 3a9e64aa0d.

Commit 4bad60e3 fixed the root of the problem that 3a9e64aa worked
around.

This enables proper pipelining of commands after terminating
replication, eliminating an undocumented limitation.

Discussion: https://postgr.es/m/3d57bc29-4459-578b-79cb-7641baf53c57%40iki.fi
Backpatch-through: 9.5
2020-12-14 23:47:30 -08:00
Noah Misch a1b8aa1e4e Use HASH_BLOBS for xidhash.
This caused BufFile errors on buildfarm member sungazer, and SIGSEGV was
possible.  Conditions for reaching those symptoms were more frequent on
big-endian systems.

Discussion: https://postgr.es/m/20201129214441.GA691200@rfd.leadboat.com
2020-12-12 21:38:36 -08:00
Noah Misch 73aae4522b Correct behavior descriptions in comments, and correct a test name. 2020-12-12 20:12:25 -08:00
Peter Eisentraut eb93f3a0b6 Convert elog(LOG) calls to ereport() where appropriate
User-visible log messages should go through ereport(), so they are
subject to translation.  Many remaining elog(LOG) calls are really
debugging calls.

Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://www.postgresql.org/message-id/flat/92d6f545-5102-65d8-3c87-489f71ea0a37%40enterprisedb.com
2020-12-04 14:25:23 +01:00
Amit Kapila 8ae4ef4fb0 Remove incorrect assertion in reorderbuffer.c.
We start recording changes in ReorderBufferTXN even before we reach
SNAPBUILD_CONSISTENT state so that if the commit is encountered after
reaching that we should be able to send the changes of the entire transaction.
Now, while recording changes if the reorder buffer memory has exceeded
logical_decoding_work_mem then we can start streaming if it is allowed and
we haven't yet streamed that data. However, we must not allow streaming to
start unless the snapshot has reached SNAPBUILD_CONSISTENT state.

In passing, improve the comments atop ReorderBufferResetTXN to mention the
case when we need to continue streaming after getting an error.

Author: Amit Kapila
Reviewed-by: Dilip Kumar
Discussion: https://postgr.es/m/CAA4eK1KoOH0byboyYY40NBcC7Fe812trwTa+WY3jQF7WQWZbQg@mail.gmail.com
2020-12-04 13:54:50 +05:30
Michael Paquier 4f48a6fbe2 Change SHA2 implementation based on OpenSSL to use EVP digest routines
The use of low-level hash routines is not recommended by upstream
OpenSSL since 2000, and pgcrypto already switched to EVP as of 5ff4a67.
This takes advantage of the refactoring done in 87ae969 that has
introduced the allocation and free routines for cryptographic hashes.

Since 1.1.0, OpenSSL does not publish the contents of the cryptohash
contexts, forcing any consumers to rely on OpenSSL for all allocations.
Hence, the resource owner callback mechanism gains a new set of routines
to track and free cryptohash contexts when using OpenSSL, preventing any
risks of leaks in the backend.  Nothing is needed in the frontend thanks
to the refactoring of 87ae969, and the resowner knowledge is isolated
into cryptohash_openssl.c.

Note that this also fixes a failure with SCRAM authentication when using
FIPS in OpenSSL, but as there have been few complaints about this
problem and as this causes an ABI breakage, no backpatch is done.

Author: Michael Paquier
Reviewed-by: Daniel Gustafsson, Heikki Linnakangas
Discussion: https://postgr.es/m/20200924025314.GE7405@paquier.xyz
Discussion: https://postgr.es/m/20180911030250.GA27115@paquier.xyz
2020-12-04 10:49:23 +09:00
Michael Paquier 87ae9691d2 Move SHA2 routines to a new generic API layer for crypto hashes
Two new routines to allocate a hash context and to free it are created,
as these become necessary for the goal behind this refactoring: switch
the all cryptohash implementations for OpenSSL to use EVP (for FIPS and
also because upstream does not recommend the use of low-level cryptohash
functions for 20 years).  Note that OpenSSL hides the internals of
cryptohash contexts since 1.1.0, so it is necessary to leave the
allocation to OpenSSL itself, explaining the need for those two new
routines.  This part is going to require more work to properly track
hash contexts with resource owners, but this not introduced here.
Still, this refactoring makes the move possible.

This reduces the number of routines for all SHA2 implementations from
twelve (SHA{224,256,386,512} with init, update and final calls) to five
(create, free, init, update and final calls) by incorporating the hash
type directly into the hash context data.

The new cryptohash routines are moved to a new file, called cryptohash.c
for the fallback implementations, with SHA2 specifics becoming a part
internal to src/common/.  OpenSSL specifics are part of
cryptohash_openssl.c.  This infrastructure is usable for more hash
types, like MD5 or HMAC.

Any code paths using the internal SHA2 routines are adapted to report
correctly errors, which are most of the changes of this commit.  The
zones mostly impacted are checksum manifests, libpq and SCRAM.

Note that e21cbb4 was a first attempt to switch SHA2 to EVP, but it
lacked the refactoring needed for libpq, as done here.

This patch has been tested on Linux and Windows, with and without
OpenSSL, and down to 1.0.1, the oldest version supported on HEAD.

Author: Michael Paquier
Reviewed-by: Daniel Gustafsson
Discussion: https://postgr.es/m/20200924025314.GE7405@paquier.xyz
2020-12-02 10:37:20 +09:00
Amit Kapila 0926e96c49 Fix replication of in-progress transactions in tablesync worker.
Tablesync worker runs under a single transaction but in streaming mode, we
were committing the transaction on stream_stop, stream_abort, and
stream_commit. We need to avoid committing the transaction in a streaming
mode in tablesync worker.

In passing move the call to process_syncing_tables in
apply_handle_stream_commit after clean up of stream files. This will
allow clean up of files to happen before the exit of tablesync worker
which would otherwise be handled by one of the proc exit routines.

Author: Dilip Kumar
Reviewed-by: Amit Kapila and Peter Smith
Tested-by: Peter Smith
Discussion: https://postgr.es/m/CAHut+Pt4PyKQCwqzQ=EFF=bpKKJD7XKt_S23F6L20ayQNxg77A@mail.gmail.com
2020-11-27 07:43:34 +05:30
Alvaro Herrera dcfff74fb1
Restore lock level to update statusFlags
Reverts 27838981be (some comments are kept).  Per discussion, it does
not seem safe to relax the lock level used for this; in order for it to
be safe, there would have to be memory barriers between the point we set
the flag and the point we set the trasaction Xid, which perhaps would
not be so bad; but there would also have to be barriers at the readers'
side, which from a performance perspective might be bad.

Now maybe this analysis is wrong and it *is* safe for some reason, but
proof of that is not trivial.

Discussion: https://postgr.es/m/20201118190928.vnztes7c2sldu43a@alap3.anarazel.de
2020-11-26 12:30:48 -03:00
Amit Kapila f3a8f73ec2 Use Enums for logical replication message types at more places.
Commit 644f0d7cc9 added logical replication message type enums to use
instead of character literals but some char substitutions were overlooked.

Author: Peter Smith
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/CAHut+PsTG=Vrv8hgrvOnAvCNR21jhqMdPk2n0a1uJPoW0p+UfQ@mail.gmail.com
2020-11-26 09:21:14 +05:30
Tom Lane 789b938bf2 Centralize logic for skipping useless ereport/elog calls.
While ereport() and elog() themselves are quite cheap when the
error message level is too low to be printed, some places need to do
substantial work before they can call those macros at all.  To allow
optimizing away such setup work when nothing is to be printed, make
elog.c export a new function message_level_is_interesting(elevel)
that reports whether ereport/elog will do anything.  Make use of that
in various places that had ad-hoc direct tests of log_min_messages etc.
Also teach ProcSleep to use it to avoid some work.  (There may well
be other places that could usefully use this; I didn't search hard.)

Within elog.c, refactor a little bit to avoid having duplicate copies
of the policy-setting logic.  When that code was written, we weren't
relying on the availability of inline functions; so it had some
duplications in the name of efficiency, which I got rid of.

Alvaro Herrera and Tom Lane

Discussion: https://postgr.es/m/129515.1606166429@sss.pgh.pa.us
2020-11-23 19:10:46 -05:00
Heikki Linnakangas c532d15ddd Split copy.c into four files.
Copy.c has grown really large. Split it into more manageable parts:

- copy.c now contains only a few functions that are common to COPY FROM
  and COPY TO.

- copyto.c contains code for COPY TO.

- copyfrom.c contains code for initializing COPY FROM, and inserting the
  tuples to the correct table.

- copyfromparse.c contains code for reading from the client/file/program,
  and parsing the input text/CSV/binary format into tuples.

All of these parts are fairly complicated, and fairly independent of each
other. There is a patch being discussed to implement parallel COPY FROM,
which will add a lot of new code to the COPY FROM path, and another patch
which would allow INSERTs to use the same multi-insert machinery as COPY
FROM, both of which will require refactoring that code. With those two
patches, there's going to be a lot of code churn in copy.c anyway, so now
seems like a good time to do this refactoring.

The CopyStateData struct is also split. All the formatting options, like
FORMAT, QUOTE, ESCAPE, are put in a new CopyFormatOption struct, which
is used by both COPY FROM and TO. Other state data are kept in separate
CopyFromStateData and CopyToStateData structs.

Reviewed-by: Soumyadeep Chakraborty, Erik Rijkers, Vignesh C, Andres Freund
Discussion: https://www.postgresql.org/message-id/8e15b560-f387-7acc-ac90-763986617bfb%40iki.fi
2020-11-23 10:50:50 +02:00
Alvaro Herrera 27838981be
Relax lock level for setting PGPROC->statusFlags
We don't actually need a lock to set PGPROC->statusFlags itself; what we
do need is a shared lock on either XidGenLock or ProcArrayLock in order to
ensure MyProc->pgxactoff keeps still while we modify the mirror array in
ProcGlobal->statusFlags.  Some places were using an exclusive lock for
that, which is excessive.  Relax those to use shared lock only.

procarray.c has a couple of places with somewhat brittle assumptions
about PGPROC changes: ProcArrayEndTransaction uses only shared lock, so
it's permissible to change MyProc only.  On the other hand,
ProcArrayEndTransactionInternal also changes other procs, so it must
hold exclusive lock.  Add asserts to ensure those assumptions continue
to hold.

Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/20201117155501.GA13805@alvherre.pgsql
2020-11-18 13:24:22 -03:00
Amit Kapila 9653f24ad8 Fix 'skip-empty-xacts' option in test_decoding for streaming mode.
In streaming mode, the transaction can be decoded in multiple streams and
those streams can be interleaved with streams of other transactions. So,
we can't remember the transaction's write status in the logical decoding
context because that might get changed due to some other transactions and
lead to wrong answers for 'skip-empty-xacts' option. We decided to keep
each transaction's write status in the ReorderBufferTxn to avoid
interleaved streams changing the status of some unrelated transactions.

Diagnosed-by: Amit Kapila
Author: Dilip Kumar
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/CAA4eK1LR7=XNM_TLmpZMFuV8ZQpoxkem--NZJYf8YXmesbvwLA@mail.gmail.com
2020-11-17 12:14:53 +05:30
Alvaro Herrera cd9c1b3e19
Rename PGPROC->vacuumFlags to statusFlags
With more flags associated to a PGPROC entry that are not related to
vacuum (currently existing or planned), the name "statusFlags" describes
its purpose better.

(The same is done to the mirroring PROC_HDR->vacuumFlags.)

No functional changes in this commit.

This was suggested first by Hari Babu Kommi in [1] and then by Michael
Paquier at [2].

[1] https://postgr.es/m/CAJrrPGcsDC-oy1AhqH0JkXYa0Z2AgbuXzHPpByLoBGMxfOZMEQ@mail.gmail.com
[2] https://postgr.es/m/20200820060929.GB3730@paquier.xyz

Author: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20201116182446.qcg3o6szo2zookyr@localhost
2020-11-16 19:42:55 -03:00
Bruce Momjian 66a8f09048 change wire protocol data type for history file content
This was marked as BYTEA, but is more like TEXT, which is how we already
pass the history timeline file name.  Internally, we don't do any
encoding or bytea escape handling, but TEXT seems closest.  This should
cause no behavioral change.

Reported-by: Brar Piening

Discussion: https://postgr.es/m/6a1b9cd9-17e3-df67-be55-86102af6bdf5@gmx.de

Backpatch-through: master
2020-11-12 14:08:59 -05:00
Fujii Masao 1a2ae7c50f Use standard SIGHUP and SIGTERM handlers in walreceiver.
Commit 1e53fe0e70 changed background processes so that they use
standard SIGHUP handler. Like that, this commit makes walreceiver
use standard SIGHUP and SIGTERM handlers, to simplify the code.

As the side effect of this commit, walreceiver can wake up and process
the configuration files promptly when receiving SIGHUP. Because the
standard SIGHUP handler sets the latch. On the other hand, previously
there could be a time lag between the receipt of SIGHUP and
the process of configuration files since the dedicated handler didn't
set the latch.

Author: Bharath Rupireddy, tweaked by Fujii Masao
Reviewed-by: Kyotaro Horiguchi, Fujii Masao
Discussion: https://postgr.es/m/CALj2ACXPorUqePswDtOeM_s82v9RW32E1fYmOPZ5NuE+TWKj_A@mail.gmail.com
2020-11-12 13:25:23 +09:00
Michael Paquier 03f9cd93ea Remove useless SHA256 initialization when not using backup manifests
Attempting to take a base backup with Postgres linking to a build of
OpenSSL with FIPS enabled currently fails with or even without a backup
manifest requested because of this mandatory SHA256 initialization used
for the manifest file itself.  However, there is no need to do this
initialization at all if backup manifests are not needed because there
is no data to append to the manifest.

Note that being able to use backup manifests with OpenSSL+FIPS requires
a switch of the SHA2 implementation to use EVP, which would cause an ABI
breakage so this cannot be backpatched to 13 as it has been already
released, but at least avoiding this SHA256 initialization gives users
the possibility to take a base backup even when specifying --no-manifest
with pg_basebackup.

Author: Michael Paquier
Discussion: https://postgr.es/m/20201110020014.GE1887@paquier.xyz
Backpatch-through: 13
2020-11-12 10:56:33 +09:00
Tom Lane ec29427ce2 Fix and simplify some usages of TimestampDifference().
Introduce TimestampDifferenceMilliseconds() to simplify callers
that would rather have the difference in milliseconds, instead of
the select()-oriented seconds-and-microseconds format.  This gets
rid of at least one integer division per call, and it eliminates
some apparently-easy-to-mess-up arithmetic.

Two of these call sites were in fact wrong:

* pg_prewarm's autoprewarm_main() forgot to multiply the seconds
by 1000, thus ending up with a delay 1000X shorter than intended.
That doesn't quite make it a busy-wait, but close.

* postgres_fdw's pgfdw_get_cleanup_result() thought it needed to compute
microseconds not milliseconds, thus ending up with a delay 1000X longer
than intended.  Somebody along the way had noticed this problem but
misdiagnosed the cause, and imposed an ad-hoc 60-second limit rather
than fixing the units.  This was relatively harmless in context, because
we don't care that much about exactly how long this delay is; still,
it's wrong.

There are a few more callers of TimestampDifference() that don't
have a direct need for seconds-and-microseconds, but can't use
TimestampDifferenceMilliseconds() either because they do need
microsecond precision or because they might possibly deal with
intervals long enough to overflow 32-bit milliseconds.  It might be
worth inventing another API to improve that, but that seems outside
the scope of this patch; so those callers are untouched here.

Given the fact that we are fixing some bugs, and the likelihood
that future patches might want to back-patch code that uses this
new API, back-patch to all supported branches.

Alexey Kondratov and Tom Lane

Discussion: https://postgr.es/m/3b1c053a21c07c1ed5e00be3b2b855ef@postgrespro.ru
2020-11-10 22:51:54 -05:00
Peter Eisentraut bdc4edbea6 Move catalog index declarations
Move the system catalog index declarations from catalog/indexing.h to
the respective parent tables' catalog/pg_*.h files.  The original
reason for having it split was that the old genbki system produced the
output in the order of the catalog files it read, so all the indexing
stuff needed to come separately.  But this is no longer the case, and
keeping it together makes more sense.

Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/c7cc82d6-f976-75d6-2e3e-b03d2cab26bb@2ndquadrant.com
2020-11-07 12:26:24 +01:00
Amit Kapila 644f0d7cc9 Use Enum for top level logical replication message types.
Logical replication protocol uses a single byte character to identify a
message type in logical replication protocol. The code uses string
literals for the same. Use Enum so that

1. All the string literals used can be found at a single place. This
makes it easy to add more types without the risk of conflicts.

2. It's easy to locate the code handling a given message type.

3. When used with switch statements, it is easy to identify the missing
cases using -Wswitch.

Author: Ashutosh Bapat
Reviewed-by: Kyotaro Horiguchi, Andres Freund, Peter Smith and Amit Kapila
Discussion: https://postgr.es/m/CAExHW5uPzQ7L0oAd_ENyvaiYMOPgkrAoJpE+ZY5-obdcVT6NPg@mail.gmail.com
2020-11-02 08:18:18 +05:30
Amit Kapila 8e90ec5580 Track statistics for streaming of changes from ReorderBuffer.
This adds the statistics about transactions streamed to the decoding
output plugin from ReorderBuffer. Users can query the
pg_stat_replication_slots view to check these stats and call
pg_stat_reset_replication_slot to reset the stats of a particular slot.
Users can pass NULL in pg_stat_reset_replication_slot to reset stats of
all the slots.

Commit 9868167500 has added the basic infrastructure to capture the stats
of slot and this commit extends the statistics collector to track
additional information about slots.

Bump the catversion as we have added new columns in the catalog entry.

Author: Ajin Cherian and Amit Kapila
Reviewed-by: Sawada Masahiko and Dilip Kumar
Discussion: https://postgr.es/m/CAA4eK1+chpEomLzgSoky-D31qev19AmECNiEAietPQUGEFhtVA@mail.gmail.com
2020-10-29 09:11:51 +05:30
Tom Lane ad77039fad Calculate extraUpdatedCols in query rewriter, not parser.
It's unsafe to do this at parse time because addition of generated
columns to a table would not invalidate stored rules containing
UPDATEs on the table ... but there might now be dependent generated
columns that were not there when the rule was made.  This also fixes
an oversight that rewriteTargetView failed to update extraUpdatedCols
when transforming an UPDATE on an updatable view.  (Since the new
calculation is downstream of that, rewriteTargetView doesn't actually
need to do anything; but before, there was a demonstrable bug there.)

In v13 and HEAD, this leads to easily-visible bugs because (since
commit c6679e4fc) we won't recalculate generated columns that aren't
listed in extraUpdatedCols.  In v12 this bitmap is mostly just used
for trigger-firing decisions, so you'd only notice a problem if a
trigger cared whether a generated column had been updated.

I'd complained about this back in May, but then forgot about it
until bug #16671 from Michael Paul Killian revived the issue.

Back-patch to v12 where this field was introduced.  If existing
stored rules contain any extraUpdatedCols values, they'll be
ignored because the rewriter will overwrite them, so the bug will
be fixed even for existing rules.  (But note that if someone were
to update to 13.1 or 12.5, store some rules with UPDATEs on tables
having generated columns, and then downgrade to a prior minor version,
they might observe issues similar to what this patch fixes.  That
seems unlikely enough to not be worth going to a lot of effort to fix.)

Discussion: https://postgr.es/m/10206.1588964727@sss.pgh.pa.us
Discussion: https://postgr.es/m/16671-2fa55851859fb166@postgresql.org
2020-10-28 13:47:02 -04:00
Heikki Linnakangas fb5883da86 Remove PartitionRoutingInfo struct.
The extra indirection neeeded to access its members via its enclosing
ResultRelInfo seems pointless. Move all the fields from
PartitionRoutingInfo to ResultRelInfo.

Author: Amit Langote
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqFViT47Zbr_ASBejiK7iDG8%3DQ1swQ-tjM6caRPQ67pT%3Dw%40mail.gmail.com
2020-10-19 14:42:55 +03:00
Alvaro Herrera b05fe7b442 Review logical replication tablesync code
Most importantly, remove optimization in LogicalRepSyncTableStart that
skips the normal walrcv_startstreaming/endstreaming dance.  The
optimization is not critically important for production uses anyway,
since it only fires in cases with no activity, and saves an
uninteresting amount of work even then.  Critically, it obscures bugs by
hiding the interesting code path from test cases.

Also: in GetSubscriptionRelState, remove pointless relation open; access
pg_subscription_rel->srsubstate with GETSTRUCT as is typical rather than
SysCacheGetAttr; remove unused 'missing_ok' argument.
In wait_for_relation_state_change, use explicit catalog snapshot
invalidation rather than obscurely (and expensively) through
GetLatestSnapshot.
In various places: sprinkle comments more liberally and rewrite a number
of them.  Other cosmetic code improvements.

No backpatch, since no bug is being fixed here.

Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Petr Jelínek <petr.jelinek@2ndquadrant.com>
Discussion: https://postgr.es/m/20201010190637.GA5774@alvherre.pgsql
2020-10-15 11:35:51 -03:00
David Rowley 110d81728a Fixup some appendStringInfo and appendPQExpBuffer calls
A number of places were using appendStringInfo() when they could have been
using appendStringInfoString() instead.  While there's no functionality
change there, it's just more efficient to use appendStringInfoString()
when no formatting is required.  Likewise for some
appendStringInfoString() calls which were just appending a single char.
We can just use appendStringInfoChar() for that.

Additionally, many places were using appendPQExpBuffer() when they could
have used appendPQExpBufferStr(). Change those too.

Patch by Zhijie Hou, but further searching by me found significantly more
places that deserved the same treatment.

Author: Zhijie Hou, David Rowley
Discussion: https://postgr.es/m/cb172cf4361e4c7ba7167429070979d4@G08CNEXMBPEKD05.g08.fujitsu.local
2020-10-15 20:35:17 +13:00
Amit Kapila d7eb52d718 Execute invalidation messages for each XLOG_XACT_INVALIDATIONS message
during logical decoding.

Prior to commit c55040ccd0 we have no way of knowing the invalidations
before commit. So, while decoding we use to execute all the invalidations
at each command end as we had no way of knowing which invalidations
happened before that command. Due to this, transactions involving large
amounts of DDLs use to take more time and also lead to high CPU usage. But
now we know specific invalidations at each command end so we execute only
required invalidations.

It has been observed that decoding of a transaction containing truncation
of a table with 1000 partitions would be finished in 1s whereas before
this patch it used to take 4-5 minutes.

Author: Dilip Kumar
Reviewed-by: Amit Kapila and Keisuke Kuroda
Discussion: https://postgr.es/m/CANDwggKYveEtXjXjqHA6RL3AKSHMsQyfRY6bK+NqhAWJyw8psQ@mail.gmail.com
2020-10-15 08:17:51 +05:30
Alvaro Herrera 4e9821b6fa Restore replication protocol's duplicate command tags
I removed the duplicate command tags for START_REPLICATION inadvertently
in commit 07082b08cc, but the replication protocol requires them.  The
fact that the replication protocol was broken was not noticed because
all our test cases use an optimized code path that exits early, failing
to verify that the behavior is correct for non-optimized cases.  Put
them back.

Also document this protocol quirk.

Add a test case that shows the failure.  It might still succeed even
without the patch when run on a fast enough server, but it suffices to
show the bug in enough cases that it would be noticed in buildfarm.

Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reported-by: Henry Hinze <henry.hinze@gmail.com>
Reviewed-by: Petr Jelínek <petr.jelinek@2ndquadrant.com>
Discussion: https://postgr.es/m/16643-eaadeb2a1a58d28c@postgresql.org
2020-10-14 20:12:26 -03:00
Heikki Linnakangas a04daa97a4 Remove es_result_relation_info from EState.
Maintaining 'es_result_relation_info' correctly at all times has become
cumbersome, especially with partitioning where each partition gets its
own result relation info. Having to set and reset it across arbitrary
operations has caused bugs in the past.

This changes all the places that used 'es_result_relation_info', to
receive the currently active ResultRelInfo via function parameters
instead.

Author: Amit Langote
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqGEmiib8FLiHMhKB%2BCH5dRgHSLc5N5wnvc4kym%2BZYpQEQ%40mail.gmail.com
2020-10-14 11:41:40 +03:00
Heikki Linnakangas 1375422c78 Create ResultRelInfos later in InitPlan, index them by RT index.
Instead of allocating all the ResultRelInfos upfront in one big array,
allocate them in ExecInitModifyTable(). es_result_relations is now an
array of ResultRelInfo pointers, rather than an array of structs, and it
is indexed by the RT index.

This simplifies things: we get rid of the separate concept of a "result
rel index", and don't need to set it in setrefs.c anymore. This also
allows follow-up optimizations (not included in this commit yet) to skip
initializing ResultRelInfos for target relations that were not needed at
runtime, and removal of the es_result_relation_info pointer.

The EState arrays of regular result rels and root result rels are merged
into one array. Similarly, the resultRelations and rootResultRelations
lists in PlannedStmt are merged into one. It's not actually clear to me
why they were kept separate in the first place, but now that the
es_result_relations array is indexed by RT index, it certainly seems
pointless.

The PlannedStmt->resultRelations list is now only needed for
ExecRelationIsTargetRelation(). One visible effect of this change is that
ExecRelationIsTargetRelation() will now return 'true' also for the
partition root, if a partitioned table is updated. That seems like a good
thing, although the function isn't used in core code, and I don't see any
reason for an FDW to call it on a partition root.

Author: Amit Langote
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqGEmiib8FLiHMhKB%2BCH5dRgHSLc5N5wnvc4kym%2BZYpQEQ%40mail.gmail.com
2020-10-13 12:57:02 +03:00
Amit Kapila f13f2e4841 Fix typos in logical.c and reorderbuffer.c.
Reviewed-by: Sawada  Masahiko
Discussion: https://postgr.es/m/CAA4eK1K6zTpuqf_d7wXCBjo_EF0_B6Fz3Ecp71Vq18t=wG-nzg@mail.gmail.com
2020-10-09 08:16:43 +05:30
Amit Kapila 9868167500 Track statistics for spilling of changes from ReorderBuffer.
This adds the statistics about transactions spilled to disk from
ReorderBuffer. Users can query the pg_stat_replication_slots view to check
these stats and call pg_stat_reset_replication_slot to reset the stats of
a particular slot. Users can pass NULL in pg_stat_reset_replication_slot
to reset stats of all the slots.

This commit extends the statistics collector to track this information
about slots.

Author: Sawada Masahiko and Amit Kapila
Reviewed-by: Amit Kapila and Dilip Kumar
Discussion: https://postgr.es/m/CA+fd4k5_pPAYRTDrO2PbtTOe0eHQpBvuqmCr8ic39uTNmR49Eg@mail.gmail.com
2020-10-08 09:09:08 +05:30
Amit Kapila f07707099c Display the names of missing columns in error during logical replication.
In logical replication when a subscriber is missing some columns, it
currently emits an error message that says "some" columns are missing, but
it doesn't specify the missing column names. Change that to display
missing column names which makes an error to be more informative to the
user.

We have decided not to backpatch this commit as this is a minor usability
improvement and no user has reported this.

Reported-by: Bharath Rupireddy
Author: Bharath Rupireddy
Reviewed-by: Kyotaro Horiguchi and Amit Kapila
Discussion: https://postgr.es/m/CALj2ACVkW-EXH_4pmBK8tNeHRz5ksUC4WddGactuCjPiBch-cg@mail.gmail.com
2020-10-07 08:14:19 +05:30
Fujii Masao fd26f78231 Archive timeline history files in standby if archive_mode is set to "always".
Previously the standby server didn't archive timeline history files
streamed from the primary even when archive_mode is set to "always",
while it archives the streamed WAL files. This could cause the PITR to
fail because there was no required timeline history file in the archive.
The cause of this issue was that walreceiver didn't mark those files as
ready for archiving.

This commit makes walreceiver mark those streamed timeline history
files as ready for archiving if archive_mode=always. Then the archiver
process archives the marked timeline history files.

Back-patch to all supported versions.

Reported-by: Grigory Smolkin
Author: Grigory Smolkin, Fujii Masao
Reviewed-by: David Zhang, Anastasia Lubennikova
Discussion: https://postgr.es/m/54b059d4-2b48-13a4-6f43-95a087c92367@postgrespro.ru
2020-09-29 16:21:46 +09:00
Amit Kapila 079d0cacf4 Fix the logical replication from HEAD to lower versions.
Commit 464824323e changed the logical replication protocol to allow the
streaming of in-progress transactions and used the new version of protocol
irrespective of the server version. Use the appropriate version of the
protocol based on the server version.

Reported-by: Ashutosh Sharma
Author: Dilip Kumar
Reviewed-by: Ashutosh Sharma and Amit Kapila
Discussion: https://postgr.es/m/CAE9k0P=9OpXcNrcU5Gsvd5MZ8GFpiN833vNHzX6Uc=8+h1ft1Q@mail.gmail.com
2020-09-26 10:13:51 +05:30
Amit Kapila 24fb35e111 Update file header comments for logical/relation.c.
Author: Amit Langote
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/CA+HiwqE20oZoix13JyCeALpTf_SmjarZWtBFe5sND6zz+iupAw@mail.gmail.com
2020-09-18 10:14:30 +05:30
Tom Lane babef40c9a Teach walsender to update its process title for replication commands.
Because the code path taken for SQL commands executed in a walsender
will update the process title, we pretty much have to update the
title for replication commands as well.  Otherwise, the title shows
"idle" for the rest of a logical walsender's lifetime once it's
executed any SQL command.

Playing with this, I confirm that a walsender now typically spends
most of its life reporting
	walsender postgres [local] START_REPLICATION
Considering this in isolation, it might be better to have it say
	walsender postgres [local] sending replication data
However, consistency with the other cases seems to be a stronger
argument.

In passing, remove duplicative pgstat_report_activity call.

Discussion: https://postgr.es/m/880181.1600026471@sss.pgh.pa.us
2020-09-16 21:06:50 -04:00
Alvaro Herrera 07082b08cc
Fix bogus completion tag usage in walsender
Since commit fd5942c18f (2012, 9.3-era), walsender has been sending
completion tags for certain replication commands twice -- and they're
not even consistent.  Apparently neither libpq nor JDBC have a problem
with it, but it's not kosher.  Fix by remove the EndCommand() call in
the common code path for them all, and inserting specific calls to
EndReplicationCommand() specifically in those places where it's needed.

EndReplicationCommand() is a new simple function to send the completion
tag for replication commands.  Do this instead of sending a generic
SELECT completion tag for them all, which was also pretty bogus (if
innocuous).  While at it, change StartReplication() to use
EndReplicationCommand() instead of pg_puttextmessage().

In commit 2f9661311b, I failed to realize that replication commands
are not close-enough kin of regular SQL commands, so the
DROP_REPLICATION_SLOT tag I added is undeserved and a type pun.  Take it
out.

Backpatch to 13, where the latter commit appeared.  The duplicate tag
has been sent since 9.3, but since nothing is broken, it doesn't seem
worth fixing.

Per complaints from Tom Lane.

Discussion: https://postgr.es/m/1347966.1600195735@sss.pgh.pa.us
2020-09-16 21:16:25 -03:00
Tom Lane 44fc6e259b Centralize setup of SIGQUIT handling for postmaster child processes.
We decided that the policy established in commit 7634bd4f6 for
the bgwriter, checkpointer, walwriter, and walreceiver processes,
namely that they should accept SIGQUIT at all times, really ought
to apply uniformly to all postmaster children.  Therefore, get
rid of the duplicative and inconsistent per-process code for
establishing that signal handler and removing SIGQUIT from BlockSig.
Instead, make InitPostmasterChild do it.

The handler set up by InitPostmasterChild is SignalHandlerForCrashExit,
which just summarily does _exit(2).  In interactive backends, we
almost immediately replace that with quickdie, since we would prefer
to try to tell the client that we're dying.  However, this patch is
changing the behavior of autovacuum (both launcher and workers), as
well as walsenders.  Those processes formerly also used quickdie,
but AFAICS that was just mindless copy-and-paste: they don't have
any interactive client that's likely to benefit from being told this.

The stats collector continues to be an outlier, in that it thinks
SIGQUIT means normal exit.  That should probably be changed for
consistency, but there's another patch set where that's being
dealt with, so I didn't do so here.

Discussion: https://postgr.es/m/644875.1599933441@sss.pgh.pa.us
2020-09-16 16:04:36 -04:00
Tom Lane 2000b6c10a Don't fetch partition check expression during InitResultRelInfo.
Since there is only one place that actually needs the partition check
expression, namely ExecPartitionCheck, it's better to fetch it from
the relcache there.  In this way we will never fetch it at all if
the query never has use for it, and we still fetch it just once when
we do need it.

The reason for taking an interest in this is that if the relcache
doesn't already have the check expression cached, fetching it
requires obtaining AccessShareLock on the partition root.  That
means that operations that look like they should only touch the
partition itself will also take a lock on the root.  In particular
we observed that TRUNCATE on a partition may take a lock on the
partition's root, contributing to a deadlock situation in parallel
pg_restore.

As written, this patch does have a small cost, which is that we
are microscopically reducing efficiency for the case where a partition
has an empty check expression.  ExecPartitionCheck will be called,
and will go through the motions of setting up and checking an empty
qual, where before it would not have been called at all.  We could
avoid that by adding a separate boolean flag to track whether there
is a partition expression to test.  However, this case only arises
for a default partition with no siblings, which surely is not an
interesting case in practice.  Hence adding complexity for it
does not seem like a good trade-off.

Amit Langote, per a suggestion by me

Discussion: https://postgr.es/m/VI1PR03MB31670CA1BD9625C3A8C5DD05EB230@VI1PR03MB3167.eurprd03.prod.outlook.com
2020-09-16 14:28:18 -04:00
Tom Lane 3d65b0593c Fix bogus cache-invalidation logic in logical replication worker.
The code recorded cache invalidation events by zeroing the "localreloid"
field of affected cache entries.  However, it's possible for an inval
event to occur even while we have the entry open and locked.  So an
ill-timed inval could result in "cache lookup failed for relation 0"
errors, if the worker's code tried to use the cleared field.  We can
fix that by creating a separate bool field to record whether the entry
needs to be revalidated.  (In the back branches, cram the bool into
what had been padding space, to avoid an ABI break in the somewhat
unlikely event that any extension is looking at this struct.)

Also, rearrange the logic in logicalrep_rel_open so that it
does the right thing in cases where table_open would fail.
We should retry the lookup by name in that case, but we didn't.

The real-world impact of this is probably small.  In the first place,
the error conditions are very low probability, and in the second place,
the worker would just exit and get restarted.  We only noticed because
in a CLOBBER_CACHE_ALWAYS build, the failure can occur repeatedly,
preventing the worker from making progress.  Nonetheless, it's clearly
a bug, and it impedes a useful type of testing; so back-patch to v10
where this code was introduced.

Discussion: https://postgr.es/m/1032727.1600096803@sss.pgh.pa.us
2020-09-16 12:07:31 -04:00
Amit Kapila 69bd60672a Fix initialization of RelationSyncEntry for streaming transactions.
In commit 464824323e, for each RelationSyncEntry we maintained the list
of xids (streamed_txns) for which we have already sent the schema. This
helps us to track when to send the schema to the downstream node for
replication of streaming transactions. Before this list got initialized,
we were processing invalidation messages which access this list and led
to an assertion failure.

In passing, clean up the nearby code:

* Initialize the list of xids with NIL instead of NULL which is our usual
coding practice.
* Remove the MemoryContext switch for creating a RelationSyncEntry in dynahash.

Diagnosed-by: Amit Kapila and Tom Lane
Author: Amit Kapila
Reviewed-by: Tom Lane and Dilip Kumar
Discussion: https://postgr.es/m/904373.1600033123@sss.pgh.pa.us
2020-09-16 07:45:44 +05:30
Tom Lane f560209c6e Make walsenders show their replication commands in pg_stat_activity.
A walsender process that has executed a SQL command left the text of
that command in pg_stat_activity.query indefinitely, which is quite
confusing if it's in RUNNING state but not doing that query.  An easy
and useful fix is to treat replication commands as if they were SQL
queries, and show them in pg_stat_activity according to the same rules
as for regular queries.  While we're at it, it seems also sensible to
set debug_query_string, allowing error logging and debugging to see
the replication command.

While here, clean up assorted silliness in exec_replication_command:

* The SQLCmd path failed to restore CurrentMemoryContext to the caller's
value, and failed to delete the temp context created in this routine.
It's only through great good fortune that these oversights did not
result in long-term memory leaks or other problems.  It seems cleaner
to code SQLCmd as a separate early-exit path, so do it like that.

* Remove useless duplicate call of SnapBuildClearExportedSnapshot().

* replication_scanner_finish() was never called.

None of those things are significant enough to merit a backpatch,
so this is for HEAD only.

Discussion: https://postgr.es/m/880181.1600026471@sss.pgh.pa.us
2020-09-14 12:35:00 -04:00
Peter Eisentraut 3e0242b24c Message fixes and style improvements 2020-09-14 06:42:30 +02:00
Amit Kapila ddd5f6d260 Remove unused function declaration in logicalproto.h.
In the passing, fix a typo in pgoutput.c.

Reported-by: Tomas Vondra
Author: Tomas Vondra
Reviewed-by: Dilip Kumar
Discussion: https://postgr.es/m/20200909084353.pncuclpbwlr7vylh@development
2020-09-12 07:47:53 +05:30
Tom Lane 7634bd4f6d Accept SIGQUIT during error recovery in auxiliary processes.
The bgwriter, checkpointer, walwriter, and walreceiver processes
claimed to allow SIGQUIT "at all times".  In reality SIGQUIT
would get re-blocked during error recovery, because we didn't
update the actual signal mask immediately, so sigsetjmp() would
save and reinstate a mask that includes SIGQUIT.

This appears to be simply a coding oversight.  There's never a
good reason to hold off SIGQUIT in these processes, because it's
going to just call _exit(2) which should be safe enough, especially
since the postmaster is going to tear down shared memory afterwards.
Hence, stick in PG_SETMASK() calls to install the modified BlockSig
mask immediately.

Also try to improve the comments around sigsetjmp blocks.  Most of
them were just referencing postgres.c, which is misleading because
actually postgres.c manages the signals differently.

No back-patch, since there's no evidence that this is causing any
problems in the field.

Discussion: https://postgr.es/m/CALDaNm1d1hHPZUg3xU4XjtWBOLCrA+-2cJcLpw-cePZ=GgDVfA@mail.gmail.com
2020-09-11 16:01:36 -04:00
Alvaro Herrera 9f1cf97bb5
Print WAL logical message contents in pg_waldump
This helps debuggability when looking at WAL streams containing logical
messages.

Author: Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/CAExHW5sWx49rKmXbg5H1Xc1t+nRv9PaYKQmgw82HPt6vWDVmDg@mail.gmail.com
2020-09-10 19:37:02 -03:00
Michael Paquier aad546bd0a doc: Fix some grammar and inconsistencies
Some comments are fixed while on it.

Author: Justin Pryzby
Discussion: https://postgr.es/m/20200818171702.GK17022@telsasoft.com
Backpatch-through: 9.6
2020-09-10 15:50:19 +09:00
Tom Lane f3e1e66196 Minor fixes in docs and error messages.
Alexander Lakhin

Discussion: https://postgr.es/m/ce7debdd-c943-d7a7-9b41-687107b27831@gmail.com
2020-09-09 11:53:39 -04:00
Tom Lane c8746f999e Fix over-eager ping'ing in logical replication receiver.
Commit 3f60f690f only partially fixed the broken-status-tracking
issue in LogicalRepApplyLoop: we need ping_sent to have the same
lifetime as last_recv_timestamp.  The effects are much less serious
than what that commit fixed, though.  AFAICS this would just lead to
extra ping requests being sent, once per second until the sender
responds.  Still, it's a bug, so backpatch to v10 as before.

Discussion: https://postgr.es/m/959627.1599248476@sss.pgh.pa.us
2020-09-04 20:33:36 -04:00
Tom Lane 9a851039aa Remove still more useless assignments.
Fix some more things scan-build pointed to as dead stores.  In some of
these cases, rearranging the code a little leads to more readable
code IMO.  It's all cosmetic, though.

Discussion: https://postgr.es/m/CAEudQAo1+AcGppxDSg8k+zF4+Kv+eJyqzEDdbpDg58-=MQcerQ@mail.gmail.com
2020-09-04 20:33:36 -04:00
Amit Kapila ac15b499f7 Fix inline marking introduced in commit 464824323e.
Forgot to add inline marking in changes_filename() declaration. In the passing, add
inline marking for a similar function subxact_filename().

Reported-By: Nathan Bossart
Discussion: https://postgr.es/m/E98FBE8F-B878-480D-A728-A60C6EED3047@amazon.com
2020-09-04 11:25:16 +05:30
Amit Kapila 464824323e Add support for streaming to built-in logical replication.
To add support for streaming of in-progress transactions into the
built-in logical replication, we need to do three things:

* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).

* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.

We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.

Author: Tomas Vondra, Dilip Kumar and Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
2020-09-03 07:54:07 +05:30
Fujii Masao be9788e998 Avoid unnecessary acquisition of SyncRepLock in transaction commit time.
In SyncRepWaitForLSN() routine called in transaction commit time,
SyncRepLock is necessary to atomically both check the shared
sync_standbys_defined flag and operate the sync replication wait-queue.
On the other hand, when the flag is false, the lock is not necessary
because the wait-queue is not touched. But due to the changes by
commit 48c9f49265, previously the lock was taken whatever the flag was.
This could cause unnecessary performance overhead in every transaction
commit time. Therefore this commit avoids that unnecessary aquisition
of SyncRepLock.

Author: Fujii Masao
Reviewed-by: Asim Praveen, Masahiko Sawada,
Discussion: https://postgr.es/m/20200406050332.nsscfqjzk2d57zyx@alap3.anarazel.de
2020-09-02 10:55:55 +09:00
Andres Freund 623a9ba79b snapshot scalability: cache snapshots using a xact completion counter.
Previous commits made it faster/more scalable to compute snapshots. But not
building a snapshot is still faster. Now that GetSnapshotData() does not
maintain RecentGlobal* anymore, that is actually not too hard:

This commit introduces xactCompletionCount, which tracks the number of
top-level transactions with xids (i.e. which may have modified the database)
that completed in some form since the start of the server.

We can avoid rebuilding the snapshot's contents whenever the current
xactCompletionCount is the same as it was when the snapshot was
originally built.  Currently this check happens while holding
ProcArrayLock. While it's likely possible to perform the check without
acquiring ProcArrayLock, it seems better to do that separately /
later, some careful analysis is required. Even with the lock this is a
significant win on its own.

On a smaller two socket machine this gains another ~1.03x, on a larger
machine the effect is roughly double (earlier patch version tested
though).  If we were able to safely avoid the lock there'd be another
significant gain on top of that.

Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-17 21:08:30 -07:00
Andres Freund 5788e258bb snapshot scalability: Move PGXACT->vacuumFlags to ProcGlobal->vacuumFlags.
Similar to the previous commit this increases the chance that data
frequently needed by GetSnapshotData() stays in l2 cache. As we now
take care to not unnecessarily write to ProcGlobal->vacuumFlags, there
should be very few modifications to the ProcGlobal->vacuumFlags array.

Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-14 15:33:35 -07:00
Andres Freund 1f51c17c68 snapshot scalability: Move PGXACT->xmin back to PGPROC.
Now that xmin isn't needed for GetSnapshotData() anymore, it leads to
unnecessary cacheline ping-pong to have it in PGXACT, as it is updated
considerably more frequently than the other PGXACT members.

After the changes in dc7420c2c9, this is a very straight-forward change.

For highly concurrent, snapshot acquisition heavy, workloads this change alone
can significantly increase scalability. E.g. plain pgbench on a smaller 2
socket machine gains 1.07x for read-only pgbench, 1.22x for read-only pgbench
when submitting queries in batches of 100, and 2.85x for batches of 100
'SELECT';.  The latter numbers are obviously not to be expected in the
real-world, but micro-benchmark the snapshot computation
scalability (previously spending ~80% of the time in GetSnapshotData()).

Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-13 16:25:21 -07:00
Andres Freund dc7420c2c9 snapshot scalability: Don't compute global horizons while building snapshots.
To make GetSnapshotData() more scalable, it cannot not look at at each proc's
xmin: While snapshot contents do not need to change whenever a read-only
transaction commits or a snapshot is released, a proc's xmin is modified in
those cases. The frequency of xmin modifications leads to, particularly on
higher core count systems, many cache misses inside GetSnapshotData(), despite
the data underlying a snapshot not changing. That is the most
significant source of GetSnapshotData() scaling poorly on larger systems.

Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons /
thresholds as it has so far. But we don't really have to: The horizons don't
actually change that much between GetSnapshotData() calls. Nor are the horizons
actually used every time a snapshot is built.

The trick this commit introduces is to delay computation of accurate horizons
until there use and using horizon boundaries to determine whether accurate
horizons need to be computed.

The use of RecentGlobal[Data]Xmin to decide whether a row version could be
removed has been replaces with new GlobalVisTest* functions.  These use two
thresholds to determine whether a row can be pruned:
1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed
   are definitely still visible.
2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
   definitely be removed
GetSnapshotData() updates definitely_needed to be the xmin of the computed
snapshot.

When testing whether a row can be removed (with GlobalVisTestIsRemovableXid())
and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID <
definitely_needed) the boundaries can be recomputed to be more accurate. As it
is not cheap to compute accurate boundaries, we limit the number of times that
happens in short succession.  As the boundaries used by
GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by
GetSnapshotData()), it is likely that further test can benefit from an earlier
computation of accurate horizons.

To avoid regressing performance when old_snapshot_threshold is set (as that
requires an accurate horizon to be computed), heap_page_prune_opt() doesn't
unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the
computation of the limited horizon, and the triggering of errors (with
SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove
tuples.

This commit just removes the accesses to PGXACT->xmin from
GetSnapshotData(), but other members of PGXACT residing in the same
cache line are accessed. Therefore this in itself does not result in a
significant improvement. Subsequent commits will take advantage of the
fact that GetSnapshotData() now does not need to access xmins anymore.

Note: This contains a workaround in heap_page_prune_opt() to keep the
snapshot_too_old tests working. While that workaround is ugly, the tests
currently are not meaningful, and it seems best to address them separately.

Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-12 16:03:49 -07:00
Peter Eisentraut 1784f278a6 Replace remaining StrNCpy() by strlcpy()
They are equivalent, except that StrNCpy() zero-fills the entire
destination buffer instead of providing just one trailing zero.  For
all but a tiny number of callers, that's just overhead rather than
being desirable.

Remove StrNCpy() as it is now unused.

In some cases, namestrcpy() is the more appropriate function to use.
While we're here, simplify the API of namestrcpy(): Remove the return
value, don't check for NULL input.  Nothing was using that anyway.
Also, remove a few unused name-related functions.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/44f5e198-36f6-6cdb-7fa9-60e34784daae%402ndquadrant.com
2020-08-10 23:20:37 +02:00
Noah Misch 11da97024a Empty search_path in logical replication apply worker and walsender.
This is like CVE-2018-1058 commit
582edc369c.  Today, a malicious user of a
publisher or subscriber database can invoke arbitrary SQL functions
under an identity running replication, often a superuser.  This fix may
cause "does not exist" or "no schema has been selected to create in"
errors in a replication process.  After upgrading, consider watching
server logs for these errors.  Objects accruing schema qualification in
the wake of the earlier commit are unlikely to need further correction.
Back-patch to v10, which introduced logical replication.

Security: CVE-2020-14349
2020-08-10 09:22:54 -07:00
Tom Lane 1c164ef3d2 Remove useless Assert.
Testing that an unsigned variable is >= 0 is pretty pointless,
as noted by Coverity and numerous buildfarm members.

In passing, add comment about new uses of "volatile" --- Coverity
doesn't much like that either, but it seems probably necessary.
2020-08-09 11:32:31 -04:00
Alvaro Herrera 470687b4a5
walsnd: Don't set waiting_for_ping_response spuriously
Ashutosh Bapat noticed that when logical walsender needs to wait for
WAL, and it realizes that it must send a keepalive message to
walreceiver to update the sent-LSN, which *does not* request a reply
from walreceiver, it wrongly sets the flag that it's going to wait for
that reply.  That means that any future would-be sender of feedback
messages ends up not sending a feedback message, because they all
believe that a reply is expected.

With built-in logical replication there's not much harm in this, because
WalReceiverMain will send a ping-back every wal_receiver_timeout/2
anyway; but with other logical replication systems (e.g. pglogical) it
can cause significant pain.

This problem was introduced in commit 41d5f8ad73, where the
request-reply flag was changed from true to false to WalSndKeepalive,
without at the same time removing the line that sets
waiting_for_ping_response.

Just removing that line would be a sufficient fix, but it seems better
to shift the responsibility of setting the flag to WalSndKeepalive
itself instead of requiring caller to do it; this is clearly less
error-prone.

Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reported-by: Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com>
Backpatch: 9.5 and up
Discussion: https://postgr.es/m/20200806225558.GA22401@alvherre.pgsql
2020-08-08 12:31:55 -04:00
Peter Eisentraut a13421c96c Add some const decorations 2020-08-08 07:31:52 +02:00
Amit Kapila 7259736a6e Implement streaming mode in ReorderBuffer.
Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke stream API methods added by commit 45fdc9738b.
However, sometimes if we have incomplete toast or speculative insert we
spill to the disk because we can't generate the complete tuple and stream.
And, as soon as we get the complete tuple we stream the transaction
including the serialized changes.

We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages at each command end. These
features are added by commits 0bead9af48 and c55040ccd0 respectively.

Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).

We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such a sqlerrcode aborts the decoding of the current
transaction and continue with the decoding of other transactions.

We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to.  The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).

We also provide a new option via SQL APIs to fetch the changes being
streamed.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila, Nikhil Sontakke
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
2020-08-08 07:47:06 +05:30
Amit Kapila 45fdc9738b Extend the logical decoding output plugin API with stream methods.
This adds seven methods to the output plugin API, adding support for
streaming changes of large in-progress transactions.

* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate

Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these
new stream methods.

The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.

This commit simply adds these new APIs and the upcoming patch to "allow
the streaming mode in ReorderBuffer" will use these APIs.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
2020-07-28 08:09:44 +05:30
Amit Kapila c55040ccd0 WAL Log invalidations at command end with wal_level=logical.
When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.

This patch is required to allow the streaming of in-progress transactions
in logical decoding.  The actual work to allow streaming will be committed
as a separate patch.

We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).

So existing code relying on those invalidations (e.g. redo) does not need
to be changed.

The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.

These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.

The invalidations are decoded and accumulated in top-transaction, and then
executed during replay.  This obviates the need to decode the
invalidations as part of a commit record.

Bump XLOG_PAGE_MAGIC, since this introduces XLOG_XACT_INVALIDATIONS.

Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
2020-07-23 08:34:48 +05:30
Tom Lane 670c0a1d47 Weaken type-OID-matching checks in array_recv and record_recv.
Rather than always insisting on an exact match of the type OID in the
data to the element type or column type we expect, complain only when
both OIDs fall within the manually-assigned range.  This acknowledges
the reality that user-defined types don't have stable OIDs, while
still preserving some of the mistake-detection value of the old test.

(It's not entirely clear whether to error if one OID is manually
assigned and the other isn't.  But perhaps that case could arise in
cross-version cases where a former extension type has been imported
into core, so I let it pass.)

This change allows us to remove the prohibition on binary transfer
of user-defined arrays and composites in the recently-landed support
for binary logical replication (commit 9de77b545).  We can just
unconditionally drop that check, since if the client has asked for
binary transfer it must be >= v14 and must have this change.

Discussion: https://postgr.es/m/CADK3HH+R3xMn=8t3Ct+uD+qJ1KD=Hbif5NFMJ+d5DkoCzp6Vgw@mail.gmail.com
2020-07-21 15:19:46 -04:00
Tom Lane d5daae47db Fix construction of updated-columns bitmap in logical replication.
Commit b9c130a1f failed to apply the publisher-to-subscriber column
mapping while checking which columns were updated.  Perhaps less
significantly, it didn't exclude dropped columns either.  This could
result in an incorrect updated-columns bitmap and thus wrong decisions
about whether to fire column-specific triggers on the subscriber while
applying updates.  In HEAD (since commit 9de77b545), it could also
result in accesses off the end of the colstatus array, as detected by
buildfarm member skink.  Fix the logic, and adjust 003_constraints.pl
so that the problem is exposed in unpatched code.

In HEAD, also add some assertions to check that we don't access off
the ends of these newly variable-sized arrays.

Back-patch to v10, as b9c130a1f was.

Discussion: https://postgr.es/m/CAH2-Wz=79hKQ4++c5A060RYbjTHgiYTHz=fw6mptCtgghH2gJA@mail.gmail.com
2020-07-20 13:40:16 -04:00
Fujii Masao c3fe108c02 Rename wal_keep_segments to wal_keep_size.
max_slot_wal_keep_size that was added in v13 and wal_keep_segments are
the GUC parameters to specify how much WAL files to retain for
the standby servers. While max_slot_wal_keep_size accepts the number of
bytes of WAL files, wal_keep_segments accepts the number of WAL files.
This difference of setting units between those similar parameters could
be confusing to users.

To alleviate this situation, this commit renames wal_keep_segments to
wal_keep_size, and make users specify the WAL size in it instead of
the number of WAL files.

There was also the idea to rename max_slot_wal_keep_size to
max_slot_wal_keep_segments, in the discussion. But we have been moving
away from measuring in segments, for example, checkpoint_segments was
replaced by max_wal_size. So we concluded to rename wal_keep_segments
to wal_keep_size.

Back-patch to v13 where max_slot_wal_keep_size was added.

Author: Fujii Masao
Reviewed-by: Álvaro Herrera, Kyotaro Horiguchi, David Steele
Discussion: https://postgr.es/m/574b4ea3-e0f9-b175-ead2-ebea7faea855@oss.nttdata.com
2020-07-20 13:30:18 +09:00
Amit Kapila 0bead9af48 Immediately WAL-log subtransaction and top-level XID association.
The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.

So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead) only when wal_level=logical.
We can not remove the existing XLOG_XACT_ASSIGNMENT WAL as that is
required for avoiding overflow in the hot standby snapshot.

Bump XLOG_PAGE_MAGIC, since this introduces XLR_BLOCK_ID_TOPLEVEL_XID.

Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
2020-07-20 08:48:26 +05:30
Tom Lane 9de77b5453 Allow logical replication to transfer data in binary format.
This patch adds a "binary" option to CREATE/ALTER SUBSCRIPTION.
When that's set, the publisher will send data using the data type's
typsend function if any, rather than typoutput.  This is generally
faster, if slightly less robust.

As committed, we won't try to transfer user-defined array or composite
types in binary, for fear that type OIDs won't match at the subscriber.
This might be changed later, but it seems like fit material for a
follow-on patch.

Dave Cramer, reviewed by Daniel Gustafsson, Petr Jelinek, and others;
adjusted some by me

Discussion: https://postgr.es/m/CADK3HH+R3xMn=8t3Ct+uD+qJ1KD=Hbif5NFMJ+d5DkoCzp6Vgw@mail.gmail.com
2020-07-18 12:44:51 -04:00
Amit Kapila f41fbee7e7 Adjust minor comment in reorderbuffer.c.
Author: Dave Cramer
Reviewed-by: David G. Johnston
Discussion: https://postgr.es/m/CADK3HHL8do4Fp1bsymgNasx375njV3AR7zY3UgYwzbL_Dx-n2Q@mail.gmail.com
2020-07-18 09:57:23 +05:30
Amit Kapila df7c5cb16e Fix comments in reorderbuffer.c.
Author: Dave Cramer
Reviewed-by: David G. Johnston
Discussion: https://postgr.es/m/CADK3HHL8do4Fp1bsymgNasx375njV3AR7zY3UgYwzbL_Dx-n2Q@mail.gmail.com
2020-07-18 09:47:38 +05:30
Amit Kapila 01160a3de3 Fix signal handler setup for SIGHUP in the apply launcher process.
Commit 1e53fe0e70 has unified the usage of the config-file reload flag by
using the same signal handler function for the SIGHUP signal at many places
in the code.  By mistake, it used the wrong SIGNAL in apply launcher
process for the SIGHUP signal handler function.

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar
Backpatch-through: 13, where it was introduced
Discussion: https://postgr.es/m/CALj2ACVzHCRnS20bOiEHaLtP5PVBENZQn4khdsSJQgOv_GM-LA@mail.gmail.com
2020-07-17 09:02:44 +05:30
Amit Kapila d973747281 Revert "Track statistics for spilling of changes from ReorderBuffer".
The stats with this commit was available only for WALSenders, however,
users might want to see for backends doing logical decoding via SQL API.
Then, users might want to reset and access these stats across server
restart which was not possible with the current patch.

List of commits reverted:

caa3c4242c   Don't call elog() while holding spinlock.
e641b2a995   Doc: Update the documentation for spilled transaction
statistics.
5883f5fe27   Fix unportable printf format introduced in commit 9290ad198.
9290ad198b   Track statistics for spilling of changes from ReorderBuffer.

Additionaly, remove the release notes entry for this feature.

Backpatch-through: 13, where it was introduced
Discussion: https://postgr.es/m/CA+fd4k5_pPAYRTDrO2PbtTOe0eHQpBvuqmCr8ic39uTNmR49Eg@mail.gmail.com
2020-07-13 08:53:23 +05:30
Andres Freund 5e7bbb5286 code: replace 'master' with 'primary' where appropriate.
Also changed "in the primary" to "on the primary", and added a few
"the" before "primary".

Author: Andres Freund
Reviewed-By: David Steele
Discussion: https://postgr.es/m/20200615182235.x7lch5n6kcjq4aue@alap3.anarazel.de
2020-07-08 12:57:23 -07:00
Alvaro Herrera a8aaa0c786
Morph pg_replication_slots.min_safe_lsn to safe_wal_size
The previous definition of the column was almost universally disliked,
so provide this updated definition which is more useful for monitoring
purposes: a large positive value is good, while zero or a negative value
means danger.  This should be operationally more convenient.

Backpatch to 13, where the new column to pg_replication_slots (and the
feature it represents) were added.

Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Discussion: https://postgr.es/m/9ddfbf8c-2f67-904d-44ed-cf8bc5916228@oss.nttdata.com
2020-07-07 13:08:00 -04:00
Amit Kapila 04c7f4144f Remove extra whitespace in comments atop ReorderBufferCheckMemoryLimit.
Backpatch-through: 13, where it was introduced
2020-07-06 08:49:09 +05:30
Michael Paquier 641dd167a3 Move description of libpqwalreceiver hooks out of the replication's README
src/backend/replication/README includes since 32bc08b a basic
description of the WAL receiver hooks available in walreceiver.h for a
module like libpqwalreceiver, but the README has never been updated to
reflect changes done to the hooks, so the contents of the README have
rotten with the time.  This commit moves the description from the README
to walreceiver.h, where it will be hard to miss that a description
update or addition is needed depending on the modifications done to the
hooks.

Each hook now includes a description of what it does in walreceiver.h,
and the replication's README mentions walreceiver.h.

Thanks also to Amit Kapila for the discussion.

Author: Michael Paquier
Reviewed-by: Peter Eisentraut
Discussion: https://postgr.es/m/20200502024606.GA471944@paquier.xyz
2020-07-02 13:57:03 +09:00
Alvaro Herrera 4ae08cd5fd
Persist slot invalidation correctly
We failed to save slot to disk after invalidating it, so the state was
lost in case of server restart or crash.  Fix by marking it dirty and
flushing.

Also, if the slot is known invalidated we don't need to reason about the
LSN at all -- it's known invalidated.  Only test the LSN if the slot is
known not invalidated.

Author: Fujii Masao <masao.fujii@oss.nttdata.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/17a69cfe-f1c1-a416-ee25-ae15427c69eb@oss.nttdata.com
2020-06-26 20:41:29 -04:00
Fujii Masao a82ba066ea Remove erroneous assertion from pg_copy_logical_replication_slot().
If restart_lsn of logical replication slot gets behind more than
max_slot_wal_keep_size from the current LSN, the logical replication slot
would be invalidated and its restart_lsn is reset to an invalid LSN.
If this logical replication slot with an invalid restart_lsn was specified as
the source slot in pg_copy_logical_replication_slot(), the function caused
the assertion failure unexpectedly.

This assertion was added because restart_lsn should not be invalid before.
But in v13, it can be invalid thanks to max_slot_wal_keep_size. So since this
assertion is no longer useful, this commit removes it.

This commit also changes the errcode in the error message that
pg_copy_logical_replication_slot() emits when the slot with an invalid
restart_lsn is specified, to more appropriate one.

Back-patch to v13 where max_slot_wal_keep_size was added and
the assertion was no longer valid.

Author: Fujii Masao
Reviewed-by: Alvaro Herrera, Kyotaro Horiguchi
Discussion: https://postgr.es/m/f91de4fb-a7ab-b90e-8132-74796e049d51@oss.nttdata.com
2020-06-25 11:13:13 +09:00
Alvaro Herrera b8fd4e02c6
Adjust max_slot_wal_keep_size behavior per review
In pg_replication_slot, change output from normal/reserved/lost to
reserved/extended/unreserved/ lost, which better expresses the possible
states particularly near the time where segments are no longer safe but
checkpoint has not run yet.

Under the new definition, reserved means the slot is consuming WAL
that's still under the normal WAL size constraints; extended means it's
consuming WAL that's being protected by wal_keep_segments or the slot
itself, whose size is below max_slot_wal_keep_size; unreserved means the
WAL is no longer safe, but checkpoint has not yet removed those files.
Such as slot is in imminent danger, but can still continue for a little
while and may catch up to the reserved WAL space.

Also, there were some bugs in the calculations used to report the
status; fixed those.

Backpatch to 13.

Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20200616.120236.1809496990963386593.horikyota.ntt@gmail.com
2020-06-24 14:23:39 -04:00
Alvaro Herrera 0188bb8253
Save slot's restart_lsn when invalidated due to size
We put it aside as invalidated_at, which let us show "lost" in
pg_replication slot.  Prior to this change, the state value was reported
as NULL.

Backpatch to 13.

Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20200617.101707.1735599255100002667.horikyota.ntt@gmail.com
Discussion: https://postgr.es/m/20200407.120905.1507671100168805403.horikyota.ntt@gmail.com
2020-06-24 14:15:17 -04:00
Fujii Masao f9e9704f09 Fix issues in invalidation of obsolete replication slots.
This commit fixes the following issues.

1. There is the case where the slot is dropped while trying to invalidate it.
    InvalidateObsoleteReplicationSlots() did not handle this case, and
    which could cause checkpoint to fail.

2. InvalidateObsoleteReplicationSlots() could emit the same log message
    multiple times unnecessary. It should be logged only once.

3. When marking the slot as used, we always searched the target slot from
    all the replication slots even if we already found it. This could cause
    useless waste of cycles.

Back-patch to v13 where these issues were added as a part of
max_slot_wal_keep_size code.

Author: Fujii Masao
Reviewed-by: Kyotaro Horiguchi, Alvaro Herrera
Discussion: https://postgr.es/m/66c05b67-3396-042c-1b41-bfa6c3ddcf82@oss.nttdata.com
2020-06-19 17:15:52 +09:00
Michael Paquier b48df818dc Fix oldest xmin and LSN computation across repslots after advancing
Advancing a replication slot did not recompute the oldest xmin and LSN
values across replication slots, preventing resource removal like
segments not recycled at checkpoint time.  The original commit that
introduced the slot advancing in 9c7d06d never did the update of those
oldest values, and b0afdca removed this code.

This commit adds a TAP test to check segment recycling with advancing
for physical slots, enforcing an extra segment switch before advancing
to check if the segment gets correctly recycled after a checkpoint.

Reported-by: Andres Freund
Reviewed-by: Alexey Kondratov, Kyptaro Horiguchi
Discussion: https://postgr.es/m/20200609171904.kpltxxvjzislidks@alap3.anarazel.de
Backpatch-through: 11
2020-06-18 16:34:59 +09:00
Robert Haas 2fd2effc50 Improve server code to read files as part of a base backup.
Don't use fread(), since that doesn't necessarily set errno. We could
use read() instead, but it's even better to use pg_pread(), which
allows us to avoid some extra calls to seek to the desired location in
the file.

Also, advertise a wait event while reading from a file, as we do for
most other places where we're reading data from files.

Patch by me, reviewed by Hamid Akhtar.

Discussion: http://postgr.es/m/CA+TgmobBw-3573vMosGj06r72ajHsYeKtksT_oTxH8XvTL7DxA@mail.gmail.com
2020-06-17 11:39:17 -04:00
Robert Haas 453e0e3f0e Minor code cleanup for perform_base_backup().
Merge two calls to sendDir() that are exactly the same except for
the fifth argument. Adjust comments to match.

Also, don't bother checking whether tblspc_map_file is NULL. We
initialize it in all cases, so it can't be.

Patch by me, reviewed by Amit Kapila and Kyotaro Horiguchi.

Discussion: http://postgr.es/m/CA+TgmoYq+59SJ2zBbP891ngWPA9fymOqntqYcweSDYXS2a620A@mail.gmail.com
2020-06-17 11:05:42 -04:00
Robert Haas 1fa092913d Don't export basebackup.c's sendTablespace().
Commit 72d422a522 made xlog.c call
sendTablespace() with the 'sizeonly' argument set to true, which
required basebackup.c to export sendTablespace(). However, that's
kind of ugly, so instead defer the call to sendTablespace() until
basebackup.c regains control. That way, it can still be a static
function.

Patch by me, reviewed by Amit Kapila and Kyotaro Horiguchi.

Discussion: http://postgr.es/m/CA+TgmoYq+59SJ2zBbP891ngWPA9fymOqntqYcweSDYXS2a620A@mail.gmail.com
2020-06-17 10:57:34 -04:00
Thomas Munro 7897e3bb90 Fix buffile.c error handling.
Convert buffile.c error handling to use ereport.  This fixes cases where
I/O errors were indistinguishable from EOF or not reported.  Also remove
"%m" from error messages where errno would be bogus.  While we're
modifying those strings, add block numbers and short read byte counts
where appropriate.

Back-patch to all supported releases.

Reported-by: Amit Khandekar <amitdkhan.pg@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Ibrar Ahmed <ibrar.ahmad@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CA%2BhUKGJE04G%3D8TLK0DLypT_27D9dR8F1RQgNp0jK6qR0tZGWOw%40mail.gmail.com
2020-06-16 16:59:07 +12:00
Robert Haas 2961c9711c Assorted cleanup of tar-related code.
Introduce TAR_BLOCK_SIZE and replace many instances of 512 with
the new constant. Introduce function tarPaddingBytesRequired
and use it to replace numerous repetitions of (x + 511) & ~511.

Add preprocessor guards against multiple inclusion to pgtar.h.

Reformat the prototype for tarCreateHeader so it doesn't extend
beyond 80 characters.

Discussion: http://postgr.es/m/CA+TgmobWbfReO9-XFk8urR1K4wTNwqoHx_v56t7=T8KaiEoKNw@mail.gmail.com
2020-06-15 15:28:49 -04:00
Michael Paquier cc072641d4 Replace superuser check by ACLs for replication origin functions
This patch removes the hardcoded check for superuser privileges when
executing replication origin functions.  Instead, execution is revoked
from public, meaning that those functions can be executed by a superuser
and that access to them can be granted.

Author: Martín Marqués
Reviewed-by: Kyotaro Horiguchi, Michael Paquier, Masahiko Sawada
Discussion: https:/postgr.es/m/CAPdiE1xJMZOKQL3dgHMUrPqysZkgwzSMXETfKkHYnBAB7-0VRQ@mail.gmail.com
2020-06-14 12:40:37 +09:00
Michael Paquier aaf8c99050 Fix typos and some format mistakes in comments
Author: Justin Pryzby
Discussion: https://postgr.es/m/20200612023709.GC14879@telsasoft.com
2020-06-12 21:05:10 +09:00
Amit Kapila c5c000b103 Fix ReorderBuffer memory overflow check.
Commit cec2edfa78 introduced logical_decoding_work_mem to limit
ReorderBuffer memory usage. We spill the changes once the memory occupied
by changes exceeds logical_decoding_work_mem.  There was an assumption
in the code that by evicting the largest (sub)transaction we will come
under the memory limit as the selected transaction will be at least as
large as the most recent change (which caused us to go over the memory
limit).  However, that is not true because a user can reduce the
logical_decoding_work_mem to a smaller value before the most recent
change.

We fix it by allowing to evict the transactions until we reach under the
memory limit.

Reported-by: Fujii Masao
Author: Amit Kapila
Reviewed-by: Fujii Masao
Backpatch-through: 13, where it was introduced
Discussion: https://postgr.es/m/2b7ba291-22e0-a187-d167-9e5309a3458d@oss.nttdata.com
2020-06-10 10:20:10 +05:30
Michael Paquier 879ad9f90e Fix crash in WAL sender when starting physical replication
Since database connections can be used with WAL senders in 9.4, it is
possible to use physical replication.  This commit fixes a crash when
starting physical replication with a WAL sender using a database
connection, caused by the refactoring done in 850196b.

There have been discussions about forbidding the use of physical
replication in a database connection, but this is left for later,
taking care only of the crash new to 13.

While on it, add a test to check for a failure when attempting logical
replication if the WAL sender does not have a database connection.  This
part is extracted from a larger patch by Kyotaro Horiguchi.

Reported-by: Vladimir Sitnikov
Author: Michael Paquier, Kyotaro Horiguchi
Reviewed-by: Kyotaro Horiguchi, Álvaro Herrera
Discussion: https://postgr.es/m/CAB=Je-GOWMj1PTPkeUhjqQp-4W3=nW-pXe2Hjax6rJFffB5_Aw@mail.gmail.com
Backpatch-through: 13
2020-06-08 10:12:24 +09:00
Peter Eisentraut 0fd2a79a63 Spelling adjustments 2020-06-07 15:06:51 +02:00
Michael Paquier c1669fd581 Fix instance of elog() called while holding a spinlock
This broke the project rule to not call any complex code while a
spinlock is held.  Issue introduced by b89e151.

Discussion: https://postgr.es/m/20200602.161518.1399689010416646074.horikyota.ntt@gmail.com
Backpatch-through: 9.5
2020-06-04 10:17:49 +09:00
Tom Lane f88bd3139f Don't call palloc() while holding a spinlock, either.
Fix some more violations of the "only straight-line code inside a
spinlock" rule.  These are hazardous not only because they risk
holding the lock for an excessively long time, but because it's
possible for palloc to throw elog(ERROR), leaving a stuck spinlock
behind.

copy_replication_slot() had two separate places that did pallocs
while holding a spinlock.  We can make the code simpler and safer
by copying the whole ReplicationSlot struct into a local variable
while holding the spinlock, and then referencing that copy.
(While that's arguably more cycles than we really need to spend
holding the lock, the struct isn't all that big, and this way seems
far more maintainable than copying fields piecemeal.  Anyway this
is surely much cheaper than a palloc.)  That bug goes back to v12.

InvalidateObsoleteReplicationSlots() not only did a palloc while
holding a spinlock, but for extra sloppiness then leaked the memory
--- probably for the lifetime of the checkpointer process, though
I didn't try to verify that.  Fortunately that silliness is new
in HEAD.

pg_get_replication_slots() had a cosmetic violation of the rule,
in that it only assumed it's safe to call namecpy() while holding
a spinlock.  Still, that's a hazard waiting to bite somebody, and
there were some other cosmetic coding-rule violations in the same
function, so clean it up.  I back-patched this as far as v10; the
code exists before that but it looks different, and this didn't
seem important enough to adapt the patch further back.

Discussion: https://postgr.es/m/20200602.161518.1399689010416646074.horikyota.ntt@gmail.com
2020-06-03 12:36:23 -04:00
Fujii Masao caa3c4242c Don't call elog() while holding spinlock.
Previously UpdateSpillStats() called elog(DEBUG2) while holding
the spinlock even though the local variables that the elog() accesses
don't need to be protected by the lock. Since spinlocks are intended
for very short-term locks, they should not be used when calling
elog(DEBUG2). So this commit moves that elog() out of spinlock period.

Author: Kyotaro Horiguchi
Reviewed-by: Amit Kapila and Fujii Masao
Discussion: https://postgr.es/m/20200602.161518.1399689010416646074.horikyota.ntt@gmail.com
2020-06-02 19:21:04 +09:00
Peter Eisentraut 574925bfd0 Remove unnecessary cast
Probably copied from nearby calls where it is necessary.  But this one
also casts away constness, so it was doubly annoying.
2020-05-22 10:36:49 +02:00
Amit Kapila 7e041b0c1d Fix comment in slot.c.
Reported-by: Sawada Masahiko
Author: Sawada Masahiko
Reviewed-by: Amit Kapila
Backpatch-through: 9.5
Discussion: https://postgr.es/m/CA+fd4k4Ws7M7YQ8PqSym5WB1y75dZeBTd1sZJUQdfe0KJQ-iSA@mail.gmail.com
2020-05-18 07:53:26 +05:30
Michael Paquier 2c8dd05d6c Make pg_stat_wal_receiver consistent with the WAL receiver's shmem info
d140f2f3 has renamed receivedUpto to flushedUpto, and has added
writtenUpto to the WAL receiver's shared memory information, but
pg_stat_wal_receiver was not consistent with that.  This commit renames
received_lsn to flushed_lsn, and adds a new column called written_lsn.

Bump catalog version.

Author: Michael Paquier
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/20200515090817.GA212736@paquier.xyz
2020-05-17 09:22:07 +09:00
Alvaro Herrera 1d3743023e
Fix walsender error cleanup code
In commit 850196b610 I (Álvaro) failed to handle the case of walsender
shutting down on an error before setting up its 'xlogreader' pointer;
the error handling code dereferences the pointer, causing a crash.
Fix by testing the pointer before trying to dereference it.

Kyotaro authored the code fix; I adopted Nathan's test case to be used
by the TAP tests and added the necessary PostgresNode change.

Reported-by: Nathan Bossart <bossartn@amazon.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/C04FC24E-903D-4423-B312-6910E4D846E5@amazon.com
2020-05-15 20:00:52 -04:00
Tom Lane 8048404939 Fix bogus initialization of replication origin shared memory state.
The previous coding zeroed out offsetof(ReplicationStateCtl, states)
more bytes than it was entitled to, as a consequence of starting the
zeroing from the wrong pointer (or, if you prefer, using the wrong
calculation of how much to zero).

It's unsurprising that this has not caused any reported problems,
since it can be expected that the newly-allocated block is at the end
of what we've used in shared memory, and we always make the shmem
block substantially bigger than minimally necessary.  Nonetheless,
this is wrong and it could bite us someday; plus it's a dangerous
model for somebody to copy.

This dates back to the introduction of this code (commit 5aa235042),
so back-patch to all supported branches.
2020-05-15 19:05:39 -04:00
Tom Lane 36ac359d36 Rename assorted LWLock tranches.
Choose names that fit into the conventions for wait event names
(particularly, that multi-word names are in the style MultiWordName)
and hopefully convey more information to non-hacker users than the
previous names did.

Also rename SerializablePredicateLockListLock to
SerializablePredicateListLock; the old name was long enough to cause
table formatting problems, plus the double occurrence of "Lock" seems
confusing/error-prone.

Also change a couple of particularly opaque LWLock field names.

Discussion: https://postgr.es/m/28683.1589405363@sss.pgh.pa.us
2020-05-15 18:11:07 -04:00
Tom Lane 5da14938f7 Rename SLRU structures and associated LWLocks.
Originally, the names assigned to SLRUs had no purpose other than
being shmem lookup keys, so not a lot of thought went into them.
As of v13, though, we're exposing them in the pg_stat_slru view and
the pg_stat_reset_slru function, so it seems advisable to take a bit
more care.  Rename them to names based on the associated on-disk
storage directories (which fortunately we *did* think about, to some
extent; since those are also visible to DBAs, consistency seems like
a good thing).  Also rename the associated LWLocks, since those names
are likewise user-exposed now as wait event names.

For the most part I only touched symbols used in the respective modules'
SimpleLruInit() calls, not the names of other related objects.  This
renaming could have been taken further, and maybe someday we will do so.
But for now it seems undesirable to change the names of any globally
visible functions or structs, so some inconsistency is unavoidable.

(But I *did* terminate "oldserxid" with prejudice, as I found that
name both unreadable and not descriptive of the SLRU's contents.)

Table 27.12 needs re-alphabetization now, but I'll leave that till
after the other LWLock renamings I have in mind.

Discussion: https://postgr.es/m/28683.1589405363@sss.pgh.pa.us
2020-05-15 14:28:25 -04:00
Tom Lane 5cbfce562f Initial pgindent and pgperltidy run for v13.
Includes some manual cleanup of places that pgindent messed up,
most of which weren't per project style anyway.

Notably, it seems some people didn't absorb the style rules of
commit c9d297751, because there were a bunch of new occurrences
of function calls with a newline just after the left paren, all
with faulty expectations about how the rest of the call would get
indented.
2020-05-14 13:06:50 -04:00
Tom Lane 29c3e2dd5a Collect built-in LWLock tranche names statically, not dynamically.
There is little point in using the LWLockRegisterTranche mechanism for
built-in tranche names.  It wastes cycles, it creates opportunities for
bugs (since failing to register a tranche name is a very hard-to-detect
problem), and the lack of any centralized list of names encourages
sloppy nonconformity in name choices.  Moreover, since we have a
centralized list of the tranches anyway in enum BuiltinTrancheIds, we're
certainly not buying any flexibility in return for these disadvantages.

Hence, nuke all the backend-internal LWLockRegisterTranche calls,
and instead provide a const array of the builtin tranche names.

(I have in mind to change a bunch of these names shortly, but this
patch is just about getting them into one place.)

Discussion: https://postgr.es/m/9056.1589419765@sss.pgh.pa.us
2020-05-14 11:10:31 -04:00
Alvaro Herrera 17cc133f01
Dial back -Wimplicit-fallthrough to level 3
The additional pain from level 4 is excessive for the gain.

Also revert all the source annotation changes to their original
wordings, to avoid back-patching pain.

Discussion: https://postgr.es/m/31166.1589378554@sss.pgh.pa.us
2020-05-13 15:31:14 -04:00
Alvaro Herrera 850196b610
Adjust walsender usage of xlogreader, simplify APIs
* Have both physical and logical walsender share a 'xlogreader' state
  struct for tracking state.  This replaces the existing globals sendSeg
  and sendCxt.

* Change WALRead not to receive XLogReaderState->seg and ->segcxt as
  separate arguments anymore; just use the ones from 'state'.  This is
  made possible by the above change.

* have the XLogReader segment_open contract require the callbacks to
  install the file descriptor in the state struct themselves instead of
  returning it.  xlogreader was already ignoring any possible failed
  return from the callbacks, relying solely on them never returning.

  (This point is not altogether excellent, as it means the callbacks
  have to know more of XLogReaderState; but to really improve on that
  we would have to pass back error info from the callbacks to
  xlogreader.  And the complexity would not be saved but instead just
  transferred to the callers of WALRead, which would have to learn how
  to throw errors from the open_segment callback in addition of, as
  currently, from pg_pread.)

* segment_open no longer receives the 'segcxt' as a separate argument,
  since it's part of the XLogReaderState argument.

Per comments from Kyotaro Horiguchi.

Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20200511203336.GA9913@alvherre.pgsql
2020-05-13 12:17:08 -04:00
Alvaro Herrera 3e9744465d
Add -Wimplicit-fallthrough to CFLAGS and CXXFLAGS
Use it at level 4, a bit more restrictive than the default level, and
tweak our commanding comments to FALLTHROUGH.

(However, leave zic.c alone, since it's external code; to avoid the
warnings that would appear there, change CFLAGS for that file in the
Makefile.)

Author: Julien Rouhaud <rjuju123@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/20200412081825.qyo5vwwco3fv4gdo@nol
Discussion: https://postgr.es/m/flat/E1fDenm-0000C8-IJ@gemulon.postgresql.org
2020-05-12 16:07:30 -04:00
Alvaro Herrera a8be5364ac
Fix obsolete references to "XLogRead"
The one in xlogreader.h was pointed out by Antonin Houska; I (Álvaro) noticed the
others by grepping.

Author: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/28250.1589186654@antos
2020-05-11 12:46:41 -04:00
Alvaro Herrera b060dbe000
Rework XLogReader callback system
Code review for 0dc8ead463, prompted by a bug closed by 91c40548d5.

XLogReader's system for opening and closing segments had gotten too
complicated, with callbacks being passed at both the XLogReaderAllocate
level (read_page) as well as at the WALRead level (segment_open).  This
was confusing and hard to follow, so restructure things so that these
callbacks are passed together at XLogReaderAllocate time, and add
another callback to the set (segment_close) to make it a coherent whole.
Also, ensure XLogReaderState is an argument to all the callbacks, so
that they can grab at the ->private data if necessary.

Document the whole arrangement more clearly.

Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20200422175754.GA19858@alvherre.pgsql
2020-05-08 15:40:11 -04:00
Alvaro Herrera d0abe78d84
Check slot->restart_lsn validity in a few more places
Lack of these checks could cause visible misbehavior, including
assertion failures.  This was missed in commit c655077639, whereby
restart_lsn becomes invalid when the size limit is exceeded.

Also reword some existing error messages, and add errdetail(), so that
the reported errors all match in spirit.

Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20200408.093710.447591748588426656.horikyota.ntt@gmail.com
2020-04-28 20:39:04 -04:00
Noah Misch f246ea3b2a In caught-up logical walsender, sleep only in WalSndWaitForWal().
Before sleeping, WalSndWaitForWal() sends a keepalive if MyWalSnd->write
< sentPtr.  When the latest physical LSN yields no logical replication
messages (a common case), that keepalive elicits a reply.  Processing
the reply updates pg_stat_replication.replay_lsn.  WalSndLoop() lacks
that; when WalSndLoop() slept, replay_lsn advancement could stall until
wal_receiver_status_interval elapsed.  This sometimes stalled
src/test/subscription/t/001_rep_changes.pl for up to 10s.

Reviewed by Fujii Masao and Michael Paquier.

Discussion: https://postgr.es/m/20200418070142.GA1075445@rfd.leadboat.com
2020-04-25 10:18:12 -07:00
Noah Misch 72a3dc321d Revert "When WalSndCaughtUp, sleep only in WalSndWaitForWal()."
This reverts commit 4216858122.  It caused
idle physical walsenders to busy-wait, as reported by Fujii Masao.

Discussion: https://postgr.es/m/20200417054146.GA1061007@rfd.leadboat.com
2020-04-25 10:17:26 -07:00
Robert Haas 3989dbdf12 Rename exposed identifiers to say "backup manifest".
Function names declared "extern" now use BackupManifest in the name
rather than just Manifest, and data types use backup_manifest
rather than just manifest.

Per note from Michael Paquier.

Discussion: http://postgr.es/m/20200418125713.GG350229@paquier.xyz
2020-04-23 08:44:06 -04:00
Robert Haas 079ac29d4d Move the server's backup manifest code to a separate file.
basebackup.c is already a pretty big and complicated file, so it
makes more sense to keep the backup manifest support routines
in a separate file, for clarity and ease of maintenance.

Discussion: http://postgr.es/m/CA+TgmoavRak5OdP76P8eJExDYhPEKWjMb0sxW7dF01dWFgE=uA@mail.gmail.com
2020-04-20 14:38:15 -04:00
Tom Lane f332241a60 Fix race conditions in synchronous standby management.
We have repeatedly seen the buildfarm reach the Assert(false) in
SyncRepGetSyncStandbysPriority.  This apparently is due to failing to
consider the possibility that the sync_standby_priority values in
shared memory might be inconsistent; but they will be whenever only
some of the walsenders have updated their values after a change in
the synchronous_standby_names setting.  That function is vastly too
complex for what it does, anyway, so rewriting it seems better than
trying to apply a band-aid fix.

Furthermore, the API of SyncRepGetSyncStandbys is broken by design:
it returns a list of WalSnd array indexes, but there is nothing
guaranteeing that the contents of the WalSnd array remain stable.
Thus, if some walsender exits and then a new walsender process
takes over that WalSnd array slot, a caller might make use of
WAL position data that it should not, potentially leading to
incorrect decisions about whether to release transactions that
are waiting for synchronous commit.

To fix, replace SyncRepGetSyncStandbys with a new function
SyncRepGetCandidateStandbys that copies all the required data
from shared memory while holding the relevant mutexes.  If the
associated walsender process then exits, this data is still safe to
make release decisions with, since we know that that much WAL *was*
sent to a valid standby server.  This incidentally means that we no
longer need to treat sync_standby_priority as protected by the
SyncRepLock rather than the per-walsender mutex.

SyncRepGetSyncStandbys is no longer used by the core code, so remove
it entirely in HEAD.  However, it seems possible that external code is
relying on that function, so do not remove it from the back branches.
Instead, just remove the known-incorrect Assert.  When the bug occurs,
the function will return a too-short list, which callers should treat
as meaning there are not enough sync standbys, which seems like a
reasonably safe fallback until the inconsistent state is resolved.
Moreover it's bug-compatible with what has been happening in non-assert
builds.  We cannot do anything about the walsender-replacement race
condition without an API/ABI break.

The bogus assertion exists back to 9.6, but 9.6 is sufficiently
different from the later branches that the patch doesn't apply at all.
I chose to just remove the bogus assertion in 9.6, feeling that the
probability of a bad outcome from the walsender-replacement race
condition is too low to justify rewriting the whole patch for 9.6.

Discussion: https://postgr.es/m/21519.1585272409@sss.pgh.pa.us
2020-04-18 14:02:44 -04:00
Michael Paquier 8128b0c152 Fix collection of typos and grammar mistakes in the tree, volume 2
This fixes some comments and documentation new as of Postgres 13, and is
a follow-up of the work done in dd0f37e.

Author: Justin Pryzby
Discussion: https://postgr.es/m/20200408165653.GF2228@telsasoft.com
2020-04-14 14:45:43 +09:00
Noah Misch 4216858122 When WalSndCaughtUp, sleep only in WalSndWaitForWal().
Before sleeping, WalSndWaitForWal() sends a keepalive if MyWalSnd->write
< sentPtr.  That is important in logical replication.  When the latest
physical LSN yields no logical replication messages (a common case),
that keepalive elicits a reply, and processing the reply updates
pg_stat_replication.replay_lsn.  WalSndLoop() lacks that; when
WalSndLoop() slept, replay_lsn advancement could stall until
wal_receiver_status_interval elapsed.  This sometimes stalled
src/test/subscription/t/001_rep_changes.pl for up to 10s.

Discussion: https://postgr.es/m/20200406063649.GA3738151@rfd.leadboat.com
2020-04-11 10:30:00 -07:00
Peter Eisentraut 12fb189bfe Fix RELCACHE_FORCE_RELEASE issue
Introduced by 83fd4532a7.  To fix, the
tuple descriptors need to be copied into the current memory context.

Discussion: https://www.postgresql.org/message-id/04d78603-edae-9243-9dde-fe3037176a7d@2ndquadrant.com
2020-04-11 15:07:25 +02:00
Michael Paquier dd0f37ecce Fix collection of typos and grammar mistakes in the tree
This fixes some comments and documentation new as of Postgres 13.

Author: Justin Pryzby
Discussion: https://postgr.es/m/20200408165653.GF2228@telsasoft.com
2020-04-10 11:18:39 +09:00
Fujii Masao 1ec50a81ec Exclude backup_manifest file that existed in database, from BASE_BACKUP.
If there is already a backup_manifest file in the database cluster,
it belongs to the past backup that was used to start this server.
It is not correct for the backup being taken now. So this commit
changes pg_basebackup so that it always skips such backup_manifest
file. The backup_manifest file for the current backup will be injected
separately if users want it.

Author: Fujii Masao
Reviewed-by: Robert Haas
Discussion: https://postgr.es/m/78f76a3d-1a28-a97d-0394-5c96985dd1c0@oss.nttdata.com
2020-04-09 22:37:11 +09:00
Thomas Munro d140f2f3e2 Rationalize GetWalRcv{Write,Flush}RecPtr().
GetWalRcvWriteRecPtr() previously reported the latest *flushed*
location.  Adopt the conventional terminology used elsewhere in the tree
by renaming it to GetWalRcvFlushRecPtr(), and likewise for some related
variables that used the term "received".

Add a new definition of GetWalRcvWriteRecPtr(), which returns the latest
*written* value.  This will allow later patches to use the value for
non-data-integrity purposes, without having to wait for the flush
pointer to advance.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
2020-04-08 23:45:09 +12:00
Peter Eisentraut 83fd4532a7 Allow publishing partition changes via ancestors
To control whether partition changes are replicated using their own
identity and schema or an ancestor's, add a new parameter that can be
set per publication named 'publish_via_partition_root'.

This allows replicating a partitioned table into a different partition
structure on the subscriber.

Author: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Rafia Sabih <rafia.pghackers@gmail.com>
Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
Reviewed-by: Petr Jelinek <petr@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+HiwqH=Y85vRK3mOdjEkqFK+E=ST=eQiHdpj43L=_eJMOOznQ@mail.gmail.com
2020-04-08 11:19:23 +02:00
Alvaro Herrera c655077639
Allow users to limit storage reserved by replication slots
Replication slots are useful to retain data that may be needed by a
replication system.  But experience has shown that allowing them to
retain excessive data can lead to the primary failing because of running
out of space.  This new feature allows the user to configure a maximum
amount of space to be reserved using the new option
max_slot_wal_keep_size.  Slots that overrun that space are invalidated
at checkpoint time, enabling the storage to be released.

Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20170228.122736.123383594.horiguchi.kyotaro@lab.ntt.co.jp
2020-04-07 18:35:00 -04:00
Peter Eisentraut f1ac27bfda Add logical replication support to replicate into partitioned tables
Mainly, this adds support code in logical/worker.c for applying
replicated operations whose target is a partitioned table to its
relevant partitions.

Author: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Rafia Sabih <rafia.pghackers@gmail.com>
Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
Reviewed-by: Petr Jelinek <petr@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+HiwqH=Y85vRK3mOdjEkqFK+E=ST=eQiHdpj43L=_eJMOOznQ@mail.gmail.com
2020-04-06 15:15:52 +02:00
Peter Eisentraut a9d9bdd3ad Save errno across LWLockRelease() calls
Fixup for "Drop slot's LWLock before returning from SaveSlotToPath()"

Reported-by: Michael Paquier <michael@paquier.xyz>
2020-04-05 10:02:00 +02:00
Robert Haas 3e0d80fd8d Fix resource management bug with replication=database.
Commit 0d8c9c1210 allowed BASE_BACKUP to
acquire a ResourceOwner without a transaction so that the backup
manifest functionality could use a BufFile, but it overlooked the fact
that when a walsender is used with replication=database, it might have
a transaction in progress, because in that mode, SQL and replication
commands can be mixed.  Try to fix things up so that the two cleanup
mechanisms don't conflict.

Per buildfarm member serinus, which triggered the problem when
CREATE_REPLICATION_SLOT failed from inside a transaction.  It passed
on the subsequent run, so evidently the failure doesn't happen every
time.
2020-04-03 22:28:37 -04:00
Robert Haas db1531cae0 Be more careful about time_t vs. pg_time_t in basebackup.c.
lapwing is complaining that about a call to pg_gmtime, saying that
it "expected 'const pg_time_t *' but argument is of type 'time_t *'".
I at first thought that the problem had someting to do with const,
but Thomas Munro suggested that it might be just because time_t
and pg_time_t are different identifers. lapwing is i686 rather than
x86_64, and pg_time_t is always int64, so that seems like a good
guess.

There is other code that just casts time_t to pg_time_t without
any conversion function, so try that approach here.

Introduced in commit 0d8c9c1210.
2020-04-03 20:18:47 -04:00
Robert Haas 0d8c9c1210 Generate backup manifests for base backups, and validate them.
A manifest is a JSON document which includes (1) the file name, size,
last modification time, and an optional checksum for each file backed
up, (2) timelines and LSNs for whatever WAL will need to be replayed
to make the backup consistent, and (3) a checksum for the manifest
itself. By default, we use CRC-32C when checksumming data files,
because we are trying to detect corruption and user error, not foil an
adversary. However, pg_basebackup and the server-side BASE_BACKUP
command now have options to select a different algorithm, so users
wanting a cryptographic hash function can select SHA-224, SHA-256,
SHA-384, or SHA-512. Users not wanting file checksums at all can
disable them, or disable generating of the backup manifest altogether.
Using a cryptographic hash function in place of CRC-32C consumes
significantly more CPU cycles, which may slow down backups in some
cases.

A new tool called pg_validatebackup can validate a backup against the
manifest. If no checksums are present, it can still check that the
right files exist and that they have the expected sizes. If checksums
are present, it can also verify that each file has the expected
checksum. Additionally, it calls pg_waldump to verify that the
expected WAL files are present and parseable. Only plain format
backups can be validated directly, but tar format backups can be
validated after extracting them.

Robert Haas, with help, ideas, review, and testing from David Steele,
Stephen Frost, Andrew Dunstan, Rushabh Lathia, Suraj Kharage, Tushar
Ahuja, Rajkumar Raghuwanshi, Mark Dilger, Davinder Singh, Jeevan
Chalke, Amit Kapila, Andres Freund, and Noah Misch.

Discussion: http://postgr.es/m/CA+TgmoZV8dw1H2bzZ9xkKwdrk8+XYa+DC9H=F7heO2zna5T6qg@mail.gmail.com
2020-04-03 15:05:59 -04:00
Peter Eisentraut d8653f4687 Refactor code to look up local replication tuple
This unifies some duplicate code.

Author: Amit Langote <amitlangote09@gmail.com>
Discussion: https://www.postgresql.org/message-id/CA+HiwqFjYE5anArxvkjr37AQMd52L-LZtz9Ld2QrLQ3YfcYhTw@mail.gmail.com
2020-04-01 15:34:41 +02:00
Michael Paquier 616ae3d2b0 Move routine definitions of xlogarchive.c to a new header file
The definitions of the routines defined in xlogarchive.c have been part
of xlog_internal.h which is included by several frontend tools, but all
those routines are only called by the backend.  More cleanup could be
done within xlog_internal.h, but that's already a nice cut.

This will help a follow-up patch for pg_rewind where handling of
restore_command is added for frontends.

Author: Alexey Kondratov, Michael Paquier
Reviewed-by: Álvaro Herrera, Alexander Korotkov
Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1@postgrespro.ru
2020-03-31 15:33:04 +09:00
Alvaro Herrera 1e6148032e
Allow walreceiver configuration to change on reload
The parameters primary_conninfo, primary_slot_name and
wal_receiver_create_temp_slot can now be changed with a simple "reload"
signal, no longer requiring a server restart.  This is achieved by
signalling the walreceiver process to terminate and having it start
again with the new values.

Thanks to Andres Freund, Kyotaro Horiguchi, Fujii Masao for discussion.

Author: Sergei Kornilov <sk@zsrv.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/19513901543181143@sas1-19a94364928d.qloud-c.yandex.net
2020-03-27 19:51:37 -03:00
Alvaro Herrera 092c6936de
Set wal_receiver_create_temp_slot PGC_POSTMASTER
Commit 3297308278 gave walreceiver the ability to create and use a
temporary replication slot, and made it controllable by a GUC (enabled
by default) that can be changed with SIGHUP.  That's useful but has two
problems: one, it's possible to cause the origin server to fill its disk
if the slot doesn't advance in time; and also there's a disconnect
between state passed down via the startup process and GUCs that
walreceiver reads directly.

We handle the first problem by setting the option to disabled by
default.  If the user enables it, its on their head to make sure that
disk doesn't fill up.

We handle the second problem by passing the flag via startup rather than
having walreceiver acquire it directly, and making it PGC_POSTMASTER
(which ensures a walreceiver always has the fresh value).  A future
commit can relax this (to PGC_SIGHUP again) by having the startup
process signal walreceiver to shutdown whenever the value changes.

Author: Sergei Kornilov <sk@zsrv.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20200122055510.GH174860@paquier.xyz
2020-03-27 16:20:33 -03:00
Peter Eisentraut 49bf81536e Drop slot's LWLock before returning from SaveSlotToPath()
When SaveSlotToPath() is called with elevel=LOG, the early exits didn't
release the slot's io_in_progress_lock.

This could result in a walsender being stuck on the lock forever.  A
possible way to get into this situation is if the offending code paths
are triggered in a low disk space situation.

Author: Pavan Deolasee <pavan.deolasee@2ndquadrant.com>
Reported-by: Craig Ringer <craig@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/56a138c5-de61-f553-7e8f-6789296de785%402ndquadrant.com
2020-03-26 13:29:20 +01:00
Andres Freund 112b006fe7 logical decoding: Remove TODO about unnecessary optimization.
Measurements show, and intuition agrees, that there's currently no
known cases where adding a fastpath to avoid allocating / ordering a
heap for a single transaction is worthwhile.

Author: Dilip Kumar
Discussion: https://postgr.es/m/CAFiTN-sp701wvzvnLQJGk7JDqrFM8f--97-ihbwkU8qvn=p8nw@mail.gmail.com
2020-03-24 12:15:03 -07:00
Peter Eisentraut cef27ae01a Fix compiler warning
A variable was unused in non-assert builds.  Simplify the code to
avoid the issue.

Reported-by: Erik Rijkers <er@xs4all.nl>
2020-03-24 16:02:01 +01:00
Peter Eisentraut 97ee604d9b Some refactoring of logical/worker.c
This moves the main operations of apply_handle_{insert|update|delete},
that of inserting, updating, deleting a tuple into/from a given
relation, into corresponding
apply_handle_{insert|update|delete}_internal functions.  This allows
performing those operations on relations that are not directly the
targets of replication, which is something a later patch will use for
targeting partitioned tables.

Author: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Rafia Sabih <rafia.pghackers@gmail.com>
Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+HiwqH=Y85vRK3mOdjEkqFK+E=ST=eQiHdpj43L=_eJMOOznQ@mail.gmail.com
2020-03-24 15:00:54 +01:00
Fujii Masao 67e0adfb3f Report NULL as total backup size if it's not estimated.
Previously 0 was reported in pg_stat_progress_basebackup.total_backup
if the total backup size was not estimated. Per discussion, our consensus
is that NULL is better choise as the value in total_backup in that case.
So this commit makes pg_stat_progress_basebackup view report NULL
in total_backup column if the estimation is disabled.

Bump catversion.

Author: Fujii Masao
Reviewed-by: Amit Langote, Magnus Hagander, Alvaro Herrera
Discussion: https://postgr.es/m/CABUevExnhOD89zBDuPvfAAh243RzNpwCPEWNLtMYpKHMB8gbAQ@mail.gmail.com
2020-03-24 10:43:41 +09:00
Peter Eisentraut c314c147c0 Prepare to support non-tables in publications
This by itself doesn't change any functionality but prepares the way
for having relations other than base tables in publications.

Make arrangements for COPY handling the initial table sync.  For
non-tables we have to use COPY (SELECT ...) instead of directly
copying from the table, but then we have to take care to omit
generated columns from the column list.

Also, remove a hardcoded reference to relkind = 'r' and rely on the
publisher to send only what it can actually publish, which will be
correct even in future cross-version scenarios.

Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+HiwqH=Y85vRK3mOdjEkqFK+E=ST=eQiHdpj43L=_eJMOOznQ@mail.gmail.com
2020-03-19 08:25:07 +01:00
Alvaro Herrera 5d0c2d5eba
Remove logical_read_local_xlog_page
It devolved into a content-less wrapper over read_local_xlog_page, with
nothing to add, plus it's easily confused with walsender's
logical_read_xlog_page.  There doesn't seem to be any reason for it to
stay.

src/include/replication/logicalfuncs.h becomes empty, so remove it too.
The prototypes it initially had were absorbed by generated fmgrprotos.h.

Discussion: https://postgr.es/m/20191115214102.GA15616@alvherre.pgsql
2020-03-17 18:18:01 -03:00
Alvaro Herrera bcd1c36300
Fix consistency issues with replication slot copy
Commit 9f06d79ef831's replication slot copying failed to
properly reserve the WAL that the slot is expecting to see
during DecodingContextFindStartpoint (to set the confirmed_flush
LSN), so concurrent activity could remove that WAL and cause the
copy process to error out.  But it doesn't actually *need* that
WAL anyway: instead of running decode to find confirmed_flush, it
can be copied from the source slot. Fix this by rearranging things
to avoid DecodingContextFindStartpoint() (leaving the target slot's
confirmed_flush_lsn to invalid), and set that up afterwards by copying
from the target slot's value.

Also ensure the source slot's confirmed_flush_lsn is valid.

Reported-by: Arseny Sher
Author: Masahiko Sawada, Arseny Sher
Discussion: https://postgr.es/m/871rr3ohbo.fsf@ars-thinkpad
2020-03-17 16:13:18 -03:00
Peter Eisentraut bf68b79e50 Refactor ps_status.c API
The init_ps_display() arguments were mostly lies by now, so to match
typical usage, just use one argument and let the caller assemble it
from multiple sources if necessary.  The only user of the additional
arguments is BackendInitialize(), which was already doing string
assembly on the caller side anyway.

Remove the second argument of set_ps_display() ("force") and just
handle that in init_ps_display() internally.

BackendInitialize() also used to set the initial status as
"authentication", but that was very far from where authentication
actually happened.  So now it's set to "initializing" and then
"authentication" just before the actual call to
ClientAuthentication().

Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Kuntal Ghosh <kuntalghosh.2007@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/c65e5196-4f04-4ead-9353-6088c19615a3@2ndquadrant.com
2020-03-11 16:38:31 +01:00
Peter Eisentraut 3c173a53a8 Remove utils/acl.h from catalog/objectaddress.h
The need for this was removed by
8b9e9644dc.

A number of files now need to include utils/acl.h or
parser/parse_node.h explicitly where they previously got it indirectly
somehow.

Since parser/parse_node.h already includes nodes/parsenodes.h, the
latter is then removed where the former was added.  Also, remove
nodes/pg_list.h from objectaddress.h, since that's included via
nodes/parsenodes.h.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/7601e258-26b2-8481-36d0-dc9dca6f28f1%402ndquadrant.com
2020-03-10 10:27:00 +01:00
Peter Eisentraut 17b9e7f9fe Support adding partitioned tables to publication
When a partitioned table is added to a publication, changes of all of
its partitions (current or future) are published via that publication.

This change only affects which tables a publication considers as its
members.  The receiving side still sees the data coming from the
individual leaf partitions.  So existing restrictions that partition
hierarchies can only be replicated one-to-one are not changed by this.

Author: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Rafia Sabih <rafia.pghackers@gmail.com>
Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+HiwqH=Y85vRK3mOdjEkqFK+E=ST=eQiHdpj43L=_eJMOOznQ@mail.gmail.com
2020-03-10 09:09:32 +01:00
Tom Lane 3ed2005ff5 Introduce macros for typalign and typstorage constants.
Our usual practice for "poor man's enum" catalog columns is to define
macros for the possible values and use those, not literal constants,
in C code.  But for some reason lost in the mists of time, this was
never done for typalign/attalign or typstorage/attstorage.  It's never
too late to make it better though, so let's do that.

The reason I got interested in this right now is the need to duplicate
some uses of the TYPSTORAGE constants in an upcoming ALTER TYPE patch.
But in general, this sort of change aids greppability and readability,
so it's a good idea even without any specific motivation.

I may have missed a few places that could be converted, and it's even
more likely that pending patches will re-introduce some hard-coded
references.  But that's not fatal --- there's no expectation that
we'd actually change any of these values.  We can clean up stragglers
over time.

Discussion: https://postgr.es/m/16457.1583189537@sss.pgh.pa.us
2020-03-04 10:34:25 -05:00
Fujii Masao e65497df8f Report progress of streaming base backup.
This commit adds pg_stat_progress_basebackup view that reports
the progress while an application like pg_basebackup is taking
a base backup. This uses the progress reporting infrastructure
added by c16dc1aca5, adding support for streaming base backup.

Bump catversion.

Author: Fujii Masao
Reviewed-by: Kyotaro Horiguchi, Amit Langote, Sergei Kornilov
Discussion: https://postgr.es/m/9ed8b801-8215-1f3d-62d7-65bff53f6e94@oss.nttdata.com
2020-03-03 12:03:43 +09:00
Alvaro Herrera 2f9661311b
Represent command completion tags as structs
The backend was using strings to represent command tags and doing string
comparisons in multiple places, but that's slow and unhelpful.  Create a
new command list with a supporting structure to use instead; this is
stored in a tag-list-file that can be tailored to specific purposes with
a caller-definable C macro, similar to what we do for WAL resource
managers.  The first first such uses are a new CommandTag enum and a
CommandTagBehavior struct.

Replace numerous occurrences of char *completionTag with a
QueryCompletion struct so that the code no longer stores information
about completed queries in a cstring.  Only at the last moment, in
EndCommand(), does this get converted to a string.

EventTriggerCacheItem no longer holds an array of palloc’d tag strings
in sorted order, but rather just a Bitmapset over the CommandTags.

Author: Mark Dilger, with unsolicited help from Álvaro Herrera
Reviewed-by: John Naylor, Tom Lane
Discussion: https://postgr.es/m/981A9DB4-3F0C-4DA5-88AD-CB9CFF4D6CAD@enterprisedb.com
2020-03-02 18:19:51 -03:00
Michael Paquier 12c5cad76f Handle logical decoding in multi-insert for catalog tuples
The code path for multi-insert decoding is not stressed yet for
catalogs (a future patch may introduce this capability), so no
back-patch is needed.

Author: Daniel Gustafsson
Discussion: https://postgr.es/m/9690D72F-5C4F-4016-9572-6D16684E1D87@yesql.se
2020-03-02 10:00:37 +09:00
Michael Paquier bf883b211e Add prefix checks in exclude lists for pg_rewind, pg_checksums and base backups
An instance of PostgreSQL crashing with a bad timing could leave behind
temporary pg_internal.init files, potentially causing failures when
verifying checksums.  As the same exclusion lists are used between
pg_rewind, pg_checksums and basebackup.c, all those tools are extended
with prefix checks to keep everything in sync, with dedicated checks
added for pg_internal.init.

Backpatch down to 11, where pg_checksums (pg_verify_checksums in 11) and
checksum verification for base backups have been introduced.

Reported-by: Michael Banck
Author: Michael Paquier
Reviewed-by: Kyotaro Horiguchi, David Steele
Discussion: https://postgr.es/m/62031974fd8e941dd8351fbc8c7eff60d59c5338.camel@credativ.de
Backpatch-through: 11
2020-02-24 18:13:25 +09:00
Amit Kapila e3ff789acf Stop demanding that top xact must be seen before subxact in decoding.
Manifested as

ERROR:  subtransaction logged without previous top-level txn record

this check forbids legit behaviours like
 - First xl_xact_assignment record is beyond reading, i.e. earlier
   restart_lsn.
 - After restart_lsn there is some change of a subxact.
 - After that, there is second xl_xact_assignment (for another subxact)
   revealing the relationship between top and first subxact.

Such a transaction won't be streamed anyway because we hadn't seen it in
full.  Saying for sure whether xact of some record encountered after
the snapshot was deserialized can be streamed or not requires to know
whether it wrote something before deserialization point --if yes, it
hasn't been seen in full and can't be decoded. Snapshot doesn't have such
info, so there is no easy way to relax the check.

Reported-by: Hsu, John
Diagnosed-by: Arseny Sher
Author: Arseny Sher, Amit Kapila
Reviewed-by: Amit Kapila, Dilip Kumar
Backpatch-through: 9.5
Discussion: https://postgr.es/m/AB5978B2-1772-4FEE-A245-74C91704ECB0@amazon.com
2020-02-19 08:15:49 +05:30
Peter Eisentraut ad3ae64770 Fill in extraUpdatedCols in logical replication
The extraUpdatedCols field of the target RTE records which generated
columns are affected by an update.  This is used in a variety of
places, including per-column triggers and foreign data wrappers.  When
an update was initiated by a logical replication subscription, this
field was not filled in, so such an update would not affect generated
columns in a way that is consistent with normal updates.  To fix,
factor out some code from analyze.c to fill in extraUpdatedCols in the
logical replication worker as well.

Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/b05e781a-fa16-6b52-6738-761181204567@2ndquadrant.com
2020-02-17 15:20:57 +01:00
Thomas Munro 701a51fd4e Use pg_pwrite() in more places.
This removes some lseek() system calls.

Author: Thomas Munro
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/CA%2BhUKGJ%2BoHhnvqjn3%3DHro7xu-YDR8FPr0FL6LF35kHRX%3D_bUzg%40mail.gmail.com
2020-02-11 17:50:22 +13:00
Alvaro Herrera c9d2977519 Clean up newlines following left parentheses
We used to strategically place newlines after some function call left
parentheses to make pgindent move the argument list a few chars to the
left, so that the whole line would fit under 80 chars.  However,
pgindent no longer does that, so the newlines just made the code
vertically longer for no reason.  Remove those newlines, and reflow some
of those lines for some extra naturality.

Reviewed-by: Michael Paquier, Tom Lane
Discussion: https://postgr.es/m/20200129200401.GA6303@alvherre.pgsql
2020-01-30 13:42:14 -03:00
Alvaro Herrera 4e89c79a52 Remove excess parens in ereport() calls
Cosmetic cleanup, not worth backpatching.

Discussion: https://postgr.es/m/20200129200401.GA6303@alvherre.pgsql
Reviewed-by: Tom Lane, Michael Paquier
2020-01-30 13:32:04 -03:00
Michael Paquier b0afdcad21 Fix slot data persistency when advancing physical replication slots
Advancing a physical replication slot with pg_replication_slot_advance()
did not mark the slot as dirty if any advancing was done, preventing the
follow-up checkpoint to flush the slot data to disk.  This caused the
advancing to be lost even on clean restarts.  This does not happen for
logical slots as any advancing marked the slot as dirty.  Per
discussion, the original feature has been implemented so as in the event
of a crash the slot may move backwards to a past LSN.  This property is
kept and more documentation is added about that.

This commit adds some new TAP tests to check the persistency of physical
and logical slots after advancing across clean restarts.

Author: Alexey Kondratov, Michael Paquier
Reviewed-by: Andres Freund, Kyotaro Horiguchi, Craig Ringer
Discussion: https://postgr.es/m/059cc53a-8b14-653a-a24d-5f867503b0ee@postgrespro.ru
Backpatch-through: 11
2020-01-30 11:14:02 +09:00
Heikki Linnakangas 38a957316d Refactor XLogReadRecord(), adding XLogBeginRead() function.
The signature of XLogReadRecord() required the caller to pass the starting
WAL position as argument, or InvalidXLogRecPtr to continue reading at the
end of previous record. That's slightly awkward to the callers, as most
of them don't want to randomly jump around in the WAL stream, but start
reading at one position and then read everything from that point onwards.
Remove the 'RecPtr' argument and add a new function XLogBeginRead() to
specify the starting position instead. That's more convenient for the
callers. Also, xlogreader holds state that is reset when you change the
starting position, so having a separate function for doing that feels like
a more natural fit.

This changes XLogFindNextRecord() function so that it doesn't reset the
xlogreader's state to what it was before the call anymore. Instead, it
positions the xlogreader to the found record, like XLogBeginRead().

Reviewed-by: Kyotaro Horiguchi, Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/5382a7a3-debe-be31-c860-cb810c08f366%40iki.fi
2020-01-26 11:39:00 +02:00
Alvaro Herrera 15cac3a523 Set ReorderBufferTXN->final_lsn more eagerly
... specifically, set it incrementally as each individual change is
spilled down to disk.  This way, it is set correctly when the
transaction disappears without trace, ie. without leaving an XACT_ABORT
wal record.  (This happens when the server crashes midway through a
transaction.)

Failing to have final_lsn prevents ReorderBufferRestoreCleanup() from
working, since it needs the final_lsn in order to know the endpoint of
its iteration through spilled files.

Commit df9f682c7b already tried to fix the problem, but it didn't set
the final_lsn in all cases.  Revert that, since it's no longer needed.

Author: Vignesh C
Reviewed-by: Amit Kapila, Dilip Kumar
Discussion: https://postgr.es/m/CALDaNm2CLk+K9JDwjYST0sPbGg5AQdvhUt0jbKyX_HdAE0jk3A@mail.gmail.com
2020-01-17 18:00:39 -03:00
Peter Eisentraut fe233366f2 Fix compiler warning about format on Windows
On 64-bit Windows, pid_t is long long int, so a %d format isn't
enough.
2020-01-14 23:59:18 +01:00
Peter Eisentraut 3297308278 walreceiver uses a temporary replication slot by default
If no permanent replication slot is configured using
primary_slot_name, the walreceiver now creates and uses a temporary
replication slot.  A new setting wal_receiver_create_temp_slot can be
used to disable this behavior, for example, if the remote instance is
out of replication slots.

Reviewed-by: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/CA%2Bfd4k4dM0iEPLxyVyme2RAFsn8SUgrNtBJOu81YqTY4V%2BnqZA%40mail.gmail.com
2020-01-14 14:40:41 +01:00
Peter Eisentraut ee4ac46c8e Expose PQbackendPID() through walreceiver API
This will be used by a subsequent patch.

Reviewed-by: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/CA%2Bfd4k4dM0iEPLxyVyme2RAFsn8SUgrNtBJOu81YqTY4V%2BnqZA%40mail.gmail.com
2020-01-14 14:40:41 +01:00
Peter Eisentraut 259bbe1778 Fix base backup with database OIDs larger than INT32_MAX
The use of pg_atoi() for parsing a string into an Oid fails for values
larger than INT32_MAX, since OIDs are unsigned.  Instead, use
atooid().  While this has less error checking, the contents of the
data directory are expected to be trustworthy, so we don't need to go
out of our way to do full error checking.

Discussion: https://www.postgresql.org/message-id/flat/dea47fc8-6c89-a2b1-07e3-754ff1ab094b%402ndquadrant.com
2020-01-13 13:41:12 +01:00
Michael Paquier 1088729e84 Remove incorrect assertion for INSERT in logical replication's publisher
On the publisher, it was assumed that an INSERT change cannot happen for
a relation with no replica identity.  However this is true only for a
change that needs references to old rows, aka UPDATE or DELETE, so
trying to use logical replication with a relation that has no replica
identity led to an assertion failure in the publisher when issuing an
INSERT.  This commit removes the incorrect assertion, and adds more
regression tests to provide coverage for relations without replica
identity.

Reported-by: Neha Sharma
Author: Dilip Kumar, Michael Paquier
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/CANiYTQsL1Hb8_Km08qd32svrqNumXLJeoGo014O7VZymgOhZEA@mail.gmail.com
Backpatch-through: 10
2020-01-12 22:43:45 +09:00
Peter Eisentraut c67a55da4e Make lsn argument of walrcv_create_slot() optional
Some callers are not using it, so it's wasteful to have to specify it.

Reviewed-by: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/CA+fd4k4BcYrYucNfTnK-CQX3+jsG+PRPEhHAUSo-W4P0Lec57A@mail.gmail.com
2020-01-11 09:07:14 +01:00
Alvaro Herrera a7b6ab5db1 Clean up representation of flags in struct ReorderBufferTXN
This simplifies addition of further flags.

Author: Nikhil Sontakke
Discussion: https://postgr.es/m/CAMGcDxeViP+R-OL7QhzUV9eKCVjURobuY1Zijik4Ay_Ddwo4Cg@mail.gmail.com
2020-01-10 17:46:57 -03:00
Tom Lane e369f37086 Reduce the number of GetFlushRecPtr() calls done by walsenders.
Since the WAL flush position only moves forward, it's safe to cache
its previous value within each walsender process, and update from
shared memory only once we've caught up to the previously-seen value.
When there are many active walsenders, this makes for a very significant
reduction in the amount of contention on the XLogCtl->info_lck spinlock.

This patch also adjusts the logic so that we update our idea of the
flush position after processing a WAL record, rather than beforehand.
This may cause us to realize we're not caught up when the preceding
coding would've thought that we were, but that seems all to the good;
it may avoid a useless sleep-and-wakeup cycle.

Back-patch to v12.  The contention problem exists in prior branches,
but it's much less severe (due to inefficiencies elsewhere) so there
seems no need to take any risk of back-patching further.

Pierre Ducroquet, reviewed by Julien Rouhaud

Discussion: https://postgr.es/m/2931018.Vxl9zapr77@pierred-pdoc
2020-01-06 16:42:20 -05:00
Peter Eisentraut b9c130a1fd Have logical replication subscriber fire column triggers
The logical replication apply worker did not fire per-column update
triggers because the updatedCols bitmap in the RTE was not populated.
This fixes that.

Reviewed-by: Euler Taveira <euler@timbira.com.br>
Discussion: https://www.postgresql.org/message-id/flat/21673e2d-597c-6afe-637e-e8b10425b240%402ndquadrant.com
2020-01-06 08:40:00 +01:00
Tom Lane 5815696bc6 Make parser rely more heavily on the ParseNamespaceItem data structure.
When I added the ParseNamespaceItem data structure (in commit 5ebaaa494),
it wasn't very tightly integrated into the parser's APIs.  In the wake of
adding p_rtindex to that struct (commit b541e9acc), there is a good reason
to make more use of it: by passing around ParseNamespaceItem pointers
instead of bare RTE pointers, we can get rid of various messy methods for
passing back or deducing the rangetable index of an RTE during parsing.
Hence, refactor the addRangeTableEntryXXX functions to build and return
a ParseNamespaceItem struct, not just the RTE proper; and replace
addRTEtoQuery with addNSItemToQuery, which is passed a ParseNamespaceItem
rather than building one internally.

Also, add per-column data (a ParseNamespaceColumn array) to each
ParseNamespaceItem.  These arrays are built during addRangeTableEntryXXX,
where we have column type data at hand so that it's nearly free to fill
the data structure.  Later, when we need to build Vars referencing RTEs,
we can use the ParseNamespaceColumn info to avoid the rather expensive
operations done in get_rte_attribute_type() or expandRTE().
get_rte_attribute_type() is indeed dead code now, so I've removed it.
This makes for a useful improvement in parse analysis speed, around 20%
in one moderately-complex test query.

The ParseNamespaceColumn structs also include Var identity information
(varno/varattno).  That info isn't actually being used in this patch,
except that p_varno == 0 is a handy test for a dropped column.
A follow-on patch will make more use of it.

Discussion: https://postgr.es/m/2461.1577764221@sss.pgh.pa.us
2020-01-02 11:29:01 -05:00
Amit Kapila d207038053 Fix running out of file descriptors for spill files.
Currently while decoding changes, if the number of changes exceeds a
certain threshold, we spill those to disk.  And this happens for each
(sub)transaction.  Now, while reading all these files, we don't close them
until we read all the files.  While reading these files, if the number of
such files exceeds the maximum number of file descriptors, the operation
errors out.

Use PathNameOpenFile interface to open these files as that internally has
the mechanism to release kernel FDs as needed to get us under the
max_safe_fds limit.

Reported-by: Amit Khandekar
Author: Amit Khandekar
Reviewed-by: Amit Kapila
Backpatch-through: 9.4
Discussion: https://postgr.es/m/CAJ3gD9c-sECEn79zXw4yBnBdOttacoE-6gAyP0oy60nfs_sabQ@mail.gmail.com
2020-01-02 11:41:04 +05:30
Bruce Momjian 7559d8ebfa Update copyrights for 2020
Backpatch-through: update all files in master, backpatch legal files through 9.4
2020-01-01 12:21:45 -05:00
Michael Paquier 7854e07f25 Revert "Rename files and headers related to index AM"
This follows multiple complains from Peter Geoghegan, Andres Freund and
Alvaro Herrera that this issue ought to be dug more before actually
happening, if it happens.

Discussion: https://postgr.es/m/20191226144606.GA5659@alvherre.pgsql
2019-12-27 08:09:00 +09:00
Michael Paquier 044b319cd7 Fix some comments related to logical repslot advancing
confirmed_flush is part of a replication slot's information, but not
confirmed_lsn.

Author: Kyotaro Horiguchi
Discussion: https://postgr.es/m/20191226.175919.17237335658671970.horikyota.ntt@gmail.com
Backpatch-through: 11
2019-12-26 22:26:09 +09:00
Michael Paquier 8ce3aa9b59 Rename files and headers related to index AM
The following renaming is done so as source files related to index
access methods are more consistent with table access methods (the
original names used for index AMs ware too generic, and could be
confused as including features related to table AMs):
- amapi.h -> indexam.h.
- amapi.c -> indexamapi.c.  Here we have an equivalent with
backend/access/table/tableamapi.c.
- amvalidate.c -> indexamvalidate.c.
- amvalidate.h -> indexamvalidate.h.
- genam.c -> indexgenam.c.
- genam.h -> indexgenam.h.

This has been discussed during the development of v12 when table AM was
worked on, but the renaming never happened.

Author: Michael Paquier
Reviewed-by: Fabien Coelho, Julien Rouhaud
Discussion: https://postgr.es/m/20191223053434.GF34339@paquier.xyz
2019-12-25 10:23:39 +09:00
Alvaro Herrera c4dcd9144b Avoid splitting C string literals with \-newline
Using \ is unnecessary and ugly, so remove that.  While at it, stitch
the literals back into a single line: we've long discouraged splitting
error message literals even when they go past the 80 chars line limit,
to improve greppability.

Leave contrib/tablefunc alone.

Discussion: https://postgr.es/m/20191223195156.GA12271@alvherre.pgsql
2019-12-24 12:44:12 -03:00
Robert Haas 16a4e4aecd Extend the ProcSignal mechanism to support barriers.
A new function EmitProcSignalBarrier() can be used to emit a global
barrier which all backends that participate in the ProcSignal
mechanism must absorb, and a new function WaitForProcSignalBarrier()
can be used to wait until all relevant backends have in fact
absorbed the barrier.

This can be used to coordinate global state changes, such as turning
checksums on while the system is running.

There's no real client of this mechanism yet, although two are
proposed, but an enum has to have at least one element, so this
includes a placeholder type (PROCSIGNAL_BARRIER_PLACEHOLDER) which
should be replaced by the first real client of this mechanism to
get committed.

Andres Freund and Robert Haas, reviewed by Daniel Gustafsson and,
in earlier versions, by Magnus Hagander.

Discussion: http://postgr.es/m/CA+TgmoZwDk=BguVDVa+qdA6SBKef=PKbaKDQALTC_9qoz1mJqg@mail.gmail.com
2019-12-19 14:56:20 -05:00