Commit Graph

897 Commits

Author SHA1 Message Date
Thomas Munro 389869af59 Harden dsm_impl.c against unexpected EEXIST.
Previously, we trusted the OS not to report EEXIST unless we'd passed in
IPC_CREAT | IPC_EXCL or O_CREAT | O_EXCL, as appropriate.  Solaris's
shm_open() can in fact do that, causing us to crash because we didn't
ereport and then we blithely assumed the mapping was successful.

Let's treat EEXIST just like any other error, unless we're actually
trying to create a new segment.  This applies to shm_open(), where this
behavior has been seen, and also to the equivalent operations for our
sysv and mmap modes just on principle.

Based on the underlying reason for the error, namely contention on a
lock file managed by Solaris librt for each distinct name, this problem
is only likely to happen on 15 and later, because the new shared memory
stats system produces shm_open() calls for the same path from
potentially large numbers of backends concurrently during
authentication.  Earlier releases only shared memory segments between a
small number of parallel workers under one Gather node.  You could
probably hit it if you tried hard enough though, and we should have been
more defensive in the first place.  Therefore, back-patch to all
supported releases.

Per build farm animal margay.  This isn't the end of the story, though,
it just changes random crashes into random "File exists" errors; more
work needed for a green build farm.

Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGKqKrCV5xKWfh9rnm%3Do%3DDwZLTLtnsj_XpUi9g5%3DV%2B9oyg%40mail.gmail.com
2022-07-01 14:17:54 +12:00
Heikki Linnakangas adf6d5dfb2 Fix visibility check when XID is committed in CLOG but not in procarray.
TransactionIdIsInProgress had a fast path to return 'false' if the
single-item CLOG cache said that the transaction was known to be
committed. However, that was wrong, because a transaction is first
marked as committed in the CLOG but doesn't become visible to others
until it has removed its XID from the proc array. That could lead to an
error:

    ERROR:  t_xmin is uncommitted in tuple to be updated

or for an UPDATE to go ahead without blocking, before the previous
UPDATE on the same row was made visible.

The window is usually very short, but synchronous replication makes it
much wider, because the wait for synchronous replica happens in that
window.

Another thing that makes it hard to hit is that it's hard to get such
a commit-in-progress transaction into the single item CLOG cache.
Normally, if you call TransactionIdIsInProgress on such a transaction,
it determines that the XID is in progress without checking the CLOG
and without populating the cache. One way to prime the cache is to
explicitly call pg_xact_status() on the XID. Another way is to use a
lot of subtransactions, so that the subxid cache in the proc array is
overflown, making TransactionIdIsInProgress rely on pg_subtrans and
CLOG checks.

This has been broken ever since it was introduced in 2008, but the race
condition is very hard to hit, especially without synchronous
replication. There were a couple of reports of the error starting from
summer 2021, but no one was able to find the root cause then.

TransactionIdIsKnownCompleted() is now unused. In 'master', remove it,
but I left it in place in backbranches in case it's used by extensions.

Also change pg_xact_status() to check TransactionIdIsInProgress().
Previously, it only checked the CLOG, and returned "committed" before
the transaction was actually made visible to other queries. Note that
this also means that you cannot use pg_xact_status() to reproduce the
bug anymore, even if the code wasn't fixed.

Report and analysis by Konstantin Knizhnik. Patch by Simon Riggs, with
the pg_xact_status() change added by me.

Author: Simon Riggs
Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/flat/4da7913d-398c-e2ad-d777-f752cf7f0bbb%40garret.ru
2022-06-27 08:21:08 +03:00
Thomas Munro 3ab4fc5dcf Don't trust signalfd() on illumos.
Since commit 6a2a70a02, we've used signalfd() to receive latch wakeups
when building with WAIT_USE_EPOLL (default for Linux and illumos), and
our traditional self-pipe when falling back to WAIT_USE_POLL (default
for other Unixes with neither epoll() nor kqueue()).

Unexplained hangs and kernel panics have been reported on illumos
systems, apparently linked to this use of signalfd(), leading illumos
users and build farm members to have to define WAIT_USE_POLL explicitly
as a work-around.  A bug report exists at
https://www.illumos.org/issues/13700 but no fix is available yet.

Let's provide a way for illumos users to go back to self-pipes with
epoll(), like releases before 14, and choose that by default.  No change
for Linux users.  To help with development/debugging, macros
WAIT_USE_{EPOLL,POLL} and WAIT_USE_{SIGNALFD,SELF_PIPE} can be defined
explicitly to override the defaults.

Back-patch to 14, where we started using signalfd().

Reported-by: Japin Li <japinli@hotmail.com>
Reported-by: Olaf Bohlen <olbohlen@eenfach.de> (off-list)
Reviewed-by: Japin Li <japinli@hotmail.com>
Discussion: https://postgr.es/m/MEYP282MB1669C8D88F0997354C2313C1B6CA9%40MEYP282MB1669.AUSP282.PROD.OUTLOOK.COM
2022-06-26 10:55:21 +12:00
Tom Lane 7ab5b4eb48 Be more careful about GucSource for internally-driven GUC settings.
The original advice for hard-wired SetConfigOption calls was to use
PGC_S_OVERRIDE, particularly for PGC_INTERNAL GUCs.  However,
that's really overkill for PGC_INTERNAL GUCs, since there is no
possibility that we need to override a user-provided setting.
Instead use PGC_S_DYNAMIC_DEFAULT in most places, so that the
value will appear with source = 'default' in pg_settings and thereby
not be shown by psql's new \dconfig command.  The one exception is
that when changing in_hot_standby in a hot-standby session, we still
use PGC_S_OVERRIDE, because people felt that seeing that in \dconfig
would be a good thing.

Similarly use PGC_S_DYNAMIC_DEFAULT for the auto-tune value of
wal_buffers (if possible, that is if wal_buffers wasn't explicitly
set to -1), and for the typical 2MB value of max_stack_depth.

In combination these changes remove four not-very-interesting
entries from the typical output of \dconfig, all of which people
fingered as "why is that showing up?" in the discussion thread.

Discussion: https://postgr.es/m/3118455.1649267333@sss.pgh.pa.us
2022-06-08 13:26:18 -04:00
Alvaro Herrera e28bb88519
Revert changes to CONCURRENTLY that "sped up" Xmin advance
This reverts commit d9d076222f "VACUUM: ignore indexing operations
with CONCURRENTLY".

These changes caused indexes created with the CONCURRENTLY option to
miss heap tuples that were HOT-updated and HOT-pruned during the index
creation.  Before these changes, HOT pruning would have been prevented
by the Xmin of the transaction creating the index, but because this
change was precisely to allow the Xmin to move forward ignoring that
backend, now other backends scanning the table can prune them.  This is
not a problem for VACUUM (which requires a lock that conflicts with a
CREATE INDEX CONCURRENTLY operation), but HOT-prune can definitely
occur.  In other words, Xmin advancement was sped up, but at the cost of
corrupting the resulting index.

Regrettably, this means that the new feature in PG14 that RIC/CIC on
very large tables no longer force VACUUM to retain very old tuples goes
away.  We might try to implement it again in a later release, but for
now the risk of indexes missing tuples is too high and there's no easy
fix.

Backpatch to 14, where this change appeared.

Reported-by: Peter Slavov <pet.slavov@gmail.com>
Diagnosys-by: Andrey Borodin <x4mmm@yandex-team.ru>
Diagnosys-by: Michael Paquier <michael@paquier.xyz>
Diagnosys-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/17485-396609c6925b982d%40postgresql.org
2022-05-31 21:24:59 +02:00
Robert Haas f5bfba5413 shm_mq_sendv: Fix flushing bug when receiver not yet attached.
With the old logic, when the reciever had not yet attached, we would
never call shm_mq_inc_bytes_written(), even if force_flush = true
was specified. That could result in a situation where data that the
sender believes it has sent is never received.

Along the way, remove a useless function prototype for a nonexistent
function from shm_mq.h.

Commit 46846433a0 introduced these
problems.

Pavan Deolasee, with a few changes by me.

Discussion: https://postgr.es/m/CABOikdPkwtLLCTnzzmpSMXo3QZa2yXq0J7Q61ssdLFAJYrOVvQ@mail.gmail.com
2022-05-31 08:46:54 -04:00
Thomas Munro 12e28aac8e Add debugging help in OwnLatch().
Build farm animal gharial recently failed a few times in a parallel
worker's call to OwnLatch() with "ERROR:  latch already owned".  Let's
turn that into a PANIC and show the PID of the owner, to try to learn
more.

Discussion: https://postgr.es/m/CA%2BhUKGJ_0RGcr7oUNzcHdn7zHqHSB_wLSd3JyS2YC_DYB%2B-V%3Dg%40mail.gmail.com
2022-05-31 12:06:11 +12:00
Alvaro Herrera 8d061acd12
Repurpose PROC_COPYABLE_FLAGS as PROC_XMIN_FLAGS
This is a slight, convenient semantics change from what commit
0f0cfb4940 ("Fix parallel operations that prevent oldest xmin from
advancing") introduced that lets us simplify the coding in the one place
where it is used.

Backpatch to 13.  This is related to commit 6fea65508a ("Tighten
ComputeXidHorizons' handling of walsenders") rewriting the code site
where this is used, which has not yet been backpatched, but it may well
be in the future.

Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/202204191637.eldwa2exvguw@alvherre.pgsql
2022-05-19 16:20:32 +02:00
Alvaro Herrera c4f113e8fe
Clean up newlines following left parentheses
Like commit c9d2977519.
2022-05-13 23:52:35 +02:00
Robert Haas 4f2400cb3f Add a new shmem_request_hook hook.
Currently, preloaded libraries are expected to request additional
shared memory and LWLocks in _PG_init().  However, it is not unusal
for such requests to depend on MaxBackends, which won't be
initialized at that time.  Such requests could also depend on GUCs
that other modules might change.  This introduces a new hook where
modules can safely use MaxBackends and GUCs to request additional
shared memory and LWLocks.

Furthermore, this change restricts requests for shared memory and
LWLocks to this hook.  Previously, libraries could make requests
until the size of the main shared memory segment was calculated.
Unlike before, we no longer silently ignore requests received at
invalid times.  Instead, we FATAL if someone tries to request
additional shared memory or LWLocks outside of the hook.

Nathan Bossart and Julien Rouhaud

Discussion: https://postgr.es/m/20220412210112.GA2065815%40nathanxps13
Discussion: https://postgr.es/m/Yn2jE/lmDhKtkUdr@paquier.xyz
2022-05-13 09:31:06 -04:00
Tom Lane 23e7b38bfe Pre-beta mechanical code beautification.
Run pgindent, pgperltidy, and reformat-dat-files.
I manually fixed a couple of comments that pgindent uglified.
2022-05-12 15:17:30 -04:00
Thomas Munro 0d3431497d Add logging for excessive ProcSignalBarrier waits.
To enable diagnosis of systems that are not processing ProcSignalBarrier
requests promptly, add a LOG message every 5 seconds if we seem to be
wedged.  Although you could already see this state as a wait event in
pg_stat_activity, the log message also shows the PID of the process that
is preventing progress.

Also add DEBUG1 logging around the whole wait loop.

Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CA%2BTgmoYJ03r5359gQutRGP9BtigYCg3_UskcmnVjBf-QO3-0pQ%40mail.gmail.com
2022-05-11 18:03:03 +12:00
Andres Freund 8f1537d10e Fix possibility of self-deadlock in ResolveRecoveryConflictWithBufferPin().
The tests added in 9f8a050f68 failed nearly reliably on FreeBSD in CI, and
occasionally on the buildfarm. That turns out to be caused not by a bug in the
test, but by a longstanding bug in recovery conflict handling.

The standby timeout handler, used by ResolveRecoveryConflictWithBufferPin(),
executed SendRecoveryConflictWithBufferPin() inside a signal handler. A bad
idea, because the deadlock timeout handler (or a spurious latch set) could
have interrupted ProcWaitForSignal(). If unlucky that could cause a
self-deadlock on ProcArrayLock, if the deadlock check is in
SendRecoveryConflictWithBufferPin()->CancelDBBackends().

To fix, set a flag in StandbyTimeoutHandler(), and check the flag in
ResolveRecoveryConflictWithBufferPin().

Subsequently the recovery conflict tests will be backpatched.

Discussion: https://postgr.es/m/20220413002626.udl7lll7f3o7nre7@alap3.anarazel.de
Backpatch: 10-
2022-05-02 18:25:00 -07:00
Etsuro Fujita d89f97e83e Fix typo in comment. 2022-05-02 16:45:00 +09:00
Tom Lane 6fea65508a Tighten ComputeXidHorizons' handling of walsenders.
ComputeXidHorizons (nee GetOldestXmin) thought that it could identify
walsenders by checking for proc->databaseId == 0.  Perhaps that was
safe when the code was written, but it's been wrong at least since
autovacuum was invented.  Background processes that aren't connected
to any particular database, such as the autovacuum launcher and
logical replication launcher, look like that too.

This imprecision is harmful because when such a process advertises an
xmin, the result is to hold back dead-tuple cleanup in all databases,
though it'd be sufficient to hold it back in shared catalogs (which
are the only relations such a process can access).  Aside from being
generally inefficient, this has recently been seen to cause regression
test failures in the buildfarm, as a consequence of the logical
replication launcher's startup transaction preventing VACUUM from
marking pages of a user table as all-visible.

We only want that global hold-back effect for the case where a
walsender is advertising a hot standby feedback xmin.  Therefore,
invent a new PGPROC flag that says that a process' xmin should be
considered globally, and check that instead of using the incorrect
databaseId == 0 test.  Currently only a walsender sets that flag,
and only if it is not connected to any particular database.  (This is
for bug-compatibility with the undocumented behavior of the existing
code, namely that feedback sent by a client who has connected to a
particular database would not be applied globally.  I'm not sure this
is a great definition; however, such a client is capable of issuing
plain SQL commands, and I don't think we want xmins advertised for
such commands to be applied globally.  Perhaps this could do with
refinement later.)

While at it, I rewrote the comment in ComputeXidHorizons, and
re-ordered the commented-upon if-tests, to make them match up
for intelligibility's sake.

This is arguably a back-patchable bug fix, but given the lack of
complaints I think it prudent to let it age awhile in HEAD first.

Discussion: https://postgr.es/m/1346227.1649887693@sss.pgh.pa.us
2022-04-15 17:50:05 -04:00
Alvaro Herrera 24d2b2680a
Remove extraneous blank lines before block-closing braces
These are useless and distracting.  We wouldn't have written the code
with them to begin with, so there's no reason to keep them.

Author: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20220411020336.GB26620@telsasoft.com
Discussion: https://postgr.es/m/attachment/133167/0016-Extraneous-blank-lines.patch
2022-04-13 19:16:02 +02:00
Robert Haas 7fc0e7de9f Revert the addition of GetMaxBackends() and related stuff.
This reverts commits 0147fc7, 4567596, aa64f23, and 5ecd018.
There is no longer agreement that introducing this function
was the right way to address the problem. The consensus now
seems to favor trying to make a correct value for MaxBackends
available to mdules executing their _PG_init() functions.

Nathan Bossart

Discussion: http://postgr.es/m/20220323045229.i23skfscdbvrsuxa@jrouhaud
2022-04-12 14:45:23 -04:00
David Rowley b0e5f02ddc Fix various typos and spelling mistakes in code comments
Author: Justin Pryzby
Discussion: https://postgr.es/m/20220411020336.GB26620@telsasoft.com
2022-04-11 20:49:41 +12:00
Robert Haas f37015a161 Rename delayChkpt to delayChkptFlags.
Before commit 412ad7a556, delayChkpt
was a Boolean. Now it's an integer. Extensions using it need to be
appropriately updated, so let's rename the field to make sure that
a hard compilation failure occurs.

Replacing delayChkpt with delayChkptFlags made a few comments extend
past 80 characters, so I reflowed them and changed some wording very
slightly.

The back-branches will need a different change to restore compatibility
with existing minor releases; this is just for master.

Per suggestion from Tom Lane.

Discussion: http://postgr.es/m/a7880f4d-1d74-582a-ada7-dad168d046d1@enterprisedb.com
2022-04-08 11:44:17 -04:00
Thomas Munro 5dc0418fab Prefetch data referenced by the WAL, take II.
Introduce a new GUC recovery_prefetch.  When enabled, look ahead in the
WAL and try to initiate asynchronous reading of referenced data blocks
that are not yet cached in our buffer pool.  For now, this is done with
posix_fadvise(), which has several caveats.  Since not all OSes have
that system call, "try" is provided so that it can be enabled where
available.  Better mechanisms for asynchronous I/O are possible in later
work.

Set to "try" for now for test coverage.  Default setting to be finalized
before release.

The GUC wal_decode_buffer_size limits the distance we can look ahead in
bytes of decoded data.

The existing GUC maintenance_io_concurrency is used to limit the number
of concurrent I/Os allowed, based on pessimistic heuristics used to
infer that I/Os have begun and completed.  We'll also not look more than
maintenance_io_concurrency * 4 block references ahead.

Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (earlier version)
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> (earlier version)
Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (earlier version)
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> (earlier version)
Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> (earlier version)
Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com> (earlier version)
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
2022-04-07 19:42:14 +12:00
Andres Freund 5891c7a8ed pgstat: store statistics in shared memory.
Previously the statistics collector received statistics updates via UDP and
shared statistics data by writing them out to temporary files regularly. These
files can reach tens of megabytes and are written out up to twice a
second. This has repeatedly prevented us from adding additional useful
statistics.

Now statistics are stored in shared memory. Statistics for variable-numbered
objects are stored in a dshash hashtable (backed by dynamic shared
memory). Fixed-numbered stats are stored in plain shared memory.

The header for pgstat.c contains an overview of the architecture.

The stats collector is not needed anymore, remove it.

By utilizing the transactional statistics drop infrastructure introduced in a
prior commit statistics entries cannot "leak" anymore. Previously leaked
statistics were dropped by pgstat_vacuum_stat(), called from [auto-]vacuum. On
systems with many small relations pgstat_vacuum_stat() could be quite
expensive.

Now that replicas drop statistics entries for dropped objects, it is not
necessary anymore to reset stats when starting from a cleanly shut down
replica.

Subsequent commits will perform some further code cleanup, adapt docs and add
tests.

Bumps PGSTAT_FILE_FORMAT_ID.

Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-By: "David G. Johnston" <david.g.johnston@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@2ndquadrant.com> (in a much earlier version)
Reviewed-By: Arthur Zakirov <a.zakirov@postgrespro.ru> (in a much earlier version)
Reviewed-By: Antonin Houska <ah@cybertec.at> (in a much earlier version)
Discussion: https://postgr.es/m/20220303021600.hs34ghqcw6zcokdh@alap3.anarazel.de
Discussion: https://postgr.es/m/20220308205351.2xcn6k4x5yivcxyd@alap3.anarazel.de
Discussion: https://postgr.es/m/20210319235115.y3wz7hpnnrshdyv6@alap3.anarazel.de
2022-04-06 21:29:46 -07:00
Andres Freund 46a2d2499a dsm: allow use in single user mode.
It might seem pointless to allow use of dsm in single user mode, but otherwise
subsystems might need dedicated single user mode code paths.

Besides changing the assert, all that's needed is to make some windows code
assuming the presence of postmaster conditional.

Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGL9hY_VY=+oUK+Gc1iSRx-Ls5qeYJ6q=dQVZnT3R63Taw@mail.gmail.com
2022-04-06 12:40:04 -07:00
Robert Haas edea649afb Explain why the startup process can't cause a shortage of sinval slots.
Bharath Rupireddy, reviewed by Fujii Masao and Yura Sokolov.
Lightly edited by me.

Discussion: http://postgr.es/m/CALj2ACU=3_frMkDp9UUeuZoAMjaK1y0Z_q5RFNbGvwi8NM==AA@mail.gmail.com
2022-03-29 09:24:24 -04:00
Michael Paquier ad8759bea0 Fix typos in standby.c
xl_running_xacts exists, not xl_xact_running_xacts.

Author: Hou Zhijie
Discussion: https://postgr.es/m/OS0PR01MB57160D8B01097FFB5C175065941A9@OS0PR01MB5716.jpnprd01.prod.outlook.com
2022-03-25 14:11:18 +09:00
Robert Haas 412ad7a556 Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.
If TRUNCATE causes some buffers to be invalidated and thus the
checkpoint does not flush them, TRUNCATE must also ensure that the
corresponding files are truncated on disk. Otherwise, a replay
from the checkpoint might find that the buffers exist but have
the wrong contents, which may cause replay to fail.

Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design
suggestion from Heikki Linnakangas, with some changes to the
comments by me. Review of this and a prior patch that approached
the issue differently by Heikki Linnakangas, Andres Freund, Álvaro
Herrera, Masahiko Sawada, and Tom Lane.

Discussion: http://postgr.es/m/BYAPR06MB6373BF50B469CA393C614257ABF00@BYAPR06MB6373.namprd06.prod.outlook.com
2022-03-24 14:52:28 -04:00
Michael Paquier 9e98583898 Create routine able to set single-call SRFs for Materialize mode
Set-returning functions that use the Materialize mode, creating a
tuplestore to include all the tuples returned in a set rather than doing
so in multiple calls, use roughly the same set of steps to prepare
ReturnSetInfo for this job:
- Check if ReturnSetInfo supports returning a tuplestore and if the
materialize mode is enabled.
- Create a tuplestore for all the tuples part of the returned set in the
per-query memory context, stored in ReturnSetInfo->setResult.
- Build a tuple descriptor mostly from get_call_result_type(), then
stored in ReturnSetInfo->setDesc.  Note that there are some cases where
the SRF's tuple descriptor has to be the one specified by the function
caller.

This refactoring is done so as there are (well, should be) no behavior
changes in any of the in-core functions refactored, and the centralized
function that checks and sets up the function's ReturnSetInfo can be
controlled with a set of bits32 options.  Two of them prove to be
necessary now:
- SRF_SINGLE_USE_EXPECTED to use expectedDesc as tuple descriptor, as
expected by the function's caller.
- SRF_SINGLE_BLESS to validate the tuple descriptor for the SRF.

The same initialization pattern is simplified in 28 places per my
count as of src/backend/, shaving up to ~900 lines of code.  These
mostly come from the removal of the per-query initializations and the
sanity checks now grouped in a single location.  There are more
locations that could be simplified in contrib/, that are left for a
follow-up cleanup.

fcc2817, 07daca5 and d61a361 have prepared the areas of the code related
to this change, to ease this refactoring.

Author: Melanie Plageman, Michael Paquier
Reviewed-by: Álvaro Herrera, Justin Pryzby
Discussion: https://postgr.es/m/CAAKRu_azyd1Z3W_r7Ou4sorTjRCs+PxeHw1CWJeXKofkE6TuZg@mail.gmail.com
2022-03-07 10:26:29 +09:00
Tom Lane 46ab07ffda Clean up assorted failures under clang's -fsanitize=undefined checks.
Most of these are cases where we could call memcpy() or other libc
functions with a NULL pointer and a zero count, which is forbidden
by POSIX even though every production version of libc allows it.
We've fixed such things before in a piecemeal way, but apparently
never made an effort to try to get them all.  I don't claim that
this patch does so either, but it gets every failure I observe in
check-world, using clang 12.0.1 on current RHEL8.

numeric.c has a different issue that the sanitizer doesn't like:
"ln(-1.0)" will compute log10(0) and then try to assign the
resulting -Inf to an integer variable.  We don't actually use the
result in such a case, so there's no live bug.

Back-patch to all supported branches, with the idea that we might
start running a buildfarm member that tests this case.  This includes
back-patching c1132aae3 (Check the size in COPY_POINTER_FIELD),
which previously silenced some of these issues in copyfuncs.c.

Discussion: https://postgr.es/m/CALNJ-vT9r0DSsAOw9OXVJFxLENoVS_68kJ5x0p44atoYH+H4dg@mail.gmail.com
2022-03-03 18:13:24 -05:00
Michael Paquier d61a361d1a Remove all traces of tuplestore_donestoring() in the C code
This routine is a no-op since dd04e95 from 2003, with a macro kept
around for compatibility purposes.  This has led to the same code
patterns being copy-pasted around for no effect, sometimes in confusing
ways like in pg_logical_slot_get_changes_guts() from logical.c where the
code was actually incorrect.

This issue has been discussed on two different threads recently, so
rather than living with this legacy, remove any uses of this routine in
the C code to simplify things.  The compatibility macro is kept to avoid
breaking any out-of-core modules that depend on it.

Reported-by: Tatsuhito Kasahara, Justin Pryzby
Author: Tatsuhito Kasahara
Discussion: https://postgr.es/m/20211217200419.GQ17618@telsasoft.com
Discussion: https://postgr.es/m/CAP0=ZVJeeYfAeRfmzqAF2Lumdiv4S4FewyBnZd4DPTrsSQKJKw@mail.gmail.com
2022-02-17 09:52:02 +09:00
Heikki Linnakangas 70e81861fa Split xlog.c into xlog.c and xlogrecovery.c.
This moves the functions related to performing WAL recovery into the new
xlogrecovery.c source file, leaving xlog.c responsible for maintaining
the WAL buffers, coordinating the startup and switch from recovery to
normal operations, and other miscellaneous stuff that have always been in
xlog.c.

Reviewed-by: Andres Freund, Kyotaro Horiguchi, Robert Haas
Discussion: https://www.postgresql.org/message-id/a31f27b4-a31d-f976-6217-2b03be646ffa%40iki.fi
2022-02-16 09:30:38 +02:00
Thomas Munro 50e570a59e Add WL_SOCKET_CLOSED for socket shutdown events.
Provide a way for WaitEventSet to report that the remote peer has shut
down its socket, independently of whether there is any buffered data
remaining to be read.  This works only on systems where the kernel
exposes that information, namely:

* WAIT_USE_POLL builds using POLLRDHUP, if available
* WAIT_USE_EPOLL builds using EPOLLRDHUP
* WAIT_USE_KQUEUE builds using EV_EOF

Reviewed-by: Zhihong Yu <zyu@yugabyte.com>
Reviewed-by: Maksim Milyutin <milyutinma@gmail.com>
Discussion: https://postgr.es/m/77def86b27e41f0efcba411460e929ae%40postgrespro.ru
2022-02-14 16:52:23 +13:00
Thomas Munro 4eb2176318 Fix DROP {DATABASE,TABLESPACE} on Windows.
Previously, it was possible for DROP DATABASE, DROP TABLESPACE and ALTER
DATABASE SET TABLESPACE to fail because other backends still had file
handles open for dropped tables.  Windows won't allow a directory
containing unlinked-but-still-open files to be unlinked.  Tackle this
problem by forcing all backends to close all smgr fds.  No change for
Unix systems, which don't suffer from the problem, but the new code path
can be tested by Unix-based developers by defining
USE_BARRIER_SMGRRELEASE explicitly.

It's possible that PROCSIGNAL_BARRIER_SMGRRELEASE will have more
bug-fixing applications soon (under discussion).  Note that this is the
first user of the ProcSignalBarrier mechanism from commit 16a4e4aec.  It
could in principle be back-patched as far as 14, but since field
complaints are rare and ProcSignalBarrier hasn't been battle-tested,
that seems like a bad idea.  Fix in master only, where these failures
have started to show up in automated testing due to new tests.

Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg@mail.gmail.com
2022-02-12 10:21:23 +13:00
Tom Lane c5f5b4dd4b Test honestly for <sys/signalfd.h>.
Commit 6a2a70a02 supposed that any platform having <sys/epoll.h>
would also have <sys/signalfd.h>.  It turns out there are still a
few people using platforms where that's not so, so we'd better make
a separate configure probe for it.  But since it took this long to
notice, I'm content with the decision to not have a separate code
path for epoll-only machines; we'll just fall back to using poll()
for these stragglers.

Per gripe from Gabriela Serventi.  Back-patch to v14 where this
code came in.

Discussion: https://postgr.es/m/CAHOHWE-JjJDfcYuLAAEO7Jk07atFAU47z8TzHzg71gbC0aMy=g@mail.gmail.com
2022-02-09 14:24:54 -05:00
Robert Haas aa64f23b02 Remove MaxBackends variable in favor of GetMaxBackends() function.
Previously, it was really easy to write code that accessed MaxBackends
before we'd actually initialized it, especially when coding up an
extension. To make this less error-prune, introduce a new function
GetMaxBackends() which should be used to obtain the correct value.
This will ERROR if called too early. Demote the global variable to
a file-level static, so that nobody can peak at it directly.

Nathan Bossart. Idea by Andres Freund. Review by Greg Sabino Mullane,
by Michael Paquier (who had doubts about the approach), and by me.

Discussion: http://postgr.es/m/20210802224204.bckcikl45uezv5e4@alap3.anarazel.de
2022-02-08 15:53:19 -05:00
Tomas Vondra f192e1bdf3 Fix ordering of XIDs in ProcArrayApplyRecoveryInfo
Commit 8431e296ea reworked ProcArrayApplyRecoveryInfo to sort XIDs
before adding them to KnownAssignedXids. But the XIDs are sorted using
xidComparator, which compares the XIDs simply as uint32 values, not
logically. KnownAssignedXidsAdd() however expects XIDs in logical order,
and calls TransactionIdFollowsOrEquals() to enforce that. If there are
XIDs for which the two orderings disagree, an error is raised and the
recovery fails/restarts.

Hitting this issue is fairly easy - you just need two transactions, one
started before the 4B limit (e.g. XID 4294967290), the other sometime
after it (e.g. XID 1000). Logically (4294967290 <= 1000) but when
compared using xidComparator we try to add them in the opposite order.
Which makes KnownAssignedXidsAdd() fail with an error like this:

  ERROR: out-of-order XID insertion in KnownAssignedXids

This only happens during replica startup, while processing RUNNING_XACTS
records to build the snapshot. Once we reach STANDBY_SNAPSHOT_READY, we
skip these records. So this does not affect already running replicas,
but if you restart (or create) a replica while there are transactions
with XIDs for which the two orderings disagree, you may hit this.

Long-running transactions and frequent replica restarts increase the
likelihood of hitting this issue. Once the replica gets into this state,
it can't be started (even if the old transactions are terminated).

Fixed by sorting the XIDs logically - this is fine because we're dealing
with normal XIDs (because it's XIDs assigned to backends) and from the
same wraparound epoch (otherwise the backends could not be running at
the same time on the primary node). So there are no problems with the
triangle inequality, which is why xidComparator compares raw values.

Investigation and root cause analysis by Abhijit Menon-Sen. Patch by me.

This issue is present in all releases since 9.4, however releases up to
9.6 are EOL already so backpatch to 10 only.

Reviewed-by: Abhijit Menon-Sen
Reviewed-by: Alvaro Herrera
Backpatch-through: 10
Discussion: https://postgr.es/m/36b8a501-5d73-277c-4972-f58a4dce088a%40enterprisedb.com
2022-01-27 20:13:55 +01:00
Amit Kapila 4c004dd520 Consistently use the function name CreateCheckPoint in code and comments.
Author: Bharath Rupireddy
Discussion: https://postgr.es/m/CALj2ACVZmKsvDjtd45+9oTcnjUJtC4LF2BYK8TpWT1f=NjJX3w@mail.gmail.com
2022-01-17 07:50:00 +05:30
John Naylor 7fa945b857 Improve warning message in pg_signal_backend()
Previously, invoking pg_terminate_backend() or pg_cancel_backend()
with the postmaster PID produced a "PID XXXX is not a PostgresSQL
server process" warning, which does not make sense. Change to
"backend process" to make the message more exact.

Nathan Bossart, based on an idea from Bharath Rupireddy with
input from Tom Lane and Euler Taveira

Discussion: https://www.postgresql.org/message-id/flat/CALj2ACW7Rr-R7mBcBQiXWPp=JV5chajjTdudLiF5YcpW-BmHhg@mail.gmail.com
2022-01-11 12:56:26 -05:00
Bruce Momjian 27b77ecf9f Update copyright for 2022
Backpatch-through: 10
2022-01-07 19:04:57 -05:00
Tom Lane 3804539e48 Replace random(), pg_erand48(), etc with a better PRNG API and algorithm.
Standardize on xoroshiro128** as our basic PRNG algorithm, eliminating
a bunch of platform dependencies as well as fundamentally-obsolete PRNG
code.  In addition, this API replacement will ease replacing the
algorithm again in future, should that become necessary.

xoroshiro128** is a few percent slower than the drand48 family,
but it can produce full-width 64-bit random values not only 48-bit,
and it should be much more trustworthy.  It's likely to be noticeably
faster than the platform's random(), depending on which platform you
are thinking about; and we can have non-global state vectors easily,
unlike with random().  It is not cryptographically strong, but neither
are the functions it replaces.

Fabien Coelho, reviewed by Dean Rasheed, Aleksander Alekseev, and myself

Discussion: https://postgr.es/m/alpine.DEB.2.22.394.2105241211230.165418@pseudo
2021-11-28 21:33:07 -05:00
Amit Kapila 0f0cfb4940 Fix parallel operations that prevent oldest xmin from advancing.
While determining xid horizons, we skip over backends that are running
Vacuum. We also ignore Create Index Concurrently, or Reindex Concurrently
for the purposes of computing Xmin for Vacuum. But we were not setting the
flags corresponding to these operations when they are performed in
parallel which was preventing Xid horizon from advancing.

The optimization related to skipping Create Index Concurrently, or Reindex
Concurrently operations was implemented in PG-14 but the fix is the same
for the Parallel Vacuum as well so back-patched till PG-13.

Author: Masahiko Sawada
Reviewed-by: Amit Kapila
Backpatch-through: 13
Discussion: https://postgr.es/m/CAD21AoCLQqgM1sXh9BrDFq0uzd3RBFKi=Vfo6cjjKODm0Onr5w@mail.gmail.com
2021-11-19 09:04:40 +05:30
Michael Paquier 4cd046c203 Make some comments use the term "ProcSignal" for consistency
The surroundings in procsignal.c prefer using "ProcSignal" rather than
"procsignal".

Author: Bharath Rupireddy
Discussion: https://postgr.es/m/CALj2ACX99ghPmm1M_O4r4g+YsXFjCn=qF7PeDXntLwMpht_Gdg@mail.gmail.com
2021-11-09 12:56:34 +09:00
Alexander Korotkov 05e6e78c18 Reset lastOverflowedXid on standby when needed
Currently, lastOverflowedXid is never reset.  It's just adjusted on new
transactions known to be overflowed.  But if there are no overflowed
transactions for a long time, snapshots could be mistakenly marked as
suboverflowed due to wraparound.

This commit fixes this issue by resetting lastOverflowedXid when needed
altogether with KnownAssignedXids.

Backpatch to all supported versions.

Reported-by: Stan Hu
Discussion: https://postgr.es/m/CAMBWrQ%3DFp5UAsU_nATY7EMY7NHczG4-DTDU%3DmCvBQZAQ6wa2xQ%40mail.gmail.com
Author: Kyotaro Horiguchi, Alexander Korotkov
Reviewed-by: Stan Hu, Simon Riggs, Nikolay Samokhvalov, Andrey Borodin, Dmitry Dolgov
2021-11-06 19:13:58 +03:00
Tom Lane 6301c3adab Avoid O(N^2) behavior when the standby process releases many locks.
When replaying a transaction that held many exclusive locks on the
primary, a standby server's startup process would expend O(N^2)
effort on manipulating the list of locks.  This code was fine when
written, but commit 1cff1b95a made repetitive list_delete_first()
calls inefficient, as explained in its commit message.  Fix by just
iterating the list normally, and releasing storage only when done.
(This'd be inadequate if we needed to recover from an error occurring
partway through; but we don't.)

Back-patch to v13 where 1cff1b95a came in.

Nathan Bossart

Discussion: https://postgr.es/m/CD2F0E7F-9822-45EC-A411-AE56F14DEA9F@amazon.com
2021-10-31 15:31:29 -04:00
Robert Haas 46846433a0 shm_mq: Update mq_bytes_written less often.
Do not update shm_mq's mq_bytes_written until we have written
an amount of data greater than 1/4th of the ring size, unless
the caller of shm_mq_send(v) requests a flush at the end of
the message. This reduces the number of calls to SetLatch(),
and also the number of CPU cache misses, considerably, and thus
makes shm_mq significantly faster.

Dilip Kumar, reviewed by Zhihong Yu and Tomas Vondra. Some
minor cosmetic changes by me.

Discussion: http://postgr.es/m/CAFiTN-tVXqn_OG7tHNeSkBbN+iiCZTiQ83uakax43y1sQb2OBA@mail.gmail.com
2021-10-14 16:13:36 -04:00
Daniel Gustafsson 7111e332c5 Fix duplicate words in comments
Remove accidentally duplicated words in code comments.

Author: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Discussion: https://postgr.es/m/87bl45t0co.fsf@wibble.ilmari.org
2021-10-04 15:12:57 +02:00
Michael Paquier 43c1c4f65e Introduce GUC shared_memory_size_in_huge_pages
This runtime-computed GUC shows the number of huge pages required
for the server's main shared memory area, taking advantage of the
work done in 0c39c29 and 0bd305e.  This is useful for users to estimate
the amount of huge pages required for a server as it becomes possible to
do an estimation without having to start the server and potentially
allocate a large chunk of shared memory.

The number of huge pages is calculated based on the existing GUC
huge_page_size if set, or by using the system's default by looking at
/proc/meminfo on Linux.  There is nothing new here as this commit reuses
the existing calculation methods, and just exposes this information
directly to the user.  The routine calculating the huge page size is
refactored to limit the number of files with platform-specific flags.

This new GUC's name was the most popular choice based on the discussion
done.  This is only supported on Linux.

I have taken the time to test the change on Linux, Windows and MacOS,
though for the last two ones large pages are not supported.  The first
one calculates correctly the number of pages depending on the existing
GUC huge_page_size or the system's default.

Thanks to Andres Freund, Robert Haas, Kyotaro Horiguchi, Tom Lane,
Justin Pryzby (and anybody forgotten here) for the discussion.

Author: Nathan Bossart
Discussion: https://postgr.es/m/F2772387-CE0F-46BF-B5F1-CC55516EB885@amazon.com
2021-09-21 10:31:58 +09:00
Fujii Masao dc899146db Fix variable shadowing in procarray.c.
ProcArrayGroupClearXid function has a parameter named "proc",
but the same name was used for its local variables. This commit fixes
this variable shadowing, to improve code readability.

Back-patch to all supported versions, to make future back-patching
easy though this patch is classified as refactoring only.

Reported-by: Ranier Vilela
Author: Ranier Vilela, Aleksander Alekseev
https://postgr.es/m/CAEudQAqyoTZC670xWi6w-Oe2_Bk1bfu2JzXz6xRfiOUzm7xbyQ@mail.gmail.com
2021-09-16 13:06:21 +09:00
Fujii Masao 64a62ebeeb Use int instead of size_t in procarray.c.
All size_t variables declared in procarray.c are actually int ones.
Let's use int instead of size_t for those variables. Which would
reduce Wsign-compare compiler warnings.

Back-patch to v14 where commit 941697c3c1 added size_t variables
in procarray.c, to make future back-patching easy though
this patch is classified as refactoring only.

Reported-by: Ranier Vilela
Author: Ranier Vilela, Aleksander Alekseev
https://postgr.es/m/CAEudQAqyoTZC670xWi6w-Oe2_Bk1bfu2JzXz6xRfiOUzm7xbyQ@mail.gmail.com
2021-09-16 12:52:30 +09:00
Michael Paquier aa37a439db Fix compilation warning in ipci.c
A Size had better use %zu when printed.

Oversight in bd17880, per buildfarm member lapwing.
2021-09-08 14:22:52 +09:00
Michael Paquier bd1788051b Introduce GUC shared_memory_size
This runtime-computed GUC shows the size of the server's main shared
memory area, taking into account the amount of shared memory allocated
by extensions as this is calculated after processing
shared_preload_libraries.

Author: Nathan Bossart
Discussion: https://postgr.es/m/F2772387-CE0F-46BF-B5F1-CC55516EB885@amazon.com
2021-09-08 12:02:30 +09:00
Michael Paquier 0bd305ee1d Move the shared memory size calculation to its own function
This change refactors the shared memory size calculation in
CreateSharedMemoryAndSemaphores() to its own function.  This is intended
for use in a future change related to the setup of huge pages and shared
memory with some GUCs, while useful on its own for extensions.

Author: Nathan Bossart
Discussion: https://postgr.es/m/F2772387-CE0F-46BF-B5F1-CC55516EB885@amazon.com
2021-09-06 10:59:20 +09:00