Commit Graph

4354 Commits

Author SHA1 Message Date
Alvaro Herrera ce197e91d0
Close yet another race condition in replication slot test code
Buildfarm shows that this test has a further failure mode when a
checkpoint starts earlier than expected, so we detect a "checkpoint
completed" line that's not the one we want.  Change the config to try
and prevent this.

Per buildfarm

While at it, update one comment that was forgotten in commit
d18e75664a.

Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20210729.162038.534808353849568395.horikyota.ntt@gmail.com
2021-07-29 17:09:06 -04:00
Fujii Masao a00c138b78 Update minimum recovery point on truncation during WAL replay of abort record.
If a file is truncated, we must update minRecoveryPoint. Once a file is
truncated, there's no going back; it would not be safe to stop recovery
at a point earlier than that anymore.

Commit 7bffc9b7bf changed xact_redo_commit() so that it updates
minRecoveryPoint on truncation, but forgot to change xact_redo_abort().

Back-patch to all supported versions.

Reported-by: mengjuan.cmj@alibaba-inc.com
Author: Fujii Masao
Reviewed-by: Heikki Linnakangas
Discussion: https://postgr.es/m/b029fce3-4fac-4265-968e-16f36ff4d075.mengjuan.cmj@alibaba-inc.com
2021-07-29 01:31:41 +09:00
Fujii Masao 7fcf2faf9c Make XLOG_FPI_FOR_HINT records honor full_page_writes setting.
Commit 2c03216d83 changed XLOG_FPI_FOR_HINT records so that they always
included full-page images even when full_page_writes was disabled. However,
in this setting, they don't need to do that because hint bit updates don't
need to be protected from torn writes.

Therefore, this commit makes XLOG_FPI_FOR_HINT records honor full_page_writes
setting. That is, XLOG_FPI_FOR_HINT records may include no full-page images
if full_page_writes is disabled, and WAL replay of them does nothing.

Reported-by: Zhang Wenjie
Author: Kyotaro Horiguchi
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/tencent_60F11973A111EED97A8596FFECC4A91ED405@qq.com
2021-07-21 11:19:00 +09:00
Alvaro Herrera ead9e51e82
Advance old-segment horizon properly after slot invalidation
When some slots are invalidated due to the max_slot_wal_keep_size limit,
the old segment horizon should move forward to stay within the limit.
However, in commit c655077639 we forgot to call KeepLogSeg again to
recompute the horizon after invalidating replication slots.  In cases
where other slots remained, the limits would be recomputed eventually
for other reasons, but if all slots were invalidated, the limits would
not move at all afterwards.  Repair.

Backpatch to 13 where the feature was introduced.

Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reported-by: Marcin Krupowicz <mk@071.ovh>
Discussion: https://postgr.es/m/17103-004130e8f27782c9@postgresql.org
2021-07-16 12:07:30 -04:00
Tom Lane a49d081235 Replace explicit PIN entries in pg_depend with an OID range test.
As of v14, pg_depend contains almost 7000 "pin" entries recording
the OIDs of built-in objects.  This is a fair amount of bloat for
every database, and it adds time to pg_depend lookups as well as
initdb.  We can get rid of all of those entries in favor of an OID
range check, i.e. "OIDs below FirstUnpinnedObjectId are pinned".

(template1 and the public schema are exceptions.  Those exceptions
are now wired into IsPinnedObject() instead of initdb's code for
filling pg_depend, but it's the same amount of cruft either way.)

The contents of pg_shdepend are modified likewise.

Discussion: https://postgr.es/m/3737988.1618451008@sss.pgh.pa.us
2021-07-15 11:41:47 -04:00
Amit Kapila a8fd13cab0 Add support for prepared transactions to built-in logical replication.
To add support for streaming transactions at prepare time into the
built-in logical replication, we need to do the following things:

* Modify the output plugin (pgoutput) to implement the new two-phase API
callbacks, by leveraging the extended replication protocol.

* Modify the replication apply worker, to properly handle two-phase
transactions by replaying them on prepare.

* Add a new SUBSCRIPTION option "two_phase" to allow users to enable
two-phase transactions. We enable the two_phase once the initial data sync
is over.

We however must explicitly disable replication of two-phase transactions
during replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover, we don't have a replication connection open so we don't know
where to send the data anyway.

The streaming option is not allowed with this new two_phase option. This
can be done as a separate patch.

We don't allow to toggle two_phase option of a subscription because it can
lead to an inconsistent replica. For the same reason, we don't allow to
refresh the publication once the two_phase is enabled for a subscription
unless copy_data option is false.

Author: Peter Smith, Ajin Cherian and Amit Kapila based on previous work by Nikhil Sontakke and Stas Kelvich
Reviewed-by: Amit Kapila, Sawada Masahiko, Vignesh C, Dilip Kumar, Takamichi Osumi, Greg Nancarrow
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/02DA5F5E-CECE-4D9C-8B4B-418077E2C010@postgrespro.ru
Discussion: https://postgr.es/m/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com
2021-07-14 07:33:50 +05:30
Tom Lane f10f0ae420 Replace RelationOpenSmgr() with RelationGetSmgr().
The idea behind this patch is to design out bugs like the one fixed
by commit 9d523119f.  Previously, once one did RelationOpenSmgr(rel),
it was considered okay to access rel->rd_smgr directly for some
not-very-clear interval.  But since that pointer will be cleared by
relcache flushes, we had bugs arising from overreliance on a previous
RelationOpenSmgr call still being effective.

Now, very little code except that in rel.h and relcache.c should ever
touch the rd_smgr field directly.  The normal coding rule is to use
RelationGetSmgr(rel) and not expect the result to be valid for longer
than one smgr function call.  There are a couple of places where using
the function every single time seemed like overkill, but they are now
annotated with large warning comments.

Amul Sul, after an idea of mine.

Discussion: https://postgr.es/m/CANiYTQsU7yMFpQYnv=BrcRVqK_3U3mtAzAsJCaqtzsDHfsUbdQ@mail.gmail.com
2021-07-12 17:01:36 -04:00
Heikki Linnakangas 4c64b51dc5 Remove dead assignment to local variable.
This should have been removed in commit 7e30c186da, which split the loop
into two. Only the first loop uses the 'from' variable; updating it in
the second loop is bogus. It was never read after the first loop, so this
was harmless and surely optimized away by the compiler, but let's be tidy.

Backpatch to all supported versions.

Author: Ranier Vilela
Discussion: https://www.postgresql.org/message-id/CAEudQAoWq%2BAL3BnELHu7gms2GN07k-np6yLbukGaxJ1vY-zeiQ%40mail.gmail.com
2021-07-12 11:13:33 +03:00
Michael Paquier 0f80b47d24 Add forgotten LSN_FORMAT_ARGS() in xlogreader.c
These should have been part of 4035cd5, that introduced LZ4 support for
wal_compression.
2021-07-09 15:27:36 +09:00
Michael Paquier 2aca19f298 Use WaitLatch() instead of pg_usleep() at the end of backups
This concerns pg_stop_backup() and BASE_BACKUP, when waiting for the
WAL segments required for a backup to be archived.  This simplifies a
bit the handling of the wait event used in this code path.

Author: Bharath Rupireddy
Reviewed-by: Michael Paquier, Stephen Frost
Discussion: https://postgr.es/m/CALj2ACU4AdPCq6NLfcA-ZGwX7pPCK5FgEj-CAU0xCKzkASSy_A@mail.gmail.com
2021-07-06 08:10:59 +09:00
Peter Eisentraut 6bd3ec62d9 Use InvalidBucket instead of -1 where appropriate
Reported-by: Ranier Vilela <ranier.vf@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAEudQAp%3DZwKjrP4L%2BCzqV7SmWiaQidPPRqj4tqdjDG4KBx5yrg%40mail.gmail.com
2021-07-02 11:59:55 +02:00
Michael Paquier 70685385d7 Use WaitLatch() instead of pg_usleep() at end-of-vacuum truncation
This has the advantage to make a process more responsive when the
postmaster dies, even if the wait time was rather limited as there was
only a 50ms timeout here.  Another advantage of this change is for
monitoring, as we gain a new wait event for the end-of-vacuum
truncation.

Author: Bharath Rupireddy
Reviewed-by: Aleksander Alekseev, Thomas Munro, Michael Paquier
Discussion: https://postgr.es/m/CALj2ACU4AdPCq6NLfcA-ZGwX7pPCK5FgEj-CAU0xCKzkASSy_A@mail.gmail.com
2021-07-02 12:58:34 +09:00
Michael Paquier 17707c059c Fix incorrect PITR message for transaction ROLLBACK PREPARED
Reaching PITR on such a transaction would cause the generation of a LOG
message mentioning a transaction committed, not aborted.

Oversight in 4f1b890.

Author: Simon Riggs
Discussion: https://postgr.es/m/CANbhV-GJ6KijeCgdOrxqMCQ+C8QiK657EMhCy4csjrPcEUFv_Q@mail.gmail.com
Backpatch-through: 9.6
2021-06-30 11:48:53 +09:00
Michael Paquier 47f514dd9a Fix compilation warning in xloginsert.c
This is reproducible with gcc using at least -O0.  The last checks
validating the compression of a block could not be reached with this
variable not set, but let's be clean.

Oversight in 4035cd5, per buildfarm member lapwing.
2021-06-29 11:57:18 +09:00
Michael Paquier 4035cd5d4e Add support for LZ4 with compression of full-page writes in WAL
The logic is implemented so as there can be a choice in the compression
used when building a WAL record, and an extra per-record bit is used to
track down if a block is compressed with PGLZ, LZ4 or nothing.

wal_compression, the existing parameter, is changed to an enum with
support for the following backward-compatible values:
- "off", the default, to not use compression.
- "pglz" or "on", to compress FPWs with PGLZ.
- "lz4", the new mode, to compress FPWs with LZ4.

Benchmarking has showed that LZ4 outclasses easily PGLZ.  ZSTD would be
also an interesting choice, but going just with LZ4 for now makes the
patch minimalistic as toast compression is already able to use LZ4, so
there is no need to worry about any build-related needs for this
implementation.

Author: Andrey Borodin, Justin Pryzby
Reviewed-by: Dilip Kumar, Michael Paquier
Discussion: https://postgr.es/m/3037310D-ECB7-4BF1-AF20-01C10BB33A33@yandex-team.ru
2021-06-29 11:17:55 +09:00
Noah Misch cc2c7d65fc Skip WAL recycling and preallocation during archive recovery.
The previous commit addressed the chief consequences of a race condition
between InstallXLogFileSegment() and KeepFileRestoredFromArchive().  Fix
three lesser consequences.  A spurious durable_rename_excl() LOG message
remained possible.  KeepFileRestoredFromArchive() wasted the proceeds of
WAL recycling and preallocation.  Finally, XLogFileInitInternal() could
return a descriptor for a file that KeepFileRestoredFromArchive() had
already unlinked.  That felt like a recipe for future bugs.

Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
2021-06-28 18:34:56 -07:00
Noah Misch 2b3e4672f7 Don't ERROR on PreallocXlogFiles() race condition.
Before a restartpoint finishes PreallocXlogFiles(), a startup process
KeepFileRestoredFromArchive() call can unlink the preallocated segment.
If a CHECKPOINT sql command had elicited the restartpoint experiencing
the race condition, that sql command failed.  Moreover, the restartpoint
omitted its log_checkpoints message and some inessential resource
reclamation.  Prevent the ERROR by skipping open() of the segment.
Since these consequences are so minor, no back-patch.

Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
2021-06-28 18:34:56 -07:00
Noah Misch 421484f79c Remove XLogFileInit() ability to unlink a pre-existing file.
Only initdb used it.  initdb refuses to operate on a non-empty directory
and generally does not cope with pre-existing files of other kinds.
Hence, use the opportunity to simplify.

Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
2021-06-28 18:34:56 -07:00
Noah Misch 85656bc305 In XLogFileInit(), fix *use_existent postcondition to suit callers.
Infrequently, the mismatch caused log_checkpoints messages and
TRACE_POSTGRESQL_CHECKPOINT_DONE() to witness an "added" count too high
by one.  Since that consequence is so minor, no back-patch.

Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
2021-06-28 18:34:55 -07:00
Noah Misch c53c6b98d3 Remove XLogFileInit() ability to skip ControlFileLock.
Cold paths, initdb and end-of-recovery, used it.  Don't optimize them.

Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
2021-06-28 18:34:55 -07:00
Andrew Dunstan e1c1c30f63
Pre branch pgindent / pgperltidy run
Along the way make a slight adjustment to
src/include/utils/queryjumble.h to avoid an unused typedef.
2021-06-28 11:05:54 -04:00
Peter Eisentraut c31833779d Message style improvements 2021-06-28 08:36:44 +02:00
Amit Kapila b786304c29 Fix race condition in TransactionGroupUpdateXidStatus().
When we cannot immediately acquire XactSLRULock in exclusive mode at
commit time, we add ourselves to a list of processes that need their XIDs
status update. We do this if the clog page where we need to update the
current transaction status is the same as the group leader's clog page,
otherwise, we allow the caller to clear it by itself. Now, when we can't
add ourselves to any group, we were not clearing the current proc if it
has already become a member of some group which was leading to an
assertion failure when the same proc was assigned to another backend after
the current backend exits.

Reported-by: Alexander Lakhin
Bug: 17072
Author: Amit Kapila
Tested-By: Alexander Lakhin
Backpatch-through: 11, where it was introduced
Discussion: https://postgr.es/m/17072-2f8764857ef2c92a@postgresql.org
2021-06-28 09:29:38 +05:30
Peter Eisentraut a60c4c5c1a Remove redundant variable pageSize in gistinitpage
In gistinitpage, pageSize variable looks redundant, instead just
pass BLCKSZ. This will be consistent with its peers BloomInitPage,
brin_page_init and SpGistInitPage.

Author: Bharath Rupireddy <bharath.rupireddy@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/CALj2ACWj=V1k5591eeZK2sOg2FYuBUp6azFO8tMkBtGfXf8PMQ@mail.gmail.com
2021-06-25 07:55:34 +02:00
Peter Geoghegan e8f201ab82 Remove overzealous VACUUM failsafe assertions.
The failsafe can trigger when index processing is already disabled.
This can happen when VACUUM's INDEX_CLEANUP parameter is "off" and the
failsafe happens to trigger.  Remove assertions that assume that index
processing is directly tied to the failsafe.

Oversight in commit c242baa4, which made it possible for the failsafe to
trigger in a two-pass strategy VACUUM that has yet to make its first
call to lazy_vacuum_all_indexes().
2021-06-20 18:14:00 -07:00
Peter Geoghegan 3499df0dee Support disabling index bypassing by VACUUM.
Generalize the INDEX_CLEANUP VACUUM parameter (and the corresponding
reloption): make it into a ternary style boolean parameter.  It now
exposes a third option, "auto".  The "auto" option (which is now the
default) enables the "bypass index vacuuming" optimization added by
commit 1e55e7d1.

"VACUUM (INDEX_CLEANUP TRUE)" is redefined to once again make VACUUM
simply do any required index vacuuming, regardless of how few dead
tuples are encountered during the first scan of the target heap relation
(unless there are exactly zero).  This gives users a way of opting out
of the "bypass index vacuuming" optimization, if for whatever reason
that proves necessary.  It is also expected to be used by PostgreSQL
developers as a testing option from time to time.

"VACUUM (INDEX_CLEANUP FALSE)" does the same thing as it always has: it
forcibly disables both index vacuuming and index cleanup.  It's not
expected to be used much in PostgreSQL 14.  The failsafe mechanism added
by commit 1e55e7d1 addresses the same problem in a simpler way.
INDEX_CLEANUP can now be thought of as a testing and compatibility
option.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CAH2-WznrBoCST4_Gxh_G9hA8NzGUbeBGnOUC8FcXcrhqsv6OHQ@mail.gmail.com
2021-06-18 20:04:07 -07:00
Heikki Linnakangas d24c5658a8 Tidy up GetMultiXactIdMembers()'s behavior on error
One of the error paths left *members uninitialized. That's not a live
bug, because most callers don't look at *members when the function
returns -1, but let's be tidy. One caller, in heap_lock_tuple(), does
"if (members != NULL) pfree(members)", but AFAICS it never passes an
invalid 'multi' value so it should not reach that error case.

The callers are also a bit inconsistent in their expectations.
heap_lock_tuple() pfrees the 'members' array if it's not-NULL, others
pfree() it if "nmembers >= 0", and others if "nmembers > 0". That's
not a live bug either, because the function should never return 0, but
add an Assert for that to make it more clear. I left the callers alone
for now.

I also moved the line where we set *nmembers. It wasn't wrong before,
but I like to do that right next to the 'return' statement, to make it
clear that it's always set on return.

Also remove one unreachable return statement after ereport(ERROR), for
brevity and for consistency with the similar if-block right after it.

Author: Greg Nancarrow with the additional changes by me
Backpatch-through: 9.6, all supported versions
2021-06-17 14:50:42 +03:00
Heikki Linnakangas d0303bc8d2 Fix outdated comment that talked about seek position of WAL file.
Since commit c24dcd0cfd, we have been using pg_pread() to read the WAL
file, which doesn't change the seek position (unless we fall back to
the implementation in src/port/pread.c). Update comment accordingly.

Backpatch-through: 12, where we started to use pg_pread()
2021-06-16 12:36:15 +03:00
Peter Geoghegan 958cfbcf2d Remove unneeded field from VACUUM state.
Bugfix commit 5fc89376 effectively made the lock_waiter_detected field
from vacuumlazy.c's global state struct into private state owned by
lazy_truncate_heap().  Finish this off by replacing the struct field
with a local variable.
2021-06-15 08:59:36 -07:00
Michael Paquier dbab0c07e5 Remove forced toast recompression in VACUUM FULL/CLUSTER
The extra checks added by the recompression of toast data introduced in
bbe0a81 is proving to have a performance impact on VACUUM or CLUSTER
even if no recompression is done.  This is more noticeable with more
toastable columns that contain non-NULL values.

Improvements could be done to make those extra checks less expensive,
but that's not material for 14 at this stage, and we are not sure either
if the code path of VACUUM FULL/CLUSTER is adapted for this job.

Per discussion with several people, including Andres Freund, Robert
Haas, Álvaro Herrera, Tom Lane and myself.

Discussion: https://postgr.es/m/20210527003144.xxqppojoiwurc2iz@alap3.anarazel.de
2021-06-14 09:25:50 +09:00
Alvaro Herrera 5cc1cd5028
Report sort phase progress in parallel btree build
We were already reporting it, but only after the parallel workers were
finished, which is visibly much later than what happens in a serial
build.

With this change we report it when the leader starts its own sort phase
when participating in the build (the normal case).  Now this might
happen a little later than when the workers start their sorting phases,
but a) communicating the actual phase start from workers is likely to be
a hassle, and b) the sort phase start is pretty fuzzy anyway, since
sorting per se is actually initiated by tuplesort.c internally earlier
than tuplesort_performsort() is called.

Backpatch to pg12, where the progress reporting code for CREATE INDEX
went in.

Reported-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Greg Nancarrow <gregn4422@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/1128176d-1eee-55d4-37ca-e63644422adb
2021-06-11 19:07:32 -04:00
David Rowley 55ba5973d9 Fix an asssortment of typos in brin_minmax_multi.c and mcv.c
Discussion: https://postgr.es/m/CAApHDvrbyJNOPBws4RUhXghZ7+TBjtdO-rznTsqZECuowNorXg@mail.gmail.com
2021-06-10 20:13:44 +12:00
Robert Haas caba8f0d43 Fix corner case failure of new standby to follow new primary.
This only happens if (1) the new standby has no WAL available locally,
(2) the new standby is starting from the old timeline, (3) the promotion
happened in the WAL segment from which the new standby is starting,
(4) the timeline history file for the new timeline is available from
the archive but the WAL files for are not (i.e. this is a race),
(5) the WAL files for the new timeline are available via streaming,
and (6) recovery_target_timeline='latest'.

Commit ee994272ca introduced this
logic and was an improvement over the previous code, but it mishandled
this case. If recovery_target_timeline='latest' and restore_command is
set, validateRecoveryParameters() can change recoveryTargetTLI to be
different from receiveTLI. If streaming is then tried afterward,
expectedTLEs gets initialized with the history of the wrong timeline.
It's supposed to be a list of entries explaining how to get to the
target timeline, but in this case it ends up with a list of entries
explaining how to get to the new standby's original timeline, which
isn't right.

Dilip Kumar and Robert Haas, reviewed by Kyotaro Horiguchi.

Discussion: http://postgr.es/m/CAFiTN-sE-jr=LB8jQuxeqikd-Ux+jHiXyh4YDiZMPedgQKup0g@mail.gmail.com
2021-06-09 16:17:00 -04:00
Tomas Vondra d1f0aa7696 Fix pg_visibility regression failure with CLOBBER_CACHE_ALWAYS
Commit 8e03eb92e9 reverted a bit too much code, reintroducing one of the
issues fixed by 39b66a91bd - a page might have been left partially empty
after relcache invalidation.

Reported-By: Tom Lane
Author: Masahiko Sawada
Discussion: https://postgr.es/m/822752.1623032114@sss.pgh.pa.us
Discussion: https://postgr.es/m/CAD21AoA%3D%3Df2VSw3c-Cp_y%3DWLKHMKc1D6s7g3YWsCOvgaYPpJcg%40mail.gmail.com
2021-06-08 19:33:11 +02:00
David Rowley 8bdb36aab2 Clean up some questionable usages of DatumGet* macros
This tidies up some questionable coding which made use of
DatumGetPointer() for Datums being passed into functions where the
parameter is expected to be a cstring.  We saw no compiler warnings with
the old code as the Pointer type used in DatumGetPointer() happens to
be a char * rather than a void *.  However, that's no excuse and we should
be using the correct macro for the job.

Here we also make use of OutputFunctionCall() rather than using
FunctionCall1() directly to call the type's output function.
OutputFunctionCall() is the standard way to do this.  It casts the
returned value to a cstring for us.

In passing get rid of a duplicate call to strlen().  Most compilers will
likely optimize away the 2nd call, but there may be some that won't.  In
any case, this just aligns the code to some other nearby code that already
does this.

Discussion: https://postgr.es/m/CAApHDvq1D=ehZ8hey8Hz67N+_Zth0GHO5wiVCfv1YcGPMXJq0A@mail.gmail.com
2021-06-04 22:42:17 +12:00
David Rowley 7fc26d11e3 Adjust locations which have an incorrect copyright year
A few patches committed after ca3b37487 mistakenly forgot to make the
copyright year 2021.  Fix these.

Discussion: https://postgr.es/m/CAApHDvqyLmd9P2oBQYJ=DbrV8QwyPRdmXtCTFYPE08h+ip0UJw@mail.gmail.com
2021-06-04 12:19:50 +12:00
David Rowley f736e188ce Standardize usages of appendStringInfo and appendPQExpBuffer
Fix a few places that were using appendStringInfo() when they should have
been using appendStringInfoString().  Also some cases of
appendPQExpBuffer() that would have been better suited to use
appendPQExpBufferChar(), and finally, some places that used
appendPQExpBuffer() when appendPQExpBufferStr() would have suited better.

There are no bugs are being fixed here.  The aim is just to make the code
use the most optimal function for the job.

All the code being changed here is new to PG14.  It makes sense to fix
these before we branch for PG15.  There are a few other places that we
could fix, but those cases are older code so fixing those seems less
worthwhile as it may cause unnecessary back-patching pain in the future.

Author: Hou Zhijie
Discussion: https://postgr.es/m/OS0PR01MB5716732158B1C4142C6FE375943D9@OS0PR01MB5716.jpnprd01.prod.outlook.com
2021-06-03 16:38:03 +12:00
Tomas Vondra 8e03eb92e9 Revert most of 39b66a91bd
Reverts most of commit 39b66a91bd, which was found to cause significant
regression for REFRESH MATERIALIZED VIEW. This means only rows inserted
by heap_multi_insert will benefit from the optimization, implemented in
commit 7db0cd2145.

Reported-by: Masahiko Sawada
Discussion: https://postgr.es/m/CAD21AoA%3D%3Df2VSw3c-Cp_y%3DWLKHMKc1D6s7g3YWsCOvgaYPpJcg%40mail.gmail.com
2021-06-03 00:13:59 +02:00
Peter Geoghegan 9afdea9824 Fix VACUUM VERBOSE's LP_DEAD item pages output.
Oversight in commit 5100010e.
2021-05-27 17:09:16 -07:00
Tom Lane e6241d8e03 Rethink definition of pg_attribute.attcompression.
Redefine '\0' (InvalidCompressionMethod) as meaning "if we need to
compress, use the current setting of default_toast_compression".
This allows '\0' to be a suitable default choice regardless of
datatype, greatly simplifying code paths that initialize tupledescs
and the like.  It seems like a more user-friendly approach as well,
because now the default compression choice doesn't migrate into table
definitions, meaning that changing default_toast_compression is
usually sufficient to flip an installation's behavior; one needn't
tediously issue per-column ALTER SET COMPRESSION commands.

Along the way, fix a few minor bugs and documentation issues
with the per-column-compression feature.  Adopt more robust
APIs for SetIndexStorageProperties and GetAttributeCompression.

Bump catversion because typical contents of attcompression will now
be different.  We could get away without doing that, but it seems
better to ensure v14 installations all agree on this.  (We already
forced initdb for beta2, anyway.)

Discussion: https://postgr.es/m/626613.1621787110@sss.pgh.pa.us
2021-05-27 13:24:27 -04:00
Michael Paquier 190fa5a00a Fix typo in heapam.c
Author: Hou Zhijie
Discussion: https://postgr.es/m/OS0PR01MB571612191738540B27A8DE5894249@OS0PR01MB5716.jpnprd01.prod.outlook.com
2021-05-26 19:53:03 +09:00
Michael Paquier fb0f5f0172 Fix memory leak when de-toasting compressed values in VACUUM FULL/CLUSTER
VACUUM FULL and CLUSTER can be used to enforce the use of the existing
compression method of a toastable column if a value currently stored is
compressed with a method that does not match the column's defined
method.  The code in charge of decompressing and recompressing toast
values at rewrite left around the detoasted values, causing an
accumulation of memory allocated in TopTransactionContext.

When processing large relations, this could cause the system to run out
of memory.  The detoasted values are not needed once their tuple is
rewritten, and this commit ensures that the necessary cleanup happens.

Issue introduced by bbe0a81d.  The comments of the area are reordered a
bit while on it.

Reported-by: Andres Freund
Analyzed-by: Andres Freund
Author: Michael Paquier
Reviewed-by: Dilip Kumar
Discussion: https://postgr.es/m/20210521211929.pcehg6f23icwstdb@alap3.anarazel.de
2021-05-25 14:27:18 +09:00
Peter Geoghegan c242baa4a8 Consider triggering VACUUM failsafe during scan.
The wraparound failsafe mechanism added by commit 1e55e7d1 handled the
one-pass strategy case (i.e. the "table has no indexes" case) by adding
a dedicated failsafe check.  This made up for the fact that the usual
one-pass checks inside lazy_vacuum_all_indexes() cannot ever be reached
during a one-pass strategy VACUUM.

This approach failed to account for two-pass VACUUMs that opt out of
index vacuuming up-front.  The INDEX_CLEANUP off case in the only case
that works like that.

Fix this by performing a failsafe check every 4GB during the first scan
of the heap, regardless of the details of the VACUUM.  This eliminates
the special case, and will make the failsafe trigger more reliably.

Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/20210424002921.pb3t7h6frupdqnkp@alap3.anarazel.de
2021-05-24 17:14:02 -07:00
Tom Lane f5024d8d7b Re-order pg_attribute columns to eliminate some padding space.
Now that attcompression is just a char, there's a lot of wasted
padding space after it.  Move it into the group of char-wide
columns to save a net of 4 bytes per pg_attribute entry.  While
we're at it, swap the order of attstorage and attalign to make for
a more logical grouping of these columns.

Also re-order actions in related code to match the new field ordering.

This patch also fixes one outright bug: equalTupleDescs() failed to
compare attcompression.  That could, for example, cause relcache
reload to fail to adopt a new value following a change.

Michael Paquier and Tom Lane, per a gripe from Andres Freund.

Discussion: https://postgr.es/m/20210517204803.iyk5wwvwgtjcmc5w@alap3.anarazel.de
2021-05-23 12:12:09 -04:00
Tom Lane f21fadafaf Avoid detoasting failure after COMMIT inside a plpgsql FOR loop.
exec_for_query() normally tries to prefetch a few rows at a time
from the query being iterated over, so as to reduce executor
entry/exit overhead.  Unfortunately this is unsafe if we have
COMMIT or ROLLBACK within the loop, because there might be
TOAST references in the data that we prefetched but haven't
yet examined.  Immediately after the COMMIT/ROLLBACK, we have
no snapshots in the session, meaning that VACUUM is at liberty
to remove recently-deleted TOAST rows.

This was originally reported as a case triggering the "no known
snapshots" error in init_toast_snapshot(), but even if you miss
hitting that, you can get "missing toast chunk", as illustrated
by the added isolation test case.

To fix, just disable prefetching in non-atomic contexts.  Maybe
there will be performance complaints prompting us to work harder
later, but it's not clear at the moment that this really costs
much, and I doubt we'd want to back-patch any complicated fix.

In passing, adjust that error message in init_toast_snapshot()
to be a little clearer about the likely cause of the problem.

Patch by me, based on earlier investigation by Konstantin Knizhnik.

Per bug #15990 from Andreas Wicht.  Back-patch to v11 where
intra-procedure COMMIT was added.

Discussion: https://postgr.es/m/15990-eee2ac466b11293d@postgresql.org
2021-05-20 18:32:37 -04:00
Fujii Masao 167bd48049 Make standby promotion reset the recovery pause state to 'not paused'.
If a promotion is triggered while recovery is paused, the paused state ends
and promotion continues. But previously in that case
pg_get_wal_replay_pause_state() returned 'paused' wrongly while a promotion
was ongoing.

This commit changes a standby promotion so that it marks the recovery
pause state as 'not paused' when it's triggered, to fix the issue.

Author: Fujii Masao
Reviewed-by: Dilip Kumar, Kyotaro Horiguchi
Discussion: https://postgr.es/m/f706876c-4894-0ba5-6f4d-79803eeea21b@oss.nttdata.com
2021-05-19 13:48:19 +09:00
Peter Geoghegan 8f72bbac3e Harden nbtree deduplication posting split code.
Add a defensive "can't happen" error to code that handles nbtree posting
list splits (promote an existing assertion).  This avoids a segfault in
the event of an insertion of a newitem that is somehow identical to an
existing non-pivot tuple in the index.  An nbtree index should never
have two index tuples with identical TIDs.

This scenario is not particular unlikely in the event of any kind of
corruption that leaves the index in an inconsistent state relative to
the heap relation that is indexed.  There are two known reports of
preventable hard crashes.  Doing nothing seems unacceptable given the
general expectation that nbtree will cope reasonably well with corrupt
data.

Discussion: https://postgr.es/m/CAH2-Wz=Jr_d-dOYEEmwz0-ifojVNWho01eAqewfQXgKfoe114w@mail.gmail.com
Backpatch: 13-, where nbtree deduplication was introduced.
2021-05-14 15:08:02 -07:00
Tom Lane c3c35a733c Prevent infinite insertion loops in spgdoinsert().
Formerly we just relied on operator classes that assert longValuesOK
to eventually shorten the leaf value enough to fit on an index page.
That fails since the introduction of INCLUDE-column support (commit
09c1c6ab4), because the INCLUDE columns might alone take up more
than a page, meaning no amount of leaf-datum compaction will get
the job done.  At least with spgtextproc.c, that leads to an infinite
loop, since spgtextproc.c won't throw an error for not being able
to shorten the leaf datum anymore.

To fix without breaking cases that would otherwise work, add logic
to spgdoinsert() to verify that the leaf tuple size is decreasing
after each "choose" step.  Some opclasses might not decrease the
size on every single cycle, and in any case, alignment roundoff
of the tuple size could obscure small gains.  Therefore, allow
up to 10 cycles without additional savings before throwing an
error.  (Perhaps this number will need adjustment, but it seems
quite generous right now.)

As long as we've developed this logic, let's back-patch it.
The back branches don't have INCLUDE columns to worry about, but
this seems like a good defense against possible bugs in operator
classes.  We already know that an infinite loop here is pretty
unpleasant, so having a defense seems to outweigh the risk of
breaking things.  (Note that spgtextproc.c is actually the only
known opclass with longValuesOK support, so that this is all moot
for known non-core opclasses anyway.)

Per report from Dilip Kumar.

Discussion: https://postgr.es/m/CAFiTN-uxP_soPhVG840tRMQTBmtA_f_Y8N51G7DKYYqDh7XN-A@mail.gmail.com
2021-05-14 15:07:34 -04:00
Tom Lane eb7a6b9229 Fix query-cancel handling in spgdoinsert().
Knowing that a buggy opclass could cause an infinite insertion loop,
spgdoinsert() intended to allow its loop to be interrupted by query
cancel.  However, that never actually worked, because in iterations
after the first, we'd be holding buffer lock(s) which would cause
InterruptHoldoffCount to be positive, preventing servicing of the
interrupt.

To fix, check if an interrupt is pending, and if so fall out of
the insertion loop and service the interrupt after we've released
the buffers.  If it was indeed a query cancel, that's the end of
the matter.  If it was a non-canceling interrupt reason, make use
of the existing provision to retry the whole insertion.  (This isn't
as wasteful as it might seem, since any upper-level index tuples we
already created should be usable in the next attempt.)

While there's no known instance of such a bug in existing release
branches, it still seems like a good idea to back-patch this to
all supported branches, since the behavior is fairly nasty if a
loop does happen --- not only is it uncancelable, but it will
quickly consume memory to the point of an OOM failure.  In any
case, this code is certainly not working as intended.

Per report from Dilip Kumar.

Discussion: https://postgr.es/m/CAFiTN-uxP_soPhVG840tRMQTBmtA_f_Y8N51G7DKYYqDh7XN-A@mail.gmail.com
2021-05-14 13:29:39 -04:00
Peter Geoghegan fbe9b80610 Fix autovacuum log output heap truncation issue.
The percentage of blocks from the table value reported by autovacuum log
output (following commit 5100010ee4) should never exceed 100% because
it describes the state of the table back when lazy_vacuum() was called.
The value could nevertheless exceed 100% in the event of heap relation
truncation.  We failed to compensate for how truncation affects
rel_pages.

Fix the faulty accounting by using the original rel_pages value instead
of the current/final rel_pages value.

Reported-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20210423204306.5osfpkt2ggaedyvy@alap3.anarazel.de
2021-05-13 16:07:17 -07:00
Tom Lane def5b065ff Initial pgindent and pgperltidy run for v14.
Also "make reformat-dat-files".

The only change worthy of note is that pgindent messed up the formatting
of launcher.c's struct LogicalRepWorkerId, which led me to notice that
that struct wasn't used at all anymore, so I just took it out.
2021-05-12 13:14:10 -04:00
Peter Eisentraut ec6e70c79f Refactor some error messages for easier translation 2021-05-12 07:42:51 +02:00
Fujii Masao d780d7c088 Change data type of counters in BufferUsage and WalUsage from long to int64.
Previously long was used as the data type for some counters in BufferUsage
and WalUsage. But long is only four byte, e.g., on Windows, and it's entirely
possible to wrap a four byte counter. For example, emitting more than
four billion WAL records in one transaction isn't actually particularly rare.

To avoid the overflows of those counters, this commit changes the data type
of them from long to int64.

Suggested-by: Andres Freund
Author: Masahiro Ikeda
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/20201221211650.k7b53tcnadrciqjo@alap3.anarazel.de
Discussion: https://postgr.es/m/af0964ac-7080-1984-dc23-513754987716@oss.nttdata.com
2021-05-12 09:56:34 +09:00
Tom Lane 049e1e2edb Fix mishandling of resjunk columns in ON CONFLICT ... UPDATE tlists.
It's unusual to have any resjunk columns in an ON CONFLICT ... UPDATE
list, but it can happen when MULTIEXPR_SUBLINK SubPlans are present.
If it happens, the ON CONFLICT UPDATE code path would end up storing
tuples that include the values of the extra resjunk columns.  That's
fairly harmless in the short run, but if new columns are added to
the table then the values would become accessible, possibly leading
to malfunctions if they don't match the datatypes of the new columns.

This had escaped notice through a confluence of missing sanity checks,
including

* There's no cross-check that a tuple presented to heap_insert or
heap_update matches the table rowtype.  While it's difficult to
check that fully at reasonable cost, we can easily add assertions
that there aren't too many columns.

* The output-column-assignment cases in execExprInterp.c lacked
any sanity checks on the output column numbers, which seems like
an oversight considering there are plenty of assertion checks on
input column numbers.  Add assertions there too.

* We failed to apply nodeModifyTable's ExecCheckPlanOutput() to
the ON CONFLICT UPDATE tlist.  That wouldn't have caught this
specific error, since that function is chartered to ignore resjunk
columns; but it sure seems like a bad omission now that we've seen
this bug.

In HEAD, the right way to fix this is to make the processing of
ON CONFLICT UPDATE tlists work the same as regular UPDATE tlists
now do, that is don't add "SET x = x" entries, and use
ExecBuildUpdateProjection to evaluate the tlist and combine it with
old values of the not-set columns.  This adds a little complication
to ExecBuildUpdateProjection, but allows removal of a comparable
amount of now-dead code from the planner.

In the back branches, the most expedient solution seems to be to
(a) use an output slot for the ON CONFLICT UPDATE projection that
actually matches the target table, and then (b) invent a variant of
ExecBuildProjectionInfo that can be told to not store values resulting
from resjunk columns, so it doesn't try to store into nonexistent
columns of the output slot.  (We can't simply ignore the resjunk columns
altogether; they have to be evaluated for MULTIEXPR_SUBLINK to work.)
This works back to v10.  In 9.6, projections work much differently and
we can't cheaply give them such an option.  The 9.6 version of this
patch works by inserting a JunkFilter when it's necessary to get rid
of resjunk columns.

In addition, v11 and up have the reverse problem when trying to
perform ON CONFLICT UPDATE on a partitioned table.  Through a
further oversight, adjust_partition_tlist() discarded resjunk columns
when re-ordering the ON CONFLICT UPDATE tlist to match a partition.
This accidentally prevented the storing-bogus-tuples problem, but
at the cost that MULTIEXPR_SUBLINK cases didn't work, typically
crashing if more than one row has to be updated.  Fix by preserving
resjunk columns in that routine.  (I failed to resist the temptation
to add more assertions there too, and to do some minor code
beautification.)

Per report from Andres Freund.  Back-patch to all supported branches.

Security: CVE-2021-32028
2021-05-10 11:02:29 -04:00
Thomas Munro c2dc19342e Revert recovery prefetching feature.
This set of commits has some bugs with known fixes, but at this late
stage in the release cycle it seems best to revert and resubmit next
time, along with some new automated test coverage for this whole area.

Commits reverted:

dc88460c: Doc: Review for "Optionally prefetch referenced data in recovery."
1d257577: Optionally prefetch referenced data in recovery.
f003d9f8: Add circular WAL decoding buffer.
323cbe7c: Remove read_page callback from XLogReader.

Remove the new GUC group WAL_RECOVERY recently added by a55a9847, as the
corresponding section of config.sgml is now reverted.

Discussion: https://postgr.es/m/CAOuzzgrn7iKnFRsB4MHp3UisEQAGgZMbk_ViTN4HV4-Ksq8zCg%40mail.gmail.com
2021-05-10 16:06:09 +12:00
Peter Geoghegan c9787385db Remove overzealous VACUUM visibility map assertion.
The all_visible_according_to_vm variable's value is inherently prone to
becoming invalidated concurrently, since it is set before we even
acquire a lock on a related heap page buffer.

Oversight in commit 7136bf34, which added the assertion in passing.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reported-By: Tang <tanghy.fnst@fujitsu.com>
Diagnosed-By:: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAD21AoDzgc8_MYrA5m1fyydomw_eVKtQiYh7sfDK4KEhdMsf_g@mail.gmail.com
2021-05-06 13:17:39 -07:00
Michael Paquier 4aba61b870 Add some forgotten LSN_FORMAT_ARGS() in xlogreader.c
6f6f284 has introduced a specific macro to make printf()-ing of LSNs
easier.  This takes care of what looks like the remaining code paths
that did not get the call.

Author: Michael Paquier
Reviewed-by: Kyotaro Horiguchi, Tom Lane
Discussion: https://postgr.es/m/YIJS9x6K8ruizN7j@paquier.xyz
2021-04-24 09:09:02 +09:00
Michael Paquier 62aa2bb293 Remove use of [U]INT64_FORMAT in some translatable strings
%lld with (long long), or %llu with (unsigned long long) are more
adapted.  This is similar to 3286065.

Author: Kyotaro Horiguchi
Discussion: https://postgr.es/m/20210421.200000.1462448394029407895.horikyota.ntt@gmail.com
2021-04-23 13:25:49 +09:00
Peter Eisentraut f0ec598b43 Fix typo 2021-04-21 08:07:37 +02:00
Tom Lane 783be78ca9 Improve WAL record descriptions for SP-GiST records.
While tracking down the bug fixed in the preceding commit, I got quite
annoyed by the low quality of spg_desc's output.  Add missing fields,
try to make the formatting consistent.
2021-04-20 17:01:49 -04:00
Peter Geoghegan 7136bf34f2 Document LP_DEAD accounting issues in VACUUM.
Document VACUUM's soft assumption that any LP_DEAD items encountered
during pruning will become LP_UNUSED items before VACUUM finishes up.
This is integral to the accounting used by VACUUM to generate its final
report on the table to the stats collector.  It also affects how VACUUM
determines which heap pages are truncatable.  In both cases VACUUM is
concerned with the likely contents of the page in the near future, not
the current contents of the page.

This state of affairs created the false impression that VACUUM's dead
tuple accounting had significant difference with similar accounting used
during ANALYZE.  There were and are no substantive differences, at least
when the soft assumption completely works out.  This is far clearer now.

Also document cases where things don't quite work out for VACUUM's dead
tuple accounting.  It's possible that a significant number of LP_DEAD
items will be left behind by VACUUM, and won't be recorded as remaining
dead tuples in VACUUM's statistics collector report.  This behavior
dates back to commit a96c41fe, which taught VACUUM to run without index
and heap vacuuming at the user's request.  The failsafe mechanism added
to VACUUM more recently by commit 1e55e7d1 takes the same approach to
dead tuple accounting.

Reported-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=Jmtu18PrsYq3EvvZJGOmZqSO2u3bvKpx9xJa5uhNp=Q@mail.gmail.com
2021-04-19 18:55:31 -07:00
Michael Paquier 7ef8b52cf0 Fix typos and grammar in comments and docs
Author: Justin Pryzby
Discussion: https://postgr.es/m/20210416070310.GG3315@telsasoft.com
2021-04-19 11:32:30 +09:00
Peter Eisentraut f59b58e2a1 Use correct format placeholder for block numbers
Should be %u rather than %d.
2021-04-17 09:40:50 +02:00
Peter Eisentraut 07e5e66742 Improve quoting in some error messages 2021-04-14 09:11:29 +02:00
Peter Geoghegan 60f1f09ff4 Don't truncate heap when VACUUM's failsafe is in effect.
It seems like a good idea to bypass heap truncation when the wraparound
failsafe mechanism (which was added in commit 1e55e7d1) is in effect.

Deliberately don't bypass heap truncation in the INDEX_CLEANUP=off case,
even though it is similar to the failsafe case.  There is already a
separate reloption (and related VACUUM parameter) for that.

Reported-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAD21AoDWRh6oTN5T8wa+cpZUVpHXET8BJ8Da7WHVHpwkPP6KLg@mail.gmail.com
2021-04-13 12:58:31 -07:00
Tom Lane 34f581c39e Avoid improbable PANIC during heap_update.
heap_update needs to clear any existing "all visible" flag on
the old tuple's page (and on the new page too, if different).
Per coding rules, to do this it must acquire pin on the appropriate
visibility-map page while not holding exclusive buffer lock;
which creates a race condition since someone else could set the
flag whenever we're not holding the buffer lock.  The code is
supposed to handle that by re-checking the flag after acquiring
buffer lock and retrying if it became set.  However, one code
path through heap_update itself, as well as one in its subroutine
RelationGetBufferForTuple, failed to do this.  The end result,
in the unlikely event that a concurrent VACUUM did set the flag
while we're transiently not holding lock, is a non-recurring
"PANIC: wrong buffer passed to visibilitymap_clear" failure.

This has been seen a few times in the buildfarm since recent VACUUM
changes that added code paths that could set the all-visible flag
while holding only exclusive buffer lock.  Previously, the flag
was (usually?) set only after doing LockBufferForCleanup, which
would insist on buffer pin count zero, thus preventing the flag
from becoming set partway through heap_update.  However, it's
clear that it's heap_update not VACUUM that's at fault here.

What's less clear is whether there is any hazard from these bugs
in released branches.  heap_update is certainly violating API
expectations, but if there is no code path that can set all-visible
without a cleanup lock then it's only a latent bug.  That's not
100% certain though, besides which we should worry about extensions
or future back-patch fixes that could introduce such code paths.

I chose to back-patch to v12.  Fixing RelationGetBufferForTuple
before that would require also back-patching portions of older
fixes (notably 0d1fe9f74), which is more code churn than seems
prudent to fix a hypothetical issue.

Discussion: https://postgr.es/m/2247102.1618008027@sss.pgh.pa.us
2021-04-13 12:17:24 -04:00
Thomas Munro b1df6b696b Fix potential SSI hazard in heap_update().
Commit 6f38d4dac3 failed to heed a warning about the stability of the
value pointed to by "otid".  The caller is allowed to pass in a pointer to
newtup->t_self, which will be updated during the execution of the
function.  Instead, the SSI check should use the value we copy into
oldtup.t_self near the top of the function.

Not a live bug, because newtup->t_self doesn't really get updated until
a bit later, but it was confusing and broke the rule established by the
comment.

Back-patch to 13.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/2689164.1618160085%40sss.pgh.pa.us
2021-04-13 13:02:56 +12:00
Fujii Masao 08aa89b326 Remove COMMIT_TS_SETTS record.
Commit 438fc4a39c prevented the WAL replay from writing
COMMIT_TS_SETTS record. By this change there is no code that
generates COMMIT_TS_SETTS record in PostgreSQL core.
Also we can think that there are no extensions using the record
because we've not received so far any complaints about the issue
that commit 438fc4a39c fixed. Therefore this commit removes
COMMIT_TS_SETTS record and its related code. Even without
this record, the timestamp required for commit timestamp feature
can be acquired from the COMMIT record.

Bump WAL page magic.

Reported-by: lx zou <zoulx1982@163.com>
Author: Fujii Masao
Reviewed-by: Alvaro Herrera
Discussion: https://postgr.es/m/16931-620d0f2fdc6108f1@postgresql.org
2021-04-12 00:04:30 +09:00
Thomas Munro dc88460c24 Doc: Review for "Optionally prefetch referenced data in recovery."
Typos, corrections and language improvements in the docs, and a few in
code comments too.

Reported-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20210409033703.GP6592%40telsasoft.com
2021-04-10 08:21:53 +12:00
Peter Geoghegan 796092fb84 Silence another _bt_check_unique compiler warning.
Per complaint from Tom Lane

Discussion: https://postgr.es/m/1922884.1617909599@sss.pgh.pa.us
2021-04-08 12:54:31 -07:00
Thomas Munro 1d257577e0 Optionally prefetch referenced data in recovery.
Introduce a new GUC recovery_prefetch, disabled by default.  When
enabled, look ahead in the WAL and try to initiate asynchronous reading
of referenced data blocks that are not yet cached in our buffer pool.
For now, this is done with posix_fadvise(), which has several caveats.
Better mechanisms will follow in later work on the I/O subsystem.

The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.

The GUC wal_decode_buffer_size is used to limit the maximum distance we
are prepared to read ahead in the WAL to find uncached blocks.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (parts)
Reviewed-by: Andres Freund <andres@anarazel.de> (parts)
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (parts)
Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Tested-by: Dmitry Dolgov <9erthalion6@gmail.com>
Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
2021-04-08 23:20:42 +12:00
Thomas Munro f003d9f872 Add circular WAL decoding buffer.
Teach xlogreader.c to decode its output into a circular buffer, to
support optimizations based on looking ahead.

 * XLogReadRecord() works as before, consuming records one by one, and
   allowing them to be examined via the traditional XLogRecGetXXX()
   macros.

 * An alternative new interface XLogNextRecord() is added that returns
   pointers to DecodedXLogRecord structs that can be examined directly.

 * XLogReadAhead() provides a second cursor that lets you see
   further ahead, as long as data is available and there is enough space
   in the decoding buffer.  This returns DecodedXLogRecord pointers to the
   caller, but also adds them to a queue of records that will later be
   consumed by XLogNextRecord()/XLogReadRecord().

The buffer's size is controlled with wal_decode_buffer_size.  The buffer
could potentially be placed into shared memory, for future projects.
Large records that don't fit in the circular buffer are called
"oversized" and allocated separately with palloc().

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
2021-04-08 23:20:42 +12:00
Thomas Munro 323cbe7c7d Remove read_page callback from XLogReader.
Previously, the XLogReader module would fetch new input data using a
callback function.  Redesign the interface so that it tells the caller
to insert more data with a special return value instead.  This API suits
later patches for prefetching, encryption and maybe other future
projects that would otherwise require continually extending the callback
interface.

As incidental cleanup work, move global variables readOff, readLen and
readSegNo inside XlogReaderState.

Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (parts of earlier version)
Reviewed-by: Antonin Houska <ah@cybertec.at>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Takashi Menjo <takashi.menjo@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/20190418.210257.43726183.horiguchi.kyotaro%40lab.ntt.co.jp
2021-04-08 23:20:42 +12:00
Alvaro Herrera 0827e8af70
autovacuum: handle analyze for partitioned tables
Previously, autovacuum would completely ignore partitioned tables, which
is not good regarding analyze -- failing to analyze those tables means
poor plans may be chosen.  Make autovacuum aware of those tables by
propagating "changes since analyze" counts from the leaf partitions up
the partitioning hierarchy.

This also introduces necessary reloptions support for partitioned tables
(autovacuum_enabled, autovacuum_analyze_scale_factor,
autovacuum_analyze_threshold).  It's unclear how best to document this
aspect.

Author: Yuzuko Hosoya <yuzukohosoya@gmail.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/CAKkQ508_PwVgwJyBY=0Lmkz90j8CmWNPUxgHvCUwGhMrouz6UA@mail.gmail.com
2021-04-08 01:19:36 -04:00
Peter Geoghegan 5100010ee4 Teach VACUUM to bypass unnecessary index vacuuming.
VACUUM has never needed to call ambulkdelete() for each index in cases
where there are precisely zero TIDs in its dead_tuples array by the end
of its first pass over the heap (also its only pass over the heap in
this scenario).  Index vacuuming is simply not required when this
happens.  Index cleanup will still go ahead, but in practice most calls
to amvacuumcleanup() are usually no-ops when there were zero preceding
ambulkdelete() calls.  In short, VACUUM has generally managed to avoid
index scans when there were clearly no index tuples to delete from
indexes.  But cases with _close to_ no index tuples to delete were
another matter -- a round of ambulkdelete() calls took place (one per
index), each of which performed a full index scan.

VACUUM now behaves just as if there were zero index tuples to delete in
cases where there are in fact "virtually zero" such tuples.  That is, it
can now bypass index vacuuming and heap vacuuming as an optimization
(though not index cleanup).  Whether or not VACUUM bypasses indexes is
determined dynamically, based on the just-observed number of heap pages
in the table that have one or more LP_DEAD items (LP_DEAD items in heap
pages have a 1:1 correspondence with index tuples that still need to be
deleted from each index in the worst case).

We only skip index vacuuming when 2% or less of the table's pages have
one or more LP_DEAD items -- bypassing index vacuuming as an
optimization must not noticeably impede setting bits in the visibility
map.  As a further condition, the dead_tuples array (i.e. VACUUM's array
of LP_DEAD item TIDs) must not exceed 32MB at the point that the first
pass over the heap finishes, which is also when the decision to bypass
is made.  (The VACUUM must also have been able to fit all TIDs in its
maintenance_work_mem-bound dead_tuples space, though with a default
maintenance_work_mem setting it can't matter.)

This avoids surprising jumps in the duration and overhead of routine
vacuuming with workloads where successive VACUUM operations consistently
have almost zero dead index tuples.  The number of LP_DEAD items may
well accumulate over multiple VACUUM operations, before finally the
threshold is crossed and VACUUM performs conventional index vacuuming.
Even then, the optimization will have avoided a great deal of largely
unnecessary index vacuuming.

In the future we may teach VACUUM to skip index vacuuming on a per-index
basis, using a much more sophisticated approach.  For now we only
consider the extreme cases, where we can be quite confident that index
vacuuming just isn't worth it using simple heuristics.

Also log information about how many heap pages have one or more LP_DEAD
items when autovacuum logging is enabled.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzmkebqPd4MVGuPTOS9bMFvp9MDs5cRTCOsv1rQJ3jCbXw@mail.gmail.com
2021-04-07 16:14:54 -07:00
Peter Geoghegan 1e55e7d175 Add wraparound failsafe to VACUUM.
Add a failsafe mechanism that is triggered by VACUUM when it notices
that the table's relfrozenxid and/or relminmxid are dangerously far in
the past.  VACUUM checks the age of the table dynamically, at regular
intervals.

When the failsafe triggers, VACUUM takes extraordinary measures to
finish as quickly as possible so that relfrozenxid and/or relminmxid can
be advanced.  VACUUM will stop applying any cost-based delay that may be
in effect.  VACUUM will also bypass any further index vacuuming and heap
vacuuming -- it only completes whatever remaining pruning and freezing
is required.  Bypassing index/heap vacuuming is enabled by commit
8523492d, which made it possible to dynamically trigger the mechanism
already used within VACUUM when it is run with INDEX_CLEANUP off.

It is expected that the failsafe will almost always trigger within an
autovacuum to prevent wraparound, long after the autovacuum began.
However, the failsafe mechanism can trigger in any VACUUM operation.
Even in a non-aggressive VACUUM, where we're likely to not advance
relfrozenxid, it still seems like a good idea to finish off remaining
pruning and freezing.   An aggressive/anti-wraparound VACUUM will be
launched immediately afterwards.  Note that the anti-wraparound VACUUM
that follows will itself trigger the failsafe, usually before it even
begins its first (and only) pass over the heap.

The failsafe is controlled by two new GUCs: vacuum_failsafe_age, and
vacuum_multixact_failsafe_age.  There are no equivalent reloptions,
since that isn't expected to be useful.  The GUCs have rather high
defaults (both default to 1.6 billion), and are expected to generally
only be used to make the failsafe trigger sooner/more frequently.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzmgH3ySGYeC-m-eOBsa2=sDwa292-CFghV4rESYo39FsQ@mail.gmail.com
2021-04-07 12:37:45 -07:00
Peter Geoghegan 3c3b8a4b26 Truncate line pointer array during VACUUM.
Teach VACUUM to truncate the line pointer array of each heap page when a
contiguous group of LP_UNUSED line pointers appear at the end of the
array -- these unused and unreferenced items are excluded.  This process
occurs during VACUUM's second pass over the heap, right after LP_DEAD
line pointers on the page (those encountered/pruned during the first
pass) are marked LP_UNUSED.

Truncation avoids line pointer bloat with certain workloads,
particularly those involving continual range DELETEs and bulk INSERTs
against the same table.

Also harden heapam code to check for an out-of-range page offset number
in places where we weren't already doing so.

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAEze2WjgaQc55Y5f5CQd3L=eS5CZcff2Obxp=O6pto8-f0hC4w@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopJMhFQj_g1rAj4GWr3ZSzw@mail.gmail.com
2021-04-07 08:47:15 -07:00
Tomas Vondra 23607a8156 Don't add non-existent pages to bitmap from BRIN
The code in bringetbitmap() simply added the whole matching page range
to the TID bitmap, as determined by pages_per_range, even if some of the
pages were beyond the end of the heap. The query then might fail with
an error like this:

  ERROR:  could not open file "base/20176/20228.2" (target block
          262144): previous segment is only 131021 blocks

In this case, the relation has 262093 pages (131072 and 131021 pages),
but we're trying to acess block 262144, i.e. first block of the 3rd
segment. At that point _mdfd_getseg() notices the preceding segment is
incomplete, and fails.

Hitting this in practice is rather unlikely, because:

* Most indexes use power-of-two ranges, so segments and page ranges
  align perfectly (segment end is also a page range end).

* The table size has to be just right, with the last segment being
  almost full - less than one page range from full segment, so that the
  last page range actually crosses the segment boundary.

* Prefetch has to be enabled. The regular page access checks that
  pages are not beyond heap end, but prefetch does not. On older
  releases (before 12) the execution stops after hitting the first
  non-existent page, so the prefetch distance has to be sufficient
  to reach the first page in the next segment to trigger the issue.
  Since 12 it's enough to just have prefetch enabled, the prefetch
  distance does not matter.

Fixed by not adding non-existent pages to the TID bitmap. Backpatch
all the way back to 9.6 (BRIN indexes were introduced in 9.5, but that
release is EOL).

Backpatch-through: 9.6
2021-04-07 15:58:36 +02:00
Heikki Linnakangas d92b1cdbab Revert "Add sortsupport for gist_btree opclasses, for faster index builds."
This reverts commit 9f984ba6d2.

It was making the buildfarm unhappy, apparently setting client_min_messages
in a regression test produces different output if log_statement='all'.
Another issue is that I now suspect the bit sortsupport function was in
fact not correct to call byteacmp(). Revert to investigate both of those
issues.
2021-04-07 14:33:21 +03:00
Heikki Linnakangas 9f984ba6d2 Add sortsupport for gist_btree opclasses, for faster index builds.
Commit 16fa9b2b30 introduced a faster way to build GiST indexes, by
sorting all the data. This commit adds the sortsupport functions needed
to make use of that feature for btree_gist.

Author: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/2F3F7265-0D22-44DB-AD71-8554C743D943@yandex-team.ru
2021-04-07 13:22:05 +03:00
Michael Paquier 4c0239cb7a Remove redundant memset(0) calls for page init of some index AMs
Bloom, GIN, GiST and SP-GiST rely on PageInit() to initialize the
contents of a page, and this routine fills entirely a page with zeros
for a size of BLCKSZ, including the special space.  Those index AMs have
been using an extra memset() call to fill with zeros the special page
space, or even the whole page, which is not necessary as PageInit()
already does this work, so let's remove them.  GiST was not doing this
extra call, but has commented out a system call that did so since
6236991.

While on it, remove one MAXALIGN() for SP-GiST as PageInit() takes care
of that.  This makes the whole page initialization logic more consistent
across all index AMs.

Author: Bharath Rupireddy
Reviewed-by: Vignesh C, Mahendra Singh Thalor
Discussion: https://postgr.es/m/CALj2ACViOo2qyaPT7krWm4LRyRTw9kOXt+g6PfNmYuGA=YHj9A@mail.gmail.com
2021-04-07 14:35:26 +09:00
Peter Geoghegan 8523492d4e Remove tupgone special case from vacuumlazy.c.
Retry the call to heap_prune_page() in rare cases where there is
disagreement between the heap_prune_page() call and the call to
HeapTupleSatisfiesVacuum() that immediately follows.  Disagreement is
possible when a concurrently-aborted transaction makes a tuple DEAD
during the tiny window between each step.  This was the only case where
a tuple considered DEAD by VACUUM still had storage following pruning.
VACUUM's definition of dead tuples is now uniformly simple and
unambiguous: dead tuples from each page are always LP_DEAD line pointers
that were encountered just after we performed pruning (and just before
we considered freezing remaining items with tuple storage).

Eliminating the tupgone=true special case enables INDEX_CLEANUP=off
style skipping of index vacuuming that takes place based on flexible,
dynamic criteria.  The INDEX_CLEANUP=off case had to know about skipping
indexes up-front before now, due to a subtle interaction with the
special case (see commit dd695979) -- this was a special case unto
itself.  Now there are no special cases.  And so now it won't matter
when or how we decide to skip index vacuuming: it won't affect how
pruning behaves, and it won't be affected by any of the implementation
details of pruning or freezing.

Also remove XLOG_HEAP2_CLEANUP_INFO records.  These are no longer
necessary because we now rely entirely on heap pruning taking care of
recovery conflicts.  There is no longer any need to generate recovery
conflicts for DEAD tuples that pruning just missed.  This also means
that heap vacuuming now uses exactly the same strategy for recovery
conflicts as index vacuuming always has: REDO routines never need to
process a latestRemovedXid from the WAL record, since earlier REDO of
the WAL record from pruning is sufficient in all cases.  The generic
XLOG_HEAP2_CLEAN record type is now split into two new record types to
reflect this new division (these are called XLOG_HEAP2_PRUNE and
XLOG_HEAP2_VACUUM).

Also stop acquiring a super-exclusive lock for heap pages when they're
vacuumed during VACUUM's second heap pass.  A regular exclusive lock is
enough.  This is correct because heap page vacuuming is now strictly a
matter of setting the LP_DEAD line pointers to LP_UNUSED.  No other
backend can have a pointer to a tuple located in a pinned buffer that
can be invalidated by a concurrent heap page vacuum operation.

Heap vacuuming can now be thought of as conceptually similar to index
vacuuming and conceptually dissimilar to heap pruning.  Heap pruning now
has sole responsibility for anything involving the logical contents of
the database (e.g., managing transaction status information, recovery
conflicts, considering what to do with HOT chains).  Index vacuuming and
heap vacuuming are now only concerned with recycling garbage items from
physical data structures that back the logical database.

Bump XLOG_PAGE_MAGIC due to pruning and heap page vacuum WAL record
changes.

Credit for the idea of retrying pruning a page to avoid the tupgone case
goes to Andres Freund.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com
2021-04-06 08:49:22 -07:00
Peter Geoghegan 7ab96cf6b3 Refactor lazy_scan_heap() loop.
Add a lazy_scan_heap() subsidiary function that handles heap pruning and
tuple freezing: lazy_scan_prune().  This is a great deal cleaner.  The
code that remains in lazy_scan_heap()'s per-block loop can now be
thought of as code that either comes before or after the call to
lazy_scan_prune(), which is now the clear focal point.  This division is
enforced by the way in which we now manage state.  lazy_scan_prune()
outputs state (using its own struct) that describes what to do with the
page following pruning and freezing (e.g., visibility map maintenance,
recording free space in the FSM).  It doesn't get passed any special
instructional state from the preamble code, though.

Also cleanly separate the logic used by a VACUUM with INDEX_CLEANUP=off
from the logic used by single-heap-pass VACUUMs.  The former case is now
structured as the omission of index and heap vacuuming by a two pass
VACUUM.  The latter case goes back to being used only when the table
happens to have no indexes (just as it was before commit a96c41fe).
This structure is much more natural, since the whole point of
INDEX_CLEANUP=off is to skip the index and heap vacuuming that would
otherwise take place.  The single-heap-pass case doesn't skip any useful
work, though -- it just does heap pruning and heap vacuuming together
when the table happens to have no indexes.

Both of these changes are preparation for an upcoming patch that
generalizes the mechanism used by INDEX_CLEANUP=off.  The later patch
will allow VACUUM to give up on index and heap vacuuming dynamically, as
problems emerge (e.g., with wraparound), so that an affected VACUUM
operation can finish up as soon as possible.

Also fix a very old bug in single-pass VACUUM VERBOSE output.  We were
reporting the number of tuples deleted via pruning as a direct
substitute for reporting the number of LP_DEAD items removed in a
function that deals with the second pass over the heap.  But that
doesn't work at all -- they're two different things.

To fix, start tracking the total number of LP_DEAD items encountered
during pruning, and use that in the report instead.  A single pass
VACUUM will always vacuum away whatever LP_DEAD items a heap page has
immediately after it is pruned, so the total number of LP_DEAD items
encountered during pruning equals the total number vacuumed-away.
(They are _not_ equal in the INDEX_CLEANUP=off case, but that's okay
because skipping index vacuuming is now a totally orthogonal concept to
one-pass VACUUM.)

Also stop reporting the count of LP_UNUSED items in VACUUM VERBOSE
output.  This makes the output of VACUUM VERBOSE more consistent with
log_autovacuum's output (because it never showed information about
LP_UNUSED items).  VACUUM VERBOSE reported LP_UNUSED items left behind
by the last VACUUM, and LP_UNUSED items created via pruning HOT chains
during the current VACUUM (it never included LP_UNUSED items left behind
by the current VACUUM's second pass over the heap).  This makes it
useless as an indicator of line pointer bloat, which must have been the
original intention. (Like the first VACUUM VERBOSE issue, this issue was
arguably an oversight in commit 282d2a03, which added the heap-only
tuple optimization.)

Finally, stop reporting empty_pages in VACUUM VERBOSE output, and start
reporting pages_removed instead.  This also makes the output of VACUUM
VERBOSE more consistent with log_autovacuum's output (which does not
show empty_pages, but does show pages_removed).  An empty page isn't
meaningfully different to a page that is almost empty, or a page that is
empty but for only a small number of remaining LP_UNUSED items.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com
2021-04-06 07:49:39 -07:00
Tom Lane 091e22b2e6 Clean up treatment of missing default and CHECK-constraint records.
Andrew Gierth reported that it's possible to crash the backend if no
pg_attrdef record is found to match an attribute that has atthasdef set.
AttrDefaultFetch warns about this situation, but then leaves behind
a relation tupdesc that has null "adbin" pointer(s), which most places
don't guard against.

We considered promoting the warning to an error, but throwing errors
during relcache load is pretty drastic: it effectively locks one out
of using the relation at all.  What seems better is to leave the
load-time behavior as a warning, but then throw an error in any code
path that wants to use a default and can't find it.  This confines
the error to a subset of INSERT/UPDATE operations on the table, and
in particular will at least allow a pg_dump to succeed.

Also, we should fix AttrDefaultFetch to not leave any null pointers
in the tupdesc, because that just creates an untested bug hazard.

While at it, apply the same philosophy of "warn at load, throw error
only upon use of the known-missing info" to CHECK constraints.
CheckConstraintFetch is very nearly the same logic as AttrDefaultFetch,
but for reasons lost in the mists of time, it was throwing ERROR for
the same cases that AttrDefaultFetch treats as WARNING.  Make the two
functions more nearly alike.

In passing, get rid of potentially-O(N^2) loops in equalTupleDesc
by making AttrDefaultFetch sort the entries after fetching them,
so that equalTupleDesc can assume that entries in two equal tupdescs
must be in matching order.  (CheckConstraintFetch already was sorting
CHECK constraints, but equalTupleDesc hadn't been told about it.)

There's some argument for back-patching this, but with such a small
number of field reports, I'm content to fix it in HEAD.

Discussion: https://postgr.es/m/87pmzaq4gx.fsf@news-spur.riddles.org.uk
2021-04-06 10:34:39 -04:00
Fujii Masao 9de9294b0c Stop archive recovery if WAL generated with wal_level=minimal is found.
Previously if hot standby was enabled, archive recovery exited with
an error when it found WAL generated with wal_level=minimal.
But if hot standby was disabled, it just reported a warning and
continued in that case. Which could lead to data loss or errors
during normal operation. A warning was emitted, but users could
easily miss that and not notice this serious situation until
they encountered the actual errors.

To improve this situation, this commit changes archive recovery
so that it exits with FATAL error when it finds WAL generated with
wal_level=minimal whatever the setting of hot standby. This enables
users to notice the serious situation soon.

The FATAL error is thrown if archive recovery starts from a base
backup taken before wal_level is changed to minimal. When archive
recovery exits with the error, if users have a base backup taken
after setting wal_level to higher than minimal, they can recover
the database by starting archive recovery from that newer backup.
But note that if such backup doesn't exist, there is no easy way to
complete archive recovery, which may make the database server
unstartable and users may lose whole database. The commit adds
the note about this risk into the document.

Even in the case of unstartable database server, previously by just
disabling hot standby users could avoid the error during archive
recovery, forcibly start up the server and salvage data from it.
But note that this commit makes this procedure unavailable at all.

Author: Takamichi Osumi
Reviewed-by: Laurenz Albe, Kyotaro Horiguchi, David Steele, Fujii Masao
Discussion: https://postgr.es/m/OSBPR01MB4888CBE1DA08818FD2D90ED8EDF90@OSBPR01MB4888.jpnprd01.prod.outlook.com
2021-04-06 22:56:51 +09:00
Peter Geoghegan f6b8f19a08 Allocate access strategy in parallel VACUUM workers.
Commit 49f49def took entirely the wrong approach to fixing this issue.
Just allocate a local buffer access strategy in each individual worker
instead of trying to propagate state.  This state was never propagated
by parallel VACUUM in the first place.

It looks like the only reason that this worked following commit 40d964ec
was that it involved static global variables, which are initialized to 0
per the C standard.

A more comprehensive fix may be necessary, even on HEAD.  This fix
should at least get the buildfarm green once again.

Thanks once again to Thomas Munro for continued off-list assistance with
the issue.
2021-04-05 17:17:40 -07:00
Tom Lane 09c1c6ab4b Support INCLUDE'd columns in SP-GiST.
Not much to say here: does what it says on the tin.
We steal a previously-always-zero bit from the nextOffset
field of leaf index tuples in order to track whether there
is a nulls bitmap.  Otherwise it works about like included
columns in other index types.

Pavel Borisov, reviewed by Andrey Borodin and Anastasia Lubennikova,
and rather heavily editorialized on by me

Discussion: https://postgr.es/m/CALT9ZEFi-vMp4faht9f9Junb1nO3NOSjhpxTmbm1UGLMsLqiEQ@mail.gmail.com
2021-04-05 18:41:21 -04:00
Peter Geoghegan 49f49defe7 Propagate parallel VACUUM's buffer access strategy.
Parallel VACUUM relied on global variable state from the leader process
being propagated to workers on fork().  Commit b4af70cb removed most
uses of global variables inside vacuumlazy.c, but did not account for
the buffer access strategy state.

To fix, propagate the state through shared memory instead.

Per buildfarm failures on elver, curculio, and morepork.

Many thanks to Thomas Munro for off-list assistance with this issue.
2021-04-05 14:56:56 -07:00
Peter Geoghegan b4af70cb21 Simplify state managed by VACUUM.
Reorganize the state struct used by VACUUM -- group related items
together to make it easier to understand.  Also stop relying on stack
variables inside lazy_scan_heap() -- move those into the state struct
instead.  Doing things this way simplifies large groups of related
functions whose function signatures had a lot of unnecessary redundancy.

Switch over to using int64 for the struct fields used to count things
that are reported to the user via log_autovacuum and VACUUM VERBOSE
output.  We were using double, but that doesn't seem to have any
advantages.  Using int64 makes it possible to add assertions that verify
that the first pass over the heap (pruning) encounters precisely the
same number of LP_DEAD items that get deleted from indexes later on, in
the second pass over the heap.  These assertions will be added in later
commits.

Finally, adjust the signatures of functions with IndexBulkDeleteResult
pointer arguments in cases where there was ambiguity about whether or
not the argument relates to a single index or all indexes.  Functions
now use the idiom that both ambulkdelete() and amvacuumcleanup() have
always used (where appropriate): accept a mutable IndexBulkDeleteResult
pointer argument, and return a result IndexBulkDeleteResult pointer to
caller.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkeOSYwC6KNckbhk2b1aNnWum6Yyn0NKP9D-Hq1LGTDPw@mail.gmail.com
2021-04-05 13:21:44 -07:00
Tom Lane dfc843d465 Fix more confusion in SP-GiST.
spg_box_quad_leaf_consistent unconditionally returned the leaf
datum as leafValue, even though in its usage for poly_ops that
value is of completely the wrong type.

In versions before 12, that was harmless because the core code did
nothing with leafValue in non-index-only scans ... but since commit
2a6368343, if we were doing a KNN-style scan, spgNewHeapItem would
unconditionally try to copy the value using the wrong datatype
parameters.  Said copying is a waste of time and space if we're not
going to return the data, but it accidentally failed to fail until
I fixed the datatype confusion in ac9099fc1.

Hence, change spgNewHeapItem to not copy the datum unless we're
actually going to return it later.  This saves cycles and dodges
the question of whether lossy opclasses are returning the right
type.  Also change spg_box_quad_leaf_consistent to not return
data that might be of the wrong type, as insurance against
somebody introducing a similar bug into the core code in future.

It seems like a good idea to back-patch these two changes into
v12 and v13, although I'm afraid to change spgNewHeapItem's
mistaken idea of which datatype to use in those branches.

Per buildfarm results from ac9099fc1.

Discussion: https://postgr.es/m/3728741.1617381471@sss.pgh.pa.us
2021-04-04 17:57:07 -04:00
Tom Lane ac9099fc1d Fix confusion in SP-GiST between attribute type and leaf storage type.
According to the documentation, the attType passed to the opclass
config function (and also relied on by the core code) is the type
of the heap column or expression being indexed.  But what was
actually being passed was the type stored for the index column.
This made no difference for user-defined SP-GiST opclasses,
because we weren't allowing the STORAGE clause of CREATE OPCLASS
to be used, so the two types would be the same.  But it's silly
not to allow that, seeing that the built-in poly_ops opclass
has a different value for opckeytype than opcintype, and that if you
want to do lossy storage then the types must really be different.
(Thus, user-defined opclasses doing lossy storage had to lie about
what type is in the index.)  Hence, remove the restriction, and make
sure that we use the input column type not opckeytype where relevant.

For reasons of backwards compatibility with existing user-defined
opclasses, we can't quite insist that the specified leafType match
the STORAGE clause; instead just add an amvalidate() warning if
they don't match.

Also fix some bugs that would only manifest when trying to return
index entries when attType is different from attLeafType.  It's not
too surprising that these have not been reported, because the only
usual reason for such a difference is to store the leaf value
lossily, rendering index-only scans impossible.

Add a src/test/modules module to exercise cases where attType is
different from attLeafType and yet index-only scan is supported.

Discussion: https://postgr.es/m/3728741.1617381471@sss.pgh.pa.us
2021-04-04 14:28:57 -04:00
Tomas Vondra d9c5b9a9ee Fix bug in brin_minmax_multi_union
When calling sort_expanded_ranges() we need to remember the return
value, because the function sorts and also deduplicates the ranges. So
the number of ranges may decrease. brin_minmax_multi_union failed to do
that, which resulted in crashes due to bogus ranges (equal minval/maxval
but not marked as compacted).

Reported-by: Jaime Casanova
Discussion: https://postgr.es/m/20210404052550.GA4376%40ahch-to
2021-04-04 19:36:12 +02:00
Tomas Vondra 1dad2a5ea3 Fix order of parameters in BRIN minmax-multi calls
The BRIN minmax-multi consistent function incorrectly assumed it can
lookup an operator, and then swap the arguments to get the commutator.
For example <(a,b) would be called as <(b,a) to get >(a,b). This works
when the arguments are of the same type, but with cross-type opclasses
this fails. We can't swap <(float4,float8) arguments, for example.

Fixed by passing arguments in the right order.

Discussion: https://postgr.es/m/CAJKUy5jLZFLCxyxfT%3DMfK5mtPfSzHA1rVLowR-j4RRsFVvKm7A%40mail.gmail.com
2021-04-04 19:25:41 +02:00
Tomas Vondra e1fbe1181c Fix BRIN minmax-multi distance for inet type
The distance calculation ignored the mask, unlike the inet comparator,
which resulted in negative distance in some cases. Fixed by applying the
mask in brin_minmax_multi_distance_inet. I've considered simply calling
inetmi() to calculate the delta, but that does not consider mask either.

Reviewed-by: Zhihong Yu
Discussion: https://postgr.es/m/1a0a7b9d-9bda-e3a2-7fa4-88f15042a051%40enterprisedb.com
2021-04-04 19:23:32 +02:00
Tomas Vondra 7262f2421a Fix BRIN minmax-multi distance for timetz type
The distance calculation ignored the time zone, so the result of (b-a)
might have ended negative even if (b > a). Fixed by considering the time
zone difference.

Reported-by: Jaime Casanova
Discussion: https://postgr.es/m/CAJKUy5jLZFLCxyxfT%3DMfK5mtPfSzHA1rVLowR-j4RRsFVvKm7A%40mail.gmail.com
2021-04-04 19:22:23 +02:00
Tomas Vondra 2b10e0e3c2 Fix BRIN minmax-multi distance for interval type
The distance calculation for interval type was treating months as having
31 days, which is inconsistent with the interval comparator (using 30
days). Due to this it was possible to get negative distance (b-a) when
(a<b), trigerring an assert.

Fixed by adopting the same logic as interval_cmp_value.

Reported-by: Jaime Casanova
Discussion: https://postgr.es/m/CAJKUy5jKH0Xhneau2mNftNPtTy-BVgQfXc8zQkEvRvBHfeUThQ%40mail.gmail.com
2021-04-04 19:19:51 +02:00
Tom Lane 1ebdec8c03 Rethink handling of pass-by-value leaf datums in SP-GiST.
The existing convention in SP-GiST is that any pass-by-value datatype
is stored in Datum representation, i.e. it's of width sizeof(Datum)
even when typlen is less than that.  This is okay, or at least it's
too late to change it, for prefix datums and node-label datums in inner
(upper) tuples.  But it's problematic for leaf datums, because we'd
prefer those to be stored in Postgres' standard on-disk representation
so that we can easily extend leaf tuples to carry additional "included"
columns.

I believe, however, that we can get away with just up and changing that.
This would be an unacceptable on-disk-format break, but there are two
big mitigating factors:

1. It seems quite unlikely that there are any SP-GiST opclasses out
there that use pass-by-value leaf datatypes.  Certainly none of the
ones in core do, nor has codesearch.debian.net heard of any.  Given
what SP-GiST is good for, it's hard to conceive of a use-case where
the leaf-level values would be both small and fixed-width.  (As an
example, if you wanted to index text values with the leaf level being
just a byte, then every text string would have to be represented
with one level of inner tuple per preceding byte, which would be
horrendously space-inefficient and slow to access.  You always want
to use as few inner-tuple levels as possible, leaving as much as
possible in the leaf values.)

2. Even granting that you have such an index, this change only
breaks things on big-endian machines.  On little-endian, the high
order bytes of the Datum format will now just appear to be alignment
padding space.

So, change the code to store pass-by-value leaf datums in their
usual on-disk form.  Inner-tuple datums are not touched.

This is extracted from a larger patch that intends to add support for
"included" columns.  I'm committing it separately for visibility in
our commit logs.

Pavel Borisov and Tom Lane, reviewed by Andrey Borodin

Discussion: https://postgr.es/m/CALT9ZEFi-vMp4faht9f9Junb1nO3NOSjhpxTmbm1UGLMsLqiEQ@mail.gmail.com
2021-04-01 17:55:17 -04:00
Noah Misch 0ff8bbdee1 Accept slightly-filled pages for tuples larger than fillfactor.
We always inserted a larger-than-fillfactor tuple into a newly-extended
page, even when existing pages were empty or contained nothing but an
unused line pointer.  This was unnecessary relation extension.  Start
tolerating page usage up to 1/8 the maximum space that could be taken up
by line pointers.  This is somewhat arbitrary, but it should allow more
cases to reuse pages.  This has no effect on tables with fillfactor=100
(the default).

John Naylor and Floris van Nee.  Reviewed by Matthias van de Meent.
Reported by Floris van Nee.

Discussion: https://postgr.es/m/6e263217180649339720afe2176c50aa@opammb0562.comp.optiver.com
2021-03-30 18:53:44 -07:00
David Rowley af527705ed Adjust design of per-worker parallel seqscan data struct
The design of the data structures which allow storage of the per-worker
memory during parallel seq scans were not ideal. The work done in
56788d215 required an additional data structure to allow workers to
remember the range of pages that had been allocated to them for
processing during a parallel seqscan.  That commit added a void pointer
field to TableScanDescData to allow heapam to store the per-worker
allocation information.  However putting the field there made very little
sense given that we have AM specific structs for that, e.g.
HeapScanDescData.

Here we remove the void pointer field from TableScanDescData and add a
dedicated field for this purpose to HeapScanDescData.

Previously we also allocated memory for this parallel per-worker data for
all scans, regardless if it was a parallel scan or not.  This was just a
wasted allocation for non-parallel scans, so here we make the allocation
conditional on the scan being parallel.

Also, add previously missing pfree() to free the per-worker data in
heap_endscan().

Reported-by: Andres Freund
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/20210317023101.anvejcfotwka6gaa@alap3.anarazel.de
2021-03-30 10:17:09 +13:00
Tomas Vondra 73b96bad4a Fix alignment in BRIN minmax-multi deserialization
The deserialization failed to ensure correct alignment, as it assumed it
can simply point into the serialized value. The serialization however
ignores alignment and copies just the significant bytes in order to make
the result as small as possible. This caused failures on systems that
are sensitive to mialigned addresses, like sparc, or with address
sanitizer enabled.

Fixed by copying the serialized data to ensure proper alignment. While
at it, fix an issue with serialization on big endian machines, using the
same store_att_byval/fetch_att trick as extended statistics.

Discussion: https://postgr.es/0c8c3304-d3dd-5e29-d5ac-b50589a23c8c%40enterprisedb.com
2021-03-26 16:48:36 +01:00
Tomas Vondra ab596105b5 BRIN minmax-multi indexes
Adds BRIN opclasses similar to the existing minmax, except that instead
of summarizing the page range into a single [min,max] range, the summary
consists of multiple ranges and/or points, allowing gaps. This allows
more efficient handling of data with poor correlation to physical
location within the table and/or outlier values, for which the regular
minmax opclassed tend to work poorly.

It's possible to specify the number of values kept for each page range,
either as a single point or an interval boundary.

  CREATE TABLE t (a int);
  CREATE INDEX ON t
   USING brin (a int4_minmax_multi_ops(values_per_range=16));

When building the summary, the values are combined into intervals with
the goal to minimize the "covering" (sum of interval lengths), using a
support procedure computing distance between two values.

Bump catversion, due to various catalog changes.

Author: Tomas Vondra <tomas.vondra@postgresql.org>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Sokolov Yura <y.sokolov@postgrespro.ru>
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
Discussion: https://postgr.es/m/5d78b774-7e9c-c94e-12cf-fef51cc89b1a%402ndquadrant.com
2021-03-26 13:54:30 +01:00
Tomas Vondra 77b88cd1bb BRIN bloom indexes
Adds a BRIN opclass using a Bloom filter to summarize the range. Indexes
using the new opclasses allow only equality queries (similar to hash
indexes), but that works fine for data like UUID, MAC addresses etc. for
which range queries are not very common. This also means the indexes
work for data that is not well correlated to physical location within
the table, or perhaps even entirely random (which is a common issue with
existing BRIN minmax opclasses).

It's possible to specify opclass parameters with the usual Bloom filter
parameters, i.e. the desired false-positive rate and the expected number
of distinct values per page range.

  CREATE TABLE t (a int);
  CREATE INDEX ON t
   USING brin (a int4_bloom_ops(false_positive_rate = 0.05,
                                n_distinct_per_range = 100));

The opclasses do not operate on the indexed values directly, but compute
a 32-bit hash first, and the Bloom filter is built on the hash value.
Collisions should not be a huge issue though, as the number of distinct
values in a page ranges is usually fairly small.

Bump catversion, due to various catalog changes.

Author: Tomas Vondra <tomas.vondra@postgresql.org>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Sokolov Yura <y.sokolov@postgrespro.ru>
Reviewed-by: Nico Williams <nico@cryptonector.com>
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
Discussion: https://postgr.es/m/5d78b774-7e9c-c94e-12cf-fef51cc89b1a%402ndquadrant.com
2021-03-26 13:35:32 +01:00
Tomas Vondra a681e3c107 Support the old signature of BRIN consistent function
Commit a1c649d889 changed the signature of the BRIN consistent function
by adding a new required parameter.  Treating the parameter as optional,
which would make the change backwards incompatibile, was rejected with
the justification that there are few out-of-core extensions, so it's not
worth adding making the code more complex, and it's better to deal with
that in the extension.

But after further thought, that would be rather problematic, because
pg_upgrade simply dumps catalog contents and the same version of an
extension needs to work on both PostgreSQL versions. Supporting both
variants of the consistent function (with 3 or 4 arguments) makes that
possible.

The signature is not the only thing that changed, as commit 72ccf55cb9
moved handling of IS [NOT] NULL keys from the support procedures. But
this change is backward compatible - handling the keys in exension is
unnecessary, but harmless. The consistent function will do a bit of
unnecessary work, but it should be very cheap.

This also undoes most of the changes to the existing opclasses (minmax
and inclusion), making them use the old signature again. This should
make backpatching simpler.

Catversion bump, because of changes in pg_amproc.

Author: Tomas Vondra <tomas.vondra@postgresql.org>
Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Reviewed-by: Mark Dilger <hornschnorter@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Masahiko Sawada <masahiko.sawada@enterprisedb.com>
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
2021-03-26 13:17:58 +01:00
Robert Haas 5db1fd7823 Fix interaction of TOAST compression with expression indexes.
Before, trying to compress a value for insertion into an expression
index would crash.

Dilip Kumar, with some editing by me. Report by Jaime Casanova.

Discussion: http://postgr.es/m/CAJKUy5gcs0zGOp6JXU2mMVdthYhuQpFk=S3V8DOKT=LZC1L36Q@mail.gmail.com
2021-03-25 19:55:32 -04:00
Michael Paquier a1999a01bb Sanitize the term "combo CID" in code comments
Combo CIDs were referred in the code comments using different terms
across various places of the code, so unify a bit the term used with
what is currently in use in some of the READMEs.

Author: "Hou, Zhijie"
Discussion: https://postgr.es/m/1d42865c91404f46af4562532fdbea31@G08CNEXMBPEKD05.g08.fujitsu.local
2021-03-25 16:08:03 +09:00
Fujii Masao 438fc4a39c Fix bug in WAL replay of COMMIT_TS_SETTS record.
Previously the WAL replay of COMMIT_TS_SETTS record called
TransactionTreeSetCommitTsData() with the argument write_xlog=true,
which generated and wrote new COMMIT_TS_SETTS record.
This should not be acceptable because it's during recovery.

This commit fixes the WAL replay of COMMIT_TS_SETTS record
so that it calls TransactionTreeSetCommitTsData() with write_xlog=false
and doesn't generate new WAL during recovery.

Back-patch to all supported branches.

Reported-by: lx zou <zoulx1982@163.com>
Author: Fujii Masao
Reviewed-by: Alvaro Herrera
Discussion: https://postgr.es/m/16931-620d0f2fdc6108f1@postgresql.org
2021-03-25 11:23:30 +09:00
Peter Eisentraut 37c99d304d Fix stray double semicolons
Reported-by: John Naylor <john.naylor@enterprisedb.com>
2021-03-24 20:42:51 +01:00
Robert Haas e5595de03e Tidy up more loose ends related to configurable TOAST compression.
Change the default_toast_compression GUC to be an enum rather than
a string. Earlier, uncommitted versions of the patch supported using
CREATE ACCESS METHOD to add new compression methods to a running
system, but that idea was dropped before commit. So, we can simplify
the GUC handling as well, which has the nice side effect of improving
the error messages.

While updating the documentation to reflect the new GUC type, also
move it back to the right place in the list. I moved this while
revising what became commit 24f0e395ac,
but apparently the intended ordering is "alphabetical" rather than
"whatever Robert thinks looks nice."

Rejigger things to avoid having access/toast_compression.h depend on
utils/guc.h, so that we don't end up with every file that includes
it also depending on something largely unrelated. Move a few
inline functions back into the C source file partly to help reduce
dependencies and partly just to avoid clutter. A few very minor
cosmetic fixes.

Original patch by Justin Pryzby, but very heavily edited by me,
and reverse reviewed by him and also reviewed by by Tom Lane.

Discussion: http://postgr.es/m/CA+TgmoYp=GT_ztUCeZg2i4hkHAQv8o=-nVJ1-TKWTG1zQOmOpg@mail.gmail.com
2021-03-24 12:36:08 -04:00
Amit Kapila 26acb54a13 Revert "Enable parallel SELECT for "INSERT INTO ... SELECT ..."."
To allow inserts in parallel-mode this feature has to ensure that all the
constraints, triggers, etc. are parallel-safe for the partition hierarchy
which is costly and we need to find a better way to do that. Additionally,
we could have used existing cached information in some cases like indexes,
domains, etc. to determine the parallel-safety.

List of commits reverted, in reverse chronological order:

ed62d3737c Doc: Update description for parallel insert reloption.
c8f78b6161 Add a new GUC and a reloption to enable inserts in parallel-mode.
c5be48f092 Improve FK trigger parallel-safety check added by 05c8482f7f.
e2cda3c20a Fix use of relcache TriggerDesc field introduced by commit 05c8482f7f.
e4e87a32cc Fix valgrind issue in commit 05c8482f7f.
05c8482f7f Enable parallel SELECT for "INSERT INTO ... SELECT ...".

Discussion: https://postgr.es/m/E1lMiB9-0001c3-SY@gemulon.postgresql.org
2021-03-24 11:29:15 +05:30
Michael Paquier 99dd75fb99 Reword slightly logs generated for index stats in autovacuum
Using "remain" is confusing, as it implies that the index file can
shrink.  Instead, use "in total".

Per discussion with Peter Geoghegan.

Discussion: https://postgr.es/m/CAH2-WzkYgHZzpGOwR14CScJsjaQpvJrEkEfkh_=wGhzLb=yVdQ@mail.gmail.com
2021-03-24 09:36:03 +09:00
Peter Geoghegan 5b861baa55 nbtree VACUUM: Cope with buggy opclasses.
Teach nbtree VACUUM to press on with vacuuming in the event of a page
deletion attempt that fails to "re-find" a downlink for its child/target
page.

There is no good reason to treat this as an irrecoverable error.  But
there is a good reason not to: pressing on at this point removes any
question of VACUUM not making progress solely due to misbehavior from
user-defined operator class code.

Discussion: https://postgr.es/m/CAH2-Wzma5G9CTtMjbrXTwOym+U=aWg-R7=-htySuztgoJLvZXg@mail.gmail.com
2021-03-23 16:09:51 -07:00
Tom Lane 9d523119fd Avoid possible crash while finishing up a heap rewrite.
end_heap_rewrite was not careful to ensure that the target relation
is open at the smgr level before performing its final smgrimmedsync.
In ordinary cases this is no problem, because it would have been
opened earlier during the rewrite.  However a crash can be reproduced
by re-clustering an empty table with CLOBBER_CACHE_ALWAYS enabled.

Although that exact scenario does not crash in v13, I think that's
a chance result of unrelated planner changes, and the problem is
likely still reachable with other test cases.  The true proximate
cause of this failure is commit c6b92041d, which replaced a call to
heap_sync (which was careful about opening smgr) with a direct call
to smgrimmedsync.  Hence, back-patch to v13.

Amul Sul, per report from Neha Sharma; cosmetic changes
and test case by me.

Discussion: https://postgr.es/m/CANiYTQsU7yMFpQYnv=BrcRVqK_3U3mtAzAsJCaqtzsDHfsUbdQ@mail.gmail.com
2021-03-23 11:24:16 -04:00
Michael Paquier 5aed6a1fc2 Add per-index stats information in verbose logs of autovacuum
Once a relation's autovacuum is completed, the logs include more
information about this relation state if the threshold of
log_autovacuum_min_duration (or its relation option) is reached, with
for example contents about the statistics of the VACUUM operation for
the relation, WAL and system usage.

This commit adds more information about the statistics of the relation's
indexes, with one line of logs generated for each index.  The index
stats were already calculated, but not printed in the context of
autovacuum yet.  While on it, some refactoring is done to keep track of
the index statistics directly within LVRelStats, simplifying some
routines related to parallel VACUUMs.

Author: Masahiko Sawada
Reviewed-by: Michael Paquier, Euler Taveira
Discussion: https://postgr.es/m/CAD21AoAy6SxHiTivh5yAPJSUE4S=QRPpSZUdafOSz0R+fRcM6Q@mail.gmail.com
2021-03-23 13:25:14 +09:00
Bruce Momjian 95d77149c5 Add macro RelationIsPermanent() to report relation permanence
Previously, to check relation permanence, the Relation's Form_pg_class
structure member relpersistence was compared to the value
RELPERSISTENCE_PERMANENT ("p"). This commit adds the macro
RelationIsPermanent() and is used in appropirate places to simplify the
code.  This matches other RelationIs* macros.

This macro will be used in more places in future cluster file encryption
patches.

Discussion: https://postgr.es/m/20210318153134.GH20766@tamriel.snowman.net
2021-03-22 20:23:52 -04:00
Tomas Vondra 8e4b332e88 Optimize allocations in bringetbitmap
The bringetbitmap function allocates memory for various purposes, which
may be quite expensive, depending on the number of scan keys. Instead of
allocating them separately, allocate one bit chunk of memory an carve it
into smaller pieces as needed - all the pieces have the same lifespan,
and it saves quite a bit of CPU and memory overhead.

Author: Tomas Vondra <tomas.vondra@postgresql.org>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Mark Dilger <hornschnorter@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Masahiko Sawada <masahiko.sawada@enterprisedb.com>
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
2021-03-23 00:47:09 +01:00
Tomas Vondra 72ccf55cb9 Move IS [NOT] NULL handling from BRIN support functions
The handling of IS [NOT] NULL clauses is independent of an opclass, and
most of the code was exactly the same in both minmax and inclusion. So
instead move the code from support procedures to the AM.

This simplifies the code - especially the support procedures - quite a
bit, as they don't need to care about NULL values and flags at all. It
also means the IS [NOT] NULL clauses can be evaluated without invoking
the support procedure.

Author: Tomas Vondra <tomas.vondra@postgresql.org>
Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Reviewed-by: Nikita Glukhov <n.gluhov@postgrespro.ru>
Reviewed-by: Mark Dilger <hornschnorter@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Masahiko Sawada <masahiko.sawada@enterprisedb.com>
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
2021-03-23 00:45:42 +01:00
Tomas Vondra a1c649d889 Pass all scan keys to BRIN consistent function at once
This commit changes how we pass scan keys to BRIN consistent function.
Instead of passing them one by one, we now pass all scan keys for a
given attribute at once. That makes the consistent function a bit more
complex, as it has to loop through the keys, but it does allow more
elaborate opclasses that can use multiple keys to eliminate ranges much
more effectively.

The existing BRIN opclasses (minmax, inclusion) don't really benefit
from this change. The primary purpose is to allow future opclases to
benefit from seeing all keys at once.

This does change the BRIN API, because the signature of the consistent
function changes (a new parameter with number of scan keys). So this
breaks existing opclasses, and will require supporting two variants of
the code for different PostgreSQL versions. We've considered supporting
two variants of the consistent, but we've decided not to do that.
Firstly, there's another patch that moves handling of NULL values from
the opclass, which means the opclasses need to be updated anyway.
Secondly, we're not aware of any out-of-core BRIN opclasses, so it does
not seem worth the extra complexity.

Bump catversion, because of pg_proc changes.

Author: Tomas Vondra <tomas.vondra@postgresql.org>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Mark Dilger <hornschnorter@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Reviewed-by: Nikita Glukhov <n.gluhov@postgrespro.ru>
Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
2021-03-23 00:45:03 +01:00
Tom Lane 063dd37ebc Short-circuit slice requests that are for more than the object's size.
substring(), and perhaps other callers, isn't careful to pass a
slice length that is no more than the datum's true size.  Since
toast_decompress_datum_slice's children will palloc the requested
slice length, this can waste memory.  Also, close study of the liblz4
documentation suggests that it is dependent on the caller to not ask
for more than the correct amount of decompressed data; this squares
with observed misbehavior with liblz4 1.8.3.  Avoid these problems
by switching to the normal full-decompression code path if the
slice request is >= datum's decompressed size.

Tom Lane and Dilip Kumar

Discussion: https://postgr.es/m/507597.1616370729@sss.pgh.pa.us
2021-03-22 14:01:20 -04:00
Tom Lane aeb1631ed2 Mostly-cosmetic adjustments of TOAST-related macros.
The authors of bbe0a81db hadn't quite got the idea that macros named
like SOMETHING_4B_C were only meant for internal endianness-related
details in postgres.h.  Choose more legible names for macros that are
intended to be used elsewhere.  Rearrange postgres.h a bit to clarify
the separation between those internal macros and ones intended for
wider use.

Also, avoid using the term "rawsize" for true decompressed size;
we've used "extsize" for that, because "rawsize" generally denotes
total Datum size including header.  This choice seemed particularly
unfortunate in tests that were comparing one of these meanings to
the other.

This patch includes a couple of not-purely-cosmetic changes: be
sure that the shifts aligning compression methods are unsigned
(not critical today, but will be when compression method 2 exists),
and fix broken definition of VARATT_EXTERNAL_GET_COMPRESSION (now
VARATT_EXTERNAL_GET_COMPRESS_METHOD), whose callers worked only
accidentally.

Discussion: https://postgr.es/m/574197.1616428079@sss.pgh.pa.us
2021-03-22 13:43:10 -04:00
Michael Paquier 595b9cba2a Fix timeline assignment in checkpoints with 2PC transactions
Any transactions found as still prepared by a checkpoint have their
state data read from the WAL records generated by PREPARE TRANSACTION
before being moved into their new location within pg_twophase/.  While
reading such records, the WAL reader uses the callback
read_local_xlog_page() to read a page, that is shared across various
parts of the system.  This callback, since 1148e22a, has introduced an
update of ThisTimeLineID when reading a record while in recovery, which
is potentially helpful in the context of cascading WAL senders.

This update of ThisTimeLineID interacts badly with the checkpointer if a
promotion happens while some 2PC data is read from its record, as, by
changing ThisTimeLineID, any follow-up WAL records would be written to
an timeline older than the promoted one.  This results in consistency
issues.  For instance, a subsequent server restart would cause a failure
in finding a valid checkpoint record, resulting in a PANIC, for
instance.

This commit changes the code reading the 2PC data to reset the timeline
once the 2PC record has been read, to prevent messing up with the static
state of the checkpointer.  It would be tempting to do the same thing
directly in read_local_xlog_page().  However, based on the discussion
that has led to 1148e22a, users may rely on the updates of
ThisTimeLineID when a WAL record page is read in recovery, so changing
this callback could break some cases that are working currently.

A TAP test reproducing the issue is added, relying on a PITR to
precisely trigger a promotion with a prepared transaction still
tracked.

Per discussion with Heikki Linnakangas, Kyotaro Horiguchi, Fujii Masao
and myself.

Author: Soumyadeep Chakraborty, Jimmy Yih, Kevin Yeap
Discussion: https://postgr.es/m/CAE-ML+_EjH_fzfq1F3RJ1=XaaNG=-Jz-i3JqkNhXiLAsM3z-Ew@mail.gmail.com
Backpatch-through: 10
2021-03-22 08:30:53 +09:00
Peter Geoghegan 9dd963ae25 Recycle nbtree pages deleted during same VACUUM.
Maintain a simple array of metadata about pages that were deleted during
nbtree VACUUM's current btvacuumscan() call.  Use this metadata at the
end of btvacuumscan() to attempt to place newly deleted pages in the FSM
without further delay.  It might not yet be safe to place any of the
pages in the FSM by then (they may not be deemed recyclable), but we
have little to lose and plenty to gain by trying.  In practice there is
a very good chance that this will work out when vacuuming larger
indexes, where scanning the index naturally takes quite a while.

This commit doesn't change the page recycling invariants; it merely
improves the efficiency of page recycling within the confines of the
existing design.  Recycle safety is a part of nbtree's implementation of
what Lanin & Shasha call "the drain technique".  The design happens to
use transaction IDs (they're stored in deleted pages), but that in
itself doesn't align the cutoff for recycle safety to any of the
XID-based cutoffs used by VACUUM (e.g., OldestXmin).  All that matters
is whether or not _other_ backends might be able to observe various
inconsistencies in the tree structure (that they cannot just detect and
recover from by moving right).  Recycle safety is purely a question of
maintaining the consistency (or the apparent consistency) of a physical
data structure.

Note that running a simple serial test case involving a large range
DELETE followed by a VACUUM VERBOSE will probably show that any newly
deleted nbtree pages are not yet reusable/recyclable.  This is expected
in the absence of even one concurrent XID assignment.  It is an old
implementation restriction.  In practice it's unlikely to be the thing
that makes recycling remain unsafe, at least with larger indexes, where
recycling newly deleted pages during the same VACUUM actually matters.

An important high-level goal of this commit (as well as related recent
commits e5d8a999 and 9f3665fb) is to make expensive deferred cleanup
operations in index AMs rare in general.  If index vacuuming frequently
depends on the next VACUUM operation finishing off work that the current
operation started, then the general behavior of index vacuuming is hard
to predict.  This is relevant to ongoing work that adds a vacuumlazy.c
mechanism to skip index vacuuming in certain cases.  Anything that makes
the real world behavior of index vacuuming simpler and more linear will
also make top-down modeling in vacuumlazy.c more robust.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzk76_P=67iUscb1UN44-gyZL-KgpsXbSxq_bdcMa7Q+wQ@mail.gmail.com
2021-03-21 15:25:39 -07:00
Tom Lane 9fb9691a88 Suppress various new compiler warnings.
Compilers that don't understand that elog(ERROR) doesn't return
issued warnings here.  In the cases in libpq_pipeline.c, we were
not exactly helping things by failing to mark pg_fatal() as noreturn.

Per buildfarm.
2021-03-21 11:50:43 -04:00
Tomas Vondra 882b2cdc08 Use valid compression method in brin_form_tuple
When compressing the BRIN summary, we can't simply use the compression
method from the indexed attribute.  The summary may use a different data
type, e.g. fixed-length attribute may have varlena summary, leading to
compression failures.  For the built-in BRIN opclasses this happens to
work, because the summary uses the same data type as the attribute.

When the data types match, we can inherit use the compression method
specified for the attribute (it's copied into the index descriptor).
Otherwise we don't have much choice and have to use the default one.

Author: Tomas Vondra
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/e0367f27-392c-321a-7411-a58e1a7e4817%40enterprisedb.com
2021-03-21 00:28:34 +01:00
Robert Haas bbe0a81db6 Allow configurable LZ4 TOAST compression.
There is now a per-column COMPRESSION option which can be set to pglz
(the default, and the only option in up until now) or lz4. Or, if you
like, you can set the new default_toast_compression GUC to lz4, and
then that will be the default for new table columns for which no value
is specified. We don't have lz4 support in the PostgreSQL code, so
to use lz4 compression, PostgreSQL must be built --with-lz4.

In general, TOAST compression means compression of individual column
values, not the whole tuple, and those values can either be compressed
inline within the tuple or compressed and then stored externally in
the TOAST table, so those properties also apply to this feature.

Prior to this commit, a TOAST pointer has two unused bits as part of
the va_extsize field, and a compessed datum has two unused bits as
part of the va_rawsize field. These bits are unused because the length
of a varlena is limited to 1GB; we now use them to indicate the
compression type that was used. This means we only have bit space for
2 more built-in compresison types, but we could work around that
problem, if necessary, by introducing a new vartag_external value for
any further types we end up wanting to add. Hopefully, it won't be
too important to offer a wide selection of algorithms here, since
each one we add not only takes more coding but also adds a build
dependency for every packager. Nevertheless, it seems worth doing
at least this much, because LZ4 gets better compression than PGLZ
with less CPU usage.

It's possible for LZ4-compressed datums to leak into composite type
values stored on disk, just as it is for PGLZ. It's also possible for
LZ4-compressed attributes to be copied into a different table via SQL
commands such as CREATE TABLE AS or INSERT .. SELECT.  It would be
expensive to force such values to be decompressed, so PostgreSQL has
never done so. For the same reasons, we also don't force recompression
of already-compressed values even if the target table prefers a
different compression method than was used for the source data.  These
architectural decisions are perhaps arguable but revisiting them is
well beyond the scope of what seemed possible to do as part of this
project.  However, it's relatively cheap to recompress as part of
VACUUM FULL or CLUSTER, so this commit adjusts those commands to do
so, if the configured compression method of the table happens not to
match what was used for some column value stored therein.

Dilip Kumar. The original patches on which this work was based were
written by Ildus Kurbangaliev, and those were patches were based on
even earlier work by Nikita Glukhov, but the design has since changed
very substantially, since allow a potentially large number of
compression methods that could be added and dropped on a running
system proved too problematic given some of the architectural issues
mentioned above; the choice of which specific compression method to
add first is now different; and a lot of the code has been heavily
refactored.  More recently, Justin Przyby helped quite a bit with
testing and reviewing and this version also includes some code
contributions from him. Other design input and review from Tomas
Vondra, Álvaro Herrera, Andres Freund, Oleg Bartunov, Alexander
Korotkov, and me.

Discussion: http://postgr.es/m/20170907194236.4cefce96%40wp.localdomain
Discussion: http://postgr.es/m/CAFiTN-uUpX3ck%3DK0mLEk-G_kUQY%3DSNOTeqdaNRR9FMdQrHKebw%40mail.gmail.com
2021-03-19 15:10:38 -04:00
Amit Kapila c8f78b6161 Add a new GUC and a reloption to enable inserts in parallel-mode.
Commit 05c8482f7f added the implementation of parallel SELECT for
"INSERT INTO ... SELECT ..." which may incur non-negligible overhead in
the additional parallel-safety checks that it performs, even when, in the
end, those checks determine that parallelism can't be used. This is
normally only ever a problem in the case of when the target table has a
large number of partitions.

A new GUC option "enable_parallel_insert" is added, to allow insert in
parallel-mode. The default is on.

In addition to the GUC option, the user may want a mechanism to allow
inserts in parallel-mode with finer granularity at table level. The new
table option "parallel_insert_enabled" allows this. The default is true.

Author: "Hou, Zhijie"
Reviewed-by: Greg Nancarrow, Amit Langote, Takayuki Tsunakawa, Amit Kapila
Discussion: https://postgr.es/m/CAA4eK1K-cW7svLC2D7DHoGHxdAdg3P37BLgebqBOC2ZLc9a6QQ%40mail.gmail.com
Discussion: https://postgr.es/m/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV=qpFJrR3AcrTS3g@mail.gmail.com
2021-03-18 07:25:27 +05:30
Tom Lane 8620a7f6db Code review for server's handling of "tablespace map" files.
While looking at Robert Foggia's report, I noticed a passel of
other issues in the same area:

* The scheme for backslash-quoting newlines in pathnames is just
wrong; it will misbehave if the last ordinary character in a pathname
is a backslash.  I'm not sure why we're bothering to allow newlines
in tablespace paths, but if we're going to do it we should do it
without introducing other problems.  Hence, backslashes themselves
have to be backslashed too.

* The author hadn't read the sscanf man page very carefully, because
this code would drop any leading whitespace from the path.  (I doubt
that a tablespace path with leading whitespace could happen in
practice; but if we're bothering to allow newlines in the path, it
sure seems like leading whitespace is little less implausible.)  Using
sscanf for the task of finding the first space is overkill anyway.

* While I'm not 100% sure what the rationale for escaping both \r and
\n is, if the idea is to allow Windows newlines in the file then this
code failed, because it'd throw an error if it saw \r followed by \n.

* There's no cross-check for an incomplete final line in the map file,
which would be a likely apparent symptom of the improper-escaping
bug.

On the generation end, aside from the escaping issue we have:

* If needtblspcmapfile is true then do_pg_start_backup will pass back
escaped strings in tablespaceinfo->path values, which no caller wants
or is prepared to deal with.  I'm not sure if there's a live bug from
that, but it looks like there might be (given the dubious assumption
that anyone actually has newlines in their tablespace paths).

* It's not being very paranoid about the possibility of random stuff
in the pg_tblspc directory.  IMO we should ignore anything without an
OID-like name.

The escaping rule change doesn't seem back-patchable: it'll require
doubling of backslashes in the tablespace_map file, which is basically
a basebackup format change.  The odds of that causing trouble are
considerably more than the odds of the existing bug causing trouble.
The rest of this seems somewhat unlikely to cause problems too,
so no back-patch.
2021-03-17 16:18:46 -04:00
Tom Lane a50e4fd028 Prevent buffer overrun in read_tablespace_map().
Robert Foggia of Trustwave reported that read_tablespace_map()
fails to prevent an overrun of its on-stack input buffer.
Since the tablespace map file is presumed trustworthy, this does
not seem like an interesting security vulnerability, but still
we should fix it just in the name of robustness.

While here, document that pg_basebackup's --tablespace-mapping option
doesn't work with tar-format output, because it doesn't.  To make it
work, we'd have to modify the tablespace_map file within the tarball
sent by the server, which might be possible but I'm not volunteering.
(Less-painful solutions would require changing the basebackup protocol
so that the source server could adjust the map.  That's not very
appetizing either.)
2021-03-17 16:10:37 -04:00
Peter Geoghegan fbe4cb3bd4 Fix comment about promising tuples.
Oversight in commit d168b66682, which added bottom-up index deletion.
2021-03-16 13:38:52 -07:00
Stephen Frost 94d13d474d Improve logging of auto-vacuum and auto-analyze
When logging auto-vacuum and auto-analyze activity, include the I/O
timing if track_io_timing is enabled.  Also, for auto-analyze, add the
read rate and the dirty rate, similar to how that information has
historically been logged for auto-vacuum.

Stephen Frost and Jakub Wartak

Reviewed-By: Heikki Linnakangas, Tomas Vondra
Discussion: https://www.postgresql.org/message-id/VI1PR0701MB69603A433348EDCF783C6ECBF6EF0%40VI1PR0701MB6960.eurprd07.prod.outlook.com
2021-03-16 14:46:48 -04:00
Fujii Masao d75288fb27 Make archiver process an auxiliary process.
This commit changes WAL archiver process so that it's treated as
an auxiliary process and can use shared memory. This is an infrastructure
patch required for upcoming shared-memory based stats collector patch
series. These patch series basically need any processes including archiver
that can report the statistics to access to shared memory. Since this patch
itself is useful to simplify the code and when users monitor the status of
archiver, it's committed separately in advance.

This commit simplifies the code for WAL archiving. For example, previously
backends need to signal to archiver via postmaster when they notify
archiver that there are some WAL files to archive. On the other hand,
this commit removes that signal to postmaster and enables backends to
notify archier directly using shared latch.

Also, as the side of this change, the information about archiver process
becomes viewable at pg_stat_activity view.

Author: Kyotaro Horiguchi
Reviewed-by: Andres Freund, Álvaro Herrera, Julien Rouhaud, Tomas Vondra, Arthur Zakirov, Fujii Masao
Discussion: https://postgr.es/m/20180629.173418.190173462.horiguchi.kyotaro@lab.ntt.co.jp
2021-03-15 13:13:14 +09:00
Peter Geoghegan 0ea71c93a0 Notice that heap page has dead items during VACUUM.
Consistently set a flag variable that tracks whether the current heap
page has a dead item during lazy vacuum's heap scan.  We missed the
common case where there is an preexisting (or even a new) LP_DEAD heap
line pointer.

Also make it clear that the variable might be affected by an existing
line pointer, say from an earlier opportunistic pruning operation.  This
distinction is important because it's the main reason why we can't just
use the nearby tups_vacuumed variable instead.

No backpatch.  In theory failing to set the page level flag variable had
no consequences.  Currently it is only used to defensively check if a
page marked all visible has dead items, which should never happen anyway
(if it does then the table must be corrupt).

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Diagnosed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAD21AoAtZb4+HJT_8RoOXvu4HM-Zd4HKS3YSMCH6+-W=bDyh-w@mail.gmail.com
2021-03-14 18:05:57 -07:00
Peter Geoghegan 02b5940dbe Consolidate nbtree VACUUM metapage routines.
Simplify _bt_vacuum_needs_cleanup() functions's signature (it only needs
a single 'rel' argument now), and move it next to its sibling function
in nbtpage.c.

I believe that _bt_vacuum_needs_cleanup() was originally located in
nbtree.c due to an include dependency issue.  That's no longer an issue.

Follow-up to commit 9f3665fb.
2021-03-12 13:11:47 -08:00
Thomas Munro 600f2f50b7 Add condition variable for recovery resume.
Replace a sleep loop with a CV, to get a fast reaction time when
recovery is resumed or the postmaster exits via standard infrastructure.
Unfortunately we still need to wake up every second to perform extra
polling during the recovery pause loop.

Discussion: https://postgr.es/m/CA%2BhUKGK1607VmtrDUHQXrsooU%3Dap4g4R2yaoByWOOA3m8xevUQ%40mail.gmail.com
2021-03-12 19:45:42 +13:00
Peter Geoghegan 7bb97211a5 Save a few cycles during nbtree VACUUM.
Avoid calling RelationGetNumberOfBlocks() unnecessarily in the common
case where there are no deleted but not yet recycled pages to recycle
during a cleanup-only nbtree VACUUM operation.

Follow-up to commit e5d8a999, which (among other things) taught the
"skip full scan" nbtree VACUUM mechanism to only trigger a full index
scan when the absolute number of deleted pages in the index is
considered excessive.
2021-03-11 14:18:23 -08:00
Peter Geoghegan effdd3f3b6 Add back vacuum_cleanup_index_scale_factor parameter.
Commit 9f3665fb removed the vacuum_cleanup_index_scale_factor storage
parameter.  However, that creates dump/reload hazards when moving across
major versions.

Add back the vacuum_cleanup_index_scale_factor parameter (though not the
GUC of the same name) purely to avoid problems when using tools like
pg_upgrade.  The parameter remains disabled and undocumented.

No backpatch to Postgres 13, since vacuum_cleanup_index_scale_factor was
only disabled by REL_13_STABLE's version of master branch commit
9f3665fb in the first place -- the parameter already looks like this on
REL_13_STABLE.

Discussion: https://postgr.es/m/YEm/a3Ko3nKnBuVq@paquier.xyz
2021-03-11 12:42:46 -08:00
Robert Haas 32fd2b57d7 Be clear about whether a recovery pause has taken effect.
Previously, the code and documentation seem to have essentially
assumed than a call to pg_wal_replay_pause() would take place
immediately, but that's not the case, because we only check for a
pause in certain places. This means that a tool that uses this
function and then wants to do something else afterward that is
dependent on the pause having taken effect doesn't know how long it
needs to wait to be sure that no more WAL is going to be replayed.

To avoid that, add a new function pg_get_wal_replay_pause_state()
which returns either 'not paused', 'paused requested', or 'paused'.
After calling pg_wal_replay_pause() the status will immediate change
from 'not paused' to 'pause requested'; when the startup process
has noticed this, the status will change to 'pause'.  For backward
compatibility, pg_is_wal_replay_paused() still exists and returns
the same thing as before: true if a pause has been requested,
whether or not it has taken effect yet; and false if not.
The documentation is updated to clarify.

To improve the changes that a pause request is quickly confirmed
effective, adjust things so that WaitForWALToBecomeAvailable will
swiftly reach a call to recoveryPausesHere() when a pause request
is made.

Dilip Kumar, reviewed by Simon Riggs, Kyotaro Horiguchi, Yugo Nagata,
Masahiko Sawada, and Bharath Rupireddy.

Discussion: http://postgr.es/m/CAFiTN-vcLLWEm8Zr%3DYK83rgYrT9pbC8VJCfa1kY9vL3AUPfu6g%40mail.gmail.com
2021-03-11 15:07:03 -05:00
Peter Geoghegan 5f8727f5a6 VACUUM ANALYZE: Always update pg_class.reltuples.
vacuumlazy.c sometimes fails to update pg_class entries for each index
(to ensure that pg_class.reltuples is current), even though analyze.c
assumed that that must have happened during VACUUM ANALYZE.  There are
at least a couple of reasons for this.  For example, vacuumlazy.c could
fail to update pg_class when the index AM indicated that its statistics
are merely an estimate, per the contract for amvacuumcleanup() routines
established by commit e57345975c back in 2006.

Stop assuming that pg_class must have been updated with accurate
statistics within VACUUM ANALYZE -- update pg_class for indexes at the
same time as the table relation in all cases.  That way VACUUM ANALYZE
will never fail to keep pg_class.reltuples reasonably accurate.

The only downside of this approach (compared to the old approach) is
that it might inaccurately set pg_class.reltuples for indexes whose heap
relation ends up with the same inaccurate value anyway.  This doesn't
seem too bad.  We already consistently called vac_update_relstats() (to
update pg_class) for the heap/table relation twice during any VACUUM
ANALYZE -- once in vacuumlazy.c, and once in analyze.c.  We now make
sure that we call vac_update_relstats() at least once (though often
twice) for each index.

This is follow up work to commit 9f3665fb, which dealt with issues in
btvacuumcleanup().  Technically this fixes an unrelated issue, though.
btvacuumcleanup() no longer provides an accurate num_index_tuples value
following commit 9f3665fb (when there was no btbulkdelete() call during
the VACUUM operation in question), but hashvacuumcleanup() has worked in
the same way for many years now.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzknxdComjhqo4SUxVFk_Q1171GJO2ZgHZ1Y6pion6u8rA@mail.gmail.com
Backpatch: 13-, just like commit 9f3665fb.
2021-03-10 17:07:57 -08:00
Peter Geoghegan 9f3665fbfc Don't consider newly inserted tuples in nbtree VACUUM.
Remove the entire idea of "stale stats" within nbtree VACUUM (stop
caring about stats involving the number of inserted tuples).  Also
remove the vacuum_cleanup_index_scale_factor GUC/param on the master
branch (though just disable them on postgres 13).

The vacuum_cleanup_index_scale_factor/stats interface made the nbtree AM
partially responsible for deciding when pg_class.reltuples stats needed
to be updated.  This seems contrary to the spirit of the index AM API,
though -- it is not actually necessary for an index AM's bulk delete and
cleanup callbacks to provide accurate stats when it happens to be
inconvenient.  The core code owns that.  (Index AMs have the authority
to perform or not perform certain kinds of deferred cleanup based on
their own considerations, such as page deletion and recycling, but that
has little to do with pg_class.reltuples/num_index_tuples.)

This issue was fairly harmless until the introduction of the
autovacuum_vacuum_insert_threshold feature by commit b07642db, which had
an undesirable interaction with the vacuum_cleanup_index_scale_factor
mechanism: it made insert-driven autovacuums perform full index scans,
even though there is no real benefit to doing so.  This has been tied to
a regression with an append-only insert benchmark [1].

Also have remaining cases that perform a full scan of an index during a
cleanup-only nbtree VACUUM indicate that the final tuple count is only
an estimate.  This prevents vacuumlazy.c from setting the index's
pg_class.reltuples in those cases (it will now only update pg_class when
vacuumlazy.c had TIDs for nbtree to bulk delete).  This arguably fixes
an oversight in deduplication-related bugfix commit 48e12913.

[1] https://smalldatum.blogspot.com/2021/01/insert-benchmark-postgres-is-still.html

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAD21AoA4WHthN5uU6+WScZ7+J_RcEjmcuH94qcoUPuB42ShXzg@mail.gmail.com
Backpatch: 13-, where autovacuum_vacuum_insert_threshold was added.
2021-03-10 16:27:01 -08:00
Bruce Momjian 845ac7f847 C comments: improve description of GiST NSN and GistBuildLSN
GiST indexes are complex, so adding more details in the code might help
someone.

Discussion: https://postgr.es/m/20210302164021.GA364@momjian.us
2021-03-10 17:03:10 -05:00
Peter Eisentraut bbaf315309 Add bound check before bsearch() for performance
In the current lazy vacuum implementation, some index AMs such as
btree indexes call lazy_tid_reaped() for each index tuple during
ambulkdelete to check if the index tuple points to the (collected)
garbage tuple.  In that function, we simply call bsearch(), but we
should be able to know the result without bsearch() if the index tuple
points to the heap tuple that is out of range of the collected garbage
tuples.  Therefore, add a simple bound check before resorting to
bsearch().  Testing has shown that this can give significant
performance benefits.

Author: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+fd4k76j8jKzJzcx8UqEugvayaMSnQz0iLUt_XgBp-_-bd22A@mail.gmail.com
2021-03-10 15:19:37 +01:00
Amit Kapila 05c8482f7f Enable parallel SELECT for "INSERT INTO ... SELECT ...".
Parallel SELECT can't be utilized for INSERT in the following cases:
- INSERT statement uses the ON CONFLICT DO UPDATE clause
- Target table has a parallel-unsafe: trigger, index expression or
  predicate, column default expression or check constraint
- Target table has a parallel-unsafe domain constraint on any column
- Target table is a partitioned table with a parallel-unsafe partition key
  expression or support function

The planner is updated to perform additional parallel-safety checks for
the cases listed above, for determining whether it is safe to run INSERT
in parallel-mode with an underlying parallel SELECT. The planner will
consider using parallel SELECT for "INSERT INTO ... SELECT ...", provided
nothing unsafe is found from the additional parallel-safety checks, or
from the existing parallel-safety checks for SELECT.

While checking parallel-safety, we need to check it for all the partitions
on the table which can be costly especially when we decide not to use a
parallel plan. So, in a separate patch, we will introduce a GUC and or a
reloption to enable/disable parallelism for Insert statements.

Prior to entering parallel-mode for the execution of INSERT with parallel
SELECT, a TransactionId is acquired and assigned to the current
transaction state. This is necessary to prevent the INSERT from attempting
to assign the TransactionId whilst in parallel-mode, which is not allowed.
This approach has a disadvantage in that if the underlying SELECT does not
return any rows, then the TransactionId is not used, however that
shouldn't happen in practice in many cases.

Author: Greg Nancarrow, Amit Langote, Amit Kapila
Reviewed-by: Amit Langote, Hou Zhijie, Takayuki Tsunakawa, Antonin Houska, Bharath Rupireddy, Dilip Kumar, Vignesh C, Zhihong Yu, Amit Kapila
Tested-by: Tang, Haiying
Discussion: https://postgr.es/m/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV=qpFJrR3AcrTS3g@mail.gmail.com
Discussion: https://postgr.es/m/CAJcOf-fAdj=nDKMsRhQzndm-O13NY4dL6xGcEvdX5Xvbbi0V7g@mail.gmail.com
2021-03-10 07:38:58 +05:30
Fujii Masao ff99918c62 Track total amounts of times spent writing and syncing WAL data to disk.
This commit adds new GUC track_wal_io_timing. When this is enabled,
the total amounts of time XLogWrite writes and issue_xlog_fsync syncs
WAL data to disk are counted in pg_stat_wal. This information would be
useful to check how much WAL write and sync affect the performance.

Enabling track_wal_io_timing will make the server query the operating
system for the current time every time WAL is written or synced,
which may cause significant overhead on some platforms. To avoid such
additional overhead in the server with track_io_timing enabled,
this commit introduces track_wal_io_timing as a separate parameter from
track_io_timing.

Note that WAL write and sync activity by walreceiver has not been tracked yet.

This commit makes the server also track the numbers of times XLogWrite
writes and issue_xlog_fsync syncs WAL data to disk, in pg_stat_wal,
regardless of the setting of track_wal_io_timing. This counters can be
used to calculate the WAL write and sync time per request, for example.

Bump PGSTAT_FILE_FORMAT_ID.

Bump catalog version.

Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David Johnston, Fujii Masao
Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.com
2021-03-09 16:52:06 +09:00
Amit Kapila 8a812e5106 Track replication origin progress for rollbacks.
Commit 1eb6d6527a allowed to track replica origin replay progress for 2PC
but it was not complete. It misses to properly track the progress for
rollback prepared especially it missed updating the code for recovery.
Additionally, we need to allow tracking it on subscriber nodes where
wal_level might not be logical.

It is required to track decoding of 2PC which is committed in PG14
(a271a1b50e) and also nobody complained about this till now so not
backpatching it.

Author: Amit Kapila
Reviewed-by: Michael Paquier and Ajin Cherian
Discussion: https://postgr.es/m/CAA4eK1L-kHmMnSdrRW6UhRbCjR7cgh04c+6psY15qzT6ktcd+g@mail.gmail.com
2021-03-08 07:54:03 +05:30
Heikki Linnakangas 3174d69fb9 Remove server and libpq support for old FE/BE protocol version 2.
Protocol version 3 was introduced in PostgreSQL 7.4. There shouldn't be
many clients or servers left out there without version 3 support. But as
a courtesy, I kept just enough of the old protocol support that we can
still send the "unsupported protocol version" error in v2 format, so that
old clients can display the message properly. Likewise, libpq still
understands v2 ErrorResponse messages when establishing a connection.

The impetus to do this now is that I'm working on a patch to COPY
FROM, to always prefetch some data. We cannot do that safely with the
old protocol, because it requires parsing the input one byte at a time
to detect the end-of-copy marker.

Reviewed-by: Tom Lane, Alvaro Herrera, John Naylor
Discussion: https://www.postgresql.org/message-id/9ec25819-0a8a-d51a-17dc-4150bb3cca3b%40iki.fi
2021-03-04 10:45:55 +02:00
Peter Geoghegan 5b2f2af3d9 nbtree page deletion: Add leaftopparent assertion.
Add documenting assertion.  This makes it easier to follow how we
maintain the top parent link in target subtree's half-dead/leaf level
page.
2021-03-02 14:06:07 -08:00
Peter Geoghegan 3d8d5787a3 Fix nbtree page deletion error messages.
Adjust some "can't happen" error messages that assumed that the page
deletion target page must be a half-dead page.  This assumption was
wrong in the case of an internal target page.  Simply refer to these
pages as the target page instead.

Internal pages are never marked half-dead.  There is exactly one
half-dead page for each subtree undergoing deletion.  The half-dead page
is also the target subtree's leaf-level page.  This has been the case
since commit efada2b8, which totally overhauled nbtree page deletion.
2021-03-02 13:02:24 -08:00
David Rowley bb437f995d Add TID Range Scans to support efficient scanning ranges of TIDs
This adds a new executor node named TID Range Scan.  The query planner
will generate paths for TID Range scans when quals are discovered on base
relations which search for ranges on the table's ctid column.  These
ranges may be open at either end. For example, WHERE ctid >= '(10,0)';
will return all tuples on page 10 and over.

To support this, two new optional callback functions have been added to
table AM.  scan_set_tidrange is used to set the scan range to just the
given range of TIDs.  scan_getnextslot_tidrange fetches the next tuple
in the given range.

For AMs were scanning ranges of TIDs would not make sense, these functions
can be set to NULL in the TableAmRoutine.  The query planner won't
generate TID Range Scan Paths in that case.

Author: Edmund Horner, David Rowley
Reviewed-by: David Rowley, Tomas Vondra, Tom Lane, Andres Freund, Zhihong Yu
Discussion: https://postgr.es/m/CAMyN-kB-nFTkF=VA_JPwFNo08S0d-Yk0F741S2B7LDmYAi8eyA@mail.gmail.com
2021-02-27 22:59:36 +13:00
Peter Geoghegan 2376361839 VACUUM VERBOSE: Count "newly deleted" index pages.
Teach VACUUM VERBOSE to report on pages deleted by the _current_ VACUUM
operation -- these are newly deleted pages.  VACUUM VERBOSE continues to
report on the total number of deleted pages in the entire index (no
change there).  The former is a subset of the latter.

The distinction between each category of deleted index page only arises
with index AMs where page deletion is supported and is decoupled from
page recycling for performance reasons.

This is follow-up work to commit e5d8a999, which made nbtree store
64-bit XIDs (not 32-bit XIDs) in pages at the point at which they're
deleted.  Note that the btm_last_cleanup_num_delpages metapage field
added by that commit usually gets set to pages_newly_deleted.  The
exceptions (the scenarios in which they're not equal) all seem to be
tricky cases for the implementation (of page deletion and recycling) in
general.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WznpdHvujGUwYZ8sihX%3Dd5u-tRYhi-F4wnV2uN2zHpMUXw%40mail.gmail.com
2021-02-25 14:32:18 -08:00
Peter Geoghegan e5d8a99903 Use full 64-bit XIDs in deleted nbtree pages.
Otherwise we risk "leaking" deleted pages by making them non-recyclable
indefinitely.  Commit 6655a729 did the same thing for deleted pages in
GiST indexes.  That work was used as a starting point here.

Stop storing an XID indicating the oldest bpto.xact across all deleted
though unrecycled pages in nbtree metapages.  There is no longer any
reason to care about that condition/the oldest XID.  It only ever made
sense when wraparound was something _bt_vacuum_needs_cleanup() had to
consider.

The btm_oldest_btpo_xact metapage field has been repurposed and renamed.
It is now btm_last_cleanup_num_delpages, which is used to remember how
many non-recycled deleted pages remain from the last VACUUM (in practice
its value is usually the precise number of pages that were _newly
deleted_ during the specific VACUUM operation that last set the field).

The general idea behind storing btm_last_cleanup_num_delpages is to use
it to give _some_ consideration to non-recycled deleted pages inside
_bt_vacuum_needs_cleanup() -- though never too much.  We only really
need to avoid leaving a truly excessive number of deleted pages in an
unrecycled state forever.  We only do this to cover certain narrow cases
where no other factor makes VACUUM do a full scan, and yet the index
continues to grow (and so actually misses out on recycling existing
deleted pages).

These metapage changes result in a clear user-visible benefit: We no
longer trigger full index scans during VACUUM operations solely due to
the presence of only 1 or 2 known deleted (though unrecycled) blocks
from a very large index.  All that matters now is keeping the costs and
benefits in balance over time.

Fix an issue that has been around since commit 857f9c36, which added the
"skip full scan of index" mechanism (i.e. the _bt_vacuum_needs_cleanup()
logic).  The accuracy of btm_last_cleanup_num_heap_tuples accidentally
hinged upon _when_ the source value gets stored.  We now always store
btm_last_cleanup_num_heap_tuples in btvacuumcleanup().  This fixes the
issue because IndexVacuumInfo.num_heap_tuples (the source field) is
expected to accurately indicate the state of the table _after_ the
VACUUM completes inside btvacuumcleanup().

A backpatchable fix cannot easily be extracted from this commit.  A
targeted fix for the issue will follow in a later commit, though that
won't happen today.

I (pgeoghegan) have chosen to remove any mention of deleted pages in the
documentation of the vacuum_cleanup_index_scale_factor GUC/param, since
the presence of deleted (though unrecycled) pages is no longer of much
concern to users.  The vacuum_cleanup_index_scale_factor description in
the docs now seems rather unclear in any case, and it should probably be
rewritten in the near future.  Perhaps some passing mention of page
deletion will be added back at the same time.

Bump XLOG_PAGE_MAGIC due to nbtree WAL records using full XIDs now.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznpdHvujGUwYZ8sihX=d5u-tRYhi-F4wnV2uN2zHpMUXw@mail.gmail.com
2021-02-24 18:41:34 -08:00
Michael Paquier bcf2667bf6 Fix some typos, grammar and style in docs and comments
The portions fixing the documentation are backpatched where needed.

Author: Justin Pryzby
Discussion: https://postgr.es/m/20210210235557.GQ20012@telsasoft.com
backpatch-through: 9.6
2021-02-24 16:13:17 +09:00
Alvaro Herrera 8deb6b38dc
Reinstate HEAP_XMAX_LOCK_ONLY|HEAP_KEYS_UPDATED as allowed
Commit 866e24d47d added an assert that HEAP_XMAX_LOCK_ONLY and
HEAP_KEYS_UPDATED cannot appear together, on the faulty assumption that
the latter necessarily referred to an update and not a tuple lock; but
that's wrong, because SELECT FOR UPDATE can use precisely that
combination, as evidenced by the amcheck test case added here.

Remove the Assert(), and also patch amcheck's verify_heapam.c to not
complain if the combination is found.  Also, out of overabundance of
caution, update (across all branches) README.tuplock to be more explicit
about this.

Author: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Mahendra Singh Thalor <mahi6run@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/20210124061758.GA11756@nol
2021-02-23 17:30:21 -03:00
Peter Eisentraut 6f6f284c7e Simplify printing of LSNs
Add a macro LSN_FORMAT_ARGS for use in printf-style printing of LSNs.
Convert all applicable code to use it.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/flat/CAExHW5ub5NaTELZ3hJUCE6amuvqAtsSxc7O+uK7y4t9Rrk23cw@mail.gmail.com
2021-02-23 10:27:02 +01:00
Alvaro Herrera 0f5505a881
Remove pointless HeapTupleHeaderIndicatesMovedPartitions calls
Pavan Deolasee recently noted that a few of the
HeapTupleHeaderIndicatesMovedPartitions calls added by commit
5db6df0c01 are useless, since they are done after comparing t_self
with t_ctid.  But because t_self can never be set to the magical values
that indicate that the tuple moved partition, this can never succeed: if
the first test fails (so we know t_self equals t_ctid), necessarily the
second test will also fail.

So these checks can be removed and no harm is done.  There's no bug
here, just a code legibility issue.

Reported-by: Pavan Deolasee <pavan.deolasee@gmail.com>
Discussion: https://postgr.es/m/20200929164411.GA15497@alvherre.pgsql
2021-02-22 16:51:44 -03:00
Fujii Masao f05ed5a5cf Initialize atomic variable waitStart in PGPROC, at postmaster startup.
Commit 46d6e5f567 added the atomic variable "waitStart" into PGPROC struct,
to store the time at which wait for lock acquisition started. Previously
this variable was initialized every time each backend started. Instead,
this commit makes postmaster initialize it at the startup, to ensure that
the variable should be initialized before any use of it.

This commit also moves the code to initialize "waitStart" variable for
prepare transaction, from TwoPhaseGetDummyProc() to MarkAsPreparingGuts().
Because MarkAsPreparingGuts() is more proper place to do that since
it initializes other PGPROC variables.

Author: Fujii Masao
Reviewed-by: Atsushi Torikoshi
Discussion: https://postgr.es/m/1df88660-6f08-cc6e-b7e2-f85296a2bdab@oss.nttdata.com
2021-02-22 18:25:00 +09:00
Fujii Masao 8a55cb5ba9 Fix bug in COMMIT AND CHAIN command.
This commit fixes COMMIT AND CHAIN command so that it starts new transaction
immediately even if savepoints are defined within the transaction to commit.
Previously COMMIT AND CHAIN command did not in that case because
commit 280a408b48 forgot to make CommitTransactionCommand() handle
a transaction chaining when the transaction state was TBLOCK_SUBCOMMIT.

Also this commit adds the regression test for COMMIT AND CHAIN command
when savepoints are defined.

Back-patch to v12 where transaction chaining was added.

Reported-by: Arthur Nascimento
Author: Fujii Masao
Reviewed-by: Arthur Nascimento, Vik Fearing
Discussion: https://postgr.es/m/16867-3475744069228158@postgresql.org
2021-02-19 21:57:52 +09:00
Peter Geoghegan b071a31149 Add nbtree README section on page recycling.
Consolidate discussion of how VACUUM places pages in the FSM for
recycling by adding a new section that comes after discussion of page
deletion.  This structure reflects the fact that page recycling is
explicitly decoupled from page deletion in Lanin & Shasha's paper.  Page
recycling in nbtree is an implementation of what the paper calls "the
drain technique".

This decoupling is an important concept for nbtree VACUUM.  Searchers
have to detect and recover from concurrent page deletions, but they will
never have to reason about concurrent page recycling.  Recycling can
almost always be thought of as a low level garbage collection operation
that asynchronously frees the physical space that backs a logical tree
node.  Almost all code need only concern itself with logical tree nodes.
(Note that "logical tree node" is not currently a term of art in the
nbtree code -- this all works implicitly.)

This is preparation for an upcoming patch that teaches nbtree VACUUM to
remember the details of pages that it deletes on the fly, in local
memory.  This enables the same VACUUM operation to consider placing its
own deleted pages in the FSM later on, when it reaches the end of
btvacuumscan().
2021-02-18 21:16:33 -08:00
Peter Geoghegan 128dd901a5 nbtree README: move VACUUM linear scan section.
Discuss VACUUM's linear scan after discussion of tuple deletion by
VACUUM, but before discussion of page deletion by VACUUM.  This
progression is a lot more natural.

Also tweak the wording a little.  It seems unnecessary to talk about how
it worked prior to PostgreSQL 8.2.
2021-02-17 21:13:15 -08:00
Peter Eisentraut 0e392fcc0d Use errmsg_internal for debug messages
An inconsistent set of debug-level messages was not using
errmsg_internal(), thus uselessly exposing the messages to translation
work.  Fix those.
2021-02-17 11:33:25 +01:00
Andres Freund 8001cb77ee Fix heap_page_prune() parameter order confusion introduced in dc7420c2c9.
Both luckily and unluckily the passed values meant the same for all
types. Luckily because that meant my confusion caused no harm,
unluckily because otherwise the compiler might have warned...

In passing, synchronize parameter names between definition and
declaration.

Reported-By: Peter Geoghegan <pg@bowt.ie>
Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wz=L=nBoepQdH9b5Qd0nMvepFT2CnT6sjWvvpOXa=K8HVQ@mail.gmail.com
2021-02-15 17:12:12 -08:00
Fujii Masao 46d6e5f567 Display the time when the process started waiting for the lock, in pg_locks, take 2
This commit adds new column "waitstart" into pg_locks view. This column
reports the time when the server process started waiting for the lock
if the lock is not held. This information is useful, for example, when
examining the amount of time to wait on a lock by subtracting
"waitstart" in pg_locks from the current time, and identify the lock
that the processes are waiting for very long.

This feature uses the current time obtained for the deadlock timeout
timer as "waitstart" (i.e., the time when this process started waiting
for the lock). Since getting the current time newly can cause overhead,
we reuse the already-obtained time to avoid that overhead.

Note that "waitstart" is updated without holding the lock table's
partition lock, to avoid the overhead by additional lock acquisition.
This can cause "waitstart" in pg_locks to become NULL for a very short
period of time after the wait started even though "granted" is false.
This is OK in practice because we can assume that users are likely to
look at "waitstart" when waiting for the lock for a long time.

The first attempt of this patch (commit 3b733fcd04) caused the buildfarm
member "rorqual" (built with --disable-atomics --disable-spinlocks) to report
the failure of the regression test. It was reverted by commit 890d2182a2.
The cause of this failure was that the atomic variable for "waitstart"
in the dummy process entry created at the end of prepare transaction was
not initialized. This second attempt fixes that issue.

Bump catalog version.

Author: Atsushi Torikoshi
Reviewed-by: Ian Lawrence Barwick, Robert Haas, Justin Pryzby, Fujii Masao
Discussion: https://postgr.es/m/a96013dc51cdc56b2a2b84fa8a16a993@oss.nttdata.com
2021-02-15 15:13:37 +09:00
Peter Geoghegan 7cde6b13a9 Adjust lazy_scan_heap() accounting comments.
Explain which particular LP_DEAD line pointers get accounted for by the
tups_vacuumed variable.
2021-02-14 19:28:37 -08:00
Thomas Munro c7ecd6af01 ReadNewTransactionId() -> ReadNextTransactionId().
The new name conveys the effect better, is more consistent with similar
functions ReadNextMultiXactId(), ReadNextFullTransactionId(), and
matches the name of the variable that it reads.

Reported-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzmVR4SakBXQUdhhPpMf1aYvZCnna5%3DHKa7DAgEmBAg%2B8g%40mail.gmail.com
2021-02-15 13:17:02 +13:00
Bruce Momjian 8facf1ea00 README/C-comment: document GiST's NSN value 2021-02-13 13:50:49 -05:00
Amit Kapila ce0fdbfe97 Allow multiple xacts during table sync in logical replication.
For the initial table data synchronization in logical replication, we use
a single transaction to copy the entire table and then synchronize the
position in the stream with the main apply worker.

There are multiple downsides of this approach: (a) We have to perform the
entire copy operation again if there is any error (network breakdown,
error in the database operation, etc.) while we synchronize the WAL
position between tablesync worker and apply worker; this will be onerous
especially for large copies, (b) Using a single transaction in the
synchronization-phase (where we can receive WAL from multiple
transactions) will have the risk of exceeding the CID limit, (c) The slot
will hold the WAL till the entire sync is complete because we never commit
till the end.

This patch solves all the above downsides by allowing multiple
transactions during the tablesync phase. The initial copy is done in a
single transaction and after that, we commit each transaction as we
receive. To allow recovery after any error or crash, we use a permanent
slot and origin to track the progress. The slot and origin will be removed
once we finish the synchronization of the table. We also remove slot and
origin of tablesync workers if the user performs DROP SUBSCRIPTION .. or
ALTER SUBSCRIPTION .. REFERESH and some of the table syncs are still not
finished.

The commands ALTER SUBSCRIPTION ... REFRESH PUBLICATION and
ALTER SUBSCRIPTION ... SET PUBLICATION ... with refresh option as true
cannot be executed inside a transaction block because they can now drop
the slots for which we have no provision to rollback.

This will also open up the path for logical replication of 2PC
transactions on the subscriber side. Previously, we can't do that because
of the requirement of maintaining a single transaction in tablesync
workers.

Bump catalog version due to change of state in the catalog
(pg_subscription_rel).

Author: Peter Smith, Amit Kapila, and Takamichi Osumi
Reviewed-by: Ajin Cherian, Petr Jelinek, Hou Zhijie and Amit Kapila
Discussion: https://postgr.es/m/CAA4eK1KHJxaZS-fod-0fey=0tq3=Gkn4ho=8N4-5HWiCfu0H1A@mail.gmail.com
2021-02-12 07:41:51 +05:30
Peter Geoghegan 3063eb1759 Remove obsolete IndexBulkDeleteResult stats field.
The pages_removed field is no longer used for anything.  It hasn't been
possible for an index to physically shrink since old-style VACUUM FULL
was removed by commit 0a469c87.
2021-02-11 16:49:41 -08:00
Peter Eisentraut 4ad5611055 Fix lack of message pluralization 2021-02-10 11:35:45 +01:00
Peter Geoghegan 31c7fb41e2 Fix obsolete FSM remarks in nbtree README.
The free space map has used a dedicated relation fork rather than shared
memory segments for over a decade.
2021-02-09 11:36:51 -08:00
Heikki Linnakangas 6214e2b228 Fix permission checks on constraint violation errors on partitions.
If a cross-partition UPDATE violates a constraint on the target partition,
and the columns in the new partition are in different physical order than
in the parent, the error message can reveal columns that the user does not
have SELECT permission on. A similar bug was fixed earlier in commit
804b6b6db4.

The cause of the bug is that the callers of the
ExecBuildSlotValueDescription() function got confused when constructing
the list of modified columns. If the tuple was routed from a parent, we
converted the tuple to the parent's format, but the list of modified
columns was grabbed directly from the child's RTE entry.

ExecUpdateLockMode() had a similar issue. That lead to confusion on which
columns are key columns, leading to wrong tuple lock being taken on tables
referenced by foreign keys, when a row is updated with INSERT ON CONFLICT
UPDATE. A new isolation test is added for that corner case.

With this patch, the ri_RangeTableIndex field is no longer set for
partitions that don't have an entry in the range table. Previously, it was
set to the RTE entry of the parent relation, but that was confusing.

NOTE: This modifies the ResultRelInfo struct, replacing the
ri_PartitionRoot field with ri_RootResultRelInfo. That's a bit risky to
backpatch, because it breaks any extensions accessing the field. The
change that ri_RangeTableIndex is not set for partitions could potentially
break extensions, too. The ResultRelInfos are visible to FDWs at least,
and this patch required small changes to postgres_fdw. Nevertheless, this
seem like the least bad option. I don't think these fields widely used in
extensions; I don't think there are FDWs out there that uses the FDW
"direct update" API, other than postgres_fdw. If there is, you will get a
compilation error, so hopefully it is caught quickly.

Backpatch to 11, where support for both cross-partition UPDATEs, and unique
indexes on partitioned tables, were added.

Reviewed-by: Amit Langote
Security: CVE-2021-3393
2021-02-08 11:01:51 +02:00
Peter Geoghegan 617fffee8a Rename removable xid function for consistency.
GlobalVisIsRemovableFullXid() is now GlobalVisCheckRemovableFullXid().
This is consistent with the general convention for FullTransactionId
equivalents of functions that deal with TransactionId values.  It now
matches the nearby GlobalVisCheckRemovableXid() function, which performs
the same check for callers that use TransactionId values.

Oversight in commit dc7420c2c9.

Discussion: https://postgr.es/m/CAH2-Wzmes12jFNDcVgpU89Vp=r6uLFrE-MT0fjSWGsE70UiNaA@mail.gmail.com
2021-02-07 10:11:14 -08:00
Michael Paquier f7400823c3 Clarify some comments around SharedRecoveryState in xlog.c
SharedRecoveryState has been switched from a boolean to an enum as of
commit 4e87c48, but some comments still referred to it as a boolean.

Author: Amul Sul
Reviewed-by: Dilip Kumar, Kyotaro Horiguchi
Discussion: https://postgr.es/m/CAAJ_b97Hf+1SXnm8jySpO+Fhm+-VKFAAce1T_cupUYtnE3Nxig
2021-02-06 10:27:55 +09:00
Peter Geoghegan c34787f910 Harden nbtree page deletion.
Add some additional defensive checks in the second phase of index
deletion to detect and report index corruption during VACUUM, and to
avoid having VACUUM become stuck in more cases.  The code is still not
robust in the presence of a circular chain of sibling links, though it's
not clear whether that really matters.  This is follow-up work to commit
3a01f68e.

The new defensive checks rely on the assumption that there can be no
more than one VACUUM operation running for an index at any given time.
Remove an old comment suggesting that multiple concurrent VACUUMs need
to be considered here.  This concern now seems highly unlikely to have
any real validity, since we clearly rely on the same assumption in
several other places.  For example, there are much more recent comments
that appear in the same function (added by commit efada2b8e9) that make
the same assumption.

Also add a CHECK_FOR_INTERRUPTS() to the relevant code path.  Contrary
to comments added by commit 3a01f68e, it is actually possible to handle
interrupts here, at least in the common case where processing takes
place at the leaf level.  We only hold a pin on leafbuf/target page when
stepping right at the leaf level.

No backpatch due to the lack of complaints following hardening added to
the same area by commit 3a01f68e.
2021-02-04 15:42:36 -08:00
Peter Eisentraut 1d71f3c83c Improve confusing variable names
The prototype calls the second argument of
pgstat_progress_update_multi_param() "index", and some callers name
their local variable that way.  But when the surrounding code deals
with index relations, this is confusing, and in at least one case
shadowed another variable that is referring to an index relation.
Adjust those call sites to have clearer local variable naming, similar
to existing callers in indexcmds.c.
2021-02-02 09:20:22 +01:00
Peter Geoghegan dc43492e46 Remove unused _bt_delitems_delete() argument.
The latestRemovedXid values used by nbtree deletion operations are
determined by _bt_delitems_delete()'s caller, so there is no reason to
pass a separate heapRel argument.

Oversight in commit d168b66682.
2021-01-31 10:10:55 -08:00
Thomas Munro 514b411a2b Retire pg_standby.
pg_standby was useful more than a decade ago, but now it is obsolete.
It has been proposed that we retire it many times.  Now seems like a
good time to finally do it, because "waiting restore commands"
are incompatible with a proposed recovery prefetching feature.

Discussion: https://postgr.es/m/20201029024412.GP5380%40telsasoft.com
Author: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
2021-01-29 14:09:41 +13:00
Robert Haas 69059d3b2f In TrimCLOG(), don't reset XactCtl->shared->latest_page_number.
Since the CLOG page number is not recorded directly in the checkpoint
record, we have to use ShmemVariableCache->nextXid to figure out the
latest CLOG page number at the start of recovery. However, as recovery
progresses, replay of CLOG/EXTEND records will update our notion of
the latest page number, and we should rely on that being accurate
rather than recomputing the value based on an updated notion of
nextXid. ShmemVariableCache->nextXid is only an approximation
during recovery anyway, whereas CLOG/EXTEND records are an
authoritative representation of how the SLRU has been updated.

Commit 0fcc2decd4 makes this
simplification possible, as before that change clog_redo() might
have injected a bogus value here, and we'd want to get rid of
that before entering normal running.

Patch by me, reviewed by Heikki Linnakangas.

Discussion: http://postgr.es/m/CA+TgmoZYig9+AQodhF5sRXuKkJ=RgFDugLr3XX_dz_F-p=TwTg@mail.gmail.com
2021-01-27 15:52:34 -05:00
Robert Haas 0fcc2decd4 In clog_redo(), don't set XactCtl->shared->latest_page_number.
The comment is no longer accurate, and hasn't been entirely accurate
since Hot Standby was introduced. The original idea here was that
StartupCLOG() wouldn't be called until the end of recovery and
therefore this value would be uninitialized when this code is reached,
but Hot Standby made that true only when hot_standby=off, and commit
1f113abdf8 means that this value is now
always initialized before replay even starts.

The original purpose of this code was to bypass the sanity check
in SimpleLruTruncate(), which will no longer occur: now, if something
is wrong, that sanity check might trip during recovery. That's
probably a good thing, because in the current code base
latest_page_number should always be initialized and therefore we
expect that the sanity check should pass. If it doesn't, something
has gone wrong, and complaining about it is appropriate.

Patch by me, reviewed by Heikki Linnakangas.

Discussion: http://postgr.es/m/CA+TgmoZYig9+AQodhF5sRXuKkJ=RgFDugLr3XX_dz_F-p=TwTg@mail.gmail.com
2021-01-27 13:11:30 -05:00
Robert Haas 1f113abdf8 Move StartupCLOG() calls to just after we initialize ShmemVariableCache.
Previously, the hot_standby=off code path did this at end of recovery,
while the hot_standby=on code path did it at the beginning of recovery.
It's better to do this in only one place because (a) it's simpler,
(b) StartupCLOG() is trivial so trying to postpone the work isn't
useful, and (c) this will make it possible to simplify some other
logic.

Patch by me, reviewed by Heikki Linnakangas.

Discussion: http://postgr.es/m/CA+TgmoZYig9+AQodhF5sRXuKkJ=RgFDugLr3XX_dz_F-p=TwTg@mail.gmail.com
2021-01-27 12:20:46 -05:00
Peter Geoghegan e42b3c3bd6 Fix GiST index deletion assert issue.
Avoid calling heap_index_delete_tuples() with an empty deltids array to
avoid an assertion failure.

This issue was arguably an oversight in commit b5f58cf2, though the
failing assert itself was added by my recent commit d168b666.  No
backpatch, though, since the oversight is harmless in the back branches.

Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Jaime Casanova <jcasanov@systemguards.com.ec>
Discussion: https://postgr.es/m/CAJKUy5jscES84n3puE=sYngyF+zpb4wv8UMtuLnLPv5z=6yyNw@mail.gmail.com
2021-01-26 23:24:37 -08:00
Robert Haas d18e75664a Remove CheckpointLock.
Up until now, we've held this lock when performing a checkpoint or
restartpoint, but commit 076a055acf back
in 2004 and commit 7e48b77b1c from 2009,
taken together, have removed all need for this. In the present code,
there's only ever one process entitled to attempt a checkpoint: either
the checkpointer, during normal operation, or the postmaster, during
single-user operation. So, we don't need the lock.

One possible concern in making this change is that it means that
a substantial amount of code where HOLD_INTERRUPTS() was previously
in effect due to the preceding LWLockAcquire() will now be
running without that. This could mean that ProcessInterrupts()
gets called in places from which it didn't before. However, this
seems unlikely to do very much, because the checkpointer doesn't
have any signal mapped to die(), so it's not clear how,
for example, ProcDiePending = true could happen in the first
place. Similarly with ClientConnectionLost and recovery conflicts.

Also, if there are any such problems, we might want to fix them
rather than reverting this, since running lots of code with
interrupt handling suspended is generally bad.

Patch by me, per an inquiry by Amul Sul. Review by Tom Lane
and Michael Paquier.

Discussion: http://postgr.es/m/CAAJ_b97XnBBfYeSREDJorFsyoD1sHgqnNuCi=02mNQBUMnA=FA@mail.gmail.com
2021-01-25 12:34:38 -05:00
David Rowley 16dfe253e3 Fix hypothetical bug in heap backward scans
Both heapgettup() and heapgettup_pagemode() incorrectly set the first page
to scan in a backward scan in which the number of pages to scan was
specified by heap_setscanlimits().  The code incorrectly started the scan
at the end of the relation when startBlk was 0, or otherwise at
startBlk - 1, neither of which is correct when only scanning a subset of
pages.

The fix here checks if heap_setscanlimits() has changed the number of
pages to scan and if so we set the first page to scan as the final page in
the specified range during backward scans.

Proper adjustment of this code was forgotten when heap_setscanlimits() was
added in 7516f5259 back in 9.5.  However, practice, nowhere in core code
performs backward scans after having used heap_setscanlimits(), yet, it is
possible an extension uses the heap functions in this way, hence
backpatch.

An upcoming patch does use heap_setscanlimits() with backward scans, so
this must be fixed before that can go in.

Author: David Rowley
Discussion: https://postgr.es/m/CAApHDvpGc9h0_oVD2CtgBcxCS1N-qDYZSeBRnUh+0CWJA9cMaA@mail.gmail.com
Backpatch-through: 9.5, all supported versions
2021-01-25 19:52:18 +13:00
Tomas Vondra 39b66a91bd Fix COPY FREEZE with CLOBBER_CACHE_ALWAYS
This adds code omitted from commit 7db0cd2145 by accident, which had
two consequences. Firstly, only rows inserted by heap_multi_insert were
frozen as expected when running COPY FREEZE, while heap_insert left
rows unfrozen. That however includes rows in TOAST tables, so a lot of
data might have been left unfrozen. Secondly, page might have been left
partially empty after relcache invalidation.

This addresses both of those issues.

Discussion: https://postgr.es/m/CABOikdN-ptGv0mZntrK2Q8OtfUuAjqaYMGmkdU1dCKFtUxVLrg@mail.gmail.com
2021-01-24 01:08:11 +01:00
Heikki Linnakangas 6b4d3046f4 Fix bug in detecting concurrent page splits in GiST insert
In commit 9eb5607e69, I got the condition on checking for split or
deleted page wrong: I used && instead of ||. The comment correctly said
"concurrent split _or_ deletion".

As a result, GiST insertion could miss a concurrent split, and insert to
wrong page. Duncan Sands demonstrated this with a test script that did a
lot of concurrent inserts.

Backpatch to v12, where this was introduced. REINDEX is required to fix
indexes that were affected by this bug.

Backpatch-through: 12
Reported-by: Duncan Sands
Discussion: https://www.postgresql.org/message-id/a9690483-6c6c-3c82-c8ba-dc1a40848f11%40deepbluecap.com
2021-01-20 11:58:03 +02:00
Peter Eisentraut 15251c0a60 Pause recovery for insufficient parameter settings
When certain parameters are changed on a physical replication primary,
this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL
record.  The standby then checks whether its own settings are at least
as big as the ones on the primary.  If not, the standby shuts down
with a fatal error.

This patch changes this behavior for hot standbys to pause recovery at
that point instead.  That allows read traffic on the standby to
continue while database administrators figure out next steps.  When
recovery is unpaused, the server shuts down (as before).  The idea is
to fix the parameters while recovery is paused and then restart when
there is a maintenance window.

Reviewed-by: Sergei Kornilov <sk@zsrv.org>
Discussion: https://www.postgresql.org/message-id/flat/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36c7b@2ndquadrant.com
2021-01-18 09:04:04 +01:00
Tomas Vondra 7db0cd2145 Set PD_ALL_VISIBLE and visibility map bits in COPY FREEZE
Make sure COPY FREEZE marks the pages as PD_ALL_VISIBLE and updates the
visibility map. Until now we only marked individual tuples as frozen,
but page-level flags were not updated, so the first VACUUM after the
COPY FREEZE had to rewrite the whole table.

This is a fairly old patch, and multiple people worked on it. The first
version was written by Jeff Janes, and then reworked by Pavan Deolasee
and Anastasia Lubennikova.

Author: Anastasia Lubennikova, Pavan Deolasee, Jeff Janes
Reviewed-by: Kuntal Ghosh, Jeff Janes, Tomas Vondra, Masahiko Sawada,
             Andres Freund, Ibrar Ahmed, Robert Haas, Tatsuro Ishii,
             Darafei Praliaskouski
Discussion: https://postgr.es/m/CABOikdN-ptGv0mZntrK2Q8OtfUuAjqaYMGmkdU1dCKFtUxVLrg@mail.gmail.com
Discussion: https://postgr.es/m/CAMkU%3D1w3osJJ2FneELhhNRLxfZitDgp9FPHee08NT2FQFmz_pQ%40mail.gmail.com
2021-01-17 22:28:26 +01:00
Noah Misch 6db992833c Prevent excess SimpleLruTruncate() deletion.
Every core SLRU wraps around.  With the exception of pg_notify, the wrap
point can fall in the middle of a page.  Account for this in the
PagePrecedes callback specification and in SimpleLruTruncate()'s use of
said callback.  Update each callback implementation to fit the new
specification.  This changes SerialPagePrecedesLogically() from the
style of asyncQueuePagePrecedes() to the style of CLOGPagePrecedes().
(Whereas pg_clog and pg_serial share a key space, pg_serial is nothing
like pg_notify.)  The bug fixed here has the same symptoms and user
followup steps as 592a589a04.  Back-patch
to 9.5 (all supported versions).

Reviewed by Andrey Borodin and (in earlier versions) by Tom Lane.

Discussion: https://postgr.es/m/20190202083822.GC32531@gust.leadboat.com
2021-01-16 12:21:35 -08:00
Michael Paquier 5ae1572993 Fix O(N^2) stat() calls when recycling WAL segments
The counter tracking the last segment number recycled was getting
initialized when recycling one single segment, while it should be used
across a full cycle of segments recycled to prevent useless checks
related to entries already recycled.

This performance issue has been introduced by b2a5545, and it was first
implemented in 61b86142.

No backpatch is done per the lack of field complaints.

Reported-by: Andres Freund, Thomas Munro
Author: Michael Paquier
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/20170621211016.eln6cxxp3jrv7m4m@alap3.anarazel.de
Discussion: https://postgr.es/m/CA+hUKG+DRiF9z1_MU4fWq+RfJMxP7zjoptfcmuCFPeO4JM2iVg@mail.gmail.com
2021-01-15 10:33:13 +09:00
Peter Geoghegan d168b66682 Enhance nbtree index tuple deletion.
Teach nbtree and heapam to cooperate in order to eagerly remove
duplicate tuples representing dead MVCC versions.  This is "bottom-up
deletion".  Each bottom-up deletion pass is triggered lazily in response
to a flood of versions on an nbtree leaf page.  This usually involves a
"logically unchanged index" hint (these are produced by the executor
mechanism added by commit 9dc718bd).

The immediate goal of bottom-up index deletion is to avoid "unnecessary"
page splits caused entirely by version duplicates.  It naturally has an
even more useful effect, though: it acts as a backstop against
accumulating an excessive number of index tuple versions for any given
_logical row_.  Bottom-up index deletion complements what we might now
call "top-down index deletion": index vacuuming performed by VACUUM.
Bottom-up index deletion responds to the immediate local needs of
queries, while leaving it up to autovacuum to perform infrequent clean
sweeps of the index.  The overall effect is to avoid certain
pathological performance issues related to "version churn" from UPDATEs.

The previous tableam interface used by index AMs to perform tuple
deletion (the table_compute_xid_horizon_for_tuples() function) has been
replaced with a new interface that supports certain new requirements.
Many (perhaps all) of the capabilities added to nbtree by this commit
could also be extended to other index AMs.  That is left as work for a
later commit.

Extend deletion of LP_DEAD-marked index tuples in nbtree by adding logic
to consider extra index tuples (that are not LP_DEAD-marked) for
deletion in passing.  This increases the number of index tuples deleted
significantly in many cases.  The LP_DEAD deletion process (which is now
called "simple deletion" to clearly distinguish it from bottom-up
deletion) won't usually need to visit any extra table blocks to check
these extra tuples.  We have to visit the same table blocks anyway to
generate a latestRemovedXid value (at least in the common case where the
index deletion operation's WAL record needs such a value).

Testing has shown that the "extra tuples" simple deletion enhancement
increases the number of index tuples deleted with almost any workload
that has LP_DEAD bits set in leaf pages.  That is, it almost never fails
to delete at least a few extra index tuples.  It helps most of all in
cases that happen to naturally have a lot of delete-safe tuples.  It's
not uncommon for an individual deletion operation to end up deleting an
order of magnitude more index tuples compared to the old naive approach
(e.g., custom instrumentation of the patch shows that this happens
fairly often when the regression tests are run).

Add a further enhancement that augments simple deletion and bottom-up
deletion in indexes that make use of deduplication: Teach nbtree's
_bt_delitems_delete() function to support granular TID deletion in
posting list tuples.  It is now possible to delete individual TIDs from
posting list tuples provided the TIDs have a tableam block number of a
table block that gets visited as part of the deletion process (visiting
the table block can be triggered directly or indirectly).  Setting the
LP_DEAD bit of a posting list tuple is still an all-or-nothing thing,
but that matters much less now that deletion only needs to start out
with the right _general_ idea about which index tuples are deletable.

Bump XLOG_PAGE_MAGIC because xl_btree_delete changed.

No bump in BTREE_VERSION, since there are no changes to the on-disk
representation of nbtree indexes.  Indexes built on PostgreSQL 12 or
PostgreSQL 13 will automatically benefit from bottom-up index deletion
(i.e. no reindexing required) following a pg_upgrade.  The enhancement
to simple deletion is available with all B-Tree indexes following a
pg_upgrade, no matter what PostgreSQL version the user upgrades from.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com
2021-01-13 09:21:32 -08:00
Peter Geoghegan 9dc718bdf2 Pass down "logically unchanged index" hint.
Add an executor aminsert() hint mechanism that informs index AMs that
the incoming index tuple (the tuple that accompanies the hint) is not
being inserted by execution of an SQL statement that logically modifies
any of the index's key columns.

The hint is received by indexes when an UPDATE takes place that does not
apply an optimization like heapam's HOT (though only for indexes where
all key columns are logically unchanged).  Any index tuple that receives
the hint on insert is expected to be a duplicate of at least one
existing older version that is needed for the same logical row.  Related
versions will typically be stored on the same index page, at least
within index AMs that apply the hint.

Recognizing the difference between MVCC version churn duplicates and
true logical row duplicates at the index AM level can help with cleanup
of garbage index tuples.  Cleanup can intelligently target tuples that
are likely to be garbage, without wasting too many cycles on less
promising tuples/pages (index pages with little or no version churn).

This is infrastructure for an upcoming commit that will teach nbtree to
perform bottom-up index deletion.  No index AM actually applies the hint
just yet.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=CEKFa74EScx_hFVshCOn6AA5T-ajFASTdzipdkLTNQQ@mail.gmail.com
2021-01-13 08:11:00 -08:00
Alvaro Herrera a3e51a36b7
Fix thinko in comment
This comment has been wrong since its introduction in commit
2c03216d83.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAD21AoAzz6qipFJBbGEaHmyWxvvNDp8httbwLR9tUQWaTjUs2Q@mail.gmail.com
2021-01-12 11:48:45 -03:00
Thomas Munro ce6a71fa53 Use vectored I/O to fill new WAL segments.
Instead of making many block-sized write() calls to fill a new WAL file
with zeroes, make a smaller number of pwritev() calls (or various
emulations).  The actual number depends on the OS's IOV_MAX, which
PG_IOV_MAX currently caps at 32.  That means we'll write 256kB per call
on typical systems.  We may want to tune the number later with more
experience.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGJA%2Bu-220VONeoREBXJ9P3S94Y7J%2BkqCnTYmahvZJwM%3Dg%40mail.gmail.com
2021-01-11 15:28:31 +13:00
Peter Geoghegan 83e3239ee7 Standardize one aspect of rmgr desc output.
Bring heap and hash rmgr desc output in line with nbtree and GiST desc
output by using the name latestRemovedXid for all fields that output the
contents of the latestRemovedXid field from the WAL record's C struct
(stop using local variants).

This seems like a clear improvement because latestRemovedXid is a symbol
name that already appears across many different source files, and so is
probably much more recognizable.

Discussion: https://postgr.es/m/CAH2-Wzkt_Rs4VqPSCk87nyjPAAEmWL8STU9zgET_83EF5YfrLw@mail.gmail.com
2021-01-04 19:46:11 -08:00
Thomas Munro 034510c820 Replace remaining uses of "whitelist".
Instead describe the action that the list effects, or just use "list"
where the meaning is obvious from context.

Author: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Discussion: https://postgr.es/m/20200615182235.x7lch5n6kcjq4aue%40alap3.anarazel.de
2021-01-05 14:00:16 +13:00
Thomas Munro c0d4f6d897 Rename "enum blacklist" to "uncommitted enums".
We agreed to remove this terminology and use something more descriptive.

Discussion: https://postgr.es/m/20200615182235.x7lch5n6kcjq4aue%40alap3.anarazel.de
2021-01-05 12:38:48 +13:00
Tom Lane 4bd3fad80e Fix integer-overflow corner cases in substring() functions.
If the substring start index and length overflow when added together,
substring() misbehaved, either throwing a bogus "negative substring
length" error on a case that should succeed, or failing to complain that
a negative length is negative (and instead returning the whole string,
in most cases).  Unsurprisingly, the text, bytea, and bit variants of
the function all had this issue.  Rearrange the logic to ensure that
negative lengths are always rejected, and add an overflow check to
handle the other case.

Also install similar guards into detoast_attr_slice() (nee
heap_tuple_untoast_attr_slice()), since it's far from clear that
no other code paths leading to that function could pass it values
that would overflow.

Patch by myself and Pavel Stehule, per bug #16804 from Rafi Shamim.

Back-patch to v11.  While these bugs are old, the common/int.h
infrastructure for overflow-detecting arithmetic didn't exist before
commit 4d6ad3125, and it doesn't seem like these misbehaviors are bad
enough to justify developing a standalone fix for the older branches.

Discussion: https://postgr.es/m/16804-f4eeeb6c11ba71d4@postgresql.org
2021-01-04 18:32:44 -05:00
Bruce Momjian ca3b37487b Update copyright for 2021
Backpatch-through: 9.5
2021-01-02 13:06:25 -05:00
Peter Geoghegan 32d6287d2e Get heap page max offset with buffer lock held.
On further reflection it seems better to call PageGetMaxOffsetNumber()
after acquiring a buffer lock on the page.  This shouldn't really
matter, but doing it this way is cleaner.

Follow-up to commit 42288174.

Backpatch: 12-, just like commit 42288174
2020-12-30 17:21:42 -08:00
Peter Geoghegan 4228817449 Fix index deletion latestRemovedXid bug.
The logic for determining the latest removed XID for the purposes of
generating recovery conflicts in REDO routines was subtly broken.  It
failed to follow links from HOT chains, and so failed to consider all
relevant heap tuple headers in some cases.

To fix, expand the loop that deals with LP_REDIRECT line pointers to
also deal with HOT chains.  The new version of the loop is loosely based
on a similar loop from heap_prune_chain().

The impact of this bug is probably quite limited, since the horizon code
necessarily deals with heap tuples that are pointed to by LP_DEAD-set
index tuples.  The process of setting LP_DEAD index tuples (e.g. within
the kill_prior_tuple mechanism) is highly correlated with opportunistic
pruning of pointed-to heap tuples.  Plus the question of generating a
recovery conflict usually comes up some time after index tuple LP_DEAD
bits were initially set, unlike heap pruning, where a latestRemovedXid
is generated at the point of the pruning operation (heap pruning has no
deferred "would-be page split" style processing that produces conflicts
lazily).

Only backpatch to Postgres 12, the first version where this logic runs
during original execution (following commit 558a9165e0).  The index
latestRemovedXid mechanism has had the same bug since it first appeared
over 10 years ago (in commit a760893d), but backpatching to all
supported versions now seems like a bad idea on balance.  Running the
new improved code during recovery seems risky, especially given the lack
of complaints from the field.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=Eib393+HHcERK_9MtgNS7Ew1HY=RDC_g6GL46zM5C6Q@mail.gmail.com
Backpatch: 12-
2020-12-30 16:29:05 -08:00
Bruce Momjian 3187ef7c46 Revert "Add key management system" (978f869b99) & later commits
The patch needs test cases, reorganization, and cfbot testing.
Technically reverts commits 5c31afc49d..e35b2bad1a (exclusive/inclusive)
and 08db7c63f3..ccbe34139b.

Reported-by: Tom Lane, Michael Paquier

Discussion: https://postgr.es/m/E1ktAAG-0002V2-VB@gemulon.postgresql.org
2020-12-27 21:37:42 -05:00
Bruce Momjian 978f869b99 Add key management system
This adds a key management system that stores (currently) two data
encryption keys of length 128, 192, or 256 bits.  The data keys are
AES256 encrypted using a key encryption key, and validated via GCM
cipher mode.  A command to obtain the key encryption key must be
specified at initdb time, and will be run at every database server
start.  New parameters allow a file descriptor open to the terminal to
be passed.  pg_upgrade support has also been added.

Discussion: https://postgr.es/m/CA+fd4k7q5o6Nc_AaX6BcYM9yqTbC6_pnH-6nSD=54Zp6NBQTCQ@mail.gmail.com
Discussion: https://postgr.es/m/20201202213814.GG20285@momjian.us

Author: Masahiko Sawada, me, Stephen Frost
2020-12-25 10:19:44 -05:00
Michael Paquier 90fbf7c57d Fix typos and grammar in docs and comments
This fixes several areas of the documentation and some comments in
matters of style, grammar, or even format.

Author: Justin Pryzby
Discussion: https://postgr.es/m/20201222041153.GK30237@telsasoft.com
2020-12-24 17:05:49 +09:00
Fujii Masao 00f690a239 Revert "Get rid of the dedicated latch for signaling the startup process".
Revert ac22929a26, as well as the followup fix 113d3591b8. Because it broke
the assumption that the startup process waiting for the recovery conflict
on buffer pin should be waken up only by buffer unpin or the timeout enabled
in ResolveRecoveryConflictWithBufferPin(). It caused, for example,
SIGHUP signal handler or walreceiver process to wake that startup process
up unnecessarily frequently.

Additionally, add the comments about why that dedicated latch that
the reverted patch tried to get rid of should not be removed.

Thanks to Kyotaro Horiguchi for the discussion.

Author: Fujii Masao
Discussion: https://postgr.es/m/d8c0c608-021b-3c73-fffd-3240829ee986@oss.nttdata.com
2020-12-17 18:06:51 +09:00
Peter Geoghegan 41ddc27f66 Remove obsolete btrescan() comment.
"Ordering stuff" refered to a _bt_first() call to _bt_orderkeys().
However, the _bt_orderkeys() function was renamed to
_bt_preprocess_keys() by commit fa5c8a055a.

_bt_preprocess_keys() is directly referenced just after the removed
comment already, which seems sufficient.
2020-12-15 15:55:07 -08:00
Tom Lane b3817f5f77 Improve hash_create()'s API for some added robustness.
Invent a new flag bit HASH_STRINGS to specify C-string hashing, which
was formerly the default; and add assertions insisting that exactly
one of the bits HASH_STRINGS, HASH_BLOBS, and HASH_FUNCTION be set.
This is in hopes of preventing recurrences of the type of oversight
fixed in commit a1b8aa1e4 (i.e., mistakenly omitting HASH_BLOBS).

Also, when HASH_STRINGS is specified, insist that the keysize be
more than 8 bytes.  This is a heuristic, but it should catch
accidental use of HASH_STRINGS for integer or pointer keys.
(Nearly all existing use-cases set the keysize to NAMEDATALEN or
more, so there's little reason to think this restriction should
be problematic.)

Tweak hash_create() to insist that the HASH_ELEM flag be set, and
remove the defaults it had for keysize and entrysize.  Since those
defaults were undocumented and basically useless, no callers
omitted HASH_ELEM anyway.

Also, remove memset's zeroing the HASHCTL parameter struct from
those callers that had one.  This has never been really necessary,
and while it wasn't a bad coding convention it was confusing that
some callers did it and some did not.  We might as well save a few
cycles by standardizing on "not".

Also improve the documentation for hash_create().

In passing, improve reinit.c's usage of a hash table by storing
the key as a binary Oid rather than a string; and, since that's
a temporary hash table, allocate it in CurrentMemoryContext for
neatness.

Discussion: https://postgr.es/m/590625.1607878171@sss.pgh.pa.us
2020-12-15 11:38:53 -05:00
Michael Paquier df9274adf3 Add some checkpoint/restartpoint status to ps display
This is done for end-of-recovery and shutdown checkpoints/restartpoints
(end-of-recovery restartpoints don't exist) rather than all types of
checkpoints, in cases where it may not be possible to rely on
pg_stat_activity to get a status from the startup or checkpointer
processes.

For example, at the end of a crash recovery, this is useful to know if a
checkpoint is running in the startup process, while previously the ps
display may only show some information about "recovering" something,
that can be confusing while a checkpoint runs.

Author: Justin Pryzby
Reviewed-by: Nathan Bossart, Kirk Jamison, Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/20200818225238.GP17022@telsasoft.com
2020-12-14 11:53:58 +09:00
Michael Paquier 947789f1f5 Avoid using tuple from syscache for update of pg_database.datfrozenxid
pg_database.datfrozenxid gets updated using an in-place update at the
end of vacuum or autovacuum.  Since 96cdeae, as pg_database has a toast
relation, it is possible for a pg_database tuple to have toast values
if there is a large set of ACLs in place.  In such a case, the in-place
update would fail because of the flattening of the toast values done for
the catcache entry fetched.  Instead of using a copy from the catcache,
this changes the logic to fetch the copy of the tuple by directly
scanning pg_database.

Per the lack of complaints on the matter, no backpatch is done.  Note
that before 96cdeae, attempting to insert such a tuple to pg_database
would cause a "row is too big" error, so the end-of-vacuum problem was
not reachable.

Author: Ashwin Agrawal, Junfeng Yang
Discussion: https://postgr.es/m/DM5PR0501MB38800D9E4605BCA72DD35557CCE10@DM5PR0501MB3880.namprd05.prod.outlook.com
2020-12-08 12:13:19 +09:00
Peter Eisentraut eb93f3a0b6 Convert elog(LOG) calls to ereport() where appropriate
User-visible log messages should go through ereport(), so they are
subject to translation.  Many remaining elog(LOG) calls are really
debugging calls.

Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://www.postgresql.org/message-id/flat/92d6f545-5102-65d8-3c87-489f71ea0a37%40enterprisedb.com
2020-12-04 14:25:23 +01:00
Fujii Masao 3df51ca8b3 Fix CLUSTER progress reporting of number of blocks scanned.
Previously pg_stat_progress_cluster view reported the current block
number in heap scan as the number of heap blocks scanned (i.e.,
heap_blks_scanned). This reported number could be incorrect when
synchronize_seqscans is enabled, because it allowed the heap scan to
start at block in middle. This could result in wraparounds in the
heap_blks_scanned column when the heap scan wrapped around.
This commit fixes the bug by calculating the number of blocks from
the block that the heap scan starts at to the current block in scan,
and reporting that number in the heap_blks_scanned column.

Also, in pg_stat_progress_cluster view, previously heap_blks_scanned
could not reach heap_blks_total at the end of heap scan phase
if the last pages scanned were empty. This commit fixes the bug by
manually updating heap_blks_scanned to the same value as
heap_blks_total when the heap scan phase finishes.

Back-patch to v12 where pg_stat_progress_cluster view was introduced.

Reported-by: Matthias van de Meent
Author: Matthias van de Meent
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/CAEze2WjCBWSGkVfYag001Rc4+-nNLDpWM7QbyD6yPvuhKs-gYQ@mail.gmail.com
2020-11-27 20:16:44 +09:00
Tom Lane 789b938bf2 Centralize logic for skipping useless ereport/elog calls.
While ereport() and elog() themselves are quite cheap when the
error message level is too low to be printed, some places need to do
substantial work before they can call those macros at all.  To allow
optimizing away such setup work when nothing is to be printed, make
elog.c export a new function message_level_is_interesting(elevel)
that reports whether ereport/elog will do anything.  Make use of that
in various places that had ad-hoc direct tests of log_min_messages etc.
Also teach ProcSleep to use it to avoid some work.  (There may well
be other places that could usefully use this; I didn't search hard.)

Within elog.c, refactor a little bit to avoid having duplicate copies
of the policy-setting logic.  When that code was written, we weren't
relying on the availability of inline functions; so it had some
duplications in the name of efficiency, which I got rid of.

Alvaro Herrera and Tom Lane

Discussion: https://postgr.es/m/129515.1606166429@sss.pgh.pa.us
2020-11-23 19:10:46 -05:00
Tom Lane 0cc9932788 Rename the "point is strictly above/below point" comparison operators.
Historically these were called >^ and <^, but that is inconsistent
with the similar box, polygon, and circle operators, which are named
|>> and <<| respectively.  Worse, the >^ and <^ names are used for
*not* strict above/below tests for the box type.

Hence, invent new operators following the more common naming.  The
old operators remain available for now, and are still accepted by
the relevant index opclasses too.  But there's a deprecation notice,
so maybe we can get rid of them someday.

Emre Hasegeli, reviewed by Pavel Borisov

Discussion: https://postgr.es/m/24348.1587444160@sss.pgh.pa.us
2020-11-23 11:38:37 -05:00
Peter Eisentraut b5acf10cfc Replace a macro by a function
Using a macro is ugly and not justified here.

Discussion: https://www.postgresql.org/message-id/flat/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36c7b@2ndquadrant.com
2020-11-20 11:25:25 +01:00
Fujii Masao a4ef0329c2 Emit log when restore_command succeeds but archived file faills to be restored.
Previously, when restore_command claimed to succeed but failed to restore
the file with the right name, for example, due to mis-configuration of
restore_command, no log message was reported. Then the recovery failed
later with an error message not directly related to the issue.

This commit changes the recovery so that a log message is emitted in
this error case. This would enable us to investigate what happened in
this case more easily.

Author: Jeff Janes, Fujii Masao
Reviewed-by: Pavel Borisov, Kyotaro Horiguchi
Discussion: https://postgr.es/m/CAMkU=1xkFs3Omp4JR4wMYWdam_KLuj6LXnTYfU8u3T0h=PLLMQ@mail.gmail.com
2020-11-20 15:42:47 +09:00
Peter Geoghegan cf2acaf4dc Deprecate nbtree's BTP_HAS_GARBAGE flag.
Streamline handling of the various strategies that we have to avoid a
page split in nbtinsert.c.  When it looks like a leaf page is about to
overflow, we now perform deleting LP_DEAD items and deduplication in one
central place.  This greatly simplifies _bt_findinsertloc().

This has an independently useful consequence: nbtree no longer relies on
the BTP_HAS_GARBAGE page level flag/hint for anything important.  We
still set and unset the flag in the same way as before, but it's no
longer treated as a gating condition when considering if we should check
for already-set LP_DEAD bits.  This happens at the point where the page
looks like it might have to be split anyway, so simply checking the
LP_DEAD bits in passing is practically free.  This avoids missing
LP_DEAD bits just because the page-level hint is unset, which is
probably reasonably common (e.g. it happens when VACUUM unsets the
page-level flag without actually removing index tuples whose LP_DEAD-bit
was set recently, after the VACUUM operation began but before it reached
the leaf page in question).

Note that this isn't a big behavioral change compared to PostgreSQL 13.
We were already checking for set LP_DEAD bits regardless of whether the
BTP_HAS_GARBAGE page level flag was set before we considered doing a
deduplication pass.  This commit only goes slightly further by doing the
same check for all indexes, even indexes where deduplication won't be
performed.

We don't completely remove the BTP_HAS_GARBAGE flag.  We still rely on
it as a gating condition with pg_upgrade'd indexes from before B-tree
version 4/PostgreSQL 12.  That makes sense because we sometimes have to
make a choice among pages full of duplicates when inserting a tuple with
pre version 4 indexes.  It probably still pays to avoid accessing the
line pointer array of a page there, since it won't yet be clear whether
we'll insert on to the page in question at all, let alone split it as a
result.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz%3DYpc1PDdk8OVJDChGJBjT06%3DA0Mbv9HyTLCsOknGcUFg%40mail.gmail.com
2020-11-17 09:45:56 -08:00
Peter Geoghegan a034f8b60c nbtree: Rename nbtinsert.c variables for consistency.
Stop naming special area/opaque pointer variables 'lpageop' in contexts
where it doesn't make sense.  This is a holdover from a time when logic
that performs tasks that are now spread across _bt_insertonpg(),
_bt_findinsertloc(), and _bt_split() was more centralized.  'lpageop'
denotes "left page", which doesn't make sense outside of contexts in
which there isn't also a right page.

Also acquire page flag variables up front within _bt_insertonpg().  This
makes it closer to _bt_split() following refactoring commit bc3087b626.
This allows the page split and retail insert paths to both make use of
the same variables.
2020-11-17 09:01:14 -08:00
Alvaro Herrera cd9c1b3e19
Rename PGPROC->vacuumFlags to statusFlags
With more flags associated to a PGPROC entry that are not related to
vacuum (currently existing or planned), the name "statusFlags" describes
its purpose better.

(The same is done to the mirroring PROC_HDR->vacuumFlags.)

No functional changes in this commit.

This was suggested first by Hari Babu Kommi in [1] and then by Michael
Paquier at [2].

[1] https://postgr.es/m/CAJrrPGcsDC-oy1AhqH0JkXYa0Z2AgbuXzHPpByLoBGMxfOZMEQ@mail.gmail.com
[2] https://postgr.es/m/20200820060929.GB3730@paquier.xyz

Author: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20201116182446.qcg3o6szo2zookyr@localhost
2020-11-16 19:42:55 -03:00
Peter Eisentraut d93ccdea1d Remove unused and deprecated strategy numbers from BRIN code
These were dead code.

Discussion: https://www.postgresql.org/message-id/flat/20201027032511.GF9241@telsasoft.com
2020-11-16 17:25:41 +01:00
Peter Eisentraut 591d282e8d Remove code handling removed deprecated containment operators
This removes the code that was there for handling the operators
removed by 2f70fdb064.

Discussion: https://www.postgresql.org/message-id/flat/20201027032511.GF9241@telsasoft.com
2020-11-16 11:51:12 +01:00
Fujii Masao 2945a488a3 Make the standby server promptly handle interrupt signals.
This commit changes the startup process in the standby server so that
it handles the interrupt signals after waiting for wal_retrieve_retry_interval
on the latch and resetting it, before entering another wait on the latch.
This change causes the standby server to promptly handle interrupt signals.

Otherwise, previously, there was the case where the standby needs to
wait extra five seconds to shutdown when the shutdown request arrived
while the startup process was waiting for wal_retrieve_retry_interval
on the latch.

Author: Fujii Masao, but implementation idea is from Soumyadeep Chakraborty
Reviewed-by: Soumyadeep Chakraborty
Discussion: https://postgr.es/m/9d7e6ab0-8a53-ddb9-63cd-289bcb25fe0e@oss.nttdata.com
2020-11-16 18:27:51 +09:00
Peter Geoghegan 46cf3c72c3 nbtree: Demote incomplete split "can't happen" error.
Only a basic logic bug in a _bt_insertonpg() caller could lead to a
violation of this invariant (index corruption won't do it).  A "can't
happen" error seems inappropriate (it is arbitrary at best).

Demote the error to a simple assertion.  This matches similar nearby
sanity checks.
2020-11-15 11:53:37 -08:00
Tomas Vondra 42c63ab6e2 Remove duplicate code in brin_memtuple_initialize
Commit 8bf74967da moved some of the code from brin_new_memtuple to
brin_memtuple_initialize, but this resulted in some of the code being
duplicate. Fix by removing the duplicate lines and backpatch to 10.

Author: Tomas Vondra
Backpatch-through: 10
Discussion: https://postgr.es/m/5eb50c97-9a8e-b691-8c40-1b2a55611c4c%40enterprisedb.com
2020-11-11 18:37:59 +01:00
Tom Lane ec29427ce2 Fix and simplify some usages of TimestampDifference().
Introduce TimestampDifferenceMilliseconds() to simplify callers
that would rather have the difference in milliseconds, instead of
the select()-oriented seconds-and-microseconds format.  This gets
rid of at least one integer division per call, and it eliminates
some apparently-easy-to-mess-up arithmetic.

Two of these call sites were in fact wrong:

* pg_prewarm's autoprewarm_main() forgot to multiply the seconds
by 1000, thus ending up with a delay 1000X shorter than intended.
That doesn't quite make it a busy-wait, but close.

* postgres_fdw's pgfdw_get_cleanup_result() thought it needed to compute
microseconds not milliseconds, thus ending up with a delay 1000X longer
than intended.  Somebody along the way had noticed this problem but
misdiagnosed the cause, and imposed an ad-hoc 60-second limit rather
than fixing the units.  This was relatively harmless in context, because
we don't care that much about exactly how long this delay is; still,
it's wrong.

There are a few more callers of TimestampDifference() that don't
have a direct need for seconds-and-microseconds, but can't use
TimestampDifferenceMilliseconds() either because they do need
microsecond precision or because they might possibly deal with
intervals long enough to overflow 32-bit milliseconds.  It might be
worth inventing another API to improve that, but that seems outside
the scope of this patch; so those callers are untouched here.

Given the fact that we are fixing some bugs, and the likelihood
that future patches might want to back-patch code that uses this
new API, back-patch to all supported branches.

Alexey Kondratov and Tom Lane

Discussion: https://postgr.es/m/3b1c053a21c07c1ed5e00be3b2b855ef@postgrespro.ru
2020-11-10 22:51:54 -05:00
Peter Geoghegan 180cf876d4 Remove ineffective heapam CHECK_FOR_INTERRUPTS().
Remove a CHECK_FOR_INTERRUPTS() call that could never actually handle an
interrupt.  We always have a heap page buffer lock at this point.
Having a useless CHECK_FOR_INTERRUPTS() call is harmless but misleading.

It is probably possible to work around the immediate problem by moving
the CHECK_FOR_INTERRUPTS() to before the heap page buffer lock is
acquired.  That isn't enough to make the function responsive to
interrupts, though.  The index AM caller will still hold an exclusive
buffer lock of its own.
2020-11-09 09:00:12 -08:00
Noah Misch 0c3185e963 In security-restricted operations, block enqueue of at-commit user code.
Specifically, this blocks DECLARE ... WITH HOLD and firing of deferred
triggers within index expressions and materialized view queries.  An
attacker having permission to create non-temp objects in at least one
schema could execute arbitrary SQL functions under the identity of the
bootstrap superuser.  One can work around the vulnerability by disabling
autovacuum and not manually running ANALYZE, CLUSTER, REINDEX, CREATE
INDEX, VACUUM FULL, or REFRESH MATERIALIZED VIEW.  (Don't restore from
pg_dump, since it runs some of those commands.)  Plain VACUUM (without
FULL) is safe, and all commands are fine when a trusted user owns the
target object.  Performance may degrade quickly under this workaround,
however.  Back-patch to 9.5 (all supported versions).

Reviewed by Robert Haas.  Reported by Etienne Stalmans.

Security: CVE-2020-25695
2020-11-09 07:32:09 -08:00
Peter Geoghegan 5a2f154a2e Improve nbtree README's LP_DEAD section.
The description of how LP_DEAD bit setting by index scans works
following commit 2ed5b87f was rather unclear.  Clean that up a bit.

Also refer to LP_DEAD bit setting within _bt_check_unique() at the start
of the same section.  This mechanism may actually be more important than
the generic kill_prior_tuple mechanism that the section focuses on, so
it at least deserves to be mentioned in passing.
2020-11-07 18:51:12 -08:00
Alvaro Herrera 52eec1c53a
Message style improvements
* Avoid pointlessly highlighting that an index vacuum was executed by a
  parallel worker; user doesn't care.

* Don't give the impression that a non-concurrent reindex of an invalid
  index on a TOAST table would work, because it wouldn't.

* Add a "translator:" comment for a mysterious message.

Discussion: https://postgr.es/m/20201107034943.GA16596@alvherre.pgsql
Reviewed-by: Michael Paquier <michael@paquier.xyz>
2020-11-07 19:33:43 -03:00
Tomas Vondra 7577dd8480 Properly detoast data in brin_form_tuple
brin_form_tuple failed to consider the values may be toasted, inserting
the toast pointer into the index. This may easily result in index
corruption, as the toast data may be deleted and cleaned up by vacuum.
The cleanup however does not care about indexes, leaving invalid toast
pointers behind, which triggers errors like this:

  ERROR:  missing chunk number 0 for toast value 16433 in pg_toast_16426

A less severe consequence are inconsistent failures due to the index row
being too large, depending on whether brin_form_tuple operated on plain
or toasted version of the row. For example

    CREATE TABLE t (val TEXT);
    INSERT INTO t VALUES ('... long value ...')
    CREATE INDEX idx ON t USING brin (val);

would likely succeed, as the row would likely include toast pointer.
Switching the order of INSERT and CREATE INDEX would likely fail:

    ERROR:  index row size 8712 exceeds maximum 8152 for index "idx"

because this happens before the row values are toasted.

The bug exists since PostgreSQL 9.5 where BRIN indexes were introduced.
So backpatch all the way back.

Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Backpatch-through: 9.5
Discussion: https://postgr.es/m/20201001184133.oq5uq75sb45pu3aw@development
Discussion: https://postgr.es/m/20201104010544.zexj52mlldagzowv%40development
2020-11-07 00:39:19 +01:00
Peter Geoghegan efc5dcfd8a Fix wal_consistency_checking nbtree bug.
wal_consistency_checking indicated an inconsistency in certain cases
involving nbtree page deletion.  The underlying issue is that there was
a minor difference between the page image produced after a REDO routine
ran and the corresponding page image following original execution.

This harmless inconsistency has been around forever.  We more or less
expect total consistency among even deleted nbtree pages these days,
though, so this won't do anymore.

To fix, tweak the REDO routine to match original execution.

Oversight in commit f47b5e13.
2020-11-05 15:01:40 -08:00
Peter Geoghegan 48e1291342 Fix nbtree cleanup-only VACUUM stats inaccuracies.
Logic for counting heap TIDs from posting list tuples (added by commit
0d861bbb) was faulty.  It didn't count any TIDs/index tuples in the
event of no callback being set.  This meant that we incorrectly counted
no index tuples in clean-up only VACUUMs, which could lead to
pg_class.reltuples being spuriously set to 0 in affected indexes.

To fix, go back to counting items from the page in cases where there is
no callback.  This approach isn't very accurate, but it works well
enough in practice while avoiding the expense of accessing every index
tuple during cleanup-only VACUUMs.

Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
https://postgr.es/m/20201023174451.69e358f1@firost
Backpatch: 13-, where nbtree deduplication was introduced
2020-11-04 18:42:27 -08:00
Thomas Munro c732c3f8c1 Fix unlinking of SLRU segments.
Commit dee663f7 intended to drop any queued up fsync requests before
unlinking segment files, but missed a code path.  Fix, by centralizing
the forget-and-unlink code into a single function.

Reported-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Discussion: https://postgr.es/m/20201104013205.icogbi773przyny5%40development
2020-11-05 13:49:49 +13:00
Fujii Masao 113d3591b8 Fix segmentation fault that commit ac22929a26 caused.
Commit ac22929a26 changed recoveryWakeupLatch so that it's reset to
NULL at the end of recovery. This change could cause a segmentation fault
in the buildfarm member 'elver'.

Previously the latch was reset to NULL after calling ShutdownWalRcv().
But there could be a window between ShutdownWalRcv() and the actual
exit of walreceiver. If walreceiver set the latch during that window,
the segmentation fault could happen.

To fix the issue, this commit changes walreceiver so that it sets
the latch only when the latch has not been reset to NULL yet.

Author: Fujii Masao
Discussion: https://postgr.es/m/5c1f8a85-747c-7bf9-241e-dd467d8a3586@iki.fi
2020-11-04 21:49:00 +09:00
Fujii Masao ac22929a26 Get rid of the dedicated latch for signaling the startup process.
This commit gets rid of the dedicated latch for signaling the startup
process in favor of using its procLatch,  since that comports better
with possible generic signal handlers using that latch.

Commit 1e53fe0e70 changed background processes so that they use standard
SIGHUP handler. Like that, this commit also makes the startup process use
standard SIGHUP handler to simplify the code.

Author: Fujii Masao
Reviewed-by: Bharath Rupireddy, Michael Paquier
Discussion: https://postgr.es/m/CALj2ACXPorUqePswDtOeM_s82v9RW32E1fYmOPZ5NuE+TWKj_A@mail.gmail.com
2020-11-04 16:43:43 +09:00
Peter Eisentraut dd26a0ad76 Use PG_GETARG_TRANSACTIONID where appropriate
Some places were using PG_GETARG_UINT32 where PG_GETARG_TRANSACTIONID
would be more appropriate.  (Of course, they are the same internally,
so there is no externally visible effect.)  To do that, export
PG_GETARG_TRANSACTIONID outside of xid.c.  We also export
PG_RETURN_TRANSACTIONID for symmetry, even though there are currently
no external users.

Author: Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/d8f6bdd536df403b9b33816e9f7e0b9d@G08CNEXMBPEKD05.g08.fujitsu.local
2020-11-02 16:48:22 +01:00
Michael Paquier 8a15e735be Fix some grammar and typos in comments and docs
The documentation fixes are backpatched down to where they apply.

Author: Justin Pryzby
Discussion: https://postgr.es/m/20201031020801.GD3080@telsasoft.com
Backpatch-through: 9.6
2020-11-02 15:14:41 +09:00
Noah Misch f90e80b913 Reproduce debug_query_string==NULL on parallel workers.
Certain background workers initiate parallel queries while
debug_query_string==NULL, at which point they attempted strlen(NULL) and
died to SIGSEGV.  Older debug_query_string observers allow NULL, so do
likewise in these newer ones.  Back-patch to v11, where commit
7de4a1bcc5 introduced the first of these.

Discussion: https://postgr.es/m/20201014022636.GA1962668@rfd.leadboat.com
2020-10-31 08:43:28 -07:00
Heikki Linnakangas 6f0bc5e1da Fix missing validation for the new GiST sortsupport functions.
Because of this, if you tried to create an operator family with the new
sortsupport function, you got an error:

ERROR:  support function number 11 is invalid for access method gist

We missed this in commit 16fa9b2b30 that added the sortsupport function,
because it only added sortsupport to a built-in operator family.

Author: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/3520A18A-5C38-4697-A2E3-F3BDE3496CD5%40yandex-team.ru
2020-10-30 19:30:19 +02:00
Andres Freund 94bc27b576 Centralize horizon determination for temp tables, fixing bug due to skew.
This fixes a bug in the edge case where, for a temp table, heap_page_prune()
can end up with a different horizon than heap_vacuum_rel(). Which can trigger
errors like "ERROR: cannot freeze committed xmax ...".

The bug was introduced due to interaction of a7212be8b9 "Set cutoff xmin more
aggressively when vacuuming a temporary table." with dc7420c2c9 "snapshot
scalability: Don't compute global horizons while building snapshots.".

The problem is caused by lazy_scan_heap() assuming that the only reason its
HeapTupleSatisfiesVacuum() call would return HEAPTUPLE_DEAD is if the tuple is
a HOT tuple, or if the tuple's inserting transaction has aborted since the
heap_page_prune() call. But after a7212be8b9 that was also possible in other
cases for temp tables, because heap_page_prune() uses a different visibility
test after dc7420c2c9.

The fix is fairly simple: Move the special case logic for temp tables from
vacuum_set_xid_limits() to the infrastructure introduced in dc7420c2c9. That
ensures that the horizon used for pruning is at least as aggressive as the one
used by lazy_scan_heap(). The concrete horizon used for temp tables is
slightly different than the logic in dc7420c2c9, but should always be as
aggressive as before (see comments).

A significant benefit to centralizing the logic procarray.c is that now the
more aggressive horizons for temp tables does not just apply to VACUUM but
also to e.g. HOT pruning and the nbtree killtuples logic.

Because isTopLevel is not needed by vacuum_set_xid_limits() anymore, I
undid the the related changes from a7212be8b9.

This commit also adds an isolation test ensuring that the more aggressive
vacuuming and pruning of temp tables keeps working.

Debugged-By: Amit Kapila <amit.kapila16@gmail.com>
Debugged-By: Tom Lane <tgl@sss.pgh.pa.us>
Debugged-By: Ashutosh Sharma <ashu.coek88@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20201014203103.72oke6hqywcyhx7s@alap3.anarazel.de
Discussion: https://postgr.es/m/20201015083735.derdzysdtqdvxshp@alap3.anarazel.de
2020-10-28 18:02:31 -07:00
Robert Haas 866e24d47d Extend amcheck to check heap pages.
Mark Dilger, reviewed by Peter Geoghegan, Andres Freund, Álvaro Herrera,
Michael Paquier, Amul Sul, and by me. Some last-minute cosmetic
revisions by me.

Discussion: http://postgr.es/m/12ED3DA8-25F0-4B68-937D-D907CFBF08E7@enterprisedb.com
2020-10-22 08:44:18 -04:00
Alvaro Herrera 93f84d59f8 Revert "Remove pointless HeapTupleHeaderIndicatesMovedPartitions calls"
This reverts commit 85adb5e91e.  It was not intended for commit just
yet.
2020-10-15 15:16:11 -03:00
Alvaro Herrera 85adb5e91e Remove pointless HeapTupleHeaderIndicatesMovedPartitions calls
Pavan Deolasee recently noted that a few of the
HeapTupleHeaderIndicatesMovedPartitions calls added by commit
5db6df0c01 are useless, since they are done after comparing t_self
with t_ctid.  But because t_self can never be set to the magical values
that indicate that the tuple moved partition, this can never succeed: if
the first test fails (so we know t_self equals t_ctid), necessarily the
second test will also fail.

So these checks can be removed and no harm is done.

Discussion: https://postgr.es/m/20200929164411.GA15497@alvherre.pgsql
2020-10-15 14:32:34 -03:00
David Rowley 110d81728a Fixup some appendStringInfo and appendPQExpBuffer calls
A number of places were using appendStringInfo() when they could have been
using appendStringInfoString() instead.  While there's no functionality
change there, it's just more efficient to use appendStringInfoString()
when no formatting is required.  Likewise for some
appendStringInfoString() calls which were just appending a single char.
We can just use appendStringInfoChar() for that.

Additionally, many places were using appendPQExpBuffer() when they could
have used appendPQExpBufferStr(). Change those too.

Patch by Zhijie Hou, but further searching by me found significantly more
places that deserved the same treatment.

Author: Zhijie Hou, David Rowley
Discussion: https://postgr.es/m/cb172cf4361e4c7ba7167429070979d4@G08CNEXMBPEKD05.g08.fujitsu.local
2020-10-15 20:35:17 +13:00
Tom Lane 371668a838 Fix GiST buffering build to work when there are included columns.
gistRelocateBuildBuffersOnSplit did not get the memo about which
attribute count to use.  This could lead to a crash if there were
included columns and buffering build was chosen.  (Because there
are random page-split decisions elsewhere in GiST index build,
the crashes are not entirely deterministic.)

Back-patch to v12 where GiST gained support for included columns.

Pavel Borisov

Discussion: https://postgr.es/m/CALT9ZEECCV5m7wvxg46PC-7x-EybUmnpupBGhSFMoAAay+r6HQ@mail.gmail.com
2020-10-12 18:01:34 -04:00
Tom Lane 78c0b6ed27 Re-allow testing of GiST buffered builds.
Commit 16fa9b2b3 broke the ability to reliably test GiST buffered builds,
because it caused sorted builds to be done instead if sortsupport is
available, regardless of any attempt to override that.  While a would-be
test case could try to work around that by choosing an opclass that has
no sortsupport function, coverage would be silently lost the moment
someone decides it'd be a good idea to add a sortsupport function.

Hence, rearrange the logic in gistbuild() so that if "buffering = on"
is specified in CREATE INDEX, we will use that method, sortsupport or no.

Also document the interaction between sorting and the buffering
parameter, as 16fa9b2b3 failed to do.

(Note that in fact we still lack any test coverage of buffered builds,
but this is a prerequisite to adding a non-fragile test.)

Discussion: https://postgr.es/m/3249980.1602532990@sss.pgh.pa.us
2020-10-12 17:09:50 -04:00
Michael Paquier b90b79e140 Fix typo in multixact.c
AtEOXact_MultiXact() was referenced in two places with an incorrect
routine name.

Author: Hou Zhijie
Discussion: https://postgr.es/m/1b41e9311e8f474cb5a360292f0b3cb1@G08CNEXMBPEKD05.g08.fujitsu.local
2020-10-08 14:06:12 +09:00
Michael Paquier 0a3c864c32 Fix compilation warning in xlog.c
Oversight in 9d0bd95.

Reported-by: Andres Freund
Discussion: https://postgr.es/m/20201006023802.qqfi6m5bw5y77zql@alap3.anarazel.de
2020-10-06 15:29:34 +09:00
Fujii Masao 8d9a935965 Add pg_stat_wal statistics view.
This view shows the statistics about WAL activity. Currently it has only
two columns: wal_buffers_full and stats_reset. wal_buffers_full column
indicates the number of times WAL data was written to the disk because
WAL buffers got full. This information is useful when tuning wal_buffers.
stats_reset column indicates the time at which these statistics were
last reset.

pg_stat_wal view is also the basic infrastructure to expose other
various statistics about WAL activity later.

Bump PGSTAT_FILE_FORMAT_ID due to the change in pgstat format.

Bump catalog version.

Author: Masahiro Ikeda
Reviewed-by: Takayuki Tsunakawa, Kyotaro Horiguchi, Amit Kapila, Fujii Masao
Discussion: https://postgr.es/m/188bd3f2d2233cf97753b5ced02bb050@oss.nttdata.com
2020-10-02 10:17:11 +09:00
Michael Paquier 9d0bd95fa9 Add block information in error context of WAL REDO apply loop
Providing this information can be useful for example when diagnosing
problems related to recovery conflicts or for recovery issues without
having to go through the output generated by pg_waldump to get some
information about the blocks a WAL record works on.

The block information is printed in the same format as pg_waldump.  This
already existed in xlog.c for debugging purposes with -DWAL_DEBUG, so
adding the block information in the callback has required just a small
refactoring.

Author: Bertrand Drouvot
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/c31e2cba-efda-762c-f4ad-5c25e5dac3d0@amazon.com
2020-10-02 09:31:50 +09:00
Heikki Linnakangas 265ea56785 Set right-links during sorted GiST index build.
This is not strictly necessary, as the right-links are only needed by
scans that are concurrent with page splits, and neither scans or page
splits can happen during sorted index build. But it seems like a good
idea to set them anyway, if we e.g. want to add a check to amcheck in
the future to verify that the chain of right-links is complete.

Author: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/4D68C21F-9FB9-41DA-B663-FDFC8D143788%40yandex-team.ru
2020-10-01 11:10:43 +03:00
Thomas Munro dee663f784 Defer flushing of SLRU files.
Previously, we called fsync() after writing out individual pg_xact,
pg_multixact and pg_commit_ts pages due to cache pressure, leading to
regular I/O stalls in user backends and recovery.  Collapse requests for
the same file into a single system call as part of the next checkpoint,
as we already did for relation files, using the infrastructure developed
by commit 3eb77eba.  This can cause a significant improvement to
recovery performance, especially when it's otherwise CPU-bound.

Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer
that it applies to all the SLRU mini-buffer-pools as well as the main
buffer pool.  Rearrange things so that data collected in CheckpointStats
includes SLRU activity.

Also remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions,
because they were redundant after the shutdown checkpoint that
immediately precedes them.  (I'm not sure if they were ever needed, but
they aren't now.)

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (parts)
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com
2020-09-25 19:00:15 +12:00
Peter Eisentraut c005eb00e7 Standardize the printf format for st_size
Existing code used various inconsistent ways to printf struct stat's
st_size member.  The type of that is off_t, which is in most cases a
signed 64-bit integer, so use the long long int format for it.
2020-09-24 21:04:21 +02:00
Thomas Munro aca74843e4 Fix missing fsync of SLRU directories.
Harmonize behavior by moving reponsibility for fsyncing directories down
into slru.c.  In 10 and later, only the multixact directories were
missed (see commit 1b02be21), and in older branches all SLRUs were
missed.

Back-patch to all supported releases.

Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CA%2BhUKGLtsTUOScnNoSMZ-2ZLv%2BwGh01J6kAo_DM8mTRq1sKdSQ%40mail.gmail.com
2020-09-24 10:39:52 +12:00
Tom Lane 9436041ed8 Copy editing: fix a bunch of misspellings and poor wording.
99% of this is docs, but also a couple of comments.  No code changes.

Justin Pryzby

Discussion: https://postgr.es/m/20200919175804.GE30557@telsasoft.com
2020-09-21 12:43:42 -04:00