Commit Graph

4087 Commits

Author SHA1 Message Date
Fujii Masao 08aa89b326 Remove COMMIT_TS_SETTS record.
Commit 438fc4a39c prevented the WAL replay from writing
COMMIT_TS_SETTS record. By this change there is no code that
generates COMMIT_TS_SETTS record in PostgreSQL core.
Also we can think that there are no extensions using the record
because we've not received so far any complaints about the issue
that commit 438fc4a39c fixed. Therefore this commit removes
COMMIT_TS_SETTS record and its related code. Even without
this record, the timestamp required for commit timestamp feature
can be acquired from the COMMIT record.

Bump WAL page magic.

Reported-by: lx zou <zoulx1982@163.com>
Author: Fujii Masao
Reviewed-by: Alvaro Herrera
Discussion: https://postgr.es/m/16931-620d0f2fdc6108f1@postgresql.org
2021-04-12 00:04:30 +09:00
Thomas Munro dc88460c24 Doc: Review for "Optionally prefetch referenced data in recovery."
Typos, corrections and language improvements in the docs, and a few in
code comments too.

Reported-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20210409033703.GP6592%40telsasoft.com
2021-04-10 08:21:53 +12:00
Peter Geoghegan 796092fb84 Silence another _bt_check_unique compiler warning.
Per complaint from Tom Lane

Discussion: https://postgr.es/m/1922884.1617909599@sss.pgh.pa.us
2021-04-08 12:54:31 -07:00
Thomas Munro 1d257577e0 Optionally prefetch referenced data in recovery.
Introduce a new GUC recovery_prefetch, disabled by default.  When
enabled, look ahead in the WAL and try to initiate asynchronous reading
of referenced data blocks that are not yet cached in our buffer pool.
For now, this is done with posix_fadvise(), which has several caveats.
Better mechanisms will follow in later work on the I/O subsystem.

The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.

The GUC wal_decode_buffer_size is used to limit the maximum distance we
are prepared to read ahead in the WAL to find uncached blocks.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (parts)
Reviewed-by: Andres Freund <andres@anarazel.de> (parts)
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (parts)
Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Tested-by: Dmitry Dolgov <9erthalion6@gmail.com>
Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
2021-04-08 23:20:42 +12:00
Thomas Munro f003d9f872 Add circular WAL decoding buffer.
Teach xlogreader.c to decode its output into a circular buffer, to
support optimizations based on looking ahead.

 * XLogReadRecord() works as before, consuming records one by one, and
   allowing them to be examined via the traditional XLogRecGetXXX()
   macros.

 * An alternative new interface XLogNextRecord() is added that returns
   pointers to DecodedXLogRecord structs that can be examined directly.

 * XLogReadAhead() provides a second cursor that lets you see
   further ahead, as long as data is available and there is enough space
   in the decoding buffer.  This returns DecodedXLogRecord pointers to the
   caller, but also adds them to a queue of records that will later be
   consumed by XLogNextRecord()/XLogReadRecord().

The buffer's size is controlled with wal_decode_buffer_size.  The buffer
could potentially be placed into shared memory, for future projects.
Large records that don't fit in the circular buffer are called
"oversized" and allocated separately with palloc().

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
2021-04-08 23:20:42 +12:00
Thomas Munro 323cbe7c7d Remove read_page callback from XLogReader.
Previously, the XLogReader module would fetch new input data using a
callback function.  Redesign the interface so that it tells the caller
to insert more data with a special return value instead.  This API suits
later patches for prefetching, encryption and maybe other future
projects that would otherwise require continually extending the callback
interface.

As incidental cleanup work, move global variables readOff, readLen and
readSegNo inside XlogReaderState.

Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (parts of earlier version)
Reviewed-by: Antonin Houska <ah@cybertec.at>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Takashi Menjo <takashi.menjo@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/20190418.210257.43726183.horiguchi.kyotaro%40lab.ntt.co.jp
2021-04-08 23:20:42 +12:00
Alvaro Herrera 0827e8af70
autovacuum: handle analyze for partitioned tables
Previously, autovacuum would completely ignore partitioned tables, which
is not good regarding analyze -- failing to analyze those tables means
poor plans may be chosen.  Make autovacuum aware of those tables by
propagating "changes since analyze" counts from the leaf partitions up
the partitioning hierarchy.

This also introduces necessary reloptions support for partitioned tables
(autovacuum_enabled, autovacuum_analyze_scale_factor,
autovacuum_analyze_threshold).  It's unclear how best to document this
aspect.

Author: Yuzuko Hosoya <yuzukohosoya@gmail.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/CAKkQ508_PwVgwJyBY=0Lmkz90j8CmWNPUxgHvCUwGhMrouz6UA@mail.gmail.com
2021-04-08 01:19:36 -04:00
Peter Geoghegan 5100010ee4 Teach VACUUM to bypass unnecessary index vacuuming.
VACUUM has never needed to call ambulkdelete() for each index in cases
where there are precisely zero TIDs in its dead_tuples array by the end
of its first pass over the heap (also its only pass over the heap in
this scenario).  Index vacuuming is simply not required when this
happens.  Index cleanup will still go ahead, but in practice most calls
to amvacuumcleanup() are usually no-ops when there were zero preceding
ambulkdelete() calls.  In short, VACUUM has generally managed to avoid
index scans when there were clearly no index tuples to delete from
indexes.  But cases with _close to_ no index tuples to delete were
another matter -- a round of ambulkdelete() calls took place (one per
index), each of which performed a full index scan.

VACUUM now behaves just as if there were zero index tuples to delete in
cases where there are in fact "virtually zero" such tuples.  That is, it
can now bypass index vacuuming and heap vacuuming as an optimization
(though not index cleanup).  Whether or not VACUUM bypasses indexes is
determined dynamically, based on the just-observed number of heap pages
in the table that have one or more LP_DEAD items (LP_DEAD items in heap
pages have a 1:1 correspondence with index tuples that still need to be
deleted from each index in the worst case).

We only skip index vacuuming when 2% or less of the table's pages have
one or more LP_DEAD items -- bypassing index vacuuming as an
optimization must not noticeably impede setting bits in the visibility
map.  As a further condition, the dead_tuples array (i.e. VACUUM's array
of LP_DEAD item TIDs) must not exceed 32MB at the point that the first
pass over the heap finishes, which is also when the decision to bypass
is made.  (The VACUUM must also have been able to fit all TIDs in its
maintenance_work_mem-bound dead_tuples space, though with a default
maintenance_work_mem setting it can't matter.)

This avoids surprising jumps in the duration and overhead of routine
vacuuming with workloads where successive VACUUM operations consistently
have almost zero dead index tuples.  The number of LP_DEAD items may
well accumulate over multiple VACUUM operations, before finally the
threshold is crossed and VACUUM performs conventional index vacuuming.
Even then, the optimization will have avoided a great deal of largely
unnecessary index vacuuming.

In the future we may teach VACUUM to skip index vacuuming on a per-index
basis, using a much more sophisticated approach.  For now we only
consider the extreme cases, where we can be quite confident that index
vacuuming just isn't worth it using simple heuristics.

Also log information about how many heap pages have one or more LP_DEAD
items when autovacuum logging is enabled.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzmkebqPd4MVGuPTOS9bMFvp9MDs5cRTCOsv1rQJ3jCbXw@mail.gmail.com
2021-04-07 16:14:54 -07:00
Peter Geoghegan 1e55e7d175 Add wraparound failsafe to VACUUM.
Add a failsafe mechanism that is triggered by VACUUM when it notices
that the table's relfrozenxid and/or relminmxid are dangerously far in
the past.  VACUUM checks the age of the table dynamically, at regular
intervals.

When the failsafe triggers, VACUUM takes extraordinary measures to
finish as quickly as possible so that relfrozenxid and/or relminmxid can
be advanced.  VACUUM will stop applying any cost-based delay that may be
in effect.  VACUUM will also bypass any further index vacuuming and heap
vacuuming -- it only completes whatever remaining pruning and freezing
is required.  Bypassing index/heap vacuuming is enabled by commit
8523492d, which made it possible to dynamically trigger the mechanism
already used within VACUUM when it is run with INDEX_CLEANUP off.

It is expected that the failsafe will almost always trigger within an
autovacuum to prevent wraparound, long after the autovacuum began.
However, the failsafe mechanism can trigger in any VACUUM operation.
Even in a non-aggressive VACUUM, where we're likely to not advance
relfrozenxid, it still seems like a good idea to finish off remaining
pruning and freezing.   An aggressive/anti-wraparound VACUUM will be
launched immediately afterwards.  Note that the anti-wraparound VACUUM
that follows will itself trigger the failsafe, usually before it even
begins its first (and only) pass over the heap.

The failsafe is controlled by two new GUCs: vacuum_failsafe_age, and
vacuum_multixact_failsafe_age.  There are no equivalent reloptions,
since that isn't expected to be useful.  The GUCs have rather high
defaults (both default to 1.6 billion), and are expected to generally
only be used to make the failsafe trigger sooner/more frequently.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzmgH3ySGYeC-m-eOBsa2=sDwa292-CFghV4rESYo39FsQ@mail.gmail.com
2021-04-07 12:37:45 -07:00
Peter Geoghegan 3c3b8a4b26 Truncate line pointer array during VACUUM.
Teach VACUUM to truncate the line pointer array of each heap page when a
contiguous group of LP_UNUSED line pointers appear at the end of the
array -- these unused and unreferenced items are excluded.  This process
occurs during VACUUM's second pass over the heap, right after LP_DEAD
line pointers on the page (those encountered/pruned during the first
pass) are marked LP_UNUSED.

Truncation avoids line pointer bloat with certain workloads,
particularly those involving continual range DELETEs and bulk INSERTs
against the same table.

Also harden heapam code to check for an out-of-range page offset number
in places where we weren't already doing so.

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAEze2WjgaQc55Y5f5CQd3L=eS5CZcff2Obxp=O6pto8-f0hC4w@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopJMhFQj_g1rAj4GWr3ZSzw@mail.gmail.com
2021-04-07 08:47:15 -07:00
Tomas Vondra 23607a8156 Don't add non-existent pages to bitmap from BRIN
The code in bringetbitmap() simply added the whole matching page range
to the TID bitmap, as determined by pages_per_range, even if some of the
pages were beyond the end of the heap. The query then might fail with
an error like this:

  ERROR:  could not open file "base/20176/20228.2" (target block
          262144): previous segment is only 131021 blocks

In this case, the relation has 262093 pages (131072 and 131021 pages),
but we're trying to acess block 262144, i.e. first block of the 3rd
segment. At that point _mdfd_getseg() notices the preceding segment is
incomplete, and fails.

Hitting this in practice is rather unlikely, because:

* Most indexes use power-of-two ranges, so segments and page ranges
  align perfectly (segment end is also a page range end).

* The table size has to be just right, with the last segment being
  almost full - less than one page range from full segment, so that the
  last page range actually crosses the segment boundary.

* Prefetch has to be enabled. The regular page access checks that
  pages are not beyond heap end, but prefetch does not. On older
  releases (before 12) the execution stops after hitting the first
  non-existent page, so the prefetch distance has to be sufficient
  to reach the first page in the next segment to trigger the issue.
  Since 12 it's enough to just have prefetch enabled, the prefetch
  distance does not matter.

Fixed by not adding non-existent pages to the TID bitmap. Backpatch
all the way back to 9.6 (BRIN indexes were introduced in 9.5, but that
release is EOL).

Backpatch-through: 9.6
2021-04-07 15:58:36 +02:00
Heikki Linnakangas d92b1cdbab Revert "Add sortsupport for gist_btree opclasses, for faster index builds."
This reverts commit 9f984ba6d2.

It was making the buildfarm unhappy, apparently setting client_min_messages
in a regression test produces different output if log_statement='all'.
Another issue is that I now suspect the bit sortsupport function was in
fact not correct to call byteacmp(). Revert to investigate both of those
issues.
2021-04-07 14:33:21 +03:00
Heikki Linnakangas 9f984ba6d2 Add sortsupport for gist_btree opclasses, for faster index builds.
Commit 16fa9b2b30 introduced a faster way to build GiST indexes, by
sorting all the data. This commit adds the sortsupport functions needed
to make use of that feature for btree_gist.

Author: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/2F3F7265-0D22-44DB-AD71-8554C743D943@yandex-team.ru
2021-04-07 13:22:05 +03:00
Michael Paquier 4c0239cb7a Remove redundant memset(0) calls for page init of some index AMs
Bloom, GIN, GiST and SP-GiST rely on PageInit() to initialize the
contents of a page, and this routine fills entirely a page with zeros
for a size of BLCKSZ, including the special space.  Those index AMs have
been using an extra memset() call to fill with zeros the special page
space, or even the whole page, which is not necessary as PageInit()
already does this work, so let's remove them.  GiST was not doing this
extra call, but has commented out a system call that did so since
6236991.

While on it, remove one MAXALIGN() for SP-GiST as PageInit() takes care
of that.  This makes the whole page initialization logic more consistent
across all index AMs.

Author: Bharath Rupireddy
Reviewed-by: Vignesh C, Mahendra Singh Thalor
Discussion: https://postgr.es/m/CALj2ACViOo2qyaPT7krWm4LRyRTw9kOXt+g6PfNmYuGA=YHj9A@mail.gmail.com
2021-04-07 14:35:26 +09:00
Peter Geoghegan 8523492d4e Remove tupgone special case from vacuumlazy.c.
Retry the call to heap_prune_page() in rare cases where there is
disagreement between the heap_prune_page() call and the call to
HeapTupleSatisfiesVacuum() that immediately follows.  Disagreement is
possible when a concurrently-aborted transaction makes a tuple DEAD
during the tiny window between each step.  This was the only case where
a tuple considered DEAD by VACUUM still had storage following pruning.
VACUUM's definition of dead tuples is now uniformly simple and
unambiguous: dead tuples from each page are always LP_DEAD line pointers
that were encountered just after we performed pruning (and just before
we considered freezing remaining items with tuple storage).

Eliminating the tupgone=true special case enables INDEX_CLEANUP=off
style skipping of index vacuuming that takes place based on flexible,
dynamic criteria.  The INDEX_CLEANUP=off case had to know about skipping
indexes up-front before now, due to a subtle interaction with the
special case (see commit dd695979) -- this was a special case unto
itself.  Now there are no special cases.  And so now it won't matter
when or how we decide to skip index vacuuming: it won't affect how
pruning behaves, and it won't be affected by any of the implementation
details of pruning or freezing.

Also remove XLOG_HEAP2_CLEANUP_INFO records.  These are no longer
necessary because we now rely entirely on heap pruning taking care of
recovery conflicts.  There is no longer any need to generate recovery
conflicts for DEAD tuples that pruning just missed.  This also means
that heap vacuuming now uses exactly the same strategy for recovery
conflicts as index vacuuming always has: REDO routines never need to
process a latestRemovedXid from the WAL record, since earlier REDO of
the WAL record from pruning is sufficient in all cases.  The generic
XLOG_HEAP2_CLEAN record type is now split into two new record types to
reflect this new division (these are called XLOG_HEAP2_PRUNE and
XLOG_HEAP2_VACUUM).

Also stop acquiring a super-exclusive lock for heap pages when they're
vacuumed during VACUUM's second heap pass.  A regular exclusive lock is
enough.  This is correct because heap page vacuuming is now strictly a
matter of setting the LP_DEAD line pointers to LP_UNUSED.  No other
backend can have a pointer to a tuple located in a pinned buffer that
can be invalidated by a concurrent heap page vacuum operation.

Heap vacuuming can now be thought of as conceptually similar to index
vacuuming and conceptually dissimilar to heap pruning.  Heap pruning now
has sole responsibility for anything involving the logical contents of
the database (e.g., managing transaction status information, recovery
conflicts, considering what to do with HOT chains).  Index vacuuming and
heap vacuuming are now only concerned with recycling garbage items from
physical data structures that back the logical database.

Bump XLOG_PAGE_MAGIC due to pruning and heap page vacuum WAL record
changes.

Credit for the idea of retrying pruning a page to avoid the tupgone case
goes to Andres Freund.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com
2021-04-06 08:49:22 -07:00
Peter Geoghegan 7ab96cf6b3 Refactor lazy_scan_heap() loop.
Add a lazy_scan_heap() subsidiary function that handles heap pruning and
tuple freezing: lazy_scan_prune().  This is a great deal cleaner.  The
code that remains in lazy_scan_heap()'s per-block loop can now be
thought of as code that either comes before or after the call to
lazy_scan_prune(), which is now the clear focal point.  This division is
enforced by the way in which we now manage state.  lazy_scan_prune()
outputs state (using its own struct) that describes what to do with the
page following pruning and freezing (e.g., visibility map maintenance,
recording free space in the FSM).  It doesn't get passed any special
instructional state from the preamble code, though.

Also cleanly separate the logic used by a VACUUM with INDEX_CLEANUP=off
from the logic used by single-heap-pass VACUUMs.  The former case is now
structured as the omission of index and heap vacuuming by a two pass
VACUUM.  The latter case goes back to being used only when the table
happens to have no indexes (just as it was before commit a96c41fe).
This structure is much more natural, since the whole point of
INDEX_CLEANUP=off is to skip the index and heap vacuuming that would
otherwise take place.  The single-heap-pass case doesn't skip any useful
work, though -- it just does heap pruning and heap vacuuming together
when the table happens to have no indexes.

Both of these changes are preparation for an upcoming patch that
generalizes the mechanism used by INDEX_CLEANUP=off.  The later patch
will allow VACUUM to give up on index and heap vacuuming dynamically, as
problems emerge (e.g., with wraparound), so that an affected VACUUM
operation can finish up as soon as possible.

Also fix a very old bug in single-pass VACUUM VERBOSE output.  We were
reporting the number of tuples deleted via pruning as a direct
substitute for reporting the number of LP_DEAD items removed in a
function that deals with the second pass over the heap.  But that
doesn't work at all -- they're two different things.

To fix, start tracking the total number of LP_DEAD items encountered
during pruning, and use that in the report instead.  A single pass
VACUUM will always vacuum away whatever LP_DEAD items a heap page has
immediately after it is pruned, so the total number of LP_DEAD items
encountered during pruning equals the total number vacuumed-away.
(They are _not_ equal in the INDEX_CLEANUP=off case, but that's okay
because skipping index vacuuming is now a totally orthogonal concept to
one-pass VACUUM.)

Also stop reporting the count of LP_UNUSED items in VACUUM VERBOSE
output.  This makes the output of VACUUM VERBOSE more consistent with
log_autovacuum's output (because it never showed information about
LP_UNUSED items).  VACUUM VERBOSE reported LP_UNUSED items left behind
by the last VACUUM, and LP_UNUSED items created via pruning HOT chains
during the current VACUUM (it never included LP_UNUSED items left behind
by the current VACUUM's second pass over the heap).  This makes it
useless as an indicator of line pointer bloat, which must have been the
original intention. (Like the first VACUUM VERBOSE issue, this issue was
arguably an oversight in commit 282d2a03, which added the heap-only
tuple optimization.)

Finally, stop reporting empty_pages in VACUUM VERBOSE output, and start
reporting pages_removed instead.  This also makes the output of VACUUM
VERBOSE more consistent with log_autovacuum's output (which does not
show empty_pages, but does show pages_removed).  An empty page isn't
meaningfully different to a page that is almost empty, or a page that is
empty but for only a small number of remaining LP_UNUSED items.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com
2021-04-06 07:49:39 -07:00
Tom Lane 091e22b2e6 Clean up treatment of missing default and CHECK-constraint records.
Andrew Gierth reported that it's possible to crash the backend if no
pg_attrdef record is found to match an attribute that has atthasdef set.
AttrDefaultFetch warns about this situation, but then leaves behind
a relation tupdesc that has null "adbin" pointer(s), which most places
don't guard against.

We considered promoting the warning to an error, but throwing errors
during relcache load is pretty drastic: it effectively locks one out
of using the relation at all.  What seems better is to leave the
load-time behavior as a warning, but then throw an error in any code
path that wants to use a default and can't find it.  This confines
the error to a subset of INSERT/UPDATE operations on the table, and
in particular will at least allow a pg_dump to succeed.

Also, we should fix AttrDefaultFetch to not leave any null pointers
in the tupdesc, because that just creates an untested bug hazard.

While at it, apply the same philosophy of "warn at load, throw error
only upon use of the known-missing info" to CHECK constraints.
CheckConstraintFetch is very nearly the same logic as AttrDefaultFetch,
but for reasons lost in the mists of time, it was throwing ERROR for
the same cases that AttrDefaultFetch treats as WARNING.  Make the two
functions more nearly alike.

In passing, get rid of potentially-O(N^2) loops in equalTupleDesc
by making AttrDefaultFetch sort the entries after fetching them,
so that equalTupleDesc can assume that entries in two equal tupdescs
must be in matching order.  (CheckConstraintFetch already was sorting
CHECK constraints, but equalTupleDesc hadn't been told about it.)

There's some argument for back-patching this, but with such a small
number of field reports, I'm content to fix it in HEAD.

Discussion: https://postgr.es/m/87pmzaq4gx.fsf@news-spur.riddles.org.uk
2021-04-06 10:34:39 -04:00
Fujii Masao 9de9294b0c Stop archive recovery if WAL generated with wal_level=minimal is found.
Previously if hot standby was enabled, archive recovery exited with
an error when it found WAL generated with wal_level=minimal.
But if hot standby was disabled, it just reported a warning and
continued in that case. Which could lead to data loss or errors
during normal operation. A warning was emitted, but users could
easily miss that and not notice this serious situation until
they encountered the actual errors.

To improve this situation, this commit changes archive recovery
so that it exits with FATAL error when it finds WAL generated with
wal_level=minimal whatever the setting of hot standby. This enables
users to notice the serious situation soon.

The FATAL error is thrown if archive recovery starts from a base
backup taken before wal_level is changed to minimal. When archive
recovery exits with the error, if users have a base backup taken
after setting wal_level to higher than minimal, they can recover
the database by starting archive recovery from that newer backup.
But note that if such backup doesn't exist, there is no easy way to
complete archive recovery, which may make the database server
unstartable and users may lose whole database. The commit adds
the note about this risk into the document.

Even in the case of unstartable database server, previously by just
disabling hot standby users could avoid the error during archive
recovery, forcibly start up the server and salvage data from it.
But note that this commit makes this procedure unavailable at all.

Author: Takamichi Osumi
Reviewed-by: Laurenz Albe, Kyotaro Horiguchi, David Steele, Fujii Masao
Discussion: https://postgr.es/m/OSBPR01MB4888CBE1DA08818FD2D90ED8EDF90@OSBPR01MB4888.jpnprd01.prod.outlook.com
2021-04-06 22:56:51 +09:00
Peter Geoghegan f6b8f19a08 Allocate access strategy in parallel VACUUM workers.
Commit 49f49def took entirely the wrong approach to fixing this issue.
Just allocate a local buffer access strategy in each individual worker
instead of trying to propagate state.  This state was never propagated
by parallel VACUUM in the first place.

It looks like the only reason that this worked following commit 40d964ec
was that it involved static global variables, which are initialized to 0
per the C standard.

A more comprehensive fix may be necessary, even on HEAD.  This fix
should at least get the buildfarm green once again.

Thanks once again to Thomas Munro for continued off-list assistance with
the issue.
2021-04-05 17:17:40 -07:00
Tom Lane 09c1c6ab4b Support INCLUDE'd columns in SP-GiST.
Not much to say here: does what it says on the tin.
We steal a previously-always-zero bit from the nextOffset
field of leaf index tuples in order to track whether there
is a nulls bitmap.  Otherwise it works about like included
columns in other index types.

Pavel Borisov, reviewed by Andrey Borodin and Anastasia Lubennikova,
and rather heavily editorialized on by me

Discussion: https://postgr.es/m/CALT9ZEFi-vMp4faht9f9Junb1nO3NOSjhpxTmbm1UGLMsLqiEQ@mail.gmail.com
2021-04-05 18:41:21 -04:00
Peter Geoghegan 49f49defe7 Propagate parallel VACUUM's buffer access strategy.
Parallel VACUUM relied on global variable state from the leader process
being propagated to workers on fork().  Commit b4af70cb removed most
uses of global variables inside vacuumlazy.c, but did not account for
the buffer access strategy state.

To fix, propagate the state through shared memory instead.

Per buildfarm failures on elver, curculio, and morepork.

Many thanks to Thomas Munro for off-list assistance with this issue.
2021-04-05 14:56:56 -07:00
Peter Geoghegan b4af70cb21 Simplify state managed by VACUUM.
Reorganize the state struct used by VACUUM -- group related items
together to make it easier to understand.  Also stop relying on stack
variables inside lazy_scan_heap() -- move those into the state struct
instead.  Doing things this way simplifies large groups of related
functions whose function signatures had a lot of unnecessary redundancy.

Switch over to using int64 for the struct fields used to count things
that are reported to the user via log_autovacuum and VACUUM VERBOSE
output.  We were using double, but that doesn't seem to have any
advantages.  Using int64 makes it possible to add assertions that verify
that the first pass over the heap (pruning) encounters precisely the
same number of LP_DEAD items that get deleted from indexes later on, in
the second pass over the heap.  These assertions will be added in later
commits.

Finally, adjust the signatures of functions with IndexBulkDeleteResult
pointer arguments in cases where there was ambiguity about whether or
not the argument relates to a single index or all indexes.  Functions
now use the idiom that both ambulkdelete() and amvacuumcleanup() have
always used (where appropriate): accept a mutable IndexBulkDeleteResult
pointer argument, and return a result IndexBulkDeleteResult pointer to
caller.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkeOSYwC6KNckbhk2b1aNnWum6Yyn0NKP9D-Hq1LGTDPw@mail.gmail.com
2021-04-05 13:21:44 -07:00
Tom Lane dfc843d465 Fix more confusion in SP-GiST.
spg_box_quad_leaf_consistent unconditionally returned the leaf
datum as leafValue, even though in its usage for poly_ops that
value is of completely the wrong type.

In versions before 12, that was harmless because the core code did
nothing with leafValue in non-index-only scans ... but since commit
2a6368343, if we were doing a KNN-style scan, spgNewHeapItem would
unconditionally try to copy the value using the wrong datatype
parameters.  Said copying is a waste of time and space if we're not
going to return the data, but it accidentally failed to fail until
I fixed the datatype confusion in ac9099fc1.

Hence, change spgNewHeapItem to not copy the datum unless we're
actually going to return it later.  This saves cycles and dodges
the question of whether lossy opclasses are returning the right
type.  Also change spg_box_quad_leaf_consistent to not return
data that might be of the wrong type, as insurance against
somebody introducing a similar bug into the core code in future.

It seems like a good idea to back-patch these two changes into
v12 and v13, although I'm afraid to change spgNewHeapItem's
mistaken idea of which datatype to use in those branches.

Per buildfarm results from ac9099fc1.

Discussion: https://postgr.es/m/3728741.1617381471@sss.pgh.pa.us
2021-04-04 17:57:07 -04:00
Tom Lane ac9099fc1d Fix confusion in SP-GiST between attribute type and leaf storage type.
According to the documentation, the attType passed to the opclass
config function (and also relied on by the core code) is the type
of the heap column or expression being indexed.  But what was
actually being passed was the type stored for the index column.
This made no difference for user-defined SP-GiST opclasses,
because we weren't allowing the STORAGE clause of CREATE OPCLASS
to be used, so the two types would be the same.  But it's silly
not to allow that, seeing that the built-in poly_ops opclass
has a different value for opckeytype than opcintype, and that if you
want to do lossy storage then the types must really be different.
(Thus, user-defined opclasses doing lossy storage had to lie about
what type is in the index.)  Hence, remove the restriction, and make
sure that we use the input column type not opckeytype where relevant.

For reasons of backwards compatibility with existing user-defined
opclasses, we can't quite insist that the specified leafType match
the STORAGE clause; instead just add an amvalidate() warning if
they don't match.

Also fix some bugs that would only manifest when trying to return
index entries when attType is different from attLeafType.  It's not
too surprising that these have not been reported, because the only
usual reason for such a difference is to store the leaf value
lossily, rendering index-only scans impossible.

Add a src/test/modules module to exercise cases where attType is
different from attLeafType and yet index-only scan is supported.

Discussion: https://postgr.es/m/3728741.1617381471@sss.pgh.pa.us
2021-04-04 14:28:57 -04:00
Tomas Vondra d9c5b9a9ee Fix bug in brin_minmax_multi_union
When calling sort_expanded_ranges() we need to remember the return
value, because the function sorts and also deduplicates the ranges. So
the number of ranges may decrease. brin_minmax_multi_union failed to do
that, which resulted in crashes due to bogus ranges (equal minval/maxval
but not marked as compacted).

Reported-by: Jaime Casanova
Discussion: https://postgr.es/m/20210404052550.GA4376%40ahch-to
2021-04-04 19:36:12 +02:00
Tomas Vondra 1dad2a5ea3 Fix order of parameters in BRIN minmax-multi calls
The BRIN minmax-multi consistent function incorrectly assumed it can
lookup an operator, and then swap the arguments to get the commutator.
For example <(a,b) would be called as <(b,a) to get >(a,b). This works
when the arguments are of the same type, but with cross-type opclasses
this fails. We can't swap <(float4,float8) arguments, for example.

Fixed by passing arguments in the right order.

Discussion: https://postgr.es/m/CAJKUy5jLZFLCxyxfT%3DMfK5mtPfSzHA1rVLowR-j4RRsFVvKm7A%40mail.gmail.com
2021-04-04 19:25:41 +02:00
Tomas Vondra e1fbe1181c Fix BRIN minmax-multi distance for inet type
The distance calculation ignored the mask, unlike the inet comparator,
which resulted in negative distance in some cases. Fixed by applying the
mask in brin_minmax_multi_distance_inet. I've considered simply calling
inetmi() to calculate the delta, but that does not consider mask either.

Reviewed-by: Zhihong Yu
Discussion: https://postgr.es/m/1a0a7b9d-9bda-e3a2-7fa4-88f15042a051%40enterprisedb.com
2021-04-04 19:23:32 +02:00
Tomas Vondra 7262f2421a Fix BRIN minmax-multi distance for timetz type
The distance calculation ignored the time zone, so the result of (b-a)
might have ended negative even if (b > a). Fixed by considering the time
zone difference.

Reported-by: Jaime Casanova
Discussion: https://postgr.es/m/CAJKUy5jLZFLCxyxfT%3DMfK5mtPfSzHA1rVLowR-j4RRsFVvKm7A%40mail.gmail.com
2021-04-04 19:22:23 +02:00
Tomas Vondra 2b10e0e3c2 Fix BRIN minmax-multi distance for interval type
The distance calculation for interval type was treating months as having
31 days, which is inconsistent with the interval comparator (using 30
days). Due to this it was possible to get negative distance (b-a) when
(a<b), trigerring an assert.

Fixed by adopting the same logic as interval_cmp_value.

Reported-by: Jaime Casanova
Discussion: https://postgr.es/m/CAJKUy5jKH0Xhneau2mNftNPtTy-BVgQfXc8zQkEvRvBHfeUThQ%40mail.gmail.com
2021-04-04 19:19:51 +02:00
Tom Lane 1ebdec8c03 Rethink handling of pass-by-value leaf datums in SP-GiST.
The existing convention in SP-GiST is that any pass-by-value datatype
is stored in Datum representation, i.e. it's of width sizeof(Datum)
even when typlen is less than that.  This is okay, or at least it's
too late to change it, for prefix datums and node-label datums in inner
(upper) tuples.  But it's problematic for leaf datums, because we'd
prefer those to be stored in Postgres' standard on-disk representation
so that we can easily extend leaf tuples to carry additional "included"
columns.

I believe, however, that we can get away with just up and changing that.
This would be an unacceptable on-disk-format break, but there are two
big mitigating factors:

1. It seems quite unlikely that there are any SP-GiST opclasses out
there that use pass-by-value leaf datatypes.  Certainly none of the
ones in core do, nor has codesearch.debian.net heard of any.  Given
what SP-GiST is good for, it's hard to conceive of a use-case where
the leaf-level values would be both small and fixed-width.  (As an
example, if you wanted to index text values with the leaf level being
just a byte, then every text string would have to be represented
with one level of inner tuple per preceding byte, which would be
horrendously space-inefficient and slow to access.  You always want
to use as few inner-tuple levels as possible, leaving as much as
possible in the leaf values.)

2. Even granting that you have such an index, this change only
breaks things on big-endian machines.  On little-endian, the high
order bytes of the Datum format will now just appear to be alignment
padding space.

So, change the code to store pass-by-value leaf datums in their
usual on-disk form.  Inner-tuple datums are not touched.

This is extracted from a larger patch that intends to add support for
"included" columns.  I'm committing it separately for visibility in
our commit logs.

Pavel Borisov and Tom Lane, reviewed by Andrey Borodin

Discussion: https://postgr.es/m/CALT9ZEFi-vMp4faht9f9Junb1nO3NOSjhpxTmbm1UGLMsLqiEQ@mail.gmail.com
2021-04-01 17:55:17 -04:00
Noah Misch 0ff8bbdee1 Accept slightly-filled pages for tuples larger than fillfactor.
We always inserted a larger-than-fillfactor tuple into a newly-extended
page, even when existing pages were empty or contained nothing but an
unused line pointer.  This was unnecessary relation extension.  Start
tolerating page usage up to 1/8 the maximum space that could be taken up
by line pointers.  This is somewhat arbitrary, but it should allow more
cases to reuse pages.  This has no effect on tables with fillfactor=100
(the default).

John Naylor and Floris van Nee.  Reviewed by Matthias van de Meent.
Reported by Floris van Nee.

Discussion: https://postgr.es/m/6e263217180649339720afe2176c50aa@opammb0562.comp.optiver.com
2021-03-30 18:53:44 -07:00
David Rowley af527705ed Adjust design of per-worker parallel seqscan data struct
The design of the data structures which allow storage of the per-worker
memory during parallel seq scans were not ideal. The work done in
56788d215 required an additional data structure to allow workers to
remember the range of pages that had been allocated to them for
processing during a parallel seqscan.  That commit added a void pointer
field to TableScanDescData to allow heapam to store the per-worker
allocation information.  However putting the field there made very little
sense given that we have AM specific structs for that, e.g.
HeapScanDescData.

Here we remove the void pointer field from TableScanDescData and add a
dedicated field for this purpose to HeapScanDescData.

Previously we also allocated memory for this parallel per-worker data for
all scans, regardless if it was a parallel scan or not.  This was just a
wasted allocation for non-parallel scans, so here we make the allocation
conditional on the scan being parallel.

Also, add previously missing pfree() to free the per-worker data in
heap_endscan().

Reported-by: Andres Freund
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/20210317023101.anvejcfotwka6gaa@alap3.anarazel.de
2021-03-30 10:17:09 +13:00
Tomas Vondra 73b96bad4a Fix alignment in BRIN minmax-multi deserialization
The deserialization failed to ensure correct alignment, as it assumed it
can simply point into the serialized value. The serialization however
ignores alignment and copies just the significant bytes in order to make
the result as small as possible. This caused failures on systems that
are sensitive to mialigned addresses, like sparc, or with address
sanitizer enabled.

Fixed by copying the serialized data to ensure proper alignment. While
at it, fix an issue with serialization on big endian machines, using the
same store_att_byval/fetch_att trick as extended statistics.

Discussion: https://postgr.es/0c8c3304-d3dd-5e29-d5ac-b50589a23c8c%40enterprisedb.com
2021-03-26 16:48:36 +01:00
Tomas Vondra ab596105b5 BRIN minmax-multi indexes
Adds BRIN opclasses similar to the existing minmax, except that instead
of summarizing the page range into a single [min,max] range, the summary
consists of multiple ranges and/or points, allowing gaps. This allows
more efficient handling of data with poor correlation to physical
location within the table and/or outlier values, for which the regular
minmax opclassed tend to work poorly.

It's possible to specify the number of values kept for each page range,
either as a single point or an interval boundary.

  CREATE TABLE t (a int);
  CREATE INDEX ON t
   USING brin (a int4_minmax_multi_ops(values_per_range=16));

When building the summary, the values are combined into intervals with
the goal to minimize the "covering" (sum of interval lengths), using a
support procedure computing distance between two values.

Bump catversion, due to various catalog changes.

Author: Tomas Vondra <tomas.vondra@postgresql.org>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Sokolov Yura <y.sokolov@postgrespro.ru>
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
Discussion: https://postgr.es/m/5d78b774-7e9c-c94e-12cf-fef51cc89b1a%402ndquadrant.com
2021-03-26 13:54:30 +01:00
Tomas Vondra 77b88cd1bb BRIN bloom indexes
Adds a BRIN opclass using a Bloom filter to summarize the range. Indexes
using the new opclasses allow only equality queries (similar to hash
indexes), but that works fine for data like UUID, MAC addresses etc. for
which range queries are not very common. This also means the indexes
work for data that is not well correlated to physical location within
the table, or perhaps even entirely random (which is a common issue with
existing BRIN minmax opclasses).

It's possible to specify opclass parameters with the usual Bloom filter
parameters, i.e. the desired false-positive rate and the expected number
of distinct values per page range.

  CREATE TABLE t (a int);
  CREATE INDEX ON t
   USING brin (a int4_bloom_ops(false_positive_rate = 0.05,
                                n_distinct_per_range = 100));

The opclasses do not operate on the indexed values directly, but compute
a 32-bit hash first, and the Bloom filter is built on the hash value.
Collisions should not be a huge issue though, as the number of distinct
values in a page ranges is usually fairly small.

Bump catversion, due to various catalog changes.

Author: Tomas Vondra <tomas.vondra@postgresql.org>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Sokolov Yura <y.sokolov@postgrespro.ru>
Reviewed-by: Nico Williams <nico@cryptonector.com>
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
Discussion: https://postgr.es/m/5d78b774-7e9c-c94e-12cf-fef51cc89b1a%402ndquadrant.com
2021-03-26 13:35:32 +01:00
Tomas Vondra a681e3c107 Support the old signature of BRIN consistent function
Commit a1c649d889 changed the signature of the BRIN consistent function
by adding a new required parameter.  Treating the parameter as optional,
which would make the change backwards incompatibile, was rejected with
the justification that there are few out-of-core extensions, so it's not
worth adding making the code more complex, and it's better to deal with
that in the extension.

But after further thought, that would be rather problematic, because
pg_upgrade simply dumps catalog contents and the same version of an
extension needs to work on both PostgreSQL versions. Supporting both
variants of the consistent function (with 3 or 4 arguments) makes that
possible.

The signature is not the only thing that changed, as commit 72ccf55cb9
moved handling of IS [NOT] NULL keys from the support procedures. But
this change is backward compatible - handling the keys in exension is
unnecessary, but harmless. The consistent function will do a bit of
unnecessary work, but it should be very cheap.

This also undoes most of the changes to the existing opclasses (minmax
and inclusion), making them use the old signature again. This should
make backpatching simpler.

Catversion bump, because of changes in pg_amproc.

Author: Tomas Vondra <tomas.vondra@postgresql.org>
Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Reviewed-by: Mark Dilger <hornschnorter@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Masahiko Sawada <masahiko.sawada@enterprisedb.com>
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
2021-03-26 13:17:58 +01:00
Robert Haas 5db1fd7823 Fix interaction of TOAST compression with expression indexes.
Before, trying to compress a value for insertion into an expression
index would crash.

Dilip Kumar, with some editing by me. Report by Jaime Casanova.

Discussion: http://postgr.es/m/CAJKUy5gcs0zGOp6JXU2mMVdthYhuQpFk=S3V8DOKT=LZC1L36Q@mail.gmail.com
2021-03-25 19:55:32 -04:00
Michael Paquier a1999a01bb Sanitize the term "combo CID" in code comments
Combo CIDs were referred in the code comments using different terms
across various places of the code, so unify a bit the term used with
what is currently in use in some of the READMEs.

Author: "Hou, Zhijie"
Discussion: https://postgr.es/m/1d42865c91404f46af4562532fdbea31@G08CNEXMBPEKD05.g08.fujitsu.local
2021-03-25 16:08:03 +09:00
Fujii Masao 438fc4a39c Fix bug in WAL replay of COMMIT_TS_SETTS record.
Previously the WAL replay of COMMIT_TS_SETTS record called
TransactionTreeSetCommitTsData() with the argument write_xlog=true,
which generated and wrote new COMMIT_TS_SETTS record.
This should not be acceptable because it's during recovery.

This commit fixes the WAL replay of COMMIT_TS_SETTS record
so that it calls TransactionTreeSetCommitTsData() with write_xlog=false
and doesn't generate new WAL during recovery.

Back-patch to all supported branches.

Reported-by: lx zou <zoulx1982@163.com>
Author: Fujii Masao
Reviewed-by: Alvaro Herrera
Discussion: https://postgr.es/m/16931-620d0f2fdc6108f1@postgresql.org
2021-03-25 11:23:30 +09:00
Peter Eisentraut 37c99d304d Fix stray double semicolons
Reported-by: John Naylor <john.naylor@enterprisedb.com>
2021-03-24 20:42:51 +01:00
Robert Haas e5595de03e Tidy up more loose ends related to configurable TOAST compression.
Change the default_toast_compression GUC to be an enum rather than
a string. Earlier, uncommitted versions of the patch supported using
CREATE ACCESS METHOD to add new compression methods to a running
system, but that idea was dropped before commit. So, we can simplify
the GUC handling as well, which has the nice side effect of improving
the error messages.

While updating the documentation to reflect the new GUC type, also
move it back to the right place in the list. I moved this while
revising what became commit 24f0e395ac,
but apparently the intended ordering is "alphabetical" rather than
"whatever Robert thinks looks nice."

Rejigger things to avoid having access/toast_compression.h depend on
utils/guc.h, so that we don't end up with every file that includes
it also depending on something largely unrelated. Move a few
inline functions back into the C source file partly to help reduce
dependencies and partly just to avoid clutter. A few very minor
cosmetic fixes.

Original patch by Justin Pryzby, but very heavily edited by me,
and reverse reviewed by him and also reviewed by by Tom Lane.

Discussion: http://postgr.es/m/CA+TgmoYp=GT_ztUCeZg2i4hkHAQv8o=-nVJ1-TKWTG1zQOmOpg@mail.gmail.com
2021-03-24 12:36:08 -04:00
Amit Kapila 26acb54a13 Revert "Enable parallel SELECT for "INSERT INTO ... SELECT ..."."
To allow inserts in parallel-mode this feature has to ensure that all the
constraints, triggers, etc. are parallel-safe for the partition hierarchy
which is costly and we need to find a better way to do that. Additionally,
we could have used existing cached information in some cases like indexes,
domains, etc. to determine the parallel-safety.

List of commits reverted, in reverse chronological order:

ed62d3737c Doc: Update description for parallel insert reloption.
c8f78b6161 Add a new GUC and a reloption to enable inserts in parallel-mode.
c5be48f092 Improve FK trigger parallel-safety check added by 05c8482f7f.
e2cda3c20a Fix use of relcache TriggerDesc field introduced by commit 05c8482f7f.
e4e87a32cc Fix valgrind issue in commit 05c8482f7f.
05c8482f7f Enable parallel SELECT for "INSERT INTO ... SELECT ...".

Discussion: https://postgr.es/m/E1lMiB9-0001c3-SY@gemulon.postgresql.org
2021-03-24 11:29:15 +05:30
Michael Paquier 99dd75fb99 Reword slightly logs generated for index stats in autovacuum
Using "remain" is confusing, as it implies that the index file can
shrink.  Instead, use "in total".

Per discussion with Peter Geoghegan.

Discussion: https://postgr.es/m/CAH2-WzkYgHZzpGOwR14CScJsjaQpvJrEkEfkh_=wGhzLb=yVdQ@mail.gmail.com
2021-03-24 09:36:03 +09:00
Peter Geoghegan 5b861baa55 nbtree VACUUM: Cope with buggy opclasses.
Teach nbtree VACUUM to press on with vacuuming in the event of a page
deletion attempt that fails to "re-find" a downlink for its child/target
page.

There is no good reason to treat this as an irrecoverable error.  But
there is a good reason not to: pressing on at this point removes any
question of VACUUM not making progress solely due to misbehavior from
user-defined operator class code.

Discussion: https://postgr.es/m/CAH2-Wzma5G9CTtMjbrXTwOym+U=aWg-R7=-htySuztgoJLvZXg@mail.gmail.com
2021-03-23 16:09:51 -07:00
Tom Lane 9d523119fd Avoid possible crash while finishing up a heap rewrite.
end_heap_rewrite was not careful to ensure that the target relation
is open at the smgr level before performing its final smgrimmedsync.
In ordinary cases this is no problem, because it would have been
opened earlier during the rewrite.  However a crash can be reproduced
by re-clustering an empty table with CLOBBER_CACHE_ALWAYS enabled.

Although that exact scenario does not crash in v13, I think that's
a chance result of unrelated planner changes, and the problem is
likely still reachable with other test cases.  The true proximate
cause of this failure is commit c6b92041d, which replaced a call to
heap_sync (which was careful about opening smgr) with a direct call
to smgrimmedsync.  Hence, back-patch to v13.

Amul Sul, per report from Neha Sharma; cosmetic changes
and test case by me.

Discussion: https://postgr.es/m/CANiYTQsU7yMFpQYnv=BrcRVqK_3U3mtAzAsJCaqtzsDHfsUbdQ@mail.gmail.com
2021-03-23 11:24:16 -04:00
Michael Paquier 5aed6a1fc2 Add per-index stats information in verbose logs of autovacuum
Once a relation's autovacuum is completed, the logs include more
information about this relation state if the threshold of
log_autovacuum_min_duration (or its relation option) is reached, with
for example contents about the statistics of the VACUUM operation for
the relation, WAL and system usage.

This commit adds more information about the statistics of the relation's
indexes, with one line of logs generated for each index.  The index
stats were already calculated, but not printed in the context of
autovacuum yet.  While on it, some refactoring is done to keep track of
the index statistics directly within LVRelStats, simplifying some
routines related to parallel VACUUMs.

Author: Masahiko Sawada
Reviewed-by: Michael Paquier, Euler Taveira
Discussion: https://postgr.es/m/CAD21AoAy6SxHiTivh5yAPJSUE4S=QRPpSZUdafOSz0R+fRcM6Q@mail.gmail.com
2021-03-23 13:25:14 +09:00
Bruce Momjian 95d77149c5 Add macro RelationIsPermanent() to report relation permanence
Previously, to check relation permanence, the Relation's Form_pg_class
structure member relpersistence was compared to the value
RELPERSISTENCE_PERMANENT ("p"). This commit adds the macro
RelationIsPermanent() and is used in appropirate places to simplify the
code.  This matches other RelationIs* macros.

This macro will be used in more places in future cluster file encryption
patches.

Discussion: https://postgr.es/m/20210318153134.GH20766@tamriel.snowman.net
2021-03-22 20:23:52 -04:00
Tomas Vondra 8e4b332e88 Optimize allocations in bringetbitmap
The bringetbitmap function allocates memory for various purposes, which
may be quite expensive, depending on the number of scan keys. Instead of
allocating them separately, allocate one bit chunk of memory an carve it
into smaller pieces as needed - all the pieces have the same lifespan,
and it saves quite a bit of CPU and memory overhead.

Author: Tomas Vondra <tomas.vondra@postgresql.org>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Mark Dilger <hornschnorter@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Masahiko Sawada <masahiko.sawada@enterprisedb.com>
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
2021-03-23 00:47:09 +01:00
Tomas Vondra 72ccf55cb9 Move IS [NOT] NULL handling from BRIN support functions
The handling of IS [NOT] NULL clauses is independent of an opclass, and
most of the code was exactly the same in both minmax and inclusion. So
instead move the code from support procedures to the AM.

This simplifies the code - especially the support procedures - quite a
bit, as they don't need to care about NULL values and flags at all. It
also means the IS [NOT] NULL clauses can be evaluated without invoking
the support procedure.

Author: Tomas Vondra <tomas.vondra@postgresql.org>
Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Reviewed-by: Nikita Glukhov <n.gluhov@postgrespro.ru>
Reviewed-by: Mark Dilger <hornschnorter@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Masahiko Sawada <masahiko.sawada@enterprisedb.com>
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
2021-03-23 00:45:42 +01:00
Tomas Vondra a1c649d889 Pass all scan keys to BRIN consistent function at once
This commit changes how we pass scan keys to BRIN consistent function.
Instead of passing them one by one, we now pass all scan keys for a
given attribute at once. That makes the consistent function a bit more
complex, as it has to loop through the keys, but it does allow more
elaborate opclasses that can use multiple keys to eliminate ranges much
more effectively.

The existing BRIN opclasses (minmax, inclusion) don't really benefit
from this change. The primary purpose is to allow future opclases to
benefit from seeing all keys at once.

This does change the BRIN API, because the signature of the consistent
function changes (a new parameter with number of scan keys). So this
breaks existing opclasses, and will require supporting two variants of
the code for different PostgreSQL versions. We've considered supporting
two variants of the consistent, but we've decided not to do that.
Firstly, there's another patch that moves handling of NULL values from
the opclass, which means the opclasses need to be updated anyway.
Secondly, we're not aware of any out-of-core BRIN opclasses, so it does
not seem worth the extra complexity.

Bump catversion, because of pg_proc changes.

Author: Tomas Vondra <tomas.vondra@postgresql.org>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Mark Dilger <hornschnorter@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Reviewed-by: Nikita Glukhov <n.gluhov@postgrespro.ru>
Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
2021-03-23 00:45:03 +01:00