Commit Graph

4104 Commits

Author SHA1 Message Date
Tom Lane def5b065ff Initial pgindent and pgperltidy run for v14.
Also "make reformat-dat-files".

The only change worthy of note is that pgindent messed up the formatting
of launcher.c's struct LogicalRepWorkerId, which led me to notice that
that struct wasn't used at all anymore, so I just took it out.
2021-05-12 13:14:10 -04:00
Peter Eisentraut ec6e70c79f Refactor some error messages for easier translation 2021-05-12 07:42:51 +02:00
Fujii Masao d780d7c088 Change data type of counters in BufferUsage and WalUsage from long to int64.
Previously long was used as the data type for some counters in BufferUsage
and WalUsage. But long is only four byte, e.g., on Windows, and it's entirely
possible to wrap a four byte counter. For example, emitting more than
four billion WAL records in one transaction isn't actually particularly rare.

To avoid the overflows of those counters, this commit changes the data type
of them from long to int64.

Suggested-by: Andres Freund
Author: Masahiro Ikeda
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/20201221211650.k7b53tcnadrciqjo@alap3.anarazel.de
Discussion: https://postgr.es/m/af0964ac-7080-1984-dc23-513754987716@oss.nttdata.com
2021-05-12 09:56:34 +09:00
Tom Lane 049e1e2edb Fix mishandling of resjunk columns in ON CONFLICT ... UPDATE tlists.
It's unusual to have any resjunk columns in an ON CONFLICT ... UPDATE
list, but it can happen when MULTIEXPR_SUBLINK SubPlans are present.
If it happens, the ON CONFLICT UPDATE code path would end up storing
tuples that include the values of the extra resjunk columns.  That's
fairly harmless in the short run, but if new columns are added to
the table then the values would become accessible, possibly leading
to malfunctions if they don't match the datatypes of the new columns.

This had escaped notice through a confluence of missing sanity checks,
including

* There's no cross-check that a tuple presented to heap_insert or
heap_update matches the table rowtype.  While it's difficult to
check that fully at reasonable cost, we can easily add assertions
that there aren't too many columns.

* The output-column-assignment cases in execExprInterp.c lacked
any sanity checks on the output column numbers, which seems like
an oversight considering there are plenty of assertion checks on
input column numbers.  Add assertions there too.

* We failed to apply nodeModifyTable's ExecCheckPlanOutput() to
the ON CONFLICT UPDATE tlist.  That wouldn't have caught this
specific error, since that function is chartered to ignore resjunk
columns; but it sure seems like a bad omission now that we've seen
this bug.

In HEAD, the right way to fix this is to make the processing of
ON CONFLICT UPDATE tlists work the same as regular UPDATE tlists
now do, that is don't add "SET x = x" entries, and use
ExecBuildUpdateProjection to evaluate the tlist and combine it with
old values of the not-set columns.  This adds a little complication
to ExecBuildUpdateProjection, but allows removal of a comparable
amount of now-dead code from the planner.

In the back branches, the most expedient solution seems to be to
(a) use an output slot for the ON CONFLICT UPDATE projection that
actually matches the target table, and then (b) invent a variant of
ExecBuildProjectionInfo that can be told to not store values resulting
from resjunk columns, so it doesn't try to store into nonexistent
columns of the output slot.  (We can't simply ignore the resjunk columns
altogether; they have to be evaluated for MULTIEXPR_SUBLINK to work.)
This works back to v10.  In 9.6, projections work much differently and
we can't cheaply give them such an option.  The 9.6 version of this
patch works by inserting a JunkFilter when it's necessary to get rid
of resjunk columns.

In addition, v11 and up have the reverse problem when trying to
perform ON CONFLICT UPDATE on a partitioned table.  Through a
further oversight, adjust_partition_tlist() discarded resjunk columns
when re-ordering the ON CONFLICT UPDATE tlist to match a partition.
This accidentally prevented the storing-bogus-tuples problem, but
at the cost that MULTIEXPR_SUBLINK cases didn't work, typically
crashing if more than one row has to be updated.  Fix by preserving
resjunk columns in that routine.  (I failed to resist the temptation
to add more assertions there too, and to do some minor code
beautification.)

Per report from Andres Freund.  Back-patch to all supported branches.

Security: CVE-2021-32028
2021-05-10 11:02:29 -04:00
Thomas Munro c2dc19342e Revert recovery prefetching feature.
This set of commits has some bugs with known fixes, but at this late
stage in the release cycle it seems best to revert and resubmit next
time, along with some new automated test coverage for this whole area.

Commits reverted:

dc88460c: Doc: Review for "Optionally prefetch referenced data in recovery."
1d257577: Optionally prefetch referenced data in recovery.
f003d9f8: Add circular WAL decoding buffer.
323cbe7c: Remove read_page callback from XLogReader.

Remove the new GUC group WAL_RECOVERY recently added by a55a9847, as the
corresponding section of config.sgml is now reverted.

Discussion: https://postgr.es/m/CAOuzzgrn7iKnFRsB4MHp3UisEQAGgZMbk_ViTN4HV4-Ksq8zCg%40mail.gmail.com
2021-05-10 16:06:09 +12:00
Peter Geoghegan c9787385db Remove overzealous VACUUM visibility map assertion.
The all_visible_according_to_vm variable's value is inherently prone to
becoming invalidated concurrently, since it is set before we even
acquire a lock on a related heap page buffer.

Oversight in commit 7136bf34, which added the assertion in passing.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reported-By: Tang <tanghy.fnst@fujitsu.com>
Diagnosed-By:: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAD21AoDzgc8_MYrA5m1fyydomw_eVKtQiYh7sfDK4KEhdMsf_g@mail.gmail.com
2021-05-06 13:17:39 -07:00
Michael Paquier 4aba61b870 Add some forgotten LSN_FORMAT_ARGS() in xlogreader.c
6f6f284 has introduced a specific macro to make printf()-ing of LSNs
easier.  This takes care of what looks like the remaining code paths
that did not get the call.

Author: Michael Paquier
Reviewed-by: Kyotaro Horiguchi, Tom Lane
Discussion: https://postgr.es/m/YIJS9x6K8ruizN7j@paquier.xyz
2021-04-24 09:09:02 +09:00
Michael Paquier 62aa2bb293 Remove use of [U]INT64_FORMAT in some translatable strings
%lld with (long long), or %llu with (unsigned long long) are more
adapted.  This is similar to 3286065.

Author: Kyotaro Horiguchi
Discussion: https://postgr.es/m/20210421.200000.1462448394029407895.horikyota.ntt@gmail.com
2021-04-23 13:25:49 +09:00
Peter Eisentraut f0ec598b43 Fix typo 2021-04-21 08:07:37 +02:00
Tom Lane 783be78ca9 Improve WAL record descriptions for SP-GiST records.
While tracking down the bug fixed in the preceding commit, I got quite
annoyed by the low quality of spg_desc's output.  Add missing fields,
try to make the formatting consistent.
2021-04-20 17:01:49 -04:00
Peter Geoghegan 7136bf34f2 Document LP_DEAD accounting issues in VACUUM.
Document VACUUM's soft assumption that any LP_DEAD items encountered
during pruning will become LP_UNUSED items before VACUUM finishes up.
This is integral to the accounting used by VACUUM to generate its final
report on the table to the stats collector.  It also affects how VACUUM
determines which heap pages are truncatable.  In both cases VACUUM is
concerned with the likely contents of the page in the near future, not
the current contents of the page.

This state of affairs created the false impression that VACUUM's dead
tuple accounting had significant difference with similar accounting used
during ANALYZE.  There were and are no substantive differences, at least
when the soft assumption completely works out.  This is far clearer now.

Also document cases where things don't quite work out for VACUUM's dead
tuple accounting.  It's possible that a significant number of LP_DEAD
items will be left behind by VACUUM, and won't be recorded as remaining
dead tuples in VACUUM's statistics collector report.  This behavior
dates back to commit a96c41fe, which taught VACUUM to run without index
and heap vacuuming at the user's request.  The failsafe mechanism added
to VACUUM more recently by commit 1e55e7d1 takes the same approach to
dead tuple accounting.

Reported-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=Jmtu18PrsYq3EvvZJGOmZqSO2u3bvKpx9xJa5uhNp=Q@mail.gmail.com
2021-04-19 18:55:31 -07:00
Michael Paquier 7ef8b52cf0 Fix typos and grammar in comments and docs
Author: Justin Pryzby
Discussion: https://postgr.es/m/20210416070310.GG3315@telsasoft.com
2021-04-19 11:32:30 +09:00
Peter Eisentraut f59b58e2a1 Use correct format placeholder for block numbers
Should be %u rather than %d.
2021-04-17 09:40:50 +02:00
Peter Eisentraut 07e5e66742 Improve quoting in some error messages 2021-04-14 09:11:29 +02:00
Peter Geoghegan 60f1f09ff4 Don't truncate heap when VACUUM's failsafe is in effect.
It seems like a good idea to bypass heap truncation when the wraparound
failsafe mechanism (which was added in commit 1e55e7d1) is in effect.

Deliberately don't bypass heap truncation in the INDEX_CLEANUP=off case,
even though it is similar to the failsafe case.  There is already a
separate reloption (and related VACUUM parameter) for that.

Reported-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAD21AoDWRh6oTN5T8wa+cpZUVpHXET8BJ8Da7WHVHpwkPP6KLg@mail.gmail.com
2021-04-13 12:58:31 -07:00
Tom Lane 34f581c39e Avoid improbable PANIC during heap_update.
heap_update needs to clear any existing "all visible" flag on
the old tuple's page (and on the new page too, if different).
Per coding rules, to do this it must acquire pin on the appropriate
visibility-map page while not holding exclusive buffer lock;
which creates a race condition since someone else could set the
flag whenever we're not holding the buffer lock.  The code is
supposed to handle that by re-checking the flag after acquiring
buffer lock and retrying if it became set.  However, one code
path through heap_update itself, as well as one in its subroutine
RelationGetBufferForTuple, failed to do this.  The end result,
in the unlikely event that a concurrent VACUUM did set the flag
while we're transiently not holding lock, is a non-recurring
"PANIC: wrong buffer passed to visibilitymap_clear" failure.

This has been seen a few times in the buildfarm since recent VACUUM
changes that added code paths that could set the all-visible flag
while holding only exclusive buffer lock.  Previously, the flag
was (usually?) set only after doing LockBufferForCleanup, which
would insist on buffer pin count zero, thus preventing the flag
from becoming set partway through heap_update.  However, it's
clear that it's heap_update not VACUUM that's at fault here.

What's less clear is whether there is any hazard from these bugs
in released branches.  heap_update is certainly violating API
expectations, but if there is no code path that can set all-visible
without a cleanup lock then it's only a latent bug.  That's not
100% certain though, besides which we should worry about extensions
or future back-patch fixes that could introduce such code paths.

I chose to back-patch to v12.  Fixing RelationGetBufferForTuple
before that would require also back-patching portions of older
fixes (notably 0d1fe9f74), which is more code churn than seems
prudent to fix a hypothetical issue.

Discussion: https://postgr.es/m/2247102.1618008027@sss.pgh.pa.us
2021-04-13 12:17:24 -04:00
Thomas Munro b1df6b696b Fix potential SSI hazard in heap_update().
Commit 6f38d4dac3 failed to heed a warning about the stability of the
value pointed to by "otid".  The caller is allowed to pass in a pointer to
newtup->t_self, which will be updated during the execution of the
function.  Instead, the SSI check should use the value we copy into
oldtup.t_self near the top of the function.

Not a live bug, because newtup->t_self doesn't really get updated until
a bit later, but it was confusing and broke the rule established by the
comment.

Back-patch to 13.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/2689164.1618160085%40sss.pgh.pa.us
2021-04-13 13:02:56 +12:00
Fujii Masao 08aa89b326 Remove COMMIT_TS_SETTS record.
Commit 438fc4a39c prevented the WAL replay from writing
COMMIT_TS_SETTS record. By this change there is no code that
generates COMMIT_TS_SETTS record in PostgreSQL core.
Also we can think that there are no extensions using the record
because we've not received so far any complaints about the issue
that commit 438fc4a39c fixed. Therefore this commit removes
COMMIT_TS_SETTS record and its related code. Even without
this record, the timestamp required for commit timestamp feature
can be acquired from the COMMIT record.

Bump WAL page magic.

Reported-by: lx zou <zoulx1982@163.com>
Author: Fujii Masao
Reviewed-by: Alvaro Herrera
Discussion: https://postgr.es/m/16931-620d0f2fdc6108f1@postgresql.org
2021-04-12 00:04:30 +09:00
Thomas Munro dc88460c24 Doc: Review for "Optionally prefetch referenced data in recovery."
Typos, corrections and language improvements in the docs, and a few in
code comments too.

Reported-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20210409033703.GP6592%40telsasoft.com
2021-04-10 08:21:53 +12:00
Peter Geoghegan 796092fb84 Silence another _bt_check_unique compiler warning.
Per complaint from Tom Lane

Discussion: https://postgr.es/m/1922884.1617909599@sss.pgh.pa.us
2021-04-08 12:54:31 -07:00
Thomas Munro 1d257577e0 Optionally prefetch referenced data in recovery.
Introduce a new GUC recovery_prefetch, disabled by default.  When
enabled, look ahead in the WAL and try to initiate asynchronous reading
of referenced data blocks that are not yet cached in our buffer pool.
For now, this is done with posix_fadvise(), which has several caveats.
Better mechanisms will follow in later work on the I/O subsystem.

The GUC maintenance_io_concurrency is used to limit the number of
concurrent I/Os we allow ourselves to initiate, based on pessimistic
heuristics used to infer that I/Os have begun and completed.

The GUC wal_decode_buffer_size is used to limit the maximum distance we
are prepared to read ahead in the WAL to find uncached blocks.

Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (parts)
Reviewed-by: Andres Freund <andres@anarazel.de> (parts)
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (parts)
Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Tested-by: Dmitry Dolgov <9erthalion6@gmail.com>
Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
2021-04-08 23:20:42 +12:00
Thomas Munro f003d9f872 Add circular WAL decoding buffer.
Teach xlogreader.c to decode its output into a circular buffer, to
support optimizations based on looking ahead.

 * XLogReadRecord() works as before, consuming records one by one, and
   allowing them to be examined via the traditional XLogRecGetXXX()
   macros.

 * An alternative new interface XLogNextRecord() is added that returns
   pointers to DecodedXLogRecord structs that can be examined directly.

 * XLogReadAhead() provides a second cursor that lets you see
   further ahead, as long as data is available and there is enough space
   in the decoding buffer.  This returns DecodedXLogRecord pointers to the
   caller, but also adds them to a queue of records that will later be
   consumed by XLogNextRecord()/XLogReadRecord().

The buffer's size is controlled with wal_decode_buffer_size.  The buffer
could potentially be placed into shared memory, for future projects.
Large records that don't fit in the circular buffer are called
"oversized" and allocated separately with palloc().

Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
2021-04-08 23:20:42 +12:00
Thomas Munro 323cbe7c7d Remove read_page callback from XLogReader.
Previously, the XLogReader module would fetch new input data using a
callback function.  Redesign the interface so that it tells the caller
to insert more data with a special return value instead.  This API suits
later patches for prefetching, encryption and maybe other future
projects that would otherwise require continually extending the callback
interface.

As incidental cleanup work, move global variables readOff, readLen and
readSegNo inside XlogReaderState.

Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (parts of earlier version)
Reviewed-by: Antonin Houska <ah@cybertec.at>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Takashi Menjo <takashi.menjo@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/20190418.210257.43726183.horiguchi.kyotaro%40lab.ntt.co.jp
2021-04-08 23:20:42 +12:00
Alvaro Herrera 0827e8af70
autovacuum: handle analyze for partitioned tables
Previously, autovacuum would completely ignore partitioned tables, which
is not good regarding analyze -- failing to analyze those tables means
poor plans may be chosen.  Make autovacuum aware of those tables by
propagating "changes since analyze" counts from the leaf partitions up
the partitioning hierarchy.

This also introduces necessary reloptions support for partitioned tables
(autovacuum_enabled, autovacuum_analyze_scale_factor,
autovacuum_analyze_threshold).  It's unclear how best to document this
aspect.

Author: Yuzuko Hosoya <yuzukohosoya@gmail.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/CAKkQ508_PwVgwJyBY=0Lmkz90j8CmWNPUxgHvCUwGhMrouz6UA@mail.gmail.com
2021-04-08 01:19:36 -04:00
Peter Geoghegan 5100010ee4 Teach VACUUM to bypass unnecessary index vacuuming.
VACUUM has never needed to call ambulkdelete() for each index in cases
where there are precisely zero TIDs in its dead_tuples array by the end
of its first pass over the heap (also its only pass over the heap in
this scenario).  Index vacuuming is simply not required when this
happens.  Index cleanup will still go ahead, but in practice most calls
to amvacuumcleanup() are usually no-ops when there were zero preceding
ambulkdelete() calls.  In short, VACUUM has generally managed to avoid
index scans when there were clearly no index tuples to delete from
indexes.  But cases with _close to_ no index tuples to delete were
another matter -- a round of ambulkdelete() calls took place (one per
index), each of which performed a full index scan.

VACUUM now behaves just as if there were zero index tuples to delete in
cases where there are in fact "virtually zero" such tuples.  That is, it
can now bypass index vacuuming and heap vacuuming as an optimization
(though not index cleanup).  Whether or not VACUUM bypasses indexes is
determined dynamically, based on the just-observed number of heap pages
in the table that have one or more LP_DEAD items (LP_DEAD items in heap
pages have a 1:1 correspondence with index tuples that still need to be
deleted from each index in the worst case).

We only skip index vacuuming when 2% or less of the table's pages have
one or more LP_DEAD items -- bypassing index vacuuming as an
optimization must not noticeably impede setting bits in the visibility
map.  As a further condition, the dead_tuples array (i.e. VACUUM's array
of LP_DEAD item TIDs) must not exceed 32MB at the point that the first
pass over the heap finishes, which is also when the decision to bypass
is made.  (The VACUUM must also have been able to fit all TIDs in its
maintenance_work_mem-bound dead_tuples space, though with a default
maintenance_work_mem setting it can't matter.)

This avoids surprising jumps in the duration and overhead of routine
vacuuming with workloads where successive VACUUM operations consistently
have almost zero dead index tuples.  The number of LP_DEAD items may
well accumulate over multiple VACUUM operations, before finally the
threshold is crossed and VACUUM performs conventional index vacuuming.
Even then, the optimization will have avoided a great deal of largely
unnecessary index vacuuming.

In the future we may teach VACUUM to skip index vacuuming on a per-index
basis, using a much more sophisticated approach.  For now we only
consider the extreme cases, where we can be quite confident that index
vacuuming just isn't worth it using simple heuristics.

Also log information about how many heap pages have one or more LP_DEAD
items when autovacuum logging is enabled.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzmkebqPd4MVGuPTOS9bMFvp9MDs5cRTCOsv1rQJ3jCbXw@mail.gmail.com
2021-04-07 16:14:54 -07:00
Peter Geoghegan 1e55e7d175 Add wraparound failsafe to VACUUM.
Add a failsafe mechanism that is triggered by VACUUM when it notices
that the table's relfrozenxid and/or relminmxid are dangerously far in
the past.  VACUUM checks the age of the table dynamically, at regular
intervals.

When the failsafe triggers, VACUUM takes extraordinary measures to
finish as quickly as possible so that relfrozenxid and/or relminmxid can
be advanced.  VACUUM will stop applying any cost-based delay that may be
in effect.  VACUUM will also bypass any further index vacuuming and heap
vacuuming -- it only completes whatever remaining pruning and freezing
is required.  Bypassing index/heap vacuuming is enabled by commit
8523492d, which made it possible to dynamically trigger the mechanism
already used within VACUUM when it is run with INDEX_CLEANUP off.

It is expected that the failsafe will almost always trigger within an
autovacuum to prevent wraparound, long after the autovacuum began.
However, the failsafe mechanism can trigger in any VACUUM operation.
Even in a non-aggressive VACUUM, where we're likely to not advance
relfrozenxid, it still seems like a good idea to finish off remaining
pruning and freezing.   An aggressive/anti-wraparound VACUUM will be
launched immediately afterwards.  Note that the anti-wraparound VACUUM
that follows will itself trigger the failsafe, usually before it even
begins its first (and only) pass over the heap.

The failsafe is controlled by two new GUCs: vacuum_failsafe_age, and
vacuum_multixact_failsafe_age.  There are no equivalent reloptions,
since that isn't expected to be useful.  The GUCs have rather high
defaults (both default to 1.6 billion), and are expected to generally
only be used to make the failsafe trigger sooner/more frequently.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzmgH3ySGYeC-m-eOBsa2=sDwa292-CFghV4rESYo39FsQ@mail.gmail.com
2021-04-07 12:37:45 -07:00
Peter Geoghegan 3c3b8a4b26 Truncate line pointer array during VACUUM.
Teach VACUUM to truncate the line pointer array of each heap page when a
contiguous group of LP_UNUSED line pointers appear at the end of the
array -- these unused and unreferenced items are excluded.  This process
occurs during VACUUM's second pass over the heap, right after LP_DEAD
line pointers on the page (those encountered/pruned during the first
pass) are marked LP_UNUSED.

Truncation avoids line pointer bloat with certain workloads,
particularly those involving continual range DELETEs and bulk INSERTs
against the same table.

Also harden heapam code to check for an out-of-range page offset number
in places where we weren't already doing so.

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAEze2WjgaQc55Y5f5CQd3L=eS5CZcff2Obxp=O6pto8-f0hC4w@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopJMhFQj_g1rAj4GWr3ZSzw@mail.gmail.com
2021-04-07 08:47:15 -07:00
Tomas Vondra 23607a8156 Don't add non-existent pages to bitmap from BRIN
The code in bringetbitmap() simply added the whole matching page range
to the TID bitmap, as determined by pages_per_range, even if some of the
pages were beyond the end of the heap. The query then might fail with
an error like this:

  ERROR:  could not open file "base/20176/20228.2" (target block
          262144): previous segment is only 131021 blocks

In this case, the relation has 262093 pages (131072 and 131021 pages),
but we're trying to acess block 262144, i.e. first block of the 3rd
segment. At that point _mdfd_getseg() notices the preceding segment is
incomplete, and fails.

Hitting this in practice is rather unlikely, because:

* Most indexes use power-of-two ranges, so segments and page ranges
  align perfectly (segment end is also a page range end).

* The table size has to be just right, with the last segment being
  almost full - less than one page range from full segment, so that the
  last page range actually crosses the segment boundary.

* Prefetch has to be enabled. The regular page access checks that
  pages are not beyond heap end, but prefetch does not. On older
  releases (before 12) the execution stops after hitting the first
  non-existent page, so the prefetch distance has to be sufficient
  to reach the first page in the next segment to trigger the issue.
  Since 12 it's enough to just have prefetch enabled, the prefetch
  distance does not matter.

Fixed by not adding non-existent pages to the TID bitmap. Backpatch
all the way back to 9.6 (BRIN indexes were introduced in 9.5, but that
release is EOL).

Backpatch-through: 9.6
2021-04-07 15:58:36 +02:00
Heikki Linnakangas d92b1cdbab Revert "Add sortsupport for gist_btree opclasses, for faster index builds."
This reverts commit 9f984ba6d2.

It was making the buildfarm unhappy, apparently setting client_min_messages
in a regression test produces different output if log_statement='all'.
Another issue is that I now suspect the bit sortsupport function was in
fact not correct to call byteacmp(). Revert to investigate both of those
issues.
2021-04-07 14:33:21 +03:00
Heikki Linnakangas 9f984ba6d2 Add sortsupport for gist_btree opclasses, for faster index builds.
Commit 16fa9b2b30 introduced a faster way to build GiST indexes, by
sorting all the data. This commit adds the sortsupport functions needed
to make use of that feature for btree_gist.

Author: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/2F3F7265-0D22-44DB-AD71-8554C743D943@yandex-team.ru
2021-04-07 13:22:05 +03:00
Michael Paquier 4c0239cb7a Remove redundant memset(0) calls for page init of some index AMs
Bloom, GIN, GiST and SP-GiST rely on PageInit() to initialize the
contents of a page, and this routine fills entirely a page with zeros
for a size of BLCKSZ, including the special space.  Those index AMs have
been using an extra memset() call to fill with zeros the special page
space, or even the whole page, which is not necessary as PageInit()
already does this work, so let's remove them.  GiST was not doing this
extra call, but has commented out a system call that did so since
6236991.

While on it, remove one MAXALIGN() for SP-GiST as PageInit() takes care
of that.  This makes the whole page initialization logic more consistent
across all index AMs.

Author: Bharath Rupireddy
Reviewed-by: Vignesh C, Mahendra Singh Thalor
Discussion: https://postgr.es/m/CALj2ACViOo2qyaPT7krWm4LRyRTw9kOXt+g6PfNmYuGA=YHj9A@mail.gmail.com
2021-04-07 14:35:26 +09:00
Peter Geoghegan 8523492d4e Remove tupgone special case from vacuumlazy.c.
Retry the call to heap_prune_page() in rare cases where there is
disagreement between the heap_prune_page() call and the call to
HeapTupleSatisfiesVacuum() that immediately follows.  Disagreement is
possible when a concurrently-aborted transaction makes a tuple DEAD
during the tiny window between each step.  This was the only case where
a tuple considered DEAD by VACUUM still had storage following pruning.
VACUUM's definition of dead tuples is now uniformly simple and
unambiguous: dead tuples from each page are always LP_DEAD line pointers
that were encountered just after we performed pruning (and just before
we considered freezing remaining items with tuple storage).

Eliminating the tupgone=true special case enables INDEX_CLEANUP=off
style skipping of index vacuuming that takes place based on flexible,
dynamic criteria.  The INDEX_CLEANUP=off case had to know about skipping
indexes up-front before now, due to a subtle interaction with the
special case (see commit dd695979) -- this was a special case unto
itself.  Now there are no special cases.  And so now it won't matter
when or how we decide to skip index vacuuming: it won't affect how
pruning behaves, and it won't be affected by any of the implementation
details of pruning or freezing.

Also remove XLOG_HEAP2_CLEANUP_INFO records.  These are no longer
necessary because we now rely entirely on heap pruning taking care of
recovery conflicts.  There is no longer any need to generate recovery
conflicts for DEAD tuples that pruning just missed.  This also means
that heap vacuuming now uses exactly the same strategy for recovery
conflicts as index vacuuming always has: REDO routines never need to
process a latestRemovedXid from the WAL record, since earlier REDO of
the WAL record from pruning is sufficient in all cases.  The generic
XLOG_HEAP2_CLEAN record type is now split into two new record types to
reflect this new division (these are called XLOG_HEAP2_PRUNE and
XLOG_HEAP2_VACUUM).

Also stop acquiring a super-exclusive lock for heap pages when they're
vacuumed during VACUUM's second heap pass.  A regular exclusive lock is
enough.  This is correct because heap page vacuuming is now strictly a
matter of setting the LP_DEAD line pointers to LP_UNUSED.  No other
backend can have a pointer to a tuple located in a pinned buffer that
can be invalidated by a concurrent heap page vacuum operation.

Heap vacuuming can now be thought of as conceptually similar to index
vacuuming and conceptually dissimilar to heap pruning.  Heap pruning now
has sole responsibility for anything involving the logical contents of
the database (e.g., managing transaction status information, recovery
conflicts, considering what to do with HOT chains).  Index vacuuming and
heap vacuuming are now only concerned with recycling garbage items from
physical data structures that back the logical database.

Bump XLOG_PAGE_MAGIC due to pruning and heap page vacuum WAL record
changes.

Credit for the idea of retrying pruning a page to avoid the tupgone case
goes to Andres Freund.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com
2021-04-06 08:49:22 -07:00
Peter Geoghegan 7ab96cf6b3 Refactor lazy_scan_heap() loop.
Add a lazy_scan_heap() subsidiary function that handles heap pruning and
tuple freezing: lazy_scan_prune().  This is a great deal cleaner.  The
code that remains in lazy_scan_heap()'s per-block loop can now be
thought of as code that either comes before or after the call to
lazy_scan_prune(), which is now the clear focal point.  This division is
enforced by the way in which we now manage state.  lazy_scan_prune()
outputs state (using its own struct) that describes what to do with the
page following pruning and freezing (e.g., visibility map maintenance,
recording free space in the FSM).  It doesn't get passed any special
instructional state from the preamble code, though.

Also cleanly separate the logic used by a VACUUM with INDEX_CLEANUP=off
from the logic used by single-heap-pass VACUUMs.  The former case is now
structured as the omission of index and heap vacuuming by a two pass
VACUUM.  The latter case goes back to being used only when the table
happens to have no indexes (just as it was before commit a96c41fe).
This structure is much more natural, since the whole point of
INDEX_CLEANUP=off is to skip the index and heap vacuuming that would
otherwise take place.  The single-heap-pass case doesn't skip any useful
work, though -- it just does heap pruning and heap vacuuming together
when the table happens to have no indexes.

Both of these changes are preparation for an upcoming patch that
generalizes the mechanism used by INDEX_CLEANUP=off.  The later patch
will allow VACUUM to give up on index and heap vacuuming dynamically, as
problems emerge (e.g., with wraparound), so that an affected VACUUM
operation can finish up as soon as possible.

Also fix a very old bug in single-pass VACUUM VERBOSE output.  We were
reporting the number of tuples deleted via pruning as a direct
substitute for reporting the number of LP_DEAD items removed in a
function that deals with the second pass over the heap.  But that
doesn't work at all -- they're two different things.

To fix, start tracking the total number of LP_DEAD items encountered
during pruning, and use that in the report instead.  A single pass
VACUUM will always vacuum away whatever LP_DEAD items a heap page has
immediately after it is pruned, so the total number of LP_DEAD items
encountered during pruning equals the total number vacuumed-away.
(They are _not_ equal in the INDEX_CLEANUP=off case, but that's okay
because skipping index vacuuming is now a totally orthogonal concept to
one-pass VACUUM.)

Also stop reporting the count of LP_UNUSED items in VACUUM VERBOSE
output.  This makes the output of VACUUM VERBOSE more consistent with
log_autovacuum's output (because it never showed information about
LP_UNUSED items).  VACUUM VERBOSE reported LP_UNUSED items left behind
by the last VACUUM, and LP_UNUSED items created via pruning HOT chains
during the current VACUUM (it never included LP_UNUSED items left behind
by the current VACUUM's second pass over the heap).  This makes it
useless as an indicator of line pointer bloat, which must have been the
original intention. (Like the first VACUUM VERBOSE issue, this issue was
arguably an oversight in commit 282d2a03, which added the heap-only
tuple optimization.)

Finally, stop reporting empty_pages in VACUUM VERBOSE output, and start
reporting pages_removed instead.  This also makes the output of VACUUM
VERBOSE more consistent with log_autovacuum's output (which does not
show empty_pages, but does show pages_removed).  An empty page isn't
meaningfully different to a page that is almost empty, or a page that is
empty but for only a small number of remaining LP_UNUSED items.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com
2021-04-06 07:49:39 -07:00
Tom Lane 091e22b2e6 Clean up treatment of missing default and CHECK-constraint records.
Andrew Gierth reported that it's possible to crash the backend if no
pg_attrdef record is found to match an attribute that has atthasdef set.
AttrDefaultFetch warns about this situation, but then leaves behind
a relation tupdesc that has null "adbin" pointer(s), which most places
don't guard against.

We considered promoting the warning to an error, but throwing errors
during relcache load is pretty drastic: it effectively locks one out
of using the relation at all.  What seems better is to leave the
load-time behavior as a warning, but then throw an error in any code
path that wants to use a default and can't find it.  This confines
the error to a subset of INSERT/UPDATE operations on the table, and
in particular will at least allow a pg_dump to succeed.

Also, we should fix AttrDefaultFetch to not leave any null pointers
in the tupdesc, because that just creates an untested bug hazard.

While at it, apply the same philosophy of "warn at load, throw error
only upon use of the known-missing info" to CHECK constraints.
CheckConstraintFetch is very nearly the same logic as AttrDefaultFetch,
but for reasons lost in the mists of time, it was throwing ERROR for
the same cases that AttrDefaultFetch treats as WARNING.  Make the two
functions more nearly alike.

In passing, get rid of potentially-O(N^2) loops in equalTupleDesc
by making AttrDefaultFetch sort the entries after fetching them,
so that equalTupleDesc can assume that entries in two equal tupdescs
must be in matching order.  (CheckConstraintFetch already was sorting
CHECK constraints, but equalTupleDesc hadn't been told about it.)

There's some argument for back-patching this, but with such a small
number of field reports, I'm content to fix it in HEAD.

Discussion: https://postgr.es/m/87pmzaq4gx.fsf@news-spur.riddles.org.uk
2021-04-06 10:34:39 -04:00
Fujii Masao 9de9294b0c Stop archive recovery if WAL generated with wal_level=minimal is found.
Previously if hot standby was enabled, archive recovery exited with
an error when it found WAL generated with wal_level=minimal.
But if hot standby was disabled, it just reported a warning and
continued in that case. Which could lead to data loss or errors
during normal operation. A warning was emitted, but users could
easily miss that and not notice this serious situation until
they encountered the actual errors.

To improve this situation, this commit changes archive recovery
so that it exits with FATAL error when it finds WAL generated with
wal_level=minimal whatever the setting of hot standby. This enables
users to notice the serious situation soon.

The FATAL error is thrown if archive recovery starts from a base
backup taken before wal_level is changed to minimal. When archive
recovery exits with the error, if users have a base backup taken
after setting wal_level to higher than minimal, they can recover
the database by starting archive recovery from that newer backup.
But note that if such backup doesn't exist, there is no easy way to
complete archive recovery, which may make the database server
unstartable and users may lose whole database. The commit adds
the note about this risk into the document.

Even in the case of unstartable database server, previously by just
disabling hot standby users could avoid the error during archive
recovery, forcibly start up the server and salvage data from it.
But note that this commit makes this procedure unavailable at all.

Author: Takamichi Osumi
Reviewed-by: Laurenz Albe, Kyotaro Horiguchi, David Steele, Fujii Masao
Discussion: https://postgr.es/m/OSBPR01MB4888CBE1DA08818FD2D90ED8EDF90@OSBPR01MB4888.jpnprd01.prod.outlook.com
2021-04-06 22:56:51 +09:00
Peter Geoghegan f6b8f19a08 Allocate access strategy in parallel VACUUM workers.
Commit 49f49def took entirely the wrong approach to fixing this issue.
Just allocate a local buffer access strategy in each individual worker
instead of trying to propagate state.  This state was never propagated
by parallel VACUUM in the first place.

It looks like the only reason that this worked following commit 40d964ec
was that it involved static global variables, which are initialized to 0
per the C standard.

A more comprehensive fix may be necessary, even on HEAD.  This fix
should at least get the buildfarm green once again.

Thanks once again to Thomas Munro for continued off-list assistance with
the issue.
2021-04-05 17:17:40 -07:00
Tom Lane 09c1c6ab4b Support INCLUDE'd columns in SP-GiST.
Not much to say here: does what it says on the tin.
We steal a previously-always-zero bit from the nextOffset
field of leaf index tuples in order to track whether there
is a nulls bitmap.  Otherwise it works about like included
columns in other index types.

Pavel Borisov, reviewed by Andrey Borodin and Anastasia Lubennikova,
and rather heavily editorialized on by me

Discussion: https://postgr.es/m/CALT9ZEFi-vMp4faht9f9Junb1nO3NOSjhpxTmbm1UGLMsLqiEQ@mail.gmail.com
2021-04-05 18:41:21 -04:00
Peter Geoghegan 49f49defe7 Propagate parallel VACUUM's buffer access strategy.
Parallel VACUUM relied on global variable state from the leader process
being propagated to workers on fork().  Commit b4af70cb removed most
uses of global variables inside vacuumlazy.c, but did not account for
the buffer access strategy state.

To fix, propagate the state through shared memory instead.

Per buildfarm failures on elver, curculio, and morepork.

Many thanks to Thomas Munro for off-list assistance with this issue.
2021-04-05 14:56:56 -07:00
Peter Geoghegan b4af70cb21 Simplify state managed by VACUUM.
Reorganize the state struct used by VACUUM -- group related items
together to make it easier to understand.  Also stop relying on stack
variables inside lazy_scan_heap() -- move those into the state struct
instead.  Doing things this way simplifies large groups of related
functions whose function signatures had a lot of unnecessary redundancy.

Switch over to using int64 for the struct fields used to count things
that are reported to the user via log_autovacuum and VACUUM VERBOSE
output.  We were using double, but that doesn't seem to have any
advantages.  Using int64 makes it possible to add assertions that verify
that the first pass over the heap (pruning) encounters precisely the
same number of LP_DEAD items that get deleted from indexes later on, in
the second pass over the heap.  These assertions will be added in later
commits.

Finally, adjust the signatures of functions with IndexBulkDeleteResult
pointer arguments in cases where there was ambiguity about whether or
not the argument relates to a single index or all indexes.  Functions
now use the idiom that both ambulkdelete() and amvacuumcleanup() have
always used (where appropriate): accept a mutable IndexBulkDeleteResult
pointer argument, and return a result IndexBulkDeleteResult pointer to
caller.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkeOSYwC6KNckbhk2b1aNnWum6Yyn0NKP9D-Hq1LGTDPw@mail.gmail.com
2021-04-05 13:21:44 -07:00
Tom Lane dfc843d465 Fix more confusion in SP-GiST.
spg_box_quad_leaf_consistent unconditionally returned the leaf
datum as leafValue, even though in its usage for poly_ops that
value is of completely the wrong type.

In versions before 12, that was harmless because the core code did
nothing with leafValue in non-index-only scans ... but since commit
2a6368343, if we were doing a KNN-style scan, spgNewHeapItem would
unconditionally try to copy the value using the wrong datatype
parameters.  Said copying is a waste of time and space if we're not
going to return the data, but it accidentally failed to fail until
I fixed the datatype confusion in ac9099fc1.

Hence, change spgNewHeapItem to not copy the datum unless we're
actually going to return it later.  This saves cycles and dodges
the question of whether lossy opclasses are returning the right
type.  Also change spg_box_quad_leaf_consistent to not return
data that might be of the wrong type, as insurance against
somebody introducing a similar bug into the core code in future.

It seems like a good idea to back-patch these two changes into
v12 and v13, although I'm afraid to change spgNewHeapItem's
mistaken idea of which datatype to use in those branches.

Per buildfarm results from ac9099fc1.

Discussion: https://postgr.es/m/3728741.1617381471@sss.pgh.pa.us
2021-04-04 17:57:07 -04:00
Tom Lane ac9099fc1d Fix confusion in SP-GiST between attribute type and leaf storage type.
According to the documentation, the attType passed to the opclass
config function (and also relied on by the core code) is the type
of the heap column or expression being indexed.  But what was
actually being passed was the type stored for the index column.
This made no difference for user-defined SP-GiST opclasses,
because we weren't allowing the STORAGE clause of CREATE OPCLASS
to be used, so the two types would be the same.  But it's silly
not to allow that, seeing that the built-in poly_ops opclass
has a different value for opckeytype than opcintype, and that if you
want to do lossy storage then the types must really be different.
(Thus, user-defined opclasses doing lossy storage had to lie about
what type is in the index.)  Hence, remove the restriction, and make
sure that we use the input column type not opckeytype where relevant.

For reasons of backwards compatibility with existing user-defined
opclasses, we can't quite insist that the specified leafType match
the STORAGE clause; instead just add an amvalidate() warning if
they don't match.

Also fix some bugs that would only manifest when trying to return
index entries when attType is different from attLeafType.  It's not
too surprising that these have not been reported, because the only
usual reason for such a difference is to store the leaf value
lossily, rendering index-only scans impossible.

Add a src/test/modules module to exercise cases where attType is
different from attLeafType and yet index-only scan is supported.

Discussion: https://postgr.es/m/3728741.1617381471@sss.pgh.pa.us
2021-04-04 14:28:57 -04:00
Tomas Vondra d9c5b9a9ee Fix bug in brin_minmax_multi_union
When calling sort_expanded_ranges() we need to remember the return
value, because the function sorts and also deduplicates the ranges. So
the number of ranges may decrease. brin_minmax_multi_union failed to do
that, which resulted in crashes due to bogus ranges (equal minval/maxval
but not marked as compacted).

Reported-by: Jaime Casanova
Discussion: https://postgr.es/m/20210404052550.GA4376%40ahch-to
2021-04-04 19:36:12 +02:00
Tomas Vondra 1dad2a5ea3 Fix order of parameters in BRIN minmax-multi calls
The BRIN minmax-multi consistent function incorrectly assumed it can
lookup an operator, and then swap the arguments to get the commutator.
For example <(a,b) would be called as <(b,a) to get >(a,b). This works
when the arguments are of the same type, but with cross-type opclasses
this fails. We can't swap <(float4,float8) arguments, for example.

Fixed by passing arguments in the right order.

Discussion: https://postgr.es/m/CAJKUy5jLZFLCxyxfT%3DMfK5mtPfSzHA1rVLowR-j4RRsFVvKm7A%40mail.gmail.com
2021-04-04 19:25:41 +02:00
Tomas Vondra e1fbe1181c Fix BRIN minmax-multi distance for inet type
The distance calculation ignored the mask, unlike the inet comparator,
which resulted in negative distance in some cases. Fixed by applying the
mask in brin_minmax_multi_distance_inet. I've considered simply calling
inetmi() to calculate the delta, but that does not consider mask either.

Reviewed-by: Zhihong Yu
Discussion: https://postgr.es/m/1a0a7b9d-9bda-e3a2-7fa4-88f15042a051%40enterprisedb.com
2021-04-04 19:23:32 +02:00
Tomas Vondra 7262f2421a Fix BRIN minmax-multi distance for timetz type
The distance calculation ignored the time zone, so the result of (b-a)
might have ended negative even if (b > a). Fixed by considering the time
zone difference.

Reported-by: Jaime Casanova
Discussion: https://postgr.es/m/CAJKUy5jLZFLCxyxfT%3DMfK5mtPfSzHA1rVLowR-j4RRsFVvKm7A%40mail.gmail.com
2021-04-04 19:22:23 +02:00
Tomas Vondra 2b10e0e3c2 Fix BRIN minmax-multi distance for interval type
The distance calculation for interval type was treating months as having
31 days, which is inconsistent with the interval comparator (using 30
days). Due to this it was possible to get negative distance (b-a) when
(a<b), trigerring an assert.

Fixed by adopting the same logic as interval_cmp_value.

Reported-by: Jaime Casanova
Discussion: https://postgr.es/m/CAJKUy5jKH0Xhneau2mNftNPtTy-BVgQfXc8zQkEvRvBHfeUThQ%40mail.gmail.com
2021-04-04 19:19:51 +02:00
Tom Lane 1ebdec8c03 Rethink handling of pass-by-value leaf datums in SP-GiST.
The existing convention in SP-GiST is that any pass-by-value datatype
is stored in Datum representation, i.e. it's of width sizeof(Datum)
even when typlen is less than that.  This is okay, or at least it's
too late to change it, for prefix datums and node-label datums in inner
(upper) tuples.  But it's problematic for leaf datums, because we'd
prefer those to be stored in Postgres' standard on-disk representation
so that we can easily extend leaf tuples to carry additional "included"
columns.

I believe, however, that we can get away with just up and changing that.
This would be an unacceptable on-disk-format break, but there are two
big mitigating factors:

1. It seems quite unlikely that there are any SP-GiST opclasses out
there that use pass-by-value leaf datatypes.  Certainly none of the
ones in core do, nor has codesearch.debian.net heard of any.  Given
what SP-GiST is good for, it's hard to conceive of a use-case where
the leaf-level values would be both small and fixed-width.  (As an
example, if you wanted to index text values with the leaf level being
just a byte, then every text string would have to be represented
with one level of inner tuple per preceding byte, which would be
horrendously space-inefficient and slow to access.  You always want
to use as few inner-tuple levels as possible, leaving as much as
possible in the leaf values.)

2. Even granting that you have such an index, this change only
breaks things on big-endian machines.  On little-endian, the high
order bytes of the Datum format will now just appear to be alignment
padding space.

So, change the code to store pass-by-value leaf datums in their
usual on-disk form.  Inner-tuple datums are not touched.

This is extracted from a larger patch that intends to add support for
"included" columns.  I'm committing it separately for visibility in
our commit logs.

Pavel Borisov and Tom Lane, reviewed by Andrey Borodin

Discussion: https://postgr.es/m/CALT9ZEFi-vMp4faht9f9Junb1nO3NOSjhpxTmbm1UGLMsLqiEQ@mail.gmail.com
2021-04-01 17:55:17 -04:00
Noah Misch 0ff8bbdee1 Accept slightly-filled pages for tuples larger than fillfactor.
We always inserted a larger-than-fillfactor tuple into a newly-extended
page, even when existing pages were empty or contained nothing but an
unused line pointer.  This was unnecessary relation extension.  Start
tolerating page usage up to 1/8 the maximum space that could be taken up
by line pointers.  This is somewhat arbitrary, but it should allow more
cases to reuse pages.  This has no effect on tables with fillfactor=100
(the default).

John Naylor and Floris van Nee.  Reviewed by Matthias van de Meent.
Reported by Floris van Nee.

Discussion: https://postgr.es/m/6e263217180649339720afe2176c50aa@opammb0562.comp.optiver.com
2021-03-30 18:53:44 -07:00
David Rowley af527705ed Adjust design of per-worker parallel seqscan data struct
The design of the data structures which allow storage of the per-worker
memory during parallel seq scans were not ideal. The work done in
56788d215 required an additional data structure to allow workers to
remember the range of pages that had been allocated to them for
processing during a parallel seqscan.  That commit added a void pointer
field to TableScanDescData to allow heapam to store the per-worker
allocation information.  However putting the field there made very little
sense given that we have AM specific structs for that, e.g.
HeapScanDescData.

Here we remove the void pointer field from TableScanDescData and add a
dedicated field for this purpose to HeapScanDescData.

Previously we also allocated memory for this parallel per-worker data for
all scans, regardless if it was a parallel scan or not.  This was just a
wasted allocation for non-parallel scans, so here we make the allocation
conditional on the scan being parallel.

Also, add previously missing pfree() to free the per-worker data in
heap_endscan().

Reported-by: Andres Freund
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/20210317023101.anvejcfotwka6gaa@alap3.anarazel.de
2021-03-30 10:17:09 +13:00
Tomas Vondra 73b96bad4a Fix alignment in BRIN minmax-multi deserialization
The deserialization failed to ensure correct alignment, as it assumed it
can simply point into the serialized value. The serialization however
ignores alignment and copies just the significant bytes in order to make
the result as small as possible. This caused failures on systems that
are sensitive to mialigned addresses, like sparc, or with address
sanitizer enabled.

Fixed by copying the serialized data to ensure proper alignment. While
at it, fix an issue with serialization on big endian machines, using the
same store_att_byval/fetch_att trick as extended statistics.

Discussion: https://postgr.es/0c8c3304-d3dd-5e29-d5ac-b50589a23c8c%40enterprisedb.com
2021-03-26 16:48:36 +01:00