Backward and forward scans share much of the same page acquisition code.
Here we consolidate that code to reduce some duplication.
Additionally, add a new rs_coffset field to HeapScanDescData to track the
offset of the current tuple. The new field fits nicely into the padding
between a bool and BlockNumber field and saves having to look at the last
returned tuple to figure out which offset we should be looking at for the
current tuple.
Author: Melanie Plageman
Reviewed-by: David Rowley
Discussion: https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_T8QEqZqOL02rpi2bw@mail.gmail.com
Here remove some dead code from heapgettup() and heapgettup_pagemode()
which was trying to support NoMovementScanDirection scans. This code can
never be reached as standard_ExecutorRun() never calls ExecutePlan with
NoMovementScanDirection.
Additionally, plans which were scanning an unordered index would use
NoMovementScanDirection rather than ForwardScanDirection. There was no
real need for this, so here we adjust this so we use ForwardScanDirection
for unordered index scans. A comment in pathnodes.h claimed that
NoMovementScanDirection was used for PathKey reasons, but if that was
true, it no longer is, per code in build_index_paths().
This does change the non-text format of the EXPLAIN output so that
unordered index scans now have a "Forward" scan direction rather than
"NoMovement". The text format of EXPLAIN has not changed.
Author: Melanie Plageman
Reviewed-by: Tom Lane, David Rowley
Discussion: https://postgr.es/m/CAAKRu_bvkhka0CZQun28KTqhuUh5ZqY=_T8QEqZqOL02rpi2bw@mail.gmail.com
Eager freezing strategy avoids large build-ups of all-visible pages. It
makes VACUUM trigger page-level freezing whenever doing so will enable
the page to become all-frozen in the visibility map. This is useful for
tables that experience continual growth, particularly strict append-only
tables such as pgbench's history table. Eager freezing significantly
improves performance stability by spreading out the cost of freezing
over time, rather than doing most freezing during aggressive VACUUMs.
It complements the insert autovacuum mechanism added by commit b07642db.
VACUUM determines its freezing strategy based on the value of the new
vacuum_freeze_strategy_threshold GUC (or reloption) with logged tables.
Tables that exceed the size threshold use the eager freezing strategy.
Unlogged tables and temp tables always use eager freezing strategy,
since the added cost is negligible there. Non-permanent relations won't
incur any extra overhead in WAL written (for the obvious reason), nor in
pages dirtied (since any extra freezing will only take place on pages
whose PD_ALL_VISIBLE bit needed to be set either way).
VACUUM uses lazy freezing strategy for logged tables that fall under the
GUC size threshold. Page-level freezing triggers based on the criteria
established in commit 1de58df4, which added basic page-level freezing.
Eager freezing is strictly more aggressive than lazy freezing. Settings
like vacuum_freeze_min_age still get applied in just the same way in
every VACUUM, independent of the strategy in use. The only mechanical
difference between eager and lazy freezing strategies is that only the
former applies its own additional criteria to trigger freezing pages.
Note that even lazy freezing strategy will trigger freezing whenever a
page happens to have required that an FPI be written during pruning,
provided that the page will thereby become all-frozen in the visibility
map afterwards (due to the FPI optimization from commit 1de58df4).
The vacuum_freeze_strategy_threshold default setting is 4GB. This is a
relatively low setting that prioritizes performance stability. It will
be reviewed at the end of the Postgres 16 beta period.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
An upcoming patch by Melanie Plageman does some refactoring work in this
area. Run pgindent on that file now before making any changes so that
it's easier to maintain/evolve each of the individual patches doing the
refactor work. Additionally, add a few new required typedefs to the list
to make it easier to do future pgindent runs on this file during the
refactor work.
Discussion: https://postgr.es/m/CAAKRu_YSOnhKsDyFcqJsKtBSrd32DP-jjXmv7hL0BPD-z0TGXQ@mail.gmail.com
Tighten up the way that visibilitymap_set() is called: request that both
the all-visible and all-frozen bits get set whenever the all-frozen bit
is set, regardless of what we think we know about the present state of
the all-visible bit. Also make sure that the page level PD_ALL_VISIBLE
flag is set in the same code path.
In practice there doesn't seem to be a concrete scenario in which the
previous approach could lead to inconsistencies. It was almost possible
in scenarios involving concurrent HOT updates from transactions that
abort, but (unlike pruning) freezing can never remove XIDs > VACUUM's
OldestXmin, even those from transactions that are known to have aborted.
That was protective here.
These issues have been around since commit a892234f83, which added the
all-frozen bit to the VM fork. There is no known live bug here, so no
backpatch.
In passing, add some defensive assertions to catch the issue, and stop
reading the existing state of the VM when setting the VM in VACUUM's
final heap pass. We already know that affected pages must have had at
least one LP_DEAD item before we set it LP_UNUSED, so there is no point
in reading the VM when it is set like this.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WznuNGSzF8v6OsgjaC5aYsb3cZ6HW6MLm30X0d65cmSH6A@mail.gmail.com
Rename the heapam.c freeze plan deduplication routines added by commit
9e540599 to names that follow conventions for functions in heapam.c.
Also relocate the functions so that they're next to their caller, which
runs during original execution, when FREEZE_PAGE WAL records are built.
The routines were initially placed next to (and followed the naming
conventions of) conceptually related REDO routine code, but that scheme
turned out to be kind of jarring when considered in a wider context.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20230109214308.icz26oqvt3k2274c@awork3.anarazel.de
Document that TransactionIdDidAbort() won't indicate that transactions
that were in-progress during a crash have aborted. Tie this to existing
discussion of the TransactionIdDidCommit() and TransactionIdDidCommit()
protocol that code in heapam_visibility.c (and a few other places) must
observe.
Follow-up to bugfix commit eb5ad4ff.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wzn4bEEqgmaUQL3aJ73yM9gAeK-wE4ngi7kjRjLztb+P0w@mail.gmail.com
We cannot rely on TransactionIdDidAbort here, since in general it may
report transactions that were in-progress at the time of an earlier hard
crash as not aborted, effectively behaving as if they were still in
progress even after crash recovery completes. Go back to defensively
verifying that xmax didn't commit instead.
Oversight in commit 79d4bf4e.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20230104035636.hy5djyr2as4gbc4q@awork3.anarazel.de
pg_xact lookups are relatively expensive. Move the xmin/xmax commit
status checks from the point that freeze plans are prepared to the point
that they're actually executed. Otherwise we'll repeat many commit
status checks whenever multiple successive VACUUM operations scan the
same pages and decide against freezing each time, which is a waste of
cycles.
Oversight in commit 1de58df4, which added page-level freezing.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkZpe4K6qMfEt8H4qYJCKc2R7TPvKsBva7jc9w7iGXQSw@mail.gmail.com
Improve comments added by commit 1de58df4 which describe the
lazy_scan_prune "freeze the page" path. These newly revised comments
are based on suggestions from Jeff Davis.
In passing, remove nearby visibility_cutoff_xid comments left over from
commit 6daeeb1f.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/ebc857107fe3edd422ef8a65191ca4a8da568b9b.camel@j-davis.com
The term "truncation" has been ambiguous since commit 10a8d13823 added
line pointer array truncation during heap pruning. Clear things up by
specifying that we're talking about rel truncation here, to match nearby
comments that apply to tuples with storage.
Don't allow VACUUM to WAL-log the value FrozenTransactionId as the
snapshotConflictHorizon of freezing or visibility map related WAL
records.
The only special XID value that's an allowable snapshotConflictHorizon
is InvalidTransactionId, which is interpreted as "record definitely
doesn't require a recovery conflict".
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WznuNGSzF8v6OsgjaC5aYsb3cZ6HW6MLm30X0d65cmSH6A@mail.gmail.com
Teach VACUUM to decide on whether or not to trigger freezing at the
level of whole heap pages. Individual XIDs and MXIDs fields from tuple
headers now trigger freezing of whole pages, rather than independently
triggering freezing of each individual tuple header field.
Managing the cost of freezing over time now significantly influences
when and how VACUUM freezes. The overall amount of WAL written is the
single most important freezing related cost, in general. Freezing each
page's tuples together in batch allows VACUUM to take full advantage of
the freeze plan WAL deduplication optimization added by commit 9e540599.
Also teach VACUUM to trigger page-level freezing whenever it detects
that heap pruning generated an FPI. We'll have already written a large
amount of WAL just to do that much, so it's very likely a good idea to
get freezing out of the way for the page early. This only happens in
cases where it will directly lead to marking the page all-frozen in the
visibility map.
In most cases "freezing a page" removes all XIDs < OldestXmin, and all
MXIDs < OldestMxact. It doesn't quite work that way in certain rare
cases involving MultiXacts, though. It is convenient to define "freeze
the page" in a way that gives FreezeMultiXactId the leeway to put off
the work of processing an individual tuple's xmax whenever it happens to
be a MultiXactId that would require an expensive second pass to process
aggressively (allocating a new multi is especially worth avoiding here).
FreezeMultiXactId is eager when processing is cheap (as it usually is),
and lazy in the event of an individual multi that happens to require
expensive second pass processing. This avoids regressions related to
processing of multis that page-level freezing might otherwise cause.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
When VACUUM determines that an existing MultiXact should use a freeze
plan that sets xmax to InvalidTransactionId, the original Multi may or
may not be before OldestMxact. Remove an incorrect assertion that
expected it to always be from before OldestMxact.
Oversight in commit 4ce3af.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Discussion: https://postgr.es/m/TYAPR01MB5866B24104FD80B5D7E65C3EF5ED9@TYAPR01MB5866.jpnprd01.prod.outlook.com
Perform a failsafe check every time VACUUM's first heap scan scans a
further FAILSAFE_EVERY_PAGES pages, rather than using an approach based
on the number of physical blocks that our current blkno is from the
blkno at the time of the previous failsafe check. That way VACUUM will
perform a failsafe check every time it has scanned a uniform number of
pages, without it mattering when or how VACUUM skipped pages using the
visibility map.
Sami Imseih, with changes to FAILSAFE_EVERY_PAGES comments added by me.
Author: Sami Imseih <simseih@amazon.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/401CE010-4049-4B94-9961-0B610A5D254D%40amazon.com
Use a dedicated struct for the XID/MXID cutoffs used by VACUUM, such as
FreezeLimit and OldestXmin. This state is initialized in vacuum.c, and
then passed around by code from vacuumlazy.c to heapam.c freezing
related routines. The new convention is that everybody works off of the
same cutoff state, which is passed around via pointers to const.
Also simplify some of the logic for dealing with frozen xmin in
heap_prepare_freeze_tuple: add dedicated "xmin_already_frozen" state to
clearly distinguish xmin XIDs that we're going to freeze from those that
were already frozen from before. That way the routine's xmin handling
code is symmetrical with the existing xmax handling code. This is
preparation for an upcoming commit that will add page level freezing.
Also refactor the control flow within FreezeMultiXactId(), while adding
stricter sanity checks. We now test OldestXmin directly, instead of
using FreezeLimit as an inexact proxy for OldestXmin. This is further
preparation for the page level freezing work, which will make the
function's caller cede control of page level freezing to the function
where appropriate (where heap_prepare_freeze_tuple sees a tuple that
happens to contain a MultiXactId in its xmax).
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/CAH2-WznS9TxXmz2_=SY+SyJyDFbiOftKofM9=aDo68BbXNBUMA@mail.gmail.com
Because we added StaticAssertStmt() first before StaticAssertDecl(),
some uses as well as the instructions in c.h are now a bit backwards
from the "native" way static assertions are meant to be used in C.
This updates the guidance and moves some static assertions to better
places.
Specifically, since the addition of StaticAssertDecl(), we can put
static assertions at the file level. This moves a number of static
assertions out of function bodies, where they might have been stuck
out of necessity, to perhaps better places at the file level or in
header files.
Also, when the static assertion appears in a position where a
declaration is allowed, then using StaticAssertDecl() is more native
than StaticAssertStmt().
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/941a04e7-dd6f-c0e4-8cdf-a33b3338cbda%40enterprisedb.com
The same code pattern is repeated 17 times for int64 counters (0 for
missing entry) and 5 times for timestamps (NULL for missing entry) on
table entries. This code is switched to use a macro for the basic code
instead, shaving a few hundred lines of originally-duplicated code. The
function names remain the same, but some fields of PgStat_StatTabEntry
have to be renamed to cope with the new style.
Author: Bertrand Drouvot
Reviewed-by: Nathan Bossart
Discussion: https:/postgr.es/m/20221204173207.GA2669116@nathanxps13
Pass VACUUM parameters (VacuumParams state) to vacuum_set_xid_limits()
directly, rather than passing most individual VacuumParams fields as
separate arguments.
Also make vacuum_set_xid_limits() output parameter symbol names match
those used by its vacuumlazy.c caller.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=TE7gW5DgSahDkf0UEZigFGAoHNNN6EvSrdzC=Kn+hrA@mail.gmail.com
We shouldn't ever need to rely on whether HEAP_XMAX_INVALID is set in
t_infomask when considering whether or not an xmax should be deemed
already frozen, since that status flag is just a hint. The only
acceptable representation for an "xmax_already_frozen" raw xmax field is
the transaction ID value zero (also known as InvalidTransactionId).
Adjust code that superficially appeared to rely on HEAP_XMAX_INVALID to
make the rule about xmax_already_frozen clear. Also avoid needlessly
rereading the tuple's raw xmax.
Oversight in bugfix commit d2599ecf. There is no evidence that this
ever led to incorrect behavior, so no backpatch. The worst consequence
of this bug was that VACUUM could hypothetically fail to notice and
report on certain kinds of corruption, which seems fairly benign.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wzkh3DMCDRPfhZxj9xCq9v3WmzvmbiCpf1dNKUBPadhCbQ@mail.gmail.com
Standardize on the name snapshotConflictHorizon for all XID fields from
WAL records that generate recovery conflicts when in hot standby mode.
This supersedes the previous latestRemovedXid naming convention.
The new naming convention places emphasis on how the values are actually
used by REDO routines. How the values are generated during original
execution (details of which vary by record type) is deemphasized. Users
of tools like pg_waldump can now grep for snapshotConflictHorizon to see
all potential sources of recovery conflicts in a standardized way,
without necessarily having to consider which specific record types might
be involved.
Also bring a couple of WAL record types that didn't follow any kind of
naming convention into line. These are heapam's VISIBLE record type and
SP-GiST's VACUUM_REDIRECT record type. Now every WAL record whose REDO
routine calls ResolveRecoveryConflictWithSnapshot() passes through the
snapshotConflictHorizon field from its WAL record. This is follow-up
work to the refactoring from commit 9e540599 that made FREEZE_PAGE WAL
records use a standard snapshotConflictHorizon style XID cutoff.
No bump in XLOG_PAGE_MAGIC, since the underlying format of affected WAL
records doesn't change.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wzm2CQUmViUq7Opgk=McVREHSOorYaAjR1ZpLYkRN7_dPw@mail.gmail.com
Make heapam WAL records that describe freezing performed by VACUUM more
space efficient by storing each distinct "freeze plan" once, alongside
an array of associated page offset numbers (one per freeze plan). The
freeze plans required for most heap pages tend to naturally have a great
deal of redundancy, so this technique is very effective in practice. It
often leads to freeze WAL records that are less than 20% of the size of
equivalent WAL records generated using the previous approach.
The freeze plan concept was introduced by commit 3b97e6823b, which fixed
bugs in VACUUM's handling of MultiXacts. We retain the concept of
freeze plans, but go back to using page offset number arrays. There is
no loss of generality here because deduplication is an additive process
that gets applied mechanically when FREEZE_PAGE WAL records are built.
More than anything else, freeze plan deduplication is an optimization
that reduces the marginal cost of freezing additional tuples on pages
that will need to have at least one or two tuples frozen in any case.
Ongoing work that adds page-level freezing to VACUUM will take full
advantage of the improved cost profile through batching.
Also refactor some of the details surrounding recovery conflicts needed
to REDO freeze records in passing: make original execution responsible
for generating a standard latestRemovedXid cutoff, rather than working
backwards to get the same cutoff in the REDO routine. Bugfix commit
66fbcb0d2e did it the other way around, which is equivalent but obscures
what's going on.
Also rename the cutoff field from the WAL record/struct (rename the
field cutoff_xid to latestRemovedXid to match similar WAL records).
Processing of conflicts by REDO routines is already completely uniform,
so tools like pg_waldump should present the information driving the
process uniformly. There are two remaining WAL record types that still
don't quite follow this convention (heapam's VISIBLE record type and
SP-GiST's VACUUM_REDIRECT record type). They can be brought into line
by later work that totally standardizes how the cutoffs are presented.
Bump XLOG_PAGE_MAGIC.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-By: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CAH2-Wz=XytErMnb8FAyFd+OQEbiipB0Q2FmFdXrggPL4VBnRYQ@mail.gmail.com
The original report was concerned with a possible inconsistency
between the heap and the visibility map, which I was unable to
confirm. The concern has been retracted.
However, there did seem to be a torn page hazard when using
checksums. By not setting the heap page LSN during redo, the
protections of minRecoveryPoint were bypassed. Fixed, along with a
misleading comment.
It may have been impossible to hit this problem in practice, because
it would require a page tear between the checksum and the flags, so I
am marking this as a theoretical risk. But, as discussed, it did
violate expectations about the page LSN, so it may have other
consequences.
Backpatch to all supported versions.
Reported-by: Konstantin Knizhnik
Reviewed-by: Konstantin Knizhnik
Discussion: https://postgr.es/m/fed17dac-8cb8-4f5b-d462-1bb4908c029e@garret.ru
Backpatch-through: 11
We have various requirements when using a dlist_head to keep track of the
number of items in the list. This, traditionally, has been done by
maintaining a counter variable in the calling code. Here we tidy this up
by adding "dclist", which is very similar to dlist but also keeps track of
the number of items stored in the list.
Callers may use the new dclist_count() function when they need to know how
many items are stored. Obtaining the count is an O(1) operation.
For simplicity reasons, dclist and dlist both use dlist_node as their node
type and dlist_iter/dlist_mutable_iter as their iterator type. dclists
have all of the same functionality as dlists except there is no function
named dclist_delete(). To remove an item from a list dclist_delete_from()
must be used. This requires knowing which dclist the given item is stored
in.
Additionally, here we also convert some dlists where additional code
exists to keep track of the number of items stored and to make these use
dclists instead.
Author: David Rowley
Reviewed-by: Bharath Rupireddy, Aleksander Alekseev
Discussion: https://postgr.es/m/CAApHDvrtVxr+FXEX0VbViCFKDGxA3tWDgw9oFewNXCJMmwLjLg@mail.gmail.com
Here we add a new 'copy' parameter to tuplesort_getdatum so that we can
instruct the function not to datumCopy() byref Datums before returning.
Similar to 91e9e89dc, this can provide significant performance
improvements in nodeSort when sorting by a single byref column and the
sort's targetlist contains only that column.
This allows us to re-enable Datum sorts for byref types which was disabled
in 3a5817695 due to a reported memory leak.
Additionally, here we slightly optimize DISTINCT aggregates so that we no
longer perform any datumCopy() when we find the current value not to be
distinct from the previous value. Previously the code would always take a
copy of the most recent Datum and pfree the previous value, even when the
values were the same. Testing shows a small but noticeable performance
increase when aggregate transitions are skipped due to the current
transition value being the same as the prior one.
Author: David Rowley
Discussion: https://postgr.es/m/CAApHDvqS6wC5U==k9Hd26E4EQXH3QR67-T4=Q1rQ36NGvjfVSg@mail.gmail.com
Discussion: https://postgr.es/m/CAApHDvqHonfe9G1cVaKeHbDx70R_zCrM3qP2AGXpGrieSKGnhA@mail.gmail.com
In a similar effort to f01592f91, here we mostly rename shadowed local
variables to remove the warnings produced when compiling with
-Wshadow=compatible-local.
This fixes 63 warnings and leaves just 5.
Author: Justin Pryzby, David Rowley
Reviewed-by: Justin Pryzby
Discussion https://postgr.es/m/20220817145434.GC26426%40telsasoft.com
Commit 34f581c39 intended to ensure that RelationGetBufferForTuple
would acquire a visibility-map page pin in case the otherBuffer's
all-visible bit had become set since we last had lock on that page.
But I missed a case: when we're extending the relation, VM concerns
were dealt with only in the relatively-less-likely case that we
fail to conditionally lock the otherBuffer. I think I'd believed
that we couldn't need to worry about it if the conditional lock
succeeds, which is true for the target buffer; but the otherBuffer
was unlocked for awhile so its bit might be set anyway. So we need
to do the GetVisibilityMapPins dance, and then also recheck the
page's free space, in both cases.
Per report from Jaime Casanova. Back-patch to v12 as the previous
patch was (although there's still no evidence that the bug is
reachable pre-v14).
Discussion: https://postgr.es/m/E1lWLjP-00006Y-Ml@gemulon.postgresql.org
Commits cf112c12 and a0dc8271 were a little too hasty in getting rid of
the pg_ prefixes where we use pread(), pwrite() and vectored variants.
We dropped support for ancient Unixes where we needed to use lseek() to
implement replacements for those, but it turns out that Windows also
changes the current position even when you pass in an offset to
ReadFile() and WriteFile() if the file handle is synchronous, despite
its documentation saying otherwise.
Switching to asynchronous file handles would fix that, but have other
complications. For now let's just put back the pg_ prefix and add some
comments to highlight the non-standard side-effect, which we can now
describe as Windows-only.
Reported-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/20220923202439.GA1156054%40nathanxps13
Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC build system is hard to maintain for
developers not using Windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern build system.
After evaluating different build system choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.
We decided that it's more realistic to commit a relatively early version of
the new build system and mature it in tree.
This commit adds an initial version of a meson based build system. It supports
building postgres on at least AIX, FreeBSD, Linux, macOS, NetBSD, OpenBSD,
Solaris and Windows (however only gcc is supported on aix, solaris). For
Windows/MSVC postgres can now be built with ninja (faster, particularly for
incremental builds) and msbuild (supporting the visual studio GUI, but
building slower).
Several aspects (e.g. Windows rc file generation, PGXS compatibility, LLVM
bitcode generation, documentation adjustments) are done in subsequent commits
requiring further review. Other aspects (e.g. not installing test-only
extensions) are not yet addressed.
When building on Windows with msbuild, builds are slower when using a visual
studio version older than 2019, because those versions do not support
MultiToolTask, required by meson for intra-target parallelism.
The plan is to remove the MSVC specific build system in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make build system in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.
Some initial help for postgres developers is at
https://wiki.postgresql.org/wiki/Meson
With contributions from Thomas Munro, John Naylor, Stone Tickle and others.
Author: Andres Freund <andres@anarazel.de>
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Author: Peter Eisentraut <peter@eisentraut.org>
Reviewed-By: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Discussion: https://postgr.es/m/20211012083721.hvixq4pnh2pixr3j@alap3.anarazel.de
Make sure that function declarations use names that exactly match the
corresponding names from function definitions. Having parameter names
that are reliably consistent in this way will make it easier to reason
about groups of related C functions from the same translation unit as a
module. It will also make certain refactoring tasks easier.
Like other recent commits that cleaned up function parameter names, this
commit was written with help from clang-tidy. Later commits will do the
same for other parts of the codebase.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg@mail.gmail.com
Add a new line to log reports from autovacuum (as well as VACUUM VERBOSE
output) that shows information about freezing. Emphasis is placed on
the total number of heap pages that had one or more tuples frozen by
VACUUM. The total number of tuples frozen is also shown.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Janes <jeff.janes@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznTY6D0zyE8VLrC6Gd4kh_HGAXxnTPtcOQOOsxzLx9zog@mail.gmail.com
Several backend-side loops scanning one or more directories with
ReadDir() (WAL segment recycle/removal in xlog.c, backend-side directory
copy, temporary file removal, configuration file parsing, some logical
decoding logic and some pgtz stuff) already know the type of the entry
being scanned thanks to the dirent structure associated to the entry, on
platforms where we know about DT_REG, DT_DIR and DT_LNK to make the
difference between a regular file, a directory and a symbolic link.
Relying on the direct structure of an entry saves a few system calls to
stat() and lstat() in the loops updated here, shaving some code while on
it. The logic of the code remains the same, calling stat() or lstat()
depending on if it is necessary to look through symlinks.
Authors: Nathan Bossart, Bharath Rupireddy
Reviewed-by: Andres Freund, Thomas Munro, Michael Paquier
Discussion: https://postgr.es/m/CALj2ACV8n-J-f=yiLUOx2=HrQGPSOZM3nWzyQQvLPcccPXxEdg@mail.gmail.com
Before now, the cutoffs that VACUUM used to determine which XIDs/MXIDs
to freeze were determined at the start of each VACUUM by taking related
cutoffs that represent which XIDs/MXIDs VACUUM should treat as still
running, and subtracting an XID/MXID age based value controlled by GUCs
like vacuum_freeze_min_age. The FreezeLimit cutoff (XID freeze cutoff)
was derived by subtracting an XID age value from OldestXmin, while the
MultiXactCutoff cutoff (MXID freeze cutoff) was derived by subtracting
an MXID age value from OldestMxact. This approach didn't match the
approach used nearby to determine whether this VACUUM operation should
be an aggressive VACUUM or not.
VACUUM now uses the standard approach instead: it subtracts the same
age-based values from next XID/next MXID (rather than subtracting from
OldestXmin/OldestMxact). This approach is simpler and more uniform.
Most of the time it will have only a negligible impact on how and when
VACUUM freezes. It will occasionally make VACUUM more robust in the
event of problems caused by long running transaction. These are cases
where OldestXmin and OldestMxact are held back by so much that they
attain an age that is a significant fraction of the value of age-based
settings like vacuum_freeze_min_age.
There is no principled reason why freezing should be affected in any way
by the presence of a long-running transaction -- at least not before the
point that the OldestXmin and OldestMxact limits used by each VACUUM
operation attain an age that makes it unsafe to freeze some of the
XIDs/MXIDs whose age exceeds the value of the relevant age-based
settings. The new approach should at least make freezing degrade more
gracefully than before, even in the most extreme cases.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkOv5CEeyOO=c91XnT5WBR_0gii0Wn5UbZhJ=4TTykDYg@mail.gmail.com
pread() and pwrite() are in SUSv2, and all targeted Unix systems have
them.
Previously, we defined pg_pread and pg_pwrite to emulate these function
with lseek() on old Unixen. The names with a pg_ prefix were a reminder
of a portability hazard: they might change the current file position.
That hazard is gone, so we can drop the prefixes.
Since the remaining replacement code is Windows-only, move it into
src/port/win32p{read,write}.c, and move the declarations into
src/include/port/win32_port.h.
No need for vestigial HAVE_PREAD, HAVE_PWRITE macros as they were only
used for declarations in port.h which have now moved into win32_port.h.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Greg Stark <stark@mit.edu>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJ3LHeP9w5Fgzdr4G8AnEtJ=z=p6hGDEm4qYGEUX5B6fQ@mail.gmail.com
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
This reverts commits 5753d4ee32 and fe60b67250 that modified HOT to
ignore BRIN indexes. The commit message for 5753d4ee32 claims that:
When determining whether an index update may be skipped by using
HOT, we can ignore attributes indexed only by BRIN indexes. There
are no index pointers to individual tuples in BRIN, and the page
range summary will be updated anyway as it relies on visibility
info.
This is partially incorrect - it's true BRIN indexes don't point to
individual tuples, so HOT chains are not an issue, but the visibitlity
info is not sufficient to keep the index up to date. This can easily
result in corrupted indexes, as demonstrated in the hackers thread.
This does not mean relaxing the HOT restrictions for BRIN is a lost
cause, but it needs to handle the two aspects (allowing HOT chains and
updating the page range summaries) as separate. But that requires a
major changes, and it's too late for that in the current dev cycle.
Reported-by: Tomas Vondra
Discussion: https://postgr.es/m/05ebcb44-f383-86e3-4f31-0a97a55634cf@enterprisedb.com
Per-backend global variables like VacuumPageHit are initialized once per
VACUUM command. This was missed by commit 49c9d9fc, which unified
VACUUM VERBOSE and autovacuum logging. As a result of that oversight,
incorrect values were shown when multiple relations were processed by a
single VACUUM VERBOSE command.
Relations that happened to be processed later on would show "buffer
usage:" values that incorrectly included buffer accesses made while
processing earlier unrelated relations. The same accesses were counted
multiple times.
To fix, take initial values for the tracker variables at the start of
heap_vacuum_rel(), and report delta values later on.
Be consistent about the lines that VACUUM VERBOSE outputs by including
an "index scan not needed: " line for completely empty tables. This
makes the output more readable, especially with multiple distinct VACUUM
operations processed by the same VACUUM command. It's also more
consistent; even empty tables can use the failsafe, which wasn't
reported in the standard way until now.
Follow-up to commit 6e20f460, which taught VACUUM VERBOSE to be more
consistent about reporting on scanned pages with empty tables.
The age of OldestXmin (a.k.a. "removable cutoff") when VACUUM ends often
indicates the approximate number of XIDs consumed while VACUUM ran.
However, there is at least one important exception: the cutoff could be
held back by a snapshot that was acquired before our VACUUM even began.
Successive VACUUM operations may even use exactly the same old cutoff in
extreme cases involving long held snapshots.
The log messages that described how removable cutoff aged (which were
added by commit 872770fd) created the impression that we were reporting
on how VACUUM's usable cutoff advanced while VACUUM ran, which was
misleading in these extreme cases. Fix by using a more general wording.
Per gripe from Tom Lane.
In passing, relocate related instrumentation code for clarity.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/1643035.1650035653@sss.pgh.pa.us
heap_fetch() used to have a "keep_buf" parameter that told it to return
ownership of the buffer pin to the caller after finding that the
requested tuple TID exists but is invisible to the specified snapshot.
This was thoughtlessly removed in commit 5db6df0c0, which broke
heapam_tuple_lock() (formerly EvalPlanQualFetch) because that function
needs to do more accesses to the tuple even if it's invisible. The net
effect is that we would continue to touch the page for a microsecond or
two after releasing pin on the buffer. Usually no harm would result;
but if a different session decided to defragment the page concurrently,
we could see garbage data and mistakenly conclude that there's no newer
tuple version to chain up to. (It's hard to say whether this has
happened in the field. The bug was actually found thanks to a later
change that allowed valgrind to detect accesses to non-pinned buffers.)
The most reasonable way to fix this is to reintroduce keep_buf,
although I made it behave slightly differently: buffer ownership
is passed back only if there is a valid tuple at the requested TID.
In HEAD, we can just add the parameter back to heap_fetch().
To avoid an API break in the back branches, introduce an additional
function heap_fetch_extended() in those branches.
In HEAD there is an additional, less obvious API change: tuple->t_data
will be set to NULL in all cases where buffer ownership is not returned,
in particular when the tuple exists but fails the time qual (and
!keep_buf). This is to defend against any other callers attempting to
access non-pinned buffers. We concluded that making that change in back
branches would be more likely to introduce problems than cure any.
In passing, remove a comment about heap_fetch that was obsoleted by
9a8ee1dc6.
Per bug #17462 from Daniil Anisimov. Back-patch to v12 where the bug
was introduced.
Discussion: https://postgr.es/m/17462-9c98a0f00df9bd36@postgresql.org
All but a few existing callers assume without checking that this
function succeeds. While it probably will, that's a poor excuse for
not checking. Let's make it return void and instead throw an error
if it doesn't find the block reference. Callers that actually need
to handle the no-such-block case must now use the underlying function
XLogRecGetBlockTagExtended.
In addition to being a bit less error-prone, this should also serve
to suppress some Coverity complaints about XLogRecGetBlockRefInfo.
While at it, clean up some inconsistency about use of the
XLogRecHasBlockRef macro: make XLogRecGetBlockTagExtended use
that instead of open-coding the same condition, and avoid calling
XLogRecHasBlockRef twice in relevant code paths. (That is,
calling XLogRecHasBlockRef followed by XLogRecGetBlockTag is now
deprecated: use XLogRecGetBlockTagExtended instead.)
Patch HEAD only; this doesn't seem to have enough value to consider
a back-branch API break.
Discussion: https://postgr.es/m/425039.1649701221@sss.pgh.pa.us
Remove comment block about how heap page vacuuming used to set tuples
with storage to LP_UNUSED in a rare edge case that can no longer happen
following commit 8523492d4e. The comments seem unnecessary now, since
it's now generally clear that heap vacuuming only applies to LP_DEAD
items from VACUUM's first heap pass following more recent work from
commits 12b5ade902 and 4f8d9d1217.
Reclaim space from the line pointer array when heap pruning leaves
behind a contiguous group of LP_UNUSED items at the end of the array.
This happens during subsequent page defragmentation. Certain kinds of
heap line pointer bloat are ameliorated by this new optimization.
Follow-up work to commit 3c3b8a4b26, which taught VACUUM to truncate the
line pointer array in about the same way during VACUUM's second pass
over the heap. We now apply line pointer array truncation during both
the first and the second pass over the heap made by VACUUM. We can also
perform line pointer array truncation during opportunistic pruning.
Matthias van de Meent, with small tweaks by me.
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAEze2WjgaQc55Y5f5CQd3L=eS5CZcff2Obxp=O6pto8-f0hC4w@mail.gmail.com
Discussion: https://postgr.es/m/CAEze2Wg36%2B4at2eWJNcYNiW2FJmht34x3YeX54ctUSs7kKoNcA%40mail.gmail.com
Soon the stats collector will be no more, with statistics instead getting
stored in shared memory. There are a lot of references to the stats collector
in comments. This commit replaces most of these references with "cumulative
statistics system", with the remaining ones getting replaced as part of
subsequent commits.
This is done separately from the - quite large - shared memory statistics
patch to make review easier.
Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20220303021600.hs34ghqcw6zcokdh@alap3.anarazel.de
Discussion: https://postgr.es/m/20220308205351.2xcn6k4x5yivcxyd@alap3.anarazel.de
This replaces the bool flag for randomAccess. An upcoming patch requires
adding another option, so instead of breaking the API for that, then
breaking it again one day if we add more options, let's just break it
once. Any boolean options we add in the future will just make use of an
unused bit in the flags.
Any extensions making use of tuplesorts will need to update their code
to pass TUPLESORT_RANDOMACCESS instead of true for randomAccess.
TUPLESORT_NONE can be used for a set of empty options.
Author: David Rowley
Reviewed-by: Justin Pryzby
Discussion: https://postgr.es/m/CAApHDvoH4ASzsAOyHcxkuY01Qf%2B%2B8JJ0paw%2B03dk%2BW25tQEcNQ%40mail.gmail.com
Non-aggressive VACUUMs were at a gratuitous disadvantage (relative to
aggressive VACUUMs) around advancing relfrozenxid and relminmxid before
now. The issue only came up when concurrent activity unset some heap
page's visibility map bit right as VACUUM was considering if the page
should get counted in frozenskipped_pages. The non-aggressive case
would recheck the all-frozen bit at this point. The aggressive case
reasoned that the page (a skippable page) must have at least been
all-frozen in the recent past, so skipping it won't make relfrozenxid
advancement unsafe (which is never okay for aggressive VACUUMs).
The recheck created a window for some other backend to confuse matters
for VACUUM. If the page's VM bit turned out to be unset, VACUUM would
conclude that the page was _never_ all-frozen. frozenskipped_pages was
not incremented, and yet VACUUM couldn't back out of skipping at this
late stage (it couldn't choose to scan the page instead). This made it
unsafe to advance relfrozenxid later on.
Consistently avoid the issue by generalizing how we skip frozen pages
during aggressive VACUUMs: take the same approach when skipping any
skippable page range during aggressive and non-aggressive VACUUMs alike.
The new approach makes ranges (not individual pages) the fundamental
unit of skipping using the visibility map. frozenskipped_pages is
replaced with a boolean flag that represents whether some skippable
range with one or more all-visible pages was actually skipped.
It is safe for VACUUM to treat a page as all-frozen provided it at least
had its all-frozen bit set after the OldestXmin cutoff was established.
VACUUM is only required to scan pages that might have XIDs < OldestXmin
(unfrozen XIDs) to be able to safely advance relfrozenxid. Tuples
concurrently inserted on "skipped" pages can be thought of as equivalent
to tuples concurrently inserted on a block >= rel_pages.
It's possible that the issue this commit fixes hardly ever came up in
practice. But we only had to be unlucky once to lose out on advancing
relfrozenxid -- a single affected heap page was enough to throw VACUUM
off. That seems like something to avoid on general principle. This is
similar to an issue fixed by commit 44fa8488, which taught vacuumlazy.c
to not give up on non-aggressive relfrozenxid advancement just because a
cleanup lock wasn't immediately available on some heap page.
Skipping an all-visible range is now explicitly structured as a choice
made by non-aggressive VACUUMs, by weighing known costs (scanning extra
skippable pages to freeze their tuples early) against known benefits
(advancing relfrozenxid early). This works in essentially the same way
as it always has (don't skip ranges < SKIP_PAGES_THRESHOLD). We could
do much better here in the future by considering other relevant factors.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzn6bGJGfOy3zSTJicKLw99PHJeSOQBOViKjSCinaxUKDQ@mail.gmail.com
Discussion: https://postgr.es/m/CA%2BTgmoZiSOY6H7aadw5ZZGm7zYmfDzL6nwmL5V7GL4HgJgLF_w%40mail.gmail.com
When VACUUM set relfrozenxid before now, it set it to whatever value was
used to determine which tuples to freeze -- the FreezeLimit cutoff.
This approach was very naive. The relfrozenxid invariant only requires
that new relfrozenxid values be <= the oldest extant XID remaining in
the table (at the point that the VACUUM operation ends), which in
general might be much more recent than FreezeLimit.
VACUUM now carefully tracks the oldest remaining XID/MultiXactId as it
goes (the oldest remaining values _after_ lazy_scan_prune processing).
The final values are set as the table's new relfrozenxid and new
relminmxid in pg_class at the end of each VACUUM. The oldest XID might
come from a tuple's xmin, xmax, or xvac fields. It might even come from
one of the table's remaining MultiXacts.
Final relfrozenxid values must still be >= FreezeLimit in an aggressive
VACUUM (FreezeLimit still acts as a lower bound on the final value that
aggressive VACUUM can set relfrozenxid to). Since standard VACUUMs
still make no guarantees about advancing relfrozenxid, they might as
well set relfrozenxid to a value from well before FreezeLimit when the
opportunity presents itself. In general standard VACUUMs may now set
relfrozenxid to any value > the original relfrozenxid and <= OldestXmin.
Credit for the general idea of using the oldest extant XID to set
pg_class.relfrozenxid at the end of VACUUM goes to Andres Freund.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com
Move some of the heap_vacuum_rel() instrumentation related variables to
the scope where they're actually needed. Also reorder some of the
variable declarations at the start of heap_vacuum_rel() so that related
variables appear together.
Because this strategy logs changes on a block-by-block basis, it
avoids the need to checkpoint before and after the operation.
However, because it logs each changed block individually, it might
generate a lot of extra write-ahead logging if the template database
is large. Therefore, the older strategy remains available via a new
STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy
option to createdb.
Somewhat controversially, this patch assembles the list of relations
to be copied to the new database by reading the pg_class relation of
the template database. Cross-database access like this isn't normally
possible, but it can be made to work here because there can't be any
connections to the database being copied, nor can it contain any
in-doubt transactions. Even so, we have to use lower-level interfaces
than normal, since the table scan and relcache interfaces will not
work for a database to which we're not connected. The advantage of
this approach is that we do not need to rely on the filesystem to
determine what ought to be copied, but instead on PostgreSQL's own
knowledge of the database structure. This avoids, for example,
copying stray files that happen to be located in the source database
directory.
Dilip Kumar, with a fairly large number of cosmetic changes by me.
Reviewed and tested by Ashutosh Sharma, Andres Freund, John Naylor,
Greg Nancarrow, Neha Sharma. Additional feedback from Bruce Momjian,
Heikki Linnakangas, Julien Rouhaud, Adam Brusselback, Kyotaro
Horiguchi, Tomas Vondra, Andrew Dunstan, Álvaro Herrera, and others.
Discussion: http://postgr.es/m/CA+TgmoYtcdxBjLh31DLxUXHxFVMPGzrU5_T=CYCvRyFHywSBUQ@mail.gmail.com
They were macros previously, but recent callsite additions made Coverity
complain about one of the assertions being always true. This change
could have been made a long time ago, but the Coverity complain broke
the inertia.
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Discussion: https://postgr.es/m/202203241021.uts52sczx3al@alvherre.pgsql
Commit 872770fd6c taught VACUUM VERBOSE and autovacuum logging to
display the total number of pages scanned by VACUUM. This information
was also displayed as a percentage of rel_pages in parenthesis, which
makes it easy to spot trends over time and across tables.
The instrumentation displayed "0 scanned (0.00% of total)" for totally
empty tables. Tweak the instrumentation: have it show "0 scanned
(100.00% of total)" for empty tables instead. This approach is clearer
and more consistent.
VACUUM's rel_pages field indicates the size of the target heap rel just
after the table_relation_vacuum() operation began. There are specific
expectations around how rel_pages can be related to other nearby state.
In particular, the range of rel_pages must contain every tuple in the
relation whose tuple headers might contain an XID < OldestXmin.
Consistently refer to the field as rel_pages to make this clearer and
more discoverable.
This is follow-up work to commit 73f6ec3d from earlier today.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20220311031351.sbge5m2bpvy2ttxg@alap3.anarazel.de
Explain the relationship between vacuumlazy.c's vistest and OldestXmin
cutoffs. These closely related cutoffs are different in subtle but
important ways. Also document a closely related rule: we must establish
rel_pages _after_ OldestXmin to ensure that no XID < OldestXmin can be
missed by lazy_scan_heap().
It's easier to explain these issues by initializing everything together,
so consolidate initialization of vacrel state. Now almost every vacrel
field is initialized by heap_vacuum_rel(). The only remaining exception
is the dead_items array, which is still managed by lazy_scan_heap() due
to interactions with how we initialize parallel VACUUM.
Also move the process that updates pg_class entries for each index into
heap_vacuum_rel(), and adjust related assertions. All pg_class updates
now take place after lazy_scan_heap() returns, which seems clearer.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20211211045710.ljtuu4gfloh754rs@alap3.anarazel.de
Discussion: https://postgr.es/m/CAH2-WznYsUxVT156rCQ+q=YD4S4=1M37hWvvHLz-H1pwSM8-Ew@mail.gmail.com
We called the argument totally_frozen in its function prototype as well
as in code comments, even though totally_frozen_p was used in the
function definition. Standardize on totally_frozen.
Most of these are cases where we could call memcpy() or other libc
functions with a NULL pointer and a zero count, which is forbidden
by POSIX even though every production version of libc allows it.
We've fixed such things before in a piecemeal way, but apparently
never made an effort to try to get them all. I don't claim that
this patch does so either, but it gets every failure I observe in
check-world, using clang 12.0.1 on current RHEL8.
numeric.c has a different issue that the sanitizer doesn't like:
"ln(-1.0)" will compute log10(0) and then try to assign the
resulting -Inf to an integer variable. We don't actually use the
result in such a case, so there's no live bug.
Back-patch to all supported branches, with the idea that we might
start running a buildfarm member that tests this case. This includes
back-patching c1132aae3 (Check the size in COPY_POINTER_FIELD),
which previously silenced some of these issues in copyfuncs.c.
Discussion: https://postgr.es/m/CALNJ-vT9r0DSsAOw9OXVJFxLENoVS_68kJ5x0p44atoYH+H4dg@mail.gmail.com
Commit 49c9d9fc unified VACUUM VERBOSE and autovacuum logging. It
neglected to remove an old vacrel field that was only used by the old
VACUUM VERBOSE, so remove it now.
The previous num_tuples approach doesn't seem to have any real advantage
over the approach VACUUM VERBOSE takes now (also the approach used by
the autovacuum logging code), which is to show new_rel_tuples.
new_rel_tuples is the possibly-estimated total number of tuples left in
the table, whereas num_tuples meant the number of tuples encountered
during the VACUUM operation, after pruning, without regard for tuples
from pages skipped via the visibility map.
In passing, reorder a related vacrel field for consistency.
The buffer argument hasn't been used since the function was first added
by commit bbb6e559c4. The sibling heap_prepare_freeze_tuple function
doesn't have such an argument either. Remove it.
There is a very good (though non-obvious) reason to avoid relation
truncation during a VACUUM that has triggered the failsafe mechanism,
which was missed before now. Update related comments, so this isn't
forgotten.
Reported-By: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/CAFBsxsFiMPxQ-dHZ8tOgktn=+ffeJT3+GinZ4zdOGbmAnCYadA@mail.gmail.com
Currently, during UPDATE, the unchanged replica identity key attributes
are not logged separately because they are getting logged as part of the
new tuple. But if they are stored externally then the untoasted values are
not getting logged as part of the new tuple and logical replication won't
be able to replicate such UPDATEs. So we need to log such attributes as
part of the old_key_tuple during UPDATE.
Reported-by: Haiying Tang
Author: Dilip Kumar and Amit Kapila
Reviewed-by: Alvaro Herrera, Haiying Tang, Andres Freund
Backpatch-through: 10
Discussion: https://postgr.es/m/OS0PR01MB611342D0A92D4F4BF26C0F47FB229@OS0PR01MB6113.jpnprd01.prod.outlook.com
Depending on compiler version and optimization level, we might
get a complaint that lazy_scan_heap's "freespace" is used
uninitialized.
Compilers not aware that ereport(ERROR) doesn't return complained
about bbsink_lz4_new().
Assigning "-1" to a uint64 value has unportable results; fortunately,
the value of xlogreadsegno is unimportant when xlogreadfd is -1.
(It looks to me like there is no need for xlogreadsegno to be static
in the first place, but I didn't venture to change that.)
Report on scanned pages within VACUUM VERBOSE and autovacuum logging.
These are pages that were physically examined during the VACUUM
operation. Note that this can include a small number of pages that were
marked all-visible in the visibility map by some earlier VACUUM
operation. VACUUM won't skip all-visible pages that aren't part of a
range of all-visible pages that's at least 32 blocks in length (partly
to avoid missing out on opportunities to advance relfrozenxid during
non-aggressive VACUUMs).
Commit 44fa8488 simplified the definition of scanned pages. It became
the complement of the pages (of those pages from rel_pages) that were
skipped using the visibility map. And so scanned pages precisely
indicates how effective the visibility map was at saving work. (Before
now we displayed the number of pages skipped via the visibility map when
happened to be frozen pages, but not when they were merely all-visible,
which was less useful to users.)
Rename the user-visible OldestXmin output field to "removal cutoff", and
show some supplementary information: how far behind the cutoff is
(number of XIDs behind) by the time the VACUUM operation finished. This
will help users to figure out what's _not_ working in extreme cases
where VACUUM is fundamentally unable to remove dead tuples or freeze
older tuples (e.g., due to a leaked replication slot). Also report when
relfrozenxid is advanced by VACUUM in output that immediately follows
"removal cutoff". This structure is intended to highlight the
relationship between the new relfrozenxid value for the table, and the
VACUUM operation's removal cutoff.
Finally, add instrumentation of "missed dead tuples", and the number of
pages that had at least one such tuple. These are fully DEAD (not just
RECENTLY_DEAD) tuples with storage that could not be pruned due to
failure to acquire a cleanup lock on a heap page. This is a replacement
for the "skipped due to pin" instrumentation removed by commit 44fa8488.
It shows more details than before for pages where failing to get a
cleanup lock actually resulted in VACUUM missing out on useful work, but
usually shows nothing at all instead (the mere fact that we couldn't get
a cleanup lock is usually of no consequence whatsoever now).
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wznp=c=Opj8Z7RMR3G=ec3_JfGYMN_YvmCEjoPCHzWbx0g@mail.gmail.com
Redefine a scanned page as any heap page that actually gets pinned by
VACUUM's first pass over the heap, regardless of whether or not the page
was cleanup locked. Although it's fundamentally impossible to prune a
heap page without a cleanup lock (since we cannot safely defragment the
page), we can do just about everything else. The only notable further
exception is freezing tuples, though even that is arguably a consequence
of not being able to prune (not a separate issue).
VACUUM now does as much of the same processing as possible for pages
that could not be cleanup locked. Any failure to do specific required
processing is treated as a special case exception, which will be rare in
practice. We now collect any preexisting LP_DEAD items (left behind by
earlier opportunistic pruning) in the dead_items array for these heap
pages, and count their tuples in the usual way. Steps used to decide if
we'll attempt relation truncation are performed in the usual way for
no-cleanup-lock scanned pages, too.
Although eliminating these special cases is intrinsically useful, it's
even more useful as an enabler of further simplifications. The only
essential difference between aggressive and non-aggressive is that only
aggressive is _guaranteed_ to be able to advance relfrozenxid up to
FreezeLimit. Advancing relfrozenxid is always useful, but before now
non-aggressive VACUUMs threw away the opportunity to do so whenever a
cleanup lock could not be acquired on any page, no matter what the
details were. This was very pessimistic.
It isn't actually necessary to "behave aggressively" to maintain the
ability to advance relfrozenxid when a cleanup lock isn't immediately
available (most of the time). The non-aggressive case will now make
sure that it isn't safe to advance relfrozenxid (without waiting) using
only a share lock. It will usually notice that there are no tuples that
need to be frozen anyway, just like in the aggressive case -- and so it
no longer wastes an opportunity to advance relfrozenxid over nothing.
(The non-aggressive case still won't wait for a cleanup lock when there
really are tuples on the page that need to be frozen, since that really
would amount to "behaving aggressively".)
VACUUM currently has a tendency to set heap pages to all-visible in the
visibility map before it freezes all of the tuples on the page. Only a
subsequent aggressive VACUUM will visit these pages to freeze their
tuples, usually only when the tuple XIDs are much older than the
vacuum_freeze_min_age GUC (FreezeLimit cutoff) is supposed to allow.
And so non-aggressive VACUUMs are still far less likely to be able to
advance relfrozenxid in practice, even with the enhancements from this
commit. This remaining issue will be addressed by future work that
overhauls the criteria for freezing tuples. Once that's in place,
almost every VACUUM operation will be able to advance relfrozenxid in
practice.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wznp=c=Opj8Z7RMR3G=ec3_JfGYMN_YvmCEjoPCHzWbx0g@mail.gmail.com
xlog.h is directly and indirectly #included in a lot of places. With
this change, xloginsert.h is no longer unnecessarily included in the
large number of them that don't need it.
Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/CALj2ACVe-W+WM5P44N7eG9C2_FmaeM8Dq5aCnD3fHt0Ba=WR6w@mail.gmail.com
This fixes a set of issues that have accumulated over the past months
(or years) in various code areas. Most fixes are related to some recent
additions, as of the development of v15.
Author: Justin Pryzby
Discussion: https://postgr.es/m/20220124030001.GQ23027@telsasoft.com
While individual logical rewrite files were synced to disk, the directory was
not. On some filesystems that could lead to loosing directory entries after a
crash.
Reported-By: Tom Lane <tgl@sss.pgh.pa.us>
Author: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/867F2E29-2782-4869-970E-B984C6D35A8F@amazon.com
Backpatch: 10-
BufferGetBlockNumber() is not that cheap and obviously cannot change during
one heap_prune_page(), so only call it once. We might be able to do better and
pass the block number from the caller, but that'd be a larger change...
Discussion: https://postgr.es/m/20211211045710.ljtuu4gfloh754rs@alap3.anarazel.de
The log_autovacuum_min_duration instrumentation used its own dedicated
code for logging, which was not reused by VACUUM VERBOSE. This was
highly duplicative, and sometimes led to each code path using slightly
different accounting for essentially the same information.
Clean things up by making VACUUM VERBOSE reuse the same instrumentation
code. This code restructuring changes the structure of the VACUUM
VERBOSE output itself, but that seems like an overall improvement. The
most noticeable change in VACUUM VERBOSE output is that it no longer
outputs a distinct message per index per round of index vacuuming. Most
of the same information (about each index) is now shown in its new
per-operation summary message. This is far more legible.
A few details are no longer displayed by VACUUM VERBOSE, but that's no
real loss in practice, especially in the common case where we don't need
multiple index scans/rounds of vacuuming. This super fine-grained
information is still available via DEBUG2 messages, which might still be
useful in debugging scenarios.
VACUUM VERBOSE now shows new instrumentation, which is typically very
useful: all of the log_autovacuum_min_duration instrumentation that it
missed out on before now. This includes information about WAL overhead,
buffers hit/missed/dirtied information, and I/O timing information.
VACUUM VERBOSE still retains a few INFO messages of its own. This is
limited to output concerning the progress of heap rel truncation, as
well as some basic information about parallel workers. These details
are still potentially quite useful. They aren't a good fit for the log
output, which must summarize the whole operation.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzmW4Me7_qR4X4ka7pxP-jGmn7=Npma_-Z-9Y1eD0MQRLw@mail.gmail.com
Corruption of redirect item pointers often only becomes visible well after
being corrupted, as e.g. bug #17255 shows: In the original reproducer,
gigabyte of WAL were between the source of the corruption and the corruption
becoming visible.
To make it easier to find / prevent such bugs, verify whether redirect
pointers are sensible at the end of heap_page_prune_execute(). 5cd7eb1f1c
introduced related assertions while modifying the page, but they can't easily
detect marking the target of an existing redirect as unused. Sometimes the
corruption will be detected later, but that's harder to diagnose.
Author: Andres Freund <andres@andres@anarazel.de>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/20211122175914.ayk6gg6nvdwuhrzb@alap3.anarazel.de
Since dc7420c2c9 the horizon used for pruning is determined "lazily". A more
accurate horizon is built on-demand, rather than in GetSnapshotData(). If a
horizon computation is triggered between two HeapTupleSatisfiesVacuum() calls
for the same tuple, the result can change from RECENTLY_DEAD to DEAD.
heap_page_prune() can process the same tid multiple times (once following an
update chain, once "directly"). When the result of HeapTupleSatisfiesVacuum()
of a tuple changes from RECENTLY_DEAD during the first access, to DEAD in the
second, the "tuple is DEAD and doesn't chain to anything else" path in
heap_prune_chain() can end up marking the target of a LP_REDIRECT ItemId
unused.
Initially not easily visible,
Once the target of a LP_REDIRECT ItemId is marked unused, a new tuple version
can reuse it. At that point the corruption may become visible, as index
entries pointing to the "original" redirect item, now point to a unrelated
tuple.
To fix, compute HTSV for all tuples on a page only once. This fixes the entire
class of problems of HTSV changing inside heap_page_prune(). However,
visibility changes can obviously still occur between HTSV checks inside
heap_page_prune() and outside (e.g. in lazy_scan_prune()).
The computation of HTSV is now done in bulk, in heap_page_prune(), rather than
on-demand in heap_prune_chain(). Besides being a bit simpler, it also is
faster: Memory accesses can happen sequentially, rather than in the order of
HOT chains.
There are other causes of HeapTupleSatisfiesVacuum() results changing between
two visibility checks for the same tuple, even before dc7420c2c9. E.g.
HEAPTUPLE_INSERT_IN_PROGRESS can change to HEAPTUPLE_DEAD when a transaction
aborts between the two checks. None of the these other visibility status
changes are known to cause corruption, but heap_page_prune()'s approach makes
it hard to be confident.
A patch implementing a more fundamental redesign of heap_page_prune(), which
fixes this bug and simplifies pruning substantially, has been proposed by
Peter Geoghegan in
https://postgr.es/m/CAH2-WzmNk6V6tqzuuabxoxM8HJRaWU6h12toaS-bqYcLiht16A@mail.gmail.com
However, that redesign is larger change than desirable for backpatching. As
the new design still benefits from the batched visibility determination
introduced in this commit, it makes sense to commit this narrower fix to 14
and master, and then commit Peter's improvement in master.
The precise sequence required to trigger the bug is complicated and hard to do
exercise in an isolation test (until we have wait points). Due to that the
isolation test initially posted at
https://postgr.es/m/20211119003623.d3jusiytzjqwb62p%40alap3.anarazel.de
and updated in
https://postgr.es/m/20211122175914.ayk6gg6nvdwuhrzb%40alap3.anarazel.de
isn't committable.
A followup commit will introduce additional assertions, to detect problems
like this more easily.
Bug: #17255
Reported-By: Alexander Lakhin <exclusion@gmail.com>
Debugged-By: Andres Freund <andres@anarazel.de>
Debugged-By: Peter Geoghegan <pg@bowt.ie>
Author: Andres Freund <andres@andres@anarazel.de>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/20211122175914.ayk6gg6nvdwuhrzb@alap3.anarazel.de
Backpatch: 14-, the oldest branch containing dc7420c2c9
Under concurrency, it is possible for two sessions to be merrily locking
and releasing a tuple and marking it again as HEAP_XMAX_INVALID all the
while a third session attempts to lock it, miserably fails at it, and
then contemplates life, the universe and everything only to eventually
fail an assertion that said bit is not set. Before SKIP LOCKED that was
indeed a reasonable expectation, but alas! commit df630b0dd5 falsified
it.
This bug is as old as time itself, and even older, if you think time
begins with the oldest supported branch. Therefore, backpatch to all
supported branches.
Author: Simon Riggs <simon.riggs@enterprisedb.com>
Discussion: https://postgr.es/m/CANbhV-FeEwMnN8yuMyss7if1ZKjOKfjcgqB26n8pqu1e=q0ebg@mail.gmail.com
This commit moves parallel vacuum related code to a new file
commands/vacuumparallel.c so that any table AM supporting indexes can
utilize parallel vacuum in order to call index AM callbacks (ambulkdelete
and amvacuumcleanup) with parallel workers.
Another reason for this refactoring is that the parallel vacuum isn't
specific to heap so it doesn't make sense to keep this code in
heap/vacuumlazy.c.
Author: Masahiko Sawada, based on suggestion from Andres Freund
Reviewed-by: Hou Zhijie, Amit Kapila, Haiying Tang
Discussion: https://www.postgresql.org/message-id/20211030212101.ae3qcouatwmy7tbr%40alap3.anarazel.de
Previously, in parallel vacuum, we allocated shmem area of
IndexBulkDeleteResult only for indexes where parallel index vacuuming is
safe and had null-bitmap in shmem area to access them. This logic was too
complicated with a small benefit of saving only a few bits per indexes.
In this commit, we allocate a dedicated shmem area for the array of
LVParallelIndStats that includes a parallel-safety flag, the index vacuum
status, and IndexBulkdeleteResult. There is one array element for every
index, even those indexes where parallel index vacuuming is unsafe or not
worthwhile. This commit makes the code clear by removing all
bitmap-related code.
Also, add the check each index vacuum status after parallel index vacuum
to make sure that all indexes have been processed.
Finally, rename parallel vacuum functions to parallel_vacuum_* for
consistency.
Author: Masahiko Sawada, based on suggestions by Andres Freund
Reviewed-by: Hou Zhijie, Amit Kapila
Discussion: https://www.postgresql.org/message-id/20211030212101.ae3qcouatwmy7tbr%40alap3.anarazel.de
The term "super-exclusive lock" is a synonym for "buffer cleanup lock"
that first appeared in nbtree many years ago. Standardize things by
consistently using the term cleanup lock. This finishes work started by
commit 276db875.
There is no good reason to have two terms. But there is a good reason
to only have one: to avoid confusion around why VACUUM acquires a full
cleanup lock (not just an ordinary exclusive lock) in index AMs, during
ambulkdelete calls. This has nothing to do with protecting the physical
index data structure itself. It is needed to implement a locking
protocol that ensures that TIDs pointing to the heap/table structure
cannot get marked for recycling by VACUUM before it is safe (which is
somewhat similar to how VACUUM uses cleanup locks during its first heap
pass). Note that it isn't strictly necessary for index AMs to implement
this locking protocol -- several index AMs use an MVCC snapshot as their
sole interlock to prevent unsafe TID recycling.
In passing, update the nbtree README. Cleanly separate discussion of
the aforementioned index vacuuming locking protocol from discussion of
the "drop leaf page pin" optimization added by commit 2ed5b87f. We now
structure discussion of the latter by describing how individual index
scans may safely opt out of applying the standard locking protocol (and
so can avoid blocking progress by VACUUM). Also document why the
optimization is not safe to apply during nbtree index-only scans.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzngHgQa92tz6NQihf4nxJwRzCV36yMJO_i8dS+2mgEVKw@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzkHPgsBBvGWjz=8PjNhDefy7XRkDKiT5NxMs-n5ZCf2dA@mail.gmail.com
When determining whether an index update may be skipped by using HOT, we
can ignore attributes indexed only by BRIN indexes. There are no index
pointers to individual tuples in BRIN, and the page range summary will
be updated anyway as it relies on visibility info.
This also removes rd_indexattr list, and replaces it with rd_attrsvalid
flag. The list was not used anywhere, and a simple flag is sufficient.
Patch by Josef Simanek, various fixes and improvements by me.
Author: Josef Simanek
Reviewed-by: Tomas Vondra, Alvaro Herrera
Discussion: https://postgr.es/m/CAFp7QwpMRGcDAQumN7onN9HjrJ3u4X3ZRXdGFT0K5G2JWvnbWg%40mail.gmail.com
Commit 8523492d simplified what it meant for an item to be considered
"dead" to VACUUM: TIDs collected in memory (in preparation for index
vacuuming) must always come from LP_DEAD stub line pointers in heap
pages, found following pruning. This formalized the idea that index
vacuuming (and heap vacuuming) are optional processes. Unlike pruning,
they can be delayed indefinitely, without any risk of that violating
fundamental invariants. For example, leaving LP_DEAD items behind
clearly won't add to the risk of transaction ID wraparound. You can't
have transaction ID wraparound without transaction IDs. Renaming
anything that references DEAD tuples (tuples with storage) reinforces
all this.
Code outside vacuumlazy.c continues to fudge the distinction between
dead/deleted tuples, and LP_DEAD items. This is necessary because
autovacuum scheduling is still mostly driven by "dead items/tuples"
statistics. In the future we may find it useful to replace this model
with something more sophisticated, as a step towards teaching autovacuum
to perform more frequent vacuuming that targeting individual indexes
that happen to be more prone to becoming bloated through version churn.
In passing, simplify some function signatures that deal with VACUUM's
dead_items array.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzktGBg4si6DEdmq3q6SoXSDqNi6MtmB8CmmTmvhsxDTLA@mail.gmail.com
The term "super-exclusive lock" is an acceptable synonym of "cleanup
lock". Even still, switching from one term to the other in the same
file is confusing. Standardize on "cleanup lock" within vacuumlazy.c.
Per a complaint from Andres Freund.
Update vacuumlazy.c file header comments (as well as comments above the
lazy_scan_heap function) that were largely written before the
introduction of the HOT optimization, when lazy_scan_heap did far less,
and didn't actually prune during its initial heap pass.
Since lazy_scan_heap now outsources far more work to lower level
functions, it makes sense to introduce the function by talking about the
high level invariant that dictates the order in which each phase takes
place. Also deemphasize the case where we run out of memory for TIDs,
since delaying that discussion makes it easier to talk about issues of
central importance.
Finally, remove discussion of parallel VACUUM from header comments.
These don't add much, and are in the wrong place.
Commit 2fd8685e7f simplified the checking of modified attributes that
takes place within heap_update(). This included a micro-optimization
affecting pages marked PD_PAGE_FULL: don't even try to use HOT to save a
few cycles on determining HOT safety. The assumption was that it won't
work out this time around, since it can't have worked out last time
around.
Remove the micro-optimization. It could only ever save cycles that are
consumed by the vast majority of heap_update() calls, which hardly seems
worth the added complexity. It also seems quite possible that there are
workloads that will do worse over time by repeated application of the
micro-optimization, despite saving some cycles on average, in the short
term.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/CAH2-WznU1L3+DMPr1F7o2eJBT7=3bAJoY6ZkWABAxNt+-afyTA@mail.gmail.com
This doesn't belong next to very high level discussion of the tasks that
lazy_scan_heap performs. There is already a similar, longer comment
block at the top of vacuumlazy.c that mentions lazy_scan_heap directly.
While determining xid horizons, we skip over backends that are running
Vacuum. We also ignore Create Index Concurrently, or Reindex Concurrently
for the purposes of computing Xmin for Vacuum. But we were not setting the
flags corresponding to these operations when they are performed in
parallel which was preventing Xid horizon from advancing.
The optimization related to skipping Create Index Concurrently, or Reindex
Concurrently operations was implemented in PG-14 but the fix is the same
for the Parallel Vacuum as well so back-patched till PG-13.
Author: Masahiko Sawada
Reviewed-by: Amit Kapila
Backpatch-through: 13
Discussion: https://postgr.es/m/CAD21AoCLQqgM1sXh9BrDFq0uzd3RBFKi=Vfo6cjjKODm0Onr5w@mail.gmail.com
Add a comment explaining why the pgstats accounting used during
opportunistic heap pruning operations (to maintain the current number of
dead tuples in the relation) needs to compensate by subtracting away the
number of new LP_DEAD items. This is needed so it can avoid completely
forgetting about tuples that become LP_DEAD items during pruning -- they
should still count.
It seems more natural to discuss this issue at the only relevant call
site (opportunistic pruning), since the same issue does not apply to the
only other caller (the VACUUM call site). Move everything there too.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wzm7f+A6ej650gi_ifTgbhsadVW5cujAL3punpupHff5Yg@mail.gmail.com
It is up to the heap_page_prune() caller to decide what to do about
updating the FSM for a page following pruning. Update old comments that
address what we might want to do as if it was the responsibility of
heap_page_prune() itself. heap_page_prune() doesn't have enough
high-level context to make a sensible choice.
Add new comments that spell out what VACUUM expects from heap pruning:
pruning must never leave behind DEAD tuples that still have tuple
storage. This has at least been the case since commit 8523492d, which
established the principle that vacuumlazy.c doesn't have to deal with
DEAD tuples that still have tuple storage directly, except perhaps by
simply retrying pruning (to handle a rare corner case involving
concurrent transaction abort).
In passing, update some references to old symbol names that were missed
by the snapshot scalability work (specifically commit dc7420c2c9).
Add hardening to the heapam index tuple deletion path to catch TIDs in
index pages that point to a heap item that index tuples should never
point to. The corruption we're trying to catch here is particularly
tricky to detect, since it typically involves "extra" (corrupt) index
tuples, as opposed to the absence of required index tuples in the index.
For example, a heap TID from an index page that turns out to point to an
LP_UNUSED item in the heap page has a good chance of being caught by one
of the new checks. There is a decent chance that the recently fixed
parallel VACUUM bug (see commit 9bacec15) would have been caught had
that particular check been in place for Postgres 14. No backpatch of
this extra hardening for now, though.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wzk-4_raTzawWGaiqNvkpwDXxv3y1AQhQyUeHfkU=tFCeA@mail.gmail.com
These assertions document (and verify) our high level assumptions about
how pruning can and cannot affect existing items from target heap pages.
For example, one of the new assertions verifies that pruning does not
set a heap-only tuple to LP_DEAD.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wz=vhvBx1GjF+oueHh8YQcHoQYrMi0F0zFMHEr8yc4sCoA@mail.gmail.com
Commit b4af70cb inverted the return value of the function
parallel_processing_is_safe(), but missed the amvacuumcleanup test.
Index AMs that don't support parallel cleanup at all were affected.
The practical consequences of this bug were not very serious. Hash
indexes are affected, but since they just return the number of blocks
during hashvacuumcleanup anyway, it can't have had much impact.
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAD21AoA-Em+aeVPmBbL_s1V-ghsJQSxYL-i3JP8nTfPiD1wjKw@mail.gmail.com
Backpatch: 14-, where commit b4af70cb appears.
Commit b4af70cb, which simplified state managed by VACUUM, performed
refactoring of parallel VACUUM in passing. Confusion about the exact
details of the tasks that the leader process is responsible for led to
code that made it possible for parallel VACUUM to miss a subset of the
table's indexes entirely. Specifically, indexes that fell under the
min_parallel_index_scan_size size cutoff were missed. These indexes are
supposed to be vacuumed by the leader (alongside any parallel unsafe
indexes), but weren't vacuumed at all. Affected indexes could easily
end up with duplicate heap TIDs, once heap TIDs were recycled for new
heap tuples. This had generic symptoms that might be seen with almost
any index corruption involving structural inconsistencies between an
index and its table.
To fix, make sure that the parallel VACUUM leader process performs any
required index vacuuming for indexes that happen to be below the size
cutoff. Also document the design of parallel VACUUM with these
below-size-cutoff indexes.
It's unclear how many users might be affected by this bug. There had to
be at least three indexes on the table to hit the bug: a smaller index,
plus at least two additional indexes that themselves exceed the size
cutoff. Cases with just one additional index would not run into
trouble, since the parallel VACUUM cost model requires two
larger-than-cutoff indexes on the table to apply any parallel
processing. Note also that autovacuum was not affected, since it never
uses parallel processing.
Test case based on tests from a larger patch to test parallel VACUUM by
Masahiko Sawada.
Many thanks to Kamigishi Rei for her invaluable help with tracking this
problem down.
Author: Peter Geoghegan <pg@bowt.ie>
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reported-By: Kamigishi Rei <iijima.yun@koumakan.jp>
Reported-By: Andrew Gierth <andrew@tao11.riddles.org.uk>
Diagnosed-By: Andres Freund <andres@anarazel.de>
Bug: #17245
Discussion: https://postgr.es/m/17245-ddf06aaf85735f36@postgresql.org
Discussion: https://postgr.es/m/20211030023740.qbnsl2xaoh2grq3d@alap3.anarazel.de
Backpatch: 14-, where the refactoring commit appears.
Commit d168b66682, which overhauled index deletion, added a
pg_unreachable() to the end of a sort comparator used when sorting heap
TIDs from an index page. This allows the compiler to apply
optimizations that assume that the heap TIDs from the index AM must
always be unique.
That doesn't seem like a good idea now, given recent reports of
corruption involving duplicate TIDs in indexes on Postgres 14. Demote
to an assertion, just in case.
Backpatch: 14-, where index deletion was overhauled.
Remove obsolete reference to lazy_vacuum()'s onecall argument. The
function argument was removed by commit 3499df0dee.
Also remove adjoining comment block that introduces the wraparound
failsafe concept. Talking about the failsafe here no longer makes
sense, since lazy_vacuum() (and related functions) are no longer the
only place where the failsafe might be triggered. This has been the
case since commit c242baa4a8 taught VACUUM to consider triggering the
failsafe mechanism during its initial heap scan.
Checking that an offset number isn't past the end of a heap page's line
pointer array was just a defensive sanity check for HOT-chain traversal
code before commit 3c3b8a4b. It's etrictly necessary now, though. Add
comments that reference the issue to code in heapam that needs to get it
right.
Per suggestion from Alexander Lakhin.
Discussion: https://postgr.es/m/f76a292c-9170-1aef-91a0-59d9443b99a3@gmail.com
A broken HOT chain is not an unexpected condition, even when the offset
number points past the end of the page's line pointer array.
heap_prune_chain() does not (and never has) treated this condition as
unexpected, so derivative code in heap_index_delete_tuples() shouldn't
do so either.
Oversight in commit 4228817449.
The assertion can probably only fail on Postgres 14 and master. Earlier
releases don't have commit 3c3b8a4b, which taught VACUUM to truncate the
line pointer array of heap pages. Backpatch all the same, just to be
consistent.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/17197-9438f31f46705182@postgresql.org
Backpatch: 12-, just like commit 4228817449.
All the code paths simplified here were already using a boolean or used
an expression that led to zero or one, making the extra bits
unnecessary.
Author: Justin Pryzby
Reviewed-by: Tom Lane, Michael Paquier, Peter Smith
Discussion: https://postgr.es/m/20210428182936.GE27406@telsasoft.com
It doesn't make any sense to report this information, since VACUUM
VERBOSE reports on heap relation truncation directly. This was an
oversight in commit 7ab96cf6, which made VACUUM VERBOSE output a little
more consistent with nearby autovacuum-specific log output. Adjust
comments that describe how this is supposed to work in passing.
Also bring truncation-related VACUUM VERBOSE output in line with the
convention established for VACUUM VERBOSE output by commit f4f4a649.
Author: Peter Geoghegan <pg@bowt.ie>
Backpatch: 14-, where VACUUM VERBOSE's output changed.
The field hasn't been used since commit 3d351d91, which redefined
pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.
Also rename a local variable of the same name ("old_rel_pages"). This is
used by relation truncation to represent the original relation size at
the start of the ongoing VACUUM operation. Rename it to orig_rel_pages,
since that's a lot clearer. (This name matches similar nearby code.)
Most data-corruption reports mention the location of the problem, but
this one failed to. Add it.
Backpatch all the way back. In 12 and older, also assign the
ERRCODE_DATA_CORRUPTED error code as was done in commit fd6ec93bf8 for
13 and later.
Discussion: https://postgr.es/m/202108191637.oqyzrdtnheir@alvherre.pgsql
Adjust track_io_timing related logging code added by commit 94d13d474d.
Make it consistent with other nearby autovacuum and autoanalyze logging
code by removing logic that suppressed zero millisecond outputs.
log_autovacuum_min_duration log output now reliably shows "read:" and
"write:" millisecond-based values in its report (when track_io_timing is
enabled).
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Stephen Frost <sfrost@snowman.net>
Discussion: https://postgr.es/m/CAH2-WznW0FNxSVQMSRazAMYNfZ6DR_gr5WE78hc6E1CBkkJpzw@mail.gmail.com
Backpatch: 14-, where the track_io_timing logging was introduced.
This order seems more natural. It starts with details that are
particular to heap and index data structures, and ends with system-level
costs incurred during the autovacuum worker's VACUUM/ANALYZE operation.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkzxK6ahA9xxsOftRtBX_R0swuHZsvo4QUbak1Bz7hb7Q@mail.gmail.com
Backpatch: 14-, which enhanced the log output in various ways.
This was arguably a minor oversight in commit b4af70cb, which cleaned up
the function signatures of functions that modify IndexBulkDeleteResult
variables.
During a VACUUM or CLUSTER command, the initial output emits a
fully qualified relation path with namespace. The post-action
errmsg only emitted the relation name however, which may lead
to hard to parse output when using multiple jobs with vacuumdb
as the output from different jobs may be interleaved. Include
the full path in the post-action errmsg to be consistent with
the initial errmsg.
Author: Mike Fiedler <miketheman@gmail.com>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://postgr.es/m/CAMerE0oz+8G-aORZL_BJcPxnBqewZAvND4bSUysjz+r-oT1BxQ@mail.gmail.com
They are used in code that runs both during normal operation and during
WAL replay, and needs to behave differently during replay. Move them to
xlogutils.c, because that's where we have other helper functions used by
redo routines.
Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/b3b71061-4919-e882-4857-27e370ab134a%40iki.fi
The idea behind this patch is to design out bugs like the one fixed
by commit 9d523119f. Previously, once one did RelationOpenSmgr(rel),
it was considered okay to access rel->rd_smgr directly for some
not-very-clear interval. But since that pointer will be cleared by
relcache flushes, we had bugs arising from overreliance on a previous
RelationOpenSmgr call still being effective.
Now, very little code except that in rel.h and relcache.c should ever
touch the rd_smgr field directly. The normal coding rule is to use
RelationGetSmgr(rel) and not expect the result to be valid for longer
than one smgr function call. There are a couple of places where using
the function every single time seemed like overkill, but they are now
annotated with large warning comments.
Amul Sul, after an idea of mine.
Discussion: https://postgr.es/m/CANiYTQsU7yMFpQYnv=BrcRVqK_3U3mtAzAsJCaqtzsDHfsUbdQ@mail.gmail.com
This has the advantage to make a process more responsive when the
postmaster dies, even if the wait time was rather limited as there was
only a 50ms timeout here. Another advantage of this change is for
monitoring, as we gain a new wait event for the end-of-vacuum
truncation.
Author: Bharath Rupireddy
Reviewed-by: Aleksander Alekseev, Thomas Munro, Michael Paquier
Discussion: https://postgr.es/m/CALj2ACU4AdPCq6NLfcA-ZGwX7pPCK5FgEj-CAU0xCKzkASSy_A@mail.gmail.com
The failsafe can trigger when index processing is already disabled.
This can happen when VACUUM's INDEX_CLEANUP parameter is "off" and the
failsafe happens to trigger. Remove assertions that assume that index
processing is directly tied to the failsafe.
Oversight in commit c242baa4, which made it possible for the failsafe to
trigger in a two-pass strategy VACUUM that has yet to make its first
call to lazy_vacuum_all_indexes().
Generalize the INDEX_CLEANUP VACUUM parameter (and the corresponding
reloption): make it into a ternary style boolean parameter. It now
exposes a third option, "auto". The "auto" option (which is now the
default) enables the "bypass index vacuuming" optimization added by
commit 1e55e7d1.
"VACUUM (INDEX_CLEANUP TRUE)" is redefined to once again make VACUUM
simply do any required index vacuuming, regardless of how few dead
tuples are encountered during the first scan of the target heap relation
(unless there are exactly zero). This gives users a way of opting out
of the "bypass index vacuuming" optimization, if for whatever reason
that proves necessary. It is also expected to be used by PostgreSQL
developers as a testing option from time to time.
"VACUUM (INDEX_CLEANUP FALSE)" does the same thing as it always has: it
forcibly disables both index vacuuming and index cleanup. It's not
expected to be used much in PostgreSQL 14. The failsafe mechanism added
by commit 1e55e7d1 addresses the same problem in a simpler way.
INDEX_CLEANUP can now be thought of as a testing and compatibility
option.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CAH2-WznrBoCST4_Gxh_G9hA8NzGUbeBGnOUC8FcXcrhqsv6OHQ@mail.gmail.com
Bugfix commit 5fc89376 effectively made the lock_waiter_detected field
from vacuumlazy.c's global state struct into private state owned by
lazy_truncate_heap(). Finish this off by replacing the struct field
with a local variable.
The extra checks added by the recompression of toast data introduced in
bbe0a81 is proving to have a performance impact on VACUUM or CLUSTER
even if no recompression is done. This is more noticeable with more
toastable columns that contain non-NULL values.
Improvements could be done to make those extra checks less expensive,
but that's not material for 14 at this stage, and we are not sure either
if the code path of VACUUM FULL/CLUSTER is adapted for this job.
Per discussion with several people, including Andres Freund, Robert
Haas, Álvaro Herrera, Tom Lane and myself.
Discussion: https://postgr.es/m/20210527003144.xxqppojoiwurc2iz@alap3.anarazel.de
Fix a few places that were using appendStringInfo() when they should have
been using appendStringInfoString(). Also some cases of
appendPQExpBuffer() that would have been better suited to use
appendPQExpBufferChar(), and finally, some places that used
appendPQExpBuffer() when appendPQExpBufferStr() would have suited better.
There are no bugs are being fixed here. The aim is just to make the code
use the most optimal function for the job.
All the code being changed here is new to PG14. It makes sense to fix
these before we branch for PG15. There are a few other places that we
could fix, but those cases are older code so fixing those seems less
worthwhile as it may cause unnecessary back-patching pain in the future.
Author: Hou Zhijie
Discussion: https://postgr.es/m/OS0PR01MB5716732158B1C4142C6FE375943D9@OS0PR01MB5716.jpnprd01.prod.outlook.com
Redefine '\0' (InvalidCompressionMethod) as meaning "if we need to
compress, use the current setting of default_toast_compression".
This allows '\0' to be a suitable default choice regardless of
datatype, greatly simplifying code paths that initialize tupledescs
and the like. It seems like a more user-friendly approach as well,
because now the default compression choice doesn't migrate into table
definitions, meaning that changing default_toast_compression is
usually sufficient to flip an installation's behavior; one needn't
tediously issue per-column ALTER SET COMPRESSION commands.
Along the way, fix a few minor bugs and documentation issues
with the per-column-compression feature. Adopt more robust
APIs for SetIndexStorageProperties and GetAttributeCompression.
Bump catversion because typical contents of attcompression will now
be different. We could get away without doing that, but it seems
better to ensure v14 installations all agree on this. (We already
forced initdb for beta2, anyway.)
Discussion: https://postgr.es/m/626613.1621787110@sss.pgh.pa.us