Commit Graph

904 Commits

Author SHA1 Message Date
Peter Eisentraut 94aa7cc5f7 Add UNIQUE null treatment option
The SQL standard has been ambiguous about whether null values in
unique constraints should be considered equal or not.  Different
implementations have different behaviors.  In the SQL:202x draft, this
has been formalized by making this implementation-defined and adding
an option on unique constraint definitions UNIQUE [ NULLS [NOT]
DISTINCT ] to choose a behavior explicitly.

This patch adds this option to PostgreSQL.  The default behavior
remains UNIQUE NULLS DISTINCT.  Making this happen in the btree code
is pretty easy; most of the patch is just to carry the flag around to
all the places that need it.

The CREATE UNIQUE INDEX syntax extension is not from the standard,
it's my own invention.

I named all the internal flags, catalog columns, etc. in the negative
("nulls not distinct") so that the default PostgreSQL behavior is the
default if the flag is false.

Reviewed-by: Maxim Orlov <orlovmg@gmail.com>
Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/84e5ee1b-387e-9a54-c326-9082674bde78@enterprisedb.com
2022-02-03 11:48:21 +01:00
Alvaro Herrera b3d7d6e462
Remove xloginsert.h from xlog.h
xlog.h is directly and indirectly #included in a lot of places.  With
this change, xloginsert.h is no longer unnecessarily included in the
large number of them that don't need it.

Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/CALj2ACVe-W+WM5P44N7eG9C2_FmaeM8Dq5aCnD3fHt0Ba=WR6w@mail.gmail.com
2022-01-30 12:25:24 -03:00
Bruce Momjian 27b77ecf9f Update copyright for 2022
Backpatch-through: 10
2022-01-07 19:04:57 -05:00
Peter Geoghegan bcf60585e6 Standardize cleanup lock terminology.
The term "super-exclusive lock" is a synonym for "buffer cleanup lock"
that first appeared in nbtree many years ago.  Standardize things by
consistently using the term cleanup lock.  This finishes work started by
commit 276db875.

There is no good reason to have two terms.  But there is a good reason
to only have one: to avoid confusion around why VACUUM acquires a full
cleanup lock (not just an ordinary exclusive lock) in index AMs, during
ambulkdelete calls.  This has nothing to do with protecting the physical
index data structure itself.  It is needed to implement a locking
protocol that ensures that TIDs pointing to the heap/table structure
cannot get marked for recycling by VACUUM before it is safe (which is
somewhat similar to how VACUUM uses cleanup locks during its first heap
pass).  Note that it isn't strictly necessary for index AMs to implement
this locking protocol -- several index AMs use an MVCC snapshot as their
sole interlock to prevent unsafe TID recycling.

In passing, update the nbtree README.  Cleanly separate discussion of
the aforementioned index vacuuming locking protocol from discussion of
the "drop leaf page pin" optimization added by commit 2ed5b87f.  We now
structure discussion of the latter by describing how individual index
scans may safely opt out of applying the standard locking protocol (and
so can avoid blocking progress by VACUUM).  Also document why the
optimization is not safe to apply during nbtree index-only scans.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzngHgQa92tz6NQihf4nxJwRzCV36yMJO_i8dS+2mgEVKw@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzkHPgsBBvGWjz=8PjNhDefy7XRkDKiT5NxMs-n5ZCf2dA@mail.gmail.com
2021-12-08 17:24:45 -08:00
Tomas Vondra 5753d4ee32 Ignore BRIN indexes when checking for HOT udpates
When determining whether an index update may be skipped by using HOT, we
can ignore attributes indexed only by BRIN indexes. There are no index
pointers to individual tuples in BRIN, and the page range summary will
be updated anyway as it relies on visibility info.

This also removes rd_indexattr list, and replaces it with rd_attrsvalid
flag. The list was not used anywhere, and a simple flag is sufficient.

Patch by Josef Simanek, various fixes and improvements by me.

Author: Josef Simanek
Reviewed-by: Tomas Vondra, Alvaro Herrera
Discussion: https://postgr.es/m/CAFp7QwpMRGcDAQumN7onN9HjrJ3u4X3ZRXdGFT0K5G2JWvnbWg%40mail.gmail.com
2021-11-30 20:04:38 +01:00
Tom Lane 3804539e48 Replace random(), pg_erand48(), etc with a better PRNG API and algorithm.
Standardize on xoroshiro128** as our basic PRNG algorithm, eliminating
a bunch of platform dependencies as well as fundamentally-obsolete PRNG
code.  In addition, this API replacement will ease replacing the
algorithm again in future, should that become necessary.

xoroshiro128** is a few percent slower than the drand48 family,
but it can produce full-width 64-bit random values not only 48-bit,
and it should be much more trustworthy.  It's likely to be noticeably
faster than the platform's random(), depending on which platform you
are thinking about; and we can have non-global state vectors easily,
unlike with random().  It is not cryptographically strong, but neither
are the functions it replaces.

Fabien Coelho, reviewed by Dean Rasheed, Aleksander Alekseev, and myself

Discussion: https://postgr.es/m/alpine.DEB.2.22.394.2105241211230.165418@pseudo
2021-11-28 21:33:07 -05:00
Amit Kapila 0f0cfb4940 Fix parallel operations that prevent oldest xmin from advancing.
While determining xid horizons, we skip over backends that are running
Vacuum. We also ignore Create Index Concurrently, or Reindex Concurrently
for the purposes of computing Xmin for Vacuum. But we were not setting the
flags corresponding to these operations when they are performed in
parallel which was preventing Xid horizon from advancing.

The optimization related to skipping Create Index Concurrently, or Reindex
Concurrently operations was implemented in PG-14 but the fix is the same
for the Parallel Vacuum as well so back-patched till PG-13.

Author: Masahiko Sawada
Reviewed-by: Amit Kapila
Backpatch-through: 13
Discussion: https://postgr.es/m/CAD21AoCLQqgM1sXh9BrDFq0uzd3RBFKi=Vfo6cjjKODm0Onr5w@mail.gmail.com
2021-11-19 09:04:40 +05:30
Peter Geoghegan e7428a99a1 Add hardening to catch invalid TIDs in indexes.
Add hardening to the heapam index tuple deletion path to catch TIDs in
index pages that point to a heap item that index tuples should never
point to.  The corruption we're trying to catch here is particularly
tricky to detect, since it typically involves "extra" (corrupt) index
tuples, as opposed to the absence of required index tuples in the index.

For example, a heap TID from an index page that turns out to point to an
LP_UNUSED item in the heap page has a good chance of being caught by one
of the new checks.  There is a decent chance that the recently fixed
parallel VACUUM bug (see commit 9bacec15) would have been caught had
that particular check been in place for Postgres 14.  No backpatch of
this extra hardening for now, though.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wzk-4_raTzawWGaiqNvkpwDXxv3y1AQhQyUeHfkU=tFCeA@mail.gmail.com
2021-11-04 19:54:05 -07:00
Peter Geoghegan 4c6afd805b Remove obsolete nbtree LP_DEAD item comments.
Comments above _bt_findinsertloc() that talk about LP_DEAD items are now
out of place.  We already discuss index tuple deletion at an earlier
point in the same comment block.

Oversight in commit d168b666.
2021-10-27 14:35:21 -07:00
Peter Geoghegan c2381b5104 Fix ordering of items in nbtree error message.
Oversight in commit a5213adf.

Backpatch: 13-, just like commit a5213adf.
2021-10-27 13:09:24 -07:00
Peter Geoghegan a5213adf3d Further harden nbtree posting split code.
Add more defensive checks around posting list split code.  These should
detect corruption involving duplicate table TIDs earlier and more
reliably than any existing check.

Follow up to commit 8f72bbac.

Discussion: https://postgr.es/m/CAH2-WzkrSY_kjyd1_M5xJK1uM0govJXMxPn8JUSvwcUOiHuWVw@mail.gmail.com
Backpatch: 13-, where nbtree deduplication was introduced.
2021-10-27 12:10:47 -07:00
Peter Geoghegan b76c1d6e84 Remove obsolete nbtree deduplication comments.
Follow up to commit 2903f140.
2021-10-15 15:25:20 -07:00
Peter Geoghegan 2903f1404d Enable deduplication in system catalog indexes.
The "equality implies image equality" opclass infrastructure disallowed
deduplication in system catalog indexes and TOAST indexes before now.
That seemed like the right approach back when the infrastructure was
added by commit 612a1ab7, since ALTER INDEX cannot set deduplicate_items
to 'off' (due to an old implementation restriction).  But that decision
now seems arbitrary at best.  Remove special case handling implementing
this policy.

No catversion bump, since existing catalog indexes will still work.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=rYQHFaJ3WYBdK=xgwxKzaiGMSSrh-ZCREa-pS-7Zjew@mail.gmail.com
2021-10-02 17:12:59 -07:00
Peter Geoghegan 895267a326 Remove unneeded nbtree latestRemovedXid comments.
Discussing the low level issue of nbtree VACUUM and recovery conflicts
in btvacuumpage() now seems inappropriate.  The same issue is discussed
in nbtxlog.h, as well as in a comment block above _bt_delitems_vacuum().

The comment block made more sense when it was part of a broader
discussion of nbtree VACUUM "pin scans".  These were removed by commit
9f83468b.
2021-09-26 20:25:14 -07:00
Peter Geoghegan ce2a860533 Update obsolete nbtree deletion comments.
_bt_delitems_delete() is no longer the high-level entry point used by
index tuple deletion driven by index tuples whose LP_DEAD bits are set
(now called "simple index tuple deletion").  It became a lower level
routine that's only called by _bt_delitems_delete_check() following
commit d168b66682.
2021-09-25 15:05:56 -07:00
Peter Geoghegan 48064a8d33 nbtree README: Add note about latestRemovedXid.
Point out that index tuple deletion generally needs a latestRemovedXid
value for the deletion operation's WAL record.  This is bound to be the
most expensive part of the whole deletion operation now that it takes
place up front, during original execution.

This was arguably an oversight in commit 558a9165e0, which moved the
work required to generate these values from index deletion REDO routines
to original execution of index deletion operations.
2021-09-24 13:53:48 -07:00
Peter Geoghegan dd94c2852e Fix "single value strategy" index deletion issue.
It is not appropriate for deduplication to apply single value strategy
when triggered by a bottom-up index deletion pass.  This wastes cycles
because later bottom-up deletion passes will overinterpret older
duplicate tuples that deduplication actually just skipped over "by
design".  It also makes bottom-up deletion much less effective for low
cardinality indexes that happen to cross a meaningless "index has single
key value per leaf page" threshold.

To fix, slightly narrow the conditions under which deduplication's
single value strategy is considered.  We already avoided the strategy
for a unique index, since our high level goal must just be to buy time
for VACUUM to run (not to buy space).  We'll now also avoid it when we
just had a bottom-up pass that reported failure.  The two cases share
the same high level goal, and already overlapped significantly, so this
approach is quite natural.

Oversight in commit d168b666, which added bottom-up index deletion.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WznaOvM+Gyj-JQ0X=JxoMDxctDTYjiEuETdAGbF5EUc3MA@mail.gmail.com
Backpatch: 14-, where bottom-up deletion was introduced.
2021-09-21 18:57:32 -07:00
Peter Geoghegan 0f6aa893cb Remove obsolete nbtree relation extension comment.
Commit 0d1fe9f7 improved the approach that vacuumlazy.c takes when it
encounters an empty heap page.  It no acquires the relation extension
lock.
2021-08-31 16:55:39 -07:00
Tom Lane f10f0ae420 Replace RelationOpenSmgr() with RelationGetSmgr().
The idea behind this patch is to design out bugs like the one fixed
by commit 9d523119f.  Previously, once one did RelationOpenSmgr(rel),
it was considered okay to access rel->rd_smgr directly for some
not-very-clear interval.  But since that pointer will be cleared by
relcache flushes, we had bugs arising from overreliance on a previous
RelationOpenSmgr call still being effective.

Now, very little code except that in rel.h and relcache.c should ever
touch the rd_smgr field directly.  The normal coding rule is to use
RelationGetSmgr(rel) and not expect the result to be valid for longer
than one smgr function call.  There are a couple of places where using
the function every single time seemed like overkill, but they are now
annotated with large warning comments.

Amul Sul, after an idea of mine.

Discussion: https://postgr.es/m/CANiYTQsU7yMFpQYnv=BrcRVqK_3U3mtAzAsJCaqtzsDHfsUbdQ@mail.gmail.com
2021-07-12 17:01:36 -04:00
Heikki Linnakangas 4c64b51dc5 Remove dead assignment to local variable.
This should have been removed in commit 7e30c186da, which split the loop
into two. Only the first loop uses the 'from' variable; updating it in
the second loop is bogus. It was never read after the first loop, so this
was harmless and surely optimized away by the compiler, but let's be tidy.

Backpatch to all supported versions.

Author: Ranier Vilela
Discussion: https://www.postgresql.org/message-id/CAEudQAoWq%2BAL3BnELHu7gms2GN07k-np6yLbukGaxJ1vY-zeiQ%40mail.gmail.com
2021-07-12 11:13:33 +03:00
Alvaro Herrera 5cc1cd5028
Report sort phase progress in parallel btree build
We were already reporting it, but only after the parallel workers were
finished, which is visibly much later than what happens in a serial
build.

With this change we report it when the leader starts its own sort phase
when participating in the build (the normal case).  Now this might
happen a little later than when the workers start their sorting phases,
but a) communicating the actual phase start from workers is likely to be
a hassle, and b) the sort phase start is pretty fuzzy anyway, since
sorting per se is actually initiated by tuplesort.c internally earlier
than tuplesort_performsort() is called.

Backpatch to pg12, where the progress reporting code for CREATE INDEX
went in.

Reported-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Greg Nancarrow <gregn4422@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/1128176d-1eee-55d4-37ca-e63644422adb
2021-06-11 19:07:32 -04:00
Peter Geoghegan 8f72bbac3e Harden nbtree deduplication posting split code.
Add a defensive "can't happen" error to code that handles nbtree posting
list splits (promote an existing assertion).  This avoids a segfault in
the event of an insertion of a newitem that is somehow identical to an
existing non-pivot tuple in the index.  An nbtree index should never
have two index tuples with identical TIDs.

This scenario is not particular unlikely in the event of any kind of
corruption that leaves the index in an inconsistent state relative to
the heap relation that is indexed.  There are two known reports of
preventable hard crashes.  Doing nothing seems unacceptable given the
general expectation that nbtree will cope reasonably well with corrupt
data.

Discussion: https://postgr.es/m/CAH2-Wz=Jr_d-dOYEEmwz0-ifojVNWho01eAqewfQXgKfoe114w@mail.gmail.com
Backpatch: 13-, where nbtree deduplication was introduced.
2021-05-14 15:08:02 -07:00
Tom Lane def5b065ff Initial pgindent and pgperltidy run for v14.
Also "make reformat-dat-files".

The only change worthy of note is that pgindent messed up the formatting
of launcher.c's struct LogicalRepWorkerId, which led me to notice that
that struct wasn't used at all anymore, so I just took it out.
2021-05-12 13:14:10 -04:00
Michael Paquier 7ef8b52cf0 Fix typos and grammar in comments and docs
Author: Justin Pryzby
Discussion: https://postgr.es/m/20210416070310.GG3315@telsasoft.com
2021-04-19 11:32:30 +09:00
Peter Geoghegan 796092fb84 Silence another _bt_check_unique compiler warning.
Per complaint from Tom Lane

Discussion: https://postgr.es/m/1922884.1617909599@sss.pgh.pa.us
2021-04-08 12:54:31 -07:00
Peter Geoghegan 8523492d4e Remove tupgone special case from vacuumlazy.c.
Retry the call to heap_prune_page() in rare cases where there is
disagreement between the heap_prune_page() call and the call to
HeapTupleSatisfiesVacuum() that immediately follows.  Disagreement is
possible when a concurrently-aborted transaction makes a tuple DEAD
during the tiny window between each step.  This was the only case where
a tuple considered DEAD by VACUUM still had storage following pruning.
VACUUM's definition of dead tuples is now uniformly simple and
unambiguous: dead tuples from each page are always LP_DEAD line pointers
that were encountered just after we performed pruning (and just before
we considered freezing remaining items with tuple storage).

Eliminating the tupgone=true special case enables INDEX_CLEANUP=off
style skipping of index vacuuming that takes place based on flexible,
dynamic criteria.  The INDEX_CLEANUP=off case had to know about skipping
indexes up-front before now, due to a subtle interaction with the
special case (see commit dd695979) -- this was a special case unto
itself.  Now there are no special cases.  And so now it won't matter
when or how we decide to skip index vacuuming: it won't affect how
pruning behaves, and it won't be affected by any of the implementation
details of pruning or freezing.

Also remove XLOG_HEAP2_CLEANUP_INFO records.  These are no longer
necessary because we now rely entirely on heap pruning taking care of
recovery conflicts.  There is no longer any need to generate recovery
conflicts for DEAD tuples that pruning just missed.  This also means
that heap vacuuming now uses exactly the same strategy for recovery
conflicts as index vacuuming always has: REDO routines never need to
process a latestRemovedXid from the WAL record, since earlier REDO of
the WAL record from pruning is sufficient in all cases.  The generic
XLOG_HEAP2_CLEAN record type is now split into two new record types to
reflect this new division (these are called XLOG_HEAP2_PRUNE and
XLOG_HEAP2_VACUUM).

Also stop acquiring a super-exclusive lock for heap pages when they're
vacuumed during VACUUM's second heap pass.  A regular exclusive lock is
enough.  This is correct because heap page vacuuming is now strictly a
matter of setting the LP_DEAD line pointers to LP_UNUSED.  No other
backend can have a pointer to a tuple located in a pinned buffer that
can be invalidated by a concurrent heap page vacuum operation.

Heap vacuuming can now be thought of as conceptually similar to index
vacuuming and conceptually dissimilar to heap pruning.  Heap pruning now
has sole responsibility for anything involving the logical contents of
the database (e.g., managing transaction status information, recovery
conflicts, considering what to do with HOT chains).  Index vacuuming and
heap vacuuming are now only concerned with recycling garbage items from
physical data structures that back the logical database.

Bump XLOG_PAGE_MAGIC due to pruning and heap page vacuum WAL record
changes.

Credit for the idea of retrying pruning a page to avoid the tupgone case
goes to Andres Freund.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com
2021-04-06 08:49:22 -07:00
Peter Geoghegan 5b861baa55 nbtree VACUUM: Cope with buggy opclasses.
Teach nbtree VACUUM to press on with vacuuming in the event of a page
deletion attempt that fails to "re-find" a downlink for its child/target
page.

There is no good reason to treat this as an irrecoverable error.  But
there is a good reason not to: pressing on at this point removes any
question of VACUUM not making progress solely due to misbehavior from
user-defined operator class code.

Discussion: https://postgr.es/m/CAH2-Wzma5G9CTtMjbrXTwOym+U=aWg-R7=-htySuztgoJLvZXg@mail.gmail.com
2021-03-23 16:09:51 -07:00
Peter Geoghegan 9dd963ae25 Recycle nbtree pages deleted during same VACUUM.
Maintain a simple array of metadata about pages that were deleted during
nbtree VACUUM's current btvacuumscan() call.  Use this metadata at the
end of btvacuumscan() to attempt to place newly deleted pages in the FSM
without further delay.  It might not yet be safe to place any of the
pages in the FSM by then (they may not be deemed recyclable), but we
have little to lose and plenty to gain by trying.  In practice there is
a very good chance that this will work out when vacuuming larger
indexes, where scanning the index naturally takes quite a while.

This commit doesn't change the page recycling invariants; it merely
improves the efficiency of page recycling within the confines of the
existing design.  Recycle safety is a part of nbtree's implementation of
what Lanin & Shasha call "the drain technique".  The design happens to
use transaction IDs (they're stored in deleted pages), but that in
itself doesn't align the cutoff for recycle safety to any of the
XID-based cutoffs used by VACUUM (e.g., OldestXmin).  All that matters
is whether or not _other_ backends might be able to observe various
inconsistencies in the tree structure (that they cannot just detect and
recover from by moving right).  Recycle safety is purely a question of
maintaining the consistency (or the apparent consistency) of a physical
data structure.

Note that running a simple serial test case involving a large range
DELETE followed by a VACUUM VERBOSE will probably show that any newly
deleted nbtree pages are not yet reusable/recyclable.  This is expected
in the absence of even one concurrent XID assignment.  It is an old
implementation restriction.  In practice it's unlikely to be the thing
that makes recycling remain unsafe, at least with larger indexes, where
recycling newly deleted pages during the same VACUUM actually matters.

An important high-level goal of this commit (as well as related recent
commits e5d8a999 and 9f3665fb) is to make expensive deferred cleanup
operations in index AMs rare in general.  If index vacuuming frequently
depends on the next VACUUM operation finishing off work that the current
operation started, then the general behavior of index vacuuming is hard
to predict.  This is relevant to ongoing work that adds a vacuumlazy.c
mechanism to skip index vacuuming in certain cases.  Anything that makes
the real world behavior of index vacuuming simpler and more linear will
also make top-down modeling in vacuumlazy.c more robust.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzk76_P=67iUscb1UN44-gyZL-KgpsXbSxq_bdcMa7Q+wQ@mail.gmail.com
2021-03-21 15:25:39 -07:00
Peter Geoghegan 02b5940dbe Consolidate nbtree VACUUM metapage routines.
Simplify _bt_vacuum_needs_cleanup() functions's signature (it only needs
a single 'rel' argument now), and move it next to its sibling function
in nbtpage.c.

I believe that _bt_vacuum_needs_cleanup() was originally located in
nbtree.c due to an include dependency issue.  That's no longer an issue.

Follow-up to commit 9f3665fb.
2021-03-12 13:11:47 -08:00
Peter Geoghegan 7bb97211a5 Save a few cycles during nbtree VACUUM.
Avoid calling RelationGetNumberOfBlocks() unnecessarily in the common
case where there are no deleted but not yet recycled pages to recycle
during a cleanup-only nbtree VACUUM operation.

Follow-up to commit e5d8a999, which (among other things) taught the
"skip full scan" nbtree VACUUM mechanism to only trigger a full index
scan when the absolute number of deleted pages in the index is
considered excessive.
2021-03-11 14:18:23 -08:00
Peter Geoghegan effdd3f3b6 Add back vacuum_cleanup_index_scale_factor parameter.
Commit 9f3665fb removed the vacuum_cleanup_index_scale_factor storage
parameter.  However, that creates dump/reload hazards when moving across
major versions.

Add back the vacuum_cleanup_index_scale_factor parameter (though not the
GUC of the same name) purely to avoid problems when using tools like
pg_upgrade.  The parameter remains disabled and undocumented.

No backpatch to Postgres 13, since vacuum_cleanup_index_scale_factor was
only disabled by REL_13_STABLE's version of master branch commit
9f3665fb in the first place -- the parameter already looks like this on
REL_13_STABLE.

Discussion: https://postgr.es/m/YEm/a3Ko3nKnBuVq@paquier.xyz
2021-03-11 12:42:46 -08:00
Peter Geoghegan 9f3665fbfc Don't consider newly inserted tuples in nbtree VACUUM.
Remove the entire idea of "stale stats" within nbtree VACUUM (stop
caring about stats involving the number of inserted tuples).  Also
remove the vacuum_cleanup_index_scale_factor GUC/param on the master
branch (though just disable them on postgres 13).

The vacuum_cleanup_index_scale_factor/stats interface made the nbtree AM
partially responsible for deciding when pg_class.reltuples stats needed
to be updated.  This seems contrary to the spirit of the index AM API,
though -- it is not actually necessary for an index AM's bulk delete and
cleanup callbacks to provide accurate stats when it happens to be
inconvenient.  The core code owns that.  (Index AMs have the authority
to perform or not perform certain kinds of deferred cleanup based on
their own considerations, such as page deletion and recycling, but that
has little to do with pg_class.reltuples/num_index_tuples.)

This issue was fairly harmless until the introduction of the
autovacuum_vacuum_insert_threshold feature by commit b07642db, which had
an undesirable interaction with the vacuum_cleanup_index_scale_factor
mechanism: it made insert-driven autovacuums perform full index scans,
even though there is no real benefit to doing so.  This has been tied to
a regression with an append-only insert benchmark [1].

Also have remaining cases that perform a full scan of an index during a
cleanup-only nbtree VACUUM indicate that the final tuple count is only
an estimate.  This prevents vacuumlazy.c from setting the index's
pg_class.reltuples in those cases (it will now only update pg_class when
vacuumlazy.c had TIDs for nbtree to bulk delete).  This arguably fixes
an oversight in deduplication-related bugfix commit 48e12913.

[1] https://smalldatum.blogspot.com/2021/01/insert-benchmark-postgres-is-still.html

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAD21AoA4WHthN5uU6+WScZ7+J_RcEjmcuH94qcoUPuB42ShXzg@mail.gmail.com
Backpatch: 13-, where autovacuum_vacuum_insert_threshold was added.
2021-03-10 16:27:01 -08:00
Peter Geoghegan 5b2f2af3d9 nbtree page deletion: Add leaftopparent assertion.
Add documenting assertion.  This makes it easier to follow how we
maintain the top parent link in target subtree's half-dead/leaf level
page.
2021-03-02 14:06:07 -08:00
Peter Geoghegan 3d8d5787a3 Fix nbtree page deletion error messages.
Adjust some "can't happen" error messages that assumed that the page
deletion target page must be a half-dead page.  This assumption was
wrong in the case of an internal target page.  Simply refer to these
pages as the target page instead.

Internal pages are never marked half-dead.  There is exactly one
half-dead page for each subtree undergoing deletion.  The half-dead page
is also the target subtree's leaf-level page.  This has been the case
since commit efada2b8, which totally overhauled nbtree page deletion.
2021-03-02 13:02:24 -08:00
Peter Geoghegan 2376361839 VACUUM VERBOSE: Count "newly deleted" index pages.
Teach VACUUM VERBOSE to report on pages deleted by the _current_ VACUUM
operation -- these are newly deleted pages.  VACUUM VERBOSE continues to
report on the total number of deleted pages in the entire index (no
change there).  The former is a subset of the latter.

The distinction between each category of deleted index page only arises
with index AMs where page deletion is supported and is decoupled from
page recycling for performance reasons.

This is follow-up work to commit e5d8a999, which made nbtree store
64-bit XIDs (not 32-bit XIDs) in pages at the point at which they're
deleted.  Note that the btm_last_cleanup_num_delpages metapage field
added by that commit usually gets set to pages_newly_deleted.  The
exceptions (the scenarios in which they're not equal) all seem to be
tricky cases for the implementation (of page deletion and recycling) in
general.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WznpdHvujGUwYZ8sihX%3Dd5u-tRYhi-F4wnV2uN2zHpMUXw%40mail.gmail.com
2021-02-25 14:32:18 -08:00
Peter Geoghegan e5d8a99903 Use full 64-bit XIDs in deleted nbtree pages.
Otherwise we risk "leaking" deleted pages by making them non-recyclable
indefinitely.  Commit 6655a729 did the same thing for deleted pages in
GiST indexes.  That work was used as a starting point here.

Stop storing an XID indicating the oldest bpto.xact across all deleted
though unrecycled pages in nbtree metapages.  There is no longer any
reason to care about that condition/the oldest XID.  It only ever made
sense when wraparound was something _bt_vacuum_needs_cleanup() had to
consider.

The btm_oldest_btpo_xact metapage field has been repurposed and renamed.
It is now btm_last_cleanup_num_delpages, which is used to remember how
many non-recycled deleted pages remain from the last VACUUM (in practice
its value is usually the precise number of pages that were _newly
deleted_ during the specific VACUUM operation that last set the field).

The general idea behind storing btm_last_cleanup_num_delpages is to use
it to give _some_ consideration to non-recycled deleted pages inside
_bt_vacuum_needs_cleanup() -- though never too much.  We only really
need to avoid leaving a truly excessive number of deleted pages in an
unrecycled state forever.  We only do this to cover certain narrow cases
where no other factor makes VACUUM do a full scan, and yet the index
continues to grow (and so actually misses out on recycling existing
deleted pages).

These metapage changes result in a clear user-visible benefit: We no
longer trigger full index scans during VACUUM operations solely due to
the presence of only 1 or 2 known deleted (though unrecycled) blocks
from a very large index.  All that matters now is keeping the costs and
benefits in balance over time.

Fix an issue that has been around since commit 857f9c36, which added the
"skip full scan of index" mechanism (i.e. the _bt_vacuum_needs_cleanup()
logic).  The accuracy of btm_last_cleanup_num_heap_tuples accidentally
hinged upon _when_ the source value gets stored.  We now always store
btm_last_cleanup_num_heap_tuples in btvacuumcleanup().  This fixes the
issue because IndexVacuumInfo.num_heap_tuples (the source field) is
expected to accurately indicate the state of the table _after_ the
VACUUM completes inside btvacuumcleanup().

A backpatchable fix cannot easily be extracted from this commit.  A
targeted fix for the issue will follow in a later commit, though that
won't happen today.

I (pgeoghegan) have chosen to remove any mention of deleted pages in the
documentation of the vacuum_cleanup_index_scale_factor GUC/param, since
the presence of deleted (though unrecycled) pages is no longer of much
concern to users.  The vacuum_cleanup_index_scale_factor description in
the docs now seems rather unclear in any case, and it should probably be
rewritten in the near future.  Perhaps some passing mention of page
deletion will be added back at the same time.

Bump XLOG_PAGE_MAGIC due to nbtree WAL records using full XIDs now.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznpdHvujGUwYZ8sihX=d5u-tRYhi-F4wnV2uN2zHpMUXw@mail.gmail.com
2021-02-24 18:41:34 -08:00
Peter Geoghegan b071a31149 Add nbtree README section on page recycling.
Consolidate discussion of how VACUUM places pages in the FSM for
recycling by adding a new section that comes after discussion of page
deletion.  This structure reflects the fact that page recycling is
explicitly decoupled from page deletion in Lanin & Shasha's paper.  Page
recycling in nbtree is an implementation of what the paper calls "the
drain technique".

This decoupling is an important concept for nbtree VACUUM.  Searchers
have to detect and recover from concurrent page deletions, but they will
never have to reason about concurrent page recycling.  Recycling can
almost always be thought of as a low level garbage collection operation
that asynchronously frees the physical space that backs a logical tree
node.  Almost all code need only concern itself with logical tree nodes.
(Note that "logical tree node" is not currently a term of art in the
nbtree code -- this all works implicitly.)

This is preparation for an upcoming patch that teaches nbtree VACUUM to
remember the details of pages that it deletes on the fly, in local
memory.  This enables the same VACUUM operation to consider placing its
own deleted pages in the FSM later on, when it reaches the end of
btvacuumscan().
2021-02-18 21:16:33 -08:00
Peter Geoghegan 128dd901a5 nbtree README: move VACUUM linear scan section.
Discuss VACUUM's linear scan after discussion of tuple deletion by
VACUUM, but before discussion of page deletion by VACUUM.  This
progression is a lot more natural.

Also tweak the wording a little.  It seems unnecessary to talk about how
it worked prior to PostgreSQL 8.2.
2021-02-17 21:13:15 -08:00
Thomas Munro c7ecd6af01 ReadNewTransactionId() -> ReadNextTransactionId().
The new name conveys the effect better, is more consistent with similar
functions ReadNextMultiXactId(), ReadNextFullTransactionId(), and
matches the name of the variable that it reads.

Reported-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzmVR4SakBXQUdhhPpMf1aYvZCnna5%3DHKa7DAgEmBAg%2B8g%40mail.gmail.com
2021-02-15 13:17:02 +13:00
Peter Geoghegan 31c7fb41e2 Fix obsolete FSM remarks in nbtree README.
The free space map has used a dedicated relation fork rather than shared
memory segments for over a decade.
2021-02-09 11:36:51 -08:00
Peter Geoghegan c34787f910 Harden nbtree page deletion.
Add some additional defensive checks in the second phase of index
deletion to detect and report index corruption during VACUUM, and to
avoid having VACUUM become stuck in more cases.  The code is still not
robust in the presence of a circular chain of sibling links, though it's
not clear whether that really matters.  This is follow-up work to commit
3a01f68e.

The new defensive checks rely on the assumption that there can be no
more than one VACUUM operation running for an index at any given time.
Remove an old comment suggesting that multiple concurrent VACUUMs need
to be considered here.  This concern now seems highly unlikely to have
any real validity, since we clearly rely on the same assumption in
several other places.  For example, there are much more recent comments
that appear in the same function (added by commit efada2b8e9) that make
the same assumption.

Also add a CHECK_FOR_INTERRUPTS() to the relevant code path.  Contrary
to comments added by commit 3a01f68e, it is actually possible to handle
interrupts here, at least in the common case where processing takes
place at the leaf level.  We only hold a pin on leafbuf/target page when
stepping right at the leaf level.

No backpatch due to the lack of complaints following hardening added to
the same area by commit 3a01f68e.
2021-02-04 15:42:36 -08:00
Peter Eisentraut 1d71f3c83c Improve confusing variable names
The prototype calls the second argument of
pgstat_progress_update_multi_param() "index", and some callers name
their local variable that way.  But when the surrounding code deals
with index relations, this is confusing, and in at least one case
shadowed another variable that is referring to an index relation.
Adjust those call sites to have clearer local variable naming, similar
to existing callers in indexcmds.c.
2021-02-02 09:20:22 +01:00
Peter Geoghegan dc43492e46 Remove unused _bt_delitems_delete() argument.
The latestRemovedXid values used by nbtree deletion operations are
determined by _bt_delitems_delete()'s caller, so there is no reason to
pass a separate heapRel argument.

Oversight in commit d168b66682.
2021-01-31 10:10:55 -08:00
Peter Geoghegan d168b66682 Enhance nbtree index tuple deletion.
Teach nbtree and heapam to cooperate in order to eagerly remove
duplicate tuples representing dead MVCC versions.  This is "bottom-up
deletion".  Each bottom-up deletion pass is triggered lazily in response
to a flood of versions on an nbtree leaf page.  This usually involves a
"logically unchanged index" hint (these are produced by the executor
mechanism added by commit 9dc718bd).

The immediate goal of bottom-up index deletion is to avoid "unnecessary"
page splits caused entirely by version duplicates.  It naturally has an
even more useful effect, though: it acts as a backstop against
accumulating an excessive number of index tuple versions for any given
_logical row_.  Bottom-up index deletion complements what we might now
call "top-down index deletion": index vacuuming performed by VACUUM.
Bottom-up index deletion responds to the immediate local needs of
queries, while leaving it up to autovacuum to perform infrequent clean
sweeps of the index.  The overall effect is to avoid certain
pathological performance issues related to "version churn" from UPDATEs.

The previous tableam interface used by index AMs to perform tuple
deletion (the table_compute_xid_horizon_for_tuples() function) has been
replaced with a new interface that supports certain new requirements.
Many (perhaps all) of the capabilities added to nbtree by this commit
could also be extended to other index AMs.  That is left as work for a
later commit.

Extend deletion of LP_DEAD-marked index tuples in nbtree by adding logic
to consider extra index tuples (that are not LP_DEAD-marked) for
deletion in passing.  This increases the number of index tuples deleted
significantly in many cases.  The LP_DEAD deletion process (which is now
called "simple deletion" to clearly distinguish it from bottom-up
deletion) won't usually need to visit any extra table blocks to check
these extra tuples.  We have to visit the same table blocks anyway to
generate a latestRemovedXid value (at least in the common case where the
index deletion operation's WAL record needs such a value).

Testing has shown that the "extra tuples" simple deletion enhancement
increases the number of index tuples deleted with almost any workload
that has LP_DEAD bits set in leaf pages.  That is, it almost never fails
to delete at least a few extra index tuples.  It helps most of all in
cases that happen to naturally have a lot of delete-safe tuples.  It's
not uncommon for an individual deletion operation to end up deleting an
order of magnitude more index tuples compared to the old naive approach
(e.g., custom instrumentation of the patch shows that this happens
fairly often when the regression tests are run).

Add a further enhancement that augments simple deletion and bottom-up
deletion in indexes that make use of deduplication: Teach nbtree's
_bt_delitems_delete() function to support granular TID deletion in
posting list tuples.  It is now possible to delete individual TIDs from
posting list tuples provided the TIDs have a tableam block number of a
table block that gets visited as part of the deletion process (visiting
the table block can be triggered directly or indirectly).  Setting the
LP_DEAD bit of a posting list tuple is still an all-or-nothing thing,
but that matters much less now that deletion only needs to start out
with the right _general_ idea about which index tuples are deletable.

Bump XLOG_PAGE_MAGIC because xl_btree_delete changed.

No bump in BTREE_VERSION, since there are no changes to the on-disk
representation of nbtree indexes.  Indexes built on PostgreSQL 12 or
PostgreSQL 13 will automatically benefit from bottom-up index deletion
(i.e. no reindexing required) following a pg_upgrade.  The enhancement
to simple deletion is available with all B-Tree indexes following a
pg_upgrade, no matter what PostgreSQL version the user upgrades from.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com
2021-01-13 09:21:32 -08:00
Peter Geoghegan 9dc718bdf2 Pass down "logically unchanged index" hint.
Add an executor aminsert() hint mechanism that informs index AMs that
the incoming index tuple (the tuple that accompanies the hint) is not
being inserted by execution of an SQL statement that logically modifies
any of the index's key columns.

The hint is received by indexes when an UPDATE takes place that does not
apply an optimization like heapam's HOT (though only for indexes where
all key columns are logically unchanged).  Any index tuple that receives
the hint on insert is expected to be a duplicate of at least one
existing older version that is needed for the same logical row.  Related
versions will typically be stored on the same index page, at least
within index AMs that apply the hint.

Recognizing the difference between MVCC version churn duplicates and
true logical row duplicates at the index AM level can help with cleanup
of garbage index tuples.  Cleanup can intelligently target tuples that
are likely to be garbage, without wasting too many cycles on less
promising tuples/pages (index pages with little or no version churn).

This is infrastructure for an upcoming commit that will teach nbtree to
perform bottom-up index deletion.  No index AM actually applies the hint
just yet.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=CEKFa74EScx_hFVshCOn6AA5T-ajFASTdzipdkLTNQQ@mail.gmail.com
2021-01-13 08:11:00 -08:00
Bruce Momjian ca3b37487b Update copyright for 2021
Backpatch-through: 9.5
2021-01-02 13:06:25 -05:00
Peter Geoghegan 41ddc27f66 Remove obsolete btrescan() comment.
"Ordering stuff" refered to a _bt_first() call to _bt_orderkeys().
However, the _bt_orderkeys() function was renamed to
_bt_preprocess_keys() by commit fa5c8a055a.

_bt_preprocess_keys() is directly referenced just after the removed
comment already, which seems sufficient.
2020-12-15 15:55:07 -08:00
Peter Eisentraut eb93f3a0b6 Convert elog(LOG) calls to ereport() where appropriate
User-visible log messages should go through ereport(), so they are
subject to translation.  Many remaining elog(LOG) calls are really
debugging calls.

Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://www.postgresql.org/message-id/flat/92d6f545-5102-65d8-3c87-489f71ea0a37%40enterprisedb.com
2020-12-04 14:25:23 +01:00
Peter Geoghegan cf2acaf4dc Deprecate nbtree's BTP_HAS_GARBAGE flag.
Streamline handling of the various strategies that we have to avoid a
page split in nbtinsert.c.  When it looks like a leaf page is about to
overflow, we now perform deleting LP_DEAD items and deduplication in one
central place.  This greatly simplifies _bt_findinsertloc().

This has an independently useful consequence: nbtree no longer relies on
the BTP_HAS_GARBAGE page level flag/hint for anything important.  We
still set and unset the flag in the same way as before, but it's no
longer treated as a gating condition when considering if we should check
for already-set LP_DEAD bits.  This happens at the point where the page
looks like it might have to be split anyway, so simply checking the
LP_DEAD bits in passing is practically free.  This avoids missing
LP_DEAD bits just because the page-level hint is unset, which is
probably reasonably common (e.g. it happens when VACUUM unsets the
page-level flag without actually removing index tuples whose LP_DEAD-bit
was set recently, after the VACUUM operation began but before it reached
the leaf page in question).

Note that this isn't a big behavioral change compared to PostgreSQL 13.
We were already checking for set LP_DEAD bits regardless of whether the
BTP_HAS_GARBAGE page level flag was set before we considered doing a
deduplication pass.  This commit only goes slightly further by doing the
same check for all indexes, even indexes where deduplication won't be
performed.

We don't completely remove the BTP_HAS_GARBAGE flag.  We still rely on
it as a gating condition with pg_upgrade'd indexes from before B-tree
version 4/PostgreSQL 12.  That makes sense because we sometimes have to
make a choice among pages full of duplicates when inserting a tuple with
pre version 4 indexes.  It probably still pays to avoid accessing the
line pointer array of a page there, since it won't yet be clear whether
we'll insert on to the page in question at all, let alone split it as a
result.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz%3DYpc1PDdk8OVJDChGJBjT06%3DA0Mbv9HyTLCsOknGcUFg%40mail.gmail.com
2020-11-17 09:45:56 -08:00
Peter Geoghegan a034f8b60c nbtree: Rename nbtinsert.c variables for consistency.
Stop naming special area/opaque pointer variables 'lpageop' in contexts
where it doesn't make sense.  This is a holdover from a time when logic
that performs tasks that are now spread across _bt_insertonpg(),
_bt_findinsertloc(), and _bt_split() was more centralized.  'lpageop'
denotes "left page", which doesn't make sense outside of contexts in
which there isn't also a right page.

Also acquire page flag variables up front within _bt_insertonpg().  This
makes it closer to _bt_split() following refactoring commit bc3087b626.
This allows the page split and retail insert paths to both make use of
the same variables.
2020-11-17 09:01:14 -08:00
Peter Geoghegan 46cf3c72c3 nbtree: Demote incomplete split "can't happen" error.
Only a basic logic bug in a _bt_insertonpg() caller could lead to a
violation of this invariant (index corruption won't do it).  A "can't
happen" error seems inappropriate (it is arbitrary at best).

Demote the error to a simple assertion.  This matches similar nearby
sanity checks.
2020-11-15 11:53:37 -08:00
Peter Geoghegan 5a2f154a2e Improve nbtree README's LP_DEAD section.
The description of how LP_DEAD bit setting by index scans works
following commit 2ed5b87f was rather unclear.  Clean that up a bit.

Also refer to LP_DEAD bit setting within _bt_check_unique() at the start
of the same section.  This mechanism may actually be more important than
the generic kill_prior_tuple mechanism that the section focuses on, so
it at least deserves to be mentioned in passing.
2020-11-07 18:51:12 -08:00
Peter Geoghegan efc5dcfd8a Fix wal_consistency_checking nbtree bug.
wal_consistency_checking indicated an inconsistency in certain cases
involving nbtree page deletion.  The underlying issue is that there was
a minor difference between the page image produced after a REDO routine
ran and the corresponding page image following original execution.

This harmless inconsistency has been around forever.  We more or less
expect total consistency among even deleted nbtree pages these days,
though, so this won't do anymore.

To fix, tweak the REDO routine to match original execution.

Oversight in commit f47b5e13.
2020-11-05 15:01:40 -08:00
Peter Geoghegan 48e1291342 Fix nbtree cleanup-only VACUUM stats inaccuracies.
Logic for counting heap TIDs from posting list tuples (added by commit
0d861bbb) was faulty.  It didn't count any TIDs/index tuples in the
event of no callback being set.  This meant that we incorrectly counted
no index tuples in clean-up only VACUUMs, which could lead to
pg_class.reltuples being spuriously set to 0 in affected indexes.

To fix, go back to counting items from the page in cases where there is
no callback.  This approach isn't very accurate, but it works well
enough in practice while avoiding the expense of accessing every index
tuple during cleanup-only VACUUMs.

Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
https://postgr.es/m/20201023174451.69e358f1@firost
Backpatch: 13-, where nbtree deduplication was introduced
2020-11-04 18:42:27 -08:00
Noah Misch f90e80b913 Reproduce debug_query_string==NULL on parallel workers.
Certain background workers initiate parallel queries while
debug_query_string==NULL, at which point they attempted strlen(NULL) and
died to SIGSEGV.  Older debug_query_string observers allow NULL, so do
likewise in these newer ones.  Back-patch to v11, where commit
7de4a1bcc5 introduced the first of these.

Discussion: https://postgr.es/m/20201014022636.GA1962668@rfd.leadboat.com
2020-10-31 08:43:28 -07:00
Amit Kapila b7f2dd959a Update parallel BTree scan state when the scan keys can't be satisfied.
For parallel btree scan to work for array of scan keys, it should reach
BTPARALLEL_DONE state once for every distinct combination of array keys.
This is required to ensure that the parallel workers don't try to seize
blocks at the same time for different scan keys. We missed to update this
state when we discovered that the scan keys can't be satisfied.

Author: James Hunter
Reviewed-by: Amit Kapila
Tested-by: Justin Pryzby
Backpatch-through: 10, where it was introduced
Discussion: https://postgr.es/m/4248CABC-25E3-4809-B4D0-128E1BAABC3C@amazon.com
2020-09-17 16:11:48 +05:30
Tom Lane a5cc4dab6d Yet more elimination of dead stores and useless initializations.
I'm not sure what tool Ranier was using, but the ones I contributed
were found by using a newer version of scan-build than I tried before.

Ranier Vilela and Tom Lane

Discussion: https://postgr.es/m/CAEudQAo1+AcGppxDSg8k+zF4+Kv+eJyqzEDdbpDg58-=MQcerQ@mail.gmail.com
2020-09-05 13:17:32 -04:00
Tom Lane 3d351d916b Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.
Historically, we've considered the state with relpages and reltuples
both zero as indicating that we do not know the table's tuple density.
This is problematic because it's impossible to distinguish "never yet
vacuumed" from "vacuumed and seen to be empty".  In particular, a user
cannot use VACUUM or ANALYZE to override the planner's normal heuristic
that an empty table should not be believed to be empty because it is
probably about to get populated.  That heuristic is a good safety
measure, so I don't care to abandon it, but there should be a way to
override it if the table is indeed intended to stay empty.

Hence, represent the initial state of ignorance by setting reltuples
to -1 (relpages is still set to zero), and apply the minimum-ten-pages
heuristic only when reltuples is still -1.  If the table is empty,
VACUUM or ANALYZE (but not CREATE INDEX) will override that to
reltuples = relpages = 0, and then we'll plan on that basis.

This requires a bunch of fiddly little changes, but we can get rid of
some ugly kluges that were formerly needed to maintain the old definition.

One notable point is that FDWs' GetForeignRelSize methods will see
baserel->tuples = -1 when no ANALYZE has been done on the foreign table.
That seems like a net improvement, since those methods were formerly
also in the dark about what baserel->tuples = 0 really meant.  Still,
it is an API change.

I bumped catversion because code predating this change would get confused
by seeing reltuples = -1.

Discussion: https://postgr.es/m/F02298E0-6EF4-49A1-BCB6-C484794D9ACC@thebuild.com
2020-08-30 12:21:51 -04:00
Andres Freund 1f51c17c68 snapshot scalability: Move PGXACT->xmin back to PGPROC.
Now that xmin isn't needed for GetSnapshotData() anymore, it leads to
unnecessary cacheline ping-pong to have it in PGXACT, as it is updated
considerably more frequently than the other PGXACT members.

After the changes in dc7420c2c9, this is a very straight-forward change.

For highly concurrent, snapshot acquisition heavy, workloads this change alone
can significantly increase scalability. E.g. plain pgbench on a smaller 2
socket machine gains 1.07x for read-only pgbench, 1.22x for read-only pgbench
when submitting queries in batches of 100, and 2.85x for batches of 100
'SELECT';.  The latter numbers are obviously not to be expected in the
real-world, but micro-benchmark the snapshot computation
scalability (previously spending ~80% of the time in GetSnapshotData()).

Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-13 16:25:21 -07:00
Andres Freund b8443eae72 Fix out-of-date version reference, grammar.
Time appears to be passing fast.

Reported-By: Peter Geoghegan <pg@bowt.ie>
2020-08-12 17:04:51 -07:00
Andres Freund dc7420c2c9 snapshot scalability: Don't compute global horizons while building snapshots.
To make GetSnapshotData() more scalable, it cannot not look at at each proc's
xmin: While snapshot contents do not need to change whenever a read-only
transaction commits or a snapshot is released, a proc's xmin is modified in
those cases. The frequency of xmin modifications leads to, particularly on
higher core count systems, many cache misses inside GetSnapshotData(), despite
the data underlying a snapshot not changing. That is the most
significant source of GetSnapshotData() scaling poorly on larger systems.

Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons /
thresholds as it has so far. But we don't really have to: The horizons don't
actually change that much between GetSnapshotData() calls. Nor are the horizons
actually used every time a snapshot is built.

The trick this commit introduces is to delay computation of accurate horizons
until there use and using horizon boundaries to determine whether accurate
horizons need to be computed.

The use of RecentGlobal[Data]Xmin to decide whether a row version could be
removed has been replaces with new GlobalVisTest* functions.  These use two
thresholds to determine whether a row can be pruned:
1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed
   are definitely still visible.
2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
   definitely be removed
GetSnapshotData() updates definitely_needed to be the xmin of the computed
snapshot.

When testing whether a row can be removed (with GlobalVisTestIsRemovableXid())
and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID <
definitely_needed) the boundaries can be recomputed to be more accurate. As it
is not cheap to compute accurate boundaries, we limit the number of times that
happens in short succession.  As the boundaries used by
GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by
GetSnapshotData()), it is likely that further test can benefit from an earlier
computation of accurate horizons.

To avoid regressing performance when old_snapshot_threshold is set (as that
requires an accurate horizon to be computed), heap_page_prune_opt() doesn't
unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the
computation of the limited horizon, and the triggering of errors (with
SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove
tuples.

This commit just removes the accesses to PGXACT->xmin from
GetSnapshotData(), but other members of PGXACT residing in the same
cache line are accessed. Therefore this in itself does not result in a
significant improvement. Subsequent commits will take advantage of the
fact that GetSnapshotData() now does not need to access xmins anymore.

Note: This contains a workaround in heap_page_prune_opt() to keep the
snapshot_too_old tests working. While that workaround is ugly, the tests
currently are not meaningful, and it seems best to address them separately.

Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-12 16:03:49 -07:00
Peter Geoghegan d129c07499 Correct nbtree page split lock coupling comment.
There is no reason to distinguish between readers and writers here.
2020-08-09 12:01:15 -07:00
Peter Geoghegan 0a7d771f0f Make nbtree split REDO locking match original execution.
Make the nbtree page split REDO routine consistent with original
execution in its approach to acquiring and releasing buffer locks (at
least for pages on the tree level of the page being split).  This brings
btree_xlog_split() in line with btree_xlog_unlink_page(), which was
taught to couple buffer locks by commit 9a9db08a.

Note that the precise order in which we both acquire and release sibling
buffer locks in btree_xlog_split() now matches original execution
exactly (the precise order in which the locks are released probably
doesn't matter much, but we might as well be consistent about it).

The rule for nbtree REDO routines from here on is that same-level locks
should be acquired in an order that's consistent with original
execution.  It's not practical to have a similar rule for cross-level
page locks, since for the most part original execution holds those locks
for a period that spans multiple atomic actions/WAL records.  It's also
not necessary, because clearly the cross-level lock coupling is only
truly needed during original execution because of the presence of
concurrent inserters.

This is not a bug fix (unlike the similar aforementioned commit, commit
9a9db08a).  The immediate reason to tighten things up in this area is to
enable an upcoming enhancement to contrib/amcheck that allows it to
verify that sibling links are in agreement with only an AccessShareLock
(this check produced false positives when run on a replica server on
account of the inconsistency fixed by this commit).  But that's not the
only reason to be stricter here.

It is generally useful to make locking on replicas be as close to what
happens during original execution as practically possible.  It makes it
less likely that hard to catch bugs will slip in in the future.  The
previous state of affairs seems to be a holdover from before the
introduction of Hot Standby, when buffer lock acquisitions during
recovery were totally unnecessary.  See also: commit 3bbf668d, which
tightened things up in this area a few years after the introduction of
Hot Standby.

Discussion: https://postgr.es/m/CAH2-Wz=465cJj11YXD9RKH8z=nhQa2dofOZ_23h67EXUGOJ00Q@mail.gmail.com
2020-08-07 15:27:56 -07:00
Peter Geoghegan 3df92bbd1d Rename nbtree split REDO routine variables.
Make the nbtree page split REDO routine variable names consistent with
_bt_split() (which handles the original execution of page splits).
These names make the code easier to follow by making the distinction
between the original page and the left half of the split clear.  (The
left half of the split page is a temp page that REDO creates to replace
the origpage contents.)

Also reduce the elevel used when adding a new high key to the temp page
from PANIC to ERROR to be consistent.  We already only raise an ERROR
when data item PageAddItem() temp page calls fail.
2020-08-07 09:53:27 -07:00
Alexander Korotkov f47b5e1395 Remove btree page items after page unlink
Currently, page unlink leaves remaining items "as is", but replay of
corresponding WAL-record re-initializes page leaving it with no items.
For the sake of consistency, this commit makes primary delete all the items
during page unlink as well.

Thanks to this change, we now don't mask contents of deleted btree page for
WAL consistency checking.

Discussion: https://postgr.es/m/CAPpHfdt_OTyQpXaPJcWzV2N-LNeNJseNB-K_A66qG%3DL518VTFw%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Peter Geoghegan
2020-08-05 02:16:13 +03:00
Peter Geoghegan 9a9db08ae4 Fix replica backward scan race condition.
It was possible for the logic used by backward scans (which must reason
about concurrent page splits/deletions in its own peculiar way) to
become confused when running on a replica.  Concurrent replay of a WAL
record that describes the second phase of page deletion could cause
_bt_walk_left() to get confused.  btree_xlog_unlink_page() simply failed
to adhere to the same locking protocol that we use on the primary, which
is obviously wrong once you consider these two disparate functions
together.  This bug is present in all stable branches.

More concretely, the problem was that nothing stopped _bt_walk_left()
from observing inconsistencies between the deletion's target page and
its original sibling pages when running on a replica.  This is true even
though the second phase of page deletion is supposed to work as a single
atomic action.  Queries running on replicas raised "could not find left
sibling of block %u in index %s" can't-happen errors when they went back
to their scan's "original" page and observed that the page has not been
marked deleted (even though it really was concurrently deleted).

There is no evidence that this actually happened in the real world.  The
issue came to light during unrelated feature development work.  Note
that _bt_walk_left() is the only code that cares about the difference
between a half-dead page and a fully deleted page that isn't also
exclusively used by nbtree VACUUM (unless you include contrib/amcheck
code).  It seems very likely that backward scans are the only thing that
could become confused by the inconsistency.  Even amcheck's complex
bt_right_page_check_scankey() dance was unaffected.

To fix, teach btree_xlog_unlink_page() to lock the left sibling, target,
and right sibling pages in that order before releasing any locks (just
like _bt_unlink_halfdead_page()).  This is the simplest possible
approach.  There doesn't seem to be any opportunity to be more clever
about lock acquisition in the REDO routine, and it hardly seems worth
the trouble in any case.

This fix might enable contrib/amcheck verification of leaf page sibling
links with only an AccessShareLock on the relation.  An amcheck patch
from Andrey Borodin was rejected back in January because it clashed with
btree_xlog_unlink_page()'s lax approach to locking pages.  It now seems
likely that the real problem was with btree_xlog_unlink_page(), not the
patch.

This is a low severity, low likelihood bug, so no backpatch.

Author: Michail Nikolaev
Diagnosed-By: Michail Nikolaev
Discussion: https://postgr.es/m/CANtu0ohkR-evAWbpzJu54V8eCOtqjJyYp3PQ_SGoBTRGXWhWRw@mail.gmail.com
2020-08-03 15:54:38 -07:00
Peter Geoghegan a451b7d442 Add nbtree page deletion assertion.
Add a documenting assertion that's similar to the nearby assertion added
by commit cd8c73a3.  This conveys that the entire call to _bt_pagedel()
does no work if it isn't possible to get a descent stack for the initial
scanblkno page.
2020-08-03 13:04:42 -07:00
Tom Lane 9f9682783b Invent "amadjustmembers" AM method for validating opclass members.
This allows AM-specific knowledge to be applied during creation of
pg_amop and pg_amproc entries.  Specifically, the AM knows better than
core code which entries to consider as required or optional.  Giving
the latter entries the appropriate sort of dependency allows them to
be dropped without taking out the whole opclass or opfamily; which
is something we'd like to have to correct obsolescent entries in
extensions.

This callback also opens the door to performing AM-specific validity
checks during opclass creation, rather than hoping than an opclass
developer will remember to test with "amvalidate".  For the most part
I've not actually added any such checks yet; that can happen in a
follow-on patch.  (Note that we shouldn't remove any tests from
"amvalidate", as those are still needed to cross-check manually
constructed entries in the initdb data.  So adding tests to
"amadjustmembers" will be somewhat duplicative, but it seems like
a good idea anyway.)

Patch by me, reviewed by Alexander Korotkov, Hamid Akhtar, and
Anastasia Lubennikova.

Discussion: https://postgr.es/m/4578.1565195302@sss.pgh.pa.us
2020-08-01 17:12:47 -04:00
Peter Geoghegan 4a70f829d8 Add nbtree Valgrind buffer lock checks.
Holding just a buffer pin (with no buffer lock) on an nbtree buffer/page
provides very weak guarantees, especially compared to heapam, where it's
often safe to read a page while only holding a buffer pin.  This commit
has Valgrind enforce the following rule: it is never okay to access an
nbtree buffer without holding both a pin and a lock on the buffer.

A draft version of this patch detected questionable code that was
cleaned up by commits fa7ff642 and 7154aa16.  The code in question used
to access an nbtree buffer page's special/opaque area with no buffer
lock (only a buffer pin).  This practice (which isn't obviously unsafe)
is hereby formally disallowed in nbtree.  There doesn't seem to be any
reason to allow it, and banning it keeps things simple for Valgrind.

The new checks are implemented by adding custom nbtree client requests
(located in LockBuffer() wrapper functions); these requests are
"superimposed" on top of the generic bufmgr.c Valgrind client requests
added by commit 1e0dfd16.  No custom resource management cleanup code is
needed to undo the effects of marking buffers as non-accessible under
this scheme.

Author: Peter Geoghegan
Reviewed-By: Anastasia Lubennikova, Georgios Kokolatos
Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
2020-07-21 15:50:58 -07:00
Peter Geoghegan 5da8bf8bbb Avoid CREATE INDEX unique index deduplication.
There is no advantage to attempting deduplication for a unique index
during CREATE INDEX, since there cannot possibly be any duplicates.
Doing so wastes cycles due to unnecessary copying.  Make sure that we
avoid it consistently.

We already avoided unique index deduplication in the case where there
were some spool2 tuples to merge.  That didn't account for the fact that
spool2 is removed early/unset in the common case where it has no tuples
that need to be merged (i.e. it failed to account for the "spool2 turns
out to be unnecessary" optimization in _bt_spools_heapscan()).

Oversight in commit 0d861bbb, which added nbtree deduplication

Backpatch: 13-, where nbtree deduplication was introduced.
2020-07-17 09:50:48 -07:00
Andres Freund 5e7bbb5286 code: replace 'master' with 'primary' where appropriate.
Also changed "in the primary" to "on the primary", and added a few
"the" before "primary".

Author: Andres Freund
Reviewed-By: David Steele
Discussion: https://postgr.es/m/20200615182235.x7lch5n6kcjq4aue@alap3.anarazel.de
2020-07-08 12:57:23 -07:00
Peter Geoghegan 28c16f4947 Remove unnecessary PageIsEmpty() nbtree build check.
nbtree index builds cannot write out an empty page.  That would mean
that there was no way to create a pivot tuple pointing to the page one
level up, since _bt_truncate() generates one based on page's firstright
tuple.

Replace the unnecessary PageIsEmpty() check with an assertion that
checks that the page has space for at least two line pointers (the
would-be high key line pointer, plus at least one valid "data item"
tuple line pointer).

The PageIsEmpty() check was added by commit 5d9f146c over 20 years ago.
It looks like it has always been unnecessary.
2020-07-06 13:47:29 -07:00
Peter Geoghegan e25d462a38 nbtree: Rename _bt_search() variables.
Make some of the variable names in _bt_search() consistent with
corresponding variables within _bt_getstackbuf().  This naming scheme is
clearer because the variable names always express a relationship between
the currently locked buffer/page and some other page.
2020-07-02 14:54:55 -07:00
Peter Geoghegan f7a476f0d6 nbtree: Correct inaccurate split location comment.
Minor oversight in commit fab2502433.
2020-06-29 12:30:39 -07:00
Peter Geoghegan 10f1ab2cb8 Fix misuse of table_index_fetch_tuple_check().
Commit 0d861bbb, which added deduplication to nbtree, had
_bt_check_unique() pass a TID to table_index_fetch_tuple_check() that
isn't safe to mutate.  table_index_fetch_tuple_check()'s tid argument is
modified when the TID in question is not the latest visible tuple in a
hot chain, though this wasn't documented.

To fix, go back to using a local copy of the TID in _bt_check_unique(),
and update comments above table_index_fetch_tuple_check().

Backpatch: 13-, where B-Tree deduplication was introduced.
2020-06-25 10:55:28 -07:00
Peter Geoghegan be14f884d5 Fix deduplication "single value" strategy bug.
It was possible for deduplication's single value strategy to mistakenly
believe that a very small duplicate tuple counts as one of the six large
tuples that it aims to leave behind after the page finally splits.  This
could cause slightly suboptimal space utilization with very low
cardinality indexes, though only under fairly narrow conditions.

To fix, be particular about what kind of tuple counts as a
maxpostingsize-capped tuple.  This avoids confusion in the event of a
small tuple that gets "wedged" between two large tuples, where all
tuples on the page are duplicates of the same value.

Discussion: https://postgr.es/m/CAH2-Wz=Y+sgSFc-O3LpiZX-POx2bC+okec2KafERHuzdVa7-rQ@mail.gmail.com
Backpatch: 13-, where deduplication was introduced (by commit 0d861bbb)
2020-06-19 08:57:24 -07:00
Peter Geoghegan d64f1cdf2f Silence _bt_check_unique compiler warning.
Reported-By: Tom Lane
Discussion: https://postgr.es/m/841649.1592065060@sss.pgh.pa.us
2020-06-13 09:33:33 -07:00
Peter Geoghegan 67b0b2dbf9 Reconsider nbtree page deletion assertion.
Commit 624686abcf added an assertion that verified that _bt_search
successfully relocated the leaf page undergoing deletion.  Page deletion
cannot deal with the case where the descent stack is to the right of the
page, so this seemed critical (deletion can only handle the case where
the descent stack is to the left of the leaf/target page).  However, the
assertion went a bit too far.

Since only a buffer pin is held on the leaf page throughout the call to
_bt_search, nothing guarantees that it can't have split during this
small window.  And if does actually split, _bt_search may end up
"relocating" a page to the right of the original target leaf page.  This
scenario seems extremely unlikely, but it must still be considered.
Remove the assertion, and document how we cope in this scenario.
2020-05-19 15:04:34 -07:00
Tom Lane e02ad575d8 Final pgindent run with pg_bsd_indent version 2.1.
This is just to provide a clean basis for comparison of the results
of the new version.  I did fix a typo that crept into 242dfcbaf.

Discussion: https://postgr.es/m/20200114221814.GA19630@alvherre.pgsql
2020-05-16 11:49:14 -04:00
Alvaro Herrera 242dfcbafa
Avoid killing btree items that are already dead
_bt_killitems marks btree items dead when a scan leaves the page where
they live, but it does so with only share lock (to improve concurrency).
This was historicall okay, since killing a dead item has no
consequences.  However, with the advent of data checksums and
wal_log_hints, this action incurs a WAL full-page-image record of the
page.  Multiple concurrent processes would write the same page several
times, leading to WAL bloat.  The probability of this happening can be
reduced by only killing items if they're not already dead, so change the
code to do that.

The problem could eliminated completely by having _bt_killitems upgrade
to exclusive lock upon seeing a killable item, but that would reduce
concurrency so it's considered a cure worse than the disease.

Backpatch all the way back to 9.5, since wal_log_hints was introduced in
9.4.

Author: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
Discussion: https://postgr.es/m/CA+fd4k6PeRj2CkzapWNrERkja5G0-6D-YQiKfbukJV+qZGFZ_Q@mail.gmail.com
2020-05-15 16:50:34 -04:00
Tom Lane 5cbfce562f Initial pgindent and pgperltidy run for v13.
Includes some manual cleanup of places that pgindent messed up,
most of which weren't per project style anyway.

Notably, it seems some people didn't absorb the style rules of
commit c9d297751, because there were a bunch of new occurrences
of function calls with a newline just after the left paren, all
with faulty expectations about how the rest of the call would get
indented.
2020-05-14 13:06:50 -04:00
Peter Geoghegan 624686abcf Adjust "root of to-be-deleted subtree" function.
Restructure the function that locates the root of the to-be-deleted
subtree during nbtree page deletion.  Handle the conditions that make
page deletion unsafe in a slightly more uniform way, and acknowledge the
fact that the behavior with incomplete splits on internal pages is
different (as pointed out in the nbtree README as of commit 35bc0ec7).
Also invent new terminology that avoids ambiguity around which pages are
about to be deleted.  Consistently use the term "to-be-deleted subtree",
not the ambiguous term "branch".

We were calling the subtree parent page the "top parent page", but that
was quite misleading.  The top parent page usually refers to a page
unlinked from its siblings and marked deleted (during the second stage
of page deletion).  There was one kind of top parent page that we merely
removed a downlink from, and another kind of top parent page that we
actually marked deleted.  Eliminate the ambiguity by inventing a new
term ("subtree parent page") that refers to the former kind of page
only.
2020-05-11 11:01:07 -07:00
Peter Geoghegan cd8c73a38a Refactor nbtree deletion INCOMPLETE_SPLIT check.
Factor out code common to _bt_lock_branch_parent() and _bt_pagedel()
into a new utility function.  This new function is used to check that
the left sibling of a deletion target page does not have the
INCOMPLETE_SPLIT page flag set.  If it is set then deletion is unsafe;
there won't be a usable pivot tuple (with a downlink) in the parent page
that points to the deletion target page.  The page deletion algorithm is
not prepared to deal with that.  Also restructure an existing, related
utility function that checks if the right sibling of the target page has
the ISHALFDEAD page flag set.

This organization highlights the symmetry between the two cases.  The
goal is to make the design of page deletion clearer.  Both functions
involve a sibling page with a flag that indicates that there was an
interrupted operation (a page split or a page deletion) that resulted in
a page pointed to by sibling pages, but not pointed to in the parent.
And, both functions indicate if page deletion is unsafe due to the
absence of a particular downlink in the parent page.
2020-05-07 16:08:54 -07:00
Peter Geoghegan 9dc7251417 Refactor btvacuumpage().
Remove one of the arguments to btvacuumpage(), and give up on the idea
that it's a recursive function.  We now use the term "backtracking" to
refer to the case where an earlier block must be visited to make sure no
tuples that need to be removed were missed.

Advertising btvacuumpage() as a recursive function was unhelpful.  In
reality the function always simulates recursion with a loop (it doesn't
actually call itself).  This wasn't just necessary as a precaution (per
the comments mentioning tail recursion), though.  There is no reliable
natural limit on the number of times we can backtrack.

There are important behavioral difference when "recursing"/backtracking,
mostly related to page deletion.  We don't perform page deletion when
backtracking due to the extra complexity.  And when we recurse, we're
not performing a physical order scan anymore, so we expect fairly
different conditions to hold for the page.  Structuring the code like
this makes it clearer how _bt_pagedel() cooperates with btvacuumpage()
and btvacuumscan() (as established in commit b0229f26 and commit
73a076b0).

Author: Peter Geoghegan
Reviewed-By: Masahiko Sawada
Discussion: https://postgr.es/m/CAH2-WzmRGMDWiLMcb+zagG9652PboNN4Gfcq1Gc_wJL6A716MA@mail.gmail.com
2020-05-02 14:04:33 -07:00
Peter Geoghegan 69cf853fe7 Clear up issue with FSM and oldest bpto.xact.
On further reflection, code comments added by commit b0229f26 slightly
misrepresented how we determine the oldest bpto.xact for the index.
btvacuumpage() does not treat the bpto.xact of a page that it put in the
FSM as a candidate to be the oldest deleted page (the delete-marked page
that has the oldest bpto.xact XID among all pages encountered).

The definition of a deleted page for the purposes of the bpto.xact
calculation is different from the definition used by the bulk delete
statistics.  The bulk delete statistics don't distinguish between pages
that were deleted by the current VACUUM, pages deleted by a previous
VACUUM operation but not yet recyclable/reusable, and pages that are
reusable (though reusable pages are counted separately).

Backpatch: 11-, just like commit b0229f26.
2020-05-01 12:19:44 -07:00
Peter Geoghegan 4e21f8b633 Reorder function prototypes for consistency. 2020-05-01 10:03:38 -07:00
Peter Geoghegan 73a076b03f Fix undercounting in VACUUM VERBOSE output.
The logic for determining how many nbtree pages in an index are deleted
pages sometimes undercounted pages.  Pages that were deleted by the
current VACUUM operation (as opposed to some previous VACUUM operation
whose deleted pages have yet to be reused) were sometimes overlooked.
The final count is exposed to users through VACUUM VERBOSE's "%u index
pages have been deleted" output.

btvacuumpage() avoided double-counting when _bt_pagedel() deleted more
than one page by assuming that only one page was deleted, and that the
additional deleted pages would get picked up during a future call to
btvacuumpage() by the same VACUUM operation.  _bt_pagedel() can
legitimately delete pages that the btvacuumscan() scan will not visit
again, though, so that assumption was slightly faulty.

Fix the accounting by teaching _bt_pagedel() about its caller's
requirements.  It now only reports on pages that it knows btvacuumscan()
won't visit again (including the current btvacuumpage() page), so
everything works out in the end.

This bug has been around forever.  Only backpatch to v11, though, to
keep _bt_pagedel() is sync on the branches that have today's bugfix
commit b0229f26da.  Note that this commit changes the signature of
_bt_pagedel(), just like commit b0229f26da.

Author: Peter Geoghegan
Reviewed-By: Masahiko Sawada
Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com
Backpatch: 11-
2020-05-01 09:51:09 -07:00
Peter Geoghegan b0229f26da Fix bug in nbtree VACUUM "skip full scan" feature.
Commit 857f9c36cd (which taught nbtree VACUUM to skip a scan of the
index from btcleanup in situations where it doesn't seem worth it) made
VACUUM maintain the oldest btpo.xact among all deleted pages for the
index as a whole.  It failed to handle all the details surrounding pages
that are deleted by the current VACUUM operation correctly (though pages
deleted by some previous VACUUM operation were processed correctly).

The most immediate problem was that the special area of the page was
examined without a buffer pin at one point.  More fundamentally, the
handling failed to account for the full range of _bt_pagedel()
behaviors.  For example, _bt_pagedel() sometimes deletes internal pages
in passing, as part of deleting an entire subtree with btvacuumpage()
caller's page as the leaf level page.  The original leaf page passed to
_bt_pagedel() might not be the page that it deletes first in cases where
deletion can take place.

It's unclear how disruptive this bug may have been, or what symptoms
users might want to look out for.  The issue was spotted during
unrelated code review.

To fix, push down the logic for maintaining the oldest btpo.xact to
_bt_pagedel().  btvacuumpage() is now responsible for pages that were
fully deleted by a previous VACUUM operation, while _bt_pagedel() is now
responsible for pages that were deleted by the current VACUUM operation
(this includes half-dead pages from a previous interrupted VACUUM
operation that become fully deleted in _bt_pagedel()).  Note that
_bt_pagedel() should never encounter an existing deleted page.

This commit theoretically breaks the ABI of a stable release by changing
the signature of _bt_pagedel().  However, if any third party extension
is actually affected by this, then it must already be completely broken
(since there are numerous assumptions made in _bt_pagedel() that cannot
be met outside of VACUUM).  It seems highly unlikely that such an
extension actually exists, in any case.

Author: Peter Geoghegan
Reviewed-By: Masahiko Sawada
Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com
Backpatch: 11-, where the "skip full scan" feature was introduced.
2020-05-01 08:39:52 -07:00
Peter Geoghegan dd1f645cc8 Fix AddressSanitizer use-after-scope complaint.
XLogRegisterBufData() does not copy data pointed to by caller's pointer
argument.

Oversight in commit 0d861bbb70.

Author: Peter Eisentraut
Reported-By: Peter Eisentraut
Discussion: https://postgr.es/m/21800dbe-a13e-22f7-d423-b81db9d249f5@2ndquadrant.com
2020-04-30 12:31:56 -07:00
Peter Geoghegan ab2343d4cb Remove redundant _bt_killitems() buffer check.
_bt_getbuf() cannot return an invalid buffer.

Oversight in commit 2ed5b87f96.
2020-04-29 18:17:49 -07:00
Peter Geoghegan 7154aa16a6 Fix another minor page deletion buffer lock issue.
Avoid accessing the leaf page's top parent tuple without a buffer lock
held during the second phase of nbtree page deletion.  The old approach
was safe, though only because VACUUM never drops its buffer pin (and
because only VACUUM itself can modify a half-dead page).  Even still, it
seems like a good idea to be strict here.  Tighten things up by copying
the top parent page's block number to a local variable before releasing
the buffer lock on the leaf page -- not after.

This is a follow-up to commit fa7ff642, which fixed a similar issue in
the first phase of nbtree page deletion.

Update some related comments in passing.

Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
2020-04-25 16:45:20 -07:00
Peter Geoghegan fa7ff642c2 Fix minor nbtree page deletion buffer lock issue.
Avoid accessing the deletion target page's special area during nbtree
page deletion at a point where there is no buffer lock held.  This issue
was detected by a patch that extends Valgrind's memcheck tool to mark
nbtree pages that are unsafe to access (due to not having a buffer lock
or buffer pin) as NOACCESS.

We do hold a buffer pin at this point, and only access the special area,
so the old approach was safe.  Even still, it seems like a good idea to
tighten up the rules in this area.  There is no reason to not simply
insist on always holding a buffer lock (not just a pin) when accessing
nbtree pages.

Update some related comments in passing.

Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
2020-04-25 14:17:02 -07:00
Peter Geoghegan 1542e16f2c Consider outliers in split interval calculation.
Commit 0d861bbb, which introduced deduplication to nbtree, added some
logic to take large posting list tuples into account when choosing a
split point.  We subtract firstright posting list overhead from the
projected new high key size when calculating leftfree/rightfree values
for an affected candidate split point.  Posting list tuples aren't
special to nbtsplitloc.c, but taking them into account like this makes a
huge difference in practice.  Posting list tuples are frequently tuple
size outliers.

However, commit 0d861bbb missed a closely related issue: split interval
itself is calculated based on the assumption that tuples on the page
being split are roughly equisized.  That assumption was acceptable back
when commit fab25024 taught the logic for choosing a split point about
suffix truncation, but it's pretty questionable now that very large
tuple sizes are common.  This oversight led to unbalanced page splits in
low cardinality multi-column indexes when deduplication was used: page
splits that don't give sufficient weight to how unbalanced the split is
when the interval happens to include some large posting list tuples (and
when most other tuples on the page are not so large).

Nail this down by calculating an initial split interval in a way that's
attuned to the actual cost that we want to keep under control (not a
fuzzy proxy for the cost): apply a leftfree + rightfree evenness test to
each candidate split point that actually gets included in the split
interval (for the default strategy).  This replaces logic that used a
percentage of all legal split points for the page as the basis of the
initial split interval.

Discussion: https://postgr.es/m/CAH2-WznJt5aT2uUB2Bs+JBLdwe0XTX67+xeLFcaNvCKxO=QBVQ@mail.gmail.com
2020-04-21 09:59:24 -07:00
Peter Geoghegan f0ca378d4c Slightly simplify nbtree split point choice loop.
Spotted during post-commit review of the nbtree deduplication commit
(commit 0d861bbb).
2020-04-15 15:47:26 -07:00
Peter Geoghegan 4a05a64095 Remove obsolete "hole in center of page" comment.
A comment from the Berkeley days incorrectly claimed that the page
management code cares about the contents of the hole in the center of
the page (at least in the case of the left half of an nbtree page
split).  Commit 8fa30f906b added an addendum that stated that the
original comment was "probably obsolete".  It's definitely obsolete,
though, so remove the original comment plus the addendum.
2020-04-14 14:38:28 -07:00
Peter Geoghegan 80634e3b18 Rearrange _bt_insertonpg() "update metapage" code.
Nest the "update metapage as part of insert into root-like page" branch
inside the broader "insert into internal page" branch.  This improves
readability.
2020-04-14 09:33:18 -07:00
Peter Geoghegan f762b2feba Add defensive "split_only_page" nbtree assertion.
Clearly it's not okay for nbtree to split a page that is the only page
on its level, and then find that it has to split the parent one level up
in turn.  There is simply no code to handle the split_only_page case in
the _bt_insertonpg() "newitem won't fit" branch (only the "newitem fits"
branch handles split_only_page).  Add a defensive assertion that will
fail if a split_only_page call to _bt_insertonpg() somehow ends up
splitting the target/parent page.

I (pgeoghegan) believe that we don't need split_only_page handling for
the "newitem won't fit" branch because anybody calling _bt_insertonpg()
like this would have to hold a lock on the same one and only child page.
2020-04-13 21:11:03 -07:00
Peter Geoghegan 826ee1a019 Make _bt_insertonpg() more like _bt_split().
It seems like a good idea for nbtree's retail insert code to be
absolutely consistent with nbtree's page split code for anything that
naturally requires equivalent handling.  Anything that concerns
inserting newitem (which is handled as part of the page split atomic
action when a page split is required) should work in exactly the same
way.  With that in mind, make _bt_insertonpg() handle 'cbuf' in a way
that matches _bt_split().
2020-04-13 19:26:41 -07:00
Peter Geoghegan bc3087b626 Harmonize nbtree page split point code.
An nbtree split point can be thought of as a point between two adjoining
tuples from an imaginary version of the page being split that includes
the incoming/new item (in addition to the items that really are on the
page).  These adjoining tuples are called the lastleft and firstright
tuples.

The variables that represent split points contained a field called
firstright, which is an offset number of the first data item from the
original page that goes on the new right page.  The corresponding tuple
from origpage was usually the same thing as the actual firstright tuple,
but not always: the firstright tuple is sometimes the new/incoming item
instead.  This situation seems unnecessarily confusing.

Make things clearer by renaming the origpage offset returned by
_bt_findsplitloc() to "firstrightoff".  We now have a firstright tuple
and a firstrightoff offset number which are comparable to the
newitem/lastleft tuples and the newitemoff/lastleftoff offset numbers
respectively.  Also make sure that we are consistent about how we
describe nbtree page split point state.

Push the responsibility for dealing with pg_upgrade'd !heapkeyspace
indexes down to lower level code, relieving _bt_split() from dealing
with it directly.  This means that we always have a palloc'd left page
high key on the leaf level, no matter what.  This enables simplifying
some of the code (and code comments) within _bt_split().

Finally, restructure the page split code to make it clearer why suffix
truncation (which only takes place during leaf page splits) is
completely different to the first data item truncation that takes place
during internal page splits.  Tuples are marked as having fewer
attributes stored in both cases, and the firstright tuple is truncated
in both cases, so it's easy to imagine somebody missing the distinction.
2020-04-13 16:39:55 -07:00
Amit Kapila 5c71362174 Allow parallel create index to accumulate buffer usage stats.
Currently, we don't account for buffer usage incurred by parallel workers
for parallel create index.  This commit allows each worker to record the
buffer usage stats and leader backend to accumulate that stats at the
end of the operation.  This will allow pg_stat_statements to display
correct buffer usage stats for (parallel) create index command.

Reported-by: Julien Rouhaud
Author: Sawada Masahiko
Reviewed-by: Dilip Kumar, Julien Rouhaud and Amit Kapila
Backpatch-through: 11, where this was introduced
Discussion: https://postgr.es/m/20200328151721.GB12854@nol
2020-04-09 09:49:30 +05:30