Commit Graph

214 Commits

Author SHA1 Message Date
Peter Geoghegan cd8c73a38a Refactor nbtree deletion INCOMPLETE_SPLIT check.
Factor out code common to _bt_lock_branch_parent() and _bt_pagedel()
into a new utility function.  This new function is used to check that
the left sibling of a deletion target page does not have the
INCOMPLETE_SPLIT page flag set.  If it is set then deletion is unsafe;
there won't be a usable pivot tuple (with a downlink) in the parent page
that points to the deletion target page.  The page deletion algorithm is
not prepared to deal with that.  Also restructure an existing, related
utility function that checks if the right sibling of the target page has
the ISHALFDEAD page flag set.

This organization highlights the symmetry between the two cases.  The
goal is to make the design of page deletion clearer.  Both functions
involve a sibling page with a flag that indicates that there was an
interrupted operation (a page split or a page deletion) that resulted in
a page pointed to by sibling pages, but not pointed to in the parent.
And, both functions indicate if page deletion is unsafe due to the
absence of a particular downlink in the parent page.
2020-05-07 16:08:54 -07:00
Peter Geoghegan 4e21f8b633 Reorder function prototypes for consistency. 2020-05-01 10:03:38 -07:00
Peter Geoghegan 73a076b03f Fix undercounting in VACUUM VERBOSE output.
The logic for determining how many nbtree pages in an index are deleted
pages sometimes undercounted pages.  Pages that were deleted by the
current VACUUM operation (as opposed to some previous VACUUM operation
whose deleted pages have yet to be reused) were sometimes overlooked.
The final count is exposed to users through VACUUM VERBOSE's "%u index
pages have been deleted" output.

btvacuumpage() avoided double-counting when _bt_pagedel() deleted more
than one page by assuming that only one page was deleted, and that the
additional deleted pages would get picked up during a future call to
btvacuumpage() by the same VACUUM operation.  _bt_pagedel() can
legitimately delete pages that the btvacuumscan() scan will not visit
again, though, so that assumption was slightly faulty.

Fix the accounting by teaching _bt_pagedel() about its caller's
requirements.  It now only reports on pages that it knows btvacuumscan()
won't visit again (including the current btvacuumpage() page), so
everything works out in the end.

This bug has been around forever.  Only backpatch to v11, though, to
keep _bt_pagedel() is sync on the branches that have today's bugfix
commit b0229f26da.  Note that this commit changes the signature of
_bt_pagedel(), just like commit b0229f26da.

Author: Peter Geoghegan
Reviewed-By: Masahiko Sawada
Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com
Backpatch: 11-
2020-05-01 09:51:09 -07:00
Peter Geoghegan b0229f26da Fix bug in nbtree VACUUM "skip full scan" feature.
Commit 857f9c36cd (which taught nbtree VACUUM to skip a scan of the
index from btcleanup in situations where it doesn't seem worth it) made
VACUUM maintain the oldest btpo.xact among all deleted pages for the
index as a whole.  It failed to handle all the details surrounding pages
that are deleted by the current VACUUM operation correctly (though pages
deleted by some previous VACUUM operation were processed correctly).

The most immediate problem was that the special area of the page was
examined without a buffer pin at one point.  More fundamentally, the
handling failed to account for the full range of _bt_pagedel()
behaviors.  For example, _bt_pagedel() sometimes deletes internal pages
in passing, as part of deleting an entire subtree with btvacuumpage()
caller's page as the leaf level page.  The original leaf page passed to
_bt_pagedel() might not be the page that it deletes first in cases where
deletion can take place.

It's unclear how disruptive this bug may have been, or what symptoms
users might want to look out for.  The issue was spotted during
unrelated code review.

To fix, push down the logic for maintaining the oldest btpo.xact to
_bt_pagedel().  btvacuumpage() is now responsible for pages that were
fully deleted by a previous VACUUM operation, while _bt_pagedel() is now
responsible for pages that were deleted by the current VACUUM operation
(this includes half-dead pages from a previous interrupted VACUUM
operation that become fully deleted in _bt_pagedel()).  Note that
_bt_pagedel() should never encounter an existing deleted page.

This commit theoretically breaks the ABI of a stable release by changing
the signature of _bt_pagedel().  However, if any third party extension
is actually affected by this, then it must already be completely broken
(since there are numerous assumptions made in _bt_pagedel() that cannot
be met outside of VACUUM).  It seems highly unlikely that such an
extension actually exists, in any case.

Author: Peter Geoghegan
Reviewed-By: Masahiko Sawada
Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com
Backpatch: 11-, where the "skip full scan" feature was introduced.
2020-05-01 08:39:52 -07:00
Peter Geoghegan 7154aa16a6 Fix another minor page deletion buffer lock issue.
Avoid accessing the leaf page's top parent tuple without a buffer lock
held during the second phase of nbtree page deletion.  The old approach
was safe, though only because VACUUM never drops its buffer pin (and
because only VACUUM itself can modify a half-dead page).  Even still, it
seems like a good idea to be strict here.  Tighten things up by copying
the top parent page's block number to a local variable before releasing
the buffer lock on the leaf page -- not after.

This is a follow-up to commit fa7ff642, which fixed a similar issue in
the first phase of nbtree page deletion.

Update some related comments in passing.

Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
2020-04-25 16:45:20 -07:00
Peter Geoghegan fa7ff642c2 Fix minor nbtree page deletion buffer lock issue.
Avoid accessing the deletion target page's special area during nbtree
page deletion at a point where there is no buffer lock held.  This issue
was detected by a patch that extends Valgrind's memcheck tool to mark
nbtree pages that are unsafe to access (due to not having a buffer lock
or buffer pin) as NOACCESS.

We do hold a buffer pin at this point, and only access the special area,
so the old approach was safe.  Even still, it seems like a good idea to
tighten up the rules in this area.  There is no reason to not simply
insist on always holding a buffer lock (not just a pin) when accessing
nbtree pages.

Update some related comments in passing.

Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
2020-04-25 14:17:02 -07:00
Peter Geoghegan 0d861bbb70 Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method.  The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
the point where a leaf page split would otherwise be required.  New
posting list tuples are formed by merging together existing duplicate
tuples.  The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed.  Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.

The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach.  Most individual inserts of index tuples have
exactly the same overhead as before.  The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits.  The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).

Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key).  The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective.  This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.

A new index storage parameter (deduplicate_items) controls the use of
deduplication.  The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible.  This decision will be
reviewed at the end of the Postgres 13 beta period.

There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values.  The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely.  Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12).  However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from.  This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.

Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
    https://postgr.es/m/55E4051B.7020209@postgrespro.ru
    https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 13:05:30 -08:00
Peter Geoghegan d2e5e20e57 Add xl_btree_delete optimization.
Commit 558a9165e0 taught _bt_delitems_delete() to produce its own XID
horizon on the primary.  Standbys no longer needed to generate their own
latestRemovedXid, since they could just use the explicitly logged value
from the primary instead.  The deleted offset numbers array from the
xl_btree_delete WAL record was no longer used by the REDO routine for
anything other than deleting the items.

This enables a minor optimization:  We now treat the array as buffer
state, not generic WAL data, following _bt_delitems_vacuum()'s example.
This should be a minor win, since it allows us to avoid including the
deleted items array in cases where XLogInsert() stores the whole buffer
anyway.  The primary goal here is to make the code more maintainable,
though.  Removing inessential differences between the two functions
highlights the fundamental differences that remain.

Also change xl_btree_delete to use uint32 for the size of the array of
item offsets being deleted.  This brings xl_btree_delete closer to
xl_btree_vacuum.  Furthermore, it seems like a good idea to use an
explicit-width integer type (the field was previously an "int").

Bump XLOG_PAGE_MAGIC because xl_btree_delete changed.

Discussion: https://postgr.es/m/CAH2-Wzkz4TjmezzfAbaV1zYrh=fr0bCpzuJTvBe5iUQ3aUPsCQ@mail.gmail.com
2020-01-03 12:18:13 -08:00
Peter Geoghegan 44e44bd258 Correct _bt_delitems_vacuum() lock comments.
The expectation within _bt_delitems_vacuum() is that caller has a
super-exclusive/cleanup buffer lock (not just a pin and a write lock).
2020-01-02 13:30:40 -08:00
Peter Geoghegan 4b25f5d0ba Revise BTP_HAS_GARBAGE nbtree VACUUM comments.
_bt_delitems_vacuum() comments claimed that it isn't worth another scan
of the page to avoid falsely unsetting the BTP_HAS_GARBAGE page flag
hint (this happens to be the same wording that was removed from
_bt_delitems_delete() by my recent commit fe97c61c).  The comments made
little sense, though.  The issue can't have much to do with performing a
second scan of the target leaf page, since an LP_DEAD test could easily
be performed in the first scan of the page anyway (the scan that takes
place in btvacuumpage() caller).

Revise the explanation.  It makes much more sense to frame this as an
issue about recovery conflicts.  _bt_delitems_vacuum() cannot easily
generate an XID cutoff in the same way that _bt_delitems_delete() is
designed to.

Falsely unsetting the page flag is not ideal, and is likely to happen
more often than was supposed by the original comments.  Explain why it
usually isn't a problem in practice.  There may be an argument for
_bt_delitems_vacuum() not clearing the BTP_HAS_GARBAGE bit, removing the
question of it being falsely unset by VACUUM (there may even be an
argument for not using a page level hint at all).  This can be revisited
later.
2020-01-01 17:29:41 -08:00
Bruce Momjian 7559d8ebfa Update copyrights for 2020
Backpatch-through: update all files in master, backpatch legal files through 9.4
2020-01-01 12:21:45 -05:00
Peter Geoghegan fe97c61c87 Update nbtree LP_DEAD item deletion comments.
Comments about the consequences of clearing the BTP_HAS_GARBAGE page
flag bit that apply only to VACUUM were added to code that deals with
opportunistic deletion of LP_DEAD items by commit a760893d.  The same
comment block was added to both _bt_delitems_vacuum() and
_bt_delitems_delete().  Correct _bt_delitems_delete()'s copy of the
comment block.

_bt_delitems_delete() reliably deletes items that were found by caller
to have their LP_DEAD bit set.  There is no question about whether or
not unsetting the BTP_HAS_GARBAGE bit can miss some LP_DEAD items that
were set recently.

Also tweak a related section of the nbtree README.
2019-12-22 19:57:35 -08:00
Peter Geoghegan 9f83468b35 Remove unneeded "pin scan" nbtree VACUUM code.
The REDO routine for nbtree's xl_btree_vacuum record type hasn't
performed a "pin scan" since commit 3e4b7d87 went in, so clearly there
isn't any point in VACUUM WAL-logging information that won't actually be
used.  Finish off the work of commit 3e4b7d87 (and the closely related
preceding commit 687f2cd7) by removing the code that generates this
unused information.  Also remove the REDO routine code disabled by
commit 3e4b7d87.

Replace the unneeded lastBlockVacuumed field in xl_btree_vacuum with a
new "ndeleted" field.  The new field isn't actually needed right now,
since we could continue to infer the array length from the overall
record length.  However, an upcoming patch to add deduplication to
nbtree needs to add an "items updated" field to xl_btree_vacuum, so we
might as well start being explicit about the number of items now.
(Besides, it doesn't seem like a good idea to leave the xl_btree_vacuum
struct without any fields; the C standard says that that's undefined.)

nbtree VACUUM no longer forces writing a WAL record for the last block
in the index.  Writing out a WAL record with no items for the final
block was supposed to force processing of a lastBlockVacuumed field by a
pin scan.

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

Discussion: https://postgr.es/m/CAH2-WzmY_mT7UnTzFB5LBQDBkKpdV5UxP3B5bLb7uP%3D%3D6UQJRQ%40mail.gmail.com
2019-12-19 11:35:55 -08:00
Bruce Momjian b93e9a5c94 revert: Remove meaningless assignments in nbtree code
Reverts commit 05684c8255.

Reported-by: Tom Lane

Discussion: https://postgr.es/m/404.1576770942@sss.pgh.pa.us

Backpatch-through: master
2019-12-19 11:19:10 -05:00
Bruce Momjian 05684c8255 Remove meaningless assignments in nbtree code
Reported-by: Ranier Vilela

Discussion: https://postgr.es/m/MN2PR18MB2927BB876D12A70FDBE8F35AE3450@MN2PR18MB2927.namprd18.prod.outlook.com

Backpatch-through: master
2019-12-19 10:33:48 -05:00
Peter Geoghegan fcf3b6917b Rename nbtree tuple macros.
Rename two function-style macros, removing the word "inner".  This makes
things more consistent.
2019-12-16 17:49:45 -08:00
Peter Geoghegan b8b3a276d4 Remove obsolete nbtree page deletion comment.
Commit efada2b8e9, which made the nbtree page deletion algorithm more
robust, removed the concept of a half-dead internal page.  Remove a
comment about half dead parent pages that was overlooked.
2019-08-27 14:01:43 -07:00
Peter Geoghegan 9c02cf5661 Remove block number field from nbtree stack.
The initial value of the nbtree stack downlink block number field
recorded during an initial descent of the tree wasn't actually used.
Both _bt_getstackbuf() callers overwrote the value with their own value.

Remove the block number field from the stack struct, and add a child
block number argument to _bt_getstackbuf() in its place.  This makes the
overall design of _bt_getstackbuf() clearer.

Author: Peter Geoghegan
Reviewed-By: Anastasia Lubennikova
Discussion: https://postgr.es/m/CAH2-Wzmx+UbXt2YNOUCZ-a04VdXU=S=OHuAuD7Z8uQq-PXTYUg@mail.gmail.com
2019-08-14 11:32:35 -07:00
Peter Geoghegan af0ba49809 Use PageIndexTupleOverwrite() within nbtree.
Use the PageIndexTupleOverwrite() bufpage.c routine within nbtree
instead of deleting a tuple and re-inserting its replacement.  This
makes the intent of affected code slightly clearer.  It also makes
CREATE INDEX slightly faster, since there is no longer a need to shift
every leaf page's line pointer array back and forth during index builds.

Author: Peter Geoghegan, Anastasia Lubennikova
Reviewed-By: Anastasia Lubennikova
Discussion: https://postgr.es/m/CAH2-Wz=Zk=B9+Vwm376WuO7YTjFc2SSskifQm4Nme3RRRPtOSQ@mail.gmail.com
2019-08-13 11:54:26 -07:00
Peter Eisentraut fd6ec93bf8 Add error codes to some corruption log messages
In some cases we have elog(ERROR) while corruption is certain and we
can give a clear error code ERRCODE_DATA_CORRUPTED or
ERRCODE_INDEX_CORRUPTED.

Author: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://www.postgresql.org/message-id/flat/25F6C686-6442-4A6B-BAF8-A6F7B84B16DE@yandex-team.ru
2019-08-01 11:15:26 +02:00
Peter Geoghegan d004147eb3 Fix nbtree metapage cache upgrade bug.
Commit 857f9c36cd, which taught nbtree VACUUM to avoid unnecessary
index scans, bumped the nbtree version number from 2 to 3, while adding
the ability for nbtree indexes to be upgraded on-the-fly.  Various
assertions that assumed that an nbtree index was always on version 2 had
to be changed to accept any supported version (version 2 or 3 on
Postgres 11).

However, a few assertions were missed in the initial commit, all of
which were in code paths that cache a local copy of the metapage
metadata, where the index had been expected to be on the current version
(no longer version 2) as a generic sanity check.  Rather than simply
update the assertions, follow-up commit 0a64b45152 intentionally made
the metapage caching code update the per-backend cached metadata version
without changing the on-disk version at the same time.  This could even
happen when the planner needed to determine the height of a B-Tree for
costing purposes.  The assertions only fail on Postgres v12 when
upgrading from v10, because they were adjusted to use the authoritative
shared memory metapage by v12's commit dd299df8.

To fix, remove the cache-only upgrade mechanism entirely, and update the
assertions themselves to accept any supported version (go back to using
the cached version in v12).  The fix is almost a full revert of commit
0a64b45152 on the v11 branch.

VACUUM only considers the authoritative metapage, and never bothers with
a locally cached version, whereas everywhere else isn't interested in
the metapage fields that were added by commit 857f9c36cd.  It seems
unlikely that this bug has affected any user on v11.

Reported-By: Christoph Berg
Bug: #15896
Discussion: https://postgr.es/m/15896-5b25e260fdb0b081%40postgresql.org
Backpatch: 11-, where VACUUM was taught to avoid unnecessary index scans.
2019-07-18 13:22:56 -07:00
Peter Geoghegan 66c5bd3a6f Remove obsolete nbtree "get root" comment.
Remove a very old Berkeley era comment that doesn't seem to have
anything to do with the current locking considerations within
_bt_getroot().

Discussion: https://postgr.es/m/CAH2-WzmA2H+rL-xxF5o6QhMD+9x6cJTnz2Mr3Li_pbPBmqoTBQ@mail.gmail.com
2019-07-01 22:28:08 -07:00
Thomas Munro 89ff7c08ee Remove unnecessary comment.
Author: Vik Fearing
Discussion: https://postgr.es/m/150d3e9f-c7ec-3fb3-4fdb-def47c4144af%402ndquadrant.com
2019-06-23 22:19:59 +12:00
Tom Lane 8255c7a5ee Phase 2 pgindent run for v12.
Switch to 2.1 version of pg_bsd_indent.  This formats
multiline function declarations "correctly", that is with
additional lines of parameter declarations indented to match
where the first line's left parenthesis is.

Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
2019-05-22 13:04:48 -04:00
Peter Geoghegan 3f58cc6dd8 Remove extra nbtree half-dead internal page check.
It's not safe for nbtree VACUUM to attempt to delete a target page whose
right sibling is already half-dead, since that would fail the
cross-check when VACUUM attempts to re-find a downlink to the right
sibling in the parent page.  Logic to prevent this from happening was
added by commit 8da3183780, which addressed a bug in the overhaul of
page deletion that went into PostgreSQL 9.4 (commit efada2b8e9).
VACUUM was made to check the right sibling page, and back off when it
happened to be half-dead already.

However, it is only truly necessary to do the right sibling check on the
leaf level, since that transitively determines if the deletion target's
parent's right sibling page is itself undergoing deletion.  Remove the
internal page level check, and add a comment explaining why the leaf
level check alone suffices.

The extra check is also unnecessary due to the fact that internal pages
that are marked half-dead are generally considered corrupt.  Commit
efada2b8e9 established the principle that there should never be
half-dead internal pages (internal pages pending deletion are possible,
but that status is never directly represented in the internal page).
VACUUM will complain about corruption when it encounters half-dead
internal pages, so VACUUM is bound to raise an error one way or another
when an nbtree index has a half-dead internal page (contrib/amcheck will
also report that the page is corrupt).

It's possible that a pg_upgrade'd 9.3 database will still have half-dead
internal pages, so it may seem like there is an argument for leaving the
check in place to reliably get a cleaner error message that advises the
user to REINDEX.  However, leaf pages are also deleted in the first
phase of deletion prior to PostgreSQL 9.4, so I believe we won't even
attempt to re-find the parent page anyway (we won't have the fully
deleted leaf page as the right sibling of our target page, so we won't
even try to find a downlink for it).

Discussion: https://postgr.es/m/CAH2-Wzm_ntmqJjWLRyKzimFmFvk+BnVAvUpaA4s1h9Ja58woaQ@mail.gmail.com
2019-05-16 15:11:58 -07:00
Andres Freund 558a9165e0 Compute XID horizon for page level index vacuum on primary.
Previously the xid horizon was only computed during WAL replay. That
had two major problems:
1) It relied on knowing what the table pointed to looks like. That was
   easy enough before the introducing of tableam (we knew it had to be
   heap, although some trickery around logging the heap relfilenodes
   was required). But to properly handle table AMs we need
   per-database catalog access to look up the AM handler, which
   recovery doesn't allow.
2) Not knowing the xid horizon also makes it hard to support logical
   decoding on standbys. When on a catalog table, we need to be able
   to conflict with slots that have an xid horizon that's too old. But
   computing the horizon by visiting the heap only works once
   consistency is reached, but we always need to be able to detect
   conflicts.

There's also a secondary problem, in that the current method performs
redundant work on every standby. But that's counterbalanced by
potentially computing the value when not necessary (either because
there's no standby, or because there's no connected backends).

Solve 1) and 2) by moving computation of the xid horizon to the
primary and by involving tableam in the computation of the horizon.

To address the potentially increased overhead, increase the efficiency
of the xid horizon computation for heap by sorting the tids, and
eliminating redundant buffer accesses. When prefetching is available,
additionally perform prefetching of buffers.  As this is more of a
maintenance task, rather than something routinely done in every read
only query, we add an arbitrary 10 to the effective concurrency -
thereby using IO concurrency, when not globally enabled.  That's
possibly not the perfect formula, but seems good enough for now.

Bumps WAL format, as latestRemovedXid is now part of the records, and
the heap's relfilenode isn't anymore.

Author: Andres Freund, Amit Khandekar, Robert Haas
Reviewed-By: Robert Haas
Discussion:
    https://postgr.es/m/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de
    https://postgr.es/m/20181214014235.dal5ogljs3bmlq44@alap3.anarazel.de
    https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-26 16:52:54 -07:00
Peter Geoghegan dd299df818 Make heap TID a tiebreaker nbtree index column.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique.  This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.  A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes.  These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.

A later patch will enhance the logic used by nbtree to pick a split
point.  This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at.  Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.

The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits.  The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised.  However, there should be a compatibility note in the v12
release notes.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20 10:04:01 -07:00
Peter Geoghegan e5adcb789d Refactor nbtree insertion scankeys.
Use dedicated struct to represent nbtree insertion scan keys.  Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions.  This is based on a suggestion by Andrey
Lepikhov.

Streamline how unique index insertions cache binary search progress.
Cache the state of in-progress binary searches within _bt_check_unique()
for later instead of having callers avoid repeating the binary search in
an ad-hoc manner.  This makes it easy to add a new optimization:
_bt_check_unique() now falls out of its loop immediately in the common
case where it's already clear that there couldn't possibly be a
duplicate.

The new _bt_check_unique() scheme makes it a lot easier to manage cached
binary search effort afterwards, from within _bt_findinsertloc().  This
is needed for the upcoming patch to make nbtree tuples unique by
treating heap TID as a final tiebreaker column.  Unique key binary
searches need to restore lower and upper bounds.  They cannot simply
continue to use the >= lower bound as the offset to insert at, because
the heap TID tiebreaker column must be used in comparisons for the
restored binary search (unlike the original _bt_check_unique() binary
search, where scankey's heap TID column must be omitted).

Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas, Andrey Lepikhov
Discussion: https://postgr.es/m/CAH2-WzmE6AhUdk9NdWBf4K3HjWXZBX3+umC7mH7+WDrKcRtsOw@mail.gmail.com
2019-03-20 09:30:57 -07:00
Peter Geoghegan 2ab23445bc Remove unneeded argument from _bt_getstackbuf().
_bt_getstackbuf() is called at exactly two points following commit
efada2b8e9 (one call site is concerned with page splits, while the
other is concerned with page deletion).  The parent buffer returned by
_bt_getstackbuf() is write-locked in both cases.  Remove the 'access'
argument and make _bt_getstackbuf() assume that callers require a
write-lock.
2019-02-25 17:47:43 -08:00
Peter Geoghegan 067786cea0 Correct obsolete nbtree page deletion comment.
Commit efada2b8e9, which made the nbtree page deletion algorithm more
robust, removed _bt_getstackbuf() calls from _bt_pagedel().  It failed
to update a comment that referenced the earlier approach.  Update the
comment to explain that the _bt_getstackbuf() page deletion call site
mirrors the only other remaining _bt_getstackbuf() call site, which is
reached during page splits.
2019-02-25 16:54:18 -08:00
Bruce Momjian 97c39498e5 Update copyright for 2019
Backpatch-through: certain files through 9.4
2019-01-02 12:44:25 -05:00
Tom Lane 0905fe8911 Avoid emitting a bogus WAL record when recycling an all-zero btree page.
Commit fafa374f2 caused _bt_getbuf() to possibly emit a WAL record for
a page that it was about to recycle.  However, it failed to distinguish
all-zero pages from dead pages, which is important because only the
latter have valid btpo.xact values, or indeed any special space at all.
Recycling an all-zero page with XLogStandbyInfoActive() enabled therefore
led to an Assert failure, or to emission of a WAL record containing a
bogus cutoff XID, which might lead to unnecessary query cancellations
on hot standby servers.

Per reports from Antonin Houska and 自己.  Amit Kapila was first to
propose this fix, and Robert Haas, myself, and Kyotaro Horiguchi
reviewed it at various times.

This is an old bug, so back-patch to all supported branches.

Discussion: https://postgr.es/m/2628.1474272158@localhost
Discussion: https://postgr.es/m/48875502.f4a0.1635f0c27b0.Coremail.zoulx1982@163.com
2018-07-09 19:26:19 -04:00
Andres Freund 3a01f68e35 Check for interrupts inside the nbtree page deletion code.
When deleting pages the nbtree code has to walk through siblings of a
tree node. When those sibling links are corrupted that can lead to
endless loops - which are currently not interruptible.  This is
especially problematic if autovacuum is repeatedly blocked on such
indexes, as it can be hard to get out of that situation without
resorting to single user mode.

Thus add interrupt checks to appropriate places in such
loops. Unfortunately in one of the cases it's it's not easy to do so.

Between 9.3 and 9.4 the page deletion (and page split) code changed
significantly. Before it was significantly less robust against
interruptions. Therefore don't backpatch to 9.3.

Author: Andres Freund
Discussion: https://postgr.es/m/20180627191629.wkunw2qbibnvlz53@alap3.anarazel.de
Backpatch: 9.4-
2018-07-04 14:58:25 -07:00
Teodor Sigaev 08186dc05b Move _bt_upgrademetapage() into critical section.
Any changes on page should be done in critical section, so move
_bt_upgrademetapage into critical section. Improve comment. Found by Amit
Kapila during post-commit review of 857f9c36.

Author: Amit Kapila
2018-05-30 19:45:39 +03:00
Teodor Sigaev 8e12f4a250 Various improvements of skipping index scan during vacuum technics
- Change vacuum_cleanup_index_scale_factor GUC to PGC_USERSET.
  vacuum_cleanup_index_scale_factor GUC was defined as PGC_SIGHUP.  But this
  GUC affects not only autovacuum.  So it might be useful to change it from user
  session in order to influence manually runned VACUUM.
- Add missing tab-complete support for vacuum_cleanup_index_scale_factor
  reloption.
- Fix condition for B-tree index cleanup.
  Zero value of vacuum_cleanup_index_scale_factor means that user wants B-tree
  index cleanup to be never skipped.
- Documentation and comment improvements

Authors: Justin Pryzby, Alexander Korotkov, Liudmila Mantrova
Reviewed by: all authors and Robert Haas
Discussion: https://www.postgresql.org/message-id/flat/20180502023025.GD7631%40telsasoft.com
2018-05-10 13:31:47 +03:00
Tom Lane bdf46af748 Post-feature-freeze pgindent run.
Discussion: https://postgr.es/m/15719.1523984266@sss.pgh.pa.us
2018-04-26 14:47:16 -04:00
Teodor Sigaev 6db4b49986 Fix wrong validation of top-parent pointer during page deletion in Btree.
After introducing usage of t_tid of inner or page high key for storing
number of attributes of tuple, validation of tuple's ItemPointer with
ItemPointerIsValid becomes incorrect, it's need to validate only blocknumber of
ItemPointer. Missing this causes a incorrect page deletion, fix that. Test is
added.

BTW, current contrib/amcheck doesn't fail on index corrupted by this way.

Also introduce BTreeTupleGetTopParent/BTreeTupleSetTopParent macroses to improve
code readability and to avoid possible confusion with page high key: high key
is used to store top-parent link for branch to remove.

Bug found by Michael Paquier, but bug doesn't exist in previous versions because
t_tid was set to P_HIKEY.

Author: Teodor Sigaev
Reviewer: Peter Geoghegan
Discussion: https://www.postgresql.org/message-id/flat/20180419052436.GA16000%40paquier.xyz
2018-04-23 15:55:10 +03:00
Teodor Sigaev 075aade436 Adjust INCLUDE index truncation comments and code.
Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected.  Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature.  A
similar check will be performed for the page highkey within
_bt_moveright() in most cases.  Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.

Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer).  This isn't strictly necessary,
but seems more future-proof than the old approach.  It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation.  Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.

In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE INDEX.  Also pfree during a page split, just to be
consistent.

Refactor _bt_check_natts() to be more readable.

Author: Peter Geoghegan with some editorization by me
Reviewed by: Alexander Korotkov, Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
2018-04-19 08:45:58 +03:00
Teodor Sigaev 8224de4f42 Indexes with INCLUDE columns and their support in B-tree
This patch introduces INCLUDE clause to index definition.  This clause
specifies a list of columns which will be included as a non-key part in
the index.  The INCLUDE columns exist solely to allow more queries to
benefit from index-only scans.  Also, such columns don't need to have
appropriate operator classes.  Expressions are not supported as INCLUDE
columns since they cannot be used in index-only scans.

Index access methods supporting INCLUDE are indicated by amcaninclude flag
in IndexAmRoutine.  For now, only B-tree indexes support INCLUDE clause.

In B-tree indexes INCLUDE columns are truncated from pivot index tuples
(tuples located in non-leaf pages and high keys).  Therefore, B-tree indexes
now might have variable number of attributes.  This patch also provides
generic facility to support that: pivot tuples contain number of their
attributes in t_tid.ip_posid.  Free 13th bit of t_info is used for indicating
that.  This facility will simplify further support of index suffix truncation.
The changes of above are backward-compatible, pg_upgrade doesn't need special
handling of B-tree indexes for that.

Bump catalog version

Author: Anastasia Lubennikova with contribition by Alexander Korotkov and me
Reviewed by: Peter Geoghegan, Tomas Vondra, Antonin Houska, Jeff Janes,
			 David Rowley, Alexander Korotkov
Discussion: https://www.postgresql.org/message-id/flat/56168952.4010101@postgrespro.ru
2018-04-07 23:00:39 +03:00
Teodor Sigaev 0a64b45152 Fix handling of non-upgraded B-tree metapages
857f9c36 bumps B-tree metapage version while upgrade is performed "on the fly"
when needed. However, some asserts fired when old version metapage was
cached to rel->rd_amcache. Despite new metadata fields are never used from
rel->rd_amcache, that needs to be fixed. This patch introduces metadata
upgrade during its caching, which fills unavailable fields with their default
values. contrib/pageinspect is also patched to handle non-upgraded metapages
in the same way.

Author: Alexander Korotkov
2018-04-05 17:56:00 +03:00
Teodor Sigaev 17d8beb4f5 Remove unused vars and mark assert-only vars
Kyotaro HORIGUCHI
2018-04-05 13:16:15 +03:00
Teodor Sigaev 51e6562324 Fix typo
Masahiko Sawada
2018-04-05 13:04:18 +03:00
Teodor Sigaev 857f9c36cd Skip full index scan during cleanup of B-tree indexes when possible
Vacuum of index consists from two stages: multiple (zero of more) ambulkdelete
calls and one amvacuumcleanup call. When workload on particular table
is append-only, then autovacuum isn't intended to touch this table. However,
user may run vacuum manually in order to fill visibility map and get benefits
of index-only scans. Then ambulkdelete wouldn't be called for indexes
of such table (because no heap tuples were deleted), only amvacuumcleanup would
be called In this case, amvacuumcleanup would perform full index scan for
two objectives: put recyclable pages into free space map and update index
statistics.

This patch allows btvacuumclanup to skip full index scan when two conditions
are satisfied: no pages are going to be put into free space map and index
statistics isn't stalled. In order to check first condition, we store
oldest btpo_xact in the meta-page. When it's precedes RecentGlobalXmin, then
there are some recyclable pages. In order to check second condition we store
number of heap tuples observed during previous full index scan by cleanup.
If fraction of newly inserted tuples is less than
vacuum_cleanup_index_scale_factor, then statistics isn't considered to be
stalled. vacuum_cleanup_index_scale_factor can be defined as both reloption and GUC (default).

This patch bumps B-tree meta-page version. Upgrade of meta-page is performed
"on the fly": during VACUUM meta-page is rewritten with new version. No special
handling in pg_upgrade is required.

Author: Masahiko Sawada, Alexander Korotkov
Review by: Peter Geoghegan, Kyotaro Horiguchi, Alexander Korotkov, Yura Sokolov
Discussion: https://www.postgresql.org/message-id/flat/CAD21AoAX+d2oD_nrd9O2YkpzHaFr=uQeGr9s1rKC3O4ENc568g@mail.gmail.com
2018-04-04 19:29:00 +03:00
Bruce Momjian 9d4649ca49 Update copyright for 2018
Backpatch-through: certain files through 9.3
2018-01-02 23:30:12 -05:00
Tom Lane 4c11d2c559 Flag index metapages as standard-format in xlog.c calls.
btree, hash, and bloom indexes all set up their metapages in standard
format (that is, with pd_lower and pd_upper correctly delimiting the
unused area); but they mostly didn't inform the xlog routines of this.
When calling log_newpage[_buffer], this is bad because it loses the
opportunity to compress unused data out of the WAL record.  When
calling XLogRegisterBuffer, it's not such a performance problem because
all of these call sites also use REGBUF_WILL_INIT, preventing an FPI
image from being written.  But it's still a good idea to provide the
flag when relevant, because that aids WAL consistency checking.

This completes the project of getting all the in-core index AMs to
handle their metapage WAL operations similarly.

Amit Kapila, reviewed by Michael Paquier

Discussion: https://postgr.es/m/0d273805-0e9e-ec1a-cb84-d4da400b8f85@lab.ntt.co.jp
2017-11-03 16:31:32 -04:00
Tom Lane eb5c404b17 Minor code-cleanliness improvements for btree.
Make the btree page-flags test macros (P_ISLEAF and friends) return clean
boolean values, rather than values that might not fit in a bool.  Use them
in a few places that were randomly referencing the flag bits directly.

In passing, change access/nbtree/'s only direct use of BUFFER_LOCK_SHARE to
BT_READ.  (Some think we should go the other way, but as long as we have
BT_READ/BT_WRITE, let's use them consistently.)

Masahiko Sawada, reviewed by Doug Doole

Discussion: https://postgr.es/m/CAD21AoBmWPeN=WBB5Jvyz_Nt3rmW1ebUyAnk3ZbJP3RMXALJog@mail.gmail.com
2017-09-18 16:36:28 -04:00
Tom Lane 382ceffdf7 Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.

By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis.  However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent.  That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.

This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.

This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.

Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 15:35:54 -04:00
Robert Haas 8da9a22636 Split index xlog headers from other private index headers.
The xlog-specific headers need to be included in both frontend code -
specifically, pg_waldump - and the backend, but the remainder of the
private headers for each index are only needed by the backend.  By
splitting the xlog stuff out into separate headers, pg_waldump pulls
in fewer backend headers, which is a good thing.

Patch by me, reviewed by Michael Paquier and Andres Freund, per a
complaint from Dilip Kumar.

Discussion: http://postgr.es/m/CA+TgmoZ=F=GkxV0YEv-A8tb+AEGy_Qa7GSiJ8deBKFATnzfEug@mail.gmail.com
2017-02-14 15:37:59 -05:00
Bruce Momjian 1d25779284 Update copyright via script for 2017 2017-01-03 13:48:53 -05:00
Tom Lane e89526d4f3 In B-tree page deletion, clean up properly after page deletion failure.
In _bt_unlink_halfdead_page(), we might fail to find an immediate left
sibling of the target page, perhaps because of corruption of the page
sibling links.  The code intends to cope with this by just abandoning
the deletion attempt; but what actually happens is that it fails outright
due to releasing the same buffer lock twice.  (And error recovery masks
a second problem, which is possible leakage of a pin on another page.)
Seems to have been introduced by careless refactoring in commit efada2b8e.
Since there are multiple cases to consider, let's make releasing the buffer
lock in the failure case the responsibility of _bt_unlink_halfdead_page()
not its caller.

Also, avoid fetching the leaf page's left-link again after we've dropped
lock on the page.  This is probably harmless, but it's not exactly good
coding practice.

Per report from Kyotaro Horiguchi.  Back-patch to 9.4 where the faulty code
was introduced.

Discussion: <20160803.173116.111915228.horiguchi.kyotaro@lab.ntt.co.jp>
2016-08-06 14:28:37 -04:00