Commit Graph

290 Commits

Author SHA1 Message Date
Peter Geoghegan 80634e3b18 Rearrange _bt_insertonpg() "update metapage" code.
Nest the "update metapage as part of insert into root-like page" branch
inside the broader "insert into internal page" branch.  This improves
readability.
2020-04-14 09:33:18 -07:00
Peter Geoghegan f762b2feba Add defensive "split_only_page" nbtree assertion.
Clearly it's not okay for nbtree to split a page that is the only page
on its level, and then find that it has to split the parent one level up
in turn.  There is simply no code to handle the split_only_page case in
the _bt_insertonpg() "newitem won't fit" branch (only the "newitem fits"
branch handles split_only_page).  Add a defensive assertion that will
fail if a split_only_page call to _bt_insertonpg() somehow ends up
splitting the target/parent page.

I (pgeoghegan) believe that we don't need split_only_page handling for
the "newitem won't fit" branch because anybody calling _bt_insertonpg()
like this would have to hold a lock on the same one and only child page.
2020-04-13 21:11:03 -07:00
Peter Geoghegan 826ee1a019 Make _bt_insertonpg() more like _bt_split().
It seems like a good idea for nbtree's retail insert code to be
absolutely consistent with nbtree's page split code for anything that
naturally requires equivalent handling.  Anything that concerns
inserting newitem (which is handled as part of the page split atomic
action when a page split is required) should work in exactly the same
way.  With that in mind, make _bt_insertonpg() handle 'cbuf' in a way
that matches _bt_split().
2020-04-13 19:26:41 -07:00
Peter Geoghegan bc3087b626 Harmonize nbtree page split point code.
An nbtree split point can be thought of as a point between two adjoining
tuples from an imaginary version of the page being split that includes
the incoming/new item (in addition to the items that really are on the
page).  These adjoining tuples are called the lastleft and firstright
tuples.

The variables that represent split points contained a field called
firstright, which is an offset number of the first data item from the
original page that goes on the new right page.  The corresponding tuple
from origpage was usually the same thing as the actual firstright tuple,
but not always: the firstright tuple is sometimes the new/incoming item
instead.  This situation seems unnecessarily confusing.

Make things clearer by renaming the origpage offset returned by
_bt_findsplitloc() to "firstrightoff".  We now have a firstright tuple
and a firstrightoff offset number which are comparable to the
newitem/lastleft tuples and the newitemoff/lastleftoff offset numbers
respectively.  Also make sure that we are consistent about how we
describe nbtree page split point state.

Push the responsibility for dealing with pg_upgrade'd !heapkeyspace
indexes down to lower level code, relieving _bt_split() from dealing
with it directly.  This means that we always have a palloc'd left page
high key on the leaf level, no matter what.  This enables simplifying
some of the code (and code comments) within _bt_split().

Finally, restructure the page split code to make it clearer why suffix
truncation (which only takes place during leaf page splits) is
completely different to the first data item truncation that takes place
during internal page splits.  Tuples are marked as having fewer
attributes stored in both cases, and the firstright tuple is truncated
in both cases, so it's easy to imagine somebody missing the distinction.
2020-04-13 16:39:55 -07:00
Peter Geoghegan 60cbd7751c Remove nbtree BTreeTupleSetAltHeapTID() function.
Since heap TID is supposed to be just another key attribute to the
implementation, it doesn't make much sense to have separate
BTreeTupleSetNAtts() and BTreeTupleSetAltHeapTID() functions.  Merge the
two functions together.  This slightly simplifies _bt_truncate().
2020-04-07 15:56:52 -07:00
Peter Geoghegan 9945ad6e90 Justify nbtree page split locking in code comment.
Delaying unlocking the right child page until after the point that the
left child's parent page has been refound is no longer truly necessary.
Commit 40dae7ec made nbtree tolerant of interrupted page splits.  VACUUM
was taught to avoid deleting a page that happens to be the right half of
an incomplete split.  As long as page splits don't unlock the left child
page until the end of the second/final phase, it should be safe to
unlock the right child page earlier (at the end of the first phase).

It probably isn't actually useful to release the right child's lock
earlier like this (it probably won't improve performance).  Even still,
pointing out that it ought to be safe to do so should make it easier to
understand the overall design.
2020-03-27 16:44:52 -07:00
Peter Geoghegan b27e1b3418 nbtree: Remove obsolete _bt_pgaddtup() comments.
Remove comments that are a throw back to a time when nbtree cared about
write-ordering dependencies.  The comments are similar to those removed
by commit 9ee7414e, among others.
2020-03-19 14:56:56 -07:00
Peter Geoghegan 6312c08a29 nbtree: Use raw PageAddItem() for retail inserts.
Only internal page splits need to call _bt_pgaddtup() instead of
PageAddItem(), and only for data items, one of which will end up at the
first offset (or first offset after the high key offset) on the new
right page.  This data item alone will need to be truncated in
_bt_pgaddtup().

Since there is no reason why retail inserts ever need to truncate the
incoming item, use a raw PageAddItem() call there instead.  Even
_bt_split() uses raw PageAddItem() calls for left page and right page
high keys.  Clearly the _bt_pgaddtup() shim function wasn't really
encapsulating anything.  _bt_pgaddtup() should now be thought of as a
_bt_split() helper function.

Note that the assertions from commit d1e241c2 verify that retail inserts
never insert an item at an internal page's negative infinity offset.
This invariant could only ever be violated as a result of a basic logic
error in nbtinsert.c.
2020-03-18 18:17:37 -07:00
Peter Geoghegan b029395f5e Refactor nbtree fastpath optimization.
Commit 2b272734, which added the fastpath rightmost leaf page cache
insert optimization, added code to _bt_doinsert() to handle using and
invalidating the backend local block cache.  It doesn't seem like a good
place to handle these low level details, though.  _bt_doinsert() is
supposed to be a high level function -- it is the main entry point to
nbtinsert.c.

Restructure the code by placing handling of the rightmost block cache at
the start of a new _bt_search() shim function, _bt_search_insert().  The
new function is called from _bt_doinsert(), which uses it as a
_bt_search() variant that conveniently accepts its BTInsertState state
as an argument.  _bt_doinsert() no longer needs to directly consider the
fastpath optimization.

Discussion: https://postgr.es/m/CAH2-Wzk59cxKJRd=rfbyub6-V4yWRjsOYRkUNHBLT1P1GdtCQQ@mail.gmail.com
2020-03-18 14:42:49 -07:00
Peter Geoghegan b897b3aae6 nbtree: Remove useless local variables.
Copying block and offset numbers to local variables in _bt_insertonpg()
made the code less readable.  Remove the variables.  There is already
code that conditionally calls BufferGetBlockNumber() in the same block,
so consistently do it that way instead.

Calling BufferGetBlockNumber() is very cheap, but we might as well avoid
it when it isn't truly necessary.  It isn't truly necessary for
_bt_insertonpg() to call BufferGetBlockNumber() in almost all cases.

Spotted while working on a patch that refactors the fastpath rightmost
leaf page cache optimization, which was added by commit 2b272734.
2020-03-17 18:39:26 -07:00
Peter Geoghegan 013c1f6af6 nbtree: Pass down MAXALIGN()'d itemsz for new item.
Refactor nbtinsert.c so that the final itemsz of each new non-pivot
tuple (the MAXALIGN()'d size) is determined once.  Most of the functions
used by leaf page inserts used the insertstate.itemsz value already.
This commit makes everything use insertstate.itemsz as standard
practice.  The goal is to decouple tuple size from "effective" tuple
size.  Making this distinction isn't truly necessary right now, but that
might change in the future.

Also explain why we consistently apply MAXALIGN() to get an effective
index tuple size.  This was rather unclear, in part because it isn't
actually strictly necessary right now.
2020-03-16 12:00:10 -07:00
Peter Geoghegan f207bb0b8f nbtree: Reorder nbtinsert.c prototypes.
Relocate _bt_newroot() prototype, so that the order that prototypes
appear in matches the order that the functions are defined in.
2020-03-15 20:53:12 -07:00
Peter Geoghegan 39eabec904 nbtree: Move fastpath NULL descent stack assertion.
Commit 074251db added an assertion that verified the fastpath/rightmost
page insert optimization's assumption about free space: There should
always be enough free space on the page to insert the new item without
splitting the page.  Otherwise, we end up using the "concurrent root
page split" phony/fake stack path in _bt_insert_parent().  This does not
lead to incorrect behavior, but it is likely to be far slower than
simply using the regular _bt_search() path.  The assertion catches
serious performance bugs that would probably take a long time to detect
any other way.

It seems much more natural to make this assertion just before the point
that we generate a fake/phony descent stack.  Move the assert there.
This also makes _bt_insertonpg() a bit more readable.
2020-03-10 17:25:47 -07:00
Peter Geoghegan d1e241c226 nbtree: Demote minus infinity "can't happen" error.
Only a very basic logic bug in a _bt_insertonpg() caller could lead to a
violation of this invariant.  Besides, any newitemoff used for an
internal page is sanitized using other "can't happen" errors in
_bt_getstackbuf() or its callers, before _bt_insertonpg() even gets
called.

Also, move the error/assertion from the insert-without-split path of
_bt_insertonpg() to the top of the same function.  There is no reason
why this invariant only applies to insertions that happen to not result
in a page split; cover every insertion.  The assertion naturally belongs
next to the existing generic assertions that document relatively
high-level invariants for the item being inserted.
2020-03-10 14:15:41 -07:00
Peter Geoghegan 1e07f5e0a1 Remove overzealous _bt_split() assertions.
_bt_split() is passed NULL as its insertion scankey for internal page
splits.  Two recently added Assert() statements failed to consider this,
leading to a crash with pg_upgrade'd BREE_VERSION < 4 indexes.  Remove
the assertions.

The assertions in question were added by commit 0d861bbb, which added
nbtree deduplication.  It would be possible to fix the assertions
directly instead, but they weren't adding much anyway.
2020-03-02 21:40:11 -08:00
Peter Geoghegan 2c0797da2c Silence another compiler warning in nbtinsert.c.
Per complaint from Álvaro Herrera.
2020-02-26 15:15:45 -08:00
Peter Geoghegan 2d8a6fad18 Silence compiler warning in nbtinsert.c.
Per buildfarm member longfin.
2020-02-26 13:17:36 -08:00
Peter Geoghegan 0d861bbb70 Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method.  The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
the point where a leaf page split would otherwise be required.  New
posting list tuples are formed by merging together existing duplicate
tuples.  The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed.  Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.

The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach.  Most individual inserts of index tuples have
exactly the same overhead as before.  The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits.  The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).

Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key).  The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective.  This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.

A new index storage parameter (deduplicate_items) controls the use of
deduplication.  The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible.  This decision will be
reviewed at the end of the Postgres 13 beta period.

There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values.  The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely.  Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12).  However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from.  This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.

Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
    https://postgr.es/m/55E4051B.7020209@postgrespro.ru
    https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 13:05:30 -08:00
Thomas Munro 6f38d4dac3 Remove dependency on HeapTuple from predicate locking functions.
The following changes make the predicate locking functions more
generic and suitable for use by future access methods:

- PredicateLockTuple() is renamed to PredicateLockTID().  It takes
  ItemPointer and inserting transaction ID instead of HeapTuple.

- CheckForSerializableConflictIn() takes blocknum instead of buffer.

- CheckForSerializableConflictOut() no longer takes HeapTuple or buffer.

Author: Ashwin Agrawal
Reviewed-by: Andres Freund, Kuntal Ghosh, Thomas Munro
Discussion: https://postgr.es/m/CALfoeiv0k3hkEb3Oqk%3DziWqtyk2Jys1UOK5hwRBNeANT_yX%2Bng%40mail.gmail.com
2020-01-28 13:13:04 +13:00
Peter Geoghegan fc31001123 Remove redundant incomplete split assertion.
The fastpath insert optimization's incomplete split flag Assert() is
redundant.  We'll reach the more general Assert() within
_bt_findinsertloc() in all cases. (Besides, Assert()'ing that the
rightmost page doesn't have the flag set never made much sense.)
2020-01-05 17:42:13 -08:00
Bruce Momjian 7559d8ebfa Update copyrights for 2020
Backpatch-through: update all files in master, backpatch legal files through 9.4
2020-01-01 12:21:45 -05:00
Peter Geoghegan fcf3b6917b Rename nbtree tuple macros.
Rename two function-style macros, removing the word "inner".  This makes
things more consistent.
2019-12-16 17:49:45 -08:00
Peter Geoghegan 2110f71696 nbtree: Tweak _bt_pgaddtup() comments.
Make it clear that _bt_pgaddtup() truncates the first data item on an
internal page because its key is supposed to be treated as minus
infinity within _bt_compare().
2019-11-18 13:04:53 -08:00
Peter Geoghegan 867d25ccb4 Explain subtlety in nbtree locking protocol.
The Postgres approach to coupling locks during an ascent of the tree is
slightly different to the approach taken by Lehman and Yao.  Add a new
paragraph to the "Differences to the Lehman & Yao algorithm" section of
the nbtree README that explains the similarities and differences.
2019-08-23 20:24:49 -07:00
Peter Geoghegan 9c02cf5661 Remove block number field from nbtree stack.
The initial value of the nbtree stack downlink block number field
recorded during an initial descent of the tree wasn't actually used.
Both _bt_getstackbuf() callers overwrote the value with their own value.

Remove the block number field from the stack struct, and add a child
block number argument to _bt_getstackbuf() in its place.  This makes the
overall design of _bt_getstackbuf() clearer.

Author: Peter Geoghegan
Reviewed-By: Anastasia Lubennikova
Discussion: https://postgr.es/m/CAH2-Wzmx+UbXt2YNOUCZ-a04VdXU=S=OHuAuD7Z8uQq-PXTYUg@mail.gmail.com
2019-08-14 11:32:35 -07:00
Peter Eisentraut fd6ec93bf8 Add error codes to some corruption log messages
In some cases we have elog(ERROR) while corruption is certain and we
can give a clear error code ERRCODE_DATA_CORRUPTED or
ERRCODE_INDEX_CORRUPTED.

Author: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://www.postgresql.org/message-id/flat/25F6C686-6442-4A6B-BAF8-A6F7B84B16DE@yandex-team.ru
2019-08-01 11:15:26 +02:00
Amit Kapila 9679345f3c Fix typos.
Reported-by: Alexander Lakhin
Author: Alexander Lakhin
Reviewed-by: Amit Kapila and Tom Lane
Discussion: https://postgr.es/m/7208de98-add8-8537-91c0-f8b089e2928c@gmail.com
2019-05-26 18:28:18 +05:30
Tom Lane 8255c7a5ee Phase 2 pgindent run for v12.
Switch to 2.1 version of pg_bsd_indent.  This formats
multiline function declarations "correctly", that is with
additional lines of parameter declarations indented to match
where the first line's left parenthesis is.

Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
2019-05-22 13:04:48 -04:00
Tom Lane be76af171c Initial pgindent run for v12.
This is still using the 2.0 version of pg_bsd_indent.
I thought it would be good to commit this separately,
so as to document the differences between 2.0 and 2.1 behavior.

Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
2019-05-22 12:55:34 -04:00
Peter Geoghegan 489e431ba5 Remove obsolete nbtree insertion comment.
Remove a Berkeley-era comment above _bt_insertonpg() that admonishes the
reader to grok Lehman and Yao's paper before making any changes.  This
made a certain amount of sense back when _bt_insertonpg() was
responsible for most of the things that are now spread across
_bt_insertonpg(), _bt_findinsertloc(), _bt_insert_parent(), and
_bt_split(), but it doesn't work like that anymore.

I believe that this comment alludes to the need to "couple" or "crab"
buffer locks as we ascend the tree as page splits cascade upwards.  The
nbtree README already explains this in detail, which seems sufficient.
Besides, the changes to page splits made by commit 40dae7ec53 altered
the exact details of how buffer locks are retained during splits; Lehman
and Yao's original algorithm seems to release the lock on the left child
page/buffer slightly earlier than _bt_insertonpg()/_bt_insert_parent()
can.
2019-05-15 16:53:11 -07:00
Peter Geoghegan ae7291acbc Standardize ItemIdData terminology.
The term "item pointer" should not be used to refer to ItemIdData
variables, since that is needlessly ambiguous.  Only
ItemPointerData/ItemPointer variables should be called item pointers.

To fix, establish the convention that ItemIdData variables should always
be referred to either as "item identifiers" or "line pointers".  The
term "item identifier" already predominates in docs and translatable
messages, and so should be the preferred alternative there.

Discussion: https://postgr.es/m/CAH2-Wz=c=MZQjUzde3o9+2PLAPuHTpVZPPdYxN=E4ndQ2--8ew@mail.gmail.com
2019-05-13 15:53:39 -07:00
Peter Geoghegan 9b42e71376 Don't leave behind junk nbtree pages during split.
Commit 8fa30f906b reduced the elevel of a number of "can't happen"
_bt_split() errors from PANIC to ERROR.  At the same time, the new right
page buffer for the split could continue to be acquired well before the
critical section.  This was possible because it was relatively
straightforward to make sure that _bt_split() could not throw an error,
with a few specific exceptions.  The exceptional cases were safe because
they involved specific, well understood errors, making it possible to
consistently zero the right page before actually raising an error using
elog().  There was no danger of leaving around a junk page, provided
_bt_split() stuck to this coding rule.

Commit 8224de4f, which introduced INCLUDE indexes, added code to make
_bt_split() truncate away non-key attributes.  This happened at a point
that broke the rule around zeroing the right page in _bt_split().  If
truncation failed (perhaps due to palloc() failure), that would result
in an errant right page buffer with junk contents.  This could confuse
VACUUM when it attempted to delete the page, and should be avoided on
general principle.

To fix, reorganize _bt_split() so that truncation occurs before the new
right page buffer is even acquired.  A junk page/buffer will not be left
behind if _bt_nonkey_truncate()/_bt_truncate() raise an error.

Discussion: https://postgr.es/m/CAH2-WzkcWT_-NH7EeL=Az4efg0KCV+wArygW8zKB=+HoP=VWMw@mail.gmail.com
Backpatch: 11-, where INCLUDE indexes were introduced.
2019-05-13 10:27:59 -07:00
Peter Geoghegan 7b37f4b02e Correct more obsolete nbtree page split comments.
Commit 3f342839 corrected obsolete comments about buffer locks at the
main _bt_insert_parent() call site, but missed similar obsolete comments
above _bt_insert_parent() itself.  Both sets of comments were rendered
obsolete by commit 40dae7ec53, which made the nbtree page split
algorithm more robust.  Fix the comments that were missed the first time
around now.

In passing, refine a related _bt_insert_parent() comment about
re-finding the parent page to insert new downlink.
2019-05-03 13:34:45 -07:00
Peter Geoghegan 9ee7414ed0 Remove obsolete _bt_insert_parent() comment.
Remove a comment that refers to a coding practice that was fully removed
by commit a8b8f4db, which introduced MarkBufferDirty().  It looks like
the comment was even obsolete before then, since it concerns
write-ordering dependencies with synchronous buffer writes.
2019-04-29 14:14:38 -07:00
Peter Geoghegan 9b10926263 Prevent O(N^2) unique index insertion edge case.
Commit dd299df8 made nbtree treat heap TID as a tiebreaker column,
establishing the principle that there is only one correct location (page
and page offset number) for every index tuple, no matter what.
Insertions of tuples into non-unique indexes proceed as if heap TID
(scan key's scantid) is just another user-attribute value, but
insertions into unique indexes are more delicate.  The TID value in
scantid must initially be omitted to ensure that the unique index
insertion visits every leaf page that duplicates could be on.  The
scantid is set once again after unique checking finishes successfully,
which can force _bt_findinsertloc() to step right one or more times, to
locate the leaf page that the new tuple must be inserted on.

Stepping right within _bt_findinsertloc() was assumed to occur no more
frequently than stepping right within _bt_check_unique(), but there was
one important case where that assumption was incorrect: inserting a
"duplicate" with NULL values.  Since _bt_check_unique() didn't do any
real work in this case, it wasn't appropriate for _bt_findinsertloc() to
behave as if it was finishing off a conventional unique insertion, where
any existing physical duplicate must be dead or recently dead.
_bt_findinsertloc() might have to grovel through a substantial portion
of all of the leaf pages in the index to insert a single tuple, even
when there were no dead tuples.

To fix, treat insertions of tuples with NULLs into a unique index as if
they were insertions into a non-unique index: never unset scantid before
calling _bt_search() to descend the tree, and bypass _bt_check_unique()
entirely.  _bt_check_unique() is no longer responsible for incoming
tuples with NULL values.

Discussion: https://postgr.es/m/CAH2-Wzm08nr+JPx4jMOa9CGqxWYDQ-_D4wtPBiKghXAUiUy-nQ@mail.gmail.com
2019-04-23 10:33:57 -07:00
Peter Geoghegan 74eb2176bf Invalidate binary search bounds consistently.
_bt_check_unique() failed to invalidate binary search bounds in the
event of a live conflict following commit e5adcb78.  This resulted in
problems after waiting for the conflicting xact to commit or abort.  The
subsequent call to _bt_check_unique() would restore the initial binary
search bounds, rather than starting a new search.  Fix by explicitly
invalidating bounds when it becomes clear that there is a live conflict
that insertion will have to wait to resolve.

Ashutosh Sharma, with a few additional tweaks by me.

Author: Ashutosh Sharma
Reported-By: Ashutosh Sharma
Diagnosed-By: Ashutosh Sharma
Discussion: https://postgr.es/m/CAE9k0PnQp-qr-UYKMSCzdC2FBzdE4wKP41hZrZvvP26dKLonLg@mail.gmail.com
2019-04-04 09:38:08 -07:00
Andres Freund 71bdc99d0d tableam: Add helper for indexes to check if a corresponding table tuples exist.
This is, likely exclusively, useful to verify that conflicts detected
in a unique index are with live tuples, rather than dead ones.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-25 16:52:55 -07:00
Peter Geoghegan fab2502433 Consider secondary factors during nbtree splits.
Teach nbtree to give some consideration to how "distinguishing"
candidate leaf page split points are.  This should not noticeably affect
the balance of free space within each half of the split, while still
making suffix truncation truncate away significantly more attributes on
average.

The logic for choosing a leaf split point now uses a fallback mode in
the case where the page is full of duplicates and it isn't possible to
find even a minimally distinguishing split point.  When the page is full
of duplicates, the split should pack the left half very tightly, while
leaving the right half mostly empty.  Our assumption is that logical
duplicates will almost always be inserted in ascending heap TID order
with v4 indexes.  This strategy leaves most of the free space on the
half of the split that will likely be where future logical duplicates of
the same value need to be placed.

The number of cycles added is not very noticeable.  This is important
because deciding on a split point takes place while at least one
exclusive buffer lock is held.  We avoid using authoritative insertion
scankey comparisons to save cycles, unlike suffix truncation proper.  We
use a faster binary comparison instead.

Note that even pg_upgrade'd v3 indexes make use of these optimizations.
Benchmarking has shown that even v3 indexes benefit, despite the fact
that suffix truncation will only truncate non-key attributes in INCLUDE
indexes.  Grouping relatively similar tuples together is beneficial in
and of itself, since it reduces the number of leaf pages that must be
accessed by subsequent index scans.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzmmoLNQOj9mAD78iQHfWLJDszHEDrAzGTUMG3mVh5xWPw@mail.gmail.com
2019-03-20 10:12:19 -07:00
Peter Geoghegan dd299df818 Make heap TID a tiebreaker nbtree index column.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique.  This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.  A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes.  These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.

A later patch will enhance the logic used by nbtree to pick a split
point.  This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at.  Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.

The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits.  The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised.  However, there should be a compatibility note in the v12
release notes.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20 10:04:01 -07:00
Peter Geoghegan e5adcb789d Refactor nbtree insertion scankeys.
Use dedicated struct to represent nbtree insertion scan keys.  Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions.  This is based on a suggestion by Andrey
Lepikhov.

Streamline how unique index insertions cache binary search progress.
Cache the state of in-progress binary searches within _bt_check_unique()
for later instead of having callers avoid repeating the binary search in
an ad-hoc manner.  This makes it easy to add a new optimization:
_bt_check_unique() now falls out of its loop immediately in the common
case where it's already clear that there couldn't possibly be a
duplicate.

The new _bt_check_unique() scheme makes it a lot easier to manage cached
binary search effort afterwards, from within _bt_findinsertloc().  This
is needed for the upcoming patch to make nbtree tuples unique by
treating heap TID as a final tiebreaker column.  Unique key binary
searches need to restore lower and upper bounds.  They cannot simply
continue to use the >= lower bound as the offset to insert at, because
the heap TID tiebreaker column must be used in comparisons for the
restored binary search (unlike the original _bt_check_unique() binary
search, where scankey's heap TID column must be omitted).

Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas, Andrey Lepikhov
Discussion: https://postgr.es/m/CAH2-WzmE6AhUdk9NdWBf4K3HjWXZBX3+umC7mH7+WDrKcRtsOw@mail.gmail.com
2019-03-20 09:30:57 -07:00
Peter Geoghegan 3f34283973 Correct obsolete nbtree page split comment.
Commit 40dae7ec53, which made the nbtree page split algorithm more
robust, made _bt_insert_parent() only unlock the right child of the
parent page before inserting a new downlink into the parent.  Update a
comment from the Berkeley days claiming that both left and right child
pages are unlocked before the new downlink actually gets inserted.

The claim that it is okay to release both locks early based on Lehman
and Yao's say-so never made much sense.  Lehman and Yao must sometimes
"couple" buffer locks across a pair of internal pages when relocating a
downlink, unlike the corresponding code within _bt_getstack().
2019-03-12 16:40:05 -07:00
Peter Geoghegan 72c7c4e386 Correct obsolete nbtree page split WAL comment.
Commit 2c03216d83, which revamped the WAL record format, failed to
update a comment referencing the old API.  Update the comment.
2019-03-04 12:32:40 -08:00
Peter Geoghegan 2ab23445bc Remove unneeded argument from _bt_getstackbuf().
_bt_getstackbuf() is called at exactly two points following commit
efada2b8e9 (one call site is concerned with page splits, while the
other is concerned with page deletion).  The parent buffer returned by
_bt_getstackbuf() is write-locked in both cases.  Remove the 'access'
argument and make _bt_getstackbuf() assume that callers require a
write-lock.
2019-02-25 17:47:43 -08:00
Andres Freund e7cc78ad43 Remove superfluous tqual.h includes.
Most of these had been obsoleted by 568d4138c / the SnapshotNow
removal.

This is is preparation for moving most of tqual.[ch] into either
snapmgr.h or heapam.h, which in turn is in preparation for pluggable
table AMs.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-01-21 12:15:02 -08:00
Bruce Momjian 97c39498e5 Update copyright for 2019
Backpatch-through: certain files through 9.4
2019-01-02 12:44:25 -05:00
Peter Geoghegan 61a4480a68 Remove obsolete nbtree duplicate entries comment.
Remove a comment from the Berkeley days claiming that nbtree must
disambiguate duplicate keys within _bt_moveright().  There is no special
care taken around duplicates within _bt_moveright(), at least since
commit 9e85183bfc removed inscrutable _bt_moveright() code to handle
pages full of duplicates.
2018-12-18 21:40:38 -08:00
Peter Geoghegan 60f3cc9553 Correct obsolete nbtree recovery comments.
Commit 40dae7ec53, which made the handling of interrupted nbtree page
splits more robust, removed an nbtree-specific end-of-recovery cleanup
step.  This meant that it was no longer possible to complete an
interrupted page split during recovery.  However, a reference to
recovery as a reason for using a NULL stack while inserting into a
parent page was missed.  Remove the reference.

Remove a similar obsolete reference to recovery that was introduced much
more recently, as part of the btree fastpath optimization enhancement
that made it into Postgres 11 (commit 2b272734, and follow-up commits).

Backpatch: 11-, where the fastpath optimization was introduced.
2018-12-18 16:59:50 -08:00
Alexander Korotkov d2086b08b0 Reduce path length for locking leaf B-tree pages during insertion
In our B-tree implementation appropriate leaf page for new tuple
insertion is acquired using _bt_search() function.  This function always
returns leaf page locked in shared mode.  In order to obtain exclusive
lock, caller have to relock the page.

This commit makes _bt_search() function lock leaf page immediately in
exclusive mode when needed.  That removes unnecessary relock and, in
turn reduces lock contention for B-tree leaf pages.  Our experiments
on multi-core systems showed acceleration up to 4.5 times in corner
case.

Discussion: https://postgr.es/m/CAPpHfduAMDFMNYTCN7VMBsFg_hsf0GqiqXnt%2BbSeaJworwFoig%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Yoshikazu Imai, Simon Riggs, Peter Geoghegan
2018-07-28 00:31:40 +03:00
Peter Geoghegan e915fed291 Correct obsolete unique index insertion comment.
Commit bc292937ae failed to update a comment about unique index
checking.  _bt_insertonpg() is no longer responsible for finding an
insertion location while preventing conflicting insertions.
2018-07-08 10:50:13 -07:00
Teodor Sigaev 08186dc05b Move _bt_upgrademetapage() into critical section.
Any changes on page should be done in critical section, so move
_bt_upgrademetapage into critical section. Improve comment. Found by Amit
Kapila during post-commit review of 857f9c36.

Author: Amit Kapila
2018-05-30 19:45:39 +03:00