Commit Graph

208 Commits

Author SHA1 Message Date
Amit Kapila 7cc2f59dd5 Remove duplicate words in docs and code comments.
Additionally, add a missing "the" in a couple of places.

Author: Vignesh C, Dagfinn Ilmari Mannsåker
Discussion: http://postgr.es/m/CALDaNm28t+wWyPfuyqEaARS810Je=dRFkaPertaLAEJYY2cWYQ@mail.gmail.com
2023-10-09 09:18:47 +05:30
Alexander Korotkov 82a7132f53 Fix another typo in e0b1ee17dc
Reported-by: Richard Guo
Discussion: https://postgr.es/m/CAMbWs4_kHMJDak75y1kBTirv-drS1-knT-7Mpg5LprAjqRJDVA%40mail.gmail.com
2023-10-07 20:36:47 +03:00
Alexander Korotkov e8c334c47a Fix typos in e0b1ee17dc
Reported-by: Alexander Lakhin
2023-10-07 11:55:55 +03:00
Alexander Korotkov e0b1ee17dc Skip checking of scan keys required for directional scan in B-tree
Currently, B-tree code matches every scan key to every item on the page.
Imagine the ordered B-tree scan for the query like this.

SELECT * FROM tbl WHERE col > 'a' AND col < 'b' ORDER BY col;

The (col > 'a') scan key will be always matched once we find the location to
start the scan.  The (col < 'b') scan key will match every item on the page
as long as it matches the last item on the page.

This patch implements prechecking of the scan keys required for directional
scan on beginning of page scan.  If precheck is successful we can skip this
scan keys check for the items on the page.  That could lead to significant
acceleration especially if the comparison operator is expensive.

Idea from patch by Konstantin Knizhnik.

Discussion: https://postgr.es/m/079c3f8e-3371-abe2-e93c-fc8a0ae3f571%40garret.ru
Reviewed-by: Peter Geoghegan, Pavel Borisov
2023-10-06 10:40:51 +03:00
Thomas Munro 9f0602539d Remove some more "snapshot too old" vestiges.
Commit f691f5b8 removed the logic, but left behind some now-useless
Snapshot arguments to various AM-internal functions, and missed a couple
of comments.

Reported-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wznj9qSNXZ1P1uWTUD_FeaTezbUazb416EPwi4Qr_jR_6A%40mail.gmail.com
2023-09-08 17:12:12 +12:00
Thomas Munro f691f5b80a Remove the "snapshot too old" feature.
Remove the old_snapshot_threshold setting and mechanism for producing
the error "snapshot too old", originally added by commit 848ef42b.
Unfortunately it had a number of known problems in terms of correctness
and performance, mostly reported by Andres in the course of his work on
snapshot scalability.  We agreed to remove it, after a long period
without an active plan to fix it.

This is certainly a desirable feature, and someone might propose a new
or improved implementation in the future.

Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CACG%3DezYV%2BEvO135fLRdVn-ZusfVsTY6cH1OZqWtezuEYH6ciQA%40mail.gmail.com
Discussion: https://postgr.es/m/20200401064008.qob7bfnnbu4w5cw4%40alap3.anarazel.de
Discussion: https://postgr.es/m/CA%2BTgmoY%3Daqf0zjTD%2B3dUWYkgMiNDegDLFjo%2B6ze%3DWtpik%2B3XqA%40mail.gmail.com
2023-09-05 19:53:43 +12:00
Thomas Munro f9b7fc651a Fix race in SSI interaction with empty btrees.
When predicate-locking btrees, we have a special case for completely
empty btrees, since there is no page to lock.  This was racy, because,
without buffer lock held, a matching key could be inserted between the
_bt_search() and the PredicateLockRelation() calls.

Fix, by rechecking _bt_search() after taking the relation-level SIREAD
lock, if using SERIALIZABLE isolation and an empty btree is discovered.

Back-patch to all supported releases.  Fixes one aspect of bug #17949.

Reported-by: Artem Anisimov <artem.anisimov.255@gmail.com>
Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/17949-a0f17035294a55e2%40postgresql.org
2023-07-04 09:07:31 +12:00
Peter Geoghegan d088ba5a5a nbtree: Allocate new pages in separate function.
Split nbtree's _bt_getbuf function is two: code that read locks or write
locks existing pages remains in _bt_getbuf, while code that deals with
allocating new pages is moved to a new, dedicated function called
_bt_allocbuf.  This simplifies most _bt_getbuf callers, since it is no
longer necessary for them to pass a heaprel argument.  Many of the
changes to nbtree from commit 61b313e4 can be reverted.  This minimizes
the divergence between HEAD/PostgreSQL 16 and earlier release branches.

_bt_allocbuf replaces the previous nbtree idiom of passing P_NEW to
_bt_getbuf.  There are only 3 affected call sites, all of which continue
to pass a heaprel for recovery conflict purposes.  Note that nbtree's
use of P_NEW was superficial; nbtree never actually relied on the P_NEW
code paths in bufmgr.c, so this change is strictly mechanical.

GiST already took the same approach; it has a dedicated function for
allocating new pages called gistNewBuffer().  That factor allowed commit
61b313e4 to make much more targeted changes to GiST.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CAH2-Wz=8Z9qY58bjm_7TAHgtW6RzZ5Ke62q5emdCEy9BAzwhmg@mail.gmail.com
2023-06-10 14:08:25 -07:00
Andres Freund 61b313e47e Pass down table relation into more index relation functions
This is done in preparation for logical decoding on standby, which needs to
include whether visibility affecting WAL records are about a (user) catalog
table. Which is only known for the table, not the indexes.

It's also nice to be able to pass the heap relation to GlobalVisTestFor() in
vacuumRedirectAndPlaceholder().

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/21b700c3-eecf-2e05-a699-f8c78dd31ec7@gmail.com
2023-04-01 20:18:29 -07:00
Bruce Momjian c8e1ba736b Update copyright for 2023
Backpatch-through: 11
2023-01-02 15:00:37 -05:00
Michael Paquier d16773cdc8 Add macros in hash and btree AMs to get the special area of their pages
This makes the code more consistent with SpGiST, GiST and GIN, that
already use this style, and the idea is to make easier the introduction
of more sanity checks for each of these AM-specific macros.  BRIN uses a
different set of macros to get a page's type and flags, so it has no
need for something similar.

Author: Matthias van de Meent
Discussion: https://postgr.es/m/CAEze2WjE3+tGO9Fs9+iZMU+z6mMZKo54W1Zt98WKqbEUHbHOBg@mail.gmail.com
2022-04-01 13:24:50 +09:00
Bruce Momjian 27b77ecf9f Update copyright for 2022
Backpatch-through: 10
2022-01-07 19:04:57 -05:00
Peter Geoghegan bcf60585e6 Standardize cleanup lock terminology.
The term "super-exclusive lock" is a synonym for "buffer cleanup lock"
that first appeared in nbtree many years ago.  Standardize things by
consistently using the term cleanup lock.  This finishes work started by
commit 276db875.

There is no good reason to have two terms.  But there is a good reason
to only have one: to avoid confusion around why VACUUM acquires a full
cleanup lock (not just an ordinary exclusive lock) in index AMs, during
ambulkdelete calls.  This has nothing to do with protecting the physical
index data structure itself.  It is needed to implement a locking
protocol that ensures that TIDs pointing to the heap/table structure
cannot get marked for recycling by VACUUM before it is safe (which is
somewhat similar to how VACUUM uses cleanup locks during its first heap
pass).  Note that it isn't strictly necessary for index AMs to implement
this locking protocol -- several index AMs use an MVCC snapshot as their
sole interlock to prevent unsafe TID recycling.

In passing, update the nbtree README.  Cleanly separate discussion of
the aforementioned index vacuuming locking protocol from discussion of
the "drop leaf page pin" optimization added by commit 2ed5b87f.  We now
structure discussion of the latter by describing how individual index
scans may safely opt out of applying the standard locking protocol (and
so can avoid blocking progress by VACUUM).  Also document why the
optimization is not safe to apply during nbtree index-only scans.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzngHgQa92tz6NQihf4nxJwRzCV36yMJO_i8dS+2mgEVKw@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzkHPgsBBvGWjz=8PjNhDefy7XRkDKiT5NxMs-n5ZCf2dA@mail.gmail.com
2021-12-08 17:24:45 -08:00
Peter Geoghegan a5213adf3d Further harden nbtree posting split code.
Add more defensive checks around posting list split code.  These should
detect corruption involving duplicate table TIDs earlier and more
reliably than any existing check.

Follow up to commit 8f72bbac.

Discussion: https://postgr.es/m/CAH2-WzkrSY_kjyd1_M5xJK1uM0govJXMxPn8JUSvwcUOiHuWVw@mail.gmail.com
Backpatch: 13-, where nbtree deduplication was introduced.
2021-10-27 12:10:47 -07:00
Peter Geoghegan e5d8a99903 Use full 64-bit XIDs in deleted nbtree pages.
Otherwise we risk "leaking" deleted pages by making them non-recyclable
indefinitely.  Commit 6655a729 did the same thing for deleted pages in
GiST indexes.  That work was used as a starting point here.

Stop storing an XID indicating the oldest bpto.xact across all deleted
though unrecycled pages in nbtree metapages.  There is no longer any
reason to care about that condition/the oldest XID.  It only ever made
sense when wraparound was something _bt_vacuum_needs_cleanup() had to
consider.

The btm_oldest_btpo_xact metapage field has been repurposed and renamed.
It is now btm_last_cleanup_num_delpages, which is used to remember how
many non-recycled deleted pages remain from the last VACUUM (in practice
its value is usually the precise number of pages that were _newly
deleted_ during the specific VACUUM operation that last set the field).

The general idea behind storing btm_last_cleanup_num_delpages is to use
it to give _some_ consideration to non-recycled deleted pages inside
_bt_vacuum_needs_cleanup() -- though never too much.  We only really
need to avoid leaving a truly excessive number of deleted pages in an
unrecycled state forever.  We only do this to cover certain narrow cases
where no other factor makes VACUUM do a full scan, and yet the index
continues to grow (and so actually misses out on recycling existing
deleted pages).

These metapage changes result in a clear user-visible benefit: We no
longer trigger full index scans during VACUUM operations solely due to
the presence of only 1 or 2 known deleted (though unrecycled) blocks
from a very large index.  All that matters now is keeping the costs and
benefits in balance over time.

Fix an issue that has been around since commit 857f9c36, which added the
"skip full scan of index" mechanism (i.e. the _bt_vacuum_needs_cleanup()
logic).  The accuracy of btm_last_cleanup_num_heap_tuples accidentally
hinged upon _when_ the source value gets stored.  We now always store
btm_last_cleanup_num_heap_tuples in btvacuumcleanup().  This fixes the
issue because IndexVacuumInfo.num_heap_tuples (the source field) is
expected to accurately indicate the state of the table _after_ the
VACUUM completes inside btvacuumcleanup().

A backpatchable fix cannot easily be extracted from this commit.  A
targeted fix for the issue will follow in a later commit, though that
won't happen today.

I (pgeoghegan) have chosen to remove any mention of deleted pages in the
documentation of the vacuum_cleanup_index_scale_factor GUC/param, since
the presence of deleted (though unrecycled) pages is no longer of much
concern to users.  The vacuum_cleanup_index_scale_factor description in
the docs now seems rather unclear in any case, and it should probably be
rewritten in the near future.  Perhaps some passing mention of page
deletion will be added back at the same time.

Bump XLOG_PAGE_MAGIC due to nbtree WAL records using full XIDs now.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznpdHvujGUwYZ8sihX=d5u-tRYhi-F4wnV2uN2zHpMUXw@mail.gmail.com
2021-02-24 18:41:34 -08:00
Bruce Momjian ca3b37487b Update copyright for 2021
Backpatch-through: 9.5
2021-01-02 13:06:25 -05:00
Amit Kapila b7f2dd959a Update parallel BTree scan state when the scan keys can't be satisfied.
For parallel btree scan to work for array of scan keys, it should reach
BTPARALLEL_DONE state once for every distinct combination of array keys.
This is required to ensure that the parallel workers don't try to seize
blocks at the same time for different scan keys. We missed to update this
state when we discovered that the scan keys can't be satisfied.

Author: James Hunter
Reviewed-by: Amit Kapila
Tested-by: Justin Pryzby
Backpatch-through: 10, where it was introduced
Discussion: https://postgr.es/m/4248CABC-25E3-4809-B4D0-128E1BAABC3C@amazon.com
2020-09-17 16:11:48 +05:30
Tom Lane a5cc4dab6d Yet more elimination of dead stores and useless initializations.
I'm not sure what tool Ranier was using, but the ones I contributed
were found by using a newer version of scan-build than I tried before.

Ranier Vilela and Tom Lane

Discussion: https://postgr.es/m/CAEudQAo1+AcGppxDSg8k+zF4+Kv+eJyqzEDdbpDg58-=MQcerQ@mail.gmail.com
2020-09-05 13:17:32 -04:00
Peter Geoghegan 4a70f829d8 Add nbtree Valgrind buffer lock checks.
Holding just a buffer pin (with no buffer lock) on an nbtree buffer/page
provides very weak guarantees, especially compared to heapam, where it's
often safe to read a page while only holding a buffer pin.  This commit
has Valgrind enforce the following rule: it is never okay to access an
nbtree buffer without holding both a pin and a lock on the buffer.

A draft version of this patch detected questionable code that was
cleaned up by commits fa7ff642 and 7154aa16.  The code in question used
to access an nbtree buffer page's special/opaque area with no buffer
lock (only a buffer pin).  This practice (which isn't obviously unsafe)
is hereby formally disallowed in nbtree.  There doesn't seem to be any
reason to allow it, and banning it keeps things simple for Valgrind.

The new checks are implemented by adding custom nbtree client requests
(located in LockBuffer() wrapper functions); these requests are
"superimposed" on top of the generic bufmgr.c Valgrind client requests
added by commit 1e0dfd16.  No custom resource management cleanup code is
needed to undo the effects of marking buffers as non-accessible under
this scheme.

Author: Peter Geoghegan
Reviewed-By: Anastasia Lubennikova, Georgios Kokolatos
Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
2020-07-21 15:50:58 -07:00
Peter Geoghegan e25d462a38 nbtree: Rename _bt_search() variables.
Make some of the variable names in _bt_search() consistent with
corresponding variables within _bt_getstackbuf().  This naming scheme is
clearer because the variable names always express a relationship between
the currently locked buffer/page and some other page.
2020-07-02 14:54:55 -07:00
Tom Lane 5cbfce562f Initial pgindent and pgperltidy run for v13.
Includes some manual cleanup of places that pgindent messed up,
most of which weren't per project style anyway.

Notably, it seems some people didn't absorb the style rules of
commit c9d297751, because there were a bunch of new occurrences
of function calls with a newline just after the left paren, all
with faulty expectations about how the rest of the call would get
indented.
2020-05-14 13:06:50 -04:00
Peter Geoghegan 113758155c nbtree: Fix obsolete _bt_search() comment.
Oversight in commit d2086b08b0.
2020-03-16 15:51:06 -07:00
Peter Geoghegan 0d861bbb70 Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method.  The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
the point where a leaf page split would otherwise be required.  New
posting list tuples are formed by merging together existing duplicate
tuples.  The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed.  Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.

The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach.  Most individual inserts of index tuples have
exactly the same overhead as before.  The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits.  The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).

Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key).  The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective.  This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.

A new index storage parameter (deduplicate_items) controls the use of
deduplication.  The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible.  This decision will be
reviewed at the end of the Postgres 13 beta period.

There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values.  The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely.  Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12).  However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from.  This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.

Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
    https://postgr.es/m/55E4051B.7020209@postgrespro.ru
    https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 13:05:30 -08:00
Peter Geoghegan fe9b92854e Remove obsolete _bt_compare() comment.
btbuild() has nothing to say about how NULL values compare in nbtree.
Besides, there are _bt_compare() header comments that describe how NULL
values are handled.
2020-02-18 16:07:16 -08:00
Bruce Momjian 7559d8ebfa Update copyrights for 2020
Backpatch-through: update all files in master, backpatch legal files through 9.4
2020-01-01 12:21:45 -05:00
Peter Geoghegan fcf3b6917b Rename nbtree tuple macros.
Rename two function-style macros, removing the word "inner".  This makes
things more consistent.
2019-12-16 17:49:45 -08:00
Peter Geoghegan 55d015bde0 Add _bt_binsrch() scantid assertion to nbtree.
Assert that _bt_binsrch() binary searches with scantid set in insertion
scankey cannot be performed on leaf pages.  Leaf-level binary searches
where scantid is set must use _bt_binsrch_insert() instead.

_bt_binsrch_insert() is likely to have additional responsibilities in
the future, such as searching within GIN-style posting lists using
scantid.  It seems like a good idea to tighten things up now.
2019-09-09 11:41:19 -07:00
Peter Geoghegan 9c02cf5661 Remove block number field from nbtree stack.
The initial value of the nbtree stack downlink block number field
recorded during an initial descent of the tree wasn't actually used.
Both _bt_getstackbuf() callers overwrote the value with their own value.

Remove the block number field from the stack struct, and add a child
block number argument to _bt_getstackbuf() in its place.  This makes the
overall design of _bt_getstackbuf() clearer.

Author: Peter Geoghegan
Reviewed-By: Anastasia Lubennikova
Discussion: https://postgr.es/m/CAH2-Wzmx+UbXt2YNOUCZ-a04VdXU=S=OHuAuD7Z8uQq-PXTYUg@mail.gmail.com
2019-08-14 11:32:35 -07:00
Peter Eisentraut fd6ec93bf8 Add error codes to some corruption log messages
In some cases we have elog(ERROR) while corruption is certain and we
can give a clear error code ERRCODE_DATA_CORRUPTED or
ERRCODE_INDEX_CORRUPTED.

Author: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://www.postgresql.org/message-id/flat/25F6C686-6442-4A6B-BAF8-A6F7B84B16DE@yandex-team.ru
2019-08-01 11:15:26 +02:00
Michael Paquier f43608bda2 Fix typos and inconsistencies in code comments
Author: Alexander Lakhin
Discussion: https://postgr.es/m/dec6aae8-2d63-639f-4d50-20e229fb83e3@gmail.com
2019-06-14 09:34:34 +09:00
Tom Lane 8255c7a5ee Phase 2 pgindent run for v12.
Switch to 2.1 version of pg_bsd_indent.  This formats
multiline function declarations "correctly", that is with
additional lines of parameter declarations indented to match
where the first line's left parenthesis is.

Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
2019-05-22 13:04:48 -04:00
Tom Lane be76af171c Initial pgindent run for v12.
This is still using the 2.0 version of pg_bsd_indent.
I thought it would be good to commit this separately,
so as to document the differences between 2.0 and 2.1 behavior.

Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
2019-05-22 12:55:34 -04:00
Peter Geoghegan 9b10926263 Prevent O(N^2) unique index insertion edge case.
Commit dd299df8 made nbtree treat heap TID as a tiebreaker column,
establishing the principle that there is only one correct location (page
and page offset number) for every index tuple, no matter what.
Insertions of tuples into non-unique indexes proceed as if heap TID
(scan key's scantid) is just another user-attribute value, but
insertions into unique indexes are more delicate.  The TID value in
scantid must initially be omitted to ensure that the unique index
insertion visits every leaf page that duplicates could be on.  The
scantid is set once again after unique checking finishes successfully,
which can force _bt_findinsertloc() to step right one or more times, to
locate the leaf page that the new tuple must be inserted on.

Stepping right within _bt_findinsertloc() was assumed to occur no more
frequently than stepping right within _bt_check_unique(), but there was
one important case where that assumption was incorrect: inserting a
"duplicate" with NULL values.  Since _bt_check_unique() didn't do any
real work in this case, it wasn't appropriate for _bt_findinsertloc() to
behave as if it was finishing off a conventional unique insertion, where
any existing physical duplicate must be dead or recently dead.
_bt_findinsertloc() might have to grovel through a substantial portion
of all of the leaf pages in the index to insert a single tuple, even
when there were no dead tuples.

To fix, treat insertions of tuples with NULLs into a unique index as if
they were insertions into a non-unique index: never unset scantid before
calling _bt_search() to descend the tree, and bypass _bt_check_unique()
entirely.  _bt_check_unique() is no longer responsible for incoming
tuples with NULL values.

Discussion: https://postgr.es/m/CAH2-Wzm08nr+JPx4jMOa9CGqxWYDQ-_D4wtPBiKghXAUiUy-nQ@mail.gmail.com
2019-04-23 10:33:57 -07:00
Peter Geoghegan 9c7fb7e6d8 Tweak some nbtree-related code comments. 2019-03-29 12:29:05 -07:00
Peter Geoghegan 29b64d1de7 Add nbtree high key "continuescan" optimization.
Teach nbtree forward index scans to check the high key before moving to
the right sibling page in the hope of finding that it isn't actually
necessary to do so.  The new check may indicate that the scan definitely
cannot find matching tuples to the right, ending the scan immediately.
We already opportunistically force a similar "continuescan orientated"
key check of the final non-pivot tuple when it's clear that it cannot be
returned to the scan due to being dead-to-all.  The new high key check
is complementary.

The new approach for forward scans is more effective than checking the
final non-pivot tuple, especially with composite indexes and non-unique
indexes.  The improvements to the logic for picking a split point added
by commit fab25024 make it likely that relatively dissimilar high keys
will appear on a page.  A distinguishing key value that can only appear
on non-pivot tuples on the right sibling page will often be present in
leaf page high keys.

Since forcing the final item to be key checked no longer makes any
difference in the case of forward scans, the existing extra key check is
now only used for backwards scans.  Backward scans continue to
opportunistically check the final non-pivot tuple, which is actually the
first non-pivot tuple on the page (not the last).

Note that even pg_upgrade'd v3 indexes make use of this optimization.

Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzkOmUduME31QnuTFpimejuQoiZ-HOf0pOWeFZNhTMctvA@mail.gmail.com
2019-03-23 11:01:53 -07:00
Peter Geoghegan dd299df818 Make heap TID a tiebreaker nbtree index column.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique.  This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.  A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes.  These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.

A later patch will enhance the logic used by nbtree to pick a split
point.  This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at.  Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.

The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits.  The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised.  However, there should be a compatibility note in the v12
release notes.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20 10:04:01 -07:00
Peter Geoghegan e5adcb789d Refactor nbtree insertion scankeys.
Use dedicated struct to represent nbtree insertion scan keys.  Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions.  This is based on a suggestion by Andrey
Lepikhov.

Streamline how unique index insertions cache binary search progress.
Cache the state of in-progress binary searches within _bt_check_unique()
for later instead of having callers avoid repeating the binary search in
an ad-hoc manner.  This makes it easy to add a new optimization:
_bt_check_unique() now falls out of its loop immediately in the common
case where it's already clear that there couldn't possibly be a
duplicate.

The new _bt_check_unique() scheme makes it a lot easier to manage cached
binary search effort afterwards, from within _bt_findinsertloc().  This
is needed for the upcoming patch to make nbtree tuples unique by
treating heap TID as a final tiebreaker column.  Unique key binary
searches need to restore lower and upper bounds.  They cannot simply
continue to use the >= lower bound as the offset to insert at, because
the heap TID tiebreaker column must be used in comparisons for the
restored binary search (unlike the original _bt_check_unique() binary
search, where scankey's heap TID column must be omitted).

Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas, Andrey Lepikhov
Discussion: https://postgr.es/m/CAH2-WzmE6AhUdk9NdWBf4K3HjWXZBX3+umC7mH7+WDrKcRtsOw@mail.gmail.com
2019-03-20 09:30:57 -07:00
Peter Geoghegan 1009920aaa Tweak nbtsearch.c function prototype order.
nbtsearch.c's static function prototypes were slightly out of order.
Make the order consistent with static function definition order.
2019-03-19 09:59:05 -07:00
Andres Freund c2fe139c20 tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:

1) Heap scans need to be generalized into table scans. Do this by
   introducing TableScanDesc, which will be the "base class" for
   individual AMs. This contains the AM independent fields from
   HeapScanDesc.

   The previous heap_{beginscan,rescan,endscan} et al. have been
   replaced with a table_ version.

   There's no direct replacement for heap_getnext(), as that returned
   a HeapTuple, which is undesirable for a other AMs. Instead there's
   table_scan_getnextslot().  But note that heap_getnext() lives on,
   it's still used widely to access catalog tables.

   This is achieved by new scan_begin, scan_end, scan_rescan,
   scan_getnextslot callbacks.

2) The portion of parallel scans that's shared between backends need
   to be able to do so without the user doing per-AM work. To achieve
   that new parallelscan_{estimate, initialize, reinitialize}
   callbacks are introduced, which operate on a new
   ParallelTableScanDesc, which again can be subclassed by AMs.

   As it is likely that several AMs are going to be block oriented,
   block oriented callbacks that can be shared between such AMs are
   provided and used by heap. table_block_parallelscan_{estimate,
   intiialize, reinitialize} as callbacks, and
   table_block_parallelscan_{nextpage, init} for use in AMs. These
   operate on a ParallelBlockTableScanDesc.

3) Index scans need to be able to access tables to return a tuple, and
   there needs to be state across individual accesses to the heap to
   store state like buffers. That's now handled by introducing a
   sort-of-scan IndexFetchTable, which again is intended to be
   subclassed by individual AMs (for heap IndexFetchHeap).

   The relevant callbacks for an AM are index_fetch_{end, begin,
   reset} to create the necessary state, and index_fetch_tuple to
   retrieve an indexed tuple.  Note that index_fetch_tuple
   implementations need to be smarter than just blindly fetching the
   tuples for AMs that have optimizations similar to heap's HOT - the
   currently alive tuple in the update chain needs to be fetched if
   appropriate.

   Similar to table_scan_getnextslot(), it's undesirable to continue
   to return HeapTuples. Thus index_fetch_heap (might want to rename
   that later) now accepts a slot as an argument. Core code doesn't
   have a lot of call sites performing index scans without going
   through the systable_* API (in contrast to loads of heap_getnext
   calls and working directly with HeapTuples).

   Index scans now store the result of a search in
   IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
   target is not generally a HeapTuple anymore that seems cleaner.

To be able to sensible adapt code to use the above, two further
callbacks have been introduced:

a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
   slots capable of holding a tuple of the AMs
   type. table_slot_callbacks() and table_slot_create() are based
   upon that, but have additional logic to deal with views, foreign
   tables, etc.

   While this change could have been done separately, nearly all the
   call sites that needed to be adapted for the rest of this commit
   also would have been needed to be adapted for
   table_slot_callbacks(), making separation not worthwhile.

b) tuple_satisfies_snapshot checks whether the tuple in a slot is
   currently visible according to a snapshot. That's required as a few
   places now don't have a buffer + HeapTuple around, but a
   slot (which in heap's case internally has that information).

Additionally a few infrastructure changes were needed:

I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
   internally uses a slot to keep track of tuples. While
   systable_getnext() still returns HeapTuples, and will so for the
   foreseeable future, the index API (see 1) above) now only deals with
   slots.

The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.

Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
    https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
    https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 12:46:41 -07:00
Andres Freund e7cc78ad43 Remove superfluous tqual.h includes.
Most of these had been obsoleted by 568d4138c / the SnapshotNow
removal.

This is is preparation for moving most of tqual.[ch] into either
snapmgr.h or heapam.h, which in turn is in preparation for pluggable
table AMs.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-01-21 12:15:02 -08:00
Bruce Momjian 97c39498e5 Update copyright for 2019
Backpatch-through: certain files through 9.4
2019-01-02 12:44:25 -05:00
Magnus Hagander fbec7459aa Fix spelling errors and typos in comments
Author: Daniel Gustafsson <daniel@yesql.se>
2018-11-02 13:56:52 +01:00
Tom Lane c87cb5f7a6 Allow btree comparison functions to return INT_MIN.
Historically we forbade datatype-specific comparison functions from
returning INT_MIN, so that it would be safe to invert the sort order
just by negating the comparison result.  However, this was never
really safe for comparison functions that directly return the result
of memcmp(), strcmp(), etc, as POSIX doesn't place any such restriction
on those library functions.  Buildfarm results show that at least on
recent Linux on s390x, memcmp() actually does return INT_MIN sometimes,
causing sort failures.

The agreed-on answer is to remove this restriction and fix relevant
call sites to not make such an assumption; code such as "res = -res"
should be replaced by "INVERT_COMPARE_RESULT(res)".  The same is needed
in a few places that just directly negated the result of memcmp or
strcmp.

To help find places having this problem, I've also added a compile option
to nbtcompare.c that causes some of the commonly used comparators to
return INT_MIN/INT_MAX instead of their usual -1/+1.  It'd likely be
a good idea to have at least one buildfarm member running with
"-DSTRESS_SORT_INT_MIN".  That's far from a complete test of course,
but it should help to prevent fresh introductions of such bugs.

This is a longstanding portability hazard, so back-patch to all supported
branches.

Discussion: https://postgr.es/m/20180928185215.ffoq2xrq5d3pafna@alap3.anarazel.de
2018-10-05 16:01:29 -04:00
Alexander Korotkov d2086b08b0 Reduce path length for locking leaf B-tree pages during insertion
In our B-tree implementation appropriate leaf page for new tuple
insertion is acquired using _bt_search() function.  This function always
returns leaf page locked in shared mode.  In order to obtain exclusive
lock, caller have to relock the page.

This commit makes _bt_search() function lock leaf page immediately in
exclusive mode when needed.  That removes unnecessary relock and, in
turn reduces lock contention for B-tree leaf pages.  Our experiments
on multi-core systems showed acceleration up to 4.5 times in corner
case.

Discussion: https://postgr.es/m/CAPpHfduAMDFMNYTCN7VMBsFg_hsf0GqiqXnt%2BbSeaJworwFoig%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Yoshikazu Imai, Simon Riggs, Peter Geoghegan
2018-07-28 00:31:40 +03:00
Amit Kapila 8ce29bb4f0 Fix the buffer release order for parallel index scans.
During parallel index scans, if the current page to be read is deleted, we
skip it and try to get the next page for a scan without releasing the buffer
lock on the current page.  To get the next page, sometimes it needs to wait
for another process to complete its scan and advance it to the next page.
Now, it is quite possible that the master backend has errored out before
advancing the scan and issued a termination signal for all workers.  The
workers failed to notice the termination request during wait because the
interrupts are held due to buffer lock on the previous page.  This lead to
all workers being stuck.

The fix is to release the buffer lock on current page before trying to get
the next page.  We are already doing same in backward scans, but missed
it for forward scans.

Reported-by: Victor Yegorov
Bug: 15290
Diagnosed-by: Thomas Munro and Amit Kapila
Author: Amit Kapila
Reviewed-by: Thomas Munro
Tested-By: Thomas Munro and Victor Yegorov
Backpatch-through: 10 where parallel index scans were introduced
Discussion: https://postgr.es/m/153228422922.1395.1746424054206154747@wrigleys.postgresql.org
2018-07-27 10:53:00 +05:30
Teodor Sigaev 075aade436 Adjust INCLUDE index truncation comments and code.
Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected.  Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature.  A
similar check will be performed for the page highkey within
_bt_moveright() in most cases.  Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.

Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer).  This isn't strictly necessary,
but seems more future-proof than the old approach.  It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation.  Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.

In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE INDEX.  Also pfree during a page split, just to be
consistent.

Refactor _bt_check_natts() to be more readable.

Author: Peter Geoghegan with some editorization by me
Reviewed by: Alexander Korotkov, Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
2018-04-19 08:45:58 +03:00
Teodor Sigaev 8224de4f42 Indexes with INCLUDE columns and their support in B-tree
This patch introduces INCLUDE clause to index definition.  This clause
specifies a list of columns which will be included as a non-key part in
the index.  The INCLUDE columns exist solely to allow more queries to
benefit from index-only scans.  Also, such columns don't need to have
appropriate operator classes.  Expressions are not supported as INCLUDE
columns since they cannot be used in index-only scans.

Index access methods supporting INCLUDE are indicated by amcaninclude flag
in IndexAmRoutine.  For now, only B-tree indexes support INCLUDE clause.

In B-tree indexes INCLUDE columns are truncated from pivot index tuples
(tuples located in non-leaf pages and high keys).  Therefore, B-tree indexes
now might have variable number of attributes.  This patch also provides
generic facility to support that: pivot tuples contain number of their
attributes in t_tid.ip_posid.  Free 13th bit of t_info is used for indicating
that.  This facility will simplify further support of index suffix truncation.
The changes of above are backward-compatible, pg_upgrade doesn't need special
handling of B-tree indexes for that.

Bump catalog version

Author: Anastasia Lubennikova with contribition by Alexander Korotkov and me
Reviewed by: Peter Geoghegan, Tomas Vondra, Antonin Houska, Jeff Janes,
			 David Rowley, Alexander Korotkov
Discussion: https://www.postgresql.org/message-id/flat/56168952.4010101@postgrespro.ru
2018-04-07 23:00:39 +03:00
Alvaro Herrera 272c2ab9fd Change some bogus PageGetLSN calls to BufferGetLSNAtomic
As src/backend/access/transam/README says, PageGetLSN may only be called
by processes holding either exclusive lock on buffer, or a shared lock
on buffer plus buffer header lock.  Therefore any place that only holds
a shared buffer lock must use BufferGetLSNAtomic instead of PageGetLSN,
which internally obtains buffer header lock prior to reading the LSN.

A few callsites failed to comply with this rule.  This was detected by
running all tests under a new (not committed) assertion that verifies
PageGetLSN locking contract.  All but one of the callsites that failed
the assertion are fixed by this patch.  Remaining callsites were
inspected manually and determined not to need any change.

The exception (unfixed callsite) is in TestForOldSnapshot, which only
has a Page argument, making it impossible to access the corresponding
Buffer from it.  Fixing that seems a much larger patch that will have to
be done separately; and that's just as well, since it was only
introduced in 9.6 and other bugs are much older.

Some of these bugs are ancient; backpatch all the way back to 9.3.

Authors: Jacob Champion, Asim Praveen, Ashwin Agrawal
Reviewed-by: Michaël Paquier
Discussion: https://postgr.es/m/CABAq_6GXgQDVu3u12mK9O5Xt5abBZWQ0V40LZCE+oUf95XyNFg@mail.gmail.com
2018-01-09 17:06:31 -03:00
Bruce Momjian 9d4649ca49 Update copyright for 2018
Backpatch-through: certain files through 9.3
2018-01-02 23:30:12 -05:00
Robert Haas 884a60840c Fix parallel index scan hang with deleted or half-dead pages.
The previous coding forgot to release the scan before seizing
it again, leading to a lockup.

Report by Patrick Hemmer.  Diagnosis by Thomas Munro.  Patch by
Amit Kapila.

Discussion: http://postgr.es/m/CAEepm=2xZUcOGP9V0O_G0=2P2wwXwPrkF=upWTCJSisUxMnuSg@mail.gmail.com
2017-12-13 16:15:44 -05:00