Commit Graph

903 Commits

Author SHA1 Message Date
Peter Geoghegan 5a2f154a2e Improve nbtree README's LP_DEAD section.
The description of how LP_DEAD bit setting by index scans works
following commit 2ed5b87f was rather unclear.  Clean that up a bit.

Also refer to LP_DEAD bit setting within _bt_check_unique() at the start
of the same section.  This mechanism may actually be more important than
the generic kill_prior_tuple mechanism that the section focuses on, so
it at least deserves to be mentioned in passing.
2020-11-07 18:51:12 -08:00
Peter Geoghegan efc5dcfd8a Fix wal_consistency_checking nbtree bug.
wal_consistency_checking indicated an inconsistency in certain cases
involving nbtree page deletion.  The underlying issue is that there was
a minor difference between the page image produced after a REDO routine
ran and the corresponding page image following original execution.

This harmless inconsistency has been around forever.  We more or less
expect total consistency among even deleted nbtree pages these days,
though, so this won't do anymore.

To fix, tweak the REDO routine to match original execution.

Oversight in commit f47b5e13.
2020-11-05 15:01:40 -08:00
Peter Geoghegan 48e1291342 Fix nbtree cleanup-only VACUUM stats inaccuracies.
Logic for counting heap TIDs from posting list tuples (added by commit
0d861bbb) was faulty.  It didn't count any TIDs/index tuples in the
event of no callback being set.  This meant that we incorrectly counted
no index tuples in clean-up only VACUUMs, which could lead to
pg_class.reltuples being spuriously set to 0 in affected indexes.

To fix, go back to counting items from the page in cases where there is
no callback.  This approach isn't very accurate, but it works well
enough in practice while avoiding the expense of accessing every index
tuple during cleanup-only VACUUMs.

Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
https://postgr.es/m/20201023174451.69e358f1@firost
Backpatch: 13-, where nbtree deduplication was introduced
2020-11-04 18:42:27 -08:00
Noah Misch f90e80b913 Reproduce debug_query_string==NULL on parallel workers.
Certain background workers initiate parallel queries while
debug_query_string==NULL, at which point they attempted strlen(NULL) and
died to SIGSEGV.  Older debug_query_string observers allow NULL, so do
likewise in these newer ones.  Back-patch to v11, where commit
7de4a1bcc5 introduced the first of these.

Discussion: https://postgr.es/m/20201014022636.GA1962668@rfd.leadboat.com
2020-10-31 08:43:28 -07:00
Amit Kapila b7f2dd959a Update parallel BTree scan state when the scan keys can't be satisfied.
For parallel btree scan to work for array of scan keys, it should reach
BTPARALLEL_DONE state once for every distinct combination of array keys.
This is required to ensure that the parallel workers don't try to seize
blocks at the same time for different scan keys. We missed to update this
state when we discovered that the scan keys can't be satisfied.

Author: James Hunter
Reviewed-by: Amit Kapila
Tested-by: Justin Pryzby
Backpatch-through: 10, where it was introduced
Discussion: https://postgr.es/m/4248CABC-25E3-4809-B4D0-128E1BAABC3C@amazon.com
2020-09-17 16:11:48 +05:30
Tom Lane a5cc4dab6d Yet more elimination of dead stores and useless initializations.
I'm not sure what tool Ranier was using, but the ones I contributed
were found by using a newer version of scan-build than I tried before.

Ranier Vilela and Tom Lane

Discussion: https://postgr.es/m/CAEudQAo1+AcGppxDSg8k+zF4+Kv+eJyqzEDdbpDg58-=MQcerQ@mail.gmail.com
2020-09-05 13:17:32 -04:00
Tom Lane 3d351d916b Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.
Historically, we've considered the state with relpages and reltuples
both zero as indicating that we do not know the table's tuple density.
This is problematic because it's impossible to distinguish "never yet
vacuumed" from "vacuumed and seen to be empty".  In particular, a user
cannot use VACUUM or ANALYZE to override the planner's normal heuristic
that an empty table should not be believed to be empty because it is
probably about to get populated.  That heuristic is a good safety
measure, so I don't care to abandon it, but there should be a way to
override it if the table is indeed intended to stay empty.

Hence, represent the initial state of ignorance by setting reltuples
to -1 (relpages is still set to zero), and apply the minimum-ten-pages
heuristic only when reltuples is still -1.  If the table is empty,
VACUUM or ANALYZE (but not CREATE INDEX) will override that to
reltuples = relpages = 0, and then we'll plan on that basis.

This requires a bunch of fiddly little changes, but we can get rid of
some ugly kluges that were formerly needed to maintain the old definition.

One notable point is that FDWs' GetForeignRelSize methods will see
baserel->tuples = -1 when no ANALYZE has been done on the foreign table.
That seems like a net improvement, since those methods were formerly
also in the dark about what baserel->tuples = 0 really meant.  Still,
it is an API change.

I bumped catversion because code predating this change would get confused
by seeing reltuples = -1.

Discussion: https://postgr.es/m/F02298E0-6EF4-49A1-BCB6-C484794D9ACC@thebuild.com
2020-08-30 12:21:51 -04:00
Andres Freund 1f51c17c68 snapshot scalability: Move PGXACT->xmin back to PGPROC.
Now that xmin isn't needed for GetSnapshotData() anymore, it leads to
unnecessary cacheline ping-pong to have it in PGXACT, as it is updated
considerably more frequently than the other PGXACT members.

After the changes in dc7420c2c9, this is a very straight-forward change.

For highly concurrent, snapshot acquisition heavy, workloads this change alone
can significantly increase scalability. E.g. plain pgbench on a smaller 2
socket machine gains 1.07x for read-only pgbench, 1.22x for read-only pgbench
when submitting queries in batches of 100, and 2.85x for batches of 100
'SELECT';.  The latter numbers are obviously not to be expected in the
real-world, but micro-benchmark the snapshot computation
scalability (previously spending ~80% of the time in GetSnapshotData()).

Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-13 16:25:21 -07:00
Andres Freund b8443eae72 Fix out-of-date version reference, grammar.
Time appears to be passing fast.

Reported-By: Peter Geoghegan <pg@bowt.ie>
2020-08-12 17:04:51 -07:00
Andres Freund dc7420c2c9 snapshot scalability: Don't compute global horizons while building snapshots.
To make GetSnapshotData() more scalable, it cannot not look at at each proc's
xmin: While snapshot contents do not need to change whenever a read-only
transaction commits or a snapshot is released, a proc's xmin is modified in
those cases. The frequency of xmin modifications leads to, particularly on
higher core count systems, many cache misses inside GetSnapshotData(), despite
the data underlying a snapshot not changing. That is the most
significant source of GetSnapshotData() scaling poorly on larger systems.

Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons /
thresholds as it has so far. But we don't really have to: The horizons don't
actually change that much between GetSnapshotData() calls. Nor are the horizons
actually used every time a snapshot is built.

The trick this commit introduces is to delay computation of accurate horizons
until there use and using horizon boundaries to determine whether accurate
horizons need to be computed.

The use of RecentGlobal[Data]Xmin to decide whether a row version could be
removed has been replaces with new GlobalVisTest* functions.  These use two
thresholds to determine whether a row can be pruned:
1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed
   are definitely still visible.
2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
   definitely be removed
GetSnapshotData() updates definitely_needed to be the xmin of the computed
snapshot.

When testing whether a row can be removed (with GlobalVisTestIsRemovableXid())
and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID <
definitely_needed) the boundaries can be recomputed to be more accurate. As it
is not cheap to compute accurate boundaries, we limit the number of times that
happens in short succession.  As the boundaries used by
GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by
GetSnapshotData()), it is likely that further test can benefit from an earlier
computation of accurate horizons.

To avoid regressing performance when old_snapshot_threshold is set (as that
requires an accurate horizon to be computed), heap_page_prune_opt() doesn't
unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the
computation of the limited horizon, and the triggering of errors (with
SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove
tuples.

This commit just removes the accesses to PGXACT->xmin from
GetSnapshotData(), but other members of PGXACT residing in the same
cache line are accessed. Therefore this in itself does not result in a
significant improvement. Subsequent commits will take advantage of the
fact that GetSnapshotData() now does not need to access xmins anymore.

Note: This contains a workaround in heap_page_prune_opt() to keep the
snapshot_too_old tests working. While that workaround is ugly, the tests
currently are not meaningful, and it seems best to address them separately.

Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
2020-08-12 16:03:49 -07:00
Peter Geoghegan d129c07499 Correct nbtree page split lock coupling comment.
There is no reason to distinguish between readers and writers here.
2020-08-09 12:01:15 -07:00
Peter Geoghegan 0a7d771f0f Make nbtree split REDO locking match original execution.
Make the nbtree page split REDO routine consistent with original
execution in its approach to acquiring and releasing buffer locks (at
least for pages on the tree level of the page being split).  This brings
btree_xlog_split() in line with btree_xlog_unlink_page(), which was
taught to couple buffer locks by commit 9a9db08a.

Note that the precise order in which we both acquire and release sibling
buffer locks in btree_xlog_split() now matches original execution
exactly (the precise order in which the locks are released probably
doesn't matter much, but we might as well be consistent about it).

The rule for nbtree REDO routines from here on is that same-level locks
should be acquired in an order that's consistent with original
execution.  It's not practical to have a similar rule for cross-level
page locks, since for the most part original execution holds those locks
for a period that spans multiple atomic actions/WAL records.  It's also
not necessary, because clearly the cross-level lock coupling is only
truly needed during original execution because of the presence of
concurrent inserters.

This is not a bug fix (unlike the similar aforementioned commit, commit
9a9db08a).  The immediate reason to tighten things up in this area is to
enable an upcoming enhancement to contrib/amcheck that allows it to
verify that sibling links are in agreement with only an AccessShareLock
(this check produced false positives when run on a replica server on
account of the inconsistency fixed by this commit).  But that's not the
only reason to be stricter here.

It is generally useful to make locking on replicas be as close to what
happens during original execution as practically possible.  It makes it
less likely that hard to catch bugs will slip in in the future.  The
previous state of affairs seems to be a holdover from before the
introduction of Hot Standby, when buffer lock acquisitions during
recovery were totally unnecessary.  See also: commit 3bbf668d, which
tightened things up in this area a few years after the introduction of
Hot Standby.

Discussion: https://postgr.es/m/CAH2-Wz=465cJj11YXD9RKH8z=nhQa2dofOZ_23h67EXUGOJ00Q@mail.gmail.com
2020-08-07 15:27:56 -07:00
Peter Geoghegan 3df92bbd1d Rename nbtree split REDO routine variables.
Make the nbtree page split REDO routine variable names consistent with
_bt_split() (which handles the original execution of page splits).
These names make the code easier to follow by making the distinction
between the original page and the left half of the split clear.  (The
left half of the split page is a temp page that REDO creates to replace
the origpage contents.)

Also reduce the elevel used when adding a new high key to the temp page
from PANIC to ERROR to be consistent.  We already only raise an ERROR
when data item PageAddItem() temp page calls fail.
2020-08-07 09:53:27 -07:00
Alexander Korotkov f47b5e1395 Remove btree page items after page unlink
Currently, page unlink leaves remaining items "as is", but replay of
corresponding WAL-record re-initializes page leaving it with no items.
For the sake of consistency, this commit makes primary delete all the items
during page unlink as well.

Thanks to this change, we now don't mask contents of deleted btree page for
WAL consistency checking.

Discussion: https://postgr.es/m/CAPpHfdt_OTyQpXaPJcWzV2N-LNeNJseNB-K_A66qG%3DL518VTFw%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Peter Geoghegan
2020-08-05 02:16:13 +03:00
Peter Geoghegan 9a9db08ae4 Fix replica backward scan race condition.
It was possible for the logic used by backward scans (which must reason
about concurrent page splits/deletions in its own peculiar way) to
become confused when running on a replica.  Concurrent replay of a WAL
record that describes the second phase of page deletion could cause
_bt_walk_left() to get confused.  btree_xlog_unlink_page() simply failed
to adhere to the same locking protocol that we use on the primary, which
is obviously wrong once you consider these two disparate functions
together.  This bug is present in all stable branches.

More concretely, the problem was that nothing stopped _bt_walk_left()
from observing inconsistencies between the deletion's target page and
its original sibling pages when running on a replica.  This is true even
though the second phase of page deletion is supposed to work as a single
atomic action.  Queries running on replicas raised "could not find left
sibling of block %u in index %s" can't-happen errors when they went back
to their scan's "original" page and observed that the page has not been
marked deleted (even though it really was concurrently deleted).

There is no evidence that this actually happened in the real world.  The
issue came to light during unrelated feature development work.  Note
that _bt_walk_left() is the only code that cares about the difference
between a half-dead page and a fully deleted page that isn't also
exclusively used by nbtree VACUUM (unless you include contrib/amcheck
code).  It seems very likely that backward scans are the only thing that
could become confused by the inconsistency.  Even amcheck's complex
bt_right_page_check_scankey() dance was unaffected.

To fix, teach btree_xlog_unlink_page() to lock the left sibling, target,
and right sibling pages in that order before releasing any locks (just
like _bt_unlink_halfdead_page()).  This is the simplest possible
approach.  There doesn't seem to be any opportunity to be more clever
about lock acquisition in the REDO routine, and it hardly seems worth
the trouble in any case.

This fix might enable contrib/amcheck verification of leaf page sibling
links with only an AccessShareLock on the relation.  An amcheck patch
from Andrey Borodin was rejected back in January because it clashed with
btree_xlog_unlink_page()'s lax approach to locking pages.  It now seems
likely that the real problem was with btree_xlog_unlink_page(), not the
patch.

This is a low severity, low likelihood bug, so no backpatch.

Author: Michail Nikolaev
Diagnosed-By: Michail Nikolaev
Discussion: https://postgr.es/m/CANtu0ohkR-evAWbpzJu54V8eCOtqjJyYp3PQ_SGoBTRGXWhWRw@mail.gmail.com
2020-08-03 15:54:38 -07:00
Peter Geoghegan a451b7d442 Add nbtree page deletion assertion.
Add a documenting assertion that's similar to the nearby assertion added
by commit cd8c73a3.  This conveys that the entire call to _bt_pagedel()
does no work if it isn't possible to get a descent stack for the initial
scanblkno page.
2020-08-03 13:04:42 -07:00
Tom Lane 9f9682783b Invent "amadjustmembers" AM method for validating opclass members.
This allows AM-specific knowledge to be applied during creation of
pg_amop and pg_amproc entries.  Specifically, the AM knows better than
core code which entries to consider as required or optional.  Giving
the latter entries the appropriate sort of dependency allows them to
be dropped without taking out the whole opclass or opfamily; which
is something we'd like to have to correct obsolescent entries in
extensions.

This callback also opens the door to performing AM-specific validity
checks during opclass creation, rather than hoping than an opclass
developer will remember to test with "amvalidate".  For the most part
I've not actually added any such checks yet; that can happen in a
follow-on patch.  (Note that we shouldn't remove any tests from
"amvalidate", as those are still needed to cross-check manually
constructed entries in the initdb data.  So adding tests to
"amadjustmembers" will be somewhat duplicative, but it seems like
a good idea anyway.)

Patch by me, reviewed by Alexander Korotkov, Hamid Akhtar, and
Anastasia Lubennikova.

Discussion: https://postgr.es/m/4578.1565195302@sss.pgh.pa.us
2020-08-01 17:12:47 -04:00
Peter Geoghegan 4a70f829d8 Add nbtree Valgrind buffer lock checks.
Holding just a buffer pin (with no buffer lock) on an nbtree buffer/page
provides very weak guarantees, especially compared to heapam, where it's
often safe to read a page while only holding a buffer pin.  This commit
has Valgrind enforce the following rule: it is never okay to access an
nbtree buffer without holding both a pin and a lock on the buffer.

A draft version of this patch detected questionable code that was
cleaned up by commits fa7ff642 and 7154aa16.  The code in question used
to access an nbtree buffer page's special/opaque area with no buffer
lock (only a buffer pin).  This practice (which isn't obviously unsafe)
is hereby formally disallowed in nbtree.  There doesn't seem to be any
reason to allow it, and banning it keeps things simple for Valgrind.

The new checks are implemented by adding custom nbtree client requests
(located in LockBuffer() wrapper functions); these requests are
"superimposed" on top of the generic bufmgr.c Valgrind client requests
added by commit 1e0dfd16.  No custom resource management cleanup code is
needed to undo the effects of marking buffers as non-accessible under
this scheme.

Author: Peter Geoghegan
Reviewed-By: Anastasia Lubennikova, Georgios Kokolatos
Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
2020-07-21 15:50:58 -07:00
Peter Geoghegan 5da8bf8bbb Avoid CREATE INDEX unique index deduplication.
There is no advantage to attempting deduplication for a unique index
during CREATE INDEX, since there cannot possibly be any duplicates.
Doing so wastes cycles due to unnecessary copying.  Make sure that we
avoid it consistently.

We already avoided unique index deduplication in the case where there
were some spool2 tuples to merge.  That didn't account for the fact that
spool2 is removed early/unset in the common case where it has no tuples
that need to be merged (i.e. it failed to account for the "spool2 turns
out to be unnecessary" optimization in _bt_spools_heapscan()).

Oversight in commit 0d861bbb, which added nbtree deduplication

Backpatch: 13-, where nbtree deduplication was introduced.
2020-07-17 09:50:48 -07:00
Andres Freund 5e7bbb5286 code: replace 'master' with 'primary' where appropriate.
Also changed "in the primary" to "on the primary", and added a few
"the" before "primary".

Author: Andres Freund
Reviewed-By: David Steele
Discussion: https://postgr.es/m/20200615182235.x7lch5n6kcjq4aue@alap3.anarazel.de
2020-07-08 12:57:23 -07:00
Peter Geoghegan 28c16f4947 Remove unnecessary PageIsEmpty() nbtree build check.
nbtree index builds cannot write out an empty page.  That would mean
that there was no way to create a pivot tuple pointing to the page one
level up, since _bt_truncate() generates one based on page's firstright
tuple.

Replace the unnecessary PageIsEmpty() check with an assertion that
checks that the page has space for at least two line pointers (the
would-be high key line pointer, plus at least one valid "data item"
tuple line pointer).

The PageIsEmpty() check was added by commit 5d9f146c over 20 years ago.
It looks like it has always been unnecessary.
2020-07-06 13:47:29 -07:00
Peter Geoghegan e25d462a38 nbtree: Rename _bt_search() variables.
Make some of the variable names in _bt_search() consistent with
corresponding variables within _bt_getstackbuf().  This naming scheme is
clearer because the variable names always express a relationship between
the currently locked buffer/page and some other page.
2020-07-02 14:54:55 -07:00
Peter Geoghegan f7a476f0d6 nbtree: Correct inaccurate split location comment.
Minor oversight in commit fab2502433.
2020-06-29 12:30:39 -07:00
Peter Geoghegan 10f1ab2cb8 Fix misuse of table_index_fetch_tuple_check().
Commit 0d861bbb, which added deduplication to nbtree, had
_bt_check_unique() pass a TID to table_index_fetch_tuple_check() that
isn't safe to mutate.  table_index_fetch_tuple_check()'s tid argument is
modified when the TID in question is not the latest visible tuple in a
hot chain, though this wasn't documented.

To fix, go back to using a local copy of the TID in _bt_check_unique(),
and update comments above table_index_fetch_tuple_check().

Backpatch: 13-, where B-Tree deduplication was introduced.
2020-06-25 10:55:28 -07:00
Peter Geoghegan be14f884d5 Fix deduplication "single value" strategy bug.
It was possible for deduplication's single value strategy to mistakenly
believe that a very small duplicate tuple counts as one of the six large
tuples that it aims to leave behind after the page finally splits.  This
could cause slightly suboptimal space utilization with very low
cardinality indexes, though only under fairly narrow conditions.

To fix, be particular about what kind of tuple counts as a
maxpostingsize-capped tuple.  This avoids confusion in the event of a
small tuple that gets "wedged" between two large tuples, where all
tuples on the page are duplicates of the same value.

Discussion: https://postgr.es/m/CAH2-Wz=Y+sgSFc-O3LpiZX-POx2bC+okec2KafERHuzdVa7-rQ@mail.gmail.com
Backpatch: 13-, where deduplication was introduced (by commit 0d861bbb)
2020-06-19 08:57:24 -07:00
Peter Geoghegan d64f1cdf2f Silence _bt_check_unique compiler warning.
Reported-By: Tom Lane
Discussion: https://postgr.es/m/841649.1592065060@sss.pgh.pa.us
2020-06-13 09:33:33 -07:00
Peter Geoghegan 67b0b2dbf9 Reconsider nbtree page deletion assertion.
Commit 624686abcf added an assertion that verified that _bt_search
successfully relocated the leaf page undergoing deletion.  Page deletion
cannot deal with the case where the descent stack is to the right of the
page, so this seemed critical (deletion can only handle the case where
the descent stack is to the left of the leaf/target page).  However, the
assertion went a bit too far.

Since only a buffer pin is held on the leaf page throughout the call to
_bt_search, nothing guarantees that it can't have split during this
small window.  And if does actually split, _bt_search may end up
"relocating" a page to the right of the original target leaf page.  This
scenario seems extremely unlikely, but it must still be considered.
Remove the assertion, and document how we cope in this scenario.
2020-05-19 15:04:34 -07:00
Tom Lane e02ad575d8 Final pgindent run with pg_bsd_indent version 2.1.
This is just to provide a clean basis for comparison of the results
of the new version.  I did fix a typo that crept into 242dfcbaf.

Discussion: https://postgr.es/m/20200114221814.GA19630@alvherre.pgsql
2020-05-16 11:49:14 -04:00
Alvaro Herrera 242dfcbafa
Avoid killing btree items that are already dead
_bt_killitems marks btree items dead when a scan leaves the page where
they live, but it does so with only share lock (to improve concurrency).
This was historicall okay, since killing a dead item has no
consequences.  However, with the advent of data checksums and
wal_log_hints, this action incurs a WAL full-page-image record of the
page.  Multiple concurrent processes would write the same page several
times, leading to WAL bloat.  The probability of this happening can be
reduced by only killing items if they're not already dead, so change the
code to do that.

The problem could eliminated completely by having _bt_killitems upgrade
to exclusive lock upon seeing a killable item, but that would reduce
concurrency so it's considered a cure worse than the disease.

Backpatch all the way back to 9.5, since wal_log_hints was introduced in
9.4.

Author: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
Discussion: https://postgr.es/m/CA+fd4k6PeRj2CkzapWNrERkja5G0-6D-YQiKfbukJV+qZGFZ_Q@mail.gmail.com
2020-05-15 16:50:34 -04:00
Tom Lane 5cbfce562f Initial pgindent and pgperltidy run for v13.
Includes some manual cleanup of places that pgindent messed up,
most of which weren't per project style anyway.

Notably, it seems some people didn't absorb the style rules of
commit c9d297751, because there were a bunch of new occurrences
of function calls with a newline just after the left paren, all
with faulty expectations about how the rest of the call would get
indented.
2020-05-14 13:06:50 -04:00
Peter Geoghegan 624686abcf Adjust "root of to-be-deleted subtree" function.
Restructure the function that locates the root of the to-be-deleted
subtree during nbtree page deletion.  Handle the conditions that make
page deletion unsafe in a slightly more uniform way, and acknowledge the
fact that the behavior with incomplete splits on internal pages is
different (as pointed out in the nbtree README as of commit 35bc0ec7).
Also invent new terminology that avoids ambiguity around which pages are
about to be deleted.  Consistently use the term "to-be-deleted subtree",
not the ambiguous term "branch".

We were calling the subtree parent page the "top parent page", but that
was quite misleading.  The top parent page usually refers to a page
unlinked from its siblings and marked deleted (during the second stage
of page deletion).  There was one kind of top parent page that we merely
removed a downlink from, and another kind of top parent page that we
actually marked deleted.  Eliminate the ambiguity by inventing a new
term ("subtree parent page") that refers to the former kind of page
only.
2020-05-11 11:01:07 -07:00
Peter Geoghegan cd8c73a38a Refactor nbtree deletion INCOMPLETE_SPLIT check.
Factor out code common to _bt_lock_branch_parent() and _bt_pagedel()
into a new utility function.  This new function is used to check that
the left sibling of a deletion target page does not have the
INCOMPLETE_SPLIT page flag set.  If it is set then deletion is unsafe;
there won't be a usable pivot tuple (with a downlink) in the parent page
that points to the deletion target page.  The page deletion algorithm is
not prepared to deal with that.  Also restructure an existing, related
utility function that checks if the right sibling of the target page has
the ISHALFDEAD page flag set.

This organization highlights the symmetry between the two cases.  The
goal is to make the design of page deletion clearer.  Both functions
involve a sibling page with a flag that indicates that there was an
interrupted operation (a page split or a page deletion) that resulted in
a page pointed to by sibling pages, but not pointed to in the parent.
And, both functions indicate if page deletion is unsafe due to the
absence of a particular downlink in the parent page.
2020-05-07 16:08:54 -07:00
Peter Geoghegan 9dc7251417 Refactor btvacuumpage().
Remove one of the arguments to btvacuumpage(), and give up on the idea
that it's a recursive function.  We now use the term "backtracking" to
refer to the case where an earlier block must be visited to make sure no
tuples that need to be removed were missed.

Advertising btvacuumpage() as a recursive function was unhelpful.  In
reality the function always simulates recursion with a loop (it doesn't
actually call itself).  This wasn't just necessary as a precaution (per
the comments mentioning tail recursion), though.  There is no reliable
natural limit on the number of times we can backtrack.

There are important behavioral difference when "recursing"/backtracking,
mostly related to page deletion.  We don't perform page deletion when
backtracking due to the extra complexity.  And when we recurse, we're
not performing a physical order scan anymore, so we expect fairly
different conditions to hold for the page.  Structuring the code like
this makes it clearer how _bt_pagedel() cooperates with btvacuumpage()
and btvacuumscan() (as established in commit b0229f26 and commit
73a076b0).

Author: Peter Geoghegan
Reviewed-By: Masahiko Sawada
Discussion: https://postgr.es/m/CAH2-WzmRGMDWiLMcb+zagG9652PboNN4Gfcq1Gc_wJL6A716MA@mail.gmail.com
2020-05-02 14:04:33 -07:00
Peter Geoghegan 69cf853fe7 Clear up issue with FSM and oldest bpto.xact.
On further reflection, code comments added by commit b0229f26 slightly
misrepresented how we determine the oldest bpto.xact for the index.
btvacuumpage() does not treat the bpto.xact of a page that it put in the
FSM as a candidate to be the oldest deleted page (the delete-marked page
that has the oldest bpto.xact XID among all pages encountered).

The definition of a deleted page for the purposes of the bpto.xact
calculation is different from the definition used by the bulk delete
statistics.  The bulk delete statistics don't distinguish between pages
that were deleted by the current VACUUM, pages deleted by a previous
VACUUM operation but not yet recyclable/reusable, and pages that are
reusable (though reusable pages are counted separately).

Backpatch: 11-, just like commit b0229f26.
2020-05-01 12:19:44 -07:00
Peter Geoghegan 4e21f8b633 Reorder function prototypes for consistency. 2020-05-01 10:03:38 -07:00
Peter Geoghegan 73a076b03f Fix undercounting in VACUUM VERBOSE output.
The logic for determining how many nbtree pages in an index are deleted
pages sometimes undercounted pages.  Pages that were deleted by the
current VACUUM operation (as opposed to some previous VACUUM operation
whose deleted pages have yet to be reused) were sometimes overlooked.
The final count is exposed to users through VACUUM VERBOSE's "%u index
pages have been deleted" output.

btvacuumpage() avoided double-counting when _bt_pagedel() deleted more
than one page by assuming that only one page was deleted, and that the
additional deleted pages would get picked up during a future call to
btvacuumpage() by the same VACUUM operation.  _bt_pagedel() can
legitimately delete pages that the btvacuumscan() scan will not visit
again, though, so that assumption was slightly faulty.

Fix the accounting by teaching _bt_pagedel() about its caller's
requirements.  It now only reports on pages that it knows btvacuumscan()
won't visit again (including the current btvacuumpage() page), so
everything works out in the end.

This bug has been around forever.  Only backpatch to v11, though, to
keep _bt_pagedel() is sync on the branches that have today's bugfix
commit b0229f26da.  Note that this commit changes the signature of
_bt_pagedel(), just like commit b0229f26da.

Author: Peter Geoghegan
Reviewed-By: Masahiko Sawada
Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com
Backpatch: 11-
2020-05-01 09:51:09 -07:00
Peter Geoghegan b0229f26da Fix bug in nbtree VACUUM "skip full scan" feature.
Commit 857f9c36cd (which taught nbtree VACUUM to skip a scan of the
index from btcleanup in situations where it doesn't seem worth it) made
VACUUM maintain the oldest btpo.xact among all deleted pages for the
index as a whole.  It failed to handle all the details surrounding pages
that are deleted by the current VACUUM operation correctly (though pages
deleted by some previous VACUUM operation were processed correctly).

The most immediate problem was that the special area of the page was
examined without a buffer pin at one point.  More fundamentally, the
handling failed to account for the full range of _bt_pagedel()
behaviors.  For example, _bt_pagedel() sometimes deletes internal pages
in passing, as part of deleting an entire subtree with btvacuumpage()
caller's page as the leaf level page.  The original leaf page passed to
_bt_pagedel() might not be the page that it deletes first in cases where
deletion can take place.

It's unclear how disruptive this bug may have been, or what symptoms
users might want to look out for.  The issue was spotted during
unrelated code review.

To fix, push down the logic for maintaining the oldest btpo.xact to
_bt_pagedel().  btvacuumpage() is now responsible for pages that were
fully deleted by a previous VACUUM operation, while _bt_pagedel() is now
responsible for pages that were deleted by the current VACUUM operation
(this includes half-dead pages from a previous interrupted VACUUM
operation that become fully deleted in _bt_pagedel()).  Note that
_bt_pagedel() should never encounter an existing deleted page.

This commit theoretically breaks the ABI of a stable release by changing
the signature of _bt_pagedel().  However, if any third party extension
is actually affected by this, then it must already be completely broken
(since there are numerous assumptions made in _bt_pagedel() that cannot
be met outside of VACUUM).  It seems highly unlikely that such an
extension actually exists, in any case.

Author: Peter Geoghegan
Reviewed-By: Masahiko Sawada
Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com
Backpatch: 11-, where the "skip full scan" feature was introduced.
2020-05-01 08:39:52 -07:00
Peter Geoghegan dd1f645cc8 Fix AddressSanitizer use-after-scope complaint.
XLogRegisterBufData() does not copy data pointed to by caller's pointer
argument.

Oversight in commit 0d861bbb70.

Author: Peter Eisentraut
Reported-By: Peter Eisentraut
Discussion: https://postgr.es/m/21800dbe-a13e-22f7-d423-b81db9d249f5@2ndquadrant.com
2020-04-30 12:31:56 -07:00
Peter Geoghegan ab2343d4cb Remove redundant _bt_killitems() buffer check.
_bt_getbuf() cannot return an invalid buffer.

Oversight in commit 2ed5b87f96.
2020-04-29 18:17:49 -07:00
Peter Geoghegan 7154aa16a6 Fix another minor page deletion buffer lock issue.
Avoid accessing the leaf page's top parent tuple without a buffer lock
held during the second phase of nbtree page deletion.  The old approach
was safe, though only because VACUUM never drops its buffer pin (and
because only VACUUM itself can modify a half-dead page).  Even still, it
seems like a good idea to be strict here.  Tighten things up by copying
the top parent page's block number to a local variable before releasing
the buffer lock on the leaf page -- not after.

This is a follow-up to commit fa7ff642, which fixed a similar issue in
the first phase of nbtree page deletion.

Update some related comments in passing.

Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
2020-04-25 16:45:20 -07:00
Peter Geoghegan fa7ff642c2 Fix minor nbtree page deletion buffer lock issue.
Avoid accessing the deletion target page's special area during nbtree
page deletion at a point where there is no buffer lock held.  This issue
was detected by a patch that extends Valgrind's memcheck tool to mark
nbtree pages that are unsafe to access (due to not having a buffer lock
or buffer pin) as NOACCESS.

We do hold a buffer pin at this point, and only access the special area,
so the old approach was safe.  Even still, it seems like a good idea to
tighten up the rules in this area.  There is no reason to not simply
insist on always holding a buffer lock (not just a pin) when accessing
nbtree pages.

Update some related comments in passing.

Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
2020-04-25 14:17:02 -07:00
Peter Geoghegan 1542e16f2c Consider outliers in split interval calculation.
Commit 0d861bbb, which introduced deduplication to nbtree, added some
logic to take large posting list tuples into account when choosing a
split point.  We subtract firstright posting list overhead from the
projected new high key size when calculating leftfree/rightfree values
for an affected candidate split point.  Posting list tuples aren't
special to nbtsplitloc.c, but taking them into account like this makes a
huge difference in practice.  Posting list tuples are frequently tuple
size outliers.

However, commit 0d861bbb missed a closely related issue: split interval
itself is calculated based on the assumption that tuples on the page
being split are roughly equisized.  That assumption was acceptable back
when commit fab25024 taught the logic for choosing a split point about
suffix truncation, but it's pretty questionable now that very large
tuple sizes are common.  This oversight led to unbalanced page splits in
low cardinality multi-column indexes when deduplication was used: page
splits that don't give sufficient weight to how unbalanced the split is
when the interval happens to include some large posting list tuples (and
when most other tuples on the page are not so large).

Nail this down by calculating an initial split interval in a way that's
attuned to the actual cost that we want to keep under control (not a
fuzzy proxy for the cost): apply a leftfree + rightfree evenness test to
each candidate split point that actually gets included in the split
interval (for the default strategy).  This replaces logic that used a
percentage of all legal split points for the page as the basis of the
initial split interval.

Discussion: https://postgr.es/m/CAH2-WznJt5aT2uUB2Bs+JBLdwe0XTX67+xeLFcaNvCKxO=QBVQ@mail.gmail.com
2020-04-21 09:59:24 -07:00
Peter Geoghegan f0ca378d4c Slightly simplify nbtree split point choice loop.
Spotted during post-commit review of the nbtree deduplication commit
(commit 0d861bbb).
2020-04-15 15:47:26 -07:00
Peter Geoghegan 4a05a64095 Remove obsolete "hole in center of page" comment.
A comment from the Berkeley days incorrectly claimed that the page
management code cares about the contents of the hole in the center of
the page (at least in the case of the left half of an nbtree page
split).  Commit 8fa30f906b added an addendum that stated that the
original comment was "probably obsolete".  It's definitely obsolete,
though, so remove the original comment plus the addendum.
2020-04-14 14:38:28 -07:00
Peter Geoghegan 80634e3b18 Rearrange _bt_insertonpg() "update metapage" code.
Nest the "update metapage as part of insert into root-like page" branch
inside the broader "insert into internal page" branch.  This improves
readability.
2020-04-14 09:33:18 -07:00
Peter Geoghegan f762b2feba Add defensive "split_only_page" nbtree assertion.
Clearly it's not okay for nbtree to split a page that is the only page
on its level, and then find that it has to split the parent one level up
in turn.  There is simply no code to handle the split_only_page case in
the _bt_insertonpg() "newitem won't fit" branch (only the "newitem fits"
branch handles split_only_page).  Add a defensive assertion that will
fail if a split_only_page call to _bt_insertonpg() somehow ends up
splitting the target/parent page.

I (pgeoghegan) believe that we don't need split_only_page handling for
the "newitem won't fit" branch because anybody calling _bt_insertonpg()
like this would have to hold a lock on the same one and only child page.
2020-04-13 21:11:03 -07:00
Peter Geoghegan 826ee1a019 Make _bt_insertonpg() more like _bt_split().
It seems like a good idea for nbtree's retail insert code to be
absolutely consistent with nbtree's page split code for anything that
naturally requires equivalent handling.  Anything that concerns
inserting newitem (which is handled as part of the page split atomic
action when a page split is required) should work in exactly the same
way.  With that in mind, make _bt_insertonpg() handle 'cbuf' in a way
that matches _bt_split().
2020-04-13 19:26:41 -07:00
Peter Geoghegan bc3087b626 Harmonize nbtree page split point code.
An nbtree split point can be thought of as a point between two adjoining
tuples from an imaginary version of the page being split that includes
the incoming/new item (in addition to the items that really are on the
page).  These adjoining tuples are called the lastleft and firstright
tuples.

The variables that represent split points contained a field called
firstright, which is an offset number of the first data item from the
original page that goes on the new right page.  The corresponding tuple
from origpage was usually the same thing as the actual firstright tuple,
but not always: the firstright tuple is sometimes the new/incoming item
instead.  This situation seems unnecessarily confusing.

Make things clearer by renaming the origpage offset returned by
_bt_findsplitloc() to "firstrightoff".  We now have a firstright tuple
and a firstrightoff offset number which are comparable to the
newitem/lastleft tuples and the newitemoff/lastleftoff offset numbers
respectively.  Also make sure that we are consistent about how we
describe nbtree page split point state.

Push the responsibility for dealing with pg_upgrade'd !heapkeyspace
indexes down to lower level code, relieving _bt_split() from dealing
with it directly.  This means that we always have a palloc'd left page
high key on the leaf level, no matter what.  This enables simplifying
some of the code (and code comments) within _bt_split().

Finally, restructure the page split code to make it clearer why suffix
truncation (which only takes place during leaf page splits) is
completely different to the first data item truncation that takes place
during internal page splits.  Tuples are marked as having fewer
attributes stored in both cases, and the firstright tuple is truncated
in both cases, so it's easy to imagine somebody missing the distinction.
2020-04-13 16:39:55 -07:00
Amit Kapila 5c71362174 Allow parallel create index to accumulate buffer usage stats.
Currently, we don't account for buffer usage incurred by parallel workers
for parallel create index.  This commit allows each worker to record the
buffer usage stats and leader backend to accumulate that stats at the
end of the operation.  This will allow pg_stat_statements to display
correct buffer usage stats for (parallel) create index command.

Reported-by: Julien Rouhaud
Author: Sawada Masahiko
Reviewed-by: Dilip Kumar, Julien Rouhaud and Amit Kapila
Backpatch-through: 11, where this was introduced
Discussion: https://postgr.es/m/20200328151721.GB12854@nol
2020-04-09 09:49:30 +05:30
Peter Geoghegan 60cbd7751c Remove nbtree BTreeTupleSetAltHeapTID() function.
Since heap TID is supposed to be just another key attribute to the
implementation, it doesn't make much sense to have separate
BTreeTupleSetNAtts() and BTreeTupleSetAltHeapTID() functions.  Merge the
two functions together.  This slightly simplifies _bt_truncate().
2020-04-07 15:56:52 -07:00
Peter Geoghegan ce2cee0ade Fix nbtree kill_prior_tuple posting list assert.
An assertion added by commit 0d861bbb checked that _bt_killitems() only
processes a BTScanPosItem whose heap TID is contained in a posting list
tuple when its page offset number still matches what is on the page
(i.e. when it matches the posting list tuple's current offset number).
This was only correct in the common case where the page can't have
changed since we first read it.  It was not correct in cases where we
don't drop the buffer pin (and don't need to verify the page hasn't
changed using its LSN).  The latter category includes scans involving
unlogged tables, and scans that use a non-MVCC snapshot, per the logic
originally introduced by commit 2ed5b87f.

The assertion still seems helpful.  Fix it by taking cases where the
page may have been concurrently modified into account.

Reported-By: Anastasia Lubennikova, Alexander Lakhin
Discussion: https://postgr.es/m/c4e38e9a-0f9c-8e53-e639-adf343f94472@postgrespro.ru
2020-04-06 14:46:33 -07:00
Noah Misch c6b92041d3 Skip WAL for new relfilenodes, under wal_level=minimal.
Until now, only selected bulk operations (e.g. COPY) did this.  If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY.  See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules.  Maintainers of table access
methods should examine that section.

To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL.  A new GUC,
wal_skip_threshold, guides that choice.  If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold.  Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.

Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode.  Introduce rd_firstRelfilenodeSubid.  Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node.  Make relcache.c retain entries for certain
dropped relations until end of transaction.

Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN.
Future servers accept older WAL, so this bump is discretionary.

Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas.  Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem.  Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs.  Reported by Martijn van Oosterhout.

Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-04-04 12:25:34 -07:00
Amit Kapila df3b181499 Add infrastructure to track WAL usage.
This allows gathering the WAL generation statistics for each statement
execution.  The three statistics that we collect are the number of WAL
records, the number of full page writes and the amount of WAL bytes
generated.

This helps the users who have write-intensive workload to see the impact
of I/O due to WAL.  This further enables us to see approximately what
percentage of overall WAL is due to full page writes.

In the future, we can extend this functionality to allow us to compute the
the exact amount of WAL data due to full page writes.

This patch in itself is just an infrastructure to compute WAL usage data.
The upcoming patches will expose this data via explain, auto_explain,
pg_stat_statements and verbose (auto)vacuum output.

Author: Kirill Bychik, Julien Rouhaud
Reviewed-by: Dilip Kumar, Fujii Masao and Amit Kapila
Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com
2020-04-04 10:02:08 +05:30
Peter Geoghegan 7dbe290da4 Add CREATE INDEX deduplication assertions.
Add two assertions that verify the assumptions about posting list tuple
space accounting and suffix truncation made within nbtsort.c.
2020-03-31 14:38:39 -07:00
Peter Geoghegan f01157e2ac Further simplify nbtree high key truncation.
Commit 7c2dbc69 reorganized _bt_truncate() in a way that enables a
further simplification that I (pgeoghegan) missed:  Since we mark the
tuple that is returned to the caller as a pivot tuple before the point
where its heap TID is set as of 7c2dbc69, it is possible to use the high
level BTreeTupleGetHeapTID() inline function to get an item pointer.  Do
it that way now.  This approach is clearer and more maintainable.
2020-03-30 17:34:12 -07:00
Peter Geoghegan 7c2dbc691c Refactor nbtree high key truncation.
Simplify _bt_truncate(), the routine that generates truncated leaf page
high keys.  Remove a micro-optimization that avoided a second palloc0()
call (this was used when a heap TID was needed in the final pivot tuple,
though only when the index happened to not be an INCLUDE index).

Removing this dubious micro-optimization allows _bt_truncate() to use
the index_truncate_tuple() indextuple.c utility routine in all cases.
This was already the common case.

This commit is a HEAD-only follow up to bugfix commit 4b42a899.
2020-03-30 15:52:39 -07:00
Peter Geoghegan 4b42a89938 Consistently truncate non-key suffix columns.
INCLUDE indexes failed to have their non-key attributes physically
truncated away in certain rare cases.  This led to physically larger
pivot tuples that contained useless non-key attribute values.  The
impact on users should be negligible, but this is still clearly a
regression (Postgres 11 supports INCLUDE indexes, and yet was not
affected).

The bug appeared in commit dd299df8, which introduced "true" suffix
truncation of key attributes.

Discussion: https://postgr.es/m/CAH2-Wz=E8pkV9ivRSFHtv812H5ckf8s1-yhx61_WrJbKccGcrQ@mail.gmail.com
Backpatch: 12-, where "true" suffix truncation was introduced.
2020-03-30 12:03:59 -07:00
Alexander Korotkov 911e702077 Implement operator class parameters
PostgreSQL provides set of template index access methods, where opclasses have
much freedom in the semantics of indexing.  These index AMs are GiST, GIN,
SP-GiST and BRIN.  There opclasses define representation of keys, operations on
them and supported search strategies.  So, it's natural that opclasses may be
faced some tradeoffs, which require user-side decision.  This commit implements
opclass parameters allowing users to set some values, which tell opclass how to
index the particular dataset.

This commit doesn't introduce new storage in system catalog.  Instead it uses
pg_attribute.attoptions, which is used for table column storage options but
unused for index attributes.

In order to evade changing signature of each opclass support function, we
implement unified way to pass options to opclass support functions.  Options
are set to fn_expr as the constant bytea expression.  It's possible due to the
fact that opclass support functions are executed outside of expressions, so
fn_expr is unused for them.

This commit comes with some examples of opclass options usage.  We parametrize
signature length in GiST.  That applies to multiple opclasses: tsvector_ops,
gist__intbig_ops, gist_ltree_ops, gist__ltree_ops, gist_trgm_ops and
gist_hstore_ops.  Also we parametrize maximum number of integer ranges for
gist__int_ops.  However, the main future usage of this feature is expected
to be json, where users would be able to specify which way to index particular
json parts.

Catversion is bumped.

Discussion: https://postgr.es/m/d22c3a18-31c7-1879-fc11-4c1ce2f5e5af%40postgrespro.ru
Author: Nikita Glukhov, revised by me
Reviwed-by: Nikolay Shaplov, Robert Haas, Tom Lane, Tomas Vondra, Alvaro Herrera
2020-03-30 19:17:23 +03:00
Peter Geoghegan a7b9d24e4e Make deduplication use number of key attributes.
Use IndexRelationGetNumberOfKeyAttributes() rather than
IndexRelationGetNumberOfAttributes() when determining whether or not two
index tuples are suitable for merging together into a single posting
list tuple.  This is a little bit tidier.  It brings affected code in
nbtdedup.c a little closer to similar, related code in nbtsplitloc.c.
2020-03-28 20:25:03 -07:00
Peter Geoghegan 9945ad6e90 Justify nbtree page split locking in code comment.
Delaying unlocking the right child page until after the point that the
left child's parent page has been refound is no longer truly necessary.
Commit 40dae7ec made nbtree tolerant of interrupted page splits.  VACUUM
was taught to avoid deleting a page that happens to be the right half of
an incomplete split.  As long as page splits don't unlock the left child
page until the end of the second/final phase, it should be safe to
unlock the right child page earlier (at the end of the first phase).

It probably isn't actually useful to release the right child's lock
earlier like this (it probably won't improve performance).  Even still,
pointing out that it ought to be safe to do so should make it easier to
understand the overall design.
2020-03-27 16:44:52 -07:00
Peter Geoghegan b150a76793 Fix nbtree deduplication README commentary.
Descriptions of some aspects of how deduplication works were unclear in
a couple of places.
2020-03-24 14:58:27 -07:00
Noah Misch de9396326e Revert "Skip WAL for new relfilenodes, under wal_level=minimal."
This reverts commit cb2fd7eac2.  Per
numerous buildfarm members, it was incompatible with parallel query, and
a test case assumed LP64.  Back-patch to 9.5 (all supported versions).

Discussion: https://postgr.es/m/20200321224920.GB1763544@rfd.leadboat.com
2020-03-22 09:24:09 -07:00
Noah Misch cb2fd7eac2 Skip WAL for new relfilenodes, under wal_level=minimal.
Until now, only selected bulk operations (e.g. COPY) did this.  If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY.  See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules.  Maintainers of table access
methods should examine that section.

To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL.  A new GUC,
wal_skip_threshold, guides that choice.  If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold.  Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.

Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode.  Introduce rd_firstRelfilenodeSubid.  Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node.  Make relcache.c retain entries for certain
dropped relations until end of transaction.

Back-patch to 9.5 (all supported versions).  This introduces a new WAL
record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC.  As
always, update standby systems before master systems.  This changes
sizeof(RelationData) and sizeof(IndexStmt), breaking binary
compatibility for affected extensions.  (The most recent commit to
affect the same class of extensions was
089e4d405d0f3b94c74a2c6a54357a84a681754b.)

Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas.  Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem.  Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs.  Reported by Martijn van Oosterhout.

Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
2020-03-21 09:38:26 -07:00
Peter Geoghegan b27e1b3418 nbtree: Remove obsolete _bt_pgaddtup() comments.
Remove comments that are a throw back to a time when nbtree cared about
write-ordering dependencies.  The comments are similar to those removed
by commit 9ee7414e, among others.
2020-03-19 14:56:56 -07:00
Peter Geoghegan 6312c08a29 nbtree: Use raw PageAddItem() for retail inserts.
Only internal page splits need to call _bt_pgaddtup() instead of
PageAddItem(), and only for data items, one of which will end up at the
first offset (or first offset after the high key offset) on the new
right page.  This data item alone will need to be truncated in
_bt_pgaddtup().

Since there is no reason why retail inserts ever need to truncate the
incoming item, use a raw PageAddItem() call there instead.  Even
_bt_split() uses raw PageAddItem() calls for left page and right page
high keys.  Clearly the _bt_pgaddtup() shim function wasn't really
encapsulating anything.  _bt_pgaddtup() should now be thought of as a
_bt_split() helper function.

Note that the assertions from commit d1e241c2 verify that retail inserts
never insert an item at an internal page's negative infinity offset.
This invariant could only ever be violated as a result of a basic logic
error in nbtinsert.c.
2020-03-18 18:17:37 -07:00
Peter Geoghegan b029395f5e Refactor nbtree fastpath optimization.
Commit 2b272734, which added the fastpath rightmost leaf page cache
insert optimization, added code to _bt_doinsert() to handle using and
invalidating the backend local block cache.  It doesn't seem like a good
place to handle these low level details, though.  _bt_doinsert() is
supposed to be a high level function -- it is the main entry point to
nbtinsert.c.

Restructure the code by placing handling of the rightmost block cache at
the start of a new _bt_search() shim function, _bt_search_insert().  The
new function is called from _bt_doinsert(), which uses it as a
_bt_search() variant that conveniently accepts its BTInsertState state
as an argument.  _bt_doinsert() no longer needs to directly consider the
fastpath optimization.

Discussion: https://postgr.es/m/CAH2-Wzk59cxKJRd=rfbyub6-V4yWRjsOYRkUNHBLT1P1GdtCQQ@mail.gmail.com
2020-03-18 14:42:49 -07:00
Peter Geoghegan b897b3aae6 nbtree: Remove useless local variables.
Copying block and offset numbers to local variables in _bt_insertonpg()
made the code less readable.  Remove the variables.  There is already
code that conditionally calls BufferGetBlockNumber() in the same block,
so consistently do it that way instead.

Calling BufferGetBlockNumber() is very cheap, but we might as well avoid
it when it isn't truly necessary.  It isn't truly necessary for
_bt_insertonpg() to call BufferGetBlockNumber() in almost all cases.

Spotted while working on a patch that refactors the fastpath rightmost
leaf page cache optimization, which was added by commit 2b272734.
2020-03-17 18:39:26 -07:00
Peter Geoghegan 113758155c nbtree: Fix obsolete _bt_search() comment.
Oversight in commit d2086b08b0.
2020-03-16 15:51:06 -07:00
Peter Geoghegan 013c1f6af6 nbtree: Pass down MAXALIGN()'d itemsz for new item.
Refactor nbtinsert.c so that the final itemsz of each new non-pivot
tuple (the MAXALIGN()'d size) is determined once.  Most of the functions
used by leaf page inserts used the insertstate.itemsz value already.
This commit makes everything use insertstate.itemsz as standard
practice.  The goal is to decouple tuple size from "effective" tuple
size.  Making this distinction isn't truly necessary right now, but that
might change in the future.

Also explain why we consistently apply MAXALIGN() to get an effective
index tuple size.  This was rather unclear, in part because it isn't
actually strictly necessary right now.
2020-03-16 12:00:10 -07:00
Peter Geoghegan f207bb0b8f nbtree: Reorder nbtinsert.c prototypes.
Relocate _bt_newroot() prototype, so that the order that prototypes
appear in matches the order that the functions are defined in.
2020-03-15 20:53:12 -07:00
Peter Geoghegan 39eabec904 nbtree: Move fastpath NULL descent stack assertion.
Commit 074251db added an assertion that verified the fastpath/rightmost
page insert optimization's assumption about free space: There should
always be enough free space on the page to insert the new item without
splitting the page.  Otherwise, we end up using the "concurrent root
page split" phony/fake stack path in _bt_insert_parent().  This does not
lead to incorrect behavior, but it is likely to be far slower than
simply using the regular _bt_search() path.  The assertion catches
serious performance bugs that would probably take a long time to detect
any other way.

It seems much more natural to make this assertion just before the point
that we generate a fake/phony descent stack.  Move the assert there.
This also makes _bt_insertonpg() a bit more readable.
2020-03-10 17:25:47 -07:00
Peter Geoghegan d1e241c226 nbtree: Demote minus infinity "can't happen" error.
Only a very basic logic bug in a _bt_insertonpg() caller could lead to a
violation of this invariant.  Besides, any newitemoff used for an
internal page is sanitized using other "can't happen" errors in
_bt_getstackbuf() or its callers, before _bt_insertonpg() even gets
called.

Also, move the error/assertion from the insert-without-split path of
_bt_insertonpg() to the top of the same function.  There is no reason
why this invariant only applies to insertions that happen to not result
in a page split; cover every insertion.  The assertion naturally belongs
next to the existing generic assertions that document relatively
high-level invariants for the item being inserted.
2020-03-10 14:15:41 -07:00
Peter Geoghegan 1e07f5e0a1 Remove overzealous _bt_split() assertions.
_bt_split() is passed NULL as its insertion scankey for internal page
splits.  Two recently added Assert() statements failed to consider this,
leading to a crash with pg_upgrade'd BREE_VERSION < 4 indexes.  Remove
the assertions.

The assertions in question were added by commit 0d861bbb, which added
nbtree deduplication.  It would be possible to fix the assertions
directly instead, but they weren't adding much anyway.
2020-03-02 21:40:11 -08:00
Peter Geoghegan 77b88bd5dc Add assertions to _bt_update_posting().
Copy some assertions from _bt_form_posting() to its sibling function,
_bt_update_posting().

Discussion: https://postgr.es/m/CAH2-WzkPR8KMwkL0ap976kmXwBCeukTeHz6fB-U__wvuP1S9Zg@mail.gmail.com
2020-03-02 08:07:16 -08:00
Peter Geoghegan 84ec9b231a Remove dead code from _bt_update_posting().
Discussion: https://postgr.es/m/CAH2-WzmAufHiOku6AGiFD=81VQs5nYJ1L2YkhW1t+BH4CMsgRw@mail.gmail.com
2020-03-01 12:11:26 -08:00
Peter Geoghegan 2c0797da2c Silence another compiler warning in nbtinsert.c.
Per complaint from Álvaro Herrera.
2020-02-26 15:15:45 -08:00
Peter Geoghegan 2d8a6fad18 Silence compiler warning in nbtinsert.c.
Per buildfarm member longfin.
2020-02-26 13:17:36 -08:00
Peter Geoghegan 0d861bbb70 Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method.  The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
the point where a leaf page split would otherwise be required.  New
posting list tuples are formed by merging together existing duplicate
tuples.  The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed.  Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.

The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach.  Most individual inserts of index tuples have
exactly the same overhead as before.  The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits.  The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).

Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key).  The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective.  This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.

A new index storage parameter (deduplicate_items) controls the use of
deduplication.  The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible.  This decision will be
reviewed at the end of the Postgres 13 beta period.

There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values.  The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely.  Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12).  However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from.  This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.

Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
    https://postgr.es/m/55E4051B.7020209@postgrespro.ru
    https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 13:05:30 -08:00
Peter Geoghegan 612a1ab767 Add equalimage B-Tree support functions.
Invent the concept of a B-Tree equalimage ("equality implies image
equality") support function, registered as support function 4.  This
indicates whether it is safe (or not safe) to apply optimizations that
assume that any two datums considered equal by an operator class's order
method must be interchangeable without any loss of semantic information.
This is static information about an operator class and a collation.

Register an equalimage routine for almost all of the existing B-Tree
opclasses.  We only need two trivial routines for all of the opclasses
that are included with the core distribution.  There is one routine for
opclasses that index non-collatable types (which returns 'true'
unconditionally), plus another routine for collatable types (which
returns 'true' when the collation is a deterministic collation).

This patch is infrastructure for an upcoming patch that adds B-Tree
deduplication.

Author: Peter Geoghegan, Anastasia Lubennikova
Discussion: https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
2020-02-26 11:28:25 -08:00
Peter Geoghegan fe9b92854e Remove obsolete _bt_compare() comment.
btbuild() has nothing to say about how NULL values compare in nbtree.
Besides, there are _bt_compare() header comments that describe how NULL
values are handled.
2020-02-18 16:07:16 -08:00
Thomas Munro d9fe702a2c Handle lack of DSM slots in parallel btree build, take 2.
Commit 74618e77 added a new check intended to fix a bug, but put
it in the wrong place so that parallel btree build was always
disabled.  Do the check after we've actually tried to create
a DSM segment.  Back-patch to 11, like the earlier commit.

Reviewed-by: Peter Geoghegan
Discussion: https://postgr.es/m/CAH2-WzmDABkJzrNnvf%2BOULK-_A_j9gkYg_Dz-H62jzNv4eKQTw%40mail.gmail.com
2020-02-05 12:27:00 +13:00
Thomas Munro 74618e77b4 Handle lack of DSM slots in parallel btree build.
If no DSM slots are available, a ParallelContext can still be
created, but its seg pointer is NULL.  Teach parallel btree build
to cope with that by falling back to a regular non-parallel build,
to avoid crashing with a segmentation fault.

Back-patch to 11, where parallel CREATE INDEX landed.

Reported-by: Nicola Contu
Reviewed-by: Peter Geoghegan
Discussion: https://postgr.es/m/CA%2BhUKGJgJEBnkuODBVomyK3MWFvDBbMVj%3Dgdt6DnRPU-5sQ6UQ%40mail.gmail.com
2020-01-31 10:25:34 +13:00
Thomas Munro 6f38d4dac3 Remove dependency on HeapTuple from predicate locking functions.
The following changes make the predicate locking functions more
generic and suitable for use by future access methods:

- PredicateLockTuple() is renamed to PredicateLockTID().  It takes
  ItemPointer and inserting transaction ID instead of HeapTuple.

- CheckForSerializableConflictIn() takes blocknum instead of buffer.

- CheckForSerializableConflictOut() no longer takes HeapTuple or buffer.

Author: Ashwin Agrawal
Reviewed-by: Andres Freund, Kuntal Ghosh, Thomas Munro
Discussion: https://postgr.es/m/CALfoeiv0k3hkEb3Oqk%3DziWqtyk2Jys1UOK5hwRBNeANT_yX%2Bng%40mail.gmail.com
2020-01-28 13:13:04 +13:00
Amit Kapila 4d8a8d0c73 Introduce IndexAM fields for parallel vacuum.
Introduce new fields amusemaintenanceworkmem and amparallelvacuumoptions
in IndexAmRoutine for parallel vacuum.  The amusemaintenanceworkmem tells
whether a particular IndexAM uses maintenance_work_mem or not.  This will
help in controlling the memory used by individual workers as otherwise,
each worker can consume memory equal to maintenance_work_mem.  The
amparallelvacuumoptions tell whether a particular IndexAM participates in
a parallel vacuum and if so in which phase (bulkdelete, vacuumcleanup) of
vacuum.

Author: Masahiko Sawada and Amit Kapila
Reviewed-by: Dilip Kumar, Amit Kapila, Tomas Vondra and Robert Haas
Discussion:
https://postgr.es/m/CAD21AoDTPMgzSkV4E3SFo1CH_x50bf5PqZFQf4jmqjk-C03BWg@mail.gmail.com
https://postgr.es/m/CAA4eK1LmcD5aPogzwim5Nn58Ki+74a6Edghx4Wd8hAskvHaq5A@mail.gmail.com
2020-01-15 07:24:14 +05:30
Peter Geoghegan fc31001123 Remove redundant incomplete split assertion.
The fastpath insert optimization's incomplete split flag Assert() is
redundant.  We'll reach the more general Assert() within
_bt_findinsertloc() in all cases. (Besides, Assert()'ing that the
rightmost page doesn't have the flag set never made much sense.)
2020-01-05 17:42:13 -08:00
Peter Geoghegan d2e5e20e57 Add xl_btree_delete optimization.
Commit 558a9165e0 taught _bt_delitems_delete() to produce its own XID
horizon on the primary.  Standbys no longer needed to generate their own
latestRemovedXid, since they could just use the explicitly logged value
from the primary instead.  The deleted offset numbers array from the
xl_btree_delete WAL record was no longer used by the REDO routine for
anything other than deleting the items.

This enables a minor optimization:  We now treat the array as buffer
state, not generic WAL data, following _bt_delitems_vacuum()'s example.
This should be a minor win, since it allows us to avoid including the
deleted items array in cases where XLogInsert() stores the whole buffer
anyway.  The primary goal here is to make the code more maintainable,
though.  Removing inessential differences between the two functions
highlights the fundamental differences that remain.

Also change xl_btree_delete to use uint32 for the size of the array of
item offsets being deleted.  This brings xl_btree_delete closer to
xl_btree_vacuum.  Furthermore, it seems like a good idea to use an
explicit-width integer type (the field was previously an "int").

Bump XLOG_PAGE_MAGIC because xl_btree_delete changed.

Discussion: https://postgr.es/m/CAH2-Wzkz4TjmezzfAbaV1zYrh=fr0bCpzuJTvBe5iUQ3aUPsCQ@mail.gmail.com
2020-01-03 12:18:13 -08:00
Peter Geoghegan 0c41c83d8f Clear up btree_xlog_split() alignment comment.
Adjust a comment that describes how alignment of the new left page high
key works in btree_xlog_split(), the nbtree page split REDO routine.
The wording used before commit 2c03216d83 is much clearer, so go back
to that.
2020-01-02 18:30:25 -08:00
Peter Geoghegan 44e44bd258 Correct _bt_delitems_vacuum() lock comments.
The expectation within _bt_delitems_vacuum() is that caller has a
super-exclusive/cleanup buffer lock (not just a pin and a write lock).
2020-01-02 13:30:40 -08:00
Peter Geoghegan 4b25f5d0ba Revise BTP_HAS_GARBAGE nbtree VACUUM comments.
_bt_delitems_vacuum() comments claimed that it isn't worth another scan
of the page to avoid falsely unsetting the BTP_HAS_GARBAGE page flag
hint (this happens to be the same wording that was removed from
_bt_delitems_delete() by my recent commit fe97c61c).  The comments made
little sense, though.  The issue can't have much to do with performing a
second scan of the target leaf page, since an LP_DEAD test could easily
be performed in the first scan of the page anyway (the scan that takes
place in btvacuumpage() caller).

Revise the explanation.  It makes much more sense to frame this as an
issue about recovery conflicts.  _bt_delitems_vacuum() cannot easily
generate an XID cutoff in the same way that _bt_delitems_delete() is
designed to.

Falsely unsetting the page flag is not ideal, and is likely to happen
more often than was supposed by the original comments.  Explain why it
usually isn't a problem in practice.  There may be an argument for
_bt_delitems_vacuum() not clearing the BTP_HAS_GARBAGE bit, removing the
question of it being falsely unset by VACUUM (there may even be an
argument for not using a page level hint at all).  This can be revisited
later.
2020-01-01 17:29:41 -08:00
Peter Geoghegan c5f3b53b0e Update btree_xlog_delete() comments.
Commit fe97c61c updated LP_DEAD item deletion comments, but missed a
minor discrepancy on the REDO side.  Fix it now.

In passing, don't talk about the btree_xlog_vacuum() behavior within
btree_xlog_delete().  The reliance on XLOG_HEAP2_CLEANUP_INFO records
for recovery conflicts is already discussed within btvacuumpage() and
mentioned again in passing above btree_xlog_vacuum(), which seems
sufficient.
2020-01-01 11:32:07 -08:00
Bruce Momjian 7559d8ebfa Update copyrights for 2020
Backpatch-through: update all files in master, backpatch legal files through 9.4
2020-01-01 12:21:45 -05:00
Michael Paquier 7854e07f25 Revert "Rename files and headers related to index AM"
This follows multiple complains from Peter Geoghegan, Andres Freund and
Alvaro Herrera that this issue ought to be dug more before actually
happening, if it happens.

Discussion: https://postgr.es/m/20191226144606.GA5659@alvherre.pgsql
2019-12-27 08:09:00 +09:00
Michael Paquier 8ce3aa9b59 Rename files and headers related to index AM
The following renaming is done so as source files related to index
access methods are more consistent with table access methods (the
original names used for index AMs ware too generic, and could be
confused as including features related to table AMs):
- amapi.h -> indexam.h.
- amapi.c -> indexamapi.c.  Here we have an equivalent with
backend/access/table/tableamapi.c.
- amvalidate.c -> indexamvalidate.c.
- amvalidate.h -> indexamvalidate.h.
- genam.c -> indexgenam.c.
- genam.h -> indexgenam.h.

This has been discussed during the development of v12 when table AM was
worked on, but the renaming never happened.

Author: Michael Paquier
Reviewed-by: Fabien Coelho, Julien Rouhaud
Discussion: https://postgr.es/m/20191223053434.GF34339@paquier.xyz
2019-12-25 10:23:39 +09:00
Peter Geoghegan fe97c61c87 Update nbtree LP_DEAD item deletion comments.
Comments about the consequences of clearing the BTP_HAS_GARBAGE page
flag bit that apply only to VACUUM were added to code that deals with
opportunistic deletion of LP_DEAD items by commit a760893d.  The same
comment block was added to both _bt_delitems_vacuum() and
_bt_delitems_delete().  Correct _bt_delitems_delete()'s copy of the
comment block.

_bt_delitems_delete() reliably deletes items that were found by caller
to have their LP_DEAD bit set.  There is no question about whether or
not unsetting the BTP_HAS_GARBAGE bit can miss some LP_DEAD items that
were set recently.

Also tweak a related section of the nbtree README.
2019-12-22 19:57:35 -08:00
Peter Geoghegan 9f83468b35 Remove unneeded "pin scan" nbtree VACUUM code.
The REDO routine for nbtree's xl_btree_vacuum record type hasn't
performed a "pin scan" since commit 3e4b7d87 went in, so clearly there
isn't any point in VACUUM WAL-logging information that won't actually be
used.  Finish off the work of commit 3e4b7d87 (and the closely related
preceding commit 687f2cd7) by removing the code that generates this
unused information.  Also remove the REDO routine code disabled by
commit 3e4b7d87.

Replace the unneeded lastBlockVacuumed field in xl_btree_vacuum with a
new "ndeleted" field.  The new field isn't actually needed right now,
since we could continue to infer the array length from the overall
record length.  However, an upcoming patch to add deduplication to
nbtree needs to add an "items updated" field to xl_btree_vacuum, so we
might as well start being explicit about the number of items now.
(Besides, it doesn't seem like a good idea to leave the xl_btree_vacuum
struct without any fields; the C standard says that that's undefined.)

nbtree VACUUM no longer forces writing a WAL record for the last block
in the index.  Writing out a WAL record with no items for the final
block was supposed to force processing of a lastBlockVacuumed field by a
pin scan.

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

Discussion: https://postgr.es/m/CAH2-WzmY_mT7UnTzFB5LBQDBkKpdV5UxP3B5bLb7uP%3D%3D6UQJRQ%40mail.gmail.com
2019-12-19 11:35:55 -08:00
Bruce Momjian b93e9a5c94 revert: Remove meaningless assignments in nbtree code
Reverts commit 05684c8255.

Reported-by: Tom Lane

Discussion: https://postgr.es/m/404.1576770942@sss.pgh.pa.us

Backpatch-through: master
2019-12-19 11:19:10 -05:00
Bruce Momjian 05684c8255 Remove meaningless assignments in nbtree code
Reported-by: Ranier Vilela

Discussion: https://postgr.es/m/MN2PR18MB2927BB876D12A70FDBE8F35AE3450@MN2PR18MB2927.namprd18.prod.outlook.com

Backpatch-through: master
2019-12-19 10:33:48 -05:00
Peter Geoghegan fcf3b6917b Rename nbtree tuple macros.
Rename two function-style macros, removing the word "inner".  This makes
things more consistent.
2019-12-16 17:49:45 -08:00
Peter Geoghegan 9067b83955 Update nbtree README's "Scans during Recovery".
get_actual_variable_range() hasn't used a dirty snapshot since commit
3ca930fc3, which invented a new snapshot type specifically to meet
selfuncs.c's requirements (HeapTupleSatisfiesNonVacuumable() type
snapshots were added).

Discussion: https://postgr.es/m/CAH2-Wzn2pSqEOcBDAA40CnO82oEy-EOpE2bNh_XL_cfFoA86jw@mail.gmail.com
2019-12-16 17:11:35 -08:00
Michael Paquier 4cb658af70 Refactor reloption handling for index AMs in-core
This reworks the reloption parsing and build of a couple of index AMs by
creating new structures for each index AM's options.  This split was
already done for BRIN, GIN and GiST (which actually has a fillfactor
parameter), but not for hash, B-tree and SPGiST which relied on
StdRdOptions due to an overlap with the default option set.

This saves a couple of bytes for rd_options in each relcache entry with
indexes making use of relation options, and brings more consistency
between all index AMs.  While on it, add a couple of AssertMacro() calls
to make sure that utility macros to grab values of reloptions are used
with the expected index AM.

Author: Nikolay Shaplov
Reviewed-by: Amit Langote, Michael Paquier, Álvaro Herrera, Dent John
Discussion: https://postgr.es/m/4127670.gFlpRb6XCm@x200m
2019-11-25 09:40:53 +09:00
Peter Geoghegan 2110f71696 nbtree: Tweak _bt_pgaddtup() comments.
Make it clear that _bt_pgaddtup() truncates the first data item on an
internal page because its key is supposed to be treated as minus
infinity within _bt_compare().
2019-11-18 13:04:53 -08:00
Peter Geoghegan 1f55ebae27 Make _bt_keep_natts_fast() use datum_image_eq().
An upcoming patch that adds deduplication to the nbtree AM will rely on
_bt_keep_natts_fast() understanding that differences in TOAST input
state can never affect its answer.  In particular, two opclass-equal
datums (with opclasses deemed safe for deduplication) should never be
treated as unequal by _bt_keep_natts_fast() due to TOAST input
differences.

This also seems like a good idea on general principle.  nbtsplitloc.c
will now occasionally make better decisions about where to split a leaf
page.  The behavior of _bt_keep_natts_fast() is now somewhat closer to
the behavior of _bt_keep_natts().

Discussion: https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
2019-11-12 13:08:41 -08:00
Amit Kapila 14aec03502 Make the order of the header file includes consistent in backend modules.
Similar to commits 7e735035f2 and dddf4cdc33, this commit makes the order
of header file inclusion consistent for backend modules.

In the passing, removed a couple of duplicate inclusions.

Author: Vignesh C
Reviewed-by: Kuntal Ghosh and Amit Kapila
Discussion: https://postgr.es/m/CALDaNm2Sznv8RR6Ex-iJO6xAdsxgWhCoETkaYX=+9DW3q0QCfA@mail.gmail.com
2019-11-12 08:30:16 +05:30
Andres Freund aae50236e4 Pass ItemPointer not HeapTuple to IndexBuildCallback.
Not all AMs use HeapTuples internally, making it inconvenient to pass
a HeapTuple. As the index callbacks really only need the TID, not the
full tuple, modify callback to only take ItemPointer.

Author: Ashwin Agrawal
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CALfoeis6=8ehuR=VNtHvj3z16cYfCwPdTcpaxU+sfSUJ5QgR3g@mail.gmail.com
2019-11-08 11:49:29 -08:00
Peter Geoghegan e86c8ef243 Use "low key" terminology in nbtsort.c.
nbtree index builds once stashed the "minimum key" for a page, which was
used as the basis of the pivot tuple that gets placed in the next level
up (i.e. the tuple that stores the downlink to the page in question).
It doesn't quite work that way anymore, so the "minimum key" terminology
now seems misleading (these days the minimum key is actually a straight
copy of the high key from the left sibling, which is a distinct thing in
subtle but important ways).  Rename this concept to "low key".  This
name is a lot clearer given that there is now a sharp distinction
between pivot and non-pivot tuples.  Also remove comments that describe
obsolete details about how the minimum key concept used to work.

Rather than generating the minus infinity item for the leftmost page on
a level by copying the new item and truncating that copy, simply
allocate a small buffer.  The old approach confusingly created the
impression that the new item had some kind of significance.  This was
another artifact of how things used to work before commits 8224de4f and
dd299df8.
2019-11-07 17:12:09 -08:00
Thomas Munro 7815e7efdb Add reusable routine for making arrays unique.
Introduce qunique() and qunique_arg(), which can be used after qsort()
and qsort_arg() respectively to remove duplicate values.  Use it where
appropriate.

Author: Thomas Munro
Reviewed-by: Tom Lane (in an earlier version)
Discussion: https://postgr.es/m/CAEepm%3D2vmFTNpAmwbGGD2WaryM6T3hSDVKQPfUwjdD_5XY6vAA%40mail.gmail.com
2019-11-07 17:00:48 +13:00
Andres Freund 01368e5d9d Split all OBJS style lines in makefiles into one-line-per-entry style.
When maintaining or merging patches, one of the most common sources
for conflicts are the list of objects in makefiles. Especially when
the split across lines has been changed on both sides, which is
somewhat common due to attempting to stay below 80 columns, those
conflicts are unnecessarily laborious to resolve.

By splitting, and alphabetically sorting, OBJS style lines into one
object per line, conflicts should be less frequent, and easier to
resolve when they still occur.

Author: Andres Freund
Discussion: https://postgr.es/m/20191029200901.vww4idgcxv74cwes@alap3.anarazel.de
2019-11-05 14:41:07 -08:00
Michael Paquier 6ca86bb7e9 Fix typos in the code
Author: Vignesh C
Reviewed-by: Dilip Kumar, Michael Paquier
Discussion: https://postgr.es/m/CALDaNm0ni+GAOe4+fbXiOxNrVudajMYmhJFtXGX-zBPoN8ixhw@mail.gmail.com
2019-10-30 10:03:00 +09:00
Peter Geoghegan 1b9becd43c Remove redundant _bt_truncate() comment paragraph. 2019-09-12 09:51:27 -07:00
Peter Geoghegan 55d015bde0 Add _bt_binsrch() scantid assertion to nbtree.
Assert that _bt_binsrch() binary searches with scantid set in insertion
scankey cannot be performed on leaf pages.  Leaf-level binary searches
where scantid is set must use _bt_binsrch_insert() instead.

_bt_binsrch_insert() is likely to have additional responsibilities in
the future, such as searching within GIN-style posting lists using
scantid.  It seems like a good idea to tighten things up now.
2019-09-09 11:41:19 -07:00
Peter Geoghegan b8b3a276d4 Remove obsolete nbtree page deletion comment.
Commit efada2b8e9, which made the nbtree page deletion algorithm more
robust, removed the concept of a half-dead internal page.  Remove a
comment about half dead parent pages that was overlooked.
2019-08-27 14:01:43 -07:00
Peter Geoghegan 867d25ccb4 Explain subtlety in nbtree locking protocol.
The Postgres approach to coupling locks during an ascent of the tree is
slightly different to the approach taken by Lehman and Yao.  Add a new
paragraph to the "Differences to the Lehman & Yao algorithm" section of
the nbtree README that explains the similarities and differences.
2019-08-23 20:24:49 -07:00
Peter Geoghegan 091bd6befc Update comments on nbtree stack struct.
Adjust the struct comment that describes how page splits use their
descent stack to cascade up the tree from the leaf level.

In passing, fix up some unrelated nbtree comments that had typos or were
obsolete.
2019-08-21 13:50:27 -07:00
Peter Geoghegan 9c02cf5661 Remove block number field from nbtree stack.
The initial value of the nbtree stack downlink block number field
recorded during an initial descent of the tree wasn't actually used.
Both _bt_getstackbuf() callers overwrote the value with their own value.

Remove the block number field from the stack struct, and add a child
block number argument to _bt_getstackbuf() in its place.  This makes the
overall design of _bt_getstackbuf() clearer.

Author: Peter Geoghegan
Reviewed-By: Anastasia Lubennikova
Discussion: https://postgr.es/m/CAH2-Wzmx+UbXt2YNOUCZ-a04VdXU=S=OHuAuD7Z8uQq-PXTYUg@mail.gmail.com
2019-08-14 11:32:35 -07:00
Peter Geoghegan 68ef887842 Remove obsolete nbtree README commentary.
Commit d2086b08b0 removed almost all cases where nbtree must release a
read buffer lock and acquire a write buffer lock instead, so remaining
cases in which that's still necessary are not notable enough to appear
in the nbtree README.

More importantly, holding on to a buffer pin in cases where nbtree must
trade a read lock for a write lock is very unlikely to save any I/O.
This seems to have been a long overlooked throwback to a time when
nbtree cared about write-ordering dependencies, and performed
synchronous buffer writes.  It hasn't worked that way in many years.
2019-08-13 17:16:44 -07:00
Peter Geoghegan af0ba49809 Use PageIndexTupleOverwrite() within nbtree.
Use the PageIndexTupleOverwrite() bufpage.c routine within nbtree
instead of deleting a tuple and re-inserting its replacement.  This
makes the intent of affected code slightly clearer.  It also makes
CREATE INDEX slightly faster, since there is no longer a need to shift
every leaf page's line pointer array back and forth during index builds.

Author: Peter Geoghegan, Anastasia Lubennikova
Reviewed-By: Anastasia Lubennikova
Discussion: https://postgr.es/m/CAH2-Wz=Zk=B9+Vwm376WuO7YTjFc2SSskifQm4Nme3RRRPtOSQ@mail.gmail.com
2019-08-13 11:54:26 -07:00
Michael Paquier 66bde49d96 Fix inconsistencies and typos in the tree, take 10
This addresses some issues with unnecessary code comments, fixes various
typos in docs and comments, and removes some orphaned structures and
definitions.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/9aabc775-5494-b372-8bcb-4dfc0bd37c68@gmail.com
2019-08-13 13:53:41 +09:00
Michael Paquier 8548ddc61b Fix inconsistencies and typos in the tree, take 9
This addresses more issues with code comments, variable names and
unreferenced variables.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/7ab243e0-116d-3e44-d120-76b3df7abefd@gmail.com
2019-08-05 12:14:58 +09:00
Peter Eisentraut fd6ec93bf8 Add error codes to some corruption log messages
In some cases we have elog(ERROR) while corruption is certain and we
can give a clear error code ERRCODE_DATA_CORRUPTED or
ERRCODE_INDEX_CORRUPTED.

Author: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://www.postgresql.org/message-id/flat/25F6C686-6442-4A6B-BAF8-A6F7B84B16DE@yandex-team.ru
2019-08-01 11:15:26 +02:00
Peter Geoghegan d004147eb3 Fix nbtree metapage cache upgrade bug.
Commit 857f9c36cd, which taught nbtree VACUUM to avoid unnecessary
index scans, bumped the nbtree version number from 2 to 3, while adding
the ability for nbtree indexes to be upgraded on-the-fly.  Various
assertions that assumed that an nbtree index was always on version 2 had
to be changed to accept any supported version (version 2 or 3 on
Postgres 11).

However, a few assertions were missed in the initial commit, all of
which were in code paths that cache a local copy of the metapage
metadata, where the index had been expected to be on the current version
(no longer version 2) as a generic sanity check.  Rather than simply
update the assertions, follow-up commit 0a64b45152 intentionally made
the metapage caching code update the per-backend cached metadata version
without changing the on-disk version at the same time.  This could even
happen when the planner needed to determine the height of a B-Tree for
costing purposes.  The assertions only fail on Postgres v12 when
upgrading from v10, because they were adjusted to use the authoritative
shared memory metapage by v12's commit dd299df8.

To fix, remove the cache-only upgrade mechanism entirely, and update the
assertions themselves to accept any supported version (go back to using
the cached version in v12).  The fix is almost a full revert of commit
0a64b45152 on the v11 branch.

VACUUM only considers the authoritative metapage, and never bothers with
a locally cached version, whereas everywhere else isn't interested in
the metapage fields that were added by commit 857f9c36cd.  It seems
unlikely that this bug has affected any user on v11.

Reported-By: Christoph Berg
Bug: #15896
Discussion: https://postgr.es/m/15896-5b25e260fdb0b081%40postgresql.org
Backpatch: 11-, where VACUUM was taught to avoid unnecessary index scans.
2019-07-18 13:22:56 -07:00
Michael Paquier 0896ae561b Fix inconsistencies and typos in the tree
This is numbered take 7, and addresses a set of issues around:
- Fixes for typos and incorrect reference names.
- Removal of unneeded comments.
- Removal of unreferenced functions and structures.
- Fixes regarding variable name consistency.

Author: Alexander Lakhin
Discussion: https://postgr.es/m/10bfd4ac-3e7c-40ab-2b2e-355ed15495e8@gmail.com
2019-07-16 13:23:53 +09:00
Peter Geoghegan bfdbac2ab3 Correct nbtsplitloc.c comment.
The logic just added by commit e3899ffd falls back on a 50:50 page split
in the event of a new item that's just to the right of our provisional
"many duplicates" split point.  Fix a comment that incorrectly claimed
that the new item had to be just to the left of our provisional split
point.

Backpatch: 12-, just like commit e3899ffd.
2019-07-15 14:35:06 -07:00
Peter Geoghegan e3899ffd8b Fix pathological nbtree split point choice issue.
Specific ever-decreasing insertion patterns could cause successive
unbalanced nbtree page splits.  Problem cases involve a large group of
duplicates to the left, and ever-decreasing insertions to the right.

To fix, detect the situation by considering the newitem offset before
performing a split using nbtsplitloc.c's "many duplicates" strategy.  If
the new item was inserted just to the right of our provisional "many
duplicates" split point, infer ever-decreasing insertions and fall back
on a 50:50 (space delta optimal) split.  This seems to barely affect
cases that already had acceptable space utilization.

An alternative fix also seems possible.  Instead of changing
nbtsplitloc.c split choice logic, we could instead teach _bt_truncate()
to generate a new value for new high keys by interpolating from the
lastleft and firstright key values.  That would certainly be a more
elegant fix, but it isn't suitable for backpatching.

Discussion: https://postgr.es/m/CAH2-WznCNvhZpxa__GqAa1fgQ9uYdVc=_apArkW2nc-K3O7_NA@mail.gmail.com
Backpatch: 12-, where the nbtree page split enhancements were introduced.
2019-07-15 13:19:13 -07:00
Peter Geoghegan 66c5bd3a6f Remove obsolete nbtree "get root" comment.
Remove a very old Berkeley era comment that doesn't seem to have
anything to do with the current locking considerations within
_bt_getroot().

Discussion: https://postgr.es/m/CAH2-WzmA2H+rL-xxF5o6QhMD+9x6cJTnz2Mr3Li_pbPBmqoTBQ@mail.gmail.com
2019-07-01 22:28:08 -07:00
Michael Paquier c74d49d41c Fix many typos and inconsistencies
Author: Alexander Lakhin
Discussion: https://postgr.es/m/af27d1b3-a128-9d62-46e0-88f424397f44@gmail.com
2019-07-01 10:00:23 +09:00
Thomas Munro 89ff7c08ee Remove unnecessary comment.
Author: Vik Fearing
Discussion: https://postgr.es/m/150d3e9f-c7ec-3fb3-4fdb-def47c4144af%402ndquadrant.com
2019-06-23 22:19:59 +12:00
Michael Paquier f43608bda2 Fix typos and inconsistencies in code comments
Author: Alexander Lakhin
Discussion: https://postgr.es/m/dec6aae8-2d63-639f-4d50-20e229fb83e3@gmail.com
2019-06-14 09:34:34 +09:00
Amit Kapila 9679345f3c Fix typos.
Reported-by: Alexander Lakhin
Author: Alexander Lakhin
Reviewed-by: Amit Kapila and Tom Lane
Discussion: https://postgr.es/m/7208de98-add8-8537-91c0-f8b089e2928c@gmail.com
2019-05-26 18:28:18 +05:30
Tom Lane 8255c7a5ee Phase 2 pgindent run for v12.
Switch to 2.1 version of pg_bsd_indent.  This formats
multiline function declarations "correctly", that is with
additional lines of parameter declarations indented to match
where the first line's left parenthesis is.

Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
2019-05-22 13:04:48 -04:00
Tom Lane be76af171c Initial pgindent run for v12.
This is still using the 2.0 version of pg_bsd_indent.
I thought it would be good to commit this separately,
so as to document the differences between 2.0 and 2.1 behavior.

Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
2019-05-22 12:55:34 -04:00
Peter Geoghegan 3f58cc6dd8 Remove extra nbtree half-dead internal page check.
It's not safe for nbtree VACUUM to attempt to delete a target page whose
right sibling is already half-dead, since that would fail the
cross-check when VACUUM attempts to re-find a downlink to the right
sibling in the parent page.  Logic to prevent this from happening was
added by commit 8da3183780, which addressed a bug in the overhaul of
page deletion that went into PostgreSQL 9.4 (commit efada2b8e9).
VACUUM was made to check the right sibling page, and back off when it
happened to be half-dead already.

However, it is only truly necessary to do the right sibling check on the
leaf level, since that transitively determines if the deletion target's
parent's right sibling page is itself undergoing deletion.  Remove the
internal page level check, and add a comment explaining why the leaf
level check alone suffices.

The extra check is also unnecessary due to the fact that internal pages
that are marked half-dead are generally considered corrupt.  Commit
efada2b8e9 established the principle that there should never be
half-dead internal pages (internal pages pending deletion are possible,
but that status is never directly represented in the internal page).
VACUUM will complain about corruption when it encounters half-dead
internal pages, so VACUUM is bound to raise an error one way or another
when an nbtree index has a half-dead internal page (contrib/amcheck will
also report that the page is corrupt).

It's possible that a pg_upgrade'd 9.3 database will still have half-dead
internal pages, so it may seem like there is an argument for leaving the
check in place to reliably get a cleaner error message that advises the
user to REINDEX.  However, leaf pages are also deleted in the first
phase of deletion prior to PostgreSQL 9.4, so I believe we won't even
attempt to re-find the parent page anyway (we won't have the fully
deleted leaf page as the right sibling of our target page, so we won't
even try to find a downlink for it).

Discussion: https://postgr.es/m/CAH2-Wzm_ntmqJjWLRyKzimFmFvk+BnVAvUpaA4s1h9Ja58woaQ@mail.gmail.com
2019-05-16 15:11:58 -07:00
Peter Geoghegan 489e431ba5 Remove obsolete nbtree insertion comment.
Remove a Berkeley-era comment above _bt_insertonpg() that admonishes the
reader to grok Lehman and Yao's paper before making any changes.  This
made a certain amount of sense back when _bt_insertonpg() was
responsible for most of the things that are now spread across
_bt_insertonpg(), _bt_findinsertloc(), _bt_insert_parent(), and
_bt_split(), but it doesn't work like that anymore.

I believe that this comment alludes to the need to "couple" or "crab"
buffer locks as we ascend the tree as page splits cascade upwards.  The
nbtree README already explains this in detail, which seems sufficient.
Besides, the changes to page splits made by commit 40dae7ec53 altered
the exact details of how buffer locks are retained during splits; Lehman
and Yao's original algorithm seems to release the lock on the left child
page/buffer slightly earlier than _bt_insertonpg()/_bt_insert_parent()
can.
2019-05-15 16:53:11 -07:00
Peter Geoghegan 7505da2f45 Reverse order of newitem nbtree candidate splits.
Commit fab25024, which taught nbtree to choose candidate split points
more carefully, had _bt_findsplitloc() record all possible split points
in an initial pass over a page that is about to be split.  The order
that candidate split points were processed and stored in was assumed to
match the offset number order of split points on an imaginary version of
the page that contains the same items as the original, but also fits
newitem (the item that provoked the split precisely because it didn't
fit).

However, the order of split points in the final array was not quite what
was expected: the split point that makes newitem the firstright item
came after the split point that makes newitem the lastleft item -- not
before.  As a result, _bt_findsplitloc() could get confused about the
leftmost and rightmost tuples among all possible split points recorded
for the page.  This seems to have no appreciable impact on the quality
of the final split point chosen by _bt_findsplitloc(), but it's still
wrong.

To fix, switch the order in which newitem candidate splits are recorded
in.  This also makes it possible to describe candidate split points in
terms of which pair of adjoining tuples enclose the split point within
_bt_findsplitloc(), making it clearer why it's generally safe for
_bt_split() to expect lastleft and firstright tuples.
2019-05-15 12:22:07 -07:00
Peter Geoghegan ae7291acbc Standardize ItemIdData terminology.
The term "item pointer" should not be used to refer to ItemIdData
variables, since that is needlessly ambiguous.  Only
ItemPointerData/ItemPointer variables should be called item pointers.

To fix, establish the convention that ItemIdData variables should always
be referred to either as "item identifiers" or "line pointers".  The
term "item identifier" already predominates in docs and translatable
messages, and so should be the preferred alternative there.

Discussion: https://postgr.es/m/CAH2-Wz=c=MZQjUzde3o9+2PLAPuHTpVZPPdYxN=E4ndQ2--8ew@mail.gmail.com
2019-05-13 15:53:39 -07:00
Peter Geoghegan 9b42e71376 Don't leave behind junk nbtree pages during split.
Commit 8fa30f906b reduced the elevel of a number of "can't happen"
_bt_split() errors from PANIC to ERROR.  At the same time, the new right
page buffer for the split could continue to be acquired well before the
critical section.  This was possible because it was relatively
straightforward to make sure that _bt_split() could not throw an error,
with a few specific exceptions.  The exceptional cases were safe because
they involved specific, well understood errors, making it possible to
consistently zero the right page before actually raising an error using
elog().  There was no danger of leaving around a junk page, provided
_bt_split() stuck to this coding rule.

Commit 8224de4f, which introduced INCLUDE indexes, added code to make
_bt_split() truncate away non-key attributes.  This happened at a point
that broke the rule around zeroing the right page in _bt_split().  If
truncation failed (perhaps due to palloc() failure), that would result
in an errant right page buffer with junk contents.  This could confuse
VACUUM when it attempted to delete the page, and should be avoided on
general principle.

To fix, reorganize _bt_split() so that truncation occurs before the new
right page buffer is even acquired.  A junk page/buffer will not be left
behind if _bt_nonkey_truncate()/_bt_truncate() raise an error.

Discussion: https://postgr.es/m/CAH2-WzkcWT_-NH7EeL=Az4efg0KCV+wArygW8zKB=+HoP=VWMw@mail.gmail.com
Backpatch: 11-, where INCLUDE indexes were introduced.
2019-05-13 10:27:59 -07:00
Peter Geoghegan d95e36dc38 Remove obsolete nbtree split REDO routine comment.
Commit dd299df818, which added suffix truncation to nbtree, simplified
the WAL record format used by page splits.  It became necessary to
explicitly WAL-log the new high key for the left half of a split in all
cases, which relieved the REDO routine from having to reconstruct a new
high key for the left page by copying the first item from the right
page.  Remove a comment that referred to the previous practice.
2019-05-08 12:47:20 -07:00
Peter Geoghegan d65b5ccad6 Correct obsolete nbtsort.c minimum key comment.
It is no longer possible under any circumstances for nbtree code to
reconstruct a strict lower bound key (parent page's pivot tuple key) for
a right sibling page by retrieving the first item in the right sibling
page.
2019-05-07 21:42:12 -07:00
Peter Geoghegan 7b37f4b02e Correct more obsolete nbtree page split comments.
Commit 3f342839 corrected obsolete comments about buffer locks at the
main _bt_insert_parent() call site, but missed similar obsolete comments
above _bt_insert_parent() itself.  Both sets of comments were rendered
obsolete by commit 40dae7ec53, which made the nbtree page split
algorithm more robust.  Fix the comments that were missed the first time
around now.

In passing, refine a related _bt_insert_parent() comment about
re-finding the parent page to insert new downlink.
2019-05-03 13:34:45 -07:00
Peter Geoghegan 6dd86c269d Fix nbtsort.c's page space accounting.
Commit dd299df818, which made heap TID a tiebreaker nbtree index
column, introduced new rules on page space management to make suffix
truncation safe.  In general, suffix truncation needs to have a small
amount of extra space available on the new left page when splitting a
leaf page.  This is needed in case it turns out that truncation cannot
even "truncate away the heap TID column", resulting in a
larger-than-firstright leaf high key with an explicit heap TID
representation.

Despite all this, CREATE INDEX/nbtsort.c did not account for the
possible need for extra heap TID space on leaf pages when deciding
whether or not a new item could fit on current page.  This could lead to
"failed to add item to the index page" errors when CREATE
INDEX/nbtsort.c tried to finish off a leaf page that lacked space for a
larger-than-firstright leaf high key (it only had space for firstright
tuple, which was just short of what was needed following "truncation").

Several conditions needed to be met all at once for CREATE INDEX to
fail.  The problem was in the hard limit on what will fit on a page,
which tends to be masked by the soft fillfactor-wise limit.  The easiest
way to recreate the problem seems to be a CREATE INDEX on a low
cardinality text column, with tuples that are of non-uniform width,
using a fillfactor of 100.

To fix, bring nbtsort.c in line with nbtsplitloc.c, which already
pessimistically assumes that all leaf page splits will have high keys
that have a heap TID appended.

Reported-By: Andreas Joseph Krogh
Discussion: https://postgr.es/m/VisenaEmail.c5.3ee7fe277d514162.16a6d785bea@tc7-visena
2019-05-02 12:33:35 -07:00
Alvaro Herrera 9a83afecb7 Widen tuple counter variables from long to int64
Mistake in ab0dfc961b6a; progress reporting would have wrapped around
for indexes created with more than 2^31 tuples.

Reported-by: Peter Geoghegan
Discussion: https://postgr.es/m/CAH2-Wz=WbNxc5ob5NJ9yqo2RMJ0q4HXDS30GVCobeCvC9A1L9A@mail.gmail.com
2019-04-30 10:27:38 -04:00
Peter Geoghegan 9ee7414ed0 Remove obsolete _bt_insert_parent() comment.
Remove a comment that refers to a coding practice that was fully removed
by commit a8b8f4db, which introduced MarkBufferDirty().  It looks like
the comment was even obsolete before then, since it concerns
write-ordering dependencies with synchronous buffer writes.
2019-04-29 14:14:38 -07:00
Peter Geoghegan 9b10926263 Prevent O(N^2) unique index insertion edge case.
Commit dd299df8 made nbtree treat heap TID as a tiebreaker column,
establishing the principle that there is only one correct location (page
and page offset number) for every index tuple, no matter what.
Insertions of tuples into non-unique indexes proceed as if heap TID
(scan key's scantid) is just another user-attribute value, but
insertions into unique indexes are more delicate.  The TID value in
scantid must initially be omitted to ensure that the unique index
insertion visits every leaf page that duplicates could be on.  The
scantid is set once again after unique checking finishes successfully,
which can force _bt_findinsertloc() to step right one or more times, to
locate the leaf page that the new tuple must be inserted on.

Stepping right within _bt_findinsertloc() was assumed to occur no more
frequently than stepping right within _bt_check_unique(), but there was
one important case where that assumption was incorrect: inserting a
"duplicate" with NULL values.  Since _bt_check_unique() didn't do any
real work in this case, it wasn't appropriate for _bt_findinsertloc() to
behave as if it was finishing off a conventional unique insertion, where
any existing physical duplicate must be dead or recently dead.
_bt_findinsertloc() might have to grovel through a substantial portion
of all of the leaf pages in the index to insert a single tuple, even
when there were no dead tuples.

To fix, treat insertions of tuples with NULLs into a unique index as if
they were insertions into a non-unique index: never unset scantid before
calling _bt_search() to descend the tree, and bypass _bt_check_unique()
entirely.  _bt_check_unique() is no longer responsible for incoming
tuples with NULL values.

Discussion: https://postgr.es/m/CAH2-Wzm08nr+JPx4jMOa9CGqxWYDQ-_D4wtPBiKghXAUiUy-nQ@mail.gmail.com
2019-04-23 10:33:57 -07:00
Alexander Korotkov 1e87198182 Fix division by zero in _bt_vacuum_needs_cleanup()
Checks inside _bt_vacuum_needs_cleanup() allow division by zero to happen when
metad->btm_last_cleanup_num_heap_tuples == 0.  This commit adjusts the
expression so that no division by zero might happen.

Reported-by: Piotr Stefaniak
Discussion: https://postgr.es/m/DB8PR03MB5931C41F7787A95313F08322F22A0%40DB8PR03MB5931.eurprd03.prod.outlook.com
Reviewed-by: Masahiko Sawada
Backpatch-through: 11
2019-04-15 20:20:43 +03:00
Peter Geoghegan 74eb2176bf Invalidate binary search bounds consistently.
_bt_check_unique() failed to invalidate binary search bounds in the
event of a live conflict following commit e5adcb78.  This resulted in
problems after waiting for the conflicting xact to commit or abort.  The
subsequent call to _bt_check_unique() would restore the initial binary
search bounds, rather than starting a new search.  Fix by explicitly
invalidating bounds when it becomes clear that there is a live conflict
that insertion will have to wait to resolve.

Ashutosh Sharma, with a few additional tweaks by me.

Author: Ashutosh Sharma
Reported-By: Ashutosh Sharma
Diagnosed-By: Ashutosh Sharma
Discussion: https://postgr.es/m/CAE9k0PnQp-qr-UYKMSCzdC2FBzdE4wKP41hZrZvvP26dKLonLg@mail.gmail.com
2019-04-04 09:38:08 -07:00
Alvaro Herrera ab0dfc961b Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.

There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific.  The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.

The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan.  (The index validation index scan requires
patching each AM, which has not been included here.)

Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 15:18:08 -03:00
Andres Freund 4bb50236eb tableam: Formatting and other minor cleanups.
The superflous heapam_xlog.h includes were reported by Peter
Geoghegan.
2019-03-31 18:16:53 -07:00
Peter Geoghegan 76a39f2295 Fix nbtree high key "continuescan" row compare bug.
Commit 29b64d1d mishandled skipping over truncated high key attributes
during row comparisons.  The row comparison key matching loop would loop
forever when a truncated attribute was encountered for a row compare
subkey.  Fix by following the example of other code in the loop: advance
the current subkey, or break out of the loop when the last subkey is
reached.

Add test coverage for the relevant _bt_check_rowcompare() code path.
The new test case is somewhat tied to nbtree implementation details,
which isn't ideal, but seems unavoidable.
2019-03-31 17:24:04 -07:00
Peter Geoghegan 9c7fb7e6d8 Tweak some nbtree-related code comments. 2019-03-29 12:29:05 -07:00
Andres Freund 2a96909a4a tableam: Support for an index build's initial table scan(s).
To support building indexes over tables of different AMs, the scans to
do so need to be routed through the table AM.  While moving a fair
amount of code, nearly all the changes are just moving code to below a
callback.

Currently the range based interface wouldn't make much sense for non
block based table AMs. But that seems aceptable for now.

Author: Andres Freund
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-27 19:59:06 -07:00
Andres Freund 558a9165e0 Compute XID horizon for page level index vacuum on primary.
Previously the xid horizon was only computed during WAL replay. That
had two major problems:
1) It relied on knowing what the table pointed to looks like. That was
   easy enough before the introducing of tableam (we knew it had to be
   heap, although some trickery around logging the heap relfilenodes
   was required). But to properly handle table AMs we need
   per-database catalog access to look up the AM handler, which
   recovery doesn't allow.
2) Not knowing the xid horizon also makes it hard to support logical
   decoding on standbys. When on a catalog table, we need to be able
   to conflict with slots that have an xid horizon that's too old. But
   computing the horizon by visiting the heap only works once
   consistency is reached, but we always need to be able to detect
   conflicts.

There's also a secondary problem, in that the current method performs
redundant work on every standby. But that's counterbalanced by
potentially computing the value when not necessary (either because
there's no standby, or because there's no connected backends).

Solve 1) and 2) by moving computation of the xid horizon to the
primary and by involving tableam in the computation of the horizon.

To address the potentially increased overhead, increase the efficiency
of the xid horizon computation for heap by sorting the tids, and
eliminating redundant buffer accesses. When prefetching is available,
additionally perform prefetching of buffers.  As this is more of a
maintenance task, rather than something routinely done in every read
only query, we add an arbitrary 10 to the effective concurrency -
thereby using IO concurrency, when not globally enabled.  That's
possibly not the perfect formula, but seems good enough for now.

Bumps WAL format, as latestRemovedXid is now part of the records, and
the heap's relfilenode isn't anymore.

Author: Andres Freund, Amit Khandekar, Robert Haas
Reviewed-By: Robert Haas
Discussion:
    https://postgr.es/m/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de
    https://postgr.es/m/20181214014235.dal5ogljs3bmlq44@alap3.anarazel.de
    https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
2019-03-26 16:52:54 -07:00