postgresql/src/backend
Peter Geoghegan 9a9db08ae4 Fix replica backward scan race condition.
It was possible for the logic used by backward scans (which must reason
about concurrent page splits/deletions in its own peculiar way) to
become confused when running on a replica.  Concurrent replay of a WAL
record that describes the second phase of page deletion could cause
_bt_walk_left() to get confused.  btree_xlog_unlink_page() simply failed
to adhere to the same locking protocol that we use on the primary, which
is obviously wrong once you consider these two disparate functions
together.  This bug is present in all stable branches.

More concretely, the problem was that nothing stopped _bt_walk_left()
from observing inconsistencies between the deletion's target page and
its original sibling pages when running on a replica.  This is true even
though the second phase of page deletion is supposed to work as a single
atomic action.  Queries running on replicas raised "could not find left
sibling of block %u in index %s" can't-happen errors when they went back
to their scan's "original" page and observed that the page has not been
marked deleted (even though it really was concurrently deleted).

There is no evidence that this actually happened in the real world.  The
issue came to light during unrelated feature development work.  Note
that _bt_walk_left() is the only code that cares about the difference
between a half-dead page and a fully deleted page that isn't also
exclusively used by nbtree VACUUM (unless you include contrib/amcheck
code).  It seems very likely that backward scans are the only thing that
could become confused by the inconsistency.  Even amcheck's complex
bt_right_page_check_scankey() dance was unaffected.

To fix, teach btree_xlog_unlink_page() to lock the left sibling, target,
and right sibling pages in that order before releasing any locks (just
like _bt_unlink_halfdead_page()).  This is the simplest possible
approach.  There doesn't seem to be any opportunity to be more clever
about lock acquisition in the REDO routine, and it hardly seems worth
the trouble in any case.

This fix might enable contrib/amcheck verification of leaf page sibling
links with only an AccessShareLock on the relation.  An amcheck patch
from Andrey Borodin was rejected back in January because it clashed with
btree_xlog_unlink_page()'s lax approach to locking pages.  It now seems
likely that the real problem was with btree_xlog_unlink_page(), not the
patch.

This is a low severity, low likelihood bug, so no backpatch.

Author: Michail Nikolaev
Diagnosed-By: Michail Nikolaev
Discussion: https://postgr.es/m/CANtu0ohkR-evAWbpzJu54V8eCOtqjJyYp3PQ_SGoBTRGXWhWRw@mail.gmail.com
2020-08-03 15:54:38 -07:00
..
access Fix replica backward scan race condition. 2020-08-03 15:54:38 -07:00
bootstrap Be more careful about marking catalog columns NOT NULL by default. 2020-07-21 13:03:48 -04:00
catalog Minimize slot creation for multi-inserts of pg_shdepend 2020-08-01 11:49:13 +09:00
commands Use int64 instead of long in incremental sort code 2020-08-02 14:24:46 +12:00
executor Add hash_mem_multiplier GUC. 2020-07-29 14:14:58 -07:00
foreign Update copyrights for 2020 2020-01-01 12:21:45 -05:00
jit pgindent run prior to branching v13. 2020-06-07 16:57:08 -04:00
lib Use pg_bitutils for HyperLogLog. 2020-07-30 09:14:23 -07:00
libpq code: replace most remaining uses of 'master'. 2020-07-08 13:24:35 -07:00
main Clean up includes of s_lock.h. 2020-06-18 19:41:05 -07:00
nodes Rename field "relkind" to "objtype" for CTAS and ALTER TABLE nodes 2020-07-11 13:32:28 +09:00
optimizer Add hash_mem_multiplier GUC. 2020-07-29 14:14:58 -07:00
parser Rename field "relkind" to "objtype" for CTAS and ALTER TABLE nodes 2020-07-11 13:32:28 +09:00
partitioning Fix some issues with step generation in partition pruning. 2020-07-28 11:00:00 +09:00
po Translation updates 2020-05-18 12:49:30 +02:00
port Add huge_page_size setting for use on Linux. 2020-07-17 14:33:00 +12:00
postmaster Introduce a WaitEventSet for the stats collector. 2020-07-30 17:44:28 +12:00
regex Dial back -Wimplicit-fallthrough to level 3 2020-05-13 15:31:14 -04:00
replication Extend the logical decoding output plugin API with stream methods. 2020-07-28 08:09:44 +05:30
rewrite Add missing invocations to object access hooks 2020-05-23 14:03:04 +09:00
snowball code: replace most remaining uses of 'master'. 2020-07-08 13:24:35 -07:00
statistics Run pgindent with new pg_bsd_indent version 2.1.1. 2020-05-16 11:54:51 -04:00
storage Preallocate some DSM space at startup. 2020-07-31 17:49:58 +12:00
tcop Rename field "relkind" to "objtype" for CTAS and ALTER TABLE nodes 2020-07-11 13:32:28 +09:00
tsearch Fix recently-introduced performance problem in ts_headline(). 2020-07-31 11:43:12 -04:00
utils Add %P to log_line_prefix for parallel group leader 2020-08-03 13:38:48 +09:00
.gitignore Add .gitignore entries for AIX-specific intermediate build artifacts. 2015-07-08 20:44:22 -04:00
Makefile Update copyrights for 2020 2020-01-01 12:21:45 -05:00
common.mk Remove PARTIAL_LINKING build mode. 2018-03-30 17:33:04 -07:00
nls.mk Add missing gettext triggers 2020-04-28 13:35:40 +02:00