Fix "failed to re-find parent key" btree VACUUM failure by revising page

deletion code to avoid the case where an upper-level btree page remains "half
dead" for a significant period of time, and to block insertions into a key
range that is in process of being re-assigned to the right sibling of the
deleted page's parent.  This prevents the scenario reported by Ed L. wherein
index keys could become out-of-order in the grandparent index level.

Since this is a moderately invasive fix, I'm applying it only to HEAD.
The bug exists back to 7.4, but the back branches will get a different patch.
This commit is contained in:
Tom Lane 2006-11-01 19:43:17 +00:00
parent 19d0c46def
commit 70ce5c9082
6 changed files with 359 additions and 135 deletions

View File

@ -1,4 +1,4 @@
$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.13 2006/07/25 19:13:00 tgl Exp $
$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.14 2006/11/01 19:43:17 tgl Exp $
This directory contains a correct implementation of Lehman and Yao's
high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
@ -201,26 +201,25 @@ When we delete the last remaining child of a parent page, we mark the
parent page "half-dead" as part of the atomic update that deletes the
child page. This implicitly transfers the parent's key space to its right
sibling (which it must have, since we never delete the overall-rightmost
page of a level). No future insertions into the parent level are allowed
to insert keys into the half-dead page --- they must move right to its
sibling, instead. The parent remains empty and can be deleted in a
separate atomic action. (However, if it's the rightmost child of its own
parent, it might have to stay half-dead for awhile, until it's also the
only child.)
Note that an empty leaf page is a valid tree state, but an empty interior
page is not legal (an interior page must have children to delegate its
key space to). So an interior page *must* be marked half-dead as soon
as its last child is deleted.
page of a level). Searches ignore the half-dead page and immediately move
right. We need not worry about insertions into a half-dead page --- insertions
into upper tree levels happen only as a result of splits of child pages, and
the half-dead page no longer has any children that could split. Therefore
the page stays empty even when we don't have lock on it, and we can complete
its deletion in a second atomic action.
The notion of a half-dead page means that the key space relationship between
the half-dead page's level and its parent's level may be a little out of
whack: key space that appears to belong to the half-dead page's parent on the
parent level may really belong to its right sibling. We can tolerate this,
however, because insertions and deletions on upper tree levels are always
done by reference to child page numbers, not keys. The only cost is that
searches may sometimes descend to the half-dead page and then have to move
right, rather than going directly to the sibling page.
parent level may really belong to its right sibling. To prevent any possible
problems, we hold lock on the deleted child page until we have finished
deleting any now-half-dead parent page(s). This prevents any insertions into
the transferred keyspace until the operation is complete. The reason for
doing this is that a sufficiently large number of insertions into the
transferred keyspace, resulting in multiple page splits, could propagate keys
from that keyspace into the parent level, resulting in transiently
out-of-order keys in that level. It is thought that that wouldn't cause any
serious problem, but it seems too risky to allow.
A deleted page cannot be reclaimed immediately, since there may be other
processes waiting to reference it (ie, search processes that just left the

View File

@ -8,7 +8,7 @@
*
*
* IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.144 2006/10/04 00:29:48 momjian Exp $
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.145 2006/11/01 19:43:17 tgl Exp $
*
*-------------------------------------------------------------------------
*/
@ -1337,8 +1337,8 @@ _bt_insert_parent(Relation rel,
/* Check for error only after writing children */
if (pbuf == InvalidBuffer)
elog(ERROR, "failed to re-find parent key in \"%s\"",
RelationGetRelationName(rel));
elog(ERROR, "failed to re-find parent key in \"%s\" for split pages %u/%u",
RelationGetRelationName(rel), bknum, rbknum);
/* Recursively update the parent */
_bt_insertonpg(rel, pbuf, stack->bts_parent,

View File

@ -9,7 +9,7 @@
*
*
* IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtpage.c,v 1.100 2006/10/04 00:29:49 momjian Exp $
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtpage.c,v 1.101 2006/11/01 19:43:17 tgl Exp $
*
* NOTES
* Postgres btree pages look like ordinary relation pages. The opaque
@ -723,7 +723,93 @@ _bt_delitems(Relation rel, Buffer buf,
}
/*
* _bt_pagedel() -- Delete a page from the b-tree.
* Subroutine to pre-check whether a page deletion is safe, that is, its
* parent page would be left in a valid or deletable state.
*
* "target" is the page we wish to delete, and "stack" is a search stack
* leading to it (approximately). Note that we will update the stack
* entry(s) to reflect current downlink positions --- this is harmless and
* indeed saves later search effort in _bt_pagedel.
*
* Note: it's OK to release page locks after checking, because a safe
* deletion can't become unsafe due to concurrent activity. A non-rightmost
* page cannot become rightmost unless there's a concurrent page deletion,
* but only VACUUM does page deletion and we only allow one VACUUM on an index
* at a time. An only child could acquire a sibling (of the same parent) only
* by being split ... but that would make it a non-rightmost child so the
* deletion is still safe.
*/
static bool
_bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
{
BlockNumber parent;
OffsetNumber poffset,
maxoff;
Buffer pbuf;
Page page;
BTPageOpaque opaque;
/*
* In recovery mode, assume the deletion being replayed is valid. We
* can't always check it because we won't have a full search stack,
* and we should complain if there's a problem, anyway.
*/
if (InRecovery)
return true;
/* Locate the parent's downlink (updating the stack entry if needed) */
ItemPointerSet(&(stack->bts_btentry.t_tid), target, P_HIKEY);
pbuf = _bt_getstackbuf(rel, stack, BT_READ);
if (pbuf == InvalidBuffer)
elog(ERROR, "failed to re-find parent key in \"%s\" for deletion target page %u",
RelationGetRelationName(rel), target);
parent = stack->bts_blkno;
poffset = stack->bts_offset;
page = BufferGetPage(pbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
maxoff = PageGetMaxOffsetNumber(page);
/*
* If the target is the rightmost child of its parent, then we can't
* delete, unless it's also the only child.
*/
if (poffset >= maxoff)
{
/* It's rightmost child... */
if (poffset == P_FIRSTDATAKEY(opaque))
{
/*
* It's only child, so safe if parent would itself be removable.
* We have to check the parent itself, and then recurse to
* test the conditions at the parent's parent.
*/
if (P_RIGHTMOST(opaque) || P_ISROOT(opaque))
{
_bt_relbuf(rel, pbuf);
return false;
}
_bt_relbuf(rel, pbuf);
return _bt_parent_deletion_safe(rel, parent, stack->bts_parent);
}
else
{
/* Unsafe to delete */
_bt_relbuf(rel, pbuf);
return false;
}
}
else
{
/* Not rightmost child, so safe to delete */
_bt_relbuf(rel, pbuf);
return true;
}
}
/*
* _bt_pagedel() -- Delete a page from the b-tree, if legal to do so.
*
* This action unlinks the page from the b-tree structure, removing all
* pointers leading to it --- but not touching its own left and right links.
@ -731,19 +817,25 @@ _bt_delitems(Relation rel, Buffer buf,
* may currently be trying to follow links leading to the page; they have to
* be allowed to use its right-link to recover. See nbtree/README.
*
* On entry, the target buffer must be pinned and read-locked. This lock and
* pin will be dropped before exiting.
* On entry, the target buffer must be pinned and locked (either read or write
* lock is OK). This lock and pin will be dropped before exiting.
*
* Returns the number of pages successfully deleted (zero on failure; could
* be more than one if parent blocks were deleted).
* The "stack" argument can be a search stack leading (approximately) to the
* target page, or NULL --- outside callers typically pass NULL since they
* have not done such a search, but internal recursion cases pass the stack
* to avoid duplicated search effort.
*
* Returns the number of pages successfully deleted (zero if page cannot
* be deleted now; could be more than one if parent pages were deleted too).
*
* NOTE: this leaks memory. Rather than trying to clean up everything
* carefully, it's better to run it in a temp context that can be reset
* frequently.
*/
int
_bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
_bt_pagedel(Relation rel, Buffer buf, BTStack stack, bool vacuum_full)
{
int result;
BlockNumber target,
leftsib,
rightsib,
@ -756,7 +848,6 @@ _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
IndexTuple targetkey,
itup;
ScanKey itup_scankey;
BTStack stack;
Buffer lbuf,
rbuf,
pbuf;
@ -778,6 +869,9 @@ _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
{
/* Should never fail to delete a half-dead page */
Assert(!P_ISHALFDEAD(opaque));
_bt_relbuf(rel, buf);
return 0;
}
@ -793,36 +887,79 @@ _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
targetkey = CopyIndexTuple((IndexTuple) PageGetItem(page, itemid));
/*
* We need to get an approximate pointer to the page's parent page. Use
* the standard search mechanism to search for the page's high key; this
* will give us a link to either the current parent or someplace to its
* left (if there are multiple equal high keys). To avoid deadlocks, we'd
* better drop the target page lock first.
* To avoid deadlocks, we'd better drop the target page lock before
* going further.
*/
_bt_relbuf(rel, buf);
/* we need an insertion scan key to do our search, so build one */
itup_scankey = _bt_mkscankey(rel, targetkey);
/* find the leftmost leaf page containing this key */
stack = _bt_search(rel, rel->rd_rel->relnatts, itup_scankey, false,
&lbuf, BT_READ);
/* don't need a pin on that either */
_bt_relbuf(rel, lbuf);
/*
* If we are trying to delete an interior page, _bt_search did more than
* we needed. Locate the stack item pointing to our parent level.
* We need an approximate pointer to the page's parent page. We use
* the standard search mechanism to search for the page's high key; this
* will give us a link to either the current parent or someplace to its
* left (if there are multiple equal high keys). In recursion cases,
* the caller already generated a search stack and we can just re-use
* that work.
*/
ilevel = 0;
for (;;)
if (stack == NULL)
{
if (stack == NULL)
elog(ERROR, "not enough stack items");
if (ilevel == targetlevel)
break;
stack = stack->bts_parent;
ilevel++;
if (!InRecovery)
{
/* we need an insertion scan key to do our search, so build one */
itup_scankey = _bt_mkscankey(rel, targetkey);
/* find the leftmost leaf page containing this key */
stack = _bt_search(rel, rel->rd_rel->relnatts, itup_scankey, false,
&lbuf, BT_READ);
/* don't need a pin on that either */
_bt_relbuf(rel, lbuf);
/*
* If we are trying to delete an interior page, _bt_search did
* more than we needed. Locate the stack item pointing to our
* parent level.
*/
ilevel = 0;
for (;;)
{
if (stack == NULL)
elog(ERROR, "not enough stack items");
if (ilevel == targetlevel)
break;
stack = stack->bts_parent;
ilevel++;
}
}
else
{
/*
* During WAL recovery, we can't use _bt_search (for one reason,
* it might invoke user-defined comparison functions that expect
* facilities not available in recovery mode). Instead, just
* set up a dummy stack pointing to the left end of the parent
* tree level, from which _bt_getstackbuf will walk right to the
* parent page. Painful, but we don't care too much about
* performance in this scenario.
*/
pbuf = _bt_get_endpoint(rel, targetlevel + 1, false);
stack = (BTStack) palloc(sizeof(BTStackData));
stack->bts_blkno = BufferGetBlockNumber(pbuf);
stack->bts_offset = InvalidOffsetNumber;
/* bts_btentry will be initialized below */
stack->bts_parent = NULL;
_bt_relbuf(rel, pbuf);
}
}
/*
* We cannot delete a page that is the rightmost child of its immediate
* parent, unless it is the only child --- in which case the parent has
* to be deleted too, and the same condition applies recursively to it.
* We have to check this condition all the way up before trying to delete.
* We don't need to re-test when deleting a non-leaf page, though.
*/
if (targetlevel == 0 &&
!_bt_parent_deletion_safe(rel, target, stack))
return 0;
/*
* We have to lock the pages we need to modify in the standard order:
* moving right, then up. Else we will deadlock against other writers.
@ -898,15 +1035,16 @@ _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
ItemPointerSet(&(stack->bts_btentry.t_tid), target, P_HIKEY);
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
if (pbuf == InvalidBuffer)
elog(ERROR, "failed to re-find parent key in \"%s\"",
RelationGetRelationName(rel));
elog(ERROR, "failed to re-find parent key in \"%s\" for deletion target page %u",
RelationGetRelationName(rel), target);
parent = stack->bts_blkno;
poffset = stack->bts_offset;
/*
* If the target is the rightmost child of its parent, then we can't
* delete, unless it's also the only child --- in which case the parent
* changes to half-dead status.
* changes to half-dead status. The "can't delete" case should have been
* detected by _bt_parent_deletion_safe, so complain if we see it now.
*/
page = BufferGetPage(pbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@ -918,14 +1056,8 @@ _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
if (poffset == P_FIRSTDATAKEY(opaque))
parent_half_dead = true;
else
{
_bt_relbuf(rel, pbuf);
_bt_relbuf(rel, rbuf);
_bt_relbuf(rel, buf);
if (BufferIsValid(lbuf))
_bt_relbuf(rel, lbuf);
return 0;
}
elog(ERROR, "failed to delete rightmost child %u of %u in \"%s\"",
target, parent, RelationGetRelationName(rel));
}
else
{
@ -940,10 +1072,13 @@ _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
* might be possible to push the fast root even further down, but the odds
* of doing so are slim, and the locking considerations daunting.)
*
* We don't support handling this in the case where the parent is
* becoming half-dead, even though it theoretically could occur.
*
* We can safely acquire a lock on the metapage here --- see comments for
* _bt_newroot().
*/
if (leftsib == P_NONE)
if (leftsib == P_NONE && !parent_half_dead)
{
page = BufferGetPage(rbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@ -1031,6 +1166,7 @@ _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
*/
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
opaque->btpo_flags &= ~BTP_HALF_DEAD;
opaque->btpo_flags |= BTP_DELETED;
opaque->btpo.xact =
vacuum_full ? FrozenTransactionId : ReadNewTransactionId();
@ -1085,6 +1221,8 @@ _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
nextrdata++;
xlinfo = XLOG_BTREE_DELETE_PAGE_META;
}
else if (parent_half_dead)
xlinfo = XLOG_BTREE_DELETE_PAGE_HALF;
else
xlinfo = XLOG_BTREE_DELETE_PAGE;
@ -1138,34 +1276,52 @@ _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full)
END_CRIT_SECTION();
/* release buffers; send out relcache inval if metapage changed */
/* release metapage; send out relcache inval if metapage changed */
if (BufferIsValid(metabuf))
{
CacheInvalidateRelcache(rel);
_bt_relbuf(rel, metabuf);
}
_bt_relbuf(rel, pbuf);
_bt_relbuf(rel, rbuf);
_bt_relbuf(rel, buf);
/* can always release leftsib immediately */
if (BufferIsValid(lbuf))
_bt_relbuf(rel, lbuf);
/*
* If parent became half dead, recurse to try to delete it. Otherwise, if
* If parent became half dead, recurse to delete it. Otherwise, if
* right sibling is empty and is now the last child of the parent, recurse
* to try to delete it. (These cases cannot apply at the same time,
* though the second case might itself recurse to the first.)
*
* When recursing to parent, we hold the lock on the target page until
* done. This delays any insertions into the keyspace that was just
* effectively reassigned to the parent's right sibling. If we allowed
* that, and there were enough such insertions before we finish deleting
* the parent, page splits within that keyspace could lead to inserting
* out-of-order keys into the grandparent level. It is thought that that
* wouldn't have any serious consequences, but it still seems like a
* pretty bad idea.
*/
if (parent_half_dead)
{
buf = _bt_getbuf(rel, parent, BT_READ);
return _bt_pagedel(rel, buf, vacuum_full) + 1;
/* recursive call will release pbuf */
_bt_relbuf(rel, rbuf);
result = _bt_pagedel(rel, pbuf, stack->bts_parent, vacuum_full) + 1;
_bt_relbuf(rel, buf);
}
if (parent_one_child && rightsib_empty)
else if (parent_one_child && rightsib_empty)
{
buf = _bt_getbuf(rel, rightsib, BT_READ);
return _bt_pagedel(rel, buf, vacuum_full) + 1;
_bt_relbuf(rel, pbuf);
_bt_relbuf(rel, buf);
/* recursive call will release rbuf */
result = _bt_pagedel(rel, rbuf, stack, vacuum_full) + 1;
}
else
{
_bt_relbuf(rel, pbuf);
_bt_relbuf(rel, buf);
_bt_relbuf(rel, rbuf);
result = 1;
}
return 1;
return result;
}

View File

@ -12,7 +12,7 @@
* Portions Copyright (c) 1994, Regents of the University of California
*
* IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtree.c,v 1.152 2006/10/04 00:29:49 momjian Exp $
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtree.c,v 1.153 2006/11/01 19:43:17 tgl Exp $
*
*-------------------------------------------------------------------------
*/
@ -804,8 +804,7 @@ restart:
if (blkno != orig_blkno)
{
if (_bt_page_recyclable(page) ||
P_ISDELETED(opaque) ||
(opaque->btpo_flags & BTP_HALF_DEAD) ||
P_IGNORE(opaque) ||
!P_ISLEAF(opaque) ||
opaque->btpo_cycleid != vstate->cycleid)
{
@ -828,7 +827,7 @@ restart:
/* Already deleted, but can't recycle yet */
stats->pages_deleted++;
}
else if (opaque->btpo_flags & BTP_HALF_DEAD)
else if (P_ISHALFDEAD(opaque))
{
/* Half-dead, try to delete */
delete_now = true;
@ -939,7 +938,7 @@ restart:
MemoryContextReset(vstate->pagedelcontext);
oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
ndel = _bt_pagedel(rel, buf, info->vacuum_full);
ndel = _bt_pagedel(rel, buf, NULL, info->vacuum_full);
/* count only this page, else may double-count parent */
if (ndel)

View File

@ -8,7 +8,7 @@
* Portions Copyright (c) 1994, Regents of the University of California
*
* IDENTIFICATION
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtxlog.c,v 1.38 2006/10/04 00:29:49 momjian Exp $
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtxlog.c,v 1.39 2006/11/01 19:43:17 tgl Exp $
*
*-------------------------------------------------------------------------
*/
@ -22,31 +22,41 @@
* them manually if they are not seen in the WAL log during replay. This
* makes it safe for page insertion to be a multiple-WAL-action process.
*
* Similarly, deletion of an only child page and deletion of its parent page
* form multiple WAL log entries, and we have to be prepared to follow through
* with the deletion if the log ends between.
*
* The data structure is a simple linked list --- this should be good enough,
* since we don't expect a page split to remain incomplete for long.
* since we don't expect a page split or multi deletion to remain incomplete
* for long. In any case we need to respect the order of operations.
*/
typedef struct bt_incomplete_split
typedef struct bt_incomplete_action
{
RelFileNode node; /* the index */
bool is_split; /* T = pending split, F = pending delete */
/* these fields are for a split: */
bool is_root; /* we split the root */
BlockNumber leftblk; /* left half of split */
BlockNumber rightblk; /* right half of split */
bool is_root; /* we split the root */
} bt_incomplete_split;
/* these fields are for a delete: */
BlockNumber delblk; /* parent block to be deleted */
} bt_incomplete_action;
static List *incomplete_splits;
static List *incomplete_actions;
static void
log_incomplete_split(RelFileNode node, BlockNumber leftblk,
BlockNumber rightblk, bool is_root)
{
bt_incomplete_split *split = palloc(sizeof(bt_incomplete_split));
bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
split->node = node;
split->leftblk = leftblk;
split->rightblk = rightblk;
split->is_root = is_root;
incomplete_splits = lappend(incomplete_splits, split);
action->node = node;
action->is_split = true;
action->is_root = is_root;
action->leftblk = leftblk;
action->rightblk = rightblk;
incomplete_actions = lappend(incomplete_actions, action);
}
static void
@ -54,17 +64,50 @@ forget_matching_split(RelFileNode node, BlockNumber downlink, bool is_root)
{
ListCell *l;
foreach(l, incomplete_splits)
foreach(l, incomplete_actions)
{
bt_incomplete_split *split = (bt_incomplete_split *) lfirst(l);
bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
if (RelFileNodeEquals(node, split->node) &&
downlink == split->rightblk)
if (RelFileNodeEquals(node, action->node) &&
action->is_split &&
downlink == action->rightblk)
{
if (is_root != split->is_root)
if (is_root != action->is_root)
elog(LOG, "forget_matching_split: fishy is_root data (expected %d, got %d)",
split->is_root, is_root);
incomplete_splits = list_delete_ptr(incomplete_splits, split);
action->is_root, is_root);
incomplete_actions = list_delete_ptr(incomplete_actions, action);
pfree(action);
break; /* need not look further */
}
}
}
static void
log_incomplete_deletion(RelFileNode node, BlockNumber delblk)
{
bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
action->node = node;
action->is_split = false;
action->delblk = delblk;
incomplete_actions = lappend(incomplete_actions, action);
}
static void
forget_matching_deletion(RelFileNode node, BlockNumber delblk)
{
ListCell *l;
foreach(l, incomplete_actions)
{
bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
if (RelFileNodeEquals(node, action->node) &&
!action->is_split &&
delblk == action->delblk)
{
incomplete_actions = list_delete_ptr(incomplete_actions, action);
pfree(action);
break; /* need not look further */
}
}
@ -389,8 +432,7 @@ btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
}
static void
btree_xlog_delete_page(bool ismeta,
XLogRecPtr lsn, XLogRecord *record)
btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
{
xl_btree_delete_page *xlrec = (xl_btree_delete_page *) XLogRecGetData(record);
Relation reln;
@ -427,6 +469,7 @@ btree_xlog_delete_page(bool ismeta,
poffset = ItemPointerGetOffsetNumber(&(xlrec->target.tid));
if (poffset >= PageGetMaxOffsetNumber(page))
{
Assert(info == XLOG_BTREE_DELETE_PAGE_HALF);
Assert(poffset == P_FIRSTDATAKEY(pageop));
PageIndexTupleDelete(page, poffset);
pageop->btpo_flags |= BTP_HALF_DEAD;
@ -437,6 +480,7 @@ btree_xlog_delete_page(bool ismeta,
IndexTuple itup;
OffsetNumber nextoffset;
Assert(info != XLOG_BTREE_DELETE_PAGE_HALF);
itemid = PageGetItemId(page, poffset);
itup = (IndexTuple) PageGetItem(page, itemid);
ItemPointerSet(&(itup->t_tid), rightsib, P_HIKEY);
@ -523,7 +567,7 @@ btree_xlog_delete_page(bool ismeta,
UnlockReleaseBuffer(buffer);
/* Update metapage if needed */
if (ismeta)
if (info == XLOG_BTREE_DELETE_PAGE_META)
{
xl_btree_metadata md;
@ -533,6 +577,13 @@ btree_xlog_delete_page(bool ismeta,
md.root, md.level,
md.fastroot, md.fastlevel);
}
/* Forget any completed deletion */
forget_matching_deletion(xlrec->target.node, target);
/* If parent became half-dead, remember it for deletion */
if (info == XLOG_BTREE_DELETE_PAGE_HALF)
log_incomplete_deletion(xlrec->target.node, parent);
}
static void
@ -620,10 +671,9 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
btree_xlog_delete(lsn, record);
break;
case XLOG_BTREE_DELETE_PAGE:
btree_xlog_delete_page(false, lsn, record);
break;
case XLOG_BTREE_DELETE_PAGE_META:
btree_xlog_delete_page(true, lsn, record);
case XLOG_BTREE_DELETE_PAGE_HALF:
btree_xlog_delete_page(info, lsn, record);
break;
case XLOG_BTREE_NEWROOT:
btree_xlog_newroot(lsn, record);
@ -724,6 +774,7 @@ btree_desc(StringInfo buf, uint8 xl_info, char *rec)
}
case XLOG_BTREE_DELETE_PAGE:
case XLOG_BTREE_DELETE_PAGE_META:
case XLOG_BTREE_DELETE_PAGE_HALF:
{
xl_btree_delete_page *xlrec = (xl_btree_delete_page *) rec;
@ -752,7 +803,7 @@ btree_desc(StringInfo buf, uint8 xl_info, char *rec)
void
btree_xlog_startup(void)
{
incomplete_splits = NIL;
incomplete_actions = NIL;
}
void
@ -760,45 +811,60 @@ btree_xlog_cleanup(void)
{
ListCell *l;
foreach(l, incomplete_splits)
foreach(l, incomplete_actions)
{
bt_incomplete_split *split = (bt_incomplete_split *) lfirst(l);
bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
Relation reln;
Buffer lbuf,
rbuf;
Page lpage,
rpage;
BTPageOpaque lpageop,
rpageop;
bool is_only;
reln = XLogOpenRelation(split->node);
lbuf = XLogReadBuffer(reln, split->leftblk, false);
/* failure should be impossible because we wrote this page earlier */
if (!BufferIsValid(lbuf))
elog(PANIC, "btree_xlog_cleanup: left block unfound");
lpage = (Page) BufferGetPage(lbuf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
rbuf = XLogReadBuffer(reln, split->rightblk, false);
/* failure should be impossible because we wrote this page earlier */
if (!BufferIsValid(rbuf))
elog(PANIC, "btree_xlog_cleanup: right block unfound");
rpage = (Page) BufferGetPage(rbuf);
rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
reln = XLogOpenRelation(action->node);
if (action->is_split)
{
/* finish an incomplete split */
Buffer lbuf,
rbuf;
Page lpage,
rpage;
BTPageOpaque lpageop,
rpageop;
bool is_only;
/* if the two pages are all of their level, it's a only-page split */
is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
lbuf = XLogReadBuffer(reln, action->leftblk, false);
/* failure is impossible because we wrote this page earlier */
if (!BufferIsValid(lbuf))
elog(PANIC, "btree_xlog_cleanup: left block unfound");
lpage = (Page) BufferGetPage(lbuf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
rbuf = XLogReadBuffer(reln, action->rightblk, false);
/* failure is impossible because we wrote this page earlier */
if (!BufferIsValid(rbuf))
elog(PANIC, "btree_xlog_cleanup: right block unfound");
rpage = (Page) BufferGetPage(rbuf);
rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
_bt_insert_parent(reln, lbuf, rbuf, NULL,
split->is_root, is_only);
/* if the pages are all of their level, it's a only-page split */
is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
_bt_insert_parent(reln, lbuf, rbuf, NULL,
action->is_root, is_only);
}
else
{
/* finish an incomplete deletion (of a half-dead page) */
Buffer buf;
buf = XLogReadBuffer(reln, action->delblk, false);
if (BufferIsValid(buf))
if (_bt_pagedel(reln, buf, NULL, true) == 0)
elog(PANIC, "btree_xlog_cleanup: _bt_pagdel failed");
}
}
incomplete_splits = NIL;
incomplete_actions = NIL;
}
bool
btree_safe_restartpoint(void)
{
if (incomplete_splits)
if (incomplete_actions)
return false;
return true;
}

View File

@ -7,7 +7,7 @@
* Portions Copyright (c) 1996-2006, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
* $PostgreSQL: pgsql/src/include/access/nbtree.h,v 1.105 2006/10/04 00:30:07 momjian Exp $
* $PostgreSQL: pgsql/src/include/access/nbtree.h,v 1.106 2006/11/01 19:43:17 tgl Exp $
*
*-------------------------------------------------------------------------
*/
@ -163,6 +163,7 @@ typedef struct BTMetaPageData
#define P_ISLEAF(opaque) ((opaque)->btpo_flags & BTP_LEAF)
#define P_ISROOT(opaque) ((opaque)->btpo_flags & BTP_ROOT)
#define P_ISDELETED(opaque) ((opaque)->btpo_flags & BTP_DELETED)
#define P_ISHALFDEAD(opaque) ((opaque)->btpo_flags & BTP_HALF_DEAD)
#define P_IGNORE(opaque) ((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
#define P_HAS_GARBAGE(opaque) ((opaque)->btpo_flags & BTP_HAS_GARBAGE)
@ -203,8 +204,10 @@ typedef struct BTMetaPageData
#define XLOG_BTREE_SPLIT_R_ROOT 0x60 /* as above, new item on right */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuple */
#define XLOG_BTREE_DELETE_PAGE 0x80 /* delete an entire page */
#define XLOG_BTREE_DELETE_PAGE_META 0x90 /* same, plus update metapage */
#define XLOG_BTREE_DELETE_PAGE_META 0x90 /* same, and update metapage */
#define XLOG_BTREE_NEWROOT 0xA0 /* new root page */
#define XLOG_BTREE_DELETE_PAGE_HALF 0xB0 /* page deletion that makes
* parent half-dead */
/*
* All that we need to find changed index tuple
@ -501,7 +504,8 @@ extern void _bt_pageinit(Page page, Size size);
extern bool _bt_page_recyclable(Page page);
extern void _bt_delitems(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems);
extern int _bt_pagedel(Relation rel, Buffer buf, bool vacuum_full);
extern int _bt_pagedel(Relation rel, Buffer buf,
BTStack stack, bool vacuum_full);
/*
* prototypes for functions in nbtsearch.c