Compress GIN posting lists, for smaller index size.

GIN posting lists are now encoded using varbyte-encoding, which allows them
to fit in much smaller space than the straight ItemPointer array format used
before. The new encoding is used for both the lists stored in-line in entry
tree items, and in posting tree leaf pages.

To maintain backwards-compatibility and keep pg_upgrade working, the code
can still read old-style pages and tuples. Posting tree leaf pages in the
new format are flagged with GIN_COMPRESSED flag, to distinguish old and new
format pages. Likewise, entry tree tuples in the new format have a
GIN_ITUP_COMPRESSED flag set in a bit that was previously unused.

This patch bumps GIN_CURRENT_VERSION from 1 to 2. New indexes created with
version 9.4 will therefore have version number 2 in the metapage, while old
pg_upgraded indexes will have version 1. The code treats them the same, but
it might be come handy in the future, if we want to drop support for the
uncompressed format.

Alexander Korotkov and me. Reviewed by Tomas Vondra and Amit Langote.
This commit is contained in:
Heikki Linnakangas 2014-01-22 18:51:48 +02:00
parent 243ee26633
commit 36a35c550a
13 changed files with 2359 additions and 768 deletions

View File

@ -123,6 +123,6 @@ create index test_ginidx on test using gin (b);
select * from pgstatginindex('test_ginidx'); select * from pgstatginindex('test_ginidx');
version | pending_pages | pending_tuples version | pending_pages | pending_tuples
---------+---------------+---------------- ---------+---------------+----------------
1 | 0 | 0 2 | 0 | 0
(1 row) (1 row)

View File

@ -135,15 +135,15 @@ same category of null entry are merged into one index entry just as happens
with ordinary key entries. with ordinary key entries.
* In a key entry at the btree leaf level, at the next SHORTALIGN boundary, * In a key entry at the btree leaf level, at the next SHORTALIGN boundary,
there is an array of zero or more ItemPointers, which store the heap tuple there is a list of item pointers, in compressed format (see Posting List
TIDs for which the indexable items contain this key. This is called the Compression section), pointing to the heap tuples for which the indexable
"posting list". The TIDs in a posting list must appear in sorted order. items contain this key. This is called the "posting list".
If the list would be too big for the index tuple to fit on an index page,
the ItemPointers are pushed out to a separate posting page or pages, and If the list would be too big for the index tuple to fit on an index page, the
none appear in the key entry itself. The separate pages are called a ItemPointers are pushed out to a separate posting page or pages, and none
"posting tree"; they are organized as a btree of ItemPointer values. appear in the key entry itself. The separate pages are called a "posting
Note that in either case, the ItemPointers associated with a key can tree" (see below); Note that in either case, the ItemPointers associated with
easily be read out in sorted order; this is relied on by the scan a key can easily be read out in sorted order; this is relied on by the scan
algorithms. algorithms.
* The index tuple header fields of a leaf key entry are abused as follows: * The index tuple header fields of a leaf key entry are abused as follows:
@ -163,6 +163,11 @@ algorithms.
* The posting list can be accessed with GinGetPosting(itup) * The posting list can be accessed with GinGetPosting(itup)
* If GinITupIsCompressed(itup), the posting list is stored in compressed
format. Otherwise it is just an array of ItemPointers. New tuples are always
stored in compressed format, uncompressed items can be present if the
database was migrated from 9.3 or earlier version.
2) Posting tree case: 2) Posting tree case:
* ItemPointerGetBlockNumber(&itup->t_tid) contains the index block number * ItemPointerGetBlockNumber(&itup->t_tid) contains the index block number
@ -210,6 +215,76 @@ fit on one pending-list page must have those pages to itself, even if this
results in wasting much of the space on the preceding page and the last results in wasting much of the space on the preceding page and the last
page for the tuple.) page for the tuple.)
Posting tree
------------
If a posting list is too large to store in-line in a key entry, a posting tree
is created. A posting tree is a B-tree structure, where the ItemPointer is
used as the key.
Internal posting tree pages use the standard PageHeader and the same "opaque"
struct as other GIN page, but do not contain regular index tuples. Instead,
the contents of the page is an array of PostingItem structs. Each PostingItem
consists of the block number of the child page, and the right bound of that
child page, as an ItemPointer. The right bound of the page is stored right
after the page header, before the PostingItem array.
Posting tree leaf pages also use the standard PageHeader and opaque struct,
and the right bound of the page is stored right after the page header,
but the page content comprises of 0-32 compressed posting lists, and an
additional array of regular uncompressed item pointers. The compressed posting
lists are stored one after each other, between page header and pd_lower. The
uncompressed array is stored between pd_upper and pd_special. The space
between pd_lower and pd_upper is unused, which allows full-page images of
posting tree leaf pages to skip the unused space in middle (buffer_std = true
in XLogRecData). For historical reasons, this does not apply to internal
pages, or uncompressed leaf pages migrated from earlier versions.
The item pointers are stored in a number of independent compressed posting
lists (also called segments), instead of one big one, to make random access
to a given item pointer faster: to find an item in a compressed list, you
have to read the list from the beginning, but when the items are split into
multiple lists, you can first skip over to the list containing the item you're
looking for, and read only that segment. Also, an update only needs to
re-encode the affected segment.
The uncompressed items array is used for insertions, to avoid re-encoding
a compressed list on every update. If there is room on a page, an insertion
simply inserts the new item to the right place in the uncompressed array.
When a page becomes full, it is rewritten, merging all the uncompressed items
are into the compressed lists. When reading, the uncompressed array and the
compressed lists are read in tandem, and merged into one stream of sorted
item pointers.
Posting List Compression
------------------------
To fit as many item pointers on a page as possible, posting tree leaf pages
and posting lists stored inline in entry tree leaf tuples use a lightweight
form of compression. We take advantage of the fact that the item pointers
are stored in sorted order. Instead of storing the block and offset number of
each item pointer separately, we store the difference from the previous item.
That in itself doesn't do much, but it allows us to use so-called varbyte
encoding to compress them.
Varbyte encoding is a method to encode integers, allowing smaller numbers to
take less space at the cost of larger numbers. Each integer is represented by
variable number of bytes. High bit of each byte in varbyte encoding determines
whether the next byte is still part of this number. Therefore, to read a single
varbyte encoded number, you have to read bytes until you find a byte with the
high bit not set.
When encoding, the block and offset number forming the item pointer are
combined into a single integer. The offset number is stored in the 11 low
bits (see MaxHeapTuplesPerPageBits in ginpostinglist.c), and the block number
is stored in the higher bits. That requires 43 bits in total, which
conveniently fits in at most 6 bytes.
A compressed posting list is passed around and stored on disk in a
PackedPostingList struct. The first item in the list is stored uncompressed
as a regular ItemPointerData, followed by the length of the list in bytes,
followed by the packed items.
Concurrency Concurrency
----------- -----------
@ -260,6 +335,36 @@ page-deletions safe; it stamps the deleted pages with an XID and keeps the
deleted pages around with the right-link intact until all concurrent scans deleted pages around with the right-link intact until all concurrent scans
have finished.) have finished.)
Compatibility
-------------
Compression of TIDs was introduced in 9.4. Some GIN indexes could remain in
uncompressed format because of pg_upgrade from 9.3 or earlier versions.
For compatibility, old uncompressed format is also supported. Following
rules are used to handle it:
* GIN_ITUP_COMPRESSED flag marks index tuples that contain a posting list.
This flag is stored in high bit of ItemPointerGetBlockNumber(&itup->t_tid).
Use GinItupIsCompressed(itup) to check the flag.
* Posting tree pages in the new format are marked with the GIN_COMPRESSED flag.
Macros GinPageIsCompressed(page) and GinPageSetCompressed(page) are used to
check and set this flag.
* All scan operations check format of posting list add use corresponding code
to read its content.
* When updating an index tuple containing an uncompressed posting list, it
will be replaced with new index tuple containing a compressed list.
* When updating an uncompressed posting tree leaf page, it's compressed.
* If vacuum finds some dead TIDs in uncompressed posting lists, they are
converted into compressed posting lists. This assumes that the compressed
posting list fits in the space occupied by the uncompressed list. IOW, we
assume that the compressed version of the page, with the dead items removed,
takes less space than the old uncompressed version.
Limitations Limitations
----------- -----------

View File

@ -325,9 +325,10 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
{ {
Page page = BufferGetPage(stack->buffer); Page page = BufferGetPage(stack->buffer);
XLogRecData *payloadrdata; XLogRecData *payloadrdata;
bool fit; GinPlaceToPageRC rc;
uint16 xlflags = 0; uint16 xlflags = 0;
Page childpage = NULL; Page childpage = NULL;
Page newlpage = NULL, newrpage = NULL;
if (GinPageIsData(page)) if (GinPageIsData(page))
xlflags |= GIN_INSERT_ISDATA; xlflags |= GIN_INSERT_ISDATA;
@ -345,16 +346,17 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
} }
/* /*
* Try to put the incoming tuple on the page. If it doesn't fit, * Try to put the incoming tuple on the page. placeToPage will decide
* placeToPage method will return false and leave the page unmodified, and * if the page needs to be split.
* we'll have to split the page.
*/ */
START_CRIT_SECTION(); rc = btree->placeToPage(btree, stack->buffer, stack,
fit = btree->placeToPage(btree, stack->buffer, stack->off, insertdata, updateblkno,
insertdata, updateblkno, &payloadrdata, &newlpage, &newrpage);
&payloadrdata); if (rc == UNMODIFIED)
if (fit) return true;
else if (rc == INSERTED)
{ {
/* placeToPage did START_CRIT_SECTION() */
MarkBufferDirty(stack->buffer); MarkBufferDirty(stack->buffer);
/* An insert to an internal page finishes the split of the child. */ /* An insert to an internal page finishes the split of the child. */
@ -373,7 +375,6 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
xlrec.node = btree->index->rd_node; xlrec.node = btree->index->rd_node;
xlrec.blkno = BufferGetBlockNumber(stack->buffer); xlrec.blkno = BufferGetBlockNumber(stack->buffer);
xlrec.offset = stack->off;
xlrec.flags = xlflags; xlrec.flags = xlflags;
rdata[0].buffer = InvalidBuffer; rdata[0].buffer = InvalidBuffer;
@ -415,20 +416,16 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
return true; return true;
} }
else else if (rc == SPLIT)
{ {
/* Didn't fit, have to split */ /* Didn't fit, have to split */
Buffer rbuffer; Buffer rbuffer;
Page newlpage;
BlockNumber savedRightLink; BlockNumber savedRightLink;
Page rpage;
XLogRecData rdata[2]; XLogRecData rdata[2];
ginxlogSplit data; ginxlogSplit data;
Buffer lbuffer = InvalidBuffer; Buffer lbuffer = InvalidBuffer;
Page newrootpg = NULL; Page newrootpg = NULL;
END_CRIT_SECTION();
rbuffer = GinNewBuffer(btree->index); rbuffer = GinNewBuffer(btree->index);
/* During index build, count the new page */ /* During index build, count the new page */
@ -443,12 +440,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
savedRightLink = GinPageGetOpaque(page)->rightlink; savedRightLink = GinPageGetOpaque(page)->rightlink;
/* /*
* newlpage is a pointer to memory page, it is not associated with a * newlpage and newrpage are pointers to memory pages, not associated
* buffer. stack->buffer is not touched yet. * with buffers. stack->buffer is not touched yet.
*/ */
newlpage = btree->splitPage(btree, stack->buffer, rbuffer, stack->off,
insertdata, updateblkno,
&payloadrdata);
data.node = btree->index->rd_node; data.node = btree->index->rd_node;
data.rblkno = BufferGetBlockNumber(rbuffer); data.rblkno = BufferGetBlockNumber(rbuffer);
@ -481,8 +475,6 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
else else
rdata[0].next = payloadrdata; rdata[0].next = payloadrdata;
rpage = BufferGetPage(rbuffer);
if (stack->parent == NULL) if (stack->parent == NULL)
{ {
/* /*
@ -508,7 +500,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
data.lblkno = BufferGetBlockNumber(lbuffer); data.lblkno = BufferGetBlockNumber(lbuffer);
data.flags |= GIN_SPLIT_ROOT; data.flags |= GIN_SPLIT_ROOT;
GinPageGetOpaque(rpage)->rightlink = InvalidBlockNumber; GinPageGetOpaque(newrpage)->rightlink = InvalidBlockNumber;
GinPageGetOpaque(newlpage)->rightlink = BufferGetBlockNumber(rbuffer); GinPageGetOpaque(newlpage)->rightlink = BufferGetBlockNumber(rbuffer);
/* /*
@ -517,12 +509,12 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
* than overwriting the original page directly, so that we can still * than overwriting the original page directly, so that we can still
* abort gracefully if this fails.) * abort gracefully if this fails.)
*/ */
newrootpg = PageGetTempPage(rpage); newrootpg = PageGetTempPage(newrpage);
GinInitPage(newrootpg, GinPageGetOpaque(newlpage)->flags & ~GIN_LEAF, BLCKSZ); GinInitPage(newrootpg, GinPageGetOpaque(newlpage)->flags & ~(GIN_LEAF | GIN_COMPRESSED), BLCKSZ);
btree->fillRoot(btree, newrootpg, btree->fillRoot(btree, newrootpg,
BufferGetBlockNumber(lbuffer), newlpage, BufferGetBlockNumber(lbuffer), newlpage,
BufferGetBlockNumber(rbuffer), rpage); BufferGetBlockNumber(rbuffer), newrpage);
} }
else else
{ {
@ -530,7 +522,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
data.rrlink = savedRightLink; data.rrlink = savedRightLink;
data.lblkno = BufferGetBlockNumber(stack->buffer); data.lblkno = BufferGetBlockNumber(stack->buffer);
GinPageGetOpaque(rpage)->rightlink = savedRightLink; GinPageGetOpaque(newrpage)->rightlink = savedRightLink;
GinPageGetOpaque(newlpage)->flags |= GIN_INCOMPLETE_SPLIT; GinPageGetOpaque(newlpage)->flags |= GIN_INCOMPLETE_SPLIT;
GinPageGetOpaque(newlpage)->rightlink = BufferGetBlockNumber(rbuffer); GinPageGetOpaque(newlpage)->rightlink = BufferGetBlockNumber(rbuffer);
} }
@ -550,16 +542,24 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
START_CRIT_SECTION(); START_CRIT_SECTION();
MarkBufferDirty(rbuffer); MarkBufferDirty(rbuffer);
MarkBufferDirty(stack->buffer);
/*
* Restore the temporary copies over the real buffers. But don't free
* the temporary copies yet, WAL record data points to them.
*/
if (stack->parent == NULL) if (stack->parent == NULL)
{ {
PageRestoreTempPage(newlpage, BufferGetPage(lbuffer));
MarkBufferDirty(lbuffer); MarkBufferDirty(lbuffer);
newlpage = newrootpg; memcpy(BufferGetPage(stack->buffer), newrootpg, BLCKSZ);
memcpy(BufferGetPage(lbuffer), newlpage, BLCKSZ);
memcpy(BufferGetPage(rbuffer), newrpage, BLCKSZ);
}
else
{
memcpy(BufferGetPage(stack->buffer), newlpage, BLCKSZ);
memcpy(BufferGetPage(rbuffer), newrpage, BLCKSZ);
} }
PageRestoreTempPage(newlpage, BufferGetPage(stack->buffer));
MarkBufferDirty(stack->buffer);
/* write WAL record */ /* write WAL record */
if (RelationNeedsWAL(btree->index)) if (RelationNeedsWAL(btree->index))
@ -568,7 +568,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_SPLIT, rdata); recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_SPLIT, rdata);
PageSetLSN(BufferGetPage(stack->buffer), recptr); PageSetLSN(BufferGetPage(stack->buffer), recptr);
PageSetLSN(rpage, recptr); PageSetLSN(BufferGetPage(rbuffer), recptr);
if (stack->parent == NULL) if (stack->parent == NULL)
PageSetLSN(BufferGetPage(lbuffer), recptr); PageSetLSN(BufferGetPage(lbuffer), recptr);
} }
@ -582,6 +582,11 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
if (stack->parent == NULL) if (stack->parent == NULL)
UnlockReleaseBuffer(lbuffer); UnlockReleaseBuffer(lbuffer);
pfree(newlpage);
pfree(newrpage);
if (newrootpg)
pfree(newrootpg);
/* /*
* If we split the root, we're done. Otherwise the split is not * If we split the root, we're done. Otherwise the split is not
* complete until the downlink for the new page has been inserted to * complete until the downlink for the new page has been inserted to
@ -592,6 +597,8 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
else else
return false; return false;
} }
else
elog(ERROR, "unknown return code from GIN placeToPage method: %d", rc);
} }
/* /*

File diff suppressed because it is too large Load Diff

View File

@ -1,7 +1,7 @@
/*------------------------------------------------------------------------- /*-------------------------------------------------------------------------
* *
* ginentrypage.c * ginentrypage.c
* page utilities routines for the postgres inverted index access method. * routines for handling GIN entry tree pages.
* *
* *
* Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
@ -15,8 +15,15 @@
#include "postgres.h" #include "postgres.h"
#include "access/gin_private.h" #include "access/gin_private.h"
#include "miscadmin.h"
#include "utils/rel.h" #include "utils/rel.h"
static void entrySplitPage(GinBtree btree, Buffer origbuf,
GinBtreeStack *stack,
void *insertPayload,
BlockNumber updateblkno, XLogRecData **prdata,
Page *newlpage, Page *newrpage);
/* /*
* Form a tuple for entry tree. * Form a tuple for entry tree.
* *
@ -27,15 +34,15 @@
* format that is being built here. We build on the assumption that we * format that is being built here. We build on the assumption that we
* are making a leaf-level key entry containing a posting list of nipd items. * are making a leaf-level key entry containing a posting list of nipd items.
* If the caller is actually trying to make a posting-tree entry, non-leaf * If the caller is actually trying to make a posting-tree entry, non-leaf
* entry, or pending-list entry, it should pass nipd = 0 and then overwrite * entry, or pending-list entry, it should pass dataSize = 0 and then overwrite
* the t_tid fields as necessary. In any case, ipd can be NULL to skip * the t_tid fields as necessary. In any case, 'data' can be NULL to skip
* copying any itempointers into the posting list; the caller is responsible * filling in the posting list; the caller is responsible for filling it
* for filling the posting list afterwards, if ipd = NULL and nipd > 0. * afterwards if data = NULL and nipd > 0.
*/ */
IndexTuple IndexTuple
GinFormTuple(GinState *ginstate, GinFormTuple(GinState *ginstate,
OffsetNumber attnum, Datum key, GinNullCategory category, OffsetNumber attnum, Datum key, GinNullCategory category,
ItemPointerData *ipd, uint32 nipd, Pointer data, Size dataSize, int nipd,
bool errorTooBig) bool errorTooBig)
{ {
Datum datums[2]; Datum datums[2];
@ -80,27 +87,25 @@ GinFormTuple(GinState *ginstate,
newsize = Max(newsize, minsize); newsize = Max(newsize, minsize);
} }
newsize = SHORTALIGN(newsize);
GinSetPostingOffset(itup, newsize); GinSetPostingOffset(itup, newsize);
GinSetNPosting(itup, nipd); GinSetNPosting(itup, nipd);
/* /*
* Add space needed for posting list, if any. Then check that the tuple * Add space needed for posting list, if any. Then check that the tuple
* won't be too big to store. * won't be too big to store.
*/ */
newsize += sizeof(ItemPointerData) * nipd; newsize += dataSize;
newsize = MAXALIGN(newsize); newsize = MAXALIGN(newsize);
if (newsize > Min(INDEX_SIZE_MASK, GinMaxItemSize))
if (newsize > GinMaxItemSize)
{ {
if (errorTooBig) if (errorTooBig)
ereport(ERROR, ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("index row size %lu exceeds maximum %lu for index \"%s\"", errmsg("index row size %lu exceeds maximum %lu for index \"%s\"",
(unsigned long) newsize, (unsigned long) newsize,
(unsigned long) Min(INDEX_SIZE_MASK, (unsigned long) GinMaxItemSize,
GinMaxItemSize),
RelationGetRelationName(ginstate->index)))); RelationGetRelationName(ginstate->index))));
pfree(itup); pfree(itup);
return NULL; return NULL;
@ -119,12 +124,20 @@ GinFormTuple(GinState *ginstate,
*/ */
memset((char *) itup + IndexTupleSize(itup), memset((char *) itup + IndexTupleSize(itup),
0, newsize - IndexTupleSize(itup)); 0, newsize - IndexTupleSize(itup));
/* set new size in tuple header */ /* set new size in tuple header */
itup->t_info &= ~INDEX_SIZE_MASK; itup->t_info &= ~INDEX_SIZE_MASK;
itup->t_info |= newsize; itup->t_info |= newsize;
} }
/*
* Copy in the posting list, if provided
*/
if (data)
{
char *ptr = GinGetPosting(itup);
memcpy(ptr, data, dataSize);
}
/* /*
* Insert category byte, if needed * Insert category byte, if needed
*/ */
@ -133,37 +146,45 @@ GinFormTuple(GinState *ginstate,
Assert(IndexTupleHasNulls(itup)); Assert(IndexTupleHasNulls(itup));
GinSetNullCategory(itup, ginstate, category); GinSetNullCategory(itup, ginstate, category);
} }
/*
* Copy in the posting list, if provided
*/
if (ipd)
memcpy(GinGetPosting(itup), ipd, sizeof(ItemPointerData) * nipd);
return itup; return itup;
} }
/* /*
* Sometimes we reduce the number of posting list items in a tuple after * Read item pointers from leaf entry tuple.
* having built it with GinFormTuple. This function adjusts the size *
* fields to match. * Returns a palloc'd array of ItemPointers. The number of items is returned
* in *nitems.
*/ */
void ItemPointer
GinShortenTuple(IndexTuple itup, uint32 nipd) ginReadTuple(GinState *ginstate, OffsetNumber attnum, IndexTuple itup,
int *nitems)
{ {
uint32 newsize; Pointer ptr = GinGetPosting(itup);
int nipd = GinGetNPosting(itup);
ItemPointer ipd;
int ndecoded;
Assert(nipd <= GinGetNPosting(itup)); if (GinItupIsCompressed(itup))
{
newsize = GinGetPostingOffset(itup) + sizeof(ItemPointerData) * nipd; if (nipd > 0)
newsize = MAXALIGN(newsize); {
ipd = ginPostingListDecode((GinPostingList *) ptr, &ndecoded);
Assert(newsize <= (itup->t_info & INDEX_SIZE_MASK)); if (nipd != ndecoded)
elog(ERROR, "number of items mismatch in GIN entry tuple, %d in tuple header, %d decoded",
itup->t_info &= ~INDEX_SIZE_MASK; nipd, ndecoded);
itup->t_info |= newsize; }
else
GinSetNPosting(itup, nipd); {
ipd = palloc(0);
}
}
else
{
ipd = (ItemPointer) palloc(sizeof(ItemPointerData) * nipd);
memcpy(ipd, ptr, sizeof(ItemPointerData) * nipd);
}
*nitems = nipd;
return ipd;
} }
/* /*
@ -492,13 +513,14 @@ entryPreparePage(GinBtree btree, Page page, OffsetNumber off,
* the downlink of the existing item at 'off' is updated to point to * the downlink of the existing item at 'off' is updated to point to
* 'updateblkno'. * 'updateblkno'.
*/ */
static bool static GinPlaceToPageRC
entryPlaceToPage(GinBtree btree, Buffer buf, OffsetNumber off, entryPlaceToPage(GinBtree btree, Buffer buf, GinBtreeStack *stack,
void *insertPayload, BlockNumber updateblkno, void *insertPayload, BlockNumber updateblkno,
XLogRecData **prdata) XLogRecData **prdata, Page *newlpage, Page *newrpage)
{ {
GinBtreeEntryInsertData *insertData = insertPayload; GinBtreeEntryInsertData *insertData = insertPayload;
Page page = BufferGetPage(buf); Page page = BufferGetPage(buf);
OffsetNumber off = stack->off;
OffsetNumber placed; OffsetNumber placed;
int cnt = 0; int cnt = 0;
@ -508,7 +530,13 @@ entryPlaceToPage(GinBtree btree, Buffer buf, OffsetNumber off,
/* quick exit if it doesn't fit */ /* quick exit if it doesn't fit */
if (!entryIsEnoughSpace(btree, buf, off, insertData)) if (!entryIsEnoughSpace(btree, buf, off, insertData))
return false; {
entrySplitPage(btree, buf, stack, insertPayload, updateblkno,
prdata, newlpage, newrpage);
return SPLIT;
}
START_CRIT_SECTION();
*prdata = rdata; *prdata = rdata;
entryPreparePage(btree, page, off, insertData, updateblkno); entryPreparePage(btree, page, off, insertData, updateblkno);
@ -522,6 +550,7 @@ entryPlaceToPage(GinBtree btree, Buffer buf, OffsetNumber off,
RelationGetRelationName(btree->index)); RelationGetRelationName(btree->index));
data.isDelete = insertData->isDelete; data.isDelete = insertData->isDelete;
data.offset = off;
rdata[cnt].buffer = buf; rdata[cnt].buffer = buf;
rdata[cnt].buffer_std = false; rdata[cnt].buffer_std = false;
@ -536,21 +565,24 @@ entryPlaceToPage(GinBtree btree, Buffer buf, OffsetNumber off,
rdata[cnt].len = IndexTupleSize(insertData->entry); rdata[cnt].len = IndexTupleSize(insertData->entry);
rdata[cnt].next = NULL; rdata[cnt].next = NULL;
return true; return INSERTED;
} }
/* /*
* Place tuple and split page, original buffer(lbuf) leaves untouched, * Place tuple and split page, original buffer(lbuf) leaves untouched,
* returns shadow page of lbuf filled new data. * returns shadow pages filled with new data.
* Tuples are distributed between pages by equal size on its, not * Tuples are distributed between pages by equal size on its, not
* an equal number! * an equal number!
*/ */
static Page static void
entrySplitPage(GinBtree btree, Buffer lbuf, Buffer rbuf, OffsetNumber off, entrySplitPage(GinBtree btree, Buffer origbuf,
GinBtreeStack *stack,
void *insertPayload, void *insertPayload,
BlockNumber updateblkno, XLogRecData **prdata) BlockNumber updateblkno, XLogRecData **prdata,
Page *newlpage, Page *newrpage)
{ {
GinBtreeEntryInsertData *insertData = insertPayload; GinBtreeEntryInsertData *insertData = insertPayload;
OffsetNumber off = stack->off;
OffsetNumber i, OffsetNumber i,
maxoff, maxoff,
separator = InvalidOffsetNumber; separator = InvalidOffsetNumber;
@ -561,8 +593,8 @@ entrySplitPage(GinBtree btree, Buffer lbuf, Buffer rbuf, OffsetNumber off,
char *ptr; char *ptr;
IndexTuple itup; IndexTuple itup;
Page page; Page page;
Page lpage = PageGetTempPageCopy(BufferGetPage(lbuf)); Page lpage = PageGetTempPageCopy(BufferGetPage(origbuf));
Page rpage = BufferGetPage(rbuf); Page rpage = PageGetTempPageCopy(BufferGetPage(origbuf));
Size pageSize = PageGetPageSize(lpage); Size pageSize = PageGetPageSize(lpage);
/* these must be static so they can be returned to caller */ /* these must be static so they can be returned to caller */
@ -651,7 +683,8 @@ entrySplitPage(GinBtree btree, Buffer lbuf, Buffer rbuf, OffsetNumber off,
rdata[1].len = tupstoresize; rdata[1].len = tupstoresize;
rdata[1].next = NULL; rdata[1].next = NULL;
return lpage; *newlpage = lpage;
*newrpage = rpage;
} }
/* /*
@ -719,7 +752,6 @@ ginPrepareEntryScan(GinBtree btree, OffsetNumber attnum,
btree->findItem = entryLocateLeafEntry; btree->findItem = entryLocateLeafEntry;
btree->findChildPtr = entryFindChildPtr; btree->findChildPtr = entryFindChildPtr;
btree->placeToPage = entryPlaceToPage; btree->placeToPage = entryPlaceToPage;
btree->splitPage = entrySplitPage;
btree->fillRoot = ginEntryFillRoot; btree->fillRoot = ginEntryFillRoot;
btree->prepareDownlink = entryPrepareDownlink; btree->prepareDownlink = entryPrepareDownlink;

View File

@ -487,7 +487,7 @@ ginHeapTupleFastCollect(GinState *ginstate,
IndexTuple itup; IndexTuple itup;
itup = GinFormTuple(ginstate, attnum, entries[i], categories[i], itup = GinFormTuple(ginstate, attnum, entries[i], categories[i],
NULL, 0, true); NULL, 0, 0, true);
itup->t_tid = *ht_ctid; itup->t_tid = *ht_ctid;
collector->tuples[collector->ntuples++] = itup; collector->tuples[collector->ntuples++] = itup;
collector->sumsize += IndexTupleSize(itup); collector->sumsize += IndexTupleSize(itup);

View File

@ -71,24 +71,20 @@ callConsistentFn(GinState *ginstate, GinScanKey key)
* Tries to refind previously taken ItemPointer on a posting page. * Tries to refind previously taken ItemPointer on a posting page.
*/ */
static bool static bool
findItemInPostingPage(Page page, ItemPointer item, OffsetNumber *off) needToStepRight(Page page, ItemPointer item)
{ {
OffsetNumber maxoff = GinPageGetOpaque(page)->maxoff;
int res;
if (GinPageGetOpaque(page)->flags & GIN_DELETED) if (GinPageGetOpaque(page)->flags & GIN_DELETED)
/* page was deleted by concurrent vacuum */ /* page was deleted by concurrent vacuum */
return false; return true;
/* if (ginCompareItemPointers(item, GinDataPageGetRightBound(page)) > 0
* scan page to find equal or first greater value && !GinPageRightMost(page))
*/
for (*off = FirstOffsetNumber; *off <= maxoff; (*off)++)
{ {
res = ginCompareItemPointers(item, GinDataPageGetItemPointer(page, *off)); /*
* the item we're looking is > the right bound of the page, so it
if (res <= 0) * can't be on this page.
return true; */
return true;
} }
return false; return false;
@ -143,14 +139,10 @@ scanPostingTree(Relation index, GinScanEntry scanEntry,
for (;;) for (;;)
{ {
page = BufferGetPage(buffer); page = BufferGetPage(buffer);
if ((GinPageGetOpaque(page)->flags & GIN_DELETED) == 0)
if ((GinPageGetOpaque(page)->flags & GIN_DELETED) == 0 &&
GinPageGetOpaque(page)->maxoff >= FirstOffsetNumber)
{ {
tbm_add_tuples(scanEntry->matchBitmap, int n = GinDataLeafPageGetItemsToTbm(page, scanEntry->matchBitmap);
GinDataPageGetItemPointer(page, FirstOffsetNumber), scanEntry->predictNumberResult += n;
GinPageGetOpaque(page)->maxoff, false);
scanEntry->predictNumberResult += GinPageGetOpaque(page)->maxoff;
} }
if (GinPageRightMost(page)) if (GinPageRightMost(page))
@ -335,8 +327,11 @@ collectMatchBitmap(GinBtreeData *btree, GinBtreeStack *stack,
} }
else else
{ {
tbm_add_tuples(scanEntry->matchBitmap, ItemPointer ipd;
GinGetPosting(itup), GinGetNPosting(itup), false); int nipd;
ipd = ginReadTuple(btree->ginstate, scanEntry->attnum, itup, &nipd);
tbm_add_tuples(scanEntry->matchBitmap, ipd, nipd, false);
scanEntry->predictNumberResult += GinGetNPosting(itup); scanEntry->predictNumberResult += GinGetNPosting(itup);
} }
@ -450,16 +445,14 @@ restartScanEntry:
IncrBufferRefCount(entry->buffer); IncrBufferRefCount(entry->buffer);
page = BufferGetPage(entry->buffer); page = BufferGetPage(entry->buffer);
entry->predictNumberResult = stack->predictNumber * GinPageGetOpaque(page)->maxoff;
/* /*
* Keep page content in memory to prevent durable page locking * Copy page content to memory to avoid keeping it locked for
* a long time.
*/ */
entry->list = (ItemPointerData *) palloc(BLCKSZ); entry->list = GinDataLeafPageGetItems(page, &entry->nlist);
entry->nlist = GinPageGetOpaque(page)->maxoff;
memcpy(entry->list, entry->predictNumberResult = stack->predictNumber * entry->nlist;
GinDataPageGetItemPointer(page, FirstOffsetNumber),
GinPageGetOpaque(page)->maxoff * sizeof(ItemPointerData));
LockBuffer(entry->buffer, GIN_UNLOCK); LockBuffer(entry->buffer, GIN_UNLOCK);
freeGinBtreeStack(stack); freeGinBtreeStack(stack);
@ -467,9 +460,10 @@ restartScanEntry:
} }
else if (GinGetNPosting(itup) > 0) else if (GinGetNPosting(itup) > 0)
{ {
entry->nlist = GinGetNPosting(itup); entry->list = ginReadTuple(ginstate, entry->attnum, itup,
entry->list = (ItemPointerData *) palloc(sizeof(ItemPointerData) * entry->nlist); &entry->nlist);
memcpy(entry->list, GinGetPosting(itup), sizeof(ItemPointerData) * entry->nlist); entry->predictNumberResult = entry->nlist;
entry->isFinished = FALSE; entry->isFinished = FALSE;
} }
} }
@ -532,6 +526,7 @@ static void
entryGetNextItem(GinState *ginstate, GinScanEntry entry) entryGetNextItem(GinState *ginstate, GinScanEntry entry)
{ {
Page page; Page page;
int i;
for (;;) for (;;)
{ {
@ -564,35 +559,47 @@ entryGetNextItem(GinState *ginstate, GinScanEntry entry)
page = BufferGetPage(entry->buffer); page = BufferGetPage(entry->buffer);
entry->offset = InvalidOffsetNumber; entry->offset = InvalidOffsetNumber;
if (!ItemPointerIsValid(&entry->curItem) || if (entry->list)
findItemInPostingPage(page, &entry->curItem, &entry->offset))
{ {
/* pfree(entry->list);
* Found position equal to or greater than stored entry->list = NULL;
*/ }
entry->nlist = GinPageGetOpaque(page)->maxoff;
memcpy(entry->list,
GinDataPageGetItemPointer(page, FirstOffsetNumber),
GinPageGetOpaque(page)->maxoff * sizeof(ItemPointerData));
LockBuffer(entry->buffer, GIN_UNLOCK); /*
* If the page was concurrently split, we have to re-find the
* item we were stopped on. If the page was split more than once,
* the item might not be on this page, but somewhere to the right.
* Keep following the right-links until we re-find the correct
* page.
*/
if (ItemPointerIsValid(&entry->curItem) &&
needToStepRight(page, &entry->curItem))
{
continue;
}
if (!ItemPointerIsValid(&entry->curItem) || entry->list = GinDataLeafPageGetItems(page, &entry->nlist);
ginCompareItemPointers(&entry->curItem,
entry->list + entry->offset - 1) == 0) /* re-find the item we were stopped on. */
if (ItemPointerIsValid(&entry->curItem))
{
for (i = 0; i < entry->nlist; i++)
{ {
/* if (ginCompareItemPointers(&entry->curItem,
* First pages are deleted or empty, or we found exact &entry->list[i]) < 0)
* position, so break inner loop and continue outer one. {
*/ LockBuffer(entry->buffer, GIN_UNLOCK);
break; entry->offset = i + 1;
entry->curItem = entry->list[entry->offset - 1];
return;
}
} }
}
/* else
* Find greater than entry->curItem position, store it. {
*/ LockBuffer(entry->buffer, GIN_UNLOCK);
entry->offset = 1; /* scan all items on the page. */
entry->curItem = entry->list[entry->offset - 1]; entry->curItem = entry->list[entry->offset - 1];
return; return;
} }
} }

View File

@ -53,31 +53,42 @@ addItemPointersToLeafTuple(GinState *ginstate,
Datum key; Datum key;
GinNullCategory category; GinNullCategory category;
IndexTuple res; IndexTuple res;
ItemPointerData *newItems,
*oldItems;
int oldNPosting,
newNPosting;
GinPostingList *compressedList;
Assert(!GinIsPostingTree(old)); Assert(!GinIsPostingTree(old));
attnum = gintuple_get_attrnum(ginstate, old); attnum = gintuple_get_attrnum(ginstate, old);
key = gintuple_get_key(ginstate, old, &category); key = gintuple_get_key(ginstate, old, &category);
/* try to build tuple with room for all the items */ /* merge the old and new posting lists */
res = GinFormTuple(ginstate, attnum, key, category, oldItems = ginReadTuple(ginstate, attnum, old, &oldNPosting);
NULL, nitem + GinGetNPosting(old),
false);
if (res) newNPosting = oldNPosting + nitem;
newItems = (ItemPointerData *) palloc(sizeof(ItemPointerData) * newNPosting);
newNPosting = ginMergeItemPointers(newItems,
items, nitem,
oldItems, oldNPosting);
/* Compress the posting list, and try to a build tuple with room for it */
res = NULL;
compressedList = ginCompressPostingList(newItems, newNPosting, GinMaxItemSize,
NULL);
pfree(newItems);
if (compressedList)
{ {
/* good, small enough */ res = GinFormTuple(ginstate, attnum, key, category,
uint32 newnitem; (char *) compressedList,
SizeOfGinPostingList(compressedList),
/* fill in the posting list with union of old and new TIDs */ newNPosting,
newnitem = ginMergeItemPointers(GinGetPosting(res), false);
GinGetPosting(old), pfree(compressedList);
GinGetNPosting(old),
items, nitem);
/* merge might have eliminated some duplicate items */
GinShortenTuple(res, newnitem);
} }
else if (!res)
{ {
/* posting list would be too big, convert to posting tree */ /* posting list would be too big, convert to posting tree */
BlockNumber postingRoot; BlockNumber postingRoot;
@ -88,8 +99,8 @@ addItemPointersToLeafTuple(GinState *ginstate,
* already be in order with no duplicates. * already be in order with no duplicates.
*/ */
postingRoot = createPostingTree(ginstate->index, postingRoot = createPostingTree(ginstate->index,
GinGetPosting(old), oldItems,
GinGetNPosting(old), oldNPosting,
buildStats); buildStats);
/* Now insert the TIDs-to-be-added into the posting tree */ /* Now insert the TIDs-to-be-added into the posting tree */
@ -98,9 +109,10 @@ addItemPointersToLeafTuple(GinState *ginstate,
buildStats); buildStats);
/* And build a new posting-tree-only result tuple */ /* And build a new posting-tree-only result tuple */
res = GinFormTuple(ginstate, attnum, key, category, NULL, 0, true); res = GinFormTuple(ginstate, attnum, key, category, NULL, 0, 0, true);
GinSetPostingTree(res, postingRoot); GinSetPostingTree(res, postingRoot);
} }
pfree(oldItems);
return res; return res;
} }
@ -119,12 +131,19 @@ buildFreshLeafTuple(GinState *ginstate,
ItemPointerData *items, uint32 nitem, ItemPointerData *items, uint32 nitem,
GinStatsData *buildStats) GinStatsData *buildStats)
{ {
IndexTuple res; IndexTuple res = NULL;
GinPostingList *compressedList;
/* try to build a posting list tuple with all the items */ /* try to build a posting list tuple with all the items */
res = GinFormTuple(ginstate, attnum, key, category, compressedList = ginCompressPostingList(items, nitem, GinMaxItemSize, NULL);
items, nitem, false); if (compressedList)
{
res = GinFormTuple(ginstate, attnum, key, category,
(char *) compressedList,
SizeOfGinPostingList(compressedList),
nitem, false);
pfree(compressedList);
}
if (!res) if (!res)
{ {
/* posting list would be too big, build posting tree */ /* posting list would be too big, build posting tree */
@ -134,7 +153,7 @@ buildFreshLeafTuple(GinState *ginstate,
* Build posting-tree-only result tuple. We do this first so as to * Build posting-tree-only result tuple. We do this first so as to
* fail quickly if the key is too big. * fail quickly if the key is too big.
*/ */
res = GinFormTuple(ginstate, attnum, key, category, NULL, 0, true); res = GinFormTuple(ginstate, attnum, key, category, NULL, 0, 0, true);
/* /*
* Initialize a new posting tree with the TIDs. * Initialize a new posting tree with the TIDs.

View File

@ -16,12 +16,342 @@
#include "access/gin_private.h" #include "access/gin_private.h"
#ifdef USE_ASSERT_CHECKING
#define CHECK_ENCODING_ROUNDTRIP
#endif
/*
* For encoding purposes, item pointers are represented as 64-bit unsigned
* integers. The lowest 11 bits represent the offset number, and the next
* lowest 32 bits are the block number. That leaves 17 bits unused, ie.
* only 43 low bits are used.
*
* These 43-bit integers are encoded using varbyte encoding. In each byte,
* the 7 low bits contain data, while the highest bit is a continuation bit.
* When the continuation bit is set, the next byte is part of the same
* integer, otherwise this is the last byte of this integer. 43 bits fit
* conveniently in at most 6 bytes when varbyte encoded (the 6th byte does
* not need a continuation bit, because we know the max size to be 43 bits):
*
* 0XXXXXXX
* 1XXXXXXX 0XXXXYYY
* 1XXXXXXX 1XXXXYYY 0YYYYYYY
* 1XXXXXXX 1XXXXYYY 1YYYYYYY 0YYYYYYY
* 1XXXXXXX 1XXXXYYY 1YYYYYYY 1YYYYYYY 0YYYYYYY
* 1XXXXXXX 1XXXXYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY YYYYYYYY
*
* X = bits used for offset number
* Y = bits used for block number
*
* The bytes are in stored in little-endian order.
*
* An important property of this encoding is that removing an item from list
* never increases the size of the resulting compressed posting list. Proof:
*
* Removing number is actually replacement of two numbers with their sum. We
* have to prove that varbyte encoding of a sum can't be longer than varbyte
* encoding of its summands. Sum of two numbers is at most one bit wider than
* than the larger of the summands. Widening a number by one bit enlarges its
* length in varbyte encoding by at most one byte. Therefore, varbyte encoding
* of sum is at most one byte longer than varbyte encoding of larger summand.
* Lesser summand is at least one byte, so the sum cannot take more space than
* the summands, Q.E.D.
*
* This property greatly simplifies VACUUM, which can assume that posting
* lists always fit on the same page after vacuuming. Note that even though
* that holds for removing items from a posting list, you must also be
* careful to not cause expansion e.g when merging uncompressed items on the
* page into the compressed lists, when vacuuming.
*/
/*
* How many bits do you need to encode offset number? OffsetNumber is a 16-bit
* integer, but you can't fit that many items on a page. 11 ought to be more
* than enough. It's tempting to derive this from MaxHeapTuplesPerPage, and
* use the minimum number of bits, but that would require changing the on-disk
* format if MaxHeapTuplesPerPage changes. Better to leave some slack.
*/
#define MaxHeapTuplesPerPageBits 11
static inline uint64
itemptr_to_uint64(const ItemPointer iptr)
{
uint64 val;
Assert(ItemPointerIsValid(iptr));
Assert(iptr->ip_posid < (1 << MaxHeapTuplesPerPageBits));
val = iptr->ip_blkid.bi_hi;
val <<= 16;
val |= iptr->ip_blkid.bi_lo;
val <<= MaxHeapTuplesPerPageBits;
val |= iptr->ip_posid;
return val;
}
static inline void
uint64_to_itemptr(uint64 val, ItemPointer iptr)
{
iptr->ip_posid = val & ((1 << MaxHeapTuplesPerPageBits) - 1);
val = val >> MaxHeapTuplesPerPageBits;
iptr->ip_blkid.bi_lo = val & 0xFFFF;
val = val >> 16;
iptr->ip_blkid.bi_hi = val & 0xFFFF;
Assert(ItemPointerIsValid(iptr));
}
/*
* Varbyte-encode 'val' into *ptr. *ptr is incremented to next integer.
*/
static void
encode_varbyte(uint64 val, unsigned char **ptr)
{
unsigned char *p = *ptr;
while (val > 0x7F)
{
*(p++) = 0x80 | (val & 0x7F);
val >>= 7;
}
*(p++) = (unsigned char) val;
*ptr = p;
}
/*
* Decode varbyte-encoded integer at *ptr. *ptr is incremented to next integer.
*/
static uint64
decode_varbyte(unsigned char **ptr)
{
uint64 val;
unsigned char *p = *ptr;
uint64 c;
c = *(p++);
val = c & 0x7F;
if (c & 0x80)
{
c = *(p++);
val |= (c & 0x7F) << 7;
if (c & 0x80)
{
c = *(p++);
val |= (c & 0x7F) << 14;
if (c & 0x80)
{
c = *(p++);
val |= (c & 0x7F) << 21;
if (c & 0x80)
{
c = *(p++);
val |= (c & 0x7F) << 28;
if (c & 0x80)
{
c = *(p++);
val |= (c & 0x7F) << 35;
if (c & 0x80)
{
/* last byte, no continuation bit */
c = *(p++);
val |= c << 42;
}
}
}
}
}
}
*ptr = p;
return val;
}
/*
* Encode a posting list.
*
* The encoded list is returned in a palloc'd struct, which will be at most
* 'maxsize' bytes in size. The number items in the returned segment is
* returned in *nwritten. If it's not equal to nipd, not all the items fit
* in 'maxsize', and only the first *nwritten were encoded.
*/
GinPostingList *
ginCompressPostingList(const ItemPointer ipd, int nipd, int maxsize,
int *nwritten)
{
uint64 prev;
int totalpacked = 0;
int maxbytes;
GinPostingList *result;
unsigned char *ptr;
unsigned char *endptr;
result = palloc(maxsize);
maxbytes = maxsize - offsetof(GinPostingList, bytes);
/* Store the first special item */
result->first = ipd[0];
prev = itemptr_to_uint64(&result->first);
ptr = result->bytes;
endptr = result->bytes + maxbytes;
for (totalpacked = 1; totalpacked < nipd; totalpacked++)
{
uint64 val = itemptr_to_uint64(&ipd[totalpacked]);
uint64 delta = val - prev;
Assert (val > prev);
if (endptr - ptr >= 6)
encode_varbyte(delta, &ptr);
else
{
/*
* There are less than 6 bytes left. Have to check if the next
* item fits in that space before writing it out.
*/
unsigned char buf[6];
unsigned char *p = buf;
encode_varbyte(delta, &p);
if (p - buf > (endptr - ptr))
break; /* output is full */
memcpy(ptr, buf, p - buf);
ptr += (p - buf);
}
prev = val;
}
result->nbytes = ptr - result->bytes;
if (nwritten)
*nwritten = totalpacked;
Assert(SizeOfGinPostingList(result) <= maxsize);
/*
* Check that the encoded segment decodes back to the original items.
*/
#if defined (CHECK_ENCODING_ROUNDTRIP)
if (assert_enabled)
{
int ndecoded;
ItemPointer tmp = ginPostingListDecode(result, &ndecoded);
int i;
Assert(ndecoded == totalpacked);
for (i = 0; i < ndecoded; i++)
Assert(memcmp(&tmp[i], &ipd[i], sizeof(ItemPointerData)) == 0);
pfree(tmp);
}
#endif
return result;
}
/*
* Decode a compressed posting list into an array of item pointers.
* The number of items is returned in *ndecoded.
*/
ItemPointer
ginPostingListDecode(GinPostingList *plist, int *ndecoded)
{
return ginPostingListDecodeAllSegments(plist,
SizeOfGinPostingList(plist),
ndecoded);
}
/*
* Decode multiple posting list segments into an array of item pointers.
* The number of items is returned in *ndecoded_out. The segments are stored
* one after each other, with total size 'len' bytes.
*/
ItemPointer
ginPostingListDecodeAllSegments(GinPostingList *segment, int len, int *ndecoded_out)
{
ItemPointer result;
int nallocated;
uint64 val;
char *endseg = ((char *) segment) + len;
int ndecoded;
unsigned char *ptr;
unsigned char *endptr;
/*
* Guess an initial size of the array.
*/
nallocated = segment->nbytes * 2 + 1;
result = palloc(nallocated * sizeof(ItemPointerData));
ndecoded = 0;
while ((char *) segment < endseg)
{
/* enlarge output array if needed */
if (ndecoded >= nallocated)
{
nallocated *= 2;
result = repalloc(result, nallocated * sizeof(ItemPointerData));
}
/* copy the first item */
result[ndecoded] = segment->first;
ndecoded++;
Assert(OffsetNumberIsValid(ItemPointerGetOffsetNumber(&segment->first)));
val = itemptr_to_uint64(&segment->first);
ptr = segment->bytes;
endptr = segment->bytes + segment->nbytes;
while (ptr < endptr)
{
/* enlarge output array if needed */
if (ndecoded >= nallocated)
{
nallocated *= 2;
result = repalloc(result, nallocated * sizeof(ItemPointerData));
}
val += decode_varbyte(&ptr);
uint64_to_itemptr(val, &result[ndecoded]);
ndecoded++;
}
segment = GinNextPostingListSegment(segment);
}
if (ndecoded_out)
*ndecoded_out = ndecoded;
return result;
}
/*
* Add all item pointers from a bunch of posting lists to a TIDBitmap.
*/
int
ginPostingListDecodeAllSegmentsToTbm(GinPostingList *ptr, int len,
TIDBitmap *tbm)
{
int ndecoded;
ItemPointer items;
items = ginPostingListDecodeAllSegments(ptr, len, &ndecoded);
tbm_add_tuples(tbm, items, ndecoded, false);
pfree(items);
return ndecoded;
}
/* /*
* Merge two ordered arrays of itempointers, eliminating any duplicates. * Merge two ordered arrays of itempointers, eliminating any duplicates.
* Returns the number of items in the result. * Returns the number of items in the result.
* Caller is responsible that there is enough space at *dst. * Caller is responsible that there is enough space at *dst.
*
* It's OK if 'dst' overlaps with the *beginning* of one of the arguments.
*/ */
uint32 int
ginMergeItemPointers(ItemPointerData *dst, ginMergeItemPointers(ItemPointerData *dst,
ItemPointerData *a, uint32 na, ItemPointerData *a, uint32 na,
ItemPointerData *b, uint32 nb) ItemPointerData *b, uint32 nb)
@ -29,28 +359,50 @@ ginMergeItemPointers(ItemPointerData *dst,
ItemPointerData *dptr = dst; ItemPointerData *dptr = dst;
ItemPointerData *aptr = a, ItemPointerData *aptr = a,
*bptr = b; *bptr = b;
int result;
while (aptr - a < na && bptr - b < nb) /*
* If the argument arrays don't overlap, we can just append them to
* each other.
*/
if (na == 0 || nb == 0 || ginCompareItemPointers(&a[na - 1], &b[0]) < 0)
{ {
int cmp = ginCompareItemPointers(aptr, bptr); memmove(dst, a, na * sizeof(ItemPointerData));
memmove(&dst[na], b, nb * sizeof(ItemPointerData));
if (cmp > 0) result = na + nb;
*dptr++ = *bptr++; }
else if (cmp == 0) else if (ginCompareItemPointers(&b[nb - 1], &a[0]) < 0)
{
memmove(dst, b, nb * sizeof(ItemPointerData));
memmove(&dst[nb], a, na * sizeof(ItemPointerData));
result = na + nb;
}
else
{
while (aptr - a < na && bptr - b < nb)
{ {
/* we want only one copy of the identical items */ int cmp = ginCompareItemPointers(aptr, bptr);
*dptr++ = *bptr++;
aptr++; if (cmp > 0)
*dptr++ = *bptr++;
else if (cmp == 0)
{
/* only keep one copy of the identical items */
*dptr++ = *bptr++;
aptr++;
}
else
*dptr++ = *aptr++;
} }
else
while (aptr - a < na)
*dptr++ = *aptr++; *dptr++ = *aptr++;
while (bptr - b < nb)
*dptr++ = *bptr++;
result = dptr - dst;
} }
while (aptr - a < na) return result;
*dptr++ = *aptr++;
while (bptr - b < nb)
*dptr++ = *bptr++;
return dptr - dst;
} }

View File

@ -20,8 +20,9 @@
#include "postmaster/autovacuum.h" #include "postmaster/autovacuum.h"
#include "storage/indexfsm.h" #include "storage/indexfsm.h"
#include "storage/lmgr.h" #include "storage/lmgr.h"
#include "utils/memutils.h"
typedef struct typedef struct GinVacuumState
{ {
Relation index; Relation index;
IndexBulkDeleteResult *result; IndexBulkDeleteResult *result;
@ -29,56 +30,58 @@ typedef struct
void *callback_state; void *callback_state;
GinState ginstate; GinState ginstate;
BufferAccessStrategy strategy; BufferAccessStrategy strategy;
MemoryContext tmpCxt;
} GinVacuumState; } GinVacuumState;
/* /*
* Vacuums a list of item pointers. The original size of the list is 'nitem', * Vacuums an uncompressed posting list. The size of the must can be specified
* returns the number of items remaining afterwards. * in number of items (nitems).
* *
* If *cleaned == NULL on entry, the original array is left unmodified; if * If none of the items need to be removed, returns NULL. Otherwise returns
* any items are removed, a palloc'd copy of the result is stored in *cleaned. * a new palloc'd array with the remaining items. The number of remaining
* Otherwise *cleaned should point to the original array, in which case it's * items is returned in *nremaining.
* modified directly.
*/ */
static int ItemPointer
ginVacuumPostingList(GinVacuumState *gvs, ItemPointerData *items, int nitem, ginVacuumItemPointers(GinVacuumState *gvs, ItemPointerData *items,
ItemPointerData **cleaned) int nitem, int *nremaining)
{ {
int i, int i,
j = 0; remaining = 0;
ItemPointer tmpitems = NULL;
Assert(*cleaned == NULL || *cleaned == items);
/* /*
* just scan over ItemPointer array * Iterate over TIDs array
*/ */
for (i = 0; i < nitem; i++) for (i = 0; i < nitem; i++)
{ {
if (gvs->callback(items + i, gvs->callback_state)) if (gvs->callback(items + i, gvs->callback_state))
{ {
gvs->result->tuples_removed += 1; gvs->result->tuples_removed += 1;
if (!*cleaned) if (!tmpitems)
{ {
*cleaned = (ItemPointerData *) palloc(sizeof(ItemPointerData) * nitem); /*
if (i != 0) * First TID to be deleted: allocate memory to hold the
memcpy(*cleaned, items, sizeof(ItemPointerData) * i); * remaining items.
*/
tmpitems = palloc(sizeof(ItemPointerData) * nitem);
memcpy(tmpitems, items, sizeof(ItemPointerData) * i);
} }
} }
else else
{ {
gvs->result->num_index_tuples += 1; gvs->result->num_index_tuples += 1;
if (i != j) if (tmpitems)
(*cleaned)[j] = items[i]; tmpitems[remaining] = items[i];
j++; remaining++;
} }
} }
return j; *nremaining = remaining;
return tmpitems;
} }
/* /*
* fills WAL record for vacuum leaf page * Create a WAL record for vacuuming entry tree leaf page.
*/ */
static void static void
xlogVacuumPage(Relation index, Buffer buffer) xlogVacuumPage(Relation index, Buffer buffer)
@ -86,65 +89,64 @@ xlogVacuumPage(Relation index, Buffer buffer)
Page page = BufferGetPage(buffer); Page page = BufferGetPage(buffer);
XLogRecPtr recptr; XLogRecPtr recptr;
XLogRecData rdata[3]; XLogRecData rdata[3];
ginxlogVacuumPage data; ginxlogVacuumPage xlrec;
char *backup; uint16 lower;
char itups[BLCKSZ]; uint16 upper;
uint32 len = 0;
/* This is only used for entry tree leaf pages. */
Assert(!GinPageIsData(page));
Assert(GinPageIsLeaf(page)); Assert(GinPageIsLeaf(page));
if (!RelationNeedsWAL(index)) if (!RelationNeedsWAL(index))
return; return;
data.node = index->rd_node; xlrec.node = index->rd_node;
data.blkno = BufferGetBlockNumber(buffer); xlrec.blkno = BufferGetBlockNumber(buffer);
if (GinPageIsData(page)) /* Assume we can omit data between pd_lower and pd_upper */
lower = ((PageHeader) page)->pd_lower;
upper = ((PageHeader) page)->pd_upper;
Assert(lower < BLCKSZ);
Assert(upper < BLCKSZ);
if (lower >= SizeOfPageHeaderData &&
upper > lower &&
upper <= BLCKSZ)
{ {
backup = GinDataPageGetData(page); xlrec.hole_offset = lower;
data.nitem = GinPageGetOpaque(page)->maxoff; xlrec.hole_length = upper - lower;
if (data.nitem)
len = MAXALIGN(sizeof(ItemPointerData) * data.nitem);
} }
else else
{ {
char *ptr; /* No "hole" to compress out */
OffsetNumber i; xlrec.hole_offset = 0;
xlrec.hole_length = 0;
ptr = backup = itups;
for (i = FirstOffsetNumber; i <= PageGetMaxOffsetNumber(page); i++)
{
IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, i));
memcpy(ptr, itup, IndexTupleSize(itup));
ptr += MAXALIGN(IndexTupleSize(itup));
}
data.nitem = PageGetMaxOffsetNumber(page);
len = ptr - backup;
} }
rdata[0].buffer = buffer; rdata[0].data = (char *) &xlrec;
rdata[0].buffer_std = (GinPageIsData(page)) ? FALSE : TRUE; rdata[0].len = sizeof(ginxlogVacuumPage);
rdata[0].len = 0; rdata[0].buffer = InvalidBuffer;
rdata[0].data = NULL; rdata[0].next = &rdata[1];
rdata[0].next = rdata + 1;
rdata[1].buffer = InvalidBuffer; if (xlrec.hole_length == 0)
rdata[1].len = sizeof(ginxlogVacuumPage);
rdata[1].data = (char *) &data;
if (len == 0)
{ {
rdata[1].data = (char *) page;
rdata[1].len = BLCKSZ;
rdata[1].buffer = InvalidBuffer;
rdata[1].next = NULL; rdata[1].next = NULL;
} }
else else
{ {
rdata[1].next = rdata + 2; /* must skip the hole */
rdata[1].data = (char *) page;
rdata[1].len = xlrec.hole_offset;
rdata[1].buffer = InvalidBuffer;
rdata[1].next = &rdata[2];
rdata[2].data = (char *) page + (xlrec.hole_offset + xlrec.hole_length);
rdata[2].len = BLCKSZ - (xlrec.hole_offset + xlrec.hole_length);
rdata[2].buffer = InvalidBuffer; rdata[2].buffer = InvalidBuffer;
rdata[2].len = len;
rdata[2].data = backup;
rdata[2].next = NULL; rdata[2].next = NULL;
} }
@ -158,6 +160,7 @@ ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
Buffer buffer; Buffer buffer;
Page page; Page page;
bool hasVoidPage = FALSE; bool hasVoidPage = FALSE;
MemoryContext oldCxt;
buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno, buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno,
RBM_NORMAL, gvs->strategy); RBM_NORMAL, gvs->strategy);
@ -169,7 +172,6 @@ ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
* again). New scan can't start but previously started ones work * again). New scan can't start but previously started ones work
* concurrently. * concurrently.
*/ */
if (isRoot) if (isRoot)
LockBufferForCleanup(buffer); LockBufferForCleanup(buffer);
else else
@ -179,32 +181,14 @@ ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
if (GinPageIsLeaf(page)) if (GinPageIsLeaf(page))
{ {
OffsetNumber newMaxOff, oldCxt = MemoryContextSwitchTo(gvs->tmpCxt);
oldMaxOff = GinPageGetOpaque(page)->maxoff; ginVacuumPostingTreeLeaf(gvs->index, buffer, gvs);
ItemPointerData *cleaned = NULL; MemoryContextSwitchTo(oldCxt);
MemoryContextReset(gvs->tmpCxt);
newMaxOff = ginVacuumPostingList(gvs, /* if root is a leaf page, we don't desire further processing */
(ItemPointer) GinDataPageGetData(page), oldMaxOff, &cleaned); if (!isRoot && !hasVoidPage && GinDataLeafPageIsEmpty(page))
hasVoidPage = TRUE;
/* saves changes about deleted tuple ... */
if (oldMaxOff != newMaxOff)
{
START_CRIT_SECTION();
if (newMaxOff > 0)
memcpy(GinDataPageGetData(page), cleaned, sizeof(ItemPointerData) * newMaxOff);
pfree(cleaned);
GinPageGetOpaque(page)->maxoff = newMaxOff;
MarkBufferDirty(buffer);
xlogVacuumPage(gvs->index, buffer);
END_CRIT_SECTION();
/* if root is a leaf page, we don't desire further processing */
if (!isRoot && GinPageGetOpaque(page)->maxoff < FirstOffsetNumber)
hasVoidPage = TRUE;
}
} }
else else
{ {
@ -224,7 +208,7 @@ ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
} }
/* /*
* if we have root and theres void pages in tree, then we don't release * if we have root and there are empty pages in tree, then we don't release
* lock to go further processing and guarantee that tree is unused * lock to go further processing and guarantee that tree is unused
*/ */
if (!(isRoot && hasVoidPage)) if (!(isRoot && hasVoidPage))
@ -391,6 +375,7 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
Buffer buffer; Buffer buffer;
Page page; Page page;
bool meDelete = FALSE; bool meDelete = FALSE;
bool isempty;
if (isRoot) if (isRoot)
{ {
@ -429,7 +414,12 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
} }
} }
if (GinPageGetOpaque(page)->maxoff < FirstOffsetNumber) if (GinPageIsLeaf(page))
isempty = GinDataLeafPageIsEmpty(page);
else
isempty = GinPageGetOpaque(page)->maxoff < FirstOffsetNumber;
if (isempty)
{ {
/* we never delete the left- or rightmost branch */ /* we never delete the left- or rightmost branch */
if (me->leftBlkno != InvalidBlockNumber && !GinPageRightMost(page)) if (me->leftBlkno != InvalidBlockNumber && !GinPageRightMost(page))
@ -513,22 +503,47 @@ ginVacuumEntryPage(GinVacuumState *gvs, Buffer buffer, BlockNumber *roots, uint3
} }
else if (GinGetNPosting(itup) > 0) else if (GinGetNPosting(itup) > 0)
{ {
int nitems;
ItemPointer uncompressed;
/* /*
* if we already created a temporary page, make changes in place * Vacuum posting list with proper function for compressed and
* uncompressed format.
*/ */
ItemPointerData *cleaned = (tmppage == origpage) ? NULL : GinGetPosting(itup); if (GinItupIsCompressed(itup))
int newN; uncompressed = ginPostingListDecode((GinPostingList *) GinGetPosting(itup), &nitems);
else
newN = ginVacuumPostingList(gvs, GinGetPosting(itup), GinGetNPosting(itup), &cleaned);
if (GinGetNPosting(itup) != newN)
{ {
uncompressed = (ItemPointer) GinGetPosting(itup);
nitems = GinGetNPosting(itup);
}
uncompressed = ginVacuumItemPointers(gvs, uncompressed, nitems,
&nitems);
if (uncompressed)
{
/*
* Some ItemPointers were deleted, recreate tuple.
*/
OffsetNumber attnum; OffsetNumber attnum;
Datum key; Datum key;
GinNullCategory category; GinNullCategory category;
GinPostingList *plist;
int plistsize;
if (nitems > 0)
{
plist = ginCompressPostingList(uncompressed, nitems, GinMaxItemSize, NULL);
plistsize = SizeOfGinPostingList(plist);
}
else
{
plist = NULL;
plistsize = 0;
}
/* /*
* Some ItemPointers were deleted, recreate tuple. * if we already created a temporary page, make changes in place
*/ */
if (tmppage == origpage) if (tmppage == origpage)
{ {
@ -538,15 +553,6 @@ ginVacuumEntryPage(GinVacuumState *gvs, Buffer buffer, BlockNumber *roots, uint3
*/ */
tmppage = PageGetTempPageCopy(origpage); tmppage = PageGetTempPageCopy(origpage);
if (newN > 0)
{
Size pos = ((char *) GinGetPosting(itup)) - ((char *) origpage);
memcpy(tmppage + pos, cleaned, sizeof(ItemPointerData) * newN);
}
pfree(cleaned);
/* set itup pointer to new page */ /* set itup pointer to new page */
itup = (IndexTuple) PageGetItem(tmppage, PageGetItemId(tmppage, i)); itup = (IndexTuple) PageGetItem(tmppage, PageGetItemId(tmppage, i));
} }
@ -554,7 +560,10 @@ ginVacuumEntryPage(GinVacuumState *gvs, Buffer buffer, BlockNumber *roots, uint3
attnum = gintuple_get_attrnum(&gvs->ginstate, itup); attnum = gintuple_get_attrnum(&gvs->ginstate, itup);
key = gintuple_get_key(&gvs->ginstate, itup, &category); key = gintuple_get_key(&gvs->ginstate, itup, &category);
itup = GinFormTuple(&gvs->ginstate, attnum, key, category, itup = GinFormTuple(&gvs->ginstate, attnum, key, category,
GinGetPosting(itup), newN, true); (char *) plist, plistsize,
nitems, true);
if (plist)
pfree(plist);
PageIndexTupleDelete(tmppage, i); PageIndexTupleDelete(tmppage, i);
if (PageAddItem(tmppage, (Item) itup, IndexTupleSize(itup), i, false, false) != i) if (PageAddItem(tmppage, (Item) itup, IndexTupleSize(itup), i, false, false) != i)
@ -583,6 +592,11 @@ ginbulkdelete(PG_FUNCTION_ARGS)
BlockNumber rootOfPostingTree[BLCKSZ / (sizeof(IndexTupleData) + sizeof(ItemId))]; BlockNumber rootOfPostingTree[BLCKSZ / (sizeof(IndexTupleData) + sizeof(ItemId))];
uint32 nRoot; uint32 nRoot;
gvs.tmpCxt = AllocSetContextCreate(CurrentMemoryContext,
"Gin vacuum temporary context",
ALLOCSET_DEFAULT_MINSIZE,
ALLOCSET_DEFAULT_INITSIZE,
ALLOCSET_DEFAULT_MAXSIZE);
gvs.index = index; gvs.index = index;
gvs.callback = callback; gvs.callback = callback;
gvs.callback_state = callback_state; gvs.callback_state = callback_state;
@ -683,6 +697,8 @@ ginbulkdelete(PG_FUNCTION_ARGS)
LockBuffer(buffer, GIN_EXCLUSIVE); LockBuffer(buffer, GIN_EXCLUSIVE);
} }
MemoryContextDelete(gvs.tmpCxt);
PG_RETURN_POINTER(gvs.result); PG_RETURN_POINTER(gvs.result);
} }

View File

@ -78,7 +78,7 @@ static void
ginRedoCreatePTree(XLogRecPtr lsn, XLogRecord *record) ginRedoCreatePTree(XLogRecPtr lsn, XLogRecord *record)
{ {
ginxlogCreatePostingTree *data = (ginxlogCreatePostingTree *) XLogRecGetData(record); ginxlogCreatePostingTree *data = (ginxlogCreatePostingTree *) XLogRecGetData(record);
ItemPointerData *items = (ItemPointerData *) (XLogRecGetData(record) + sizeof(ginxlogCreatePostingTree)); char *ptr;
Buffer buffer; Buffer buffer;
Page page; Page page;
@ -89,9 +89,14 @@ ginRedoCreatePTree(XLogRecPtr lsn, XLogRecord *record)
Assert(BufferIsValid(buffer)); Assert(BufferIsValid(buffer));
page = (Page) BufferGetPage(buffer); page = (Page) BufferGetPage(buffer);
GinInitBuffer(buffer, GIN_DATA | GIN_LEAF); GinInitBuffer(buffer, GIN_DATA | GIN_LEAF | GIN_COMPRESSED);
memcpy(GinDataPageGetData(page), items, sizeof(ItemPointerData) * data->nitem);
GinPageGetOpaque(page)->maxoff = data->nitem; ptr = XLogRecGetData(record) + sizeof(ginxlogCreatePostingTree);
/* Place page data */
memcpy(GinDataLeafPageGetPostingList(page), ptr, data->size);
GinDataLeafPageSetPostingListSize(page, data->size);
PageSetLSN(page, lsn); PageSetLSN(page, lsn);
@ -100,11 +105,11 @@ ginRedoCreatePTree(XLogRecPtr lsn, XLogRecord *record)
} }
static void static void
ginRedoInsertEntry(Buffer buffer, OffsetNumber offset, BlockNumber rightblkno, ginRedoInsertEntry(Buffer buffer, bool isLeaf, BlockNumber rightblkno, void *rdata)
void *rdata)
{ {
Page page = BufferGetPage(buffer); Page page = BufferGetPage(buffer);
ginxlogInsertEntry *data = (ginxlogInsertEntry *) rdata; ginxlogInsertEntry *data = (ginxlogInsertEntry *) rdata;
OffsetNumber offset = data->offset;
IndexTuple itup; IndexTuple itup;
if (rightblkno != InvalidBlockNumber) if (rightblkno != InvalidBlockNumber)
@ -138,30 +143,43 @@ ginRedoInsertEntry(Buffer buffer, OffsetNumber offset, BlockNumber rightblkno,
} }
static void static void
ginRedoInsertData(Buffer buffer, OffsetNumber offset, BlockNumber rightblkno, ginRedoRecompress(Page page, ginxlogRecompressDataLeaf *data)
void *rdata) {
Pointer segment;
/* Copy the new data to the right place */
segment = ((Pointer) GinDataLeafPageGetPostingList(page))
+ data->unmodifiedsize;
memcpy(segment, data->newdata, data->length - data->unmodifiedsize);
GinDataLeafPageSetPostingListSize(page, data->length);
GinPageSetCompressed(page);
}
static void
ginRedoInsertData(Buffer buffer, bool isLeaf, BlockNumber rightblkno, void *rdata)
{ {
Page page = BufferGetPage(buffer); Page page = BufferGetPage(buffer);
if (GinPageIsLeaf(page)) if (isLeaf)
{ {
ginxlogInsertDataLeaf *data = (ginxlogInsertDataLeaf *) rdata; ginxlogRecompressDataLeaf *data = (ginxlogRecompressDataLeaf *) rdata;
ItemPointerData *items = data->items;
OffsetNumber i;
for (i = 0; i < data->nitem; i++) Assert(GinPageIsLeaf(page));
GinDataPageAddItemPointer(page, &items[i], offset + i);
ginRedoRecompress(page, data);
} }
else else
{ {
PostingItem *pitem = (PostingItem *) rdata; ginxlogInsertDataInternal *data = (ginxlogInsertDataInternal *) rdata;
PostingItem *oldpitem; PostingItem *oldpitem;
Assert(!GinPageIsLeaf(page));
/* update link to right page after split */ /* update link to right page after split */
oldpitem = GinDataPageGetPostingItem(page, offset); oldpitem = GinDataPageGetPostingItem(page, data->offset);
PostingItemSetBlockNumber(oldpitem, rightblkno); PostingItemSetBlockNumber(oldpitem, rightblkno);
GinDataPageAddPostingItem(page, pitem, offset); GinDataPageAddPostingItem(page, &data->newitem, data->offset);
} }
} }
@ -213,12 +231,12 @@ ginRedoInsert(XLogRecPtr lsn, XLogRecord *record)
if (data->flags & GIN_INSERT_ISDATA) if (data->flags & GIN_INSERT_ISDATA)
{ {
Assert(GinPageIsData(page)); Assert(GinPageIsData(page));
ginRedoInsertData(buffer, data->offset, rightChildBlkno, payload); ginRedoInsertData(buffer, isLeaf, rightChildBlkno, payload);
} }
else else
{ {
Assert(!GinPageIsData(page)); Assert(!GinPageIsData(page));
ginRedoInsertEntry(buffer, data->offset, rightChildBlkno, payload); ginRedoInsertEntry(buffer, isLeaf, rightChildBlkno, payload);
} }
PageSetLSN(page, lsn); PageSetLSN(page, lsn);
@ -253,38 +271,42 @@ ginRedoSplitEntry(Page lpage, Page rpage, void *rdata)
static void static void
ginRedoSplitData(Page lpage, Page rpage, void *rdata) ginRedoSplitData(Page lpage, Page rpage, void *rdata)
{ {
ginxlogSplitData *data = (ginxlogSplitData *) rdata;
bool isleaf = GinPageIsLeaf(lpage); bool isleaf = GinPageIsLeaf(lpage);
char *ptr = (char *) rdata + sizeof(ginxlogSplitData);
OffsetNumber i;
ItemPointer bound;
if (isleaf) if (isleaf)
{ {
ItemPointer items = (ItemPointer) ptr; ginxlogSplitDataLeaf *data = (ginxlogSplitDataLeaf *) rdata;
for (i = 0; i < data->separator; i++) Pointer lptr = (Pointer) rdata + sizeof(ginxlogSplitDataLeaf);
GinDataPageAddItemPointer(lpage, &items[i], InvalidOffsetNumber); Pointer rptr = lptr + data->lsize;
for (i = data->separator; i < data->nitem; i++)
GinDataPageAddItemPointer(rpage, &items[i], InvalidOffsetNumber); Assert(data->lsize > 0 && data->lsize <= GinDataLeafMaxContentSize);
Assert(data->rsize > 0 && data->rsize <= GinDataLeafMaxContentSize);
memcpy(GinDataLeafPageGetPostingList(lpage), lptr, data->lsize);
memcpy(GinDataLeafPageGetPostingList(rpage), rptr, data->rsize);
GinDataLeafPageSetPostingListSize(lpage, data->lsize);
GinDataLeafPageSetPostingListSize(rpage, data->rsize);
*GinDataPageGetRightBound(lpage) = data->lrightbound;
*GinDataPageGetRightBound(rpage) = data->rrightbound;
} }
else else
{ {
PostingItem *items = (PostingItem *) ptr; ginxlogSplitDataInternal *data = (ginxlogSplitDataInternal *) rdata;
PostingItem *items = (PostingItem *) ((char *) rdata + sizeof(ginxlogSplitDataInternal));
OffsetNumber i;
OffsetNumber maxoff;
for (i = 0; i < data->separator; i++) for (i = 0; i < data->separator; i++)
GinDataPageAddPostingItem(lpage, &items[i], InvalidOffsetNumber); GinDataPageAddPostingItem(lpage, &items[i], InvalidOffsetNumber);
for (i = data->separator; i < data->nitem; i++) for (i = data->separator; i < data->nitem; i++)
GinDataPageAddPostingItem(rpage, &items[i], InvalidOffsetNumber); GinDataPageAddPostingItem(rpage, &items[i], InvalidOffsetNumber);
/* set up right key */
maxoff = GinPageGetOpaque(lpage)->maxoff;
*GinDataPageGetRightBound(lpage) = GinDataPageGetPostingItem(lpage, maxoff)->key;
*GinDataPageGetRightBound(rpage) = data->rightbound;
} }
/* set up right key */
bound = GinDataPageGetRightBound(lpage);
if (isleaf)
*bound = *GinDataPageGetItemPointer(lpage, GinPageGetOpaque(lpage)->maxoff);
else
*bound = GinDataPageGetPostingItem(lpage, GinPageGetOpaque(lpage)->maxoff)->key;
bound = GinDataPageGetRightBound(rpage);
*bound = data->rightbound;
} }
static void static void
@ -317,9 +339,10 @@ ginRedoSplit(XLogRecPtr lsn, XLogRecord *record)
if (isLeaf) if (isLeaf)
flags |= GIN_LEAF; flags |= GIN_LEAF;
if (isData) if (isData)
flags |= GIN_DATA; flags |= GIN_DATA;
if (isLeaf && isData)
flags |= GIN_COMPRESSED;
lbuffer = XLogReadBuffer(data->node, data->lblkno, true); lbuffer = XLogReadBuffer(data->node, data->lblkno, true);
Assert(BufferIsValid(lbuffer)); Assert(BufferIsValid(lbuffer));
@ -352,7 +375,7 @@ ginRedoSplit(XLogRecPtr lsn, XLogRecord *record)
Buffer rootBuf = XLogReadBuffer(data->node, rootBlkno, true); Buffer rootBuf = XLogReadBuffer(data->node, rootBlkno, true);
Page rootPage = BufferGetPage(rootBuf); Page rootPage = BufferGetPage(rootBuf);
GinInitBuffer(rootBuf, flags & ~GIN_LEAF); GinInitBuffer(rootBuf, flags & ~GIN_LEAF & ~GIN_COMPRESSED);
if (isData) if (isData)
{ {
@ -383,10 +406,56 @@ ginRedoSplit(XLogRecPtr lsn, XLogRecord *record)
UnlockReleaseBuffer(lbuffer); UnlockReleaseBuffer(lbuffer);
} }
/*
* This is functionally the same as heap_xlog_newpage.
*/
static void static void
ginRedoVacuumPage(XLogRecPtr lsn, XLogRecord *record) ginRedoVacuumPage(XLogRecPtr lsn, XLogRecord *record)
{ {
ginxlogVacuumPage *data = (ginxlogVacuumPage *) XLogRecGetData(record); ginxlogVacuumPage *xlrec = (ginxlogVacuumPage *) XLogRecGetData(record);
char *blk = ((char *) xlrec) + sizeof(ginxlogVacuumPage);
Buffer buffer;
Page page;
Assert(xlrec->hole_offset < BLCKSZ);
Assert(xlrec->hole_length < BLCKSZ);
/* If we have a full-page image, restore it and we're done */
if (record->xl_info & XLR_BKP_BLOCK(0))
{
(void) RestoreBackupBlock(lsn, record, 0, false, false);
return;
}
buffer = XLogReadBuffer(xlrec->node, xlrec->blkno, true);
if (!BufferIsValid(buffer))
return;
page = (Page) BufferGetPage(buffer);
if (xlrec->hole_length == 0)
{
memcpy((char *) page, blk, BLCKSZ);
}
else
{
memcpy((char *) page, blk, xlrec->hole_offset);
/* must zero-fill the hole */
MemSet((char *) page + xlrec->hole_offset, 0, xlrec->hole_length);
memcpy((char *) page + (xlrec->hole_offset + xlrec->hole_length),
blk + xlrec->hole_offset,
BLCKSZ - (xlrec->hole_offset + xlrec->hole_length));
}
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
UnlockReleaseBuffer(buffer);
}
static void
ginRedoVacuumDataLeafPage(XLogRecPtr lsn, XLogRecord *record)
{
ginxlogVacuumDataLeafPage *xlrec = (ginxlogVacuumDataLeafPage *) XLogRecGetData(record);
Buffer buffer; Buffer buffer;
Page page; Page page;
@ -397,41 +466,17 @@ ginRedoVacuumPage(XLogRecPtr lsn, XLogRecord *record)
return; return;
} }
buffer = XLogReadBuffer(data->node, data->blkno, false); buffer = XLogReadBuffer(xlrec->node, xlrec->blkno, false);
if (!BufferIsValid(buffer)) if (!BufferIsValid(buffer))
return; return;
page = (Page) BufferGetPage(buffer); page = (Page) BufferGetPage(buffer);
Assert(GinPageIsLeaf(page));
Assert(GinPageIsData(page));
if (lsn > PageGetLSN(page)) if (lsn > PageGetLSN(page))
{ {
if (GinPageIsData(page)) ginRedoRecompress(page, &xlrec->data);
{
memcpy(GinDataPageGetData(page),
XLogRecGetData(record) + sizeof(ginxlogVacuumPage),
data->nitem * GinSizeOfDataPageItem(page));
GinPageGetOpaque(page)->maxoff = data->nitem;
}
else
{
OffsetNumber i,
*tod;
IndexTuple itup = (IndexTuple) (XLogRecGetData(record) + sizeof(ginxlogVacuumPage));
tod = (OffsetNumber *) palloc(sizeof(OffsetNumber) * PageGetMaxOffsetNumber(page));
for (i = FirstOffsetNumber; i <= PageGetMaxOffsetNumber(page); i++)
tod[i - 1] = i;
PageIndexMultiDelete(page, tod, PageGetMaxOffsetNumber(page));
for (i = 0; i < data->nitem; i++)
{
if (PageAddItem(page, (Item) itup, IndexTupleSize(itup), InvalidOffsetNumber, false, false) == InvalidOffsetNumber)
elog(ERROR, "failed to add item to index page in %u/%u/%u",
data->node.spcNode, data->node.dbNode, data->node.relNode);
itup = (IndexTuple) (((char *) itup) + MAXALIGN(IndexTupleSize(itup)));
}
}
PageSetLSN(page, lsn); PageSetLSN(page, lsn);
MarkBufferDirty(buffer); MarkBufferDirty(buffer);
} }
@ -747,6 +792,9 @@ gin_redo(XLogRecPtr lsn, XLogRecord *record)
case XLOG_GIN_VACUUM_PAGE: case XLOG_GIN_VACUUM_PAGE:
ginRedoVacuumPage(lsn, record); ginRedoVacuumPage(lsn, record);
break; break;
case XLOG_GIN_VACUUM_DATA_LEAF_PAGE:
ginRedoVacuumDataLeafPage(lsn, record);
break;
case XLOG_GIN_DELETE_PAGE: case XLOG_GIN_DELETE_PAGE:
ginRedoDeletePage(lsn, record); ginRedoDeletePage(lsn, record);
break; break;

View File

@ -47,8 +47,7 @@ gin_desc(StringInfo buf, uint8 xl_info, char *rec)
appendStringInfoString(buf, "Insert item, "); appendStringInfoString(buf, "Insert item, ");
desc_node(buf, xlrec->node, xlrec->blkno); desc_node(buf, xlrec->node, xlrec->blkno);
appendStringInfo(buf, " offset: %u isdata: %c isleaf: %c", appendStringInfo(buf, " isdata: %c isleaf: %c",
xlrec->offset,
(xlrec->flags & GIN_INSERT_ISDATA) ? 'T' : 'F', (xlrec->flags & GIN_INSERT_ISDATA) ? 'T' : 'F',
(xlrec->flags & GIN_INSERT_ISLEAF) ? 'T' : 'F'); (xlrec->flags & GIN_INSERT_ISLEAF) ? 'T' : 'F');
if (!(xlrec->flags & GIN_INSERT_ISLEAF)) if (!(xlrec->flags & GIN_INSERT_ISLEAF))
@ -67,24 +66,50 @@ gin_desc(StringInfo buf, uint8 xl_info, char *rec)
appendStringInfo(buf, " isdelete: %c", appendStringInfo(buf, " isdelete: %c",
(((ginxlogInsertEntry *) payload)->isDelete) ? 'T' : 'F'); (((ginxlogInsertEntry *) payload)->isDelete) ? 'T' : 'F');
else if (xlrec->flags & GIN_INSERT_ISLEAF) else if (xlrec->flags & GIN_INSERT_ISLEAF)
appendStringInfo(buf, " nitem: %u", {
(((ginxlogInsertDataLeaf *) payload)->nitem)); ginxlogRecompressDataLeaf *insertData =
(ginxlogRecompressDataLeaf *) payload;
appendStringInfo(buf, " unmodified: %u length: %u (compressed)",
insertData->unmodifiedsize,
insertData->length);
}
else else
{
ginxlogInsertDataInternal *insertData = (ginxlogInsertDataInternal *) payload;
appendStringInfo(buf, " pitem: %u-%u/%u", appendStringInfo(buf, " pitem: %u-%u/%u",
PostingItemGetBlockNumber((PostingItem *) payload), PostingItemGetBlockNumber(&insertData->newitem),
ItemPointerGetBlockNumber(&((PostingItem *) payload)->key), ItemPointerGetBlockNumber(&insertData->newitem.key),
ItemPointerGetOffsetNumber(&((PostingItem *) payload)->key)); ItemPointerGetOffsetNumber(&insertData->newitem.key));
}
} }
break; break;
case XLOG_GIN_SPLIT: case XLOG_GIN_SPLIT:
appendStringInfoString(buf, "Page split, "); {
desc_node(buf, ((ginxlogSplit *) rec)->node, ((ginxlogSplit *) rec)->lblkno); ginxlogSplit *xlrec = (ginxlogSplit *) rec;
appendStringInfo(buf, " isrootsplit: %c", (((ginxlogSplit *) rec)->flags & GIN_SPLIT_ROOT) ? 'T' : 'F');
appendStringInfoString(buf, "Page split, ");
desc_node(buf, ((ginxlogSplit *) rec)->node, ((ginxlogSplit *) rec)->lblkno);
appendStringInfo(buf, " isrootsplit: %c", (((ginxlogSplit *) rec)->flags & GIN_SPLIT_ROOT) ? 'T' : 'F');
appendStringInfo(buf, " isdata: %c isleaf: %c",
(xlrec->flags & GIN_INSERT_ISDATA) ? 'T' : 'F',
(xlrec->flags & GIN_INSERT_ISLEAF) ? 'T' : 'F');
}
break; break;
case XLOG_GIN_VACUUM_PAGE: case XLOG_GIN_VACUUM_PAGE:
appendStringInfoString(buf, "Vacuum page, "); appendStringInfoString(buf, "Vacuum page, ");
desc_node(buf, ((ginxlogVacuumPage *) rec)->node, ((ginxlogVacuumPage *) rec)->blkno); desc_node(buf, ((ginxlogVacuumPage *) rec)->node, ((ginxlogVacuumPage *) rec)->blkno);
break; break;
case XLOG_GIN_VACUUM_DATA_LEAF_PAGE:
{
ginxlogVacuumDataLeafPage *xlrec = (ginxlogVacuumDataLeafPage *) rec;
appendStringInfoString(buf, "Vacuum data leaf page, ");
desc_node(buf, xlrec->node, xlrec->blkno);
appendStringInfo(buf, " unmodified: %u length: %u",
xlrec->data.unmodifiedsize,
xlrec->data.length);
}
break;
case XLOG_GIN_DELETE_PAGE: case XLOG_GIN_DELETE_PAGE:
appendStringInfoString(buf, "Delete page, "); appendStringInfoString(buf, "Delete page, ");
desc_node(buf, ((ginxlogDeletePage *) rec)->node, ((ginxlogDeletePage *) rec)->blkno); desc_node(buf, ((ginxlogDeletePage *) rec)->node, ((ginxlogDeletePage *) rec)->blkno);

View File

@ -32,11 +32,8 @@
typedef struct GinPageOpaqueData typedef struct GinPageOpaqueData
{ {
BlockNumber rightlink; /* next page if any */ BlockNumber rightlink; /* next page if any */
OffsetNumber maxoff; /* number entries on GIN_DATA page: number of OffsetNumber maxoff; /* number of PostingItems on GIN_DATA & ~GIN_LEAF page.
* heap ItemPointers on GIN_DATA|GIN_LEAF page * On GIN_LIST page, number of heap tuples. */
* or number of PostingItems on GIN_DATA &
* ~GIN_LEAF page. On GIN_LIST page, number of
* heap tuples. */
uint16 flags; /* see bit definitions below */ uint16 flags; /* see bit definitions below */
} GinPageOpaqueData; } GinPageOpaqueData;
@ -49,6 +46,7 @@ typedef GinPageOpaqueData *GinPageOpaque;
#define GIN_LIST (1 << 4) #define GIN_LIST (1 << 4)
#define GIN_LIST_FULLROW (1 << 5) /* makes sense only on GIN_LIST page */ #define GIN_LIST_FULLROW (1 << 5) /* makes sense only on GIN_LIST page */
#define GIN_INCOMPLETE_SPLIT (1 << 6) /* page was split, but parent not updated */ #define GIN_INCOMPLETE_SPLIT (1 << 6) /* page was split, but parent not updated */
#define GIN_COMPRESSED (1 << 7)
/* Page numbers of fixed-location pages */ /* Page numbers of fixed-location pages */
#define GIN_METAPAGE_BLKNO (0) #define GIN_METAPAGE_BLKNO (0)
@ -88,7 +86,12 @@ typedef struct GinMetaPageData
* GIN version number (ideally this should have been at the front, but too * GIN version number (ideally this should have been at the front, but too
* late now. Don't move it!) * late now. Don't move it!)
* *
* Currently 1 (for indexes initialized in 9.1 or later) * Currently 2 (for indexes initialized in 9.4 or later)
*
* Version 1 (indexes initialized in version 9.1, 9.2 or 9.3), is
* compatible, but may contain uncompressed posting tree (leaf) pages and
* posting lists. They will be converted to compressed format when
* modified.
* *
* Version 0 (indexes initialized in 9.0 or before) is compatible but may * Version 0 (indexes initialized in 9.0 or before) is compatible but may
* be missing null entries, including both null keys and placeholders. * be missing null entries, including both null keys and placeholders.
@ -97,7 +100,7 @@ typedef struct GinMetaPageData
int32 ginVersion; int32 ginVersion;
} GinMetaPageData; } GinMetaPageData;
#define GIN_CURRENT_VERSION 1 #define GIN_CURRENT_VERSION 2
#define GinPageGetMeta(p) \ #define GinPageGetMeta(p) \
((GinMetaPageData *) PageGetContents(p)) ((GinMetaPageData *) PageGetContents(p))
@ -116,6 +119,8 @@ typedef struct GinMetaPageData
#define GinPageSetList(page) ( GinPageGetOpaque(page)->flags |= GIN_LIST ) #define GinPageSetList(page) ( GinPageGetOpaque(page)->flags |= GIN_LIST )
#define GinPageHasFullRow(page) ( GinPageGetOpaque(page)->flags & GIN_LIST_FULLROW ) #define GinPageHasFullRow(page) ( GinPageGetOpaque(page)->flags & GIN_LIST_FULLROW )
#define GinPageSetFullRow(page) ( GinPageGetOpaque(page)->flags |= GIN_LIST_FULLROW ) #define GinPageSetFullRow(page) ( GinPageGetOpaque(page)->flags |= GIN_LIST_FULLROW )
#define GinPageIsCompressed(page) ( GinPageGetOpaque(page)->flags & GIN_COMPRESSED )
#define GinPageSetCompressed(page) ( GinPageGetOpaque(page)->flags |= GIN_COMPRESSED )
#define GinPageIsDeleted(page) ( GinPageGetOpaque(page)->flags & GIN_DELETED) #define GinPageIsDeleted(page) ( GinPageGetOpaque(page)->flags & GIN_DELETED)
#define GinPageSetDeleted(page) ( GinPageGetOpaque(page)->flags |= GIN_DELETED) #define GinPageSetDeleted(page) ( GinPageGetOpaque(page)->flags |= GIN_DELETED)
@ -213,13 +218,16 @@ typedef signed char GinNullCategory;
#define GinSetPostingTree(itup, blkno) ( GinSetNPosting((itup),GIN_TREE_POSTING), ItemPointerSetBlockNumber(&(itup)->t_tid, blkno) ) #define GinSetPostingTree(itup, blkno) ( GinSetNPosting((itup),GIN_TREE_POSTING), ItemPointerSetBlockNumber(&(itup)->t_tid, blkno) )
#define GinGetPostingTree(itup) GinItemPointerGetBlockNumber(&(itup)->t_tid) #define GinGetPostingTree(itup) GinItemPointerGetBlockNumber(&(itup)->t_tid)
#define GinGetPostingOffset(itup) GinItemPointerGetBlockNumber(&(itup)->t_tid) #define GIN_ITUP_COMPRESSED (1 << 31)
#define GinSetPostingOffset(itup,n) ItemPointerSetBlockNumber(&(itup)->t_tid,n) #define GinGetPostingOffset(itup) (GinItemPointerGetBlockNumber(&(itup)->t_tid) & (~GIN_ITUP_COMPRESSED))
#define GinGetPosting(itup) ((ItemPointer) ((char*)(itup) + GinGetPostingOffset(itup))) #define GinSetPostingOffset(itup,n) ItemPointerSetBlockNumber(&(itup)->t_tid,(n)|GIN_ITUP_COMPRESSED)
#define GinGetPosting(itup) ((Pointer) ((char*)(itup) + GinGetPostingOffset(itup)))
#define GinItupIsCompressed(itup) (GinItemPointerGetBlockNumber(&(itup)->t_tid) & GIN_ITUP_COMPRESSED)
#define GinMaxItemSize \ #define GinMaxItemSize \
MAXALIGN_DOWN(((BLCKSZ - SizeOfPageHeaderData - \ Min(INDEX_SIZE_MASK, \
MAXALIGN(sizeof(GinPageOpaqueData))) / 3 - sizeof(ItemIdData))) MAXALIGN_DOWN(((BLCKSZ - SizeOfPageHeaderData - \
MAXALIGN(sizeof(GinPageOpaqueData))) / 6 - sizeof(ItemIdData))))
/* /*
* Access macros for non-leaf entry tuples * Access macros for non-leaf entry tuples
@ -230,30 +238,59 @@ typedef signed char GinNullCategory;
/* /*
* Data (posting tree) pages * Data (posting tree) pages
*
* Posting tree pages don't store regular tuples. Non-leaf pages contain
* PostingItems, which are pairs of ItemPointers and child block numbers.
* Leaf pages contain GinPostingLists and an uncompressed array of item
* pointers.
*
* In a leaf page, the compressed posting lists are stored after the regular
* page header, one after each other. Although we don't store regular tuples,
* pd_lower is used to indicate the end of the posting lists. After that, free
* space follows. This layout is compatible with the "standard" heap and
* index page layout described in bufpage.h, so that we can e.g set buffer_std
* when writing WAL records.
*
* In the special space is the GinPageOpaque struct.
*/ */
#define GinDataLeafPageGetPostingList(page) \
(GinPostingList *) ((PageGetContents(page) + MAXALIGN(sizeof(ItemPointerData))))
#define GinDataLeafPageGetPostingListSize(page) \
(((PageHeader) page)->pd_lower - MAXALIGN(SizeOfPageHeaderData) - MAXALIGN(sizeof(ItemPointerData)))
#define GinDataLeafPageSetPostingListSize(page, size) \
{ \
Assert(size <= GinDataLeafMaxContentSize); \
((PageHeader) page)->pd_lower = (size) + MAXALIGN(SizeOfPageHeaderData) + MAXALIGN(sizeof(ItemPointerData)); \
}
#define GinDataLeafPageIsEmpty(page) \
(GinPageIsCompressed(page) ? (GinDataLeafPageGetPostingListSize(page) == 0) : (GinPageGetOpaque(page)->maxoff < FirstOffsetNumber))
#define GinDataLeafPageGetFreeSpace(page) PageGetExactFreeSpace(page)
#define GinDataPageGetRightBound(page) ((ItemPointer) PageGetContents(page)) #define GinDataPageGetRightBound(page) ((ItemPointer) PageGetContents(page))
/*
* Pointer to the data portion of a posting tree page. For internal pages,
* that's the beginning of the array of PostingItems. For compressed leaf
* pages, the first compressed posting list. For uncompressed (pre-9.4) leaf
* pages, it's the beginning of the ItemPointer array.
*/
#define GinDataPageGetData(page) \ #define GinDataPageGetData(page) \
(PageGetContents(page) + MAXALIGN(sizeof(ItemPointerData))) (PageGetContents(page) + MAXALIGN(sizeof(ItemPointerData)))
/* non-leaf pages contain PostingItems */ /* non-leaf pages contain PostingItems */
#define GinDataPageGetPostingItem(page, i) \ #define GinDataPageGetPostingItem(page, i) \
((PostingItem *) (GinDataPageGetData(page) + ((i)-1) * sizeof(PostingItem))) ((PostingItem *) (GinDataPageGetData(page) + ((i)-1) * sizeof(PostingItem)))
/* leaf pages contain ItemPointers */
#define GinDataPageGetItemPointer(page, i) \
((ItemPointer) (GinDataPageGetData(page) + ((i)-1) * sizeof(ItemPointerData)))
#define GinSizeOfDataPageItem(page) \
(GinPageIsLeaf(page) ? sizeof(ItemPointerData) : sizeof(PostingItem))
#define GinDataPageGetFreeSpace(page) \ #define GinNonLeafDataPageGetFreeSpace(page) \
(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
- MAXALIGN(sizeof(ItemPointerData)) \ - MAXALIGN(sizeof(ItemPointerData)) \
- GinPageGetOpaque(page)->maxoff * GinSizeOfDataPageItem(page) \ - GinPageGetOpaque(page)->maxoff * sizeof(PostingItem) \
- MAXALIGN(sizeof(GinPageOpaqueData))) - MAXALIGN(sizeof(GinPageOpaqueData)))
#define GinMaxLeafDataItems \ #define GinDataLeafMaxContentSize \
((BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
MAXALIGN(sizeof(ItemPointerData)) - \ - MAXALIGN(sizeof(ItemPointerData)) \
MAXALIGN(sizeof(GinPageOpaqueData))) \ - MAXALIGN(sizeof(GinPageOpaqueData)))
/ sizeof(ItemPointerData))
/* /*
* List pages * List pages
@ -318,6 +355,23 @@ typedef struct GinState
Oid supportCollation[INDEX_MAX_KEYS]; Oid supportCollation[INDEX_MAX_KEYS];
} GinState; } GinState;
/*
* A compressed posting list.
*
* Note: This requires 2-byte alignment.
*/
typedef struct
{
ItemPointerData first; /* first item in this posting list (unpacked) */
uint16 nbytes; /* number of bytes that follow */
unsigned char bytes[1]; /* varbyte encoded items (variable length) */
} GinPostingList;
#define SizeOfGinPostingList(plist) (offsetof(GinPostingList, bytes) + SHORTALIGN((plist)->nbytes) )
#define GinNextPostingListSegment(cur) ((GinPostingList *) (((char *) (cur)) + SizeOfGinPostingList((cur))))
/* XLog stuff */ /* XLog stuff */
#define XLOG_GIN_CREATE_INDEX 0x00 #define XLOG_GIN_CREATE_INDEX 0x00
@ -328,18 +382,21 @@ typedef struct ginxlogCreatePostingTree
{ {
RelFileNode node; RelFileNode node;
BlockNumber blkno; BlockNumber blkno;
uint32 nitem; uint32 size;
/* follows list of heap's ItemPointer */ /* A compressed posting list follows */
} ginxlogCreatePostingTree; } ginxlogCreatePostingTree;
#define XLOG_GIN_INSERT 0x20 #define XLOG_GIN_INSERT 0x20
typedef struct ginxlogInsert /*
* The format of the insertion record varies depending on the page type.
* ginxlogInsert is the common part between all variants.
*/
typedef struct
{ {
RelFileNode node; RelFileNode node;
BlockNumber blkno; BlockNumber blkno;
uint16 flags; /* GIN_SPLIT_ISLEAF and/or GIN_SPLIT_ISDATA */ uint16 flags; /* GIN_SPLIT_ISLEAF and/or GIN_SPLIT_ISDATA */
OffsetNumber offset;
/* /*
* FOLLOWS: * FOLLOWS:
@ -358,17 +415,25 @@ typedef struct ginxlogInsert
typedef struct typedef struct
{ {
OffsetNumber offset;
bool isDelete; bool isDelete;
IndexTupleData tuple; /* variable length */ IndexTupleData tuple; /* variable length */
} ginxlogInsertEntry; } ginxlogInsertEntry;
typedef struct typedef struct
{ {
OffsetNumber nitem; uint16 length;
ItemPointerData items[1]; /* variable length */ uint16 unmodifiedsize;
} ginxlogInsertDataLeaf;
/* In an insert to an internal data page, the payload is a PostingItem */ /* compressed segments, variable length */
char newdata[1];
} ginxlogRecompressDataLeaf;
typedef struct
{
OffsetNumber offset;
PostingItem newitem;
} ginxlogInsertDataInternal;
#define XLOG_GIN_SPLIT 0x30 #define XLOG_GIN_SPLIT 0x30
@ -401,25 +466,58 @@ typedef struct
/* FOLLOWS: IndexTuples */ /* FOLLOWS: IndexTuples */
} ginxlogSplitEntry; } ginxlogSplitEntry;
typedef struct
{
uint16 lsize;
uint16 rsize;
ItemPointerData lrightbound; /* new right bound of left page */
ItemPointerData rrightbound; /* new right bound of right page */
/* FOLLOWS: new compressed posting lists of left and right page */
char newdata[1];
} ginxlogSplitDataLeaf;
typedef struct typedef struct
{ {
OffsetNumber separator; OffsetNumber separator;
OffsetNumber nitem; OffsetNumber nitem;
ItemPointerData rightbound; ItemPointerData rightbound;
/* FOLLOWS: array of ItemPointers (for leaf) or PostingItems (non-leaf) */ /* FOLLOWS: array of PostingItems */
} ginxlogSplitData; } ginxlogSplitDataInternal;
/*
* Vacuum simply WAL-logs the whole page, when anything is modified. This
* functionally identical heap_newpage records, but is kept separate for
* debugging purposes. (When inspecting the WAL stream, it's easier to see
* what's going on when GIN vacuum records are marked as such, not as heap
* records.) This is currently only used for entry tree leaf pages.
*/
#define XLOG_GIN_VACUUM_PAGE 0x40 #define XLOG_GIN_VACUUM_PAGE 0x40
typedef struct ginxlogVacuumPage typedef struct ginxlogVacuumPage
{ {
RelFileNode node; RelFileNode node;
BlockNumber blkno; BlockNumber blkno;
OffsetNumber nitem; uint16 hole_offset; /* number of bytes before "hole" */
/* follows content of page */ uint16 hole_length; /* number of bytes in "hole" */
/* entire page contents (minus the hole) follow at end of record */
} ginxlogVacuumPage; } ginxlogVacuumPage;
/*
* Vacuuming posting tree leaf page is WAL-logged like recompression caused
* by insertion.
*/
#define XLOG_GIN_VACUUM_DATA_LEAF_PAGE 0x90
typedef struct ginxlogVacuumDataLeafPage
{
RelFileNode node;
BlockNumber blkno;
ginxlogRecompressDataLeaf data;
} ginxlogVacuumDataLeafPage;
#define XLOG_GIN_DELETE_PAGE 0x50 #define XLOG_GIN_DELETE_PAGE 0x50
typedef struct ginxlogDeletePage typedef struct ginxlogDeletePage
@ -506,6 +604,7 @@ typedef struct GinBtreeStack
BlockNumber blkno; BlockNumber blkno;
Buffer buffer; Buffer buffer;
OffsetNumber off; OffsetNumber off;
ItemPointerData iptr;
/* predictNumber contains predicted number of pages on current level */ /* predictNumber contains predicted number of pages on current level */
uint32 predictNumber; uint32 predictNumber;
struct GinBtreeStack *parent; struct GinBtreeStack *parent;
@ -513,6 +612,14 @@ typedef struct GinBtreeStack
typedef struct GinBtreeData *GinBtree; typedef struct GinBtreeData *GinBtree;
/* Return codes for GinBtreeData.placeToPage method */
typedef enum
{
UNMODIFIED,
INSERTED,
SPLIT
} GinPlaceToPageRC;
typedef struct GinBtreeData typedef struct GinBtreeData
{ {
/* search methods */ /* search methods */
@ -523,8 +630,7 @@ typedef struct GinBtreeData
/* insert methods */ /* insert methods */
OffsetNumber (*findChildPtr) (GinBtree, Page, BlockNumber, OffsetNumber); OffsetNumber (*findChildPtr) (GinBtree, Page, BlockNumber, OffsetNumber);
bool (*placeToPage) (GinBtree, Buffer, OffsetNumber, void *, BlockNumber, XLogRecData **); GinPlaceToPageRC (*placeToPage) (GinBtree, Buffer, GinBtreeStack *, void *, BlockNumber, XLogRecData **, Page *, Page *);
Page (*splitPage) (GinBtree, Buffer, Buffer, OffsetNumber, void *, BlockNumber, XLogRecData **);
void *(*prepareDownlink) (GinBtree, Buffer); void *(*prepareDownlink) (GinBtree, Buffer);
void (*fillRoot) (GinBtree, Page, BlockNumber, Page, BlockNumber, Page); void (*fillRoot) (GinBtree, Page, BlockNumber, Page, BlockNumber, Page);
@ -577,14 +683,17 @@ extern void ginInsertValue(GinBtree btree, GinBtreeStack *stack,
/* ginentrypage.c */ /* ginentrypage.c */
extern IndexTuple GinFormTuple(GinState *ginstate, extern IndexTuple GinFormTuple(GinState *ginstate,
OffsetNumber attnum, Datum key, GinNullCategory category, OffsetNumber attnum, Datum key, GinNullCategory category,
ItemPointerData *ipd, uint32 nipd, bool errorTooBig); Pointer data, Size dataSize, int nipd, bool errorTooBig);
extern void GinShortenTuple(IndexTuple itup, uint32 nipd);
extern void ginPrepareEntryScan(GinBtree btree, OffsetNumber attnum, extern void ginPrepareEntryScan(GinBtree btree, OffsetNumber attnum,
Datum key, GinNullCategory category, Datum key, GinNullCategory category,
GinState *ginstate); GinState *ginstate);
extern void ginEntryFillRoot(GinBtree btree, Page root, BlockNumber lblkno, Page lpage, BlockNumber rblkno, Page rpage); extern void ginEntryFillRoot(GinBtree btree, Page root, BlockNumber lblkno, Page lpage, BlockNumber rblkno, Page rpage);
extern ItemPointer ginReadTuple(GinState *ginstate, OffsetNumber attnum,
IndexTuple itup, int *nitems);
/* gindatapage.c */ /* gindatapage.c */
extern ItemPointer GinDataLeafPageGetItems(Page page, int *nitems);
extern int GinDataLeafPageGetItemsToTbm(Page page, TIDBitmap *tbm);
extern BlockNumber createPostingTree(Relation index, extern BlockNumber createPostingTree(Relation index,
ItemPointerData *items, uint32 nitems, ItemPointerData *items, uint32 nitems,
GinStatsData *buildStats); GinStatsData *buildStats);
@ -598,6 +707,15 @@ extern GinBtreeStack *ginScanBeginPostingTree(Relation index, BlockNumber rootBl
extern void ginDataFillRoot(GinBtree btree, Page root, BlockNumber lblkno, Page lpage, BlockNumber rblkno, Page rpage); extern void ginDataFillRoot(GinBtree btree, Page root, BlockNumber lblkno, Page lpage, BlockNumber rblkno, Page rpage);
extern void ginPrepareDataScan(GinBtree btree, Relation index, BlockNumber rootBlkno); extern void ginPrepareDataScan(GinBtree btree, Relation index, BlockNumber rootBlkno);
/*
* This is declared in ginvacuum.c, but is passed between ginVacuumItemPointers
* and ginVacuumPostingTreeLeaf and as an opaque struct, so we need a forward
* declaration for it.
*/
typedef struct GinVacuumState GinVacuumState;
extern void ginVacuumPostingTreeLeaf(Relation rel, Buffer buf, GinVacuumState *gvs);
/* ginscan.c */ /* ginscan.c */
/* /*
@ -679,7 +797,7 @@ typedef struct GinScanEntryData
/* used for Posting list and one page in Posting tree */ /* used for Posting list and one page in Posting tree */
ItemPointerData *list; ItemPointerData *list;
uint32 nlist; int nlist;
OffsetNumber offset; OffsetNumber offset;
bool isFinished; bool isFinished;
@ -717,6 +835,8 @@ extern Datum gingetbitmap(PG_FUNCTION_ARGS);
/* ginvacuum.c */ /* ginvacuum.c */
extern Datum ginbulkdelete(PG_FUNCTION_ARGS); extern Datum ginbulkdelete(PG_FUNCTION_ARGS);
extern Datum ginvacuumcleanup(PG_FUNCTION_ARGS); extern Datum ginvacuumcleanup(PG_FUNCTION_ARGS);
extern ItemPointer ginVacuumItemPointers(GinVacuumState *gvs,
ItemPointerData *items, int nitem, int *nremaining);
/* ginbulk.c */ /* ginbulk.c */
typedef struct GinEntryAccumulator typedef struct GinEntryAccumulator
@ -770,11 +890,17 @@ extern void ginInsertCleanup(GinState *ginstate,
bool vac_delay, IndexBulkDeleteResult *stats); bool vac_delay, IndexBulkDeleteResult *stats);
/* ginpostinglist.c */ /* ginpostinglist.c */
extern uint32 ginMergeItemPointers(ItemPointerData *dst,
extern GinPostingList *ginCompressPostingList(const ItemPointer ptrs, int nptrs,
int maxsize, int *nwritten);
extern int ginPostingListDecodeAllSegmentsToTbm(GinPostingList *ptr, int totalsize, TIDBitmap *tbm);
extern ItemPointer ginPostingListDecodeAllSegments(GinPostingList *ptr, int len, int *ndecoded);
extern ItemPointer ginPostingListDecode(GinPostingList *ptr, int *ndecoded);
extern int ginMergeItemPointers(ItemPointerData *dst,
ItemPointerData *a, uint32 na, ItemPointerData *a, uint32 na,
ItemPointerData *b, uint32 nb); ItemPointerData *b, uint32 nb);
/* /*
* Merging the results of several gin scans compares item pointers a lot, * Merging the results of several gin scans compares item pointers a lot,
* so we want this to be inlined. But if the compiler doesn't support that, * so we want this to be inlined. But if the compiler doesn't support that,