BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
/*
|
|
|
|
* brin_page.h
|
|
|
|
* Prototypes and definitions for BRIN page layouts
|
|
|
|
*
|
2021-01-02 19:06:25 +01:00
|
|
|
* Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
|
|
|
* src/include/access/brin_page.h
|
|
|
|
*
|
|
|
|
* NOTES
|
|
|
|
*
|
|
|
|
* These structs should really be private to specific BRIN files, but it's
|
|
|
|
* useful to have them here so that they can be used by pageinspect and similar
|
|
|
|
* tools.
|
|
|
|
*/
|
|
|
|
#ifndef BRIN_PAGE_H
|
|
|
|
#define BRIN_PAGE_H
|
|
|
|
|
|
|
|
#include "storage/block.h"
|
|
|
|
#include "storage/itemptr.h"
|
|
|
|
|
2015-03-10 16:26:34 +01:00
|
|
|
/*
|
|
|
|
* Special area of BRIN pages.
|
|
|
|
*
|
|
|
|
* We define it in this odd way so that it always occupies the last
|
|
|
|
* MAXALIGN-sized element of each page.
|
|
|
|
*/
|
|
|
|
typedef struct BrinSpecialSpace
|
|
|
|
{
|
|
|
|
uint16 vector[MAXALIGN(1) / sizeof(uint16)];
|
|
|
|
} BrinSpecialSpace;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make the page type be the last half-word in the page, for consumption by
|
|
|
|
* pg_filedump and similar utilities. We don't really care much about the
|
|
|
|
* position of the "flags" half-word, but it's simpler to apply a consistent
|
|
|
|
* rule to both.
|
|
|
|
*
|
|
|
|
* See comments above GinPageOpaqueData.
|
|
|
|
*/
|
|
|
|
#define BrinPageType(page) \
|
|
|
|
(((BrinSpecialSpace *) \
|
|
|
|
PageGetSpecialPointer(page))->vector[MAXALIGN(1) / sizeof(uint16) - 1])
|
|
|
|
|
|
|
|
#define BrinPageFlags(page) \
|
|
|
|
(((BrinSpecialSpace *) \
|
|
|
|
PageGetSpecialPointer(page))->vector[MAXALIGN(1) / sizeof(uint16) - 2])
|
|
|
|
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
/* special space on all BRIN pages stores a "type" identifier */
|
|
|
|
#define BRIN_PAGETYPE_META 0xF091
|
|
|
|
#define BRIN_PAGETYPE_REVMAP 0xF092
|
|
|
|
#define BRIN_PAGETYPE_REGULAR 0xF093
|
|
|
|
|
Close some holes in BRIN page assignment
In some corner cases, it is possible for the BRIN index relation to be
extended by brin_getinsertbuffer but the new page not be used
immediately for anything by its callers; when this happens, the page is
initialized and the FSM is updated (by brin_getinsertbuffer) with the
info about that page, but these actions are not WAL-logged. A later
index insert/update can use the page, but since the page is already
initialized, the initialization itself is not WAL-logged then either.
Replay of this sequence of events causes recovery to fail altogether.
There is a related corner case within brin_getinsertbuffer itself, in
which we extend the relation to put a new index tuple there, but later
find out that we cannot do so, and do not return the buffer; the page
obtained from extension is not even initialized. The resulting page is
lost forever.
To fix, shuffle the code so that initialization is not the
responsibility of brin_getinsertbuffer anymore, in normal cases;
instead, the initialization is done by its callers (brin_doinsert and
brin_doupdate) once they're certain that the page is going to be used.
When either those functions determine that the new page cannot be used,
before bailing out they initialize the page as an empty regular page,
enter it in FSM and WAL-log all this. This way, the page is usable for
future index insertions, and WAL replay doesn't find trying to insert
tuples in pages whose initialization didn't make it to the WAL. The
same strategy is used in brin_getinsertbuffer when it cannot return the
new page.
Additionally, add a new step to vacuuming so that all pages of the index
are scanned; whenever an uninitialized page is found, it is initialized
as empty and WAL-logged. This closes the hole that the relation is
extended but the system crashes before anything is WAL-logged about it.
We also take this opportunity to update the FSM, in case it has gotten
out of date.
Thanks to Heikki Linnakangas for finding the problem that kicked some
additional analysis of BRIN page assignment code.
Backpatch to 9.5, where BRIN was introduced.
Discussion: https://www.postgresql.org/message-id/20150723204810.GY5596@postgresql.org
2015-08-12 19:20:38 +02:00
|
|
|
#define BRIN_IS_META_PAGE(page) (BrinPageType(page) == BRIN_PAGETYPE_META)
|
2015-03-10 16:26:34 +01:00
|
|
|
#define BRIN_IS_REVMAP_PAGE(page) (BrinPageType(page) == BRIN_PAGETYPE_REVMAP)
|
|
|
|
#define BRIN_IS_REGULAR_PAGE(page) (BrinPageType(page) == BRIN_PAGETYPE_REGULAR)
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
|
|
|
/* flags for BrinSpecialSpace */
|
|
|
|
#define BRIN_EVACUATE_PAGE (1 << 0)
|
|
|
|
|
|
|
|
|
|
|
|
/* Metapage definitions */
|
|
|
|
typedef struct BrinMetaPageData
|
|
|
|
{
|
|
|
|
uint32 brinMagic;
|
|
|
|
uint32 brinVersion;
|
|
|
|
BlockNumber pagesPerRange;
|
|
|
|
BlockNumber lastRevmapPage;
|
|
|
|
} BrinMetaPageData;
|
|
|
|
|
|
|
|
#define BRIN_CURRENT_VERSION 1
|
|
|
|
#define BRIN_META_MAGIC 0xA8109CFA
|
|
|
|
|
|
|
|
#define BRIN_METAPAGE_BLKNO 0
|
|
|
|
|
|
|
|
/* Definitions for revmap pages */
|
|
|
|
typedef struct RevmapContents
|
|
|
|
{
|
2015-02-21 07:46:43 +01:00
|
|
|
/*
|
|
|
|
* This array will fill all available space on the page. It should be
|
|
|
|
* declared [FLEXIBLE_ARRAY_MEMBER], but for some reason you can't do that
|
|
|
|
* in an otherwise-empty struct.
|
|
|
|
*/
|
|
|
|
ItemPointerData rm_tids[1];
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
} RevmapContents;
|
|
|
|
|
|
|
|
#define REVMAP_CONTENT_SIZE \
|
|
|
|
(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
|
|
|
|
offsetof(RevmapContents, rm_tids) - \
|
|
|
|
MAXALIGN(sizeof(BrinSpecialSpace)))
|
|
|
|
/* max num of items in the array */
|
|
|
|
#define REVMAP_PAGE_MAXITEMS \
|
|
|
|
(REVMAP_CONTENT_SIZE / sizeof(ItemPointerData))
|
|
|
|
|
|
|
|
#endif /* BRIN_PAGE_H */
|