BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
/*
|
2019-07-01 03:00:23 +02:00
|
|
|
* brin_tuple.c
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
* Method implementations for tuples in BRIN indexes.
|
|
|
|
*
|
|
|
|
* Intended usage is that code outside this file only deals with
|
|
|
|
* BrinMemTuples, and convert to and from the on-disk representation through
|
|
|
|
* functions in this file.
|
|
|
|
*
|
|
|
|
* NOTES
|
|
|
|
*
|
|
|
|
* A BRIN tuple is similar to a heap tuple, with a few key differences. The
|
|
|
|
* first interesting difference is that the tuple header is much simpler, only
|
|
|
|
* containing its total length and a small area for flags. Also, the stored
|
|
|
|
* data does not match the relation tuple descriptor exactly: for each
|
|
|
|
* attribute in the descriptor, the index tuple carries an arbitrary number
|
|
|
|
* of values, depending on the opclass.
|
|
|
|
*
|
|
|
|
* Also, for each column of the index relation there are two null bits: one
|
|
|
|
* (hasnulls) stores whether any tuple within the page range has that column
|
|
|
|
* set to null; the other one (allnulls) stores whether the column values are
|
|
|
|
* all null. If allnulls is true, then the tuple data area does not contain
|
|
|
|
* values for that column at all; whereas it does if the hasnulls is set.
|
|
|
|
* Note the size of the null bitmask may not be the same as that of the
|
|
|
|
* datum array.
|
|
|
|
*
|
2021-01-02 19:06:25 +01:00
|
|
|
* Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
|
|
|
* src/backend/access/brin/brin_tuple.c
|
|
|
|
*/
|
|
|
|
#include "postgres.h"
|
|
|
|
|
|
|
|
#include "access/brin_tuple.h"
|
2020-11-07 00:39:19 +01:00
|
|
|
#include "access/detoast.h"
|
|
|
|
#include "access/heaptoast.h"
|
2019-11-12 04:00:16 +01:00
|
|
|
#include "access/htup_details.h"
|
2020-11-07 00:39:19 +01:00
|
|
|
#include "access/toast_internals.h"
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
#include "access/tupdesc.h"
|
|
|
|
#include "access/tupmacs.h"
|
|
|
|
#include "utils/datum.h"
|
|
|
|
#include "utils/memutils.h"
|
|
|
|
|
2020-11-07 00:39:19 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This enables de-toasting of index entries. Needed until VACUUM is
|
|
|
|
* smart enough to rebuild indexes from scratch.
|
|
|
|
*/
|
|
|
|
#define TOAST_INDEX_HACK
|
|
|
|
|
|
|
|
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
static inline void brin_deconstruct_tuple(BrinDesc *brdesc,
|
2019-05-22 19:04:48 +02:00
|
|
|
char *tp, bits8 *nullbits, bool nulls,
|
|
|
|
Datum *values, bool *allnulls, bool *hasnulls);
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return a tuple descriptor used for on-disk storage of BRIN tuples.
|
|
|
|
*/
|
|
|
|
static TupleDesc
|
|
|
|
brtuple_disk_tupdesc(BrinDesc *brdesc)
|
|
|
|
{
|
|
|
|
/* We cache these in the BrinDesc */
|
|
|
|
if (brdesc->bd_disktdesc == NULL)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
int j;
|
|
|
|
AttrNumber attno = 1;
|
|
|
|
TupleDesc tupdesc;
|
|
|
|
MemoryContext oldcxt;
|
|
|
|
|
|
|
|
/* make sure it's in the bdesc's context */
|
|
|
|
oldcxt = MemoryContextSwitchTo(brdesc->bd_context);
|
|
|
|
|
Remove WITH OIDS support, change oid catalog column visibility.
Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.
This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row. Neither pg_dump nor COPY included the contents of the
oid column by default.
The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.
WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.
Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.
The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.
The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such. This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.
The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.
Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).
The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.
While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.
Catversion bump, for obvious reasons.
Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
2018-11-21 00:36:57 +01:00
|
|
|
tupdesc = CreateTemplateTupleDesc(brdesc->bd_totalstored);
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
|
|
|
for (i = 0; i < brdesc->bd_tupdesc->natts; i++)
|
|
|
|
{
|
|
|
|
for (j = 0; j < brdesc->bd_info[i]->oi_nstored; j++)
|
|
|
|
TupleDescInitEntry(tupdesc, attno++, NULL,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
brdesc->bd_info[i]->oi_typcache[j]->type_id,
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
-1, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
MemoryContextSwitchTo(oldcxt);
|
|
|
|
|
|
|
|
brdesc->bd_disktdesc = tupdesc;
|
|
|
|
}
|
|
|
|
|
|
|
|
return brdesc->bd_disktdesc;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Generate a new on-disk tuple to be inserted in a BRIN index.
|
|
|
|
*
|
|
|
|
* See brin_form_placeholder_tuple if you touch this.
|
|
|
|
*/
|
|
|
|
BrinTuple *
|
|
|
|
brin_form_tuple(BrinDesc *brdesc, BlockNumber blkno, BrinMemTuple *tuple,
|
|
|
|
Size *size)
|
|
|
|
{
|
|
|
|
Datum *values;
|
|
|
|
bool *nulls;
|
|
|
|
bool anynulls = false;
|
|
|
|
BrinTuple *rettuple;
|
|
|
|
int keyno;
|
|
|
|
int idxattno;
|
2014-11-10 19:56:08 +01:00
|
|
|
uint16 phony_infomask = 0;
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
bits8 *phony_nullbitmap;
|
|
|
|
Size len,
|
|
|
|
hoff,
|
|
|
|
data_len;
|
2020-11-07 00:39:19 +01:00
|
|
|
int i;
|
|
|
|
|
|
|
|
#ifdef TOAST_INDEX_HACK
|
|
|
|
Datum *untoasted_values;
|
|
|
|
int nuntoasted = 0;
|
|
|
|
#endif
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
|
|
|
Assert(brdesc->bd_totalstored > 0);
|
|
|
|
|
Fix valgrind's "unaddressable bytes" whining about BRIN code.
brin_form_tuple calculated an exact tuple size, then palloc'd and
filled just that much. Later, brin_doinsert or brin_doupdate would
MAXALIGN the tuple size and tell PageAddItem that that was the size
of the tuple to insert. If the original tuple size wasn't a multiple
of MAXALIGN, the net result would be that PageAddItem would memcpy
a few more bytes than the palloc request had been for.
AFAICS, this is totally harmless in the real world: the error is a
read overrun not a write overrun, and palloc would certainly have
rounded the request up to a MAXALIGN multiple internally, so there's
no chance of the memcpy fetching off the end of memory. Valgrind,
however, is picky to the byte level not the MAXALIGN level.
Fix it by pushing the MAXALIGN step back to brin_form_tuple. (The other
possible source of tuples in this code, brin_form_placeholder_tuple,
was already producing a MAXALIGN'd result.)
In passing, be a bit more paranoid about internal allocations in
brin_form_tuple.
2015-05-26 03:56:19 +02:00
|
|
|
values = (Datum *) palloc(sizeof(Datum) * brdesc->bd_totalstored);
|
|
|
|
nulls = (bool *) palloc0(sizeof(bool) * brdesc->bd_totalstored);
|
|
|
|
phony_nullbitmap = (bits8 *)
|
|
|
|
palloc(sizeof(bits8) * BITMAPLEN(brdesc->bd_totalstored));
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
2020-11-07 00:39:19 +01:00
|
|
|
#ifdef TOAST_INDEX_HACK
|
|
|
|
untoasted_values = (Datum *) palloc(sizeof(Datum) * brdesc->bd_totalstored);
|
|
|
|
#endif
|
|
|
|
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
/*
|
|
|
|
* Set up the values/nulls arrays for heap_fill_tuple
|
|
|
|
*/
|
|
|
|
idxattno = 0;
|
|
|
|
for (keyno = 0; keyno < brdesc->bd_tupdesc->natts; keyno++)
|
|
|
|
{
|
|
|
|
int datumno;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* "allnulls" is set when there's no nonnull value in any row in the
|
|
|
|
* column; when this happens, there is no data to store. Thus set the
|
|
|
|
* nullable bits for all data elements of this column and we're done.
|
|
|
|
*/
|
|
|
|
if (tuple->bt_columns[keyno].bv_allnulls)
|
|
|
|
{
|
|
|
|
for (datumno = 0;
|
|
|
|
datumno < brdesc->bd_info[keyno]->oi_nstored;
|
|
|
|
datumno++)
|
|
|
|
nulls[idxattno++] = true;
|
|
|
|
anynulls = true;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The "hasnulls" bit is set when there are some null values in the
|
|
|
|
* data. We still need to store a real value, but the presence of
|
|
|
|
* this means we need a null bitmap.
|
|
|
|
*/
|
|
|
|
if (tuple->bt_columns[keyno].bv_hasnulls)
|
|
|
|
anynulls = true;
|
|
|
|
|
2020-11-07 00:39:19 +01:00
|
|
|
/*
|
|
|
|
* Now obtain the values of each stored datum. Note that some values
|
|
|
|
* might be toasted, and we cannot rely on the original heap values
|
|
|
|
* sticking around forever, so we must detoast them. Also try to
|
|
|
|
* compress them.
|
|
|
|
*/
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
for (datumno = 0;
|
|
|
|
datumno < brdesc->bd_info[keyno]->oi_nstored;
|
|
|
|
datumno++)
|
2020-11-07 00:39:19 +01:00
|
|
|
{
|
|
|
|
Datum value = tuple->bt_columns[keyno].bv_values[datumno];
|
|
|
|
|
|
|
|
#ifdef TOAST_INDEX_HACK
|
|
|
|
|
|
|
|
/* We must look at the stored type, not at the index descriptor. */
|
|
|
|
TypeCacheEntry *atttype = brdesc->bd_info[keyno]->oi_typcache[datumno];
|
|
|
|
|
|
|
|
/* Do we need to free the value at the end? */
|
|
|
|
bool free_value = false;
|
|
|
|
|
|
|
|
/* For non-varlena types we don't need to do anything special */
|
|
|
|
if (atttype->typlen != -1)
|
|
|
|
{
|
|
|
|
values[idxattno++] = value;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Do nothing if value is not of varlena type. We don't need to
|
|
|
|
* care about NULL values here, thanks to bv_allnulls above.
|
|
|
|
*
|
|
|
|
* If value is stored EXTERNAL, must fetch it so we are not
|
|
|
|
* depending on outside storage.
|
|
|
|
*
|
|
|
|
* XXX Is this actually true? Could it be that the summary is
|
|
|
|
* NULL even for range with non-NULL data? E.g. degenerate bloom
|
|
|
|
* filter may be thrown away, etc.
|
|
|
|
*/
|
|
|
|
if (VARATT_IS_EXTERNAL(DatumGetPointer(value)))
|
|
|
|
{
|
|
|
|
value = PointerGetDatum(detoast_external_attr((struct varlena *)
|
|
|
|
DatumGetPointer(value)));
|
|
|
|
free_value = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If value is above size target, and is of a compressible datatype,
|
|
|
|
* try to compress it in-line.
|
|
|
|
*/
|
|
|
|
if (!VARATT_IS_EXTENDED(DatumGetPointer(value)) &&
|
|
|
|
VARSIZE(DatumGetPointer(value)) > TOAST_INDEX_TARGET &&
|
|
|
|
(atttype->typstorage == TYPSTORAGE_EXTENDED ||
|
|
|
|
atttype->typstorage == TYPSTORAGE_MAIN))
|
|
|
|
{
|
2021-03-21 00:28:13 +01:00
|
|
|
Datum cvalue;
|
|
|
|
char compression;
|
Allow configurable LZ4 TOAST compression.
There is now a per-column COMPRESSION option which can be set to pglz
(the default, and the only option in up until now) or lz4. Or, if you
like, you can set the new default_toast_compression GUC to lz4, and
then that will be the default for new table columns for which no value
is specified. We don't have lz4 support in the PostgreSQL code, so
to use lz4 compression, PostgreSQL must be built --with-lz4.
In general, TOAST compression means compression of individual column
values, not the whole tuple, and those values can either be compressed
inline within the tuple or compressed and then stored externally in
the TOAST table, so those properties also apply to this feature.
Prior to this commit, a TOAST pointer has two unused bits as part of
the va_extsize field, and a compessed datum has two unused bits as
part of the va_rawsize field. These bits are unused because the length
of a varlena is limited to 1GB; we now use them to indicate the
compression type that was used. This means we only have bit space for
2 more built-in compresison types, but we could work around that
problem, if necessary, by introducing a new vartag_external value for
any further types we end up wanting to add. Hopefully, it won't be
too important to offer a wide selection of algorithms here, since
each one we add not only takes more coding but also adds a build
dependency for every packager. Nevertheless, it seems worth doing
at least this much, because LZ4 gets better compression than PGLZ
with less CPU usage.
It's possible for LZ4-compressed datums to leak into composite type
values stored on disk, just as it is for PGLZ. It's also possible for
LZ4-compressed attributes to be copied into a different table via SQL
commands such as CREATE TABLE AS or INSERT .. SELECT. It would be
expensive to force such values to be decompressed, so PostgreSQL has
never done so. For the same reasons, we also don't force recompression
of already-compressed values even if the target table prefers a
different compression method than was used for the source data. These
architectural decisions are perhaps arguable but revisiting them is
well beyond the scope of what seemed possible to do as part of this
project. However, it's relatively cheap to recompress as part of
VACUUM FULL or CLUSTER, so this commit adjusts those commands to do
so, if the configured compression method of the table happens not to
match what was used for some column value stored therein.
Dilip Kumar. The original patches on which this work was based were
written by Ildus Kurbangaliev, and those were patches were based on
even earlier work by Nikita Glukhov, but the design has since changed
very substantially, since allow a potentially large number of
compression methods that could be added and dropped on a running
system proved too problematic given some of the architectural issues
mentioned above; the choice of which specific compression method to
add first is now different; and a lot of the code has been heavily
refactored. More recently, Justin Przyby helped quite a bit with
testing and reviewing and this version also includes some code
contributions from him. Other design input and review from Tomas
Vondra, Álvaro Herrera, Andres Freund, Oleg Bartunov, Alexander
Korotkov, and me.
Discussion: http://postgr.es/m/20170907194236.4cefce96%40wp.localdomain
Discussion: http://postgr.es/m/CAFiTN-uUpX3ck%3DK0mLEk-G_kUQY%3DSNOTeqdaNRR9FMdQrHKebw%40mail.gmail.com
2021-03-19 20:10:38 +01:00
|
|
|
Form_pg_attribute att = TupleDescAttr(brdesc->bd_tupdesc,
|
|
|
|
keyno);
|
2021-03-21 00:28:13 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the BRIN summary and indexed attribute use the same data
|
2021-03-26 00:55:32 +01:00
|
|
|
* type and it has a valid compression method, we can use the
|
|
|
|
* same compression method. Otherwise we have to use the
|
|
|
|
* default method.
|
2021-03-21 00:28:13 +01:00
|
|
|
*/
|
2021-03-26 00:55:32 +01:00
|
|
|
if (att->atttypid == atttype->type_id &&
|
|
|
|
CompressionMethodIsValid(att->attcompression))
|
2021-03-21 00:28:13 +01:00
|
|
|
compression = att->attcompression;
|
|
|
|
else
|
|
|
|
compression = GetDefaultToastCompression();
|
|
|
|
|
|
|
|
cvalue = toast_compress_datum(value, compression);
|
2020-11-07 00:39:19 +01:00
|
|
|
|
|
|
|
if (DatumGetPointer(cvalue) != NULL)
|
|
|
|
{
|
|
|
|
/* successful compression */
|
|
|
|
if (free_value)
|
|
|
|
pfree(DatumGetPointer(value));
|
|
|
|
|
|
|
|
value = cvalue;
|
|
|
|
free_value = true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we untoasted / compressed the value, we need to free it
|
|
|
|
* after forming the index tuple.
|
|
|
|
*/
|
|
|
|
if (free_value)
|
|
|
|
untoasted_values[nuntoasted++] = value;
|
|
|
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
values[idxattno++] = value;
|
|
|
|
}
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
}
|
|
|
|
|
Fix valgrind's "unaddressable bytes" whining about BRIN code.
brin_form_tuple calculated an exact tuple size, then palloc'd and
filled just that much. Later, brin_doinsert or brin_doupdate would
MAXALIGN the tuple size and tell PageAddItem that that was the size
of the tuple to insert. If the original tuple size wasn't a multiple
of MAXALIGN, the net result would be that PageAddItem would memcpy
a few more bytes than the palloc request had been for.
AFAICS, this is totally harmless in the real world: the error is a
read overrun not a write overrun, and palloc would certainly have
rounded the request up to a MAXALIGN multiple internally, so there's
no chance of the memcpy fetching off the end of memory. Valgrind,
however, is picky to the byte level not the MAXALIGN level.
Fix it by pushing the MAXALIGN step back to brin_form_tuple. (The other
possible source of tuples in this code, brin_form_placeholder_tuple,
was already producing a MAXALIGN'd result.)
In passing, be a bit more paranoid about internal allocations in
brin_form_tuple.
2015-05-26 03:56:19 +02:00
|
|
|
/* Assert we did not overrun temp arrays */
|
|
|
|
Assert(idxattno <= brdesc->bd_totalstored);
|
|
|
|
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
/* compute total space needed */
|
|
|
|
len = SizeOfBrinTuple;
|
|
|
|
if (anynulls)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We need a double-length bitmap on an on-disk BRIN index tuple; the
|
|
|
|
* first half stores the "allnulls" bits, the second stores
|
|
|
|
* "hasnulls".
|
|
|
|
*/
|
|
|
|
len += BITMAPLEN(brdesc->bd_tupdesc->natts * 2);
|
|
|
|
}
|
|
|
|
|
|
|
|
len = hoff = MAXALIGN(len);
|
|
|
|
|
|
|
|
data_len = heap_compute_data_size(brtuple_disk_tupdesc(brdesc),
|
|
|
|
values, nulls);
|
|
|
|
len += data_len;
|
|
|
|
|
Fix valgrind's "unaddressable bytes" whining about BRIN code.
brin_form_tuple calculated an exact tuple size, then palloc'd and
filled just that much. Later, brin_doinsert or brin_doupdate would
MAXALIGN the tuple size and tell PageAddItem that that was the size
of the tuple to insert. If the original tuple size wasn't a multiple
of MAXALIGN, the net result would be that PageAddItem would memcpy
a few more bytes than the palloc request had been for.
AFAICS, this is totally harmless in the real world: the error is a
read overrun not a write overrun, and palloc would certainly have
rounded the request up to a MAXALIGN multiple internally, so there's
no chance of the memcpy fetching off the end of memory. Valgrind,
however, is picky to the byte level not the MAXALIGN level.
Fix it by pushing the MAXALIGN step back to brin_form_tuple. (The other
possible source of tuples in this code, brin_form_placeholder_tuple,
was already producing a MAXALIGN'd result.)
In passing, be a bit more paranoid about internal allocations in
brin_form_tuple.
2015-05-26 03:56:19 +02:00
|
|
|
len = MAXALIGN(len);
|
|
|
|
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
rettuple = palloc0(len);
|
|
|
|
rettuple->bt_blkno = blkno;
|
|
|
|
rettuple->bt_info = hoff;
|
Fix valgrind's "unaddressable bytes" whining about BRIN code.
brin_form_tuple calculated an exact tuple size, then palloc'd and
filled just that much. Later, brin_doinsert or brin_doupdate would
MAXALIGN the tuple size and tell PageAddItem that that was the size
of the tuple to insert. If the original tuple size wasn't a multiple
of MAXALIGN, the net result would be that PageAddItem would memcpy
a few more bytes than the palloc request had been for.
AFAICS, this is totally harmless in the real world: the error is a
read overrun not a write overrun, and palloc would certainly have
rounded the request up to a MAXALIGN multiple internally, so there's
no chance of the memcpy fetching off the end of memory. Valgrind,
however, is picky to the byte level not the MAXALIGN level.
Fix it by pushing the MAXALIGN step back to brin_form_tuple. (The other
possible source of tuples in this code, brin_form_placeholder_tuple,
was already producing a MAXALIGN'd result.)
In passing, be a bit more paranoid about internal allocations in
brin_form_tuple.
2015-05-26 03:56:19 +02:00
|
|
|
|
|
|
|
/* Assert that hoff fits in the space available */
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
Assert((rettuple->bt_info & BRIN_OFFSET_MASK) == hoff);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The infomask and null bitmap as computed by heap_fill_tuple are useless
|
|
|
|
* to us. However, that function will not accept a null infomask; and we
|
|
|
|
* need to pass a valid null bitmap so that it will correctly skip
|
|
|
|
* outputting null attributes in the data area.
|
|
|
|
*/
|
|
|
|
heap_fill_tuple(brtuple_disk_tupdesc(brdesc),
|
|
|
|
values,
|
|
|
|
nulls,
|
|
|
|
(char *) rettuple + hoff,
|
|
|
|
data_len,
|
|
|
|
&phony_infomask,
|
|
|
|
phony_nullbitmap);
|
|
|
|
|
|
|
|
/* done with these */
|
|
|
|
pfree(values);
|
|
|
|
pfree(nulls);
|
|
|
|
pfree(phony_nullbitmap);
|
|
|
|
|
2020-11-07 00:39:19 +01:00
|
|
|
#ifdef TOAST_INDEX_HACK
|
|
|
|
for (i = 0; i < nuntoasted; i++)
|
|
|
|
pfree(DatumGetPointer(untoasted_values[i]));
|
|
|
|
#endif
|
|
|
|
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
/*
|
|
|
|
* Now fill in the real null bitmasks. allnulls first.
|
|
|
|
*/
|
|
|
|
if (anynulls)
|
|
|
|
{
|
|
|
|
bits8 *bitP;
|
|
|
|
int bitmask;
|
|
|
|
|
|
|
|
rettuple->bt_info |= BRIN_NULLS_MASK;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Note that we reverse the sense of null bits in this module: we
|
|
|
|
* store a 1 for a null attribute rather than a 0. So we must reverse
|
2019-07-01 03:00:23 +02:00
|
|
|
* the sense of the att_isnull test in brin_deconstruct_tuple as well.
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
*/
|
|
|
|
bitP = ((bits8 *) ((char *) rettuple + SizeOfBrinTuple)) - 1;
|
|
|
|
bitmask = HIGHBIT;
|
|
|
|
for (keyno = 0; keyno < brdesc->bd_tupdesc->natts; keyno++)
|
|
|
|
{
|
|
|
|
if (bitmask != HIGHBIT)
|
|
|
|
bitmask <<= 1;
|
|
|
|
else
|
|
|
|
{
|
|
|
|
bitP += 1;
|
|
|
|
*bitP = 0x0;
|
|
|
|
bitmask = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!tuple->bt_columns[keyno].bv_allnulls)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
*bitP |= bitmask;
|
|
|
|
}
|
|
|
|
/* hasnulls bits follow */
|
|
|
|
for (keyno = 0; keyno < brdesc->bd_tupdesc->natts; keyno++)
|
|
|
|
{
|
|
|
|
if (bitmask != HIGHBIT)
|
|
|
|
bitmask <<= 1;
|
|
|
|
else
|
|
|
|
{
|
|
|
|
bitP += 1;
|
|
|
|
*bitP = 0x0;
|
|
|
|
bitmask = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!tuple->bt_columns[keyno].bv_hasnulls)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
*bitP |= bitmask;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (tuple->bt_placeholder)
|
|
|
|
rettuple->bt_info |= BRIN_PLACEHOLDER_MASK;
|
|
|
|
|
|
|
|
*size = len;
|
|
|
|
return rettuple;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Generate a new on-disk tuple with no data values, marked as placeholder.
|
|
|
|
*
|
|
|
|
* This is a cut-down version of brin_form_tuple.
|
|
|
|
*/
|
|
|
|
BrinTuple *
|
|
|
|
brin_form_placeholder_tuple(BrinDesc *brdesc, BlockNumber blkno, Size *size)
|
|
|
|
{
|
|
|
|
Size len;
|
|
|
|
Size hoff;
|
|
|
|
BrinTuple *rettuple;
|
|
|
|
int keyno;
|
|
|
|
bits8 *bitP;
|
|
|
|
int bitmask;
|
|
|
|
|
|
|
|
/* compute total space needed: always add nulls */
|
|
|
|
len = SizeOfBrinTuple;
|
|
|
|
len += BITMAPLEN(brdesc->bd_tupdesc->natts * 2);
|
|
|
|
len = hoff = MAXALIGN(len);
|
|
|
|
|
|
|
|
rettuple = palloc0(len);
|
|
|
|
rettuple->bt_blkno = blkno;
|
|
|
|
rettuple->bt_info = hoff;
|
|
|
|
rettuple->bt_info |= BRIN_NULLS_MASK | BRIN_PLACEHOLDER_MASK;
|
|
|
|
|
|
|
|
bitP = ((bits8 *) ((char *) rettuple + SizeOfBrinTuple)) - 1;
|
|
|
|
bitmask = HIGHBIT;
|
|
|
|
/* set allnulls true for all attributes */
|
|
|
|
for (keyno = 0; keyno < brdesc->bd_tupdesc->natts; keyno++)
|
|
|
|
{
|
|
|
|
if (bitmask != HIGHBIT)
|
|
|
|
bitmask <<= 1;
|
|
|
|
else
|
|
|
|
{
|
|
|
|
bitP += 1;
|
|
|
|
*bitP = 0x0;
|
|
|
|
bitmask = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
*bitP |= bitmask;
|
|
|
|
}
|
|
|
|
/* no need to set hasnulls */
|
|
|
|
|
|
|
|
*size = len;
|
|
|
|
return rettuple;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Free a tuple created by brin_form_tuple
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
brin_free_tuple(BrinTuple *tuple)
|
|
|
|
{
|
|
|
|
pfree(tuple);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2017-04-07 23:54:26 +02:00
|
|
|
* Given a brin tuple of size len, create a copy of it. If 'dest' is not
|
|
|
|
* NULL, its size is destsz, and can be used as output buffer; if the tuple
|
|
|
|
* to be copied does not fit, it is enlarged by repalloc, and the size is
|
|
|
|
* updated to match. This avoids palloc/free cycles when many brin tuples
|
|
|
|
* are being processed in loops.
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
*/
|
|
|
|
BrinTuple *
|
2017-04-07 23:54:26 +02:00
|
|
|
brin_copy_tuple(BrinTuple *tuple, Size len, BrinTuple *dest, Size *destsz)
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
{
|
2017-04-07 23:54:26 +02:00
|
|
|
if (!destsz || *destsz == 0)
|
|
|
|
dest = palloc(len);
|
|
|
|
else if (len > *destsz)
|
|
|
|
{
|
|
|
|
dest = repalloc(dest, len);
|
|
|
|
*destsz = len;
|
|
|
|
}
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
2017-04-07 23:54:26 +02:00
|
|
|
memcpy(dest, tuple, len);
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
2017-04-07 23:54:26 +02:00
|
|
|
return dest;
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return whether two BrinTuples are bitwise identical.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
brin_tuples_equal(const BrinTuple *a, Size alen, const BrinTuple *b, Size blen)
|
|
|
|
{
|
|
|
|
if (alen != blen)
|
|
|
|
return false;
|
|
|
|
if (memcmp(a, b, alen) != 0)
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create a new BrinMemTuple from scratch, and initialize it to an empty
|
|
|
|
* state.
|
|
|
|
*
|
|
|
|
* Note: we don't provide any means to free a deformed tuple, so make sure to
|
|
|
|
* use a temporary memory context.
|
|
|
|
*/
|
|
|
|
BrinMemTuple *
|
|
|
|
brin_new_memtuple(BrinDesc *brdesc)
|
|
|
|
{
|
|
|
|
BrinMemTuple *dtup;
|
|
|
|
long basesize;
|
|
|
|
|
|
|
|
basesize = MAXALIGN(sizeof(BrinMemTuple) +
|
|
|
|
sizeof(BrinValues) * brdesc->bd_tupdesc->natts);
|
|
|
|
dtup = palloc0(basesize + sizeof(Datum) * brdesc->bd_totalstored);
|
2017-04-07 23:54:26 +02:00
|
|
|
|
|
|
|
dtup->bt_values = palloc(sizeof(Datum) * brdesc->bd_totalstored);
|
|
|
|
dtup->bt_allnulls = palloc(sizeof(bool) * brdesc->bd_tupdesc->natts);
|
|
|
|
dtup->bt_hasnulls = palloc(sizeof(bool) * brdesc->bd_tupdesc->natts);
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
|
|
|
dtup->bt_context = AllocSetContextCreate(CurrentMemoryContext,
|
|
|
|
"brin dtuple",
|
Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
|
|
|
ALLOCSET_DEFAULT_SIZES);
|
2017-04-07 23:54:26 +02:00
|
|
|
|
|
|
|
brin_memtuple_initialize(dtup, brdesc);
|
|
|
|
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
return dtup;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2017-04-07 23:54:26 +02:00
|
|
|
* Reset a BrinMemTuple to initial state. We return the same tuple, for
|
|
|
|
* notational convenience.
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
*/
|
2017-04-07 23:54:26 +02:00
|
|
|
BrinMemTuple *
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
brin_memtuple_initialize(BrinMemTuple *dtuple, BrinDesc *brdesc)
|
|
|
|
{
|
|
|
|
int i;
|
2017-04-07 23:54:26 +02:00
|
|
|
char *currdatum;
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
|
|
|
MemoryContextReset(dtuple->bt_context);
|
2017-04-07 23:54:26 +02:00
|
|
|
|
|
|
|
currdatum = (char *) dtuple +
|
|
|
|
MAXALIGN(sizeof(BrinMemTuple) +
|
|
|
|
sizeof(BrinValues) * brdesc->bd_tupdesc->natts);
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
for (i = 0; i < brdesc->bd_tupdesc->natts; i++)
|
|
|
|
{
|
2017-04-07 23:54:26 +02:00
|
|
|
dtuple->bt_columns[i].bv_attno = i + 1;
|
|
|
|
dtuple->bt_columns[i].bv_allnulls = true;
|
|
|
|
dtuple->bt_columns[i].bv_hasnulls = false;
|
|
|
|
dtuple->bt_columns[i].bv_values = (Datum *) currdatum;
|
|
|
|
currdatum += sizeof(Datum) * brdesc->bd_info[i]->oi_nstored;
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
}
|
2017-04-07 23:54:26 +02:00
|
|
|
|
|
|
|
return dtuple;
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Convert a BrinTuple back to a BrinMemTuple. This is the reverse of
|
|
|
|
* brin_form_tuple.
|
|
|
|
*
|
2017-04-07 23:54:26 +02:00
|
|
|
* As an optimization, the caller can pass a previously allocated 'dMemtuple'.
|
|
|
|
* This avoids having to allocate it here, which can be useful when this
|
|
|
|
* function is called many times in a loop. It is caller's responsibility
|
|
|
|
* that the given BrinMemTuple matches what we need here.
|
|
|
|
*
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
* Note we don't need the "on disk tupdesc" here; we rely on our own routine to
|
|
|
|
* deconstruct the tuple from the on-disk format.
|
|
|
|
*/
|
|
|
|
BrinMemTuple *
|
2017-04-07 23:54:26 +02:00
|
|
|
brin_deform_tuple(BrinDesc *brdesc, BrinTuple *tuple, BrinMemTuple *dMemtuple)
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
{
|
|
|
|
BrinMemTuple *dtup;
|
|
|
|
Datum *values;
|
|
|
|
bool *allnulls;
|
|
|
|
bool *hasnulls;
|
|
|
|
char *tp;
|
|
|
|
bits8 *nullbits;
|
|
|
|
int keyno;
|
|
|
|
int valueno;
|
|
|
|
MemoryContext oldcxt;
|
|
|
|
|
2017-04-07 23:54:26 +02:00
|
|
|
dtup = dMemtuple ? brin_memtuple_initialize(dMemtuple, brdesc) :
|
|
|
|
brin_new_memtuple(brdesc);
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
|
|
|
if (BrinTupleIsPlaceholder(tuple))
|
|
|
|
dtup->bt_placeholder = true;
|
|
|
|
dtup->bt_blkno = tuple->bt_blkno;
|
|
|
|
|
2017-04-07 23:54:26 +02:00
|
|
|
values = dtup->bt_values;
|
|
|
|
allnulls = dtup->bt_allnulls;
|
|
|
|
hasnulls = dtup->bt_hasnulls;
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
|
|
|
tp = (char *) tuple + BrinTupleDataOffset(tuple);
|
|
|
|
|
|
|
|
if (BrinTupleHasNulls(tuple))
|
|
|
|
nullbits = (bits8 *) ((char *) tuple + SizeOfBrinTuple);
|
|
|
|
else
|
|
|
|
nullbits = NULL;
|
|
|
|
brin_deconstruct_tuple(brdesc,
|
|
|
|
tp, nullbits, BrinTupleHasNulls(tuple),
|
|
|
|
values, allnulls, hasnulls);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Iterate to assign each of the values to the corresponding item in the
|
|
|
|
* values array of each column. The copies occur in the tuple's context.
|
|
|
|
*/
|
|
|
|
oldcxt = MemoryContextSwitchTo(dtup->bt_context);
|
|
|
|
for (valueno = 0, keyno = 0; keyno < brdesc->bd_tupdesc->natts; keyno++)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (allnulls[keyno])
|
|
|
|
{
|
|
|
|
valueno += brdesc->bd_info[keyno]->oi_nstored;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We would like to skip datumCopy'ing the values datum in some cases,
|
|
|
|
* caller permitting ...
|
|
|
|
*/
|
|
|
|
for (i = 0; i < brdesc->bd_info[keyno]->oi_nstored; i++)
|
|
|
|
dtup->bt_columns[keyno].bv_values[i] =
|
|
|
|
datumCopy(values[valueno++],
|
2015-05-07 18:02:22 +02:00
|
|
|
brdesc->bd_info[keyno]->oi_typcache[i]->typbyval,
|
|
|
|
brdesc->bd_info[keyno]->oi_typcache[i]->typlen);
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
|
|
|
dtup->bt_columns[keyno].bv_hasnulls = hasnulls[keyno];
|
|
|
|
dtup->bt_columns[keyno].bv_allnulls = false;
|
|
|
|
}
|
|
|
|
|
|
|
|
MemoryContextSwitchTo(oldcxt);
|
|
|
|
|
|
|
|
return dtup;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* brin_deconstruct_tuple
|
|
|
|
* Guts of attribute extraction from an on-disk BRIN tuple.
|
|
|
|
*
|
|
|
|
* Its arguments are:
|
|
|
|
* brdesc BRIN descriptor for the stored tuple
|
|
|
|
* tp pointer to the tuple data area
|
|
|
|
* nullbits pointer to the tuple nulls bitmask
|
|
|
|
* nulls "has nulls" bit in tuple infomask
|
|
|
|
* values output values, array of size brdesc->bd_totalstored
|
|
|
|
* allnulls output "allnulls", size brdesc->bd_tupdesc->natts
|
|
|
|
* hasnulls output "hasnulls", size brdesc->bd_tupdesc->natts
|
|
|
|
*
|
|
|
|
* Output arrays must have been allocated by caller.
|
|
|
|
*/
|
|
|
|
static inline void
|
|
|
|
brin_deconstruct_tuple(BrinDesc *brdesc,
|
|
|
|
char *tp, bits8 *nullbits, bool nulls,
|
|
|
|
Datum *values, bool *allnulls, bool *hasnulls)
|
|
|
|
{
|
|
|
|
int attnum;
|
|
|
|
int stored;
|
|
|
|
TupleDesc diskdsc;
|
|
|
|
long off;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* First iterate to natts to obtain both null flags for each attribute.
|
|
|
|
* Note that we reverse the sense of the att_isnull test, because we store
|
|
|
|
* 1 for a null value (rather than a 1 for a not null value as is the
|
|
|
|
* att_isnull convention used elsewhere.) See brin_form_tuple.
|
|
|
|
*/
|
|
|
|
for (attnum = 0; attnum < brdesc->bd_tupdesc->natts; attnum++)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* the "all nulls" bit means that all values in the page range for
|
|
|
|
* this column are nulls. Therefore there are no values in the tuple
|
|
|
|
* data area.
|
|
|
|
*/
|
|
|
|
allnulls[attnum] = nulls && !att_isnull(attnum, nullbits);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* the "has nulls" bit means that some tuples have nulls, but others
|
|
|
|
* have not-null values. Therefore we know the tuple contains data
|
|
|
|
* for this column.
|
|
|
|
*
|
|
|
|
* The hasnulls bits follow the allnulls bits in the same bitmask.
|
|
|
|
*/
|
|
|
|
hasnulls[attnum] =
|
|
|
|
nulls && !att_isnull(brdesc->bd_tupdesc->natts + attnum, nullbits);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Iterate to obtain each attribute's stored values. Note that since we
|
|
|
|
* may reuse attribute entries for more than one column, we cannot cache
|
|
|
|
* offsets here.
|
|
|
|
*/
|
|
|
|
diskdsc = brtuple_disk_tupdesc(brdesc);
|
|
|
|
stored = 0;
|
|
|
|
off = 0;
|
|
|
|
for (attnum = 0; attnum < brdesc->bd_tupdesc->natts; attnum++)
|
|
|
|
{
|
|
|
|
int datumno;
|
|
|
|
|
|
|
|
if (allnulls[attnum])
|
|
|
|
{
|
|
|
|
stored += brdesc->bd_info[attnum]->oi_nstored;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (datumno = 0;
|
|
|
|
datumno < brdesc->bd_info[attnum]->oi_nstored;
|
|
|
|
datumno++)
|
|
|
|
{
|
2017-08-20 20:19:07 +02:00
|
|
|
Form_pg_attribute thisatt = TupleDescAttr(diskdsc, stored);
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
|
|
|
if (thisatt->attlen == -1)
|
|
|
|
{
|
|
|
|
off = att_align_pointer(off, thisatt->attalign, -1,
|
|
|
|
tp + off);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* not varlena, so safe to use att_align_nominal */
|
|
|
|
off = att_align_nominal(off, thisatt->attalign);
|
|
|
|
}
|
|
|
|
|
|
|
|
values[stored++] = fetchatt(thisatt, tp + off);
|
|
|
|
|
|
|
|
off = att_addlength_pointer(off, thisatt->attlen, tp + off);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|