BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
/*
|
|
|
|
* brin_tuples.c
|
|
|
|
* Method implementations for tuples in BRIN indexes.
|
|
|
|
*
|
|
|
|
* Intended usage is that code outside this file only deals with
|
|
|
|
* BrinMemTuples, and convert to and from the on-disk representation through
|
|
|
|
* functions in this file.
|
|
|
|
*
|
|
|
|
* NOTES
|
|
|
|
*
|
|
|
|
* A BRIN tuple is similar to a heap tuple, with a few key differences. The
|
|
|
|
* first interesting difference is that the tuple header is much simpler, only
|
|
|
|
* containing its total length and a small area for flags. Also, the stored
|
|
|
|
* data does not match the relation tuple descriptor exactly: for each
|
|
|
|
* attribute in the descriptor, the index tuple carries an arbitrary number
|
|
|
|
* of values, depending on the opclass.
|
|
|
|
*
|
|
|
|
* Also, for each column of the index relation there are two null bits: one
|
|
|
|
* (hasnulls) stores whether any tuple within the page range has that column
|
|
|
|
* set to null; the other one (allnulls) stores whether the column values are
|
|
|
|
* all null. If allnulls is true, then the tuple data area does not contain
|
|
|
|
* values for that column at all; whereas it does if the hasnulls is set.
|
|
|
|
* Note the size of the null bitmask may not be the same as that of the
|
|
|
|
* datum array.
|
|
|
|
*
|
2019-01-02 18:44:25 +01:00
|
|
|
* Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
|
|
|
* src/backend/access/brin/brin_tuple.c
|
|
|
|
*/
|
|
|
|
#include "postgres.h"
|
|
|
|
|
|
|
|
#include "access/htup_details.h"
|
|
|
|
#include "access/brin_tuple.h"
|
|
|
|
#include "access/tupdesc.h"
|
|
|
|
#include "access/tupmacs.h"
|
|
|
|
#include "utils/datum.h"
|
|
|
|
#include "utils/memutils.h"
|
|
|
|
|
|
|
|
|
|
|
|
static inline void brin_deconstruct_tuple(BrinDesc *brdesc,
|
|
|
|
char *tp, bits8 *nullbits, bool nulls,
|
|
|
|
Datum *values, bool *allnulls, bool *hasnulls);
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return a tuple descriptor used for on-disk storage of BRIN tuples.
|
|
|
|
*/
|
|
|
|
static TupleDesc
|
|
|
|
brtuple_disk_tupdesc(BrinDesc *brdesc)
|
|
|
|
{
|
|
|
|
/* We cache these in the BrinDesc */
|
|
|
|
if (brdesc->bd_disktdesc == NULL)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
int j;
|
|
|
|
AttrNumber attno = 1;
|
|
|
|
TupleDesc tupdesc;
|
|
|
|
MemoryContext oldcxt;
|
|
|
|
|
|
|
|
/* make sure it's in the bdesc's context */
|
|
|
|
oldcxt = MemoryContextSwitchTo(brdesc->bd_context);
|
|
|
|
|
Remove WITH OIDS support, change oid catalog column visibility.
Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.
This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row. Neither pg_dump nor COPY included the contents of the
oid column by default.
The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.
WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.
Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.
The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.
The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such. This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.
The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.
Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).
The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.
While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.
Catversion bump, for obvious reasons.
Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
2018-11-21 00:36:57 +01:00
|
|
|
tupdesc = CreateTemplateTupleDesc(brdesc->bd_totalstored);
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
|
|
|
for (i = 0; i < brdesc->bd_tupdesc->natts; i++)
|
|
|
|
{
|
|
|
|
for (j = 0; j < brdesc->bd_info[i]->oi_nstored; j++)
|
|
|
|
TupleDescInitEntry(tupdesc, attno++, NULL,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
brdesc->bd_info[i]->oi_typcache[j]->type_id,
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
-1, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
MemoryContextSwitchTo(oldcxt);
|
|
|
|
|
|
|
|
brdesc->bd_disktdesc = tupdesc;
|
|
|
|
}
|
|
|
|
|
|
|
|
return brdesc->bd_disktdesc;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Generate a new on-disk tuple to be inserted in a BRIN index.
|
|
|
|
*
|
|
|
|
* See brin_form_placeholder_tuple if you touch this.
|
|
|
|
*/
|
|
|
|
BrinTuple *
|
|
|
|
brin_form_tuple(BrinDesc *brdesc, BlockNumber blkno, BrinMemTuple *tuple,
|
|
|
|
Size *size)
|
|
|
|
{
|
|
|
|
Datum *values;
|
|
|
|
bool *nulls;
|
|
|
|
bool anynulls = false;
|
|
|
|
BrinTuple *rettuple;
|
|
|
|
int keyno;
|
|
|
|
int idxattno;
|
2014-11-10 19:56:08 +01:00
|
|
|
uint16 phony_infomask = 0;
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
bits8 *phony_nullbitmap;
|
|
|
|
Size len,
|
|
|
|
hoff,
|
|
|
|
data_len;
|
|
|
|
|
|
|
|
Assert(brdesc->bd_totalstored > 0);
|
|
|
|
|
Fix valgrind's "unaddressable bytes" whining about BRIN code.
brin_form_tuple calculated an exact tuple size, then palloc'd and
filled just that much. Later, brin_doinsert or brin_doupdate would
MAXALIGN the tuple size and tell PageAddItem that that was the size
of the tuple to insert. If the original tuple size wasn't a multiple
of MAXALIGN, the net result would be that PageAddItem would memcpy
a few more bytes than the palloc request had been for.
AFAICS, this is totally harmless in the real world: the error is a
read overrun not a write overrun, and palloc would certainly have
rounded the request up to a MAXALIGN multiple internally, so there's
no chance of the memcpy fetching off the end of memory. Valgrind,
however, is picky to the byte level not the MAXALIGN level.
Fix it by pushing the MAXALIGN step back to brin_form_tuple. (The other
possible source of tuples in this code, brin_form_placeholder_tuple,
was already producing a MAXALIGN'd result.)
In passing, be a bit more paranoid about internal allocations in
brin_form_tuple.
2015-05-26 03:56:19 +02:00
|
|
|
values = (Datum *) palloc(sizeof(Datum) * brdesc->bd_totalstored);
|
|
|
|
nulls = (bool *) palloc0(sizeof(bool) * brdesc->bd_totalstored);
|
|
|
|
phony_nullbitmap = (bits8 *)
|
|
|
|
palloc(sizeof(bits8) * BITMAPLEN(brdesc->bd_totalstored));
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Set up the values/nulls arrays for heap_fill_tuple
|
|
|
|
*/
|
|
|
|
idxattno = 0;
|
|
|
|
for (keyno = 0; keyno < brdesc->bd_tupdesc->natts; keyno++)
|
|
|
|
{
|
|
|
|
int datumno;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* "allnulls" is set when there's no nonnull value in any row in the
|
|
|
|
* column; when this happens, there is no data to store. Thus set the
|
|
|
|
* nullable bits for all data elements of this column and we're done.
|
|
|
|
*/
|
|
|
|
if (tuple->bt_columns[keyno].bv_allnulls)
|
|
|
|
{
|
|
|
|
for (datumno = 0;
|
|
|
|
datumno < brdesc->bd_info[keyno]->oi_nstored;
|
|
|
|
datumno++)
|
|
|
|
nulls[idxattno++] = true;
|
|
|
|
anynulls = true;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The "hasnulls" bit is set when there are some null values in the
|
|
|
|
* data. We still need to store a real value, but the presence of
|
|
|
|
* this means we need a null bitmap.
|
|
|
|
*/
|
|
|
|
if (tuple->bt_columns[keyno].bv_hasnulls)
|
|
|
|
anynulls = true;
|
|
|
|
|
|
|
|
for (datumno = 0;
|
|
|
|
datumno < brdesc->bd_info[keyno]->oi_nstored;
|
|
|
|
datumno++)
|
|
|
|
values[idxattno++] = tuple->bt_columns[keyno].bv_values[datumno];
|
|
|
|
}
|
|
|
|
|
Fix valgrind's "unaddressable bytes" whining about BRIN code.
brin_form_tuple calculated an exact tuple size, then palloc'd and
filled just that much. Later, brin_doinsert or brin_doupdate would
MAXALIGN the tuple size and tell PageAddItem that that was the size
of the tuple to insert. If the original tuple size wasn't a multiple
of MAXALIGN, the net result would be that PageAddItem would memcpy
a few more bytes than the palloc request had been for.
AFAICS, this is totally harmless in the real world: the error is a
read overrun not a write overrun, and palloc would certainly have
rounded the request up to a MAXALIGN multiple internally, so there's
no chance of the memcpy fetching off the end of memory. Valgrind,
however, is picky to the byte level not the MAXALIGN level.
Fix it by pushing the MAXALIGN step back to brin_form_tuple. (The other
possible source of tuples in this code, brin_form_placeholder_tuple,
was already producing a MAXALIGN'd result.)
In passing, be a bit more paranoid about internal allocations in
brin_form_tuple.
2015-05-26 03:56:19 +02:00
|
|
|
/* Assert we did not overrun temp arrays */
|
|
|
|
Assert(idxattno <= brdesc->bd_totalstored);
|
|
|
|
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
/* compute total space needed */
|
|
|
|
len = SizeOfBrinTuple;
|
|
|
|
if (anynulls)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We need a double-length bitmap on an on-disk BRIN index tuple; the
|
|
|
|
* first half stores the "allnulls" bits, the second stores
|
|
|
|
* "hasnulls".
|
|
|
|
*/
|
|
|
|
len += BITMAPLEN(brdesc->bd_tupdesc->natts * 2);
|
|
|
|
}
|
|
|
|
|
|
|
|
len = hoff = MAXALIGN(len);
|
|
|
|
|
|
|
|
data_len = heap_compute_data_size(brtuple_disk_tupdesc(brdesc),
|
|
|
|
values, nulls);
|
|
|
|
len += data_len;
|
|
|
|
|
Fix valgrind's "unaddressable bytes" whining about BRIN code.
brin_form_tuple calculated an exact tuple size, then palloc'd and
filled just that much. Later, brin_doinsert or brin_doupdate would
MAXALIGN the tuple size and tell PageAddItem that that was the size
of the tuple to insert. If the original tuple size wasn't a multiple
of MAXALIGN, the net result would be that PageAddItem would memcpy
a few more bytes than the palloc request had been for.
AFAICS, this is totally harmless in the real world: the error is a
read overrun not a write overrun, and palloc would certainly have
rounded the request up to a MAXALIGN multiple internally, so there's
no chance of the memcpy fetching off the end of memory. Valgrind,
however, is picky to the byte level not the MAXALIGN level.
Fix it by pushing the MAXALIGN step back to brin_form_tuple. (The other
possible source of tuples in this code, brin_form_placeholder_tuple,
was already producing a MAXALIGN'd result.)
In passing, be a bit more paranoid about internal allocations in
brin_form_tuple.
2015-05-26 03:56:19 +02:00
|
|
|
len = MAXALIGN(len);
|
|
|
|
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
rettuple = palloc0(len);
|
|
|
|
rettuple->bt_blkno = blkno;
|
|
|
|
rettuple->bt_info = hoff;
|
Fix valgrind's "unaddressable bytes" whining about BRIN code.
brin_form_tuple calculated an exact tuple size, then palloc'd and
filled just that much. Later, brin_doinsert or brin_doupdate would
MAXALIGN the tuple size and tell PageAddItem that that was the size
of the tuple to insert. If the original tuple size wasn't a multiple
of MAXALIGN, the net result would be that PageAddItem would memcpy
a few more bytes than the palloc request had been for.
AFAICS, this is totally harmless in the real world: the error is a
read overrun not a write overrun, and palloc would certainly have
rounded the request up to a MAXALIGN multiple internally, so there's
no chance of the memcpy fetching off the end of memory. Valgrind,
however, is picky to the byte level not the MAXALIGN level.
Fix it by pushing the MAXALIGN step back to brin_form_tuple. (The other
possible source of tuples in this code, brin_form_placeholder_tuple,
was already producing a MAXALIGN'd result.)
In passing, be a bit more paranoid about internal allocations in
brin_form_tuple.
2015-05-26 03:56:19 +02:00
|
|
|
|
|
|
|
/* Assert that hoff fits in the space available */
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
Assert((rettuple->bt_info & BRIN_OFFSET_MASK) == hoff);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The infomask and null bitmap as computed by heap_fill_tuple are useless
|
|
|
|
* to us. However, that function will not accept a null infomask; and we
|
|
|
|
* need to pass a valid null bitmap so that it will correctly skip
|
|
|
|
* outputting null attributes in the data area.
|
|
|
|
*/
|
|
|
|
heap_fill_tuple(brtuple_disk_tupdesc(brdesc),
|
|
|
|
values,
|
|
|
|
nulls,
|
|
|
|
(char *) rettuple + hoff,
|
|
|
|
data_len,
|
|
|
|
&phony_infomask,
|
|
|
|
phony_nullbitmap);
|
|
|
|
|
|
|
|
/* done with these */
|
|
|
|
pfree(values);
|
|
|
|
pfree(nulls);
|
|
|
|
pfree(phony_nullbitmap);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now fill in the real null bitmasks. allnulls first.
|
|
|
|
*/
|
|
|
|
if (anynulls)
|
|
|
|
{
|
|
|
|
bits8 *bitP;
|
|
|
|
int bitmask;
|
|
|
|
|
|
|
|
rettuple->bt_info |= BRIN_NULLS_MASK;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Note that we reverse the sense of null bits in this module: we
|
|
|
|
* store a 1 for a null attribute rather than a 0. So we must reverse
|
|
|
|
* the sense of the att_isnull test in br_deconstruct_tuple as well.
|
|
|
|
*/
|
|
|
|
bitP = ((bits8 *) ((char *) rettuple + SizeOfBrinTuple)) - 1;
|
|
|
|
bitmask = HIGHBIT;
|
|
|
|
for (keyno = 0; keyno < brdesc->bd_tupdesc->natts; keyno++)
|
|
|
|
{
|
|
|
|
if (bitmask != HIGHBIT)
|
|
|
|
bitmask <<= 1;
|
|
|
|
else
|
|
|
|
{
|
|
|
|
bitP += 1;
|
|
|
|
*bitP = 0x0;
|
|
|
|
bitmask = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!tuple->bt_columns[keyno].bv_allnulls)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
*bitP |= bitmask;
|
|
|
|
}
|
|
|
|
/* hasnulls bits follow */
|
|
|
|
for (keyno = 0; keyno < brdesc->bd_tupdesc->natts; keyno++)
|
|
|
|
{
|
|
|
|
if (bitmask != HIGHBIT)
|
|
|
|
bitmask <<= 1;
|
|
|
|
else
|
|
|
|
{
|
|
|
|
bitP += 1;
|
|
|
|
*bitP = 0x0;
|
|
|
|
bitmask = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!tuple->bt_columns[keyno].bv_hasnulls)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
*bitP |= bitmask;
|
|
|
|
}
|
|
|
|
bitP = ((bits8 *) (rettuple + SizeOfBrinTuple)) - 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (tuple->bt_placeholder)
|
|
|
|
rettuple->bt_info |= BRIN_PLACEHOLDER_MASK;
|
|
|
|
|
|
|
|
*size = len;
|
|
|
|
return rettuple;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Generate a new on-disk tuple with no data values, marked as placeholder.
|
|
|
|
*
|
|
|
|
* This is a cut-down version of brin_form_tuple.
|
|
|
|
*/
|
|
|
|
BrinTuple *
|
|
|
|
brin_form_placeholder_tuple(BrinDesc *brdesc, BlockNumber blkno, Size *size)
|
|
|
|
{
|
|
|
|
Size len;
|
|
|
|
Size hoff;
|
|
|
|
BrinTuple *rettuple;
|
|
|
|
int keyno;
|
|
|
|
bits8 *bitP;
|
|
|
|
int bitmask;
|
|
|
|
|
|
|
|
/* compute total space needed: always add nulls */
|
|
|
|
len = SizeOfBrinTuple;
|
|
|
|
len += BITMAPLEN(brdesc->bd_tupdesc->natts * 2);
|
|
|
|
len = hoff = MAXALIGN(len);
|
|
|
|
|
|
|
|
rettuple = palloc0(len);
|
|
|
|
rettuple->bt_blkno = blkno;
|
|
|
|
rettuple->bt_info = hoff;
|
|
|
|
rettuple->bt_info |= BRIN_NULLS_MASK | BRIN_PLACEHOLDER_MASK;
|
|
|
|
|
|
|
|
bitP = ((bits8 *) ((char *) rettuple + SizeOfBrinTuple)) - 1;
|
|
|
|
bitmask = HIGHBIT;
|
|
|
|
/* set allnulls true for all attributes */
|
|
|
|
for (keyno = 0; keyno < brdesc->bd_tupdesc->natts; keyno++)
|
|
|
|
{
|
|
|
|
if (bitmask != HIGHBIT)
|
|
|
|
bitmask <<= 1;
|
|
|
|
else
|
|
|
|
{
|
|
|
|
bitP += 1;
|
|
|
|
*bitP = 0x0;
|
|
|
|
bitmask = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
*bitP |= bitmask;
|
|
|
|
}
|
|
|
|
/* no need to set hasnulls */
|
|
|
|
|
|
|
|
*size = len;
|
|
|
|
return rettuple;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Free a tuple created by brin_form_tuple
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
brin_free_tuple(BrinTuple *tuple)
|
|
|
|
{
|
|
|
|
pfree(tuple);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2017-04-07 23:54:26 +02:00
|
|
|
* Given a brin tuple of size len, create a copy of it. If 'dest' is not
|
|
|
|
* NULL, its size is destsz, and can be used as output buffer; if the tuple
|
|
|
|
* to be copied does not fit, it is enlarged by repalloc, and the size is
|
|
|
|
* updated to match. This avoids palloc/free cycles when many brin tuples
|
|
|
|
* are being processed in loops.
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
*/
|
|
|
|
BrinTuple *
|
2017-04-07 23:54:26 +02:00
|
|
|
brin_copy_tuple(BrinTuple *tuple, Size len, BrinTuple *dest, Size *destsz)
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
{
|
2017-04-07 23:54:26 +02:00
|
|
|
if (!destsz || *destsz == 0)
|
|
|
|
dest = palloc(len);
|
|
|
|
else if (len > *destsz)
|
|
|
|
{
|
|
|
|
dest = repalloc(dest, len);
|
|
|
|
*destsz = len;
|
|
|
|
}
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
2017-04-07 23:54:26 +02:00
|
|
|
memcpy(dest, tuple, len);
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
2017-04-07 23:54:26 +02:00
|
|
|
return dest;
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return whether two BrinTuples are bitwise identical.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
brin_tuples_equal(const BrinTuple *a, Size alen, const BrinTuple *b, Size blen)
|
|
|
|
{
|
|
|
|
if (alen != blen)
|
|
|
|
return false;
|
|
|
|
if (memcmp(a, b, alen) != 0)
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create a new BrinMemTuple from scratch, and initialize it to an empty
|
|
|
|
* state.
|
|
|
|
*
|
|
|
|
* Note: we don't provide any means to free a deformed tuple, so make sure to
|
|
|
|
* use a temporary memory context.
|
|
|
|
*/
|
|
|
|
BrinMemTuple *
|
|
|
|
brin_new_memtuple(BrinDesc *brdesc)
|
|
|
|
{
|
|
|
|
BrinMemTuple *dtup;
|
|
|
|
long basesize;
|
|
|
|
|
|
|
|
basesize = MAXALIGN(sizeof(BrinMemTuple) +
|
|
|
|
sizeof(BrinValues) * brdesc->bd_tupdesc->natts);
|
|
|
|
dtup = palloc0(basesize + sizeof(Datum) * brdesc->bd_totalstored);
|
2017-04-07 23:54:26 +02:00
|
|
|
|
|
|
|
dtup->bt_values = palloc(sizeof(Datum) * brdesc->bd_totalstored);
|
|
|
|
dtup->bt_allnulls = palloc(sizeof(bool) * brdesc->bd_tupdesc->natts);
|
|
|
|
dtup->bt_hasnulls = palloc(sizeof(bool) * brdesc->bd_tupdesc->natts);
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
|
|
|
dtup->bt_context = AllocSetContextCreate(CurrentMemoryContext,
|
|
|
|
"brin dtuple",
|
Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
|
|
|
ALLOCSET_DEFAULT_SIZES);
|
2017-04-07 23:54:26 +02:00
|
|
|
|
|
|
|
brin_memtuple_initialize(dtup, brdesc);
|
|
|
|
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
return dtup;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2017-04-07 23:54:26 +02:00
|
|
|
* Reset a BrinMemTuple to initial state. We return the same tuple, for
|
|
|
|
* notational convenience.
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
*/
|
2017-04-07 23:54:26 +02:00
|
|
|
BrinMemTuple *
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
brin_memtuple_initialize(BrinMemTuple *dtuple, BrinDesc *brdesc)
|
|
|
|
{
|
|
|
|
int i;
|
2017-04-07 23:54:26 +02:00
|
|
|
char *currdatum;
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
|
|
|
MemoryContextReset(dtuple->bt_context);
|
2017-04-07 23:54:26 +02:00
|
|
|
|
|
|
|
currdatum = (char *) dtuple +
|
|
|
|
MAXALIGN(sizeof(BrinMemTuple) +
|
|
|
|
sizeof(BrinValues) * brdesc->bd_tupdesc->natts);
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
for (i = 0; i < brdesc->bd_tupdesc->natts; i++)
|
|
|
|
{
|
|
|
|
dtuple->bt_columns[i].bv_allnulls = true;
|
|
|
|
dtuple->bt_columns[i].bv_hasnulls = false;
|
2017-04-07 23:54:26 +02:00
|
|
|
|
|
|
|
dtuple->bt_columns[i].bv_attno = i + 1;
|
|
|
|
dtuple->bt_columns[i].bv_allnulls = true;
|
|
|
|
dtuple->bt_columns[i].bv_hasnulls = false;
|
|
|
|
dtuple->bt_columns[i].bv_values = (Datum *) currdatum;
|
|
|
|
currdatum += sizeof(Datum) * brdesc->bd_info[i]->oi_nstored;
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
}
|
2017-04-07 23:54:26 +02:00
|
|
|
|
|
|
|
return dtuple;
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Convert a BrinTuple back to a BrinMemTuple. This is the reverse of
|
|
|
|
* brin_form_tuple.
|
|
|
|
*
|
2017-04-07 23:54:26 +02:00
|
|
|
* As an optimization, the caller can pass a previously allocated 'dMemtuple'.
|
|
|
|
* This avoids having to allocate it here, which can be useful when this
|
|
|
|
* function is called many times in a loop. It is caller's responsibility
|
|
|
|
* that the given BrinMemTuple matches what we need here.
|
|
|
|
*
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
* Note we don't need the "on disk tupdesc" here; we rely on our own routine to
|
|
|
|
* deconstruct the tuple from the on-disk format.
|
|
|
|
*/
|
|
|
|
BrinMemTuple *
|
2017-04-07 23:54:26 +02:00
|
|
|
brin_deform_tuple(BrinDesc *brdesc, BrinTuple *tuple, BrinMemTuple *dMemtuple)
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
{
|
|
|
|
BrinMemTuple *dtup;
|
|
|
|
Datum *values;
|
|
|
|
bool *allnulls;
|
|
|
|
bool *hasnulls;
|
|
|
|
char *tp;
|
|
|
|
bits8 *nullbits;
|
|
|
|
int keyno;
|
|
|
|
int valueno;
|
|
|
|
MemoryContext oldcxt;
|
|
|
|
|
2017-04-07 23:54:26 +02:00
|
|
|
dtup = dMemtuple ? brin_memtuple_initialize(dMemtuple, brdesc) :
|
|
|
|
brin_new_memtuple(brdesc);
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
|
|
|
if (BrinTupleIsPlaceholder(tuple))
|
|
|
|
dtup->bt_placeholder = true;
|
|
|
|
dtup->bt_blkno = tuple->bt_blkno;
|
|
|
|
|
2017-04-07 23:54:26 +02:00
|
|
|
values = dtup->bt_values;
|
|
|
|
allnulls = dtup->bt_allnulls;
|
|
|
|
hasnulls = dtup->bt_hasnulls;
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
|
|
|
tp = (char *) tuple + BrinTupleDataOffset(tuple);
|
|
|
|
|
|
|
|
if (BrinTupleHasNulls(tuple))
|
|
|
|
nullbits = (bits8 *) ((char *) tuple + SizeOfBrinTuple);
|
|
|
|
else
|
|
|
|
nullbits = NULL;
|
|
|
|
brin_deconstruct_tuple(brdesc,
|
|
|
|
tp, nullbits, BrinTupleHasNulls(tuple),
|
|
|
|
values, allnulls, hasnulls);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Iterate to assign each of the values to the corresponding item in the
|
|
|
|
* values array of each column. The copies occur in the tuple's context.
|
|
|
|
*/
|
|
|
|
oldcxt = MemoryContextSwitchTo(dtup->bt_context);
|
|
|
|
for (valueno = 0, keyno = 0; keyno < brdesc->bd_tupdesc->natts; keyno++)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (allnulls[keyno])
|
|
|
|
{
|
|
|
|
valueno += brdesc->bd_info[keyno]->oi_nstored;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We would like to skip datumCopy'ing the values datum in some cases,
|
|
|
|
* caller permitting ...
|
|
|
|
*/
|
|
|
|
for (i = 0; i < brdesc->bd_info[keyno]->oi_nstored; i++)
|
|
|
|
dtup->bt_columns[keyno].bv_values[i] =
|
|
|
|
datumCopy(values[valueno++],
|
2015-05-07 18:02:22 +02:00
|
|
|
brdesc->bd_info[keyno]->oi_typcache[i]->typbyval,
|
|
|
|
brdesc->bd_info[keyno]->oi_typcache[i]->typlen);
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
|
|
|
dtup->bt_columns[keyno].bv_hasnulls = hasnulls[keyno];
|
|
|
|
dtup->bt_columns[keyno].bv_allnulls = false;
|
|
|
|
}
|
|
|
|
|
|
|
|
MemoryContextSwitchTo(oldcxt);
|
|
|
|
|
|
|
|
return dtup;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* brin_deconstruct_tuple
|
|
|
|
* Guts of attribute extraction from an on-disk BRIN tuple.
|
|
|
|
*
|
|
|
|
* Its arguments are:
|
|
|
|
* brdesc BRIN descriptor for the stored tuple
|
|
|
|
* tp pointer to the tuple data area
|
|
|
|
* nullbits pointer to the tuple nulls bitmask
|
|
|
|
* nulls "has nulls" bit in tuple infomask
|
|
|
|
* values output values, array of size brdesc->bd_totalstored
|
|
|
|
* allnulls output "allnulls", size brdesc->bd_tupdesc->natts
|
|
|
|
* hasnulls output "hasnulls", size brdesc->bd_tupdesc->natts
|
|
|
|
*
|
|
|
|
* Output arrays must have been allocated by caller.
|
|
|
|
*/
|
|
|
|
static inline void
|
|
|
|
brin_deconstruct_tuple(BrinDesc *brdesc,
|
|
|
|
char *tp, bits8 *nullbits, bool nulls,
|
|
|
|
Datum *values, bool *allnulls, bool *hasnulls)
|
|
|
|
{
|
|
|
|
int attnum;
|
|
|
|
int stored;
|
|
|
|
TupleDesc diskdsc;
|
|
|
|
long off;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* First iterate to natts to obtain both null flags for each attribute.
|
|
|
|
* Note that we reverse the sense of the att_isnull test, because we store
|
|
|
|
* 1 for a null value (rather than a 1 for a not null value as is the
|
|
|
|
* att_isnull convention used elsewhere.) See brin_form_tuple.
|
|
|
|
*/
|
|
|
|
for (attnum = 0; attnum < brdesc->bd_tupdesc->natts; attnum++)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* the "all nulls" bit means that all values in the page range for
|
|
|
|
* this column are nulls. Therefore there are no values in the tuple
|
|
|
|
* data area.
|
|
|
|
*/
|
|
|
|
allnulls[attnum] = nulls && !att_isnull(attnum, nullbits);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* the "has nulls" bit means that some tuples have nulls, but others
|
|
|
|
* have not-null values. Therefore we know the tuple contains data
|
|
|
|
* for this column.
|
|
|
|
*
|
|
|
|
* The hasnulls bits follow the allnulls bits in the same bitmask.
|
|
|
|
*/
|
|
|
|
hasnulls[attnum] =
|
|
|
|
nulls && !att_isnull(brdesc->bd_tupdesc->natts + attnum, nullbits);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Iterate to obtain each attribute's stored values. Note that since we
|
|
|
|
* may reuse attribute entries for more than one column, we cannot cache
|
|
|
|
* offsets here.
|
|
|
|
*/
|
|
|
|
diskdsc = brtuple_disk_tupdesc(brdesc);
|
|
|
|
stored = 0;
|
|
|
|
off = 0;
|
|
|
|
for (attnum = 0; attnum < brdesc->bd_tupdesc->natts; attnum++)
|
|
|
|
{
|
|
|
|
int datumno;
|
|
|
|
|
|
|
|
if (allnulls[attnum])
|
|
|
|
{
|
|
|
|
stored += brdesc->bd_info[attnum]->oi_nstored;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (datumno = 0;
|
|
|
|
datumno < brdesc->bd_info[attnum]->oi_nstored;
|
|
|
|
datumno++)
|
|
|
|
{
|
2017-08-20 20:19:07 +02:00
|
|
|
Form_pg_attribute thisatt = TupleDescAttr(diskdsc, stored);
|
BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes. They work by maintaining "summary" data about
block ranges. Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not. Normal index scans are not supported
because these indexes do not store TIDs.
As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.
For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range. This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results. In this commit I only include minmax.
Catalog version bumped due to new builtin catalog entries.
There's more that could be done here, but this is a good step forwards.
Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.
Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.
PS:
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633.
2014-11-07 20:38:14 +01:00
|
|
|
|
|
|
|
if (thisatt->attlen == -1)
|
|
|
|
{
|
|
|
|
off = att_align_pointer(off, thisatt->attalign, -1,
|
|
|
|
tp + off);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* not varlena, so safe to use att_align_nominal */
|
|
|
|
off = att_align_nominal(off, thisatt->attalign);
|
|
|
|
}
|
|
|
|
|
|
|
|
values[stored++] = fetchatt(thisatt, tp + off);
|
|
|
|
|
|
|
|
off = att_addlength_pointer(off, thisatt->attlen, tp + off);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|