postgresql/src/include/access/brin_pageops.h

/*
 * brin_pageops.h
 *		Prototypes for operating on BRIN pages.
 *
 * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * IDENTIFICATION
 *	  src/include/access/brin_pageops.h
 */
#ifndef BRIN_PAGEOPS_H
#define BRIN_PAGEOPS_H

#include "access/brin_revmap.h"

extern bool brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
						  BrinRevmap *revmap, BlockNumber heapBlk,
						  Buffer oldbuf, OffsetNumber oldoff,
						  const BrinTuple *origtup, Size origsz,
						  const BrinTuple *newtup, Size newsz,
						  bool samepage);
extern bool brin_can_do_samepage_update(Buffer buffer, Size origsz,
										Size newsz);
extern OffsetNumber brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
								  BrinRevmap *revmap, Buffer *buffer, BlockNumber heapBlk,
								  BrinTuple *tup, Size itemsz);

extern void brin_page_init(Page page, uint16 type);
extern void brin_metapage_init(Page page, BlockNumber pagesPerRange,
							   uint16 version);

extern bool brin_start_evacuating_page(Relation idxRel, Buffer buf);
extern void brin_evacuate_page(Relation idxRel, BlockNumber pagesPerRange,
							   BrinRevmap *revmap, Buffer buf);

extern void brin_page_cleanup(Relation idxrel, Buffer buf);

#endif							/* BRIN_PAGEOPS_H */
BRIN: Block Range Indexes BRIN is a new index access method intended to accelerate scans of very large tables, without the maintenance overhead of btrees or other traditional indexes. They work by maintaining "summary" data about block ranges. Bitmap index scans work by reading each summary tuple and comparing them with the query quals; all pages in the range are returned in a lossy TID bitmap if the quals are consistent with the values in the summary tuple, otherwise not. Normal index scans are not supported because these indexes do not store TIDs. As new tuples are added into the index, the summary information is updated (if the block range in which the tuple is added is already summarized) or not; in the latter case, a subsequent pass of VACUUM or the brin_summarize_new_values() function will create the summary information. For data types with natural 1-D sort orders, the summary info consists of the maximum and the minimum values of each indexed column within each page range. This type of operator class we call "Minmax", and we supply a bunch of them for most data types with B-tree opclasses. Since the BRIN code is generalized, other approaches are possible for things such as arrays, geometric types, ranges, etc; even for things such as enum types we could do something different than minmax with better results. In this commit I only include minmax. Catalog version bumped due to new builtin catalog entries. There's more that could be done here, but this is a good step forwards. Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera, with contribution by Heikki Linnakangas. Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas. Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo. PS: The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633. 2014-11-07 20:38:14 +01:00			`/*`
			`* brin_pageops.h`
			`* Prototypes for operating on BRIN pages.`
			`*`
Update copyright for 2021 Backpatch-through: 9.5 2021-01-02 19:06:25 +01:00			`* Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group`
BRIN: Block Range Indexes BRIN is a new index access method intended to accelerate scans of very large tables, without the maintenance overhead of btrees or other traditional indexes. They work by maintaining "summary" data about block ranges. Bitmap index scans work by reading each summary tuple and comparing them with the query quals; all pages in the range are returned in a lossy TID bitmap if the quals are consistent with the values in the summary tuple, otherwise not. Normal index scans are not supported because these indexes do not store TIDs. As new tuples are added into the index, the summary information is updated (if the block range in which the tuple is added is already summarized) or not; in the latter case, a subsequent pass of VACUUM or the brin_summarize_new_values() function will create the summary information. For data types with natural 1-D sort orders, the summary info consists of the maximum and the minimum values of each indexed column within each page range. This type of operator class we call "Minmax", and we supply a bunch of them for most data types with B-tree opclasses. Since the BRIN code is generalized, other approaches are possible for things such as arrays, geometric types, ranges, etc; even for things such as enum types we could do something different than minmax with better results. In this commit I only include minmax. Catalog version bumped due to new builtin catalog entries. There's more that could be done here, but this is a good step forwards. Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera, with contribution by Heikki Linnakangas. Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas. Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo. PS: The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633. 2014-11-07 20:38:14 +01:00			`* Portions Copyright (c) 1994, Regents of the University of California`
			`*`
			`* IDENTIFICATION`
			`* src/include/access/brin_pageops.h`
			`*/`
			`#ifndef BRIN_PAGEOPS_H`
			`#define BRIN_PAGEOPS_H`

			`#include "access/brin_revmap.h"`

			`extern bool brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,`
			`BrinRevmap *revmap, BlockNumber heapBlk,`
			`Buffer oldbuf, OffsetNumber oldoff,`
			`const BrinTuple *origtup, Size origsz,`
			`const BrinTuple *newtup, Size newsz,`
			`bool samepage);`
			`extern bool brin_can_do_samepage_update(Buffer buffer, Size origsz,`
			`Size newsz);`
			`extern OffsetNumber brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,`
			`BrinRevmap revmap, Buffer buffer, BlockNumber heapBlk,`
			`BrinTuple *tup, Size itemsz);`

			`extern void brin_page_init(Page page, uint16 type);`
			`extern void brin_metapage_init(Page page, BlockNumber pagesPerRange,`
			`uint16 version);`

			`extern bool brin_start_evacuating_page(Relation idxRel, Buffer buf);`
			`extern void brin_evacuate_page(Relation idxRel, BlockNumber pagesPerRange,`
			`BrinRevmap *revmap, Buffer buf);`

Improve FSM management for BRIN indexes. BRIN indexes like to propagate additions of free space into the upper pages of their free space maps as soon as the new space is known, even when it's just on one individual index page. Previously this required calling FreeSpaceMapVacuum, which is quite an expensive thing if the map is large. Use the FreeSpaceMapVacuumRange function recently added by commit c79f6df75 to reduce the amount of work done for this purpose. Fix a couple of places that neglected to do the upper-page vacuuming at all after recording new free space. If the policy is to be that BRIN should do that, it should do it everywhere. Do RecordPageWithFreeSpace unconditionally in brin_page_cleanup, and do FreeSpaceMapVacuum unconditionally in brin_vacuum_scan. Because of the FSM's imprecise storage of free space, the old complications here seldom bought anything, they just slowed things down. This approach also provides a predictable path for FSM corruption to be repaired. Remove premature RecordPageWithFreeSpace call in brin_getinsertbuffer where it's about to return an extended page to the caller. The caller should do that, instead, after it's inserted its new tuple. Fix the one caller that forgot to do so. Simplify logic in brin_doupdate's same-page-update case by postponing brin_initialize_empty_new_buffer to after the critical section; I see little point in doing it before. Avoid repeat calls of RelationGetNumberOfBlocks in brin_vacuum_scan. Avoid duplicate BufferGetBlockNumber and BufferGetPage calls in a couple of places where we already had the right values. Move a BRIN_elog debug logging call out of a critical section; that's pretty unsafe and I don't think it buys us anything to not wait till after the critical section. Move the "extended = false" step in brin_getinsertbuffer into the routine's main loop. There's no actual bug there, since the loop can't iterate with extended still true, but it doesn't seem very future-proof as coded; and it's certainly not documented as a loop invariant. This is all from follow-on investigation inspired by commit c79f6df75. Discussion: https://postgr.es/m/5801.1522429460@sss.pgh.pa.us 2018-04-04 20:26:04 +02:00			`extern void brin_page_cleanup(Relation idxrel, Buffer buf);`
Close some holes in BRIN page assignment In some corner cases, it is possible for the BRIN index relation to be extended by brin_getinsertbuffer but the new page not be used immediately for anything by its callers; when this happens, the page is initialized and the FSM is updated (by brin_getinsertbuffer) with the info about that page, but these actions are not WAL-logged. A later index insert/update can use the page, but since the page is already initialized, the initialization itself is not WAL-logged then either. Replay of this sequence of events causes recovery to fail altogether. There is a related corner case within brin_getinsertbuffer itself, in which we extend the relation to put a new index tuple there, but later find out that we cannot do so, and do not return the buffer; the page obtained from extension is not even initialized. The resulting page is lost forever. To fix, shuffle the code so that initialization is not the responsibility of brin_getinsertbuffer anymore, in normal cases; instead, the initialization is done by its callers (brin_doinsert and brin_doupdate) once they're certain that the page is going to be used. When either those functions determine that the new page cannot be used, before bailing out they initialize the page as an empty regular page, enter it in FSM and WAL-log all this. This way, the page is usable for future index insertions, and WAL replay doesn't find trying to insert tuples in pages whose initialization didn't make it to the WAL. The same strategy is used in brin_getinsertbuffer when it cannot return the new page. Additionally, add a new step to vacuuming so that all pages of the index are scanned; whenever an uninitialized page is found, it is initialized as empty and WAL-logged. This closes the hole that the relation is extended but the system crashes before anything is WAL-logged about it. We also take this opportunity to update the FSM, in case it has gotten out of date. Thanks to Heikki Linnakangas for finding the problem that kicked some additional analysis of BRIN page assignment code. Backpatch to 9.5, where BRIN was introduced. Discussion: https://www.postgresql.org/message-id/20150723204810.GY5596@postgresql.org 2015-08-12 19:20:38 +02:00
BRIN: Block Range Indexes BRIN is a new index access method intended to accelerate scans of very large tables, without the maintenance overhead of btrees or other traditional indexes. They work by maintaining "summary" data about block ranges. Bitmap index scans work by reading each summary tuple and comparing them with the query quals; all pages in the range are returned in a lossy TID bitmap if the quals are consistent with the values in the summary tuple, otherwise not. Normal index scans are not supported because these indexes do not store TIDs. As new tuples are added into the index, the summary information is updated (if the block range in which the tuple is added is already summarized) or not; in the latter case, a subsequent pass of VACUUM or the brin_summarize_new_values() function will create the summary information. For data types with natural 1-D sort orders, the summary info consists of the maximum and the minimum values of each indexed column within each page range. This type of operator class we call "Minmax", and we supply a bunch of them for most data types with B-tree opclasses. Since the BRIN code is generalized, other approaches are possible for things such as arrays, geometric types, ranges, etc; even for things such as enum types we could do something different than minmax with better results. In this commit I only include minmax. Catalog version bumped due to new builtin catalog entries. There's more that could be done here, but this is a good step forwards. Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera, with contribution by Heikki Linnakangas. Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas. Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo. PS: The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633. 2014-11-07 20:38:14 +01:00			`#endif /* BRIN_PAGEOPS_H */`