1996-07-09 08:22:35 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
2003-02-22 01:45:05 +01:00
|
|
|
*
|
1999-10-18 00:15:09 +02:00
|
|
|
* nbtsort.c
|
|
|
|
* Build a btree from sorted input by loading leaf pages sequentially.
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
|
|
|
* NOTES
|
|
|
|
*
|
1999-10-18 00:15:09 +02:00
|
|
|
* We use tuplesort.c to sort the given index tuples into order.
|
|
|
|
* Then we scan the index tuples in order and build the btree pages
|
2014-05-06 18:12:18 +02:00
|
|
|
* for each level. We load source tuples into leaf-level pages.
|
2000-07-21 08:42:39 +02:00
|
|
|
* Whenever we fill a page at one level, we add a link to it to its
|
|
|
|
* parent level (starting a new parent level if necessary). When
|
|
|
|
* done, we write out each final page on each level, adding it to
|
|
|
|
* its parent level. When we have only one page on a level, it must be
|
|
|
|
* the root -- it can be attached to the btree metapage and we are done.
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2000-07-22 00:14:09 +02:00
|
|
|
* It is not wise to pack the pages entirely full, since then *any*
|
|
|
|
* insertion would cause a split (and not only of the leaf page; the need
|
|
|
|
* for a split would cascade right up the tree). The steady-state load
|
|
|
|
* factor for btrees is usually estimated at 70%. We choose to pack leaf
|
2006-07-11 23:05:57 +02:00
|
|
|
* pages to the user-controllable fill factor (default 90%) while upper pages
|
|
|
|
* are always packed to 70%. This gives us reasonable density (there aren't
|
|
|
|
* many upper pages if the keys are reasonable-size) without risking a lot of
|
|
|
|
* cascading splits during early insertions.
|
2000-07-21 08:42:39 +02:00
|
|
|
*
|
2004-06-02 19:28:18 +02:00
|
|
|
* Formerly the index pages being built were kept in shared buffers, but
|
|
|
|
* that is of no value (since other backends have no interest in them yet)
|
|
|
|
* and it created locking problems for CHECKPOINT, because the upper-level
|
|
|
|
* pages were held exclusive-locked for long periods. Now we just build
|
2007-01-03 19:11:01 +01:00
|
|
|
* the pages in local memory and smgrwrite or smgrextend them as we finish
|
|
|
|
* them. They will need to be re-read into shared buffers on first use after
|
|
|
|
* the build finishes.
|
2004-06-02 19:28:18 +02:00
|
|
|
*
|
|
|
|
* Since the index will never be used unless it is completely built,
|
|
|
|
* from a crash-recovery point of view there is no need to WAL-log the
|
2014-05-06 18:12:18 +02:00
|
|
|
* steps of the build. After completing the index build, we can just sync
|
2004-06-02 19:28:18 +02:00
|
|
|
* the whole file to disk using smgrimmedsync() before exiting this module.
|
|
|
|
* This can be seen to be sufficient for crash recovery by considering that
|
|
|
|
* it's effectively equivalent to what would happen if a CHECKPOINT occurred
|
2014-05-06 18:12:18 +02:00
|
|
|
* just after the index build. However, it is clearly not sufficient if the
|
2004-06-02 19:28:18 +02:00
|
|
|
* DBA is using the WAL log for PITR or replication purposes, since another
|
|
|
|
* machine would not be able to reconstruct the index from WAL. Therefore,
|
|
|
|
* we log the completed index pages to WAL if and only if WAL archiving is
|
|
|
|
* active.
|
|
|
|
*
|
2008-09-30 12:52:14 +02:00
|
|
|
* This code isn't concerned about the FSM at all. The caller is responsible
|
|
|
|
* for initializing that.
|
1999-10-18 00:15:09 +02:00
|
|
|
*
|
2020-01-01 18:21:45 +01:00
|
|
|
* Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
|
2000-01-26 06:58:53 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
1999-10-18 00:15:09 +02:00
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/backend/access/nbtree/nbtsort.c
|
1999-10-18 00:15:09 +02:00
|
|
|
*
|
1996-07-09 08:22:35 +02:00
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
1998-01-13 05:05:12 +01:00
|
|
|
#include "postgres.h"
|
1996-07-09 08:22:35 +02:00
|
|
|
|
1998-01-13 05:05:12 +01:00
|
|
|
#include "access/nbtree.h"
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
#include "access/parallel.h"
|
|
|
|
#include "access/relscan.h"
|
2019-03-26 00:52:55 +01:00
|
|
|
#include "access/table.h"
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
#include "access/tableam.h"
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
#include "access/xact.h"
|
2014-11-06 12:52:08 +01:00
|
|
|
#include "access/xlog.h"
|
|
|
|
#include "access/xloginsert.h"
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
#include "catalog/index.h"
|
Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 20:18:08 +02:00
|
|
|
#include "commands/progress.h"
|
2003-02-21 01:06:22 +01:00
|
|
|
#include "miscadmin.h"
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
#include "pgstat.h"
|
2004-06-02 19:28:18 +02:00
|
|
|
#include "storage/smgr.h"
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
#include "tcop/tcopprot.h" /* pgrminclude ignore */
|
2008-06-19 02:46:06 +02:00
|
|
|
#include "utils/rel.h"
|
Clean up includes from RLS patch
The initial patch for RLS mistakenly included headers associated with
the executor and planner bits in rewrite/rowsecurity.h. Per policy and
general good sense, executor headers should not be included in planner
headers or vice versa.
The include of execnodes.h was a mistaken holdover from previous
versions, while the include of relation.h was used for Relation's
definition, which should have been coming from utils/relcache.h. This
patch cleans these issues up, adds comments to the RowSecurityPolicy
struct and the RowSecurityConfigType enum, and changes Relation->rsdesc
to Relation->rd_rsdesc to follow Relation field naming convention.
Additionally, utils/rel.h was including rewrite/rowsecurity.h, which
wasn't a great idea since that was pulling in things not really needed
in utils/rel.h (which gets included in quite a few places). Instead,
use 'struct RowSecurityDesc' for the rd_rsdesc field and add comments
explaining why.
Lastly, add an include into access/nbtree/nbtsort.c for
utils/sortsupport.h, which was evidently missed due to the above mess.
Pointed out by Tom in 16970.1415838651@sss.pgh.pa.us; note that the
concerns regarding a similar situation in the custom-path commit still
need to be addressed.
2014-11-14 22:53:51 +01:00
|
|
|
#include "utils/sortsupport.h"
|
1999-10-18 00:15:09 +02:00
|
|
|
#include "utils/tuplesort.h"
|
1996-07-09 08:22:35 +02:00
|
|
|
|
1996-11-03 13:35:27 +01:00
|
|
|
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
/* Magic numbers for parallel state sharing */
|
|
|
|
#define PARALLEL_KEY_BTREE_SHARED UINT64CONST(0xA000000000000001)
|
|
|
|
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xA000000000000002)
|
|
|
|
#define PARALLEL_KEY_TUPLESORT_SPOOL2 UINT64CONST(0xA000000000000003)
|
2018-03-22 18:15:03 +01:00
|
|
|
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xA000000000000004)
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* DISABLE_LEADER_PARTICIPATION disables the leader's participation in
|
|
|
|
* parallel index builds. This may be useful as a debugging aid.
|
|
|
|
#undef DISABLE_LEADER_PARTICIPATION
|
|
|
|
*/
|
|
|
|
|
1997-02-22 11:08:27 +01:00
|
|
|
/*
|
2004-06-02 19:28:18 +02:00
|
|
|
* Status record for spooling/sorting phase. (Note we may have two of
|
|
|
|
* these due to the special requirements for uniqueness-checking with
|
|
|
|
* dead tuples.)
|
1997-02-22 11:08:27 +01:00
|
|
|
*/
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
typedef struct BTSpool
|
1997-09-07 07:04:48 +02:00
|
|
|
{
|
1999-10-18 00:15:09 +02:00
|
|
|
Tuplesortstate *sortstate; /* state data for tuplesort.c */
|
Provide database object names as separate fields in error messages.
This patch addresses the problem that applications currently have to
extract object names from possibly-localized textual error messages,
if they want to know for example which index caused a UNIQUE_VIOLATION
failure. It adds new error message fields to the wire protocol, which
can carry the name of a table, table column, data type, or constraint
associated with the error. (Since the protocol spec has always instructed
clients to ignore unrecognized field types, this should not create any
compatibility problem.)
Support for providing these new fields has been added to just a limited set
of error reports (mainly, those in the "integrity constraint violation"
SQLSTATE class), but we will doubtless add them to more calls in future.
Pavel Stehule, reviewed and extensively revised by Peter Geoghegan, with
additional hacking by Tom Lane.
2013-01-29 23:06:26 +01:00
|
|
|
Relation heap;
|
1999-10-18 00:15:09 +02:00
|
|
|
Relation index;
|
1997-09-08 04:41:22 +02:00
|
|
|
bool isunique;
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
} BTSpool;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Status for index builds performed in parallel. This is allocated in a
|
|
|
|
* dynamic shared memory segment. Note that there is a separate tuplesort TOC
|
|
|
|
* entry, private to tuplesort.c but allocated by this module on its behalf.
|
|
|
|
*/
|
|
|
|
typedef struct BTShared
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* These fields are not modified during the sort. They primarily exist
|
|
|
|
* for the benefit of worker processes that need to create BTSpool state
|
|
|
|
* corresponding to that used by the leader.
|
|
|
|
*/
|
|
|
|
Oid heaprelid;
|
|
|
|
Oid indexrelid;
|
|
|
|
bool isunique;
|
|
|
|
bool isconcurrent;
|
|
|
|
int scantuplesortstates;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* workersdonecv is used to monitor the progress of workers. All parallel
|
|
|
|
* participants must indicate that they are done before leader can use
|
|
|
|
* mutable state that workers maintain during scan (and before leader can
|
|
|
|
* proceed to tuplesort_performsort()).
|
|
|
|
*/
|
|
|
|
ConditionVariable workersdonecv;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* mutex protects all fields before heapdesc.
|
|
|
|
*
|
|
|
|
* These fields contain status information of interest to B-Tree index
|
|
|
|
* builds that must work just the same when an index is built in parallel.
|
|
|
|
*/
|
|
|
|
slock_t mutex;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Mutable state that is maintained by workers, and reported back to
|
|
|
|
* leader at end of parallel scan.
|
|
|
|
*
|
|
|
|
* nparticipantsdone is number of worker processes finished.
|
|
|
|
*
|
|
|
|
* reltuples is the total number of input heap tuples.
|
|
|
|
*
|
|
|
|
* havedead indicates if RECENTLY_DEAD tuples were encountered during
|
|
|
|
* build.
|
|
|
|
*
|
|
|
|
* indtuples is the total number of tuples that made it into the index.
|
|
|
|
*
|
|
|
|
* brokenhotchain indicates if any worker detected a broken HOT chain
|
|
|
|
* during build.
|
|
|
|
*/
|
|
|
|
int nparticipantsdone;
|
|
|
|
double reltuples;
|
|
|
|
bool havedead;
|
|
|
|
double indtuples;
|
|
|
|
bool brokenhotchain;
|
|
|
|
|
|
|
|
/*
|
2019-03-11 22:26:43 +01:00
|
|
|
* ParallelTableScanDescData data follows. Can't directly embed here, as
|
|
|
|
* implementations of the parallel table scan desc interface might need
|
|
|
|
* stronger alignment.
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
*/
|
|
|
|
} BTShared;
|
|
|
|
|
2019-03-11 22:26:43 +01:00
|
|
|
/*
|
|
|
|
* Return pointer to a BTShared's parallel table scan.
|
|
|
|
*
|
|
|
|
* c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
|
|
|
|
* MAXALIGN.
|
|
|
|
*/
|
|
|
|
#define ParallelTableScanFromBTShared(shared) \
|
|
|
|
(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(BTShared)))
|
|
|
|
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
/*
|
|
|
|
* Status for leader in parallel index build.
|
|
|
|
*/
|
|
|
|
typedef struct BTLeader
|
|
|
|
{
|
|
|
|
/* parallel context itself */
|
|
|
|
ParallelContext *pcxt;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* nparticipanttuplesorts is the exact number of worker processes
|
|
|
|
* successfully launched, plus one leader process if it participates as a
|
|
|
|
* worker (only DISABLE_LEADER_PARTICIPATION builds avoid leader
|
|
|
|
* participating as a worker).
|
|
|
|
*/
|
|
|
|
int nparticipanttuplesorts;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Leader process convenience pointers to shared state (leader avoids TOC
|
|
|
|
* lookups).
|
|
|
|
*
|
|
|
|
* btshared is the shared state for entire build. sharedsort is the
|
|
|
|
* shared, tuplesort-managed state passed to each process tuplesort.
|
|
|
|
* sharedsort2 is the corresponding btspool2 shared state, used only when
|
|
|
|
* building unique indexes. snapshot is the snapshot used by the scan iff
|
|
|
|
* an MVCC snapshot is required.
|
|
|
|
*/
|
|
|
|
BTShared *btshared;
|
|
|
|
Sharedsort *sharedsort;
|
|
|
|
Sharedsort *sharedsort2;
|
|
|
|
Snapshot snapshot;
|
|
|
|
} BTLeader;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Working state for btbuild and its callback.
|
|
|
|
*
|
|
|
|
* When parallel CREATE INDEX is used, there is a BTBuildState for each
|
|
|
|
* participant.
|
|
|
|
*/
|
|
|
|
typedef struct BTBuildState
|
|
|
|
{
|
|
|
|
bool isunique;
|
|
|
|
bool havedead;
|
|
|
|
Relation heap;
|
|
|
|
BTSpool *spool;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* spool2 is needed only when the index is a unique index. Dead tuples are
|
|
|
|
* put into spool2 instead of spool in order to avoid uniqueness check.
|
|
|
|
*/
|
|
|
|
BTSpool *spool2;
|
|
|
|
double indtuples;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* btleader is only present when a parallel index build is performed, and
|
|
|
|
* only in the leader process. (Actually, only the leader has a
|
|
|
|
* BTBuildState. Workers have their own spool and spool2, though.)
|
|
|
|
*/
|
|
|
|
BTLeader *btleader;
|
|
|
|
} BTBuildState;
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2000-07-21 08:42:39 +02:00
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Status record for a btree page being built. We have one of these
|
2000-07-21 08:42:39 +02:00
|
|
|
* for each active tree level.
|
|
|
|
*/
|
|
|
|
typedef struct BTPageState
|
|
|
|
{
|
2004-06-02 19:28:18 +02:00
|
|
|
Page btps_page; /* workspace for page building */
|
2004-08-29 07:07:03 +02:00
|
|
|
BlockNumber btps_blkno; /* block # to write this page at */
|
2019-11-08 02:12:09 +01:00
|
|
|
IndexTuple btps_lowkey; /* page's strict lower bound pivot tuple */
|
2000-07-21 08:42:39 +02:00
|
|
|
OffsetNumber btps_lastoff; /* last item offset loaded */
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
Size btps_lastextra; /* last item's extra posting list space */
|
2003-02-21 01:06:22 +01:00
|
|
|
uint32 btps_level; /* tree level (0 = leaf) */
|
2005-10-15 04:49:52 +02:00
|
|
|
Size btps_full; /* "full" if less than this much free space */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
struct BTPageState *btps_next; /* link to parent level, if any */
|
2000-07-21 08:42:39 +02:00
|
|
|
} BTPageState;
|
|
|
|
|
2004-06-02 19:28:18 +02:00
|
|
|
/*
|
|
|
|
* Overall status record for index writing phase.
|
|
|
|
*/
|
|
|
|
typedef struct BTWriteState
|
|
|
|
{
|
Provide database object names as separate fields in error messages.
This patch addresses the problem that applications currently have to
extract object names from possibly-localized textual error messages,
if they want to know for example which index caused a UNIQUE_VIOLATION
failure. It adds new error message fields to the wire protocol, which
can carry the name of a table, table column, data type, or constraint
associated with the error. (Since the protocol spec has always instructed
clients to ignore unrecognized field types, this should not create any
compatibility problem.)
Support for providing these new fields has been added to just a limited set
of error reports (mainly, those in the "integrity constraint violation"
SQLSTATE class), but we will doubtless add them to more calls in future.
Pavel Stehule, reviewed and extensively revised by Peter Geoghegan, with
additional hacking by Tom Lane.
2013-01-29 23:06:26 +01:00
|
|
|
Relation heap;
|
2004-06-02 19:28:18 +02:00
|
|
|
Relation index;
|
2019-03-20 17:30:57 +01:00
|
|
|
BTScanInsert inskey; /* generic insertion scankey */
|
2004-08-29 07:07:03 +02:00
|
|
|
bool btws_use_wal; /* dump pages to WAL? */
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
BlockNumber btws_pages_alloced; /* # pages allocated */
|
|
|
|
BlockNumber btws_pages_written; /* # pages written out */
|
2004-08-29 07:07:03 +02:00
|
|
|
Page btws_zeropage; /* workspace for filling zeroes */
|
2004-06-02 19:28:18 +02:00
|
|
|
} BTWriteState;
|
|
|
|
|
2000-07-21 08:42:39 +02:00
|
|
|
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
static double _bt_spools_heapscan(Relation heap, Relation index,
|
2019-05-22 19:04:48 +02:00
|
|
|
BTBuildState *buildstate, IndexInfo *indexInfo);
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
static void _bt_spooldestroy(BTSpool *btspool);
|
|
|
|
static void _bt_spool(BTSpool *btspool, ItemPointer self,
|
2019-05-22 19:04:48 +02:00
|
|
|
Datum *values, bool *isnull);
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
|
2019-11-08 09:44:52 +01:00
|
|
|
static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
|
2019-05-22 19:04:48 +02:00
|
|
|
bool *isnull, bool tupleIsAlive, void *state);
|
2004-06-02 19:28:18 +02:00
|
|
|
static Page _bt_blnewpage(uint32 level);
|
|
|
|
static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level);
|
|
|
|
static void _bt_slideleft(Page page);
|
2000-07-22 00:14:09 +02:00
|
|
|
static void _bt_sortaddtup(Page page, Size itemsize,
|
2019-05-22 19:04:48 +02:00
|
|
|
IndexTuple itup, OffsetNumber itup_off);
|
2006-01-26 00:04:21 +01:00
|
|
|
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
IndexTuple itup, Size truncextra);
|
|
|
|
static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
|
|
|
|
BTPageState *state,
|
|
|
|
BTDedupState dstate);
|
2004-06-02 19:28:18 +02:00
|
|
|
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
|
|
|
|
static void _bt_load(BTWriteState *wstate,
|
2019-05-22 19:04:48 +02:00
|
|
|
BTSpool *btspool, BTSpool *btspool2);
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
|
2019-05-22 19:04:48 +02:00
|
|
|
int request);
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
static void _bt_end_parallel(BTLeader *btleader);
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
static Size _bt_parallel_estimate_shared(Relation heap, Snapshot snapshot);
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
static double _bt_parallel_heapscan(BTBuildState *buildstate,
|
2019-05-22 19:04:48 +02:00
|
|
|
bool *brokenhotchain);
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
static void _bt_leader_participate_as_worker(BTBuildState *buildstate);
|
|
|
|
static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
|
2019-05-22 19:04:48 +02:00
|
|
|
BTShared *btshared, Sharedsort *sharedsort,
|
|
|
|
Sharedsort *sharedsort2, int sortmem,
|
|
|
|
bool progress);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
|
|
|
|
/*
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
* btbuild() -- build a new btree index.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
IndexBuildResult *
|
|
|
|
btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
|
|
|
|
{
|
|
|
|
IndexBuildResult *result;
|
|
|
|
BTBuildState buildstate;
|
|
|
|
double reltuples;
|
|
|
|
|
|
|
|
#ifdef BTREE_BUILD_STATS
|
|
|
|
if (log_btree_build_stats)
|
|
|
|
ResetUsage();
|
|
|
|
#endif /* BTREE_BUILD_STATS */
|
|
|
|
|
|
|
|
buildstate.isunique = indexInfo->ii_Unique;
|
|
|
|
buildstate.havedead = false;
|
|
|
|
buildstate.heap = heap;
|
|
|
|
buildstate.spool = NULL;
|
|
|
|
buildstate.spool2 = NULL;
|
|
|
|
buildstate.indtuples = 0;
|
|
|
|
buildstate.btleader = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We expect to be called exactly once for any index relation. If that's
|
|
|
|
* not the case, big trouble's what we have.
|
|
|
|
*/
|
|
|
|
if (RelationGetNumberOfBlocks(index) != 0)
|
|
|
|
elog(ERROR, "index \"%s\" already contains data",
|
|
|
|
RelationGetRelationName(index));
|
|
|
|
|
|
|
|
reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Finish the build by (1) completing the sort of the spool file, (2)
|
|
|
|
* inserting the sorted tuples into btree pages and (3) building the upper
|
|
|
|
* levels. Finally, it may also be necessary to end use of parallelism.
|
|
|
|
*/
|
|
|
|
_bt_leafbuild(buildstate.spool, buildstate.spool2);
|
|
|
|
_bt_spooldestroy(buildstate.spool);
|
|
|
|
if (buildstate.spool2)
|
|
|
|
_bt_spooldestroy(buildstate.spool2);
|
|
|
|
if (buildstate.btleader)
|
|
|
|
_bt_end_parallel(buildstate.btleader);
|
|
|
|
|
|
|
|
result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
|
|
|
|
|
|
|
|
result->heap_tuples = reltuples;
|
|
|
|
result->index_tuples = buildstate.indtuples;
|
|
|
|
|
|
|
|
#ifdef BTREE_BUILD_STATS
|
|
|
|
if (log_btree_build_stats)
|
|
|
|
{
|
|
|
|
ShowUsage("BTREE BUILD STATS");
|
|
|
|
ResetUsage();
|
|
|
|
}
|
|
|
|
#endif /* BTREE_BUILD_STATS */
|
1996-07-09 08:22:35 +02:00
|
|
|
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
return result;
|
|
|
|
}
|
1996-07-09 08:22:35 +02:00
|
|
|
|
|
|
|
/*
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
* Create and initialize one or two spool structures, and save them in caller's
|
|
|
|
* buildstate argument. May also fill-in fields within indexInfo used by index
|
|
|
|
* builds.
|
|
|
|
*
|
|
|
|
* Scans the heap, possibly in parallel, filling spools with IndexTuples. This
|
|
|
|
* routine encapsulates all aspects of managing parallelism. Caller need only
|
|
|
|
* call _bt_end_parallel() in parallel case after it is done with spool/spool2.
|
|
|
|
*
|
|
|
|
* Returns the total number of heap tuples scanned.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
static double
|
|
|
|
_bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
|
|
|
|
IndexInfo *indexInfo)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2002-11-13 01:39:48 +01:00
|
|
|
BTSpool *btspool = (BTSpool *) palloc0(sizeof(BTSpool));
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
SortCoordinate coordinate = NULL;
|
|
|
|
double reltuples = 0;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
/*
|
|
|
|
* We size the sort area as maintenance_work_mem rather than work_mem to
|
|
|
|
* speed index creation. This should be OK since a single backend can't
|
|
|
|
* run multiple index creations in parallel (see also: notes on
|
|
|
|
* parallelism and maintenance_work_mem below).
|
|
|
|
*/
|
Provide database object names as separate fields in error messages.
This patch addresses the problem that applications currently have to
extract object names from possibly-localized textual error messages,
if they want to know for example which index caused a UNIQUE_VIOLATION
failure. It adds new error message fields to the wire protocol, which
can carry the name of a table, table column, data type, or constraint
associated with the error. (Since the protocol spec has always instructed
clients to ignore unrecognized field types, this should not create any
compatibility problem.)
Support for providing these new fields has been added to just a limited set
of error reports (mainly, those in the "integrity constraint violation"
SQLSTATE class), but we will doubtless add them to more calls in future.
Pavel Stehule, reviewed and extensively revised by Peter Geoghegan, with
additional hacking by Tom Lane.
2013-01-29 23:06:26 +01:00
|
|
|
btspool->heap = heap;
|
1999-10-18 00:15:09 +02:00
|
|
|
btspool->index = index;
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
btspool->isunique = indexInfo->ii_Unique;
|
|
|
|
|
|
|
|
/* Save as primary spool */
|
|
|
|
buildstate->spool = btspool;
|
|
|
|
|
Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 20:18:08 +02:00
|
|
|
/* Report table scan phase started */
|
|
|
|
pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
|
|
|
|
PROGRESS_BTREE_PHASE_INDEXBUILD_TABLESCAN);
|
|
|
|
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
/* Attempt to launch parallel worker scan when required */
|
|
|
|
if (indexInfo->ii_ParallelWorkers > 0)
|
|
|
|
_bt_begin_parallel(buildstate, indexInfo->ii_Concurrent,
|
|
|
|
indexInfo->ii_ParallelWorkers);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2004-02-03 18:34:04 +01:00
|
|
|
/*
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
* If parallel build requested and at least one worker process was
|
|
|
|
* successfully launched, set up coordination state
|
2004-02-03 18:34:04 +01:00
|
|
|
*/
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
if (buildstate->btleader)
|
|
|
|
{
|
|
|
|
coordinate = (SortCoordinate) palloc0(sizeof(SortCoordinateData));
|
|
|
|
coordinate->isWorker = false;
|
|
|
|
coordinate->nParticipants =
|
|
|
|
buildstate->btleader->nparticipanttuplesorts;
|
|
|
|
coordinate->sharedsort = buildstate->btleader->sharedsort;
|
|
|
|
}
|
1997-09-07 07:04:48 +02:00
|
|
|
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
/*
|
|
|
|
* Begin serial/leader tuplesort.
|
|
|
|
*
|
|
|
|
* In cases where parallelism is involved, the leader receives the same
|
|
|
|
* share of maintenance_work_mem as a serial sort (it is generally treated
|
|
|
|
* in the same way as a serial sort once we return). Parallel worker
|
|
|
|
* Tuplesortstates will have received only a fraction of
|
|
|
|
* maintenance_work_mem, though.
|
|
|
|
*
|
|
|
|
* We rely on the lifetime of the Leader Tuplesortstate almost not
|
|
|
|
* overlapping with any worker Tuplesortstate's lifetime. There may be
|
|
|
|
* some small overlap, but that's okay because we rely on leader
|
|
|
|
* Tuplesortstate only allocating a small, fixed amount of memory here.
|
|
|
|
* When its tuplesort_performsort() is called (by our caller), and
|
|
|
|
* significant amounts of memory are likely to be used, all workers must
|
|
|
|
* have already freed almost all memory held by their Tuplesortstates
|
|
|
|
* (they are about to go away completely, too). The overall effect is
|
|
|
|
* that maintenance_work_mem always represents an absolute high watermark
|
|
|
|
* on the amount of memory used by a CREATE INDEX operation, regardless of
|
|
|
|
* the use of parallelism or any other factor.
|
|
|
|
*/
|
|
|
|
buildstate->spool->sortstate =
|
|
|
|
tuplesort_begin_index_btree(heap, index, buildstate->isunique,
|
|
|
|
maintenance_work_mem, coordinate,
|
|
|
|
false);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If building a unique index, put dead tuples in a second spool to keep
|
|
|
|
* them out of the uniqueness check. We expect that the second spool (for
|
|
|
|
* dead tuples) won't get very full, so we give it only work_mem.
|
|
|
|
*/
|
|
|
|
if (indexInfo->ii_Unique)
|
|
|
|
{
|
|
|
|
BTSpool *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
|
|
|
|
SortCoordinate coordinate2 = NULL;
|
|
|
|
|
|
|
|
/* Initialize secondary spool */
|
|
|
|
btspool2->heap = heap;
|
|
|
|
btspool2->index = index;
|
|
|
|
btspool2->isunique = false;
|
|
|
|
/* Save as secondary spool */
|
|
|
|
buildstate->spool2 = btspool2;
|
|
|
|
|
|
|
|
if (buildstate->btleader)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Set up non-private state that is passed to
|
|
|
|
* tuplesort_begin_index_btree() about the basic high level
|
|
|
|
* coordination of a parallel sort.
|
|
|
|
*/
|
|
|
|
coordinate2 = (SortCoordinate) palloc0(sizeof(SortCoordinateData));
|
|
|
|
coordinate2->isWorker = false;
|
|
|
|
coordinate2->nParticipants =
|
|
|
|
buildstate->btleader->nparticipanttuplesorts;
|
|
|
|
coordinate2->sharedsort = buildstate->btleader->sharedsort2;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We expect that the second one (for dead tuples) won't get very
|
|
|
|
* full, so we give it only work_mem
|
|
|
|
*/
|
|
|
|
buildstate->spool2->sortstate =
|
|
|
|
tuplesort_begin_index_btree(heap, index, false, work_mem,
|
|
|
|
coordinate2, false);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Fill spool using either serial or parallel heap scan */
|
|
|
|
if (!buildstate->btleader)
|
Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 20:18:08 +02:00
|
|
|
reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
|
2019-03-28 03:59:06 +01:00
|
|
|
_bt_build_callback, (void *) buildstate,
|
|
|
|
NULL);
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
else
|
|
|
|
reltuples = _bt_parallel_heapscan(buildstate,
|
|
|
|
&indexInfo->ii_BrokenHotChain);
|
|
|
|
|
Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 20:18:08 +02:00
|
|
|
/*
|
|
|
|
* Set the progress target for the next phase. Reset the block number
|
|
|
|
* values set by table_index_build_scan
|
|
|
|
*/
|
|
|
|
{
|
|
|
|
const int index[] = {
|
|
|
|
PROGRESS_CREATEIDX_TUPLES_TOTAL,
|
|
|
|
PROGRESS_SCAN_BLOCKS_TOTAL,
|
|
|
|
PROGRESS_SCAN_BLOCKS_DONE
|
|
|
|
};
|
|
|
|
const int64 val[] = {
|
|
|
|
buildstate->indtuples,
|
|
|
|
0, 0
|
|
|
|
};
|
|
|
|
|
|
|
|
pgstat_progress_update_multi_param(3, index, val);
|
|
|
|
}
|
|
|
|
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
/* okay, all heap tuples are spooled */
|
|
|
|
if (buildstate->spool2 && !buildstate->havedead)
|
|
|
|
{
|
|
|
|
/* spool2 turns out to be unnecessary */
|
|
|
|
_bt_spooldestroy(buildstate->spool2);
|
|
|
|
buildstate->spool2 = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return reltuples;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* clean up a spool structure and its substructures.
|
|
|
|
*/
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
static void
|
1999-10-18 00:15:09 +02:00
|
|
|
_bt_spooldestroy(BTSpool *btspool)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
1999-10-18 00:15:09 +02:00
|
|
|
tuplesort_end(btspool->sortstate);
|
2004-09-27 06:01:23 +02:00
|
|
|
pfree(btspool);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2006-01-26 00:04:21 +01:00
|
|
|
* spool an index entry into the sort file.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
static void
|
2014-07-01 16:34:42 +02:00
|
|
|
_bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2014-07-01 16:34:42 +02:00
|
|
|
tuplesort_putindextuplevalues(btspool->sortstate, btspool->index,
|
|
|
|
self, values, isnull);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
1999-10-18 00:15:09 +02:00
|
|
|
* given a spool loaded by successive calls to _bt_spool,
|
|
|
|
* create an entire btree.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
static void
|
2000-08-10 04:33:20 +02:00
|
|
|
_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2004-08-29 07:07:03 +02:00
|
|
|
BTWriteState wstate;
|
2004-06-02 19:28:18 +02:00
|
|
|
|
1999-10-18 00:15:09 +02:00
|
|
|
#ifdef BTREE_BUILD_STATS
|
2002-11-15 02:26:09 +01:00
|
|
|
if (log_btree_build_stats)
|
1997-09-07 07:04:48 +02:00
|
|
|
{
|
2001-11-11 00:51:14 +01:00
|
|
|
ShowUsage("BTREE BUILD (Spool) STATISTICS");
|
1999-10-18 00:15:09 +02:00
|
|
|
ResetUsage();
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
#endif /* BTREE_BUILD_STATS */
|
1999-10-18 00:15:09 +02:00
|
|
|
|
Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 20:18:08 +02:00
|
|
|
pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
|
|
|
|
PROGRESS_BTREE_PHASE_PERFORMSORT_1);
|
2003-02-22 01:45:05 +01:00
|
|
|
tuplesort_performsort(btspool->sortstate);
|
2000-08-10 04:33:20 +02:00
|
|
|
if (btspool2)
|
Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 20:18:08 +02:00
|
|
|
{
|
|
|
|
pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
|
|
|
|
PROGRESS_BTREE_PHASE_PERFORMSORT_2);
|
2000-08-10 04:33:20 +02:00
|
|
|
tuplesort_performsort(btspool2->sortstate);
|
Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 20:18:08 +02:00
|
|
|
}
|
2004-06-02 19:28:18 +02:00
|
|
|
|
Provide database object names as separate fields in error messages.
This patch addresses the problem that applications currently have to
extract object names from possibly-localized textual error messages,
if they want to know for example which index caused a UNIQUE_VIOLATION
failure. It adds new error message fields to the wire protocol, which
can carry the name of a table, table column, data type, or constraint
associated with the error. (Since the protocol spec has always instructed
clients to ignore unrecognized field types, this should not create any
compatibility problem.)
Support for providing these new fields has been added to just a limited set
of error reports (mainly, those in the "integrity constraint violation"
SQLSTATE class), but we will doubtless add them to more calls in future.
Pavel Stehule, reviewed and extensively revised by Peter Geoghegan, with
additional hacking by Tom Lane.
2013-01-29 23:06:26 +01:00
|
|
|
wstate.heap = btspool->heap;
|
2004-06-02 19:28:18 +02:00
|
|
|
wstate.index = btspool->index;
|
2019-03-20 17:30:57 +01:00
|
|
|
wstate.inskey = _bt_mkscankey(wstate.index, NULL);
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
/* _bt_mkscankey() won't set allequalimage without metapage */
|
|
|
|
wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
|
2004-08-29 07:07:03 +02:00
|
|
|
|
2004-06-02 19:28:18 +02:00
|
|
|
/*
|
2010-01-15 10:19:10 +01:00
|
|
|
* We need to log index creation in WAL iff WAL archiving/streaming is
|
2010-12-13 18:34:26 +01:00
|
|
|
* enabled UNLESS the index isn't WAL-logged anyway.
|
2004-06-02 19:28:18 +02:00
|
|
|
*/
|
2010-12-13 18:34:26 +01:00
|
|
|
wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
|
2004-06-02 19:28:18 +02:00
|
|
|
|
|
|
|
/* reserve the metapage */
|
|
|
|
wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
|
|
|
|
wstate.btws_pages_written = 0;
|
2004-08-29 07:07:03 +02:00
|
|
|
wstate.btws_zeropage = NULL; /* until needed */
|
2004-06-02 19:28:18 +02:00
|
|
|
|
Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 20:18:08 +02:00
|
|
|
pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
|
|
|
|
PROGRESS_BTREE_PHASE_LEAF_LOAD);
|
2004-06-02 19:28:18 +02:00
|
|
|
_bt_load(&wstate, btspool, btspool2);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2019-03-28 03:59:06 +01:00
|
|
|
* Per-tuple callback for table_index_build_scan
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
static void
|
|
|
|
_bt_build_callback(Relation index,
|
2019-11-08 09:44:52 +01:00
|
|
|
ItemPointer tid,
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
Datum *values,
|
|
|
|
bool *isnull,
|
|
|
|
bool tupleIsAlive,
|
|
|
|
void *state)
|
|
|
|
{
|
|
|
|
BTBuildState *buildstate = (BTBuildState *) state;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
/*
|
|
|
|
* insert the index tuple into the appropriate spool file for subsequent
|
|
|
|
* processing
|
|
|
|
*/
|
|
|
|
if (tupleIsAlive || buildstate->spool2 == NULL)
|
2019-11-08 09:44:52 +01:00
|
|
|
_bt_spool(buildstate->spool, tid, values, isnull);
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
else
|
|
|
|
{
|
|
|
|
/* dead tuples are put into spool2 */
|
|
|
|
buildstate->havedead = true;
|
2019-11-08 09:44:52 +01:00
|
|
|
_bt_spool(buildstate->spool2, tid, values, isnull);
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
buildstate->indtuples += 1;
|
|
|
|
}
|
1996-07-09 08:22:35 +02:00
|
|
|
|
|
|
|
/*
|
2004-06-02 19:28:18 +02:00
|
|
|
* allocate workspace for a new, clean btree page, not linked to any siblings.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2004-06-02 19:28:18 +02:00
|
|
|
static Page
|
|
|
|
_bt_blnewpage(uint32 level)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2004-08-29 07:07:03 +02:00
|
|
|
Page page;
|
1997-09-08 04:41:22 +02:00
|
|
|
BTPageOpaque opaque;
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2004-06-02 19:28:18 +02:00
|
|
|
page = (Page) palloc(BLCKSZ);
|
2002-08-06 21:41:23 +02:00
|
|
|
|
|
|
|
/* Zero the page and set up standard page header info */
|
2004-06-02 19:28:18 +02:00
|
|
|
_bt_pageinit(page, BLCKSZ);
|
2002-08-06 21:41:23 +02:00
|
|
|
|
|
|
|
/* Initialize BT opaque state */
|
2004-06-02 19:28:18 +02:00
|
|
|
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
|
1997-09-07 07:04:48 +02:00
|
|
|
opaque->btpo_prev = opaque->btpo_next = P_NONE;
|
2003-02-21 01:06:22 +01:00
|
|
|
opaque->btpo.level = level;
|
|
|
|
opaque->btpo_flags = (level > 0) ? 0 : BTP_LEAF;
|
2006-05-08 02:00:17 +02:00
|
|
|
opaque->btpo_cycleid = 0;
|
2002-08-06 21:41:23 +02:00
|
|
|
|
|
|
|
/* Make the P_HIKEY line pointer appear allocated */
|
2004-06-02 19:28:18 +02:00
|
|
|
((PageHeader) page)->pd_lower += sizeof(ItemIdData);
|
|
|
|
|
|
|
|
return page;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2003-02-21 01:06:22 +01:00
|
|
|
/*
|
2004-06-02 19:28:18 +02:00
|
|
|
* emit a completed btree page, and release the working storage.
|
2003-02-21 01:06:22 +01:00
|
|
|
*/
|
|
|
|
static void
|
2004-06-02 19:28:18 +02:00
|
|
|
_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
|
2003-02-21 01:06:22 +01:00
|
|
|
{
|
2006-01-07 23:45:41 +01:00
|
|
|
/* Ensure rd_smgr is open (could have been closed by relcache flush!) */
|
|
|
|
RelationOpenSmgr(wstate->index);
|
|
|
|
|
2003-02-21 01:06:22 +01:00
|
|
|
/* XLOG stuff */
|
2004-06-02 19:28:18 +02:00
|
|
|
if (wstate->btws_use_wal)
|
2003-02-21 01:06:22 +01:00
|
|
|
{
|
2019-08-05 05:14:58 +02:00
|
|
|
/* We use the XLOG_FPI record type for this */
|
2013-12-03 23:10:47 +01:00
|
|
|
log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
|
2004-06-02 19:28:18 +02:00
|
|
|
}
|
2003-02-21 01:06:22 +01:00
|
|
|
|
2004-06-02 19:28:18 +02:00
|
|
|
/*
|
|
|
|
* If we have to write pages nonsequentially, fill in the space with
|
|
|
|
* zeroes until we come back and overwrite. This is not logically
|
2005-10-15 04:49:52 +02:00
|
|
|
* necessary on standard Unix filesystems (unwritten space will read as
|
|
|
|
* zeroes anyway), but it should help to avoid fragmentation. The dummy
|
|
|
|
* pages aren't WAL-logged though.
|
2004-06-02 19:28:18 +02:00
|
|
|
*/
|
|
|
|
while (blkno > wstate->btws_pages_written)
|
|
|
|
{
|
|
|
|
if (!wstate->btws_zeropage)
|
|
|
|
wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
|
2013-03-22 14:54:07 +01:00
|
|
|
/* don't set checksum for all-zero page */
|
2008-08-11 13:05:11 +02:00
|
|
|
smgrextend(wstate->index->rd_smgr, MAIN_FORKNUM,
|
|
|
|
wstate->btws_pages_written++,
|
2007-01-03 19:11:01 +01:00
|
|
|
(char *) wstate->btws_zeropage,
|
|
|
|
true);
|
2004-06-02 19:28:18 +02:00
|
|
|
}
|
2003-02-21 01:06:22 +01:00
|
|
|
|
2013-03-22 14:54:07 +01:00
|
|
|
PageSetChecksumInplace(page, blkno);
|
|
|
|
|
2004-06-02 19:28:18 +02:00
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Now write the page. There's no need for smgr to schedule an fsync for
|
2010-08-13 22:10:54 +02:00
|
|
|
* this write; we'll do it ourselves before ending the build.
|
2004-06-02 19:28:18 +02:00
|
|
|
*/
|
|
|
|
if (blkno == wstate->btws_pages_written)
|
2007-01-03 19:11:01 +01:00
|
|
|
{
|
|
|
|
/* extending the file... */
|
2008-08-11 13:05:11 +02:00
|
|
|
smgrextend(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
|
|
|
|
(char *) page, true);
|
2004-06-02 19:28:18 +02:00
|
|
|
wstate->btws_pages_written++;
|
2007-01-03 19:11:01 +01:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* overwriting a block we zero-filled before */
|
2008-08-11 13:05:11 +02:00
|
|
|
smgrwrite(wstate->index->rd_smgr, MAIN_FORKNUM, blkno,
|
|
|
|
(char *) page, true);
|
2007-01-03 19:11:01 +01:00
|
|
|
}
|
2004-06-02 19:28:18 +02:00
|
|
|
|
|
|
|
pfree(page);
|
2003-02-21 01:06:22 +01:00
|
|
|
}
|
|
|
|
|
2000-07-22 00:14:09 +02:00
|
|
|
/*
|
|
|
|
* allocate and initialize a new BTPageState. the returned structure
|
|
|
|
* is suitable for immediate use by _bt_buildadd.
|
|
|
|
*/
|
|
|
|
static BTPageState *
|
2004-06-02 19:28:18 +02:00
|
|
|
_bt_pagestate(BTWriteState *wstate, uint32 level)
|
2000-07-22 00:14:09 +02:00
|
|
|
{
|
2002-11-13 01:39:48 +01:00
|
|
|
BTPageState *state = (BTPageState *) palloc0(sizeof(BTPageState));
|
2000-07-22 00:14:09 +02:00
|
|
|
|
2004-06-02 19:28:18 +02:00
|
|
|
/* create initial page for level */
|
|
|
|
state->btps_page = _bt_blnewpage(level);
|
|
|
|
|
|
|
|
/* and assign it a page position */
|
|
|
|
state->btps_blkno = wstate->btws_pages_alloced++;
|
2000-07-22 00:14:09 +02:00
|
|
|
|
2019-11-08 02:12:09 +01:00
|
|
|
state->btps_lowkey = NULL;
|
2000-07-22 00:14:09 +02:00
|
|
|
/* initialize lastoff so first item goes into P_FIRSTKEY */
|
|
|
|
state->btps_lastoff = P_HIKEY;
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
state->btps_lastextra = 0;
|
2000-07-22 00:14:09 +02:00
|
|
|
state->btps_level = level;
|
|
|
|
/* set "full" threshold based on level. See notes at head of file. */
|
2006-07-04 00:45:41 +02:00
|
|
|
if (level > 0)
|
2006-07-11 23:05:57 +02:00
|
|
|
state->btps_full = (BLCKSZ * (100 - BTREE_NONLEAF_FILLFACTOR) / 100);
|
2006-07-04 00:45:41 +02:00
|
|
|
else
|
2019-11-25 01:40:53 +01:00
|
|
|
state->btps_full = BTGetTargetPageFreeSpace(wstate->index);
|
|
|
|
|
2000-07-22 00:14:09 +02:00
|
|
|
/* no parent level, yet */
|
2004-01-07 19:56:30 +01:00
|
|
|
state->btps_next = NULL;
|
2000-07-22 00:14:09 +02:00
|
|
|
|
|
|
|
return state;
|
|
|
|
}
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
|
|
|
* slide an array of ItemIds back one slot (from P_FIRSTKEY to
|
What looks like some *major* improvements to btree indexing...
Patches from: aoki@CS.Berkeley.EDU (Paul M. Aoki)
i gave jolly my btree bulkload code a long, long time ago but never
gave him a bunch of my bugfixes. here's a diff against the 6.0
baseline.
for some reason, this code has slowed down somewhat relative to the
insertion-build code on very small tables. don't know why -- it used
to be within about 10%. anyway, here are some (highly unscientific!)
timings on a dec 3000/300 for synthetic tables with 10k, 100k and
1000k tuples (basically, 1mb, 10mb and 100mb heaps). 'c' means
clustered (pre-sorted) inputs and 'u' means unclustered (randomly
ordered) inputs. the 10k table basically fits in the buffer pool, but
the 100k and 1000k tables don't. as you can see, insertion build is
fine if you've sorted your heaps on your index key or if your heap
fits in core, but is absolutely horrible on unordered data (yes,
that's 7.5 hours to index 100mb of data...) because of the zillions of
random i/os.
if it doesn't work for you for whatever reason, you can always turn it
back off by flipping the FastBuild flag in nbtree.c. i don't have
time to maintain it.
good luck!
baseline code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 8.6
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 9.1
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.2
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 652.4
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.1
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 26772.9
bulkloading code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 11.3
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 10.4
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.5
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 63.5
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.9
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 701.0
1997-02-12 06:04:52 +01:00
|
|
|
* P_HIKEY, overwriting P_HIKEY). we need to do this when we discover
|
|
|
|
* that we have built an ItemId array in what has turned out to be a
|
|
|
|
* P_RIGHTMOST page.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
|
|
|
static void
|
2004-06-02 19:28:18 +02:00
|
|
|
_bt_slideleft(Page page)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
1997-09-08 04:41:22 +02:00
|
|
|
OffsetNumber off;
|
|
|
|
OffsetNumber maxoff;
|
|
|
|
ItemId previi;
|
|
|
|
ItemId thisii;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
if (!PageIsEmpty(page))
|
|
|
|
{
|
|
|
|
maxoff = PageGetMaxOffsetNumber(page);
|
|
|
|
previi = PageGetItemId(page, P_HIKEY);
|
|
|
|
for (off = P_FIRSTKEY; off <= maxoff; off = OffsetNumberNext(off))
|
|
|
|
{
|
|
|
|
thisii = PageGetItemId(page, off);
|
|
|
|
*previi = *thisii;
|
|
|
|
previi = thisii;
|
|
|
|
}
|
|
|
|
((PageHeader) page)->pd_lower -= sizeof(ItemIdData);
|
What looks like some *major* improvements to btree indexing...
Patches from: aoki@CS.Berkeley.EDU (Paul M. Aoki)
i gave jolly my btree bulkload code a long, long time ago but never
gave him a bunch of my bugfixes. here's a diff against the 6.0
baseline.
for some reason, this code has slowed down somewhat relative to the
insertion-build code on very small tables. don't know why -- it used
to be within about 10%. anyway, here are some (highly unscientific!)
timings on a dec 3000/300 for synthetic tables with 10k, 100k and
1000k tuples (basically, 1mb, 10mb and 100mb heaps). 'c' means
clustered (pre-sorted) inputs and 'u' means unclustered (randomly
ordered) inputs. the 10k table basically fits in the buffer pool, but
the 100k and 1000k tables don't. as you can see, insertion build is
fine if you've sorted your heaps on your index key or if your heap
fits in core, but is absolutely horrible on unordered data (yes,
that's 7.5 hours to index 100mb of data...) because of the zillions of
random i/os.
if it doesn't work for you for whatever reason, you can always turn it
back off by flipping the FastBuild flag in nbtree.c. i don't have
time to maintain it.
good luck!
baseline code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 8.6
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 9.1
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.2
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 652.4
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.1
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 26772.9
bulkloading code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 11.3
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 10.4
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.5
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 63.5
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.9
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 701.0
1997-02-12 06:04:52 +01:00
|
|
|
}
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
What looks like some *major* improvements to btree indexing...
Patches from: aoki@CS.Berkeley.EDU (Paul M. Aoki)
i gave jolly my btree bulkload code a long, long time ago but never
gave him a bunch of my bugfixes. here's a diff against the 6.0
baseline.
for some reason, this code has slowed down somewhat relative to the
insertion-build code on very small tables. don't know why -- it used
to be within about 10%. anyway, here are some (highly unscientific!)
timings on a dec 3000/300 for synthetic tables with 10k, 100k and
1000k tuples (basically, 1mb, 10mb and 100mb heaps). 'c' means
clustered (pre-sorted) inputs and 'u' means unclustered (randomly
ordered) inputs. the 10k table basically fits in the buffer pool, but
the 100k and 1000k tables don't. as you can see, insertion build is
fine if you've sorted your heaps on your index key or if your heap
fits in core, but is absolutely horrible on unordered data (yes,
that's 7.5 hours to index 100mb of data...) because of the zillions of
random i/os.
if it doesn't work for you for whatever reason, you can always turn it
back off by flipping the FastBuild flag in nbtree.c. i don't have
time to maintain it.
good luck!
baseline code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 8.6
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 9.1
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.2
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 652.4
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.1
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 26772.9
bulkloading code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 11.3
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 10.4
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.5
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 63.5
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.9
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 701.0
1997-02-12 06:04:52 +01:00
|
|
|
/*
|
2000-07-22 00:14:09 +02:00
|
|
|
* Add an item to a page being built.
|
|
|
|
*
|
|
|
|
* The main difference between this routine and a bare PageAddItem call
|
2019-11-18 22:04:53 +01:00
|
|
|
* is that this code knows that the leftmost data item on a non-leaf btree
|
|
|
|
* page has a key that must be treated as minus infinity. Therefore, it
|
|
|
|
* truncates away all attributes.
|
2000-07-22 00:14:09 +02:00
|
|
|
*
|
|
|
|
* This is almost like nbtinsert.c's _bt_pgaddtup(), but we can't use
|
|
|
|
* that because it assumes that P_RIGHTMOST() will return the correct
|
|
|
|
* answer for the page. Here, we don't know yet if the page will be
|
|
|
|
* rightmost. Offset P_FIRSTKEY is always the first data key.
|
What looks like some *major* improvements to btree indexing...
Patches from: aoki@CS.Berkeley.EDU (Paul M. Aoki)
i gave jolly my btree bulkload code a long, long time ago but never
gave him a bunch of my bugfixes. here's a diff against the 6.0
baseline.
for some reason, this code has slowed down somewhat relative to the
insertion-build code on very small tables. don't know why -- it used
to be within about 10%. anyway, here are some (highly unscientific!)
timings on a dec 3000/300 for synthetic tables with 10k, 100k and
1000k tuples (basically, 1mb, 10mb and 100mb heaps). 'c' means
clustered (pre-sorted) inputs and 'u' means unclustered (randomly
ordered) inputs. the 10k table basically fits in the buffer pool, but
the 100k and 1000k tables don't. as you can see, insertion build is
fine if you've sorted your heaps on your index key or if your heap
fits in core, but is absolutely horrible on unordered data (yes,
that's 7.5 hours to index 100mb of data...) because of the zillions of
random i/os.
if it doesn't work for you for whatever reason, you can always turn it
back off by flipping the FastBuild flag in nbtree.c. i don't have
time to maintain it.
good luck!
baseline code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 8.6
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 9.1
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.2
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 652.4
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.1
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 26772.9
bulkloading code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 11.3
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 10.4
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.5
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 63.5
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.9
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 701.0
1997-02-12 06:04:52 +01:00
|
|
|
*/
|
2000-07-22 00:14:09 +02:00
|
|
|
static void
|
|
|
|
_bt_sortaddtup(Page page,
|
|
|
|
Size itemsize,
|
2006-01-26 00:04:21 +01:00
|
|
|
IndexTuple itup,
|
2000-07-22 00:14:09 +02:00
|
|
|
OffsetNumber itup_off)
|
What looks like some *major* improvements to btree indexing...
Patches from: aoki@CS.Berkeley.EDU (Paul M. Aoki)
i gave jolly my btree bulkload code a long, long time ago but never
gave him a bunch of my bugfixes. here's a diff against the 6.0
baseline.
for some reason, this code has slowed down somewhat relative to the
insertion-build code on very small tables. don't know why -- it used
to be within about 10%. anyway, here are some (highly unscientific!)
timings on a dec 3000/300 for synthetic tables with 10k, 100k and
1000k tuples (basically, 1mb, 10mb and 100mb heaps). 'c' means
clustered (pre-sorted) inputs and 'u' means unclustered (randomly
ordered) inputs. the 10k table basically fits in the buffer pool, but
the 100k and 1000k tables don't. as you can see, insertion build is
fine if you've sorted your heaps on your index key or if your heap
fits in core, but is absolutely horrible on unordered data (yes,
that's 7.5 hours to index 100mb of data...) because of the zillions of
random i/os.
if it doesn't work for you for whatever reason, you can always turn it
back off by flipping the FastBuild flag in nbtree.c. i don't have
time to maintain it.
good luck!
baseline code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 8.6
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 9.1
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.2
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 652.4
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.1
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 26772.9
bulkloading code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 11.3
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 10.4
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.5
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 63.5
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.9
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 701.0
1997-02-12 06:04:52 +01:00
|
|
|
{
|
2000-07-22 00:14:09 +02:00
|
|
|
BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
|
2006-01-26 00:04:21 +01:00
|
|
|
IndexTupleData trunctuple;
|
What looks like some *major* improvements to btree indexing...
Patches from: aoki@CS.Berkeley.EDU (Paul M. Aoki)
i gave jolly my btree bulkload code a long, long time ago but never
gave him a bunch of my bugfixes. here's a diff against the 6.0
baseline.
for some reason, this code has slowed down somewhat relative to the
insertion-build code on very small tables. don't know why -- it used
to be within about 10%. anyway, here are some (highly unscientific!)
timings on a dec 3000/300 for synthetic tables with 10k, 100k and
1000k tuples (basically, 1mb, 10mb and 100mb heaps). 'c' means
clustered (pre-sorted) inputs and 'u' means unclustered (randomly
ordered) inputs. the 10k table basically fits in the buffer pool, but
the 100k and 1000k tables don't. as you can see, insertion build is
fine if you've sorted your heaps on your index key or if your heap
fits in core, but is absolutely horrible on unordered data (yes,
that's 7.5 hours to index 100mb of data...) because of the zillions of
random i/os.
if it doesn't work for you for whatever reason, you can always turn it
back off by flipping the FastBuild flag in nbtree.c. i don't have
time to maintain it.
good luck!
baseline code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 8.6
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 9.1
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.2
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 652.4
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.1
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 26772.9
bulkloading code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 11.3
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 10.4
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.5
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 63.5
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.9
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 701.0
1997-02-12 06:04:52 +01:00
|
|
|
|
2001-03-22 05:01:46 +01:00
|
|
|
if (!P_ISLEAF(opaque) && itup_off == P_FIRSTKEY)
|
2000-07-22 00:14:09 +02:00
|
|
|
{
|
2006-01-26 00:04:21 +01:00
|
|
|
trunctuple = *itup;
|
|
|
|
trunctuple.t_info = sizeof(IndexTupleData);
|
Adjust INCLUDE index truncation comments and code.
Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected. Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature. A
similar check will be performed for the page highkey within
_bt_moveright() in most cases. Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.
Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer). This isn't strictly necessary,
but seems more future-proof than the old approach. It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation. Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.
In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE INDEX. Also pfree during a page split, just to be
consistent.
Refactor _bt_check_natts() to be more readable.
Author: Peter Geoghegan with some editorization by me
Reviewed by: Alexander Korotkov, Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
2018-04-19 07:45:58 +02:00
|
|
|
BTreeTupleSetNAtts(&trunctuple, 0);
|
2006-01-26 00:04:21 +01:00
|
|
|
itup = &trunctuple;
|
|
|
|
itemsize = sizeof(IndexTupleData);
|
2000-07-22 00:14:09 +02:00
|
|
|
}
|
What looks like some *major* improvements to btree indexing...
Patches from: aoki@CS.Berkeley.EDU (Paul M. Aoki)
i gave jolly my btree bulkload code a long, long time ago but never
gave him a bunch of my bugfixes. here's a diff against the 6.0
baseline.
for some reason, this code has slowed down somewhat relative to the
insertion-build code on very small tables. don't know why -- it used
to be within about 10%. anyway, here are some (highly unscientific!)
timings on a dec 3000/300 for synthetic tables with 10k, 100k and
1000k tuples (basically, 1mb, 10mb and 100mb heaps). 'c' means
clustered (pre-sorted) inputs and 'u' means unclustered (randomly
ordered) inputs. the 10k table basically fits in the buffer pool, but
the 100k and 1000k tables don't. as you can see, insertion build is
fine if you've sorted your heaps on your index key or if your heap
fits in core, but is absolutely horrible on unordered data (yes,
that's 7.5 hours to index 100mb of data...) because of the zillions of
random i/os.
if it doesn't work for you for whatever reason, you can always turn it
back off by flipping the FastBuild flag in nbtree.c. i don't have
time to maintain it.
good luck!
baseline code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 8.6
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 9.1
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.2
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 652.4
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.1
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 26772.9
bulkloading code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 11.3
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 10.4
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.5
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 63.5
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.9
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 701.0
1997-02-12 06:04:52 +01:00
|
|
|
|
2006-01-26 00:04:21 +01:00
|
|
|
if (PageAddItem(page, (Item) itup, itemsize, itup_off,
|
2007-09-20 19:56:33 +02:00
|
|
|
false, false) == InvalidOffsetNumber)
|
2003-07-21 22:29:40 +02:00
|
|
|
elog(ERROR, "failed to add item to the index page");
|
What looks like some *major* improvements to btree indexing...
Patches from: aoki@CS.Berkeley.EDU (Paul M. Aoki)
i gave jolly my btree bulkload code a long, long time ago but never
gave him a bunch of my bugfixes. here's a diff against the 6.0
baseline.
for some reason, this code has slowed down somewhat relative to the
insertion-build code on very small tables. don't know why -- it used
to be within about 10%. anyway, here are some (highly unscientific!)
timings on a dec 3000/300 for synthetic tables with 10k, 100k and
1000k tuples (basically, 1mb, 10mb and 100mb heaps). 'c' means
clustered (pre-sorted) inputs and 'u' means unclustered (randomly
ordered) inputs. the 10k table basically fits in the buffer pool, but
the 100k and 1000k tables don't. as you can see, insertion build is
fine if you've sorted your heaps on your index key or if your heap
fits in core, but is absolutely horrible on unordered data (yes,
that's 7.5 hours to index 100mb of data...) because of the zillions of
random i/os.
if it doesn't work for you for whatever reason, you can always turn it
back off by flipping the FastBuild flag in nbtree.c. i don't have
time to maintain it.
good luck!
baseline code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 8.6
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 9.1
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.2
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 652.4
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.1
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 26772.9
bulkloading code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 11.3
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 10.4
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.5
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 63.5
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.9
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 701.0
1997-02-12 06:04:52 +01:00
|
|
|
}
|
|
|
|
|
2000-07-22 00:14:09 +02:00
|
|
|
/*----------
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
* Add an item to a disk page from the sort output (or add a posting list
|
|
|
|
* item formed from the sort output).
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2000-07-22 00:14:09 +02:00
|
|
|
* We must be careful to observe the page layout conventions of nbtsearch.c:
|
|
|
|
* - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
|
|
|
|
* - on non-leaf pages, the key portion of the first item need not be
|
2001-03-22 05:01:46 +01:00
|
|
|
* stored, we should store only the link.
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2000-07-22 00:14:09 +02:00
|
|
|
* A leaf page being built looks like:
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
|
|
|
* +----------------+---------------------------------+
|
2014-05-06 18:12:18 +02:00
|
|
|
* | PageHeaderData | linp0 linp1 linp2 ... |
|
1996-07-09 08:22:35 +02:00
|
|
|
* +-----------+----+---------------------------------+
|
2000-07-21 08:42:39 +02:00
|
|
|
* | ... linpN | |
|
1996-07-09 08:22:35 +02:00
|
|
|
* +-----------+--------------------------------------+
|
1997-09-07 07:04:48 +02:00
|
|
|
* | ^ last |
|
|
|
|
* | |
|
1996-07-09 08:22:35 +02:00
|
|
|
* +-------------+------------------------------------+
|
2014-05-06 18:12:18 +02:00
|
|
|
* | | itemN ... |
|
1996-07-09 08:22:35 +02:00
|
|
|
* +-------------+------------------+-----------------+
|
1997-09-07 07:04:48 +02:00
|
|
|
* | ... item3 item2 item1 | "special space" |
|
1996-07-09 08:22:35 +02:00
|
|
|
* +--------------------------------+-----------------+
|
|
|
|
*
|
2000-07-22 00:14:09 +02:00
|
|
|
* Contrast this with the diagram in bufpage.h; note the mismatch
|
|
|
|
* between linps and items. This is because we reserve linp0 as a
|
1996-07-09 08:22:35 +02:00
|
|
|
* placeholder for the pointer to the "high key" item; when we have
|
|
|
|
* filled up the page, we will set linp0 to point to itemN and clear
|
2000-07-22 00:14:09 +02:00
|
|
|
* linpN. On the other hand, if we find this is the last (rightmost)
|
Adjust INCLUDE index truncation comments and code.
Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected. Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature. A
similar check will be performed for the page highkey within
_bt_moveright() in most cases. Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.
Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer). This isn't strictly necessary,
but seems more future-proof than the old approach. It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation. Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.
In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE INDEX. Also pfree during a page split, just to be
consistent.
Refactor _bt_check_natts() to be more readable.
Author: Peter Geoghegan with some editorization by me
Reviewed by: Alexander Korotkov, Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
2018-04-19 07:45:58 +02:00
|
|
|
* page, we leave the items alone and slide the linp array over. If
|
|
|
|
* the high key is to be truncated, offset 1 is deleted, and we insert
|
|
|
|
* the truncated high key at offset 1.
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2000-07-21 08:42:39 +02:00
|
|
|
* 'last' pointer indicates the last offset added to the page.
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
*
|
|
|
|
* 'truncextra' is the size of the posting list in itup, if any. This
|
|
|
|
* information is stashed for the next call here, when we may benefit
|
|
|
|
* from considering the impact of truncating away the posting list on
|
|
|
|
* the page before deciding to finish the page off. Posting lists are
|
|
|
|
* often relatively large, so it is worth going to the trouble of
|
|
|
|
* accounting for the saving from truncating away the posting list of
|
|
|
|
* the tuple that becomes the high key (that may be the only way to
|
|
|
|
* get close to target free space on the page). Note that this is
|
|
|
|
* only used for the soft fillfactor-wise limit, not the critical hard
|
|
|
|
* limit.
|
2000-07-22 00:14:09 +02:00
|
|
|
*----------
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2000-07-21 08:42:39 +02:00
|
|
|
static void
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
|
|
|
|
Size truncextra)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
1997-09-08 04:41:22 +02:00
|
|
|
Page npage;
|
2004-08-29 07:07:03 +02:00
|
|
|
BlockNumber nblkno;
|
1997-09-08 04:41:22 +02:00
|
|
|
OffsetNumber last_off;
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
Size last_truncextra;
|
1997-09-08 04:41:22 +02:00
|
|
|
Size pgspc;
|
2006-01-26 00:04:21 +01:00
|
|
|
Size itupsz;
|
Fix nbtsort.c's page space accounting.
Commit dd299df8189, which made heap TID a tiebreaker nbtree index
column, introduced new rules on page space management to make suffix
truncation safe. In general, suffix truncation needs to have a small
amount of extra space available on the new left page when splitting a
leaf page. This is needed in case it turns out that truncation cannot
even "truncate away the heap TID column", resulting in a
larger-than-firstright leaf high key with an explicit heap TID
representation.
Despite all this, CREATE INDEX/nbtsort.c did not account for the
possible need for extra heap TID space on leaf pages when deciding
whether or not a new item could fit on current page. This could lead to
"failed to add item to the index page" errors when CREATE
INDEX/nbtsort.c tried to finish off a leaf page that lacked space for a
larger-than-firstright leaf high key (it only had space for firstright
tuple, which was just short of what was needed following "truncation").
Several conditions needed to be met all at once for CREATE INDEX to
fail. The problem was in the hard limit on what will fit on a page,
which tends to be masked by the soft fillfactor-wise limit. The easiest
way to recreate the problem seems to be a CREATE INDEX on a low
cardinality text column, with tuples that are of non-uniform width,
using a fillfactor of 100.
To fix, bring nbtsort.c in line with nbtsplitloc.c, which already
pessimistically assumes that all leaf page splits will have high keys
that have a heap TID appended.
Reported-By: Andreas Joseph Krogh
Discussion: https://postgr.es/m/VisenaEmail.c5.3ee7fe277d514162.16a6d785bea@tc7-visena
2019-05-02 21:33:35 +02:00
|
|
|
bool isleaf;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2006-03-10 21:18:15 +01:00
|
|
|
/*
|
2006-10-04 02:30:14 +02:00
|
|
|
* This is a handy place to check for cancel interrupts during the btree
|
|
|
|
* load phase of index creation.
|
2006-03-10 21:18:15 +01:00
|
|
|
*/
|
|
|
|
CHECK_FOR_INTERRUPTS();
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
npage = state->btps_page;
|
2004-06-02 19:28:18 +02:00
|
|
|
nblkno = state->btps_blkno;
|
1997-09-07 07:04:48 +02:00
|
|
|
last_off = state->btps_lastoff;
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
last_truncextra = state->btps_lastextra;
|
|
|
|
state->btps_lastextra = truncextra;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
pgspc = PageGetFreeSpace(npage);
|
2018-03-01 01:25:54 +01:00
|
|
|
itupsz = IndexTupleSize(itup);
|
2006-01-26 00:04:21 +01:00
|
|
|
itupsz = MAXALIGN(itupsz);
|
Fix nbtsort.c's page space accounting.
Commit dd299df8189, which made heap TID a tiebreaker nbtree index
column, introduced new rules on page space management to make suffix
truncation safe. In general, suffix truncation needs to have a small
amount of extra space available on the new left page when splitting a
leaf page. This is needed in case it turns out that truncation cannot
even "truncate away the heap TID column", resulting in a
larger-than-firstright leaf high key with an explicit heap TID
representation.
Despite all this, CREATE INDEX/nbtsort.c did not account for the
possible need for extra heap TID space on leaf pages when deciding
whether or not a new item could fit on current page. This could lead to
"failed to add item to the index page" errors when CREATE
INDEX/nbtsort.c tried to finish off a leaf page that lacked space for a
larger-than-firstright leaf high key (it only had space for firstright
tuple, which was just short of what was needed following "truncation").
Several conditions needed to be met all at once for CREATE INDEX to
fail. The problem was in the hard limit on what will fit on a page,
which tends to be masked by the soft fillfactor-wise limit. The easiest
way to recreate the problem seems to be a CREATE INDEX on a low
cardinality text column, with tuples that are of non-uniform width,
using a fillfactor of 100.
To fix, bring nbtsort.c in line with nbtsplitloc.c, which already
pessimistically assumes that all leaf page splits will have high keys
that have a heap TID appended.
Reported-By: Andreas Joseph Krogh
Discussion: https://postgr.es/m/VisenaEmail.c5.3ee7fe277d514162.16a6d785bea@tc7-visena
2019-05-02 21:33:35 +02:00
|
|
|
/* Leaf case has slightly different rules due to suffix truncation */
|
|
|
|
isleaf = (state->btps_level == 0);
|
2000-01-08 22:24:49 +01:00
|
|
|
|
|
|
|
/*
|
Fix nbtsort.c's page space accounting.
Commit dd299df8189, which made heap TID a tiebreaker nbtree index
column, introduced new rules on page space management to make suffix
truncation safe. In general, suffix truncation needs to have a small
amount of extra space available on the new left page when splitting a
leaf page. This is needed in case it turns out that truncation cannot
even "truncate away the heap TID column", resulting in a
larger-than-firstright leaf high key with an explicit heap TID
representation.
Despite all this, CREATE INDEX/nbtsort.c did not account for the
possible need for extra heap TID space on leaf pages when deciding
whether or not a new item could fit on current page. This could lead to
"failed to add item to the index page" errors when CREATE
INDEX/nbtsort.c tried to finish off a leaf page that lacked space for a
larger-than-firstright leaf high key (it only had space for firstright
tuple, which was just short of what was needed following "truncation").
Several conditions needed to be met all at once for CREATE INDEX to
fail. The problem was in the hard limit on what will fit on a page,
which tends to be masked by the soft fillfactor-wise limit. The easiest
way to recreate the problem seems to be a CREATE INDEX on a low
cardinality text column, with tuples that are of non-uniform width,
using a fillfactor of 100.
To fix, bring nbtsort.c in line with nbtsplitloc.c, which already
pessimistically assumes that all leaf page splits will have high keys
that have a heap TID appended.
Reported-By: Andreas Joseph Krogh
Discussion: https://postgr.es/m/VisenaEmail.c5.3ee7fe277d514162.16a6d785bea@tc7-visena
2019-05-02 21:33:35 +02:00
|
|
|
* Check whether the new item can fit on a btree page on current level at
|
|
|
|
* all.
|
2000-01-08 22:24:49 +01:00
|
|
|
*
|
Make heap TID a tiebreaker nbtree index column.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique. This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".
Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added. This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes. This can increase fan-out,
especially in a multi-column index. Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now. A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.
Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved). Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3. contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes. These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.
A later patch will enhance the logic used by nbtree to pick a split
point. This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at. Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.
The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits. The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised. However, there should be a compatibility note in the v12
release notes.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20 18:04:01 +01:00
|
|
|
* Every newly built index will treat heap TID as part of the keyspace,
|
|
|
|
* which imposes the requirement that new high keys must occasionally have
|
|
|
|
* a heap TID appended within _bt_truncate(). That may leave a new pivot
|
|
|
|
* tuple one or two MAXALIGN() quantums larger than the original first
|
|
|
|
* right tuple it's derived from. v4 deals with the problem by decreasing
|
|
|
|
* the limit on the size of tuples inserted on the leaf level by the same
|
|
|
|
* small amount. Enforce the new v4+ limit on the leaf level, and the old
|
|
|
|
* limit on internal levels, since pivot tuples may need to make use of
|
2019-03-29 20:29:05 +01:00
|
|
|
* the reserved space. This should never fail on internal pages.
|
2000-01-08 22:24:49 +01:00
|
|
|
*/
|
Make heap TID a tiebreaker nbtree index column.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique. This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".
Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added. This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes. This can increase fan-out,
especially in a multi-column index. Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now. A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.
Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved). Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3. contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes. These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.
A later patch will enhance the logic used by nbtree to pick a split
point. This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at. Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.
The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits. The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised. However, there should be a compatibility note in the v12
release notes.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20 18:04:01 +01:00
|
|
|
if (unlikely(itupsz > BTMaxItemSize(npage)))
|
Fix nbtsort.c's page space accounting.
Commit dd299df8189, which made heap TID a tiebreaker nbtree index
column, introduced new rules on page space management to make suffix
truncation safe. In general, suffix truncation needs to have a small
amount of extra space available on the new left page when splitting a
leaf page. This is needed in case it turns out that truncation cannot
even "truncate away the heap TID column", resulting in a
larger-than-firstright leaf high key with an explicit heap TID
representation.
Despite all this, CREATE INDEX/nbtsort.c did not account for the
possible need for extra heap TID space on leaf pages when deciding
whether or not a new item could fit on current page. This could lead to
"failed to add item to the index page" errors when CREATE
INDEX/nbtsort.c tried to finish off a leaf page that lacked space for a
larger-than-firstright leaf high key (it only had space for firstright
tuple, which was just short of what was needed following "truncation").
Several conditions needed to be met all at once for CREATE INDEX to
fail. The problem was in the hard limit on what will fit on a page,
which tends to be masked by the soft fillfactor-wise limit. The easiest
way to recreate the problem seems to be a CREATE INDEX on a low
cardinality text column, with tuples that are of non-uniform width,
using a fillfactor of 100.
To fix, bring nbtsort.c in line with nbtsplitloc.c, which already
pessimistically assumes that all leaf page splits will have high keys
that have a heap TID appended.
Reported-By: Andreas Joseph Krogh
Discussion: https://postgr.es/m/VisenaEmail.c5.3ee7fe277d514162.16a6d785bea@tc7-visena
2019-05-02 21:33:35 +02:00
|
|
|
_bt_check_third_page(wstate->index, wstate->heap, isleaf, npage,
|
|
|
|
itup);
|
2000-01-08 22:24:49 +01:00
|
|
|
|
2006-07-11 23:05:57 +02:00
|
|
|
/*
|
Fix nbtsort.c's page space accounting.
Commit dd299df8189, which made heap TID a tiebreaker nbtree index
column, introduced new rules on page space management to make suffix
truncation safe. In general, suffix truncation needs to have a small
amount of extra space available on the new left page when splitting a
leaf page. This is needed in case it turns out that truncation cannot
even "truncate away the heap TID column", resulting in a
larger-than-firstright leaf high key with an explicit heap TID
representation.
Despite all this, CREATE INDEX/nbtsort.c did not account for the
possible need for extra heap TID space on leaf pages when deciding
whether or not a new item could fit on current page. This could lead to
"failed to add item to the index page" errors when CREATE
INDEX/nbtsort.c tried to finish off a leaf page that lacked space for a
larger-than-firstright leaf high key (it only had space for firstright
tuple, which was just short of what was needed following "truncation").
Several conditions needed to be met all at once for CREATE INDEX to
fail. The problem was in the hard limit on what will fit on a page,
which tends to be masked by the soft fillfactor-wise limit. The easiest
way to recreate the problem seems to be a CREATE INDEX on a low
cardinality text column, with tuples that are of non-uniform width,
using a fillfactor of 100.
To fix, bring nbtsort.c in line with nbtsplitloc.c, which already
pessimistically assumes that all leaf page splits will have high keys
that have a heap TID appended.
Reported-By: Andreas Joseph Krogh
Discussion: https://postgr.es/m/VisenaEmail.c5.3ee7fe277d514162.16a6d785bea@tc7-visena
2019-05-02 21:33:35 +02:00
|
|
|
* Check to see if current page will fit new item, with space left over to
|
|
|
|
* append a heap TID during suffix truncation when page is a leaf page.
|
|
|
|
*
|
|
|
|
* It is guaranteed that we can fit at least 2 non-pivot tuples plus a
|
|
|
|
* high key with heap TID when finishing off a leaf page, since we rely on
|
|
|
|
* _bt_check_third_page() rejecting oversized non-pivot tuples. On
|
|
|
|
* internal pages we can always fit 3 pivot tuples with larger internal
|
|
|
|
* page tuple limit (includes page high key).
|
|
|
|
*
|
|
|
|
* Most of the time, a page is only "full" in the sense that the soft
|
|
|
|
* fillfactor-wise limit has been exceeded. However, we must always leave
|
|
|
|
* at least two items plus a high key on each page before starting a new
|
|
|
|
* page. Disregard fillfactor and insert on "full" current page if we
|
|
|
|
* don't have the minimum number of items yet. (Note that we deliberately
|
|
|
|
* assume that suffix truncation neither enlarges nor shrinks new high key
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
* when applying soft limit, except when last tuple has a posting list.)
|
2006-07-11 23:05:57 +02:00
|
|
|
*/
|
Fix nbtsort.c's page space accounting.
Commit dd299df8189, which made heap TID a tiebreaker nbtree index
column, introduced new rules on page space management to make suffix
truncation safe. In general, suffix truncation needs to have a small
amount of extra space available on the new left page when splitting a
leaf page. This is needed in case it turns out that truncation cannot
even "truncate away the heap TID column", resulting in a
larger-than-firstright leaf high key with an explicit heap TID
representation.
Despite all this, CREATE INDEX/nbtsort.c did not account for the
possible need for extra heap TID space on leaf pages when deciding
whether or not a new item could fit on current page. This could lead to
"failed to add item to the index page" errors when CREATE
INDEX/nbtsort.c tried to finish off a leaf page that lacked space for a
larger-than-firstright leaf high key (it only had space for firstright
tuple, which was just short of what was needed following "truncation").
Several conditions needed to be met all at once for CREATE INDEX to
fail. The problem was in the hard limit on what will fit on a page,
which tends to be masked by the soft fillfactor-wise limit. The easiest
way to recreate the problem seems to be a CREATE INDEX on a low
cardinality text column, with tuples that are of non-uniform width,
using a fillfactor of 100.
To fix, bring nbtsort.c in line with nbtsplitloc.c, which already
pessimistically assumes that all leaf page splits will have high keys
that have a heap TID appended.
Reported-By: Andreas Joseph Krogh
Discussion: https://postgr.es/m/VisenaEmail.c5.3ee7fe277d514162.16a6d785bea@tc7-visena
2019-05-02 21:33:35 +02:00
|
|
|
if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
(pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
|
1997-09-07 07:04:48 +02:00
|
|
|
{
|
2000-07-21 08:42:39 +02:00
|
|
|
/*
|
2006-07-11 23:05:57 +02:00
|
|
|
* Finish off the page and write it out.
|
2000-07-21 08:42:39 +02:00
|
|
|
*/
|
1997-09-08 04:41:22 +02:00
|
|
|
Page opage = npage;
|
2004-08-29 07:07:03 +02:00
|
|
|
BlockNumber oblkno = nblkno;
|
1997-09-08 04:41:22 +02:00
|
|
|
ItemId ii;
|
|
|
|
ItemId hii;
|
2006-01-26 00:04:21 +01:00
|
|
|
IndexTuple oitup;
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2004-06-02 19:28:18 +02:00
|
|
|
/* Create new page of same level */
|
|
|
|
npage = _bt_blnewpage(state->btps_level);
|
|
|
|
|
|
|
|
/* and assign it a page position */
|
|
|
|
nblkno = wstate->btws_pages_alloced++;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
/*
|
2000-07-21 08:42:39 +02:00
|
|
|
* We copy the last item on the page into the new page, and then
|
2005-10-15 04:49:52 +02:00
|
|
|
* rearrange the old page so that the 'last item' becomes its high key
|
|
|
|
* rather than a true data item. There had better be at least two
|
|
|
|
* items on the page already, else the page would be empty of useful
|
2006-07-11 23:05:57 +02:00
|
|
|
* data.
|
1997-09-07 07:04:48 +02:00
|
|
|
*/
|
2000-07-22 00:14:09 +02:00
|
|
|
Assert(last_off > P_FIRSTKEY);
|
2000-07-21 08:42:39 +02:00
|
|
|
ii = PageGetItemId(opage, last_off);
|
2006-01-26 00:04:21 +01:00
|
|
|
oitup = (IndexTuple) PageGetItem(opage, ii);
|
|
|
|
_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
|
1996-07-09 08:22:35 +02:00
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
Fix nbtsort.c's page space accounting.
Commit dd299df8189, which made heap TID a tiebreaker nbtree index
column, introduced new rules on page space management to make suffix
truncation safe. In general, suffix truncation needs to have a small
amount of extra space available on the new left page when splitting a
leaf page. This is needed in case it turns out that truncation cannot
even "truncate away the heap TID column", resulting in a
larger-than-firstright leaf high key with an explicit heap TID
representation.
Despite all this, CREATE INDEX/nbtsort.c did not account for the
possible need for extra heap TID space on leaf pages when deciding
whether or not a new item could fit on current page. This could lead to
"failed to add item to the index page" errors when CREATE
INDEX/nbtsort.c tried to finish off a leaf page that lacked space for a
larger-than-firstright leaf high key (it only had space for firstright
tuple, which was just short of what was needed following "truncation").
Several conditions needed to be met all at once for CREATE INDEX to
fail. The problem was in the hard limit on what will fit on a page,
which tends to be masked by the soft fillfactor-wise limit. The easiest
way to recreate the problem seems to be a CREATE INDEX on a low
cardinality text column, with tuples that are of non-uniform width,
using a fillfactor of 100.
To fix, bring nbtsort.c in line with nbtsplitloc.c, which already
pessimistically assumes that all leaf page splits will have high keys
that have a heap TID appended.
Reported-By: Andreas Joseph Krogh
Discussion: https://postgr.es/m/VisenaEmail.c5.3ee7fe277d514162.16a6d785bea@tc7-visena
2019-05-02 21:33:35 +02:00
|
|
|
* Move 'last' into the high key position on opage. _bt_blnewpage()
|
|
|
|
* allocated empty space for a line pointer when opage was first
|
|
|
|
* created, so this is a matter of rearranging already-allocated space
|
|
|
|
* on page, and initializing high key line pointer. (Actually, leaf
|
|
|
|
* pages must also swap oitup with a truncated version of oitup, which
|
|
|
|
* is sometimes larger than oitup, though never by more than the space
|
|
|
|
* needed to append a heap TID.)
|
1997-09-07 07:04:48 +02:00
|
|
|
*/
|
|
|
|
hii = PageGetItemId(opage, P_HIKEY);
|
|
|
|
*hii = *ii;
|
2007-09-13 00:10:26 +02:00
|
|
|
ItemIdSetUnused(ii); /* redundant */
|
1997-09-07 07:04:48 +02:00
|
|
|
((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
|
1996-07-09 08:22:35 +02:00
|
|
|
|
Fix nbtsort.c's page space accounting.
Commit dd299df8189, which made heap TID a tiebreaker nbtree index
column, introduced new rules on page space management to make suffix
truncation safe. In general, suffix truncation needs to have a small
amount of extra space available on the new left page when splitting a
leaf page. This is needed in case it turns out that truncation cannot
even "truncate away the heap TID column", resulting in a
larger-than-firstright leaf high key with an explicit heap TID
representation.
Despite all this, CREATE INDEX/nbtsort.c did not account for the
possible need for extra heap TID space on leaf pages when deciding
whether or not a new item could fit on current page. This could lead to
"failed to add item to the index page" errors when CREATE
INDEX/nbtsort.c tried to finish off a leaf page that lacked space for a
larger-than-firstright leaf high key (it only had space for firstright
tuple, which was just short of what was needed following "truncation").
Several conditions needed to be met all at once for CREATE INDEX to
fail. The problem was in the hard limit on what will fit on a page,
which tends to be masked by the soft fillfactor-wise limit. The easiest
way to recreate the problem seems to be a CREATE INDEX on a low
cardinality text column, with tuples that are of non-uniform width,
using a fillfactor of 100.
To fix, bring nbtsort.c in line with nbtsplitloc.c, which already
pessimistically assumes that all leaf page splits will have high keys
that have a heap TID appended.
Reported-By: Andreas Joseph Krogh
Discussion: https://postgr.es/m/VisenaEmail.c5.3ee7fe277d514162.16a6d785bea@tc7-visena
2019-05-02 21:33:35 +02:00
|
|
|
if (isleaf)
|
2018-04-07 22:00:39 +02:00
|
|
|
{
|
Make heap TID a tiebreaker nbtree index column.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique. This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".
Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added. This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes. This can increase fan-out,
especially in a multi-column index. Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now. A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.
Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved). Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3. contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes. These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.
A later patch will enhance the logic used by nbtree to pick a split
point. This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at. Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.
The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits. The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised. However, there should be a compatibility note in the v12
release notes.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20 18:04:01 +01:00
|
|
|
IndexTuple lastleft;
|
Adjust INCLUDE index truncation comments and code.
Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected. Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature. A
similar check will be performed for the page highkey within
_bt_moveright() in most cases. Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.
Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer). This isn't strictly necessary,
but seems more future-proof than the old approach. It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation. Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.
In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE INDEX. Also pfree during a page split, just to be
consistent.
Refactor _bt_check_natts() to be more readable.
Author: Peter Geoghegan with some editorization by me
Reviewed by: Alexander Korotkov, Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
2018-04-19 07:45:58 +02:00
|
|
|
IndexTuple truncated;
|
|
|
|
|
2018-04-07 22:00:39 +02:00
|
|
|
/*
|
Make heap TID a tiebreaker nbtree index column.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique. This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".
Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added. This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes. This can increase fan-out,
especially in a multi-column index. Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now. A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.
Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved). Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3. contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes. These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.
A later patch will enhance the logic used by nbtree to pick a split
point. This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at. Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.
The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits. The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised. However, there should be a compatibility note in the v12
release notes.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20 18:04:01 +01:00
|
|
|
* Truncate away any unneeded attributes from high key on leaf
|
|
|
|
* level. This is only done at the leaf level because downlinks
|
2018-04-26 20:47:16 +02:00
|
|
|
* in internal pages are either negative infinity items, or get
|
|
|
|
* their contents from copying from one level down. See also:
|
|
|
|
* _bt_split().
|
Adjust INCLUDE index truncation comments and code.
Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected. Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature. A
similar check will be performed for the page highkey within
_bt_moveright() in most cases. Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.
Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer). This isn't strictly necessary,
but seems more future-proof than the old approach. It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation. Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.
In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE INDEX. Also pfree during a page split, just to be
consistent.
Refactor _bt_check_natts() to be more readable.
Author: Peter Geoghegan with some editorization by me
Reviewed by: Alexander Korotkov, Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
2018-04-19 07:45:58 +02:00
|
|
|
*
|
Make heap TID a tiebreaker nbtree index column.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique. This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".
Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added. This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes. This can increase fan-out,
especially in a multi-column index. Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now. A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.
Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved). Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3. contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes. These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.
A later patch will enhance the logic used by nbtree to pick a split
point. This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at. Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.
The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits. The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised. However, there should be a compatibility note in the v12
release notes.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20 18:04:01 +01:00
|
|
|
* We don't try to bias our choice of split point to make it more
|
|
|
|
* likely that _bt_truncate() can truncate away more attributes,
|
2019-08-21 22:50:27 +02:00
|
|
|
* whereas the split point used within _bt_split() is chosen much
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
* more delicately. Even still, the lastleft and firstright
|
|
|
|
* tuples passed to _bt_truncate() here are at least not fully
|
|
|
|
* equal to each other when deduplication is used, unless there is
|
|
|
|
* a large group of duplicates (also, unique index builds usually
|
|
|
|
* have few or no spool2 duplicates). When the split point is
|
|
|
|
* between two unequal tuples, _bt_truncate() will avoid including
|
|
|
|
* a heap TID in the new high key, which is the most important
|
|
|
|
* benefit of suffix truncation.
|
Make heap TID a tiebreaker nbtree index column.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique. This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".
Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added. This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes. This can increase fan-out,
especially in a multi-column index. Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now. A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.
Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved). Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3. contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes. These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.
A later patch will enhance the logic used by nbtree to pick a split
point. This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at. Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.
The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits. The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised. However, there should be a compatibility note in the v12
release notes.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20 18:04:01 +01:00
|
|
|
*
|
2019-08-13 20:54:26 +02:00
|
|
|
* Overwrite the old item with new truncated high key directly.
|
|
|
|
* oitup is already located at the physical beginning of tuple
|
|
|
|
* space, so this should directly reuse the existing tuple space.
|
2018-04-07 22:00:39 +02:00
|
|
|
*/
|
Make heap TID a tiebreaker nbtree index column.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique. This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".
Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added. This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes. This can increase fan-out,
especially in a multi-column index. Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now. A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.
Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved). Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3. contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes. These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.
A later patch will enhance the logic used by nbtree to pick a split
point. This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at. Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.
The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits. The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised. However, there should be a compatibility note in the v12
release notes.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20 18:04:01 +01:00
|
|
|
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
|
|
|
|
lastleft = (IndexTuple) PageGetItem(opage, ii);
|
|
|
|
|
|
|
|
truncated = _bt_truncate(wstate->index, lastleft, oitup,
|
|
|
|
wstate->inskey);
|
2019-08-13 20:54:26 +02:00
|
|
|
if (!PageIndexTupleOverwrite(opage, P_HIKEY, (Item) truncated,
|
|
|
|
IndexTupleSize(truncated)))
|
|
|
|
elog(ERROR, "failed to add high key to the index page");
|
Adjust INCLUDE index truncation comments and code.
Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected. Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature. A
similar check will be performed for the page highkey within
_bt_moveright() in most cases. Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.
Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer). This isn't strictly necessary,
but seems more future-proof than the old approach. It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation. Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.
In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE INDEX. Also pfree during a page split, just to be
consistent.
Refactor _bt_check_natts() to be more readable.
Author: Peter Geoghegan with some editorization by me
Reviewed by: Alexander Korotkov, Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
2018-04-19 07:45:58 +02:00
|
|
|
pfree(truncated);
|
2018-04-07 22:00:39 +02:00
|
|
|
|
Adjust INCLUDE index truncation comments and code.
Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected. Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature. A
similar check will be performed for the page highkey within
_bt_moveright() in most cases. Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.
Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer). This isn't strictly necessary,
but seems more future-proof than the old approach. It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation. Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.
In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE INDEX. Also pfree during a page split, just to be
consistent.
Refactor _bt_check_natts() to be more readable.
Author: Peter Geoghegan with some editorization by me
Reviewed by: Alexander Korotkov, Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
2018-04-19 07:45:58 +02:00
|
|
|
/* oitup should continue to point to the page's high key */
|
|
|
|
hii = PageGetItemId(opage, P_HIKEY);
|
|
|
|
oitup = (IndexTuple) PageGetItem(opage, hii);
|
2018-04-07 22:00:39 +02:00
|
|
|
}
|
|
|
|
|
2000-07-21 08:42:39 +02:00
|
|
|
/*
|
2019-11-08 02:12:09 +01:00
|
|
|
* Link the old page into its parent, using its low key. If we don't
|
|
|
|
* have a parent, we have to create one; this adds a new btree level.
|
2000-07-21 08:42:39 +02:00
|
|
|
*/
|
2004-01-07 19:56:30 +01:00
|
|
|
if (state->btps_next == NULL)
|
2004-06-02 19:28:18 +02:00
|
|
|
state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
|
2003-02-21 01:06:22 +01:00
|
|
|
|
2019-11-08 02:12:09 +01:00
|
|
|
Assert((BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) <=
|
Make heap TID a tiebreaker nbtree index column.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique. This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".
Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added. This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes. This can increase fan-out,
especially in a multi-column index. Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now. A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.
Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved). Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3. contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes. These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.
A later patch will enhance the logic used by nbtree to pick a split
point. This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at. Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.
The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits. The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised. However, there should be a compatibility note in the v12
release notes.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20 18:04:01 +01:00
|
|
|
IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
|
2019-11-08 02:12:09 +01:00
|
|
|
BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) > 0) ||
|
Fix nbtsort.c's page space accounting.
Commit dd299df8189, which made heap TID a tiebreaker nbtree index
column, introduced new rules on page space management to make suffix
truncation safe. In general, suffix truncation needs to have a small
amount of extra space available on the new left page when splitting a
leaf page. This is needed in case it turns out that truncation cannot
even "truncate away the heap TID column", resulting in a
larger-than-firstright leaf high key with an explicit heap TID
representation.
Despite all this, CREATE INDEX/nbtsort.c did not account for the
possible need for extra heap TID space on leaf pages when deciding
whether or not a new item could fit on current page. This could lead to
"failed to add item to the index page" errors when CREATE
INDEX/nbtsort.c tried to finish off a leaf page that lacked space for a
larger-than-firstright leaf high key (it only had space for firstright
tuple, which was just short of what was needed following "truncation").
Several conditions needed to be met all at once for CREATE INDEX to
fail. The problem was in the hard limit on what will fit on a page,
which tends to be masked by the soft fillfactor-wise limit. The easiest
way to recreate the problem seems to be a CREATE INDEX on a low
cardinality text column, with tuples that are of non-uniform width,
using a fillfactor of 100.
To fix, bring nbtsort.c in line with nbtsplitloc.c, which already
pessimistically assumes that all leaf page splits will have high keys
that have a heap TID appended.
Reported-By: Andreas Joseph Krogh
Discussion: https://postgr.es/m/VisenaEmail.c5.3ee7fe277d514162.16a6d785bea@tc7-visena
2019-05-02 21:33:35 +02:00
|
|
|
P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
|
2019-11-08 02:12:09 +01:00
|
|
|
Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
|
Fix nbtsort.c's page space accounting.
Commit dd299df8189, which made heap TID a tiebreaker nbtree index
column, introduced new rules on page space management to make suffix
truncation safe. In general, suffix truncation needs to have a small
amount of extra space available on the new left page when splitting a
leaf page. This is needed in case it turns out that truncation cannot
even "truncate away the heap TID column", resulting in a
larger-than-firstright leaf high key with an explicit heap TID
representation.
Despite all this, CREATE INDEX/nbtsort.c did not account for the
possible need for extra heap TID space on leaf pages when deciding
whether or not a new item could fit on current page. This could lead to
"failed to add item to the index page" errors when CREATE
INDEX/nbtsort.c tried to finish off a leaf page that lacked space for a
larger-than-firstright leaf high key (it only had space for firstright
tuple, which was just short of what was needed following "truncation").
Several conditions needed to be met all at once for CREATE INDEX to
fail. The problem was in the hard limit on what will fit on a page,
which tends to be masked by the soft fillfactor-wise limit. The easiest
way to recreate the problem seems to be a CREATE INDEX on a low
cardinality text column, with tuples that are of non-uniform width,
using a fillfactor of 100.
To fix, bring nbtsort.c in line with nbtsplitloc.c, which already
pessimistically assumes that all leaf page splits will have high keys
that have a heap TID appended.
Reported-By: Andreas Joseph Krogh
Discussion: https://postgr.es/m/VisenaEmail.c5.3ee7fe277d514162.16a6d785bea@tc7-visena
2019-05-02 21:33:35 +02:00
|
|
|
!P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
|
2019-12-17 02:49:45 +01:00
|
|
|
BTreeTupleSetDownLink(state->btps_lowkey, oblkno);
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
_bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
|
2019-11-08 02:12:09 +01:00
|
|
|
pfree(state->btps_lowkey);
|
1996-07-09 08:22:35 +02:00
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
2019-11-08 02:12:09 +01:00
|
|
|
* Save a copy of the high key from the old page. It is also the low
|
|
|
|
* key for the new page.
|
2000-07-22 00:14:09 +02:00
|
|
|
*/
|
2019-11-08 02:12:09 +01:00
|
|
|
state->btps_lowkey = CopyIndexTuple(oitup);
|
2000-07-22 00:14:09 +02:00
|
|
|
|
|
|
|
/*
|
2003-02-21 01:06:22 +01:00
|
|
|
* Set the sibling links for both pages.
|
1997-09-07 07:04:48 +02:00
|
|
|
*/
|
|
|
|
{
|
1997-09-08 04:41:22 +02:00
|
|
|
BTPageOpaque oopaque = (BTPageOpaque) PageGetSpecialPointer(opage);
|
|
|
|
BTPageOpaque nopaque = (BTPageOpaque) PageGetSpecialPointer(npage);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2004-06-02 19:28:18 +02:00
|
|
|
oopaque->btpo_next = nblkno;
|
|
|
|
nopaque->btpo_prev = oblkno;
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
nopaque->btpo_next = P_NONE; /* redundant */
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Write out the old page. We never need to touch it again, so we can
|
2005-10-15 04:49:52 +02:00
|
|
|
* free the opage workspace too.
|
1997-09-07 07:04:48 +02:00
|
|
|
*/
|
2004-06-02 19:28:18 +02:00
|
|
|
_bt_blwritepage(wstate, opage, oblkno);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
/*
|
2000-07-22 00:14:09 +02:00
|
|
|
* Reset last_off to point to new page
|
1997-09-07 07:04:48 +02:00
|
|
|
*/
|
2000-07-22 00:14:09 +02:00
|
|
|
last_off = P_FIRSTKEY;
|
What looks like some *major* improvements to btree indexing...
Patches from: aoki@CS.Berkeley.EDU (Paul M. Aoki)
i gave jolly my btree bulkload code a long, long time ago but never
gave him a bunch of my bugfixes. here's a diff against the 6.0
baseline.
for some reason, this code has slowed down somewhat relative to the
insertion-build code on very small tables. don't know why -- it used
to be within about 10%. anyway, here are some (highly unscientific!)
timings on a dec 3000/300 for synthetic tables with 10k, 100k and
1000k tuples (basically, 1mb, 10mb and 100mb heaps). 'c' means
clustered (pre-sorted) inputs and 'u' means unclustered (randomly
ordered) inputs. the 10k table basically fits in the buffer pool, but
the 100k and 1000k tables don't. as you can see, insertion build is
fine if you've sorted your heaps on your index key or if your heap
fits in core, but is absolutely horrible on unordered data (yes,
that's 7.5 hours to index 100mb of data...) because of the zillions of
random i/os.
if it doesn't work for you for whatever reason, you can always turn it
back off by flipping the FastBuild flag in nbtree.c. i don't have
time to maintain it.
good luck!
baseline code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 8.6
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 9.1
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.2
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 652.4
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.1
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 26772.9
bulkloading code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 11.3
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 10.4
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.5
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 63.5
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.9
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 701.0
1997-02-12 06:04:52 +01:00
|
|
|
}
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
Fix nbtsort.c's page space accounting.
Commit dd299df8189, which made heap TID a tiebreaker nbtree index
column, introduced new rules on page space management to make suffix
truncation safe. In general, suffix truncation needs to have a small
amount of extra space available on the new left page when splitting a
leaf page. This is needed in case it turns out that truncation cannot
even "truncate away the heap TID column", resulting in a
larger-than-firstright leaf high key with an explicit heap TID
representation.
Despite all this, CREATE INDEX/nbtsort.c did not account for the
possible need for extra heap TID space on leaf pages when deciding
whether or not a new item could fit on current page. This could lead to
"failed to add item to the index page" errors when CREATE
INDEX/nbtsort.c tried to finish off a leaf page that lacked space for a
larger-than-firstright leaf high key (it only had space for firstright
tuple, which was just short of what was needed following "truncation").
Several conditions needed to be met all at once for CREATE INDEX to
fail. The problem was in the hard limit on what will fit on a page,
which tends to be masked by the soft fillfactor-wise limit. The easiest
way to recreate the problem seems to be a CREATE INDEX on a low
cardinality text column, with tuples that are of non-uniform width,
using a fillfactor of 100.
To fix, bring nbtsort.c in line with nbtsplitloc.c, which already
pessimistically assumes that all leaf page splits will have high keys
that have a heap TID appended.
Reported-By: Andreas Joseph Krogh
Discussion: https://postgr.es/m/VisenaEmail.c5.3ee7fe277d514162.16a6d785bea@tc7-visena
2019-05-02 21:33:35 +02:00
|
|
|
* By here, either original page is still the current page, or a new page
|
|
|
|
* was created that became the current page. Either way, the current page
|
|
|
|
* definitely has space for new item.
|
|
|
|
*
|
2019-11-08 02:12:09 +01:00
|
|
|
* If the new item is the first for its page, it must also be the first
|
|
|
|
* item on its entire level. On later same-level pages, a low key for a
|
|
|
|
* page will be copied from the prior page in the code above. Generate a
|
|
|
|
* minus infinity low key here instead.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2000-07-22 00:14:09 +02:00
|
|
|
if (last_off == P_HIKEY)
|
1997-09-07 07:04:48 +02:00
|
|
|
{
|
2019-11-08 02:12:09 +01:00
|
|
|
Assert(state->btps_lowkey == NULL);
|
|
|
|
state->btps_lowkey = palloc0(sizeof(IndexTupleData));
|
|
|
|
state->btps_lowkey->t_info = sizeof(IndexTupleData);
|
|
|
|
BTreeTupleSetNAtts(state->btps_lowkey, 0);
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
2000-07-22 00:14:09 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Add the new item into the current page.
|
|
|
|
*/
|
|
|
|
last_off = OffsetNumberNext(last_off);
|
2006-01-26 00:04:21 +01:00
|
|
|
_bt_sortaddtup(npage, itupsz, itup, last_off);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
state->btps_page = npage;
|
2004-06-02 19:28:18 +02:00
|
|
|
state->btps_blkno = nblkno;
|
1997-09-07 07:04:48 +02:00
|
|
|
state->btps_lastoff = last_off;
|
What looks like some *major* improvements to btree indexing...
Patches from: aoki@CS.Berkeley.EDU (Paul M. Aoki)
i gave jolly my btree bulkload code a long, long time ago but never
gave him a bunch of my bugfixes. here's a diff against the 6.0
baseline.
for some reason, this code has slowed down somewhat relative to the
insertion-build code on very small tables. don't know why -- it used
to be within about 10%. anyway, here are some (highly unscientific!)
timings on a dec 3000/300 for synthetic tables with 10k, 100k and
1000k tuples (basically, 1mb, 10mb and 100mb heaps). 'c' means
clustered (pre-sorted) inputs and 'u' means unclustered (randomly
ordered) inputs. the 10k table basically fits in the buffer pool, but
the 100k and 1000k tables don't. as you can see, insertion build is
fine if you've sorted your heaps on your index key or if your heap
fits in core, but is absolutely horrible on unordered data (yes,
that's 7.5 hours to index 100mb of data...) because of the zillions of
random i/os.
if it doesn't work for you for whatever reason, you can always turn it
back off by flipping the FastBuild flag in nbtree.c. i don't have
time to maintain it.
good luck!
baseline code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 8.6
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 9.1
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.2
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 652.4
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.1
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 26772.9
bulkloading code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 11.3
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 10.4
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.5
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 63.5
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.9
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 701.0
1997-02-12 06:04:52 +01:00
|
|
|
}
|
|
|
|
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
/*
|
|
|
|
* Finalize pending posting list tuple, and add it to the index. Final tuple
|
|
|
|
* is based on saved base tuple, and saved list of heap TIDs.
|
|
|
|
*
|
|
|
|
* This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
|
|
|
|
* using _bt_buildadd().
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
|
|
|
|
BTDedupState dstate)
|
|
|
|
{
|
|
|
|
Assert(dstate->nitems > 0);
|
|
|
|
|
|
|
|
if (dstate->nitems == 1)
|
|
|
|
_bt_buildadd(wstate, state, dstate->base, 0);
|
|
|
|
else
|
|
|
|
{
|
|
|
|
IndexTuple postingtuple;
|
|
|
|
Size truncextra;
|
|
|
|
|
|
|
|
/* form a tuple with a posting list */
|
|
|
|
postingtuple = _bt_form_posting(dstate->base,
|
|
|
|
dstate->htids,
|
|
|
|
dstate->nhtids);
|
|
|
|
/* Calculate posting list overhead */
|
|
|
|
truncextra = IndexTupleSize(postingtuple) -
|
|
|
|
BTreeTupleGetPostingOffset(postingtuple);
|
|
|
|
|
|
|
|
_bt_buildadd(wstate, state, postingtuple, truncextra);
|
|
|
|
pfree(postingtuple);
|
|
|
|
}
|
|
|
|
|
|
|
|
dstate->nhtids = 0;
|
|
|
|
dstate->nitems = 0;
|
|
|
|
dstate->phystupsize = 0;
|
|
|
|
}
|
|
|
|
|
2000-07-21 08:42:39 +02:00
|
|
|
/*
|
|
|
|
* Finish writing out the completed btree.
|
|
|
|
*/
|
1997-08-19 23:40:56 +02:00
|
|
|
static void
|
2004-06-02 19:28:18 +02:00
|
|
|
_bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
|
What looks like some *major* improvements to btree indexing...
Patches from: aoki@CS.Berkeley.EDU (Paul M. Aoki)
i gave jolly my btree bulkload code a long, long time ago but never
gave him a bunch of my bugfixes. here's a diff against the 6.0
baseline.
for some reason, this code has slowed down somewhat relative to the
insertion-build code on very small tables. don't know why -- it used
to be within about 10%. anyway, here are some (highly unscientific!)
timings on a dec 3000/300 for synthetic tables with 10k, 100k and
1000k tuples (basically, 1mb, 10mb and 100mb heaps). 'c' means
clustered (pre-sorted) inputs and 'u' means unclustered (randomly
ordered) inputs. the 10k table basically fits in the buffer pool, but
the 100k and 1000k tables don't. as you can see, insertion build is
fine if you've sorted your heaps on your index key or if your heap
fits in core, but is absolutely horrible on unordered data (yes,
that's 7.5 hours to index 100mb of data...) because of the zillions of
random i/os.
if it doesn't work for you for whatever reason, you can always turn it
back off by flipping the FastBuild flag in nbtree.c. i don't have
time to maintain it.
good luck!
baseline code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 8.6
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 9.1
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.2
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 652.4
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.1
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 26772.9
bulkloading code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 11.3
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 10.4
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.5
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 63.5
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.9
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 701.0
1997-02-12 06:04:52 +01:00
|
|
|
{
|
1997-09-08 04:41:22 +02:00
|
|
|
BTPageState *s;
|
2004-08-29 07:07:03 +02:00
|
|
|
BlockNumber rootblkno = P_NONE;
|
2003-09-30 01:40:26 +02:00
|
|
|
uint32 rootlevel = 0;
|
2004-06-02 19:28:18 +02:00
|
|
|
Page metapage;
|
What looks like some *major* improvements to btree indexing...
Patches from: aoki@CS.Berkeley.EDU (Paul M. Aoki)
i gave jolly my btree bulkload code a long, long time ago but never
gave him a bunch of my bugfixes. here's a diff against the 6.0
baseline.
for some reason, this code has slowed down somewhat relative to the
insertion-build code on very small tables. don't know why -- it used
to be within about 10%. anyway, here are some (highly unscientific!)
timings on a dec 3000/300 for synthetic tables with 10k, 100k and
1000k tuples (basically, 1mb, 10mb and 100mb heaps). 'c' means
clustered (pre-sorted) inputs and 'u' means unclustered (randomly
ordered) inputs. the 10k table basically fits in the buffer pool, but
the 100k and 1000k tables don't. as you can see, insertion build is
fine if you've sorted your heaps on your index key or if your heap
fits in core, but is absolutely horrible on unordered data (yes,
that's 7.5 hours to index 100mb of data...) because of the zillions of
random i/os.
if it doesn't work for you for whatever reason, you can always turn it
back off by flipping the FastBuild flag in nbtree.c. i don't have
time to maintain it.
good luck!
baseline code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 8.6
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 9.1
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.2
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 652.4
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.1
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 26772.9
bulkloading code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 11.3
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 10.4
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.5
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 63.5
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.9
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 701.0
1997-02-12 06:04:52 +01:00
|
|
|
|
2000-07-21 08:42:39 +02:00
|
|
|
/*
|
|
|
|
* Each iteration of this loop completes one more level of the tree.
|
|
|
|
*/
|
2004-01-07 19:56:30 +01:00
|
|
|
for (s = state; s != NULL; s = s->btps_next)
|
1997-09-07 07:04:48 +02:00
|
|
|
{
|
2000-07-22 00:14:09 +02:00
|
|
|
BlockNumber blkno;
|
|
|
|
BTPageOpaque opaque;
|
|
|
|
|
2004-06-02 19:28:18 +02:00
|
|
|
blkno = s->btps_blkno;
|
1997-09-07 07:04:48 +02:00
|
|
|
opaque = (BTPageOpaque) PageGetSpecialPointer(s->btps_page);
|
What looks like some *major* improvements to btree indexing...
Patches from: aoki@CS.Berkeley.EDU (Paul M. Aoki)
i gave jolly my btree bulkload code a long, long time ago but never
gave him a bunch of my bugfixes. here's a diff against the 6.0
baseline.
for some reason, this code has slowed down somewhat relative to the
insertion-build code on very small tables. don't know why -- it used
to be within about 10%. anyway, here are some (highly unscientific!)
timings on a dec 3000/300 for synthetic tables with 10k, 100k and
1000k tuples (basically, 1mb, 10mb and 100mb heaps). 'c' means
clustered (pre-sorted) inputs and 'u' means unclustered (randomly
ordered) inputs. the 10k table basically fits in the buffer pool, but
the 100k and 1000k tables don't. as you can see, insertion build is
fine if you've sorted your heaps on your index key or if your heap
fits in core, but is absolutely horrible on unordered data (yes,
that's 7.5 hours to index 100mb of data...) because of the zillions of
random i/os.
if it doesn't work for you for whatever reason, you can always turn it
back off by flipping the FastBuild flag in nbtree.c. i don't have
time to maintain it.
good luck!
baseline code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 8.6
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 9.1
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.2
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 652.4
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.1
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 26772.9
bulkloading code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 11.3
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 10.4
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.5
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 63.5
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.9
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 701.0
1997-02-12 06:04:52 +01:00
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
2000-07-21 08:42:39 +02:00
|
|
|
* We have to link the last page on this level to somewhere.
|
|
|
|
*
|
|
|
|
* If we're at the top, it's the root, so attach it to the metapage.
|
2019-11-08 02:12:09 +01:00
|
|
|
* Otherwise, add an entry for it to its parent using its low key.
|
2005-10-15 04:49:52 +02:00
|
|
|
* This may cause the last page of the parent level to split, but
|
|
|
|
* that's not a problem -- we haven't gotten to it yet.
|
1997-09-07 07:04:48 +02:00
|
|
|
*/
|
2004-01-07 19:56:30 +01:00
|
|
|
if (s->btps_next == NULL)
|
1997-09-07 07:04:48 +02:00
|
|
|
{
|
2000-07-21 08:42:39 +02:00
|
|
|
opaque->btpo_flags |= BTP_ROOT;
|
2003-09-30 01:40:26 +02:00
|
|
|
rootblkno = blkno;
|
|
|
|
rootlevel = s->btps_level;
|
2000-07-21 08:42:39 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
2019-11-08 02:12:09 +01:00
|
|
|
Assert((BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) <=
|
Make heap TID a tiebreaker nbtree index column.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique. This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".
Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added. This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes. This can increase fan-out,
especially in a multi-column index. Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now. A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.
Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved). Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3. contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes. These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.
A later patch will enhance the logic used by nbtree to pick a split
point. This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at. Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.
The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits. The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised. However, there should be a compatibility note in the v12
release notes.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20 18:04:01 +01:00
|
|
|
IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
|
2019-11-08 02:12:09 +01:00
|
|
|
BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) > 0) ||
|
2018-05-04 11:38:23 +02:00
|
|
|
P_LEFTMOST(opaque));
|
2019-11-08 02:12:09 +01:00
|
|
|
Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
|
2018-05-04 11:38:23 +02:00
|
|
|
!P_LEFTMOST(opaque));
|
2019-12-17 02:49:45 +01:00
|
|
|
BTreeTupleSetDownLink(s->btps_lowkey, blkno);
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
_bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
|
2019-11-08 02:12:09 +01:00
|
|
|
pfree(s->btps_lowkey);
|
|
|
|
s->btps_lowkey = NULL;
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
What looks like some *major* improvements to btree indexing...
Patches from: aoki@CS.Berkeley.EDU (Paul M. Aoki)
i gave jolly my btree bulkload code a long, long time ago but never
gave him a bunch of my bugfixes. here's a diff against the 6.0
baseline.
for some reason, this code has slowed down somewhat relative to the
insertion-build code on very small tables. don't know why -- it used
to be within about 10%. anyway, here are some (highly unscientific!)
timings on a dec 3000/300 for synthetic tables with 10k, 100k and
1000k tuples (basically, 1mb, 10mb and 100mb heaps). 'c' means
clustered (pre-sorted) inputs and 'u' means unclustered (randomly
ordered) inputs. the 10k table basically fits in the buffer pool, but
the 100k and 1000k tables don't. as you can see, insertion build is
fine if you've sorted your heaps on your index key or if your heap
fits in core, but is absolutely horrible on unordered data (yes,
that's 7.5 hours to index 100mb of data...) because of the zillions of
random i/os.
if it doesn't work for you for whatever reason, you can always turn it
back off by flipping the FastBuild flag in nbtree.c. i don't have
time to maintain it.
good luck!
baseline code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 8.6
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 9.1
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.2
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 652.4
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.1
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 26772.9
bulkloading code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 11.3
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 10.4
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.5
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 63.5
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.9
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 701.0
1997-02-12 06:04:52 +01:00
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* This is the rightmost page, so the ItemId array needs to be slid
|
|
|
|
* back one slot. Then we can dump out the page.
|
1997-09-07 07:04:48 +02:00
|
|
|
*/
|
2004-06-02 19:28:18 +02:00
|
|
|
_bt_slideleft(s->btps_page);
|
|
|
|
_bt_blwritepage(wstate, s->btps_page, s->btps_blkno);
|
|
|
|
s->btps_page = NULL; /* writepage freed the workspace */
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
2003-09-30 01:40:26 +02:00
|
|
|
|
|
|
|
/*
|
2004-06-02 19:28:18 +02:00
|
|
|
* As the last step in the process, construct the metapage and make it
|
2005-10-15 04:49:52 +02:00
|
|
|
* point to the new root (unless we had no data at all, in which case it's
|
|
|
|
* set to point to "P_NONE"). This changes the index to the "valid" state
|
|
|
|
* by filling in a valid magic number in the metapage.
|
2003-09-30 01:40:26 +02:00
|
|
|
*/
|
2004-06-02 19:28:18 +02:00
|
|
|
metapage = (Page) palloc(BLCKSZ);
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
_bt_initmetapage(metapage, rootblkno, rootlevel,
|
|
|
|
wstate->inskey->allequalimage);
|
2004-06-02 19:28:18 +02:00
|
|
|
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
1999-10-18 00:15:09 +02:00
|
|
|
* Read tuples in correct sort order from tuplesort, and load them into
|
|
|
|
* btree leaves.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
What looks like some *major* improvements to btree indexing...
Patches from: aoki@CS.Berkeley.EDU (Paul M. Aoki)
i gave jolly my btree bulkload code a long, long time ago but never
gave him a bunch of my bugfixes. here's a diff against the 6.0
baseline.
for some reason, this code has slowed down somewhat relative to the
insertion-build code on very small tables. don't know why -- it used
to be within about 10%. anyway, here are some (highly unscientific!)
timings on a dec 3000/300 for synthetic tables with 10k, 100k and
1000k tuples (basically, 1mb, 10mb and 100mb heaps). 'c' means
clustered (pre-sorted) inputs and 'u' means unclustered (randomly
ordered) inputs. the 10k table basically fits in the buffer pool, but
the 100k and 1000k tables don't. as you can see, insertion build is
fine if you've sorted your heaps on your index key or if your heap
fits in core, but is absolutely horrible on unordered data (yes,
that's 7.5 hours to index 100mb of data...) because of the zillions of
random i/os.
if it doesn't work for you for whatever reason, you can always turn it
back off by flipping the FastBuild flag in nbtree.c. i don't have
time to maintain it.
good luck!
baseline code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 8.6
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 9.1
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.2
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 652.4
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.1
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 26772.9
bulkloading code:
time psql -c 'create index c10 on k10 using btree (c int4_ops)' bttest
real 11.3
time psql -c 'create index u10 on k10 using btree (b int4_ops)' bttest
real 10.4
time psql -c 'create index c100 on k100 using btree (c int4_ops)' bttest
real 59.5
time psql -c 'create index u100 on k100 using btree (b int4_ops)' bttest
real 63.5
time psql -c 'create index c1000 on k1000 using btree (c int4_ops)' bttest
real 636.9
time psql -c 'create index u1000 on k1000 using btree (b int4_ops)' bttest
real 701.0
1997-02-12 06:04:52 +01:00
|
|
|
static void
|
2004-06-02 19:28:18 +02:00
|
|
|
_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2000-07-21 08:42:39 +02:00
|
|
|
BTPageState *state = NULL;
|
2000-08-10 04:33:20 +02:00
|
|
|
bool merge = (btspool2 != NULL);
|
2006-01-26 00:04:21 +01:00
|
|
|
IndexTuple itup,
|
|
|
|
itup2 = NULL;
|
2016-12-12 21:57:35 +01:00
|
|
|
bool load1;
|
2004-06-02 19:28:18 +02:00
|
|
|
TupleDesc tupdes = RelationGetDescr(wstate->index);
|
2001-03-22 05:01:46 +01:00
|
|
|
int i,
|
2018-04-07 22:00:39 +02:00
|
|
|
keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
|
2014-11-07 21:50:09 +01:00
|
|
|
SortSupport sortKeys;
|
2019-04-29 20:15:19 +02:00
|
|
|
int64 tuples_done = 0;
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
bool deduplicate;
|
|
|
|
|
|
|
|
deduplicate = wstate->inskey->allequalimage &&
|
|
|
|
BTGetDeduplicateItems(wstate->index);
|
2000-08-10 04:33:20 +02:00
|
|
|
|
|
|
|
if (merge)
|
1999-10-18 00:15:09 +02:00
|
|
|
{
|
2000-08-10 04:33:20 +02:00
|
|
|
/*
|
2001-03-22 05:01:46 +01:00
|
|
|
* Another BTSpool for dead tuples exists. Now we have to merge
|
|
|
|
* btspool and btspool2.
|
|
|
|
*/
|
2000-08-10 04:33:20 +02:00
|
|
|
|
|
|
|
/* the preparation of merge */
|
2016-12-12 21:57:35 +01:00
|
|
|
itup = tuplesort_getindextuple(btspool->sortstate, true);
|
|
|
|
itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
|
2003-11-12 22:15:59 +01:00
|
|
|
|
2014-11-07 21:50:09 +01:00
|
|
|
/* Prepare SortSupport data for each column */
|
|
|
|
sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
|
|
|
|
|
|
|
|
for (i = 0; i < keysz; i++)
|
|
|
|
{
|
|
|
|
SortSupport sortKey = sortKeys + i;
|
2019-03-20 17:30:57 +01:00
|
|
|
ScanKey scanKey = wstate->inskey->scankeys + i;
|
2014-11-07 21:50:09 +01:00
|
|
|
int16 strategy;
|
|
|
|
|
|
|
|
sortKey->ssup_cxt = CurrentMemoryContext;
|
|
|
|
sortKey->ssup_collation = scanKey->sk_collation;
|
|
|
|
sortKey->ssup_nulls_first =
|
|
|
|
(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
|
|
|
|
sortKey->ssup_attno = scanKey->sk_attno;
|
2015-01-19 21:20:31 +01:00
|
|
|
/* Abbreviation is not supported here */
|
|
|
|
sortKey->abbreviate = false;
|
2014-11-07 21:50:09 +01:00
|
|
|
|
|
|
|
AssertState(sortKey->ssup_attno != 0);
|
|
|
|
|
|
|
|
strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
|
|
|
|
BTGreaterStrategyNumber : BTLessStrategyNumber;
|
|
|
|
|
|
|
|
PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
|
|
|
|
}
|
|
|
|
|
2000-08-10 04:33:20 +02:00
|
|
|
for (;;)
|
|
|
|
{
|
2001-03-22 05:01:46 +01:00
|
|
|
load1 = true; /* load BTSpool next ? */
|
2006-01-26 00:04:21 +01:00
|
|
|
if (itup2 == NULL)
|
2000-08-10 04:33:20 +02:00
|
|
|
{
|
2006-01-26 00:04:21 +01:00
|
|
|
if (itup == NULL)
|
2000-08-10 04:33:20 +02:00
|
|
|
break;
|
|
|
|
}
|
2006-01-26 00:04:21 +01:00
|
|
|
else if (itup != NULL)
|
2000-08-10 04:33:20 +02:00
|
|
|
{
|
Make heap TID a tiebreaker nbtree index column.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique. This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".
Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added. This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes. This can increase fan-out,
especially in a multi-column index. Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now. A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.
Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved). Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3. contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes. These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.
A later patch will enhance the logic used by nbtree to pick a split
point. This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at. Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.
The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits. The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised. However, there should be a compatibility note in the v12
release notes.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20 18:04:01 +01:00
|
|
|
int32 compare = 0;
|
|
|
|
|
2000-08-10 04:33:20 +02:00
|
|
|
for (i = 1; i <= keysz; i++)
|
|
|
|
{
|
2015-05-24 03:35:49 +02:00
|
|
|
SortSupport entry;
|
2003-11-12 22:15:59 +01:00
|
|
|
Datum attrDatum1,
|
|
|
|
attrDatum2;
|
2007-01-09 03:14:16 +01:00
|
|
|
bool isNull1,
|
|
|
|
isNull2;
|
2003-11-12 22:15:59 +01:00
|
|
|
|
2014-11-07 21:50:09 +01:00
|
|
|
entry = sortKeys + i - 1;
|
2007-01-09 03:14:16 +01:00
|
|
|
attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
|
|
|
|
attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
|
2014-11-07 21:50:09 +01:00
|
|
|
|
|
|
|
compare = ApplySortComparator(attrDatum1, isNull1,
|
|
|
|
attrDatum2, isNull2,
|
|
|
|
entry);
|
2007-01-09 03:14:16 +01:00
|
|
|
if (compare > 0)
|
|
|
|
{
|
|
|
|
load1 = false;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
else if (compare < 0)
|
|
|
|
break;
|
2000-08-10 04:33:20 +02:00
|
|
|
}
|
Make heap TID a tiebreaker nbtree index column.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique. This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".
Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added. This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes. This can increase fan-out,
especially in a multi-column index. Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now. A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.
Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved). Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3. contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes. These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.
A later patch will enhance the logic used by nbtree to pick a split
point. This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at. Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.
The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits. The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised. However, there should be a compatibility note in the v12
release notes.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2019-03-20 18:04:01 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If key values are equal, we sort on ItemPointer. This is
|
|
|
|
* required for btree indexes, since heap TID is treated as an
|
|
|
|
* implicit last key attribute in order to ensure that all
|
|
|
|
* keys in the index are physically unique.
|
|
|
|
*/
|
|
|
|
if (compare == 0)
|
|
|
|
{
|
|
|
|
compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
|
|
|
|
Assert(compare != 0);
|
|
|
|
if (compare > 0)
|
|
|
|
load1 = false;
|
|
|
|
}
|
2000-08-10 04:33:20 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
load1 = false;
|
|
|
|
|
|
|
|
/* When we see first tuple, create first index page */
|
|
|
|
if (state == NULL)
|
2004-06-02 19:28:18 +02:00
|
|
|
state = _bt_pagestate(wstate, 0);
|
2000-08-10 04:33:20 +02:00
|
|
|
|
|
|
|
if (load1)
|
|
|
|
{
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
_bt_buildadd(wstate, state, itup, 0);
|
2016-12-12 21:57:35 +01:00
|
|
|
itup = tuplesort_getindextuple(btspool->sortstate, true);
|
2000-08-10 04:33:20 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
_bt_buildadd(wstate, state, itup2, 0);
|
2016-12-12 21:57:35 +01:00
|
|
|
itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
|
2000-08-10 04:33:20 +02:00
|
|
|
}
|
Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 20:18:08 +02:00
|
|
|
|
|
|
|
/* Report progress */
|
|
|
|
pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
|
|
|
|
++tuples_done);
|
2000-08-10 04:33:20 +02:00
|
|
|
}
|
2014-11-07 21:50:09 +01:00
|
|
|
pfree(sortKeys);
|
2000-08-10 04:33:20 +02:00
|
|
|
}
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
else if (deduplicate)
|
|
|
|
{
|
|
|
|
/* merge is unnecessary, deduplicate into posting lists */
|
|
|
|
BTDedupState dstate;
|
|
|
|
|
|
|
|
dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
|
|
|
|
dstate->deduplicate = true; /* unused */
|
|
|
|
dstate->maxpostingsize = 0; /* set later */
|
|
|
|
/* Metadata about base tuple of current pending posting list */
|
|
|
|
dstate->base = NULL;
|
|
|
|
dstate->baseoff = InvalidOffsetNumber; /* unused */
|
|
|
|
dstate->basetupsize = 0;
|
|
|
|
/* Metadata about current pending posting list TIDs */
|
|
|
|
dstate->htids = NULL;
|
|
|
|
dstate->nhtids = 0;
|
|
|
|
dstate->nitems = 0;
|
|
|
|
dstate->phystupsize = 0; /* unused */
|
|
|
|
dstate->nintervals = 0; /* unused */
|
|
|
|
|
|
|
|
while ((itup = tuplesort_getindextuple(btspool->sortstate,
|
|
|
|
true)) != NULL)
|
|
|
|
{
|
|
|
|
/* When we see first tuple, create first index page */
|
|
|
|
if (state == NULL)
|
|
|
|
{
|
|
|
|
state = _bt_pagestate(wstate, 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Limit size of posting list tuples to 1/10 space we want to
|
|
|
|
* leave behind on the page, plus space for final item's line
|
|
|
|
* pointer. This is equal to the space that we'd like to
|
|
|
|
* leave behind on each leaf page when fillfactor is 90,
|
|
|
|
* allowing us to get close to fillfactor% space utilization
|
|
|
|
* when there happen to be a great many duplicates. (This
|
|
|
|
* makes higher leaf fillfactor settings ineffective when
|
|
|
|
* building indexes that have many duplicates, but packing
|
|
|
|
* leaf pages full with few very large tuples doesn't seem
|
|
|
|
* like a useful goal.)
|
|
|
|
*/
|
|
|
|
dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
|
|
|
|
sizeof(ItemIdData);
|
|
|
|
Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
|
|
|
|
dstate->maxpostingsize <= INDEX_SIZE_MASK);
|
|
|
|
dstate->htids = palloc(dstate->maxpostingsize);
|
|
|
|
|
|
|
|
/* start new pending posting list with itup copy */
|
|
|
|
_bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
|
|
|
|
InvalidOffsetNumber);
|
|
|
|
}
|
|
|
|
else if (_bt_keep_natts_fast(wstate->index, dstate->base,
|
|
|
|
itup) > keysz &&
|
|
|
|
_bt_dedup_save_htid(dstate, itup))
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Tuple is equal to base tuple of pending posting list. Heap
|
|
|
|
* TID from itup has been saved in state.
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Tuple is not equal to pending posting list tuple, or
|
|
|
|
* _bt_dedup_save_htid() opted to not merge current item into
|
|
|
|
* pending posting list.
|
|
|
|
*/
|
|
|
|
_bt_sort_dedup_finish_pending(wstate, state, dstate);
|
|
|
|
pfree(dstate->base);
|
|
|
|
|
|
|
|
/* start new pending posting list with itup copy */
|
|
|
|
_bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
|
|
|
|
InvalidOffsetNumber);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Report progress */
|
|
|
|
pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
|
|
|
|
++tuples_done);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (state)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Handle the last item (there must be a last item when the
|
|
|
|
* tuplesort returned one or more tuples)
|
|
|
|
*/
|
|
|
|
_bt_sort_dedup_finish_pending(wstate, state, dstate);
|
|
|
|
pfree(dstate->base);
|
|
|
|
pfree(dstate->htids);
|
|
|
|
}
|
|
|
|
|
|
|
|
pfree(dstate);
|
|
|
|
}
|
2001-03-22 05:01:46 +01:00
|
|
|
else
|
2000-08-10 04:33:20 +02:00
|
|
|
{
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
/* merging and deduplication are both unnecessary */
|
2006-01-26 00:04:21 +01:00
|
|
|
while ((itup = tuplesort_getindextuple(btspool->sortstate,
|
2016-12-12 21:57:35 +01:00
|
|
|
true)) != NULL)
|
2000-08-10 04:33:20 +02:00
|
|
|
{
|
|
|
|
/* When we see first tuple, create first index page */
|
|
|
|
if (state == NULL)
|
2004-06-02 19:28:18 +02:00
|
|
|
state = _bt_pagestate(wstate, 0);
|
2000-07-22 00:14:09 +02:00
|
|
|
|
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ru
https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
2020-02-26 22:05:30 +01:00
|
|
|
_bt_buildadd(wstate, state, itup, 0);
|
Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 20:18:08 +02:00
|
|
|
|
|
|
|
/* Report progress */
|
|
|
|
pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
|
|
|
|
++tuples_done);
|
2000-08-10 04:33:20 +02:00
|
|
|
}
|
1999-10-18 00:15:09 +02:00
|
|
|
}
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2004-06-02 19:28:18 +02:00
|
|
|
/* Close down final pages and write the metapage */
|
|
|
|
_bt_uppershutdown(wstate, state);
|
|
|
|
|
|
|
|
/*
|
2010-12-13 18:34:26 +01:00
|
|
|
* If the index is WAL-logged, we must fsync it down to disk before it's
|
2014-05-06 18:12:18 +02:00
|
|
|
* safe to commit the transaction. (For a non-WAL-logged index we don't
|
2010-12-13 18:34:26 +01:00
|
|
|
* care since the index will be uninteresting after a crash anyway.)
|
2004-08-16 01:44:46 +02:00
|
|
|
*
|
2004-08-29 07:07:03 +02:00
|
|
|
* It's obvious that we must do this when not WAL-logging the build. It's
|
|
|
|
* less obvious that we have to do it even if we did WAL-log the index
|
2005-10-15 04:49:52 +02:00
|
|
|
* pages. The reason is that since we're building outside shared buffers,
|
|
|
|
* a CHECKPOINT occurring during the build has no way to flush the
|
|
|
|
* previously written data to disk (indeed it won't know the index even
|
|
|
|
* exists). A crash later on would replay WAL from the checkpoint,
|
|
|
|
* therefore it wouldn't replay our earlier WAL entries. If we do not
|
|
|
|
* fsync those pages here, they might still not be on disk when the crash
|
|
|
|
* occurs.
|
2004-06-02 19:28:18 +02:00
|
|
|
*/
|
2010-12-13 18:34:26 +01:00
|
|
|
if (RelationNeedsWAL(wstate->index))
|
2006-01-07 23:45:41 +01:00
|
|
|
{
|
|
|
|
RelationOpenSmgr(wstate->index);
|
2008-08-11 13:05:11 +02:00
|
|
|
smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
|
2006-01-07 23:45:41 +01:00
|
|
|
}
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Create parallel context, and launch workers for leader.
|
|
|
|
*
|
|
|
|
* buildstate argument should be initialized (with the exception of the
|
|
|
|
* tuplesort state in spools, which may later be created based on shared
|
|
|
|
* state initially set up here).
|
|
|
|
*
|
|
|
|
* isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
|
|
|
|
*
|
|
|
|
* request is the target number of parallel worker processes to launch.
|
|
|
|
*
|
|
|
|
* Sets buildstate's BTLeader, which caller must use to shut down parallel
|
|
|
|
* mode by passing it to _bt_end_parallel() at the very end of its index
|
|
|
|
* build. If not even a single worker process can be launched, this is
|
|
|
|
* never set, and caller should proceed with a serial index build.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
_bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
|
|
|
|
{
|
|
|
|
ParallelContext *pcxt;
|
|
|
|
int scantuplesortstates;
|
|
|
|
Snapshot snapshot;
|
|
|
|
Size estbtshared;
|
|
|
|
Size estsort;
|
|
|
|
BTShared *btshared;
|
|
|
|
Sharedsort *sharedsort;
|
|
|
|
Sharedsort *sharedsort2;
|
|
|
|
BTSpool *btspool = buildstate->spool;
|
|
|
|
BTLeader *btleader = (BTLeader *) palloc0(sizeof(BTLeader));
|
|
|
|
bool leaderparticipates = true;
|
2018-03-22 18:15:03 +01:00
|
|
|
char *sharedquery;
|
|
|
|
int querylen;
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
|
|
|
|
#ifdef DISABLE_LEADER_PARTICIPATION
|
|
|
|
leaderparticipates = false;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Enter parallel mode, and create context for parallel build of btree
|
|
|
|
* index
|
|
|
|
*/
|
|
|
|
EnterParallelMode();
|
|
|
|
Assert(request > 0);
|
|
|
|
pcxt = CreateParallelContext("postgres", "_bt_parallel_build_main",
|
Enable parallel query with SERIALIZABLE isolation.
Previously, the SERIALIZABLE isolation level prevented parallel query
from being used. Allow the two features to be used together by
sharing the leader's SERIALIZABLEXACT with parallel workers.
An extra per-SERIALIZABLEXACT LWLock is introduced to make it safe to
share, and new logic is introduced to coordinate the early release
of the SERIALIZABLEXACT required for the SXACT_FLAG_RO_SAFE
optimization, as follows:
The first backend to observe the SXACT_FLAG_RO_SAFE flag (set by
some other transaction) will 'partially release' the SERIALIZABLEXACT,
meaning that the conflicts and locks it holds are released, but the
SERIALIZABLEXACT itself will remain active because other backends
might still have a pointer to it.
Whenever any backend notices the SXACT_FLAG_RO_SAFE flag, it clears
its own MySerializableXact variable and frees local resources so that
it can skip SSI checks for the rest of the transaction. In the
special case of the leader process, it transfers the SERIALIZABLEXACT
to a new variable SavedSerializableXact, so that it can be completely
released at the end of the transaction after all workers have exited.
Remove the serializable_okay flag added to CreateParallelContext() by
commit 9da0cc35, because it's now redundant.
Author: Thomas Munro
Reviewed-by: Haribabu Kommi, Robert Haas, Masahiko Sawada, Kevin Grittner
Discussion: https://postgr.es/m/CAEepm=0gXGYhtrVDWOTHS8SQQy_=S9xo+8oCxGLWZAOoeJ=yzQ@mail.gmail.com
2019-03-15 04:23:46 +01:00
|
|
|
request);
|
2020-01-30 22:25:34 +01:00
|
|
|
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
scantuplesortstates = leaderparticipates ? request + 1 : request;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Prepare for scan of the base relation. In a normal index build, we use
|
|
|
|
* SnapshotAny because we must retrieve all tuples and do our own time
|
|
|
|
* qual checks (because we have to index RECENTLY_DEAD tuples). In a
|
|
|
|
* concurrent build, we take a regular MVCC snapshot and index whatever's
|
|
|
|
* live according to that.
|
|
|
|
*/
|
|
|
|
if (!isconcurrent)
|
|
|
|
snapshot = SnapshotAny;
|
|
|
|
else
|
|
|
|
snapshot = RegisterSnapshot(GetTransactionSnapshot());
|
|
|
|
|
|
|
|
/*
|
2018-03-22 18:15:03 +01:00
|
|
|
* Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
|
|
|
|
* PARALLEL_KEY_TUPLESORT tuplesort workspace
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
*/
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
estbtshared = _bt_parallel_estimate_shared(btspool->heap, snapshot);
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
shm_toc_estimate_chunk(&pcxt->estimator, estbtshared);
|
|
|
|
estsort = tuplesort_estimate_shared(scantuplesortstates);
|
|
|
|
shm_toc_estimate_chunk(&pcxt->estimator, estsort);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Unique case requires a second spool, and so we may have to account for
|
2018-03-22 18:15:03 +01:00
|
|
|
* another shared workspace for that -- PARALLEL_KEY_TUPLESORT_SPOOL2
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
*/
|
|
|
|
if (!btspool->isunique)
|
|
|
|
shm_toc_estimate_keys(&pcxt->estimator, 2);
|
|
|
|
else
|
|
|
|
{
|
|
|
|
shm_toc_estimate_chunk(&pcxt->estimator, estsort);
|
|
|
|
shm_toc_estimate_keys(&pcxt->estimator, 3);
|
|
|
|
}
|
|
|
|
|
2018-03-22 18:15:03 +01:00
|
|
|
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
|
|
|
|
querylen = strlen(debug_query_string);
|
|
|
|
shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
|
|
|
|
shm_toc_estimate_keys(&pcxt->estimator, 1);
|
|
|
|
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
/* Everyone's had a chance to ask for space, so now create the DSM */
|
|
|
|
InitializeParallelDSM(pcxt);
|
|
|
|
|
2020-02-05 00:21:03 +01:00
|
|
|
/* If no DSM segment was available, back out (do serial build) */
|
|
|
|
if (pcxt->seg == NULL)
|
|
|
|
{
|
|
|
|
if (IsMVCCSnapshot(snapshot))
|
|
|
|
UnregisterSnapshot(snapshot);
|
|
|
|
DestroyParallelContext(pcxt);
|
|
|
|
ExitParallelMode();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
/* Store shared build state, for which we reserved space */
|
|
|
|
btshared = (BTShared *) shm_toc_allocate(pcxt->toc, estbtshared);
|
|
|
|
/* Initialize immutable state */
|
|
|
|
btshared->heaprelid = RelationGetRelid(btspool->heap);
|
|
|
|
btshared->indexrelid = RelationGetRelid(btspool->index);
|
|
|
|
btshared->isunique = btspool->isunique;
|
|
|
|
btshared->isconcurrent = isconcurrent;
|
|
|
|
btshared->scantuplesortstates = scantuplesortstates;
|
|
|
|
ConditionVariableInit(&btshared->workersdonecv);
|
|
|
|
SpinLockInit(&btshared->mutex);
|
|
|
|
/* Initialize mutable state */
|
|
|
|
btshared->nparticipantsdone = 0;
|
|
|
|
btshared->reltuples = 0.0;
|
|
|
|
btshared->havedead = false;
|
|
|
|
btshared->indtuples = 0.0;
|
|
|
|
btshared->brokenhotchain = false;
|
2019-03-11 22:26:43 +01:00
|
|
|
table_parallelscan_initialize(btspool->heap,
|
|
|
|
ParallelTableScanFromBTShared(btshared),
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
snapshot);
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Store shared tuplesort-private state, for which we reserved space.
|
|
|
|
* Then, initialize opaque state using tuplesort routine.
|
|
|
|
*/
|
|
|
|
sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
|
|
|
|
tuplesort_initialize_shared(sharedsort, scantuplesortstates,
|
|
|
|
pcxt->seg);
|
|
|
|
|
|
|
|
shm_toc_insert(pcxt->toc, PARALLEL_KEY_BTREE_SHARED, btshared);
|
|
|
|
shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT, sharedsort);
|
|
|
|
|
|
|
|
/* Unique case requires a second spool, and associated shared state */
|
|
|
|
if (!btspool->isunique)
|
|
|
|
sharedsort2 = NULL;
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Store additional shared tuplesort-private state, for which we
|
|
|
|
* reserved space. Then, initialize opaque state using tuplesort
|
|
|
|
* routine.
|
|
|
|
*/
|
|
|
|
sharedsort2 = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
|
|
|
|
tuplesort_initialize_shared(sharedsort2, scantuplesortstates,
|
|
|
|
pcxt->seg);
|
|
|
|
|
|
|
|
shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT_SPOOL2, sharedsort2);
|
|
|
|
}
|
|
|
|
|
2018-03-22 18:15:03 +01:00
|
|
|
/* Store query string for workers */
|
|
|
|
sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
|
|
|
|
memcpy(sharedquery, debug_query_string, querylen + 1);
|
|
|
|
shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
|
|
|
|
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
/* Launch workers, saving status for leader/caller */
|
|
|
|
LaunchParallelWorkers(pcxt);
|
|
|
|
btleader->pcxt = pcxt;
|
|
|
|
btleader->nparticipanttuplesorts = pcxt->nworkers_launched;
|
|
|
|
if (leaderparticipates)
|
|
|
|
btleader->nparticipanttuplesorts++;
|
|
|
|
btleader->btshared = btshared;
|
|
|
|
btleader->sharedsort = sharedsort;
|
|
|
|
btleader->sharedsort2 = sharedsort2;
|
|
|
|
btleader->snapshot = snapshot;
|
|
|
|
|
|
|
|
/* If no workers were successfully launched, back out (do serial build) */
|
|
|
|
if (pcxt->nworkers_launched == 0)
|
|
|
|
{
|
|
|
|
_bt_end_parallel(btleader);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Save leader state now that it's clear build will be parallel */
|
|
|
|
buildstate->btleader = btleader;
|
|
|
|
|
|
|
|
/* Join heap scan ourselves */
|
|
|
|
if (leaderparticipates)
|
|
|
|
_bt_leader_participate_as_worker(buildstate);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Caller needs to wait for all launched workers when we return. Make
|
|
|
|
* sure that the failure-to-start case will not hang forever.
|
|
|
|
*/
|
|
|
|
WaitForParallelWorkersToAttach(pcxt);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Shut down workers, destroy parallel context, and end parallel mode.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
_bt_end_parallel(BTLeader *btleader)
|
|
|
|
{
|
|
|
|
/* Shutdown worker processes */
|
|
|
|
WaitForParallelWorkersToFinish(btleader->pcxt);
|
|
|
|
/* Free last reference to MVCC snapshot, if one was used */
|
|
|
|
if (IsMVCCSnapshot(btleader->snapshot))
|
|
|
|
UnregisterSnapshot(btleader->snapshot);
|
|
|
|
DestroyParallelContext(btleader->pcxt);
|
|
|
|
ExitParallelMode();
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns size of shared memory required to store state for a parallel
|
|
|
|
* btree index build based on the snapshot its parallel scan will use.
|
|
|
|
*/
|
|
|
|
static Size
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
_bt_parallel_estimate_shared(Relation heap, Snapshot snapshot)
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
{
|
2019-03-11 22:26:43 +01:00
|
|
|
/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
|
|
|
|
return add_size(BUFFERALIGN(sizeof(BTShared)),
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
table_parallelscan_estimate(heap, snapshot));
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Within leader, wait for end of heap scan.
|
|
|
|
*
|
|
|
|
* When called, parallel heap scan started by _bt_begin_parallel() will
|
|
|
|
* already be underway within worker processes (when leader participates
|
|
|
|
* as a worker, we should end up here just as workers are finishing).
|
|
|
|
*
|
|
|
|
* Fills in fields needed for ambuild statistics, and lets caller set
|
|
|
|
* field indicating that some worker encountered a broken HOT chain.
|
|
|
|
*
|
|
|
|
* Returns the total number of heap tuples scanned.
|
|
|
|
*/
|
|
|
|
static double
|
|
|
|
_bt_parallel_heapscan(BTBuildState *buildstate, bool *brokenhotchain)
|
|
|
|
{
|
|
|
|
BTShared *btshared = buildstate->btleader->btshared;
|
|
|
|
int nparticipanttuplesorts;
|
|
|
|
double reltuples;
|
|
|
|
|
|
|
|
nparticipanttuplesorts = buildstate->btleader->nparticipanttuplesorts;
|
|
|
|
for (;;)
|
|
|
|
{
|
|
|
|
SpinLockAcquire(&btshared->mutex);
|
|
|
|
if (btshared->nparticipantsdone == nparticipanttuplesorts)
|
|
|
|
{
|
|
|
|
buildstate->havedead = btshared->havedead;
|
|
|
|
buildstate->indtuples = btshared->indtuples;
|
|
|
|
*brokenhotchain = btshared->brokenhotchain;
|
|
|
|
reltuples = btshared->reltuples;
|
|
|
|
SpinLockRelease(&btshared->mutex);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
SpinLockRelease(&btshared->mutex);
|
|
|
|
|
|
|
|
ConditionVariableSleep(&btshared->workersdonecv,
|
|
|
|
WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
|
|
|
|
}
|
|
|
|
|
|
|
|
ConditionVariableCancelSleep();
|
|
|
|
|
|
|
|
return reltuples;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Within leader, participate as a parallel worker.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
_bt_leader_participate_as_worker(BTBuildState *buildstate)
|
|
|
|
{
|
|
|
|
BTLeader *btleader = buildstate->btleader;
|
|
|
|
BTSpool *leaderworker;
|
|
|
|
BTSpool *leaderworker2;
|
|
|
|
int sortmem;
|
|
|
|
|
|
|
|
/* Allocate memory and initialize private spool */
|
|
|
|
leaderworker = (BTSpool *) palloc0(sizeof(BTSpool));
|
|
|
|
leaderworker->heap = buildstate->spool->heap;
|
|
|
|
leaderworker->index = buildstate->spool->index;
|
|
|
|
leaderworker->isunique = buildstate->spool->isunique;
|
|
|
|
|
|
|
|
/* Initialize second spool, if required */
|
|
|
|
if (!btleader->btshared->isunique)
|
|
|
|
leaderworker2 = NULL;
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* Allocate memory for worker's own private secondary spool */
|
|
|
|
leaderworker2 = (BTSpool *) palloc0(sizeof(BTSpool));
|
|
|
|
|
|
|
|
/* Initialize worker's own secondary spool */
|
|
|
|
leaderworker2->heap = leaderworker->heap;
|
|
|
|
leaderworker2->index = leaderworker->index;
|
|
|
|
leaderworker2->isunique = false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Might as well use reliable figure when doling out maintenance_work_mem
|
|
|
|
* (when requested number of workers were not launched, this will be
|
|
|
|
* somewhat higher than it is for other workers).
|
|
|
|
*/
|
|
|
|
sortmem = maintenance_work_mem / btleader->nparticipanttuplesorts;
|
|
|
|
|
|
|
|
/* Perform work common to all participants */
|
|
|
|
_bt_parallel_scan_and_sort(leaderworker, leaderworker2, btleader->btshared,
|
|
|
|
btleader->sharedsort, btleader->sharedsort2,
|
Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 20:18:08 +02:00
|
|
|
sortmem, true);
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
|
|
|
|
#ifdef BTREE_BUILD_STATS
|
|
|
|
if (log_btree_build_stats)
|
|
|
|
{
|
|
|
|
ShowUsage("BTREE BUILD (Leader Partial Spool) STATISTICS");
|
|
|
|
ResetUsage();
|
|
|
|
}
|
|
|
|
#endif /* BTREE_BUILD_STATS */
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Perform work within a launched parallel process.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
_bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
|
|
|
|
{
|
2018-03-22 18:15:03 +01:00
|
|
|
char *sharedquery;
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
BTSpool *btspool;
|
|
|
|
BTSpool *btspool2;
|
|
|
|
BTShared *btshared;
|
|
|
|
Sharedsort *sharedsort;
|
|
|
|
Sharedsort *sharedsort2;
|
|
|
|
Relation heapRel;
|
|
|
|
Relation indexRel;
|
|
|
|
LOCKMODE heapLockmode;
|
|
|
|
LOCKMODE indexLockmode;
|
|
|
|
int sortmem;
|
|
|
|
|
|
|
|
#ifdef BTREE_BUILD_STATS
|
|
|
|
if (log_btree_build_stats)
|
|
|
|
ResetUsage();
|
|
|
|
#endif /* BTREE_BUILD_STATS */
|
|
|
|
|
2018-03-22 18:15:03 +01:00
|
|
|
/* Set debug_query_string for individual workers first */
|
|
|
|
sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, false);
|
|
|
|
debug_query_string = sharedquery;
|
|
|
|
|
|
|
|
/* Report the query string from leader */
|
|
|
|
pgstat_report_activity(STATE_RUNNING, debug_query_string);
|
|
|
|
|
|
|
|
/* Look up nbtree shared state */
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
btshared = shm_toc_lookup(toc, PARALLEL_KEY_BTREE_SHARED, false);
|
|
|
|
|
|
|
|
/* Open relations using lock modes known to be obtained by index.c */
|
|
|
|
if (!btshared->isconcurrent)
|
|
|
|
{
|
|
|
|
heapLockmode = ShareLock;
|
|
|
|
indexLockmode = AccessExclusiveLock;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
heapLockmode = ShareUpdateExclusiveLock;
|
|
|
|
indexLockmode = RowExclusiveLock;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Open relations within worker */
|
2019-01-21 19:32:19 +01:00
|
|
|
heapRel = table_open(btshared->heaprelid, heapLockmode);
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
indexRel = index_open(btshared->indexrelid, indexLockmode);
|
|
|
|
|
|
|
|
/* Initialize worker's own spool */
|
|
|
|
btspool = (BTSpool *) palloc0(sizeof(BTSpool));
|
|
|
|
btspool->heap = heapRel;
|
|
|
|
btspool->index = indexRel;
|
|
|
|
btspool->isunique = btshared->isunique;
|
|
|
|
|
|
|
|
/* Look up shared state private to tuplesort.c */
|
|
|
|
sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
|
|
|
|
tuplesort_attach_shared(sharedsort, seg);
|
|
|
|
if (!btshared->isunique)
|
|
|
|
{
|
|
|
|
btspool2 = NULL;
|
|
|
|
sharedsort2 = NULL;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* Allocate memory for worker's own private secondary spool */
|
|
|
|
btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
|
|
|
|
|
|
|
|
/* Initialize worker's own secondary spool */
|
|
|
|
btspool2->heap = btspool->heap;
|
|
|
|
btspool2->index = btspool->index;
|
|
|
|
btspool2->isunique = false;
|
|
|
|
/* Look up shared state private to tuplesort.c */
|
|
|
|
sharedsort2 = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT_SPOOL2, false);
|
|
|
|
tuplesort_attach_shared(sharedsort2, seg);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Perform sorting of spool, and possibly a spool2 */
|
|
|
|
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
|
|
|
|
_bt_parallel_scan_and_sort(btspool, btspool2, btshared, sharedsort,
|
Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 20:18:08 +02:00
|
|
|
sharedsort2, sortmem, false);
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
|
|
|
|
#ifdef BTREE_BUILD_STATS
|
|
|
|
if (log_btree_build_stats)
|
|
|
|
{
|
|
|
|
ShowUsage("BTREE BUILD (Worker Partial Spool) STATISTICS");
|
|
|
|
ResetUsage();
|
|
|
|
}
|
|
|
|
#endif /* BTREE_BUILD_STATS */
|
|
|
|
|
|
|
|
index_close(indexRel, indexLockmode);
|
2019-01-21 19:32:19 +01:00
|
|
|
table_close(heapRel, heapLockmode);
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Perform a worker's portion of a parallel sort.
|
|
|
|
*
|
|
|
|
* This generates a tuplesort for passed btspool, and a second tuplesort
|
|
|
|
* state if a second btspool is need (i.e. for unique index builds). All
|
|
|
|
* other spool fields should already be set when this is called.
|
|
|
|
*
|
|
|
|
* sortmem is the amount of working memory to use within each worker,
|
|
|
|
* expressed in KBs.
|
|
|
|
*
|
|
|
|
* When this returns, workers are done, and need only release resources.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
_bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
|
|
|
|
BTShared *btshared, Sharedsort *sharedsort,
|
Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 20:18:08 +02:00
|
|
|
Sharedsort *sharedsort2, int sortmem, bool progress)
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
{
|
|
|
|
SortCoordinate coordinate;
|
|
|
|
BTBuildState buildstate;
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
TableScanDesc scan;
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
double reltuples;
|
|
|
|
IndexInfo *indexInfo;
|
|
|
|
|
|
|
|
/* Initialize local tuplesort coordination state */
|
|
|
|
coordinate = palloc0(sizeof(SortCoordinateData));
|
|
|
|
coordinate->isWorker = true;
|
|
|
|
coordinate->nParticipants = -1;
|
|
|
|
coordinate->sharedsort = sharedsort;
|
|
|
|
|
|
|
|
/* Begin "partial" tuplesort */
|
|
|
|
btspool->sortstate = tuplesort_begin_index_btree(btspool->heap,
|
|
|
|
btspool->index,
|
|
|
|
btspool->isunique,
|
|
|
|
sortmem, coordinate,
|
|
|
|
false);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Just as with serial case, there may be a second spool. If so, a
|
|
|
|
* second, dedicated spool2 partial tuplesort is required.
|
|
|
|
*/
|
|
|
|
if (btspool2)
|
|
|
|
{
|
|
|
|
SortCoordinate coordinate2;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We expect that the second one (for dead tuples) won't get very
|
|
|
|
* full, so we give it only work_mem (unless sortmem is less for
|
|
|
|
* worker). Worker processes are generally permitted to allocate
|
|
|
|
* work_mem independently.
|
|
|
|
*/
|
|
|
|
coordinate2 = palloc0(sizeof(SortCoordinateData));
|
|
|
|
coordinate2->isWorker = true;
|
|
|
|
coordinate2->nParticipants = -1;
|
|
|
|
coordinate2->sharedsort = sharedsort2;
|
|
|
|
btspool2->sortstate =
|
|
|
|
tuplesort_begin_index_btree(btspool->heap, btspool->index, false,
|
|
|
|
Min(sortmem, work_mem), coordinate2,
|
|
|
|
false);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Fill in buildstate for _bt_build_callback() */
|
|
|
|
buildstate.isunique = btshared->isunique;
|
|
|
|
buildstate.havedead = false;
|
|
|
|
buildstate.heap = btspool->heap;
|
|
|
|
buildstate.spool = btspool;
|
|
|
|
buildstate.spool2 = btspool2;
|
|
|
|
buildstate.indtuples = 0;
|
|
|
|
buildstate.btleader = NULL;
|
|
|
|
|
|
|
|
/* Join parallel scan */
|
|
|
|
indexInfo = BuildIndexInfo(btspool->index);
|
|
|
|
indexInfo->ii_Concurrent = btshared->isconcurrent;
|
Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 20:18:08 +02:00
|
|
|
scan = table_beginscan_parallel(btspool->heap,
|
|
|
|
ParallelTableScanFromBTShared(btshared));
|
2019-03-28 03:59:06 +01:00
|
|
|
reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
|
Report progress of CREATE INDEX operations
This uses the progress reporting infrastructure added by c16dc1aca5e0,
adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.
There are two pieces to this: one is index-AM-agnostic, and the other is
AM-specific. The latter is fairly elaborate for btrees, including
reportage for parallel index builds and the separate phases that btree
index creation uses; other index AMs, which are much simpler in their
building procedures, have simplistic reporting only, but that seems
sufficient, at least for non-concurrent builds.
The index-AM-agnostic part is fairly complete, providing insight into
the CONCURRENTLY wait phases as well as block-based progress during the
index validation table scan. (The index validation index scan requires
patching each AM, which has not been included here.)
Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada
Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
2019-04-02 20:18:08 +02:00
|
|
|
true, progress, _bt_build_callback,
|
2019-03-28 03:59:06 +01:00
|
|
|
(void *) &buildstate, scan);
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Execute this worker's part of the sort.
|
|
|
|
*
|
|
|
|
* Unlike leader and serial cases, we cannot avoid calling
|
|
|
|
* tuplesort_performsort() for spool2 if it ends up containing no dead
|
|
|
|
* tuples (this is disallowed for workers by tuplesort).
|
|
|
|
*/
|
|
|
|
tuplesort_performsort(btspool->sortstate);
|
|
|
|
if (btspool2)
|
|
|
|
tuplesort_performsort(btspool2->sortstate);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Done. Record ambuild statistics, and whether we encountered a broken
|
|
|
|
* HOT chain.
|
|
|
|
*/
|
|
|
|
SpinLockAcquire(&btshared->mutex);
|
|
|
|
btshared->nparticipantsdone++;
|
|
|
|
btshared->reltuples += reltuples;
|
|
|
|
if (buildstate.havedead)
|
|
|
|
btshared->havedead = true;
|
|
|
|
btshared->indtuples += buildstate.indtuples;
|
|
|
|
if (indexInfo->ii_BrokenHotChain)
|
|
|
|
btshared->brokenhotchain = true;
|
|
|
|
SpinLockRelease(&btshared->mutex);
|
|
|
|
|
|
|
|
/* Notify leader */
|
|
|
|
ConditionVariableSignal(&btshared->workersdonecv);
|
|
|
|
|
|
|
|
/* We can end tuplesorts immediately */
|
|
|
|
tuplesort_end(btspool->sortstate);
|
|
|
|
if (btspool2)
|
|
|
|
tuplesort_end(btspool2->sortstate);
|
|
|
|
}
|