tableam: introduce table AM infrastructure.
This introduces the concept of table access methods, i.e. CREATE
ACCESS METHOD ... TYPE TABLE and
CREATE TABLE ... USING (storage-engine).
No table access functionality is delegated to table AMs as of this
commit, that'll be done in following commits.
Subsequent commits will incrementally abstract table access
functionality to be routed through table access methods. That change
is too large to be reviewed & committed at once, so it'll be done
incrementally.
Docs will be updated at the end, as adding them incrementally would
likely make them less coherent, and definitely is a lot more work,
without a lot of benefit.
Table access methods are specified similar to index access methods,
i.e. pg_am.amhandler returns, as INTERNAL, a pointer to a struct with
callbacks. In contrast to index AMs that struct needs to live as long
as a backend, typically that's achieved by just returning a pointer to
a constant struct.
Psql's \d+ now displays a table's access method. That can be disabled
with HIDE_TABLEAM=true, which is mainly useful so regression tests can
be run against different AMs. It's quite possible that this behaviour
still needs to be fine tuned.
For now it's not allowed to set a table AM for a partitioned table, as
we've not resolved how partitions would inherit that. Disallowing
allows us to introduce, if we decide that's the way forward, such a
behaviour without a compatibility break.
Catversion bumped, to add the heap table AM and references to it.
Author: Haribabu Kommi, Andres Freund, Alvaro Herrera, Dimitri Golgov and others
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
https://postgr.es/m/20190107235616.6lur25ph22u5u5av@alap3.anarazel.de
https://postgr.es/m/20190304234700.w5tmhducs5wxgzls@alap3.anarazel.de
2019-03-06 18:54:38 +01:00
|
|
|
/*----------------------------------------------------------------------
|
|
|
|
*
|
|
|
|
* tableam.c
|
|
|
|
* Table access method routines too big to be inline functions.
|
|
|
|
*
|
|
|
|
* Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
|
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
|
|
|
*
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
|
|
|
* src/backend/access/table/tableam.c
|
|
|
|
*
|
|
|
|
* NOTES
|
|
|
|
* Note that most function in here are documented in tableam.h, rather than
|
|
|
|
* here. That's because there's a lot of inline functions in tableam.h and
|
|
|
|
* it'd be harder to understand if one constantly had to switch between files.
|
|
|
|
*
|
tableam: introduce table AM infrastructure.
This introduces the concept of table access methods, i.e. CREATE
ACCESS METHOD ... TYPE TABLE and
CREATE TABLE ... USING (storage-engine).
No table access functionality is delegated to table AMs as of this
commit, that'll be done in following commits.
Subsequent commits will incrementally abstract table access
functionality to be routed through table access methods. That change
is too large to be reviewed & committed at once, so it'll be done
incrementally.
Docs will be updated at the end, as adding them incrementally would
likely make them less coherent, and definitely is a lot more work,
without a lot of benefit.
Table access methods are specified similar to index access methods,
i.e. pg_am.amhandler returns, as INTERNAL, a pointer to a struct with
callbacks. In contrast to index AMs that struct needs to live as long
as a backend, typically that's achieved by just returning a pointer to
a constant struct.
Psql's \d+ now displays a table's access method. That can be disabled
with HIDE_TABLEAM=true, which is mainly useful so regression tests can
be run against different AMs. It's quite possible that this behaviour
still needs to be fine tuned.
For now it's not allowed to set a table AM for a partitioned table, as
we've not resolved how partitions would inherit that. Disallowing
allows us to introduce, if we decide that's the way forward, such a
behaviour without a compatibility break.
Catversion bumped, to add the heap table AM and references to it.
Author: Haribabu Kommi, Andres Freund, Alvaro Herrera, Dimitri Golgov and others
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
https://postgr.es/m/20190107235616.6lur25ph22u5u5av@alap3.anarazel.de
https://postgr.es/m/20190304234700.w5tmhducs5wxgzls@alap3.anarazel.de
2019-03-06 18:54:38 +01:00
|
|
|
*----------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
#include "postgres.h"
|
|
|
|
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
#include "access/heapam.h" /* for ss_* */
|
tableam: introduce table AM infrastructure.
This introduces the concept of table access methods, i.e. CREATE
ACCESS METHOD ... TYPE TABLE and
CREATE TABLE ... USING (storage-engine).
No table access functionality is delegated to table AMs as of this
commit, that'll be done in following commits.
Subsequent commits will incrementally abstract table access
functionality to be routed through table access methods. That change
is too large to be reviewed & committed at once, so it'll be done
incrementally.
Docs will be updated at the end, as adding them incrementally would
likely make them less coherent, and definitely is a lot more work,
without a lot of benefit.
Table access methods are specified similar to index access methods,
i.e. pg_am.amhandler returns, as INTERNAL, a pointer to a struct with
callbacks. In contrast to index AMs that struct needs to live as long
as a backend, typically that's achieved by just returning a pointer to
a constant struct.
Psql's \d+ now displays a table's access method. That can be disabled
with HIDE_TABLEAM=true, which is mainly useful so regression tests can
be run against different AMs. It's quite possible that this behaviour
still needs to be fine tuned.
For now it's not allowed to set a table AM for a partitioned table, as
we've not resolved how partitions would inherit that. Disallowing
allows us to introduce, if we decide that's the way forward, such a
behaviour without a compatibility break.
Catversion bumped, to add the heap table AM and references to it.
Author: Haribabu Kommi, Andres Freund, Alvaro Herrera, Dimitri Golgov and others
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
https://postgr.es/m/20190107235616.6lur25ph22u5u5av@alap3.anarazel.de
https://postgr.es/m/20190304234700.w5tmhducs5wxgzls@alap3.anarazel.de
2019-03-06 18:54:38 +01:00
|
|
|
#include "access/tableam.h"
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
#include "access/xact.h"
|
|
|
|
#include "storage/bufmgr.h"
|
|
|
|
#include "storage/shmem.h"
|
tableam: introduce table AM infrastructure.
This introduces the concept of table access methods, i.e. CREATE
ACCESS METHOD ... TYPE TABLE and
CREATE TABLE ... USING (storage-engine).
No table access functionality is delegated to table AMs as of this
commit, that'll be done in following commits.
Subsequent commits will incrementally abstract table access
functionality to be routed through table access methods. That change
is too large to be reviewed & committed at once, so it'll be done
incrementally.
Docs will be updated at the end, as adding them incrementally would
likely make them less coherent, and definitely is a lot more work,
without a lot of benefit.
Table access methods are specified similar to index access methods,
i.e. pg_am.amhandler returns, as INTERNAL, a pointer to a struct with
callbacks. In contrast to index AMs that struct needs to live as long
as a backend, typically that's achieved by just returning a pointer to
a constant struct.
Psql's \d+ now displays a table's access method. That can be disabled
with HIDE_TABLEAM=true, which is mainly useful so regression tests can
be run against different AMs. It's quite possible that this behaviour
still needs to be fine tuned.
For now it's not allowed to set a table AM for a partitioned table, as
we've not resolved how partitions would inherit that. Disallowing
allows us to introduce, if we decide that's the way forward, such a
behaviour without a compatibility break.
Catversion bumped, to add the heap table AM and references to it.
Author: Haribabu Kommi, Andres Freund, Alvaro Herrera, Dimitri Golgov and others
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
https://postgr.es/m/20190107235616.6lur25ph22u5u5av@alap3.anarazel.de
https://postgr.es/m/20190304234700.w5tmhducs5wxgzls@alap3.anarazel.de
2019-03-06 18:54:38 +01:00
|
|
|
|
|
|
|
|
|
|
|
/* GUC variables */
|
|
|
|
char *default_table_access_method = DEFAULT_TABLE_ACCESS_METHOD;
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
bool synchronize_seqscans = true;
|
|
|
|
|
|
|
|
|
|
|
|
/* ----------------------------------------------------------------------------
|
|
|
|
* Slot functions.
|
|
|
|
* ----------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
const TupleTableSlotOps *
|
|
|
|
table_slot_callbacks(Relation relation)
|
|
|
|
{
|
|
|
|
const TupleTableSlotOps *tts_cb;
|
|
|
|
|
|
|
|
if (relation->rd_tableam)
|
|
|
|
tts_cb = relation->rd_tableam->slot_callbacks(relation);
|
|
|
|
else if (relation->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Historically FDWs expect to store heap tuples in slots. Continue
|
|
|
|
* handing them one, to make it less painful to adapt FDWs to new
|
|
|
|
* versions. The cost of a heap slot over a virtual slot is pretty
|
|
|
|
* small.
|
|
|
|
*/
|
|
|
|
tts_cb = &TTSOpsHeapTuple;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* These need to be supported, as some parts of the code (like COPY)
|
|
|
|
* need to create slots for such relations too. It seems better to
|
|
|
|
* centralize the knowledge that a heap slot is the right thing in
|
|
|
|
* that case here.
|
|
|
|
*/
|
|
|
|
Assert(relation->rd_rel->relkind == RELKIND_VIEW ||
|
|
|
|
relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
|
|
|
|
tts_cb = &TTSOpsVirtual;
|
|
|
|
}
|
|
|
|
|
|
|
|
return tts_cb;
|
|
|
|
}
|
|
|
|
|
|
|
|
TupleTableSlot *
|
|
|
|
table_slot_create(Relation relation, List **reglist)
|
|
|
|
{
|
|
|
|
const TupleTableSlotOps *tts_cb;
|
|
|
|
TupleTableSlot *slot;
|
|
|
|
|
|
|
|
tts_cb = table_slot_callbacks(relation);
|
|
|
|
slot = MakeSingleTupleTableSlot(RelationGetDescr(relation), tts_cb);
|
|
|
|
|
|
|
|
if (reglist)
|
|
|
|
*reglist = lappend(*reglist, slot);
|
|
|
|
|
|
|
|
return slot;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* ----------------------------------------------------------------------------
|
|
|
|
* Table scan functions.
|
|
|
|
* ----------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
TableScanDesc
|
|
|
|
table_beginscan_catalog(Relation relation, int nkeys, struct ScanKeyData *key)
|
|
|
|
{
|
|
|
|
Oid relid = RelationGetRelid(relation);
|
|
|
|
Snapshot snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
|
|
|
|
|
2019-03-31 05:13:56 +02:00
|
|
|
return relation->rd_tableam->scan_begin(relation, snapshot, nkeys, key,
|
|
|
|
NULL, true, true, true, false,
|
|
|
|
false, true);
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
table_scan_update_snapshot(TableScanDesc scan, Snapshot snapshot)
|
|
|
|
{
|
|
|
|
Assert(IsMVCCSnapshot(snapshot));
|
|
|
|
|
|
|
|
RegisterSnapshot(snapshot);
|
|
|
|
scan->rs_snapshot = snapshot;
|
|
|
|
scan->rs_temp_snap = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* ----------------------------------------------------------------------------
|
|
|
|
* Parallel table scan related functions.
|
|
|
|
* ----------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
Size
|
|
|
|
table_parallelscan_estimate(Relation rel, Snapshot snapshot)
|
|
|
|
{
|
|
|
|
Size sz = 0;
|
|
|
|
|
|
|
|
if (IsMVCCSnapshot(snapshot))
|
|
|
|
sz = add_size(sz, EstimateSnapshotSpace(snapshot));
|
|
|
|
else
|
|
|
|
Assert(snapshot == SnapshotAny);
|
|
|
|
|
|
|
|
sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
|
|
|
|
|
|
|
|
return sz;
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
|
|
|
|
Snapshot snapshot)
|
|
|
|
{
|
|
|
|
Size snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
|
|
|
|
|
|
|
|
pscan->phs_snapshot_off = snapshot_off;
|
|
|
|
|
|
|
|
if (IsMVCCSnapshot(snapshot))
|
|
|
|
{
|
|
|
|
SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
|
|
|
|
pscan->phs_snapshot_any = false;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
Assert(snapshot == SnapshotAny);
|
|
|
|
pscan->phs_snapshot_any = true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
TableScanDesc
|
|
|
|
table_beginscan_parallel(Relation relation, ParallelTableScanDesc parallel_scan)
|
|
|
|
{
|
|
|
|
Snapshot snapshot;
|
|
|
|
|
|
|
|
Assert(RelationGetRelid(relation) == parallel_scan->phs_relid);
|
|
|
|
|
|
|
|
if (!parallel_scan->phs_snapshot_any)
|
|
|
|
{
|
|
|
|
/* Snapshot was serialized -- restore it */
|
|
|
|
snapshot = RestoreSnapshot((char *) parallel_scan +
|
|
|
|
parallel_scan->phs_snapshot_off);
|
|
|
|
RegisterSnapshot(snapshot);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* SnapshotAny passed by caller (not serialized) */
|
|
|
|
snapshot = SnapshotAny;
|
|
|
|
}
|
|
|
|
|
2019-03-31 05:13:56 +02:00
|
|
|
return relation->rd_tableam->scan_begin(relation, snapshot, 0, NULL,
|
|
|
|
parallel_scan, true, true, true,
|
|
|
|
false, false,
|
|
|
|
!parallel_scan->phs_snapshot_any);
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2019-03-26 00:52:55 +01:00
|
|
|
/* ----------------------------------------------------------------------------
|
|
|
|
* Index scan related functions.
|
|
|
|
* ----------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* To perform that check simply start an index scan, create the necessary
|
|
|
|
* slot, do the heap lookup, and shut everything down again. This could be
|
|
|
|
* optimized, but is unlikely to matter from a performance POV. If there
|
|
|
|
* frequently are live index pointers also matching a unique index key, the
|
|
|
|
* CPU overhead of this routine is unlikely to matter.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
table_index_fetch_tuple_check(Relation rel,
|
|
|
|
ItemPointer tid,
|
|
|
|
Snapshot snapshot,
|
|
|
|
bool *all_dead)
|
|
|
|
{
|
|
|
|
IndexFetchTableData *scan;
|
|
|
|
TupleTableSlot *slot;
|
|
|
|
bool call_again = false;
|
|
|
|
bool found;
|
|
|
|
|
|
|
|
slot = table_slot_create(rel, NULL);
|
|
|
|
scan = table_index_fetch_begin(rel);
|
|
|
|
found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again,
|
|
|
|
all_dead);
|
|
|
|
table_index_fetch_end(scan);
|
|
|
|
ExecDropSingleTupleTableSlot(slot);
|
|
|
|
|
|
|
|
return found;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2019-05-18 03:52:01 +02:00
|
|
|
/* ------------------------------------------------------------------------
|
|
|
|
* Functions for non-modifying operations on individual tuples
|
|
|
|
* ------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
void
|
|
|
|
table_get_latest_tid(TableScanDesc scan, ItemPointer tid)
|
|
|
|
{
|
|
|
|
Relation rel = scan->rs_rd;
|
|
|
|
const TableAmRoutine *tableam = rel->rd_tableam;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Since this can be called with user-supplied TID, don't trust the input
|
|
|
|
* too much.
|
|
|
|
*/
|
|
|
|
if (!tableam->tuple_tid_valid(scan, tid))
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
|
|
|
|
errmsg("tid (%u, %u) is not valid for relation for relation \"%s\"",
|
|
|
|
ItemPointerGetBlockNumberNoCheck(tid),
|
|
|
|
ItemPointerGetOffsetNumberNoCheck(tid),
|
|
|
|
RelationGetRelationName(rel))));
|
|
|
|
|
2019-05-18 06:40:39 +02:00
|
|
|
tableam->tuple_get_latest_tid(scan, tid);
|
2019-05-18 03:52:01 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
tableam: Add tuple_{insert, delete, update, lock} and use.
This adds new, required, table AM callbacks for insert/delete/update
and lock_tuple. To be able to reasonably use those, the EvalPlanQual
mechanism had to be adapted, moving more logic into the AM.
Previously both delete/update/lock call-sites and the EPQ mechanism had
to have awareness of the specific tuple format to be able to fetch the
latest version of a tuple. Obviously that needs to be abstracted
away. To do so, move the logic that find the latest row version into
the AM. lock_tuple has a new flag argument,
TUPLE_LOCK_FLAG_FIND_LAST_VERSION, that forces it to lock the last
version, rather than the current one. It'd have been possible to do
so via a separate callback as well, but finding the last version
usually also necessitates locking the newest version, making it
sensible to combine the two. This replaces the previous use of
EvalPlanQualFetch(). Additionally HeapTupleUpdated, which previously
signaled either a concurrent update or delete, is now split into two,
to avoid callers needing AM specific knowledge to differentiate.
The move of finding the latest row version into tuple_lock means that
encountering a row concurrently moved into another partition will now
raise an error about "tuple to be locked" rather than "tuple to be
updated/deleted" - which is accurate, as that always happens when
locking rows. While possible slightly less helpful for users, it seems
like an acceptable trade-off.
As part of this commit HTSU_Result has been renamed to TM_Result, and
its members been expanded to differentiated between updating and
deleting. HeapUpdateFailureData has been renamed to TM_FailureData.
The interface to speculative insertion is changed so nodeModifyTable.c
does not have to set the speculative token itself anymore. Instead
there's a version of tuple_insert, tuple_insert_speculative, that
performs the speculative insertion (without requiring a flag to signal
that fact), and the speculative insertion is either made permanent
with table_complete_speculative(succeeded = true) or aborted with
succeeded = false).
Note that multi_insert is not yet routed through tableam, nor is
COPY. Changing multi_insert requires changes to copy.c that are large
enough to better be done separately.
Similarly, although simpler, CREATE TABLE AS and CREATE MATERIALIZED
VIEW are also only going to be adjusted in a later commit.
Author: Andres Freund and Haribabu Kommi
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20190313003903.nwvrxi7rw3ywhdel@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-24 03:55:57 +01:00
|
|
|
/* ----------------------------------------------------------------------------
|
|
|
|
* Functions to make modifications a bit simpler.
|
|
|
|
* ----------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* simple_table_insert - insert a tuple
|
|
|
|
*
|
|
|
|
* Currently, this routine differs from table_insert only in supplying a
|
|
|
|
* default command ID and not allowing access to the speedup options.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
simple_table_insert(Relation rel, TupleTableSlot *slot)
|
|
|
|
{
|
|
|
|
table_insert(rel, slot, GetCurrentCommandId(true), 0, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* simple_table_delete - delete a tuple
|
|
|
|
*
|
|
|
|
* This routine may be used to delete a tuple when concurrent updates of
|
|
|
|
* the target tuple are not expected (for example, because we have a lock
|
|
|
|
* on the relation associated with the tuple). Any failure is reported
|
|
|
|
* via ereport().
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
simple_table_delete(Relation rel, ItemPointer tid, Snapshot snapshot)
|
|
|
|
{
|
|
|
|
TM_Result result;
|
|
|
|
TM_FailureData tmfd;
|
|
|
|
|
|
|
|
result = table_delete(rel, tid,
|
|
|
|
GetCurrentCommandId(true),
|
|
|
|
snapshot, InvalidSnapshot,
|
|
|
|
true /* wait for commit */ ,
|
|
|
|
&tmfd, false /* changingPart */ );
|
|
|
|
|
|
|
|
switch (result)
|
|
|
|
{
|
|
|
|
case TM_SelfModified:
|
|
|
|
/* Tuple was already updated in current command? */
|
|
|
|
elog(ERROR, "tuple already updated by self");
|
|
|
|
break;
|
|
|
|
|
|
|
|
case TM_Ok:
|
|
|
|
/* done successfully */
|
|
|
|
break;
|
|
|
|
|
|
|
|
case TM_Updated:
|
|
|
|
elog(ERROR, "tuple concurrently updated");
|
|
|
|
break;
|
|
|
|
|
|
|
|
case TM_Deleted:
|
|
|
|
elog(ERROR, "tuple concurrently deleted");
|
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
|
|
|
elog(ERROR, "unrecognized table_delete status: %u", result);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* simple_table_update - replace a tuple
|
|
|
|
*
|
|
|
|
* This routine may be used to update a tuple when concurrent updates of
|
|
|
|
* the target tuple are not expected (for example, because we have a lock
|
|
|
|
* on the relation associated with the tuple). Any failure is reported
|
|
|
|
* via ereport().
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
simple_table_update(Relation rel, ItemPointer otid,
|
|
|
|
TupleTableSlot *slot,
|
|
|
|
Snapshot snapshot,
|
|
|
|
bool *update_indexes)
|
|
|
|
{
|
|
|
|
TM_Result result;
|
|
|
|
TM_FailureData tmfd;
|
|
|
|
LockTupleMode lockmode;
|
|
|
|
|
|
|
|
result = table_update(rel, otid, slot,
|
|
|
|
GetCurrentCommandId(true),
|
|
|
|
snapshot, InvalidSnapshot,
|
|
|
|
true /* wait for commit */ ,
|
|
|
|
&tmfd, &lockmode, update_indexes);
|
|
|
|
|
|
|
|
switch (result)
|
|
|
|
{
|
|
|
|
case TM_SelfModified:
|
|
|
|
/* Tuple was already updated in current command? */
|
|
|
|
elog(ERROR, "tuple already updated by self");
|
|
|
|
break;
|
|
|
|
|
|
|
|
case TM_Ok:
|
|
|
|
/* done successfully */
|
|
|
|
break;
|
|
|
|
|
|
|
|
case TM_Updated:
|
|
|
|
elog(ERROR, "tuple concurrently updated");
|
|
|
|
break;
|
|
|
|
|
|
|
|
case TM_Deleted:
|
|
|
|
elog(ERROR, "tuple concurrently deleted");
|
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
|
|
|
elog(ERROR, "unrecognized table_update status: %u", result);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
/* ----------------------------------------------------------------------------
|
|
|
|
* Helper functions to implement parallel scans for block oriented AMs.
|
|
|
|
* ----------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
Size
|
|
|
|
table_block_parallelscan_estimate(Relation rel)
|
|
|
|
{
|
|
|
|
return sizeof(ParallelBlockTableScanDescData);
|
|
|
|
}
|
|
|
|
|
|
|
|
Size
|
|
|
|
table_block_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan)
|
|
|
|
{
|
|
|
|
ParallelBlockTableScanDesc bpscan = (ParallelBlockTableScanDesc) pscan;
|
|
|
|
|
|
|
|
bpscan->base.phs_relid = RelationGetRelid(rel);
|
|
|
|
bpscan->phs_nblocks = RelationGetNumberOfBlocks(rel);
|
|
|
|
/* compare phs_syncscan initialization to similar logic in initscan */
|
|
|
|
bpscan->base.phs_syncscan = synchronize_seqscans &&
|
|
|
|
!RelationUsesLocalBuffers(rel) &&
|
|
|
|
bpscan->phs_nblocks > NBuffers / 4;
|
|
|
|
SpinLockInit(&bpscan->phs_mutex);
|
|
|
|
bpscan->phs_startblock = InvalidBlockNumber;
|
|
|
|
pg_atomic_init_u64(&bpscan->phs_nallocated, 0);
|
|
|
|
|
|
|
|
return sizeof(ParallelBlockTableScanDescData);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
table_block_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
|
|
|
|
{
|
|
|
|
ParallelBlockTableScanDesc bpscan = (ParallelBlockTableScanDesc) pscan;
|
|
|
|
|
|
|
|
pg_atomic_write_u64(&bpscan->phs_nallocated, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* find and set the scan's startblock
|
|
|
|
*
|
|
|
|
* Determine where the parallel seq scan should start. This function may be
|
|
|
|
* called many times, once by each parallel worker. We must be careful only
|
|
|
|
* to set the startblock once.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
table_block_parallelscan_startblock_init(Relation rel, ParallelBlockTableScanDesc pbscan)
|
|
|
|
{
|
|
|
|
BlockNumber sync_startpage = InvalidBlockNumber;
|
|
|
|
|
|
|
|
retry:
|
|
|
|
/* Grab the spinlock. */
|
|
|
|
SpinLockAcquire(&pbscan->phs_mutex);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the scan's startblock has not yet been initialized, we must do so
|
|
|
|
* now. If this is not a synchronized scan, we just start at block 0, but
|
|
|
|
* if it is a synchronized scan, we must get the starting position from
|
|
|
|
* the synchronized scan machinery. We can't hold the spinlock while
|
|
|
|
* doing that, though, so release the spinlock, get the information we
|
|
|
|
* need, and retry. If nobody else has initialized the scan in the
|
|
|
|
* meantime, we'll fill in the value we fetched on the second time
|
|
|
|
* through.
|
|
|
|
*/
|
|
|
|
if (pbscan->phs_startblock == InvalidBlockNumber)
|
|
|
|
{
|
|
|
|
if (!pbscan->base.phs_syncscan)
|
|
|
|
pbscan->phs_startblock = 0;
|
|
|
|
else if (sync_startpage != InvalidBlockNumber)
|
|
|
|
pbscan->phs_startblock = sync_startpage;
|
|
|
|
else
|
|
|
|
{
|
|
|
|
SpinLockRelease(&pbscan->phs_mutex);
|
|
|
|
sync_startpage = ss_get_location(rel, pbscan->phs_nblocks);
|
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
SpinLockRelease(&pbscan->phs_mutex);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* get the next page to scan
|
|
|
|
*
|
|
|
|
* Get the next page to scan. Even if there are no pages left to scan,
|
|
|
|
* another backend could have grabbed a page to scan and not yet finished
|
|
|
|
* looking at it, so it doesn't follow that the scan is done when the first
|
|
|
|
* backend gets an InvalidBlockNumber return.
|
|
|
|
*/
|
|
|
|
BlockNumber
|
|
|
|
table_block_parallelscan_nextpage(Relation rel, ParallelBlockTableScanDesc pbscan)
|
|
|
|
{
|
|
|
|
BlockNumber page;
|
|
|
|
uint64 nallocated;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* phs_nallocated tracks how many pages have been allocated to workers
|
|
|
|
* already. When phs_nallocated >= rs_nblocks, all blocks have been
|
|
|
|
* allocated.
|
|
|
|
*
|
|
|
|
* Because we use an atomic fetch-and-add to fetch the current value, the
|
|
|
|
* phs_nallocated counter will exceed rs_nblocks, because workers will
|
|
|
|
* still increment the value, when they try to allocate the next block but
|
|
|
|
* all blocks have been allocated already. The counter must be 64 bits
|
|
|
|
* wide because of that, to avoid wrapping around when rs_nblocks is close
|
|
|
|
* to 2^32.
|
|
|
|
*
|
|
|
|
* The actual page to return is calculated by adding the counter to the
|
|
|
|
* starting block number, modulo nblocks.
|
|
|
|
*/
|
|
|
|
nallocated = pg_atomic_fetch_add_u64(&pbscan->phs_nallocated, 1);
|
|
|
|
if (nallocated >= pbscan->phs_nblocks)
|
|
|
|
page = InvalidBlockNumber; /* all blocks have been allocated */
|
|
|
|
else
|
|
|
|
page = (nallocated + pbscan->phs_startblock) % pbscan->phs_nblocks;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Report scan location. Normally, we report the current page number.
|
|
|
|
* When we reach the end of the scan, though, we report the starting page,
|
|
|
|
* not the ending page, just so the starting positions for later scans
|
|
|
|
* doesn't slew backwards. We only report the position at the end of the
|
|
|
|
* scan once, though: subsequent callers will report nothing.
|
|
|
|
*/
|
|
|
|
if (pbscan->base.phs_syncscan)
|
|
|
|
{
|
|
|
|
if (page != InvalidBlockNumber)
|
|
|
|
ss_report_location(rel, page);
|
|
|
|
else if (nallocated == pbscan->phs_nblocks)
|
|
|
|
ss_report_location(rel, pbscan->phs_startblock);
|
|
|
|
}
|
|
|
|
|
|
|
|
return page;
|
|
|
|
}
|