2009-05-12 02:56:05 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
|
|
|
* pg_inherits.c
|
|
|
|
* routines to support manipulation of the pg_inherits relation
|
|
|
|
*
|
Disallow converting an inheritance child table to a view.
Generally, members of inheritance trees must be plain tables (or,
in more recent versions, foreign tables). ALTER TABLE INHERIT
rejects creating an inheritance relationship that has a view at
either end. When DefineQueryRewrite attempts to convert a relation
to a view, it already had checks prohibiting doing so for partitioning
parents or children as well as traditional-inheritance parents ...
but it neglected to check that a traditional-inheritance child wasn't
being converted. Since the planner assumes that any inheritance
child is a table, this led to making plans that tried to do a physical
scan on a view, causing failures (or even crashes, in recent versions).
One could imagine trying to support such a case by expanding the view
normally, but since the rewriter runs before the planner does
inheritance expansion, it would take some very fundamental refactoring
to make that possible. There are probably a lot of other parts of the
system that don't cope well with such a situation, too. For now,
just forbid it.
Per bug #16856 from Yang Lin. Back-patch to all supported branches.
(In versions before v10, this includes back-patching the portion of
commit 501ed02cf that added has_superclass(). Perhaps the lack of
that infrastructure partially explains the missing check.)
Discussion: https://postgr.es/m/16856-0363e05c6e1612fd@postgresql.org
2021-02-06 21:17:01 +01:00
|
|
|
* Note: currently, this module mostly contains inquiry functions; actual
|
|
|
|
* creation and deletion of pg_inherits entries is mostly done in tablecmds.c.
|
2009-05-12 02:56:05 +02:00
|
|
|
* Perhaps someday that code should be moved here, but it'd have to be
|
|
|
|
* disentangled from other stuff such as pg_depend updates.
|
|
|
|
*
|
2021-01-02 19:06:25 +01:00
|
|
|
* Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
|
2009-05-12 02:56:05 +02:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
|
|
|
*
|
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/backend/catalog/pg_inherits.c
|
2009-05-12 02:56:05 +02:00
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
#include "postgres.h"
|
|
|
|
|
2019-12-27 00:09:00 +01:00
|
|
|
#include "access/genam.h"
|
2012-08-30 22:15:44 +02:00
|
|
|
#include "access/htup_details.h"
|
2019-01-21 19:18:20 +01:00
|
|
|
#include "access/table.h"
|
2009-12-29 23:00:14 +01:00
|
|
|
#include "catalog/indexing.h"
|
2009-05-12 02:56:05 +02:00
|
|
|
#include "catalog/pg_inherits.h"
|
|
|
|
#include "parser/parse_type.h"
|
2009-05-12 05:11:02 +02:00
|
|
|
#include "storage/lmgr.h"
|
2017-03-01 17:55:28 +01:00
|
|
|
#include "utils/builtins.h"
|
2009-05-12 02:56:05 +02:00
|
|
|
#include "utils/fmgroids.h"
|
2017-05-17 01:33:31 +02:00
|
|
|
#include "utils/memutils.h"
|
ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.
Because it runs in two transactions, it cannot be used in a transaction
block. This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it. But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)
In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.
The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command. Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included. As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.
A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
2021-03-25 22:00:28 +01:00
|
|
|
#include "utils/snapmgr.h"
|
2009-05-12 02:56:05 +02:00
|
|
|
#include "utils/syscache.h"
|
|
|
|
|
2017-03-27 18:07:48 +02:00
|
|
|
/*
|
|
|
|
* Entry of a hash table used in find_all_inheritors. See below.
|
|
|
|
*/
|
|
|
|
typedef struct SeenRelsEntry
|
|
|
|
{
|
|
|
|
Oid rel_id; /* relation oid */
|
Represent Lists as expansible arrays, not chains of cons-cells.
Originally, Postgres Lists were a more or less exact reimplementation of
Lisp lists, which consist of chains of separately-allocated cons cells,
each having a value and a next-cell link. We'd hacked that once before
(commit d0b4399d8) to add a separate List header, but the data was still
in cons cells. That makes some operations -- notably list_nth() -- O(N),
and it's bulky because of the next-cell pointers and per-cell palloc
overhead, and it's very cache-unfriendly if the cons cells end up
scattered around rather than being adjacent.
In this rewrite, we still have List headers, but the data is in a
resizable array of values, with no next-cell links. Now we need at
most two palloc's per List, and often only one, since we can allocate
some values in the same palloc call as the List header. (Of course,
extending an existing List may require repalloc's to enlarge the array.
But this involves just O(log N) allocations not O(N).)
Of course this is not without downsides. The key difficulty is that
addition or deletion of a list entry may now cause other entries to
move, which it did not before.
For example, that breaks foreach() and sister macros, which historically
used a pointer to the current cons-cell as loop state. We can repair
those macros transparently by making their actual loop state be an
integer list index; the exposed "ListCell *" pointer is no longer state
carried across loop iterations, but is just a derived value. (In
practice, modern compilers can optimize things back to having just one
loop state value, at least for simple cases with inline loop bodies.)
In principle, this is a semantics change for cases where the loop body
inserts or deletes list entries ahead of the current loop index; but
I found no such cases in the Postgres code.
The change is not at all transparent for code that doesn't use foreach()
but chases lists "by hand" using lnext(). The largest share of such
code in the backend is in loops that were maintaining "prev" and "next"
variables in addition to the current-cell pointer, in order to delete
list cells efficiently using list_delete_cell(). However, we no longer
need a previous-cell pointer to delete a list cell efficiently. Keeping
a next-cell pointer doesn't work, as explained above, but we can improve
matters by changing such code to use a regular foreach() loop and then
using the new macro foreach_delete_current() to delete the current cell.
(This macro knows how to update the associated foreach loop's state so
that no cells will be missed in the traversal.)
There remains a nontrivial risk of code assuming that a ListCell *
pointer will remain good over an operation that could now move the list
contents. To help catch such errors, list.c can be compiled with a new
define symbol DEBUG_LIST_MEMORY_USAGE that forcibly moves list contents
whenever that could possibly happen. This makes list operations
significantly more expensive so it's not normally turned on (though it
is on by default if USE_VALGRIND is on).
There are two notable API differences from the previous code:
* lnext() now requires the List's header pointer in addition to the
current cell's address.
* list_delete_cell() no longer requires a previous-cell argument.
These changes are somewhat unfortunate, but on the other hand code using
either function needs inspection to see if it is assuming anything
it shouldn't, so it's not all bad.
Programmers should be aware of these significant performance changes:
* list_nth() and related functions are now O(1); so there's no
major access-speed difference between a list and an array.
* Inserting or deleting a list element now takes time proportional to
the distance to the end of the list, due to moving the array elements.
(However, it typically *doesn't* require palloc or pfree, so except in
long lists it's probably still faster than before.) Notably, lcons()
used to be about the same cost as lappend(), but that's no longer true
if the list is long. Code that uses lcons() and list_delete_first()
to maintain a stack might usefully be rewritten to push and pop at the
end of the list rather than the beginning.
* There are now list_insert_nth...() and list_delete_nth...() functions
that add or remove a list cell identified by index. These have the
data-movement penalty explained above, but there's no search penalty.
* list_concat() and variants now copy the second list's data into
storage belonging to the first list, so there is no longer any
sharing of cells between the input lists. The second argument is
now declared "const List *" to reflect that it isn't changed.
This patch just does the minimum needed to get the new implementation
in place and fix bugs exposed by the regression tests. As suggested
by the foregoing, there's a fair amount of followup work remaining to
do.
Also, the ENABLE_LIST_COMPAT macros are finally removed in this
commit. Code using those should have been gone a dozen years ago.
Patch by me; thanks to David Rowley, Jesper Pedersen, and others
for review.
Discussion: https://postgr.es/m/11587.1550975080@sss.pgh.pa.us
2019-07-15 19:41:58 +02:00
|
|
|
int list_index; /* its position in output list(s) */
|
2017-03-27 18:07:48 +02:00
|
|
|
} SeenRelsEntry;
|
2009-05-12 02:56:05 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* find_inheritance_children
|
|
|
|
*
|
|
|
|
* Returns a list containing the OIDs of all relations which
|
|
|
|
* inherit *directly* from the relation with OID 'parentrelId'.
|
2009-05-12 05:11:02 +02:00
|
|
|
*
|
|
|
|
* The specified lock type is acquired on each child relation (but not on the
|
|
|
|
* given rel; caller should already have locked it). If lockmode is NoLock
|
|
|
|
* then no locks are acquired, but caller must beware of race conditions
|
|
|
|
* against possible DROPs of child relations.
|
ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.
Because it runs in two transactions, it cannot be used in a transaction
block. This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it. But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)
In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.
The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command. Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included. As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.
A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
2021-03-25 22:00:28 +01:00
|
|
|
*
|
2021-04-28 21:44:35 +02:00
|
|
|
* Partitions marked as being detached are omitted; see
|
|
|
|
* find_inheritance_children_extended for details.
|
|
|
|
*/
|
|
|
|
List *
|
|
|
|
find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
|
|
|
|
{
|
|
|
|
return find_inheritance_children_extended(parentrelId, true, lockmode,
|
|
|
|
NULL, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* find_inheritance_children_extended
|
|
|
|
*
|
|
|
|
* As find_inheritance_children, with more options regarding detached
|
|
|
|
* partitions.
|
|
|
|
*
|
Fix relcache inconsistency hazard in partition detach
During queries coming from ri_triggers.c, we need to omit partitions
that are marked pending detach -- otherwise, the RI query is tricked
into allowing a row into the referencing table whose corresponding row
is in the detached partition. Which is bogus: once the detach operation
completes, the row becomes an orphan.
However, the code was not doing that in repeatable-read transactions,
because relcache kept a copy of the partition descriptor that included
the partition, and used it in the RI query. This commit changes the
partdesc cache code to only keep descriptors that aren't dependent on
a snapshot (namely: those where no detached partition exist, and those
where detached partitions are included). When a partdesc-without-
detached-partitions is requested, we create one afresh each time; also,
those partdescs are stored in PortalContext instead of
CacheMemoryContext.
find_inheritance_children gets a new output *detached_exist boolean,
which indicates whether any partition marked pending-detach is found.
Its "include_detached" input flag is changed to "omit_detached", because
that name captures desired the semantics more naturally.
CreatePartitionDirectory() and RelationGetPartitionDesc() arguments are
identically renamed.
This was noticed because a buildfarm member that runs with relcache
clobbering, which would not keep the improperly cached partdesc, broke
one test, which led us to realize that the expected output of that test
was bogus. This commit also corrects that expected output.
Author: Amit Langote <amitlangote09@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/3269784.1617215412@sss.pgh.pa.us
2021-04-22 21:13:25 +02:00
|
|
|
* If a partition's pg_inherits row is marked "detach pending",
|
2021-04-22 22:04:48 +02:00
|
|
|
* *detached_exist (if not null) is set true.
|
Fix relcache inconsistency hazard in partition detach
During queries coming from ri_triggers.c, we need to omit partitions
that are marked pending detach -- otherwise, the RI query is tricked
into allowing a row into the referencing table whose corresponding row
is in the detached partition. Which is bogus: once the detach operation
completes, the row becomes an orphan.
However, the code was not doing that in repeatable-read transactions,
because relcache kept a copy of the partition descriptor that included
the partition, and used it in the RI query. This commit changes the
partdesc cache code to only keep descriptors that aren't dependent on
a snapshot (namely: those where no detached partition exist, and those
where detached partitions are included). When a partdesc-without-
detached-partitions is requested, we create one afresh each time; also,
those partdescs are stored in PortalContext instead of
CacheMemoryContext.
find_inheritance_children gets a new output *detached_exist boolean,
which indicates whether any partition marked pending-detach is found.
Its "include_detached" input flag is changed to "omit_detached", because
that name captures desired the semantics more naturally.
CreatePartitionDirectory() and RelationGetPartitionDesc() arguments are
identically renamed.
This was noticed because a buildfarm member that runs with relcache
clobbering, which would not keep the improperly cached partdesc, broke
one test, which led us to realize that the expected output of that test
was bogus. This commit also corrects that expected output.
Author: Amit Langote <amitlangote09@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/3269784.1617215412@sss.pgh.pa.us
2021-04-22 21:13:25 +02:00
|
|
|
*
|
|
|
|
* If omit_detached is true and there is an active snapshot (not the same as
|
|
|
|
* the catalog snapshot used to scan pg_inherits!) and a pg_inherits tuple
|
|
|
|
* marked "detach pending" is visible to that snapshot, then that partition is
|
|
|
|
* omitted from the output list. This makes partitions invisible depending on
|
|
|
|
* whether the transaction that marked those partitions as detached appears
|
2021-04-28 21:44:35 +02:00
|
|
|
* committed to the active snapshot. In addition, *detached_xmin (if not null)
|
|
|
|
* is set to the xmin of the row of the detached partition.
|
2009-05-12 02:56:05 +02:00
|
|
|
*/
|
|
|
|
List *
|
2021-04-28 21:44:35 +02:00
|
|
|
find_inheritance_children_extended(Oid parentrelId, bool omit_detached,
|
|
|
|
LOCKMODE lockmode, bool *detached_exist,
|
|
|
|
TransactionId *detached_xmin)
|
2009-05-12 02:56:05 +02:00
|
|
|
{
|
|
|
|
List *list = NIL;
|
|
|
|
Relation relation;
|
2009-12-29 23:00:14 +01:00
|
|
|
SysScanDesc scan;
|
2009-05-12 02:56:05 +02:00
|
|
|
ScanKeyData key[1];
|
|
|
|
HeapTuple inheritsTuple;
|
|
|
|
Oid inhrelid;
|
2009-12-29 23:00:14 +01:00
|
|
|
Oid *oidarr;
|
|
|
|
int maxoids,
|
|
|
|
numoids,
|
|
|
|
i;
|
2009-05-12 02:56:05 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Can skip the scan if pg_class shows the relation has never had a
|
|
|
|
* subclass.
|
|
|
|
*/
|
|
|
|
if (!has_subclass(parentrelId))
|
|
|
|
return NIL;
|
|
|
|
|
|
|
|
/*
|
2009-12-29 23:00:14 +01:00
|
|
|
* Scan pg_inherits and build a working array of subclass OIDs.
|
2009-05-12 02:56:05 +02:00
|
|
|
*/
|
2009-12-29 23:00:14 +01:00
|
|
|
maxoids = 32;
|
|
|
|
oidarr = (Oid *) palloc(maxoids * sizeof(Oid));
|
|
|
|
numoids = 0;
|
|
|
|
|
2019-01-21 19:32:19 +01:00
|
|
|
relation = table_open(InheritsRelationId, AccessShareLock);
|
2009-12-29 23:00:14 +01:00
|
|
|
|
2009-05-12 02:56:05 +02:00
|
|
|
ScanKeyInit(&key[0],
|
|
|
|
Anum_pg_inherits_inhparent,
|
|
|
|
BTEqualStrategyNumber, F_OIDEQ,
|
|
|
|
ObjectIdGetDatum(parentrelId));
|
2009-05-12 05:11:02 +02:00
|
|
|
|
2009-12-29 23:00:14 +01:00
|
|
|
scan = systable_beginscan(relation, InheritsParentIndexId, true,
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
NULL, 1, key);
|
2009-12-29 23:00:14 +01:00
|
|
|
|
|
|
|
while ((inheritsTuple = systable_getnext(scan)) != NULL)
|
2009-05-12 02:56:05 +02:00
|
|
|
{
|
ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.
Because it runs in two transactions, it cannot be used in a transaction
block. This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it. But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)
In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.
The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command. Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included. As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.
A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
2021-03-25 22:00:28 +01:00
|
|
|
/*
|
|
|
|
* Cope with partitions concurrently being detached. When we see a
|
Fix relcache inconsistency hazard in partition detach
During queries coming from ri_triggers.c, we need to omit partitions
that are marked pending detach -- otherwise, the RI query is tricked
into allowing a row into the referencing table whose corresponding row
is in the detached partition. Which is bogus: once the detach operation
completes, the row becomes an orphan.
However, the code was not doing that in repeatable-read transactions,
because relcache kept a copy of the partition descriptor that included
the partition, and used it in the RI query. This commit changes the
partdesc cache code to only keep descriptors that aren't dependent on
a snapshot (namely: those where no detached partition exist, and those
where detached partitions are included). When a partdesc-without-
detached-partitions is requested, we create one afresh each time; also,
those partdescs are stored in PortalContext instead of
CacheMemoryContext.
find_inheritance_children gets a new output *detached_exist boolean,
which indicates whether any partition marked pending-detach is found.
Its "include_detached" input flag is changed to "omit_detached", because
that name captures desired the semantics more naturally.
CreatePartitionDirectory() and RelationGetPartitionDesc() arguments are
identically renamed.
This was noticed because a buildfarm member that runs with relcache
clobbering, which would not keep the improperly cached partdesc, broke
one test, which led us to realize that the expected output of that test
was bogus. This commit also corrects that expected output.
Author: Amit Langote <amitlangote09@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/3269784.1617215412@sss.pgh.pa.us
2021-04-22 21:13:25 +02:00
|
|
|
* partition marked "detach pending", we omit it from the returned set
|
|
|
|
* of visible partitions if caller requested that and the tuple's xmin
|
|
|
|
* does not appear in progress to the active snapshot. (If there's no
|
|
|
|
* active snapshot set, that means we're not running a user query, so
|
|
|
|
* it's OK to always include detached partitions in that case; if the
|
|
|
|
* xmin is still running to the active snapshot, then the partition
|
|
|
|
* has not been detached yet and so we include it.)
|
ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.
Because it runs in two transactions, it cannot be used in a transaction
block. This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it. But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)
In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.
The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command. Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included. As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.
A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
2021-03-25 22:00:28 +01:00
|
|
|
*
|
Fix relcache inconsistency hazard in partition detach
During queries coming from ri_triggers.c, we need to omit partitions
that are marked pending detach -- otherwise, the RI query is tricked
into allowing a row into the referencing table whose corresponding row
is in the detached partition. Which is bogus: once the detach operation
completes, the row becomes an orphan.
However, the code was not doing that in repeatable-read transactions,
because relcache kept a copy of the partition descriptor that included
the partition, and used it in the RI query. This commit changes the
partdesc cache code to only keep descriptors that aren't dependent on
a snapshot (namely: those where no detached partition exist, and those
where detached partitions are included). When a partdesc-without-
detached-partitions is requested, we create one afresh each time; also,
those partdescs are stored in PortalContext instead of
CacheMemoryContext.
find_inheritance_children gets a new output *detached_exist boolean,
which indicates whether any partition marked pending-detach is found.
Its "include_detached" input flag is changed to "omit_detached", because
that name captures desired the semantics more naturally.
CreatePartitionDirectory() and RelationGetPartitionDesc() arguments are
identically renamed.
This was noticed because a buildfarm member that runs with relcache
clobbering, which would not keep the improperly cached partdesc, broke
one test, which led us to realize that the expected output of that test
was bogus. This commit also corrects that expected output.
Author: Amit Langote <amitlangote09@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/3269784.1617215412@sss.pgh.pa.us
2021-04-22 21:13:25 +02:00
|
|
|
* The reason for this hack is that we want to avoid seeing the
|
ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.
Because it runs in two transactions, it cannot be used in a transaction
block. This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it. But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)
In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.
The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command. Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included. As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.
A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
2021-03-25 22:00:28 +01:00
|
|
|
* partition as alive in RI queries during REPEATABLE READ or
|
Fix relcache inconsistency hazard in partition detach
During queries coming from ri_triggers.c, we need to omit partitions
that are marked pending detach -- otherwise, the RI query is tricked
into allowing a row into the referencing table whose corresponding row
is in the detached partition. Which is bogus: once the detach operation
completes, the row becomes an orphan.
However, the code was not doing that in repeatable-read transactions,
because relcache kept a copy of the partition descriptor that included
the partition, and used it in the RI query. This commit changes the
partdesc cache code to only keep descriptors that aren't dependent on
a snapshot (namely: those where no detached partition exist, and those
where detached partitions are included). When a partdesc-without-
detached-partitions is requested, we create one afresh each time; also,
those partdescs are stored in PortalContext instead of
CacheMemoryContext.
find_inheritance_children gets a new output *detached_exist boolean,
which indicates whether any partition marked pending-detach is found.
Its "include_detached" input flag is changed to "omit_detached", because
that name captures desired the semantics more naturally.
CreatePartitionDirectory() and RelationGetPartitionDesc() arguments are
identically renamed.
This was noticed because a buildfarm member that runs with relcache
clobbering, which would not keep the improperly cached partdesc, broke
one test, which led us to realize that the expected output of that test
was bogus. This commit also corrects that expected output.
Author: Amit Langote <amitlangote09@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/3269784.1617215412@sss.pgh.pa.us
2021-04-22 21:13:25 +02:00
|
|
|
* SERIALIZABLE transactions: such queries use a different snapshot
|
|
|
|
* than the one used by regular (user) queries.
|
ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.
Because it runs in two transactions, it cannot be used in a transaction
block. This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it. But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)
In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.
The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command. Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included. As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.
A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
2021-03-25 22:00:28 +01:00
|
|
|
*/
|
Fix relcache inconsistency hazard in partition detach
During queries coming from ri_triggers.c, we need to omit partitions
that are marked pending detach -- otherwise, the RI query is tricked
into allowing a row into the referencing table whose corresponding row
is in the detached partition. Which is bogus: once the detach operation
completes, the row becomes an orphan.
However, the code was not doing that in repeatable-read transactions,
because relcache kept a copy of the partition descriptor that included
the partition, and used it in the RI query. This commit changes the
partdesc cache code to only keep descriptors that aren't dependent on
a snapshot (namely: those where no detached partition exist, and those
where detached partitions are included). When a partdesc-without-
detached-partitions is requested, we create one afresh each time; also,
those partdescs are stored in PortalContext instead of
CacheMemoryContext.
find_inheritance_children gets a new output *detached_exist boolean,
which indicates whether any partition marked pending-detach is found.
Its "include_detached" input flag is changed to "omit_detached", because
that name captures desired the semantics more naturally.
CreatePartitionDirectory() and RelationGetPartitionDesc() arguments are
identically renamed.
This was noticed because a buildfarm member that runs with relcache
clobbering, which would not keep the improperly cached partdesc, broke
one test, which led us to realize that the expected output of that test
was bogus. This commit also corrects that expected output.
Author: Amit Langote <amitlangote09@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/3269784.1617215412@sss.pgh.pa.us
2021-04-22 21:13:25 +02:00
|
|
|
if (((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhdetachpending)
|
ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.
Because it runs in two transactions, it cannot be used in a transaction
block. This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it. But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)
In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.
The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command. Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included. As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.
A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
2021-03-25 22:00:28 +01:00
|
|
|
{
|
Fix relcache inconsistency hazard in partition detach
During queries coming from ri_triggers.c, we need to omit partitions
that are marked pending detach -- otherwise, the RI query is tricked
into allowing a row into the referencing table whose corresponding row
is in the detached partition. Which is bogus: once the detach operation
completes, the row becomes an orphan.
However, the code was not doing that in repeatable-read transactions,
because relcache kept a copy of the partition descriptor that included
the partition, and used it in the RI query. This commit changes the
partdesc cache code to only keep descriptors that aren't dependent on
a snapshot (namely: those where no detached partition exist, and those
where detached partitions are included). When a partdesc-without-
detached-partitions is requested, we create one afresh each time; also,
those partdescs are stored in PortalContext instead of
CacheMemoryContext.
find_inheritance_children gets a new output *detached_exist boolean,
which indicates whether any partition marked pending-detach is found.
Its "include_detached" input flag is changed to "omit_detached", because
that name captures desired the semantics more naturally.
CreatePartitionDirectory() and RelationGetPartitionDesc() arguments are
identically renamed.
This was noticed because a buildfarm member that runs with relcache
clobbering, which would not keep the improperly cached partdesc, broke
one test, which led us to realize that the expected output of that test
was bogus. This commit also corrects that expected output.
Author: Amit Langote <amitlangote09@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/3269784.1617215412@sss.pgh.pa.us
2021-04-22 21:13:25 +02:00
|
|
|
if (detached_exist)
|
|
|
|
*detached_exist = true;
|
|
|
|
|
|
|
|
if (omit_detached && ActiveSnapshotSet())
|
|
|
|
{
|
|
|
|
TransactionId xmin;
|
|
|
|
Snapshot snap;
|
ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.
Because it runs in two transactions, it cannot be used in a transaction
block. This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it. But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)
In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.
The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command. Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included. As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.
A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
2021-03-25 22:00:28 +01:00
|
|
|
|
Fix relcache inconsistency hazard in partition detach
During queries coming from ri_triggers.c, we need to omit partitions
that are marked pending detach -- otherwise, the RI query is tricked
into allowing a row into the referencing table whose corresponding row
is in the detached partition. Which is bogus: once the detach operation
completes, the row becomes an orphan.
However, the code was not doing that in repeatable-read transactions,
because relcache kept a copy of the partition descriptor that included
the partition, and used it in the RI query. This commit changes the
partdesc cache code to only keep descriptors that aren't dependent on
a snapshot (namely: those where no detached partition exist, and those
where detached partitions are included). When a partdesc-without-
detached-partitions is requested, we create one afresh each time; also,
those partdescs are stored in PortalContext instead of
CacheMemoryContext.
find_inheritance_children gets a new output *detached_exist boolean,
which indicates whether any partition marked pending-detach is found.
Its "include_detached" input flag is changed to "omit_detached", because
that name captures desired the semantics more naturally.
CreatePartitionDirectory() and RelationGetPartitionDesc() arguments are
identically renamed.
This was noticed because a buildfarm member that runs with relcache
clobbering, which would not keep the improperly cached partdesc, broke
one test, which led us to realize that the expected output of that test
was bogus. This commit also corrects that expected output.
Author: Amit Langote <amitlangote09@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/3269784.1617215412@sss.pgh.pa.us
2021-04-22 21:13:25 +02:00
|
|
|
xmin = HeapTupleHeaderGetXmin(inheritsTuple->t_data);
|
|
|
|
snap = GetActiveSnapshot();
|
ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.
Because it runs in two transactions, it cannot be used in a transaction
block. This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it. But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)
In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.
The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command. Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included. As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.
A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
2021-03-25 22:00:28 +01:00
|
|
|
|
Fix relcache inconsistency hazard in partition detach
During queries coming from ri_triggers.c, we need to omit partitions
that are marked pending detach -- otherwise, the RI query is tricked
into allowing a row into the referencing table whose corresponding row
is in the detached partition. Which is bogus: once the detach operation
completes, the row becomes an orphan.
However, the code was not doing that in repeatable-read transactions,
because relcache kept a copy of the partition descriptor that included
the partition, and used it in the RI query. This commit changes the
partdesc cache code to only keep descriptors that aren't dependent on
a snapshot (namely: those where no detached partition exist, and those
where detached partitions are included). When a partdesc-without-
detached-partitions is requested, we create one afresh each time; also,
those partdescs are stored in PortalContext instead of
CacheMemoryContext.
find_inheritance_children gets a new output *detached_exist boolean,
which indicates whether any partition marked pending-detach is found.
Its "include_detached" input flag is changed to "omit_detached", because
that name captures desired the semantics more naturally.
CreatePartitionDirectory() and RelationGetPartitionDesc() arguments are
identically renamed.
This was noticed because a buildfarm member that runs with relcache
clobbering, which would not keep the improperly cached partdesc, broke
one test, which led us to realize that the expected output of that test
was bogus. This commit also corrects that expected output.
Author: Amit Langote <amitlangote09@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/3269784.1617215412@sss.pgh.pa.us
2021-04-22 21:13:25 +02:00
|
|
|
if (!XidInMVCCSnapshot(xmin, snap))
|
2021-04-28 21:44:35 +02:00
|
|
|
{
|
|
|
|
if (detached_xmin)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Two detached partitions should not occur (see
|
|
|
|
* checks in MarkInheritDetached), but if they do,
|
|
|
|
* track the newer of the two. Make sure to warn the
|
|
|
|
* user, so that they can clean up. Since this is
|
|
|
|
* just a cross-check against potentially corrupt
|
|
|
|
* catalogs, we don't make it a full-fledged error
|
|
|
|
* message.
|
|
|
|
*/
|
|
|
|
if (*detached_xmin != InvalidTransactionId)
|
|
|
|
{
|
|
|
|
elog(WARNING, "more than one partition pending detach found for table with OID %u",
|
|
|
|
parentrelId);
|
|
|
|
if (TransactionIdFollows(xmin, *detached_xmin))
|
|
|
|
*detached_xmin = xmin;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
*detached_xmin = xmin;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Don't add the partition to the output list */
|
Fix relcache inconsistency hazard in partition detach
During queries coming from ri_triggers.c, we need to omit partitions
that are marked pending detach -- otherwise, the RI query is tricked
into allowing a row into the referencing table whose corresponding row
is in the detached partition. Which is bogus: once the detach operation
completes, the row becomes an orphan.
However, the code was not doing that in repeatable-read transactions,
because relcache kept a copy of the partition descriptor that included
the partition, and used it in the RI query. This commit changes the
partdesc cache code to only keep descriptors that aren't dependent on
a snapshot (namely: those where no detached partition exist, and those
where detached partitions are included). When a partdesc-without-
detached-partitions is requested, we create one afresh each time; also,
those partdescs are stored in PortalContext instead of
CacheMemoryContext.
find_inheritance_children gets a new output *detached_exist boolean,
which indicates whether any partition marked pending-detach is found.
Its "include_detached" input flag is changed to "omit_detached", because
that name captures desired the semantics more naturally.
CreatePartitionDirectory() and RelationGetPartitionDesc() arguments are
identically renamed.
This was noticed because a buildfarm member that runs with relcache
clobbering, which would not keep the improperly cached partdesc, broke
one test, which led us to realize that the expected output of that test
was bogus. This commit also corrects that expected output.
Author: Amit Langote <amitlangote09@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/3269784.1617215412@sss.pgh.pa.us
2021-04-22 21:13:25 +02:00
|
|
|
continue;
|
2021-04-28 21:44:35 +02:00
|
|
|
}
|
Fix relcache inconsistency hazard in partition detach
During queries coming from ri_triggers.c, we need to omit partitions
that are marked pending detach -- otherwise, the RI query is tricked
into allowing a row into the referencing table whose corresponding row
is in the detached partition. Which is bogus: once the detach operation
completes, the row becomes an orphan.
However, the code was not doing that in repeatable-read transactions,
because relcache kept a copy of the partition descriptor that included
the partition, and used it in the RI query. This commit changes the
partdesc cache code to only keep descriptors that aren't dependent on
a snapshot (namely: those where no detached partition exist, and those
where detached partitions are included). When a partdesc-without-
detached-partitions is requested, we create one afresh each time; also,
those partdescs are stored in PortalContext instead of
CacheMemoryContext.
find_inheritance_children gets a new output *detached_exist boolean,
which indicates whether any partition marked pending-detach is found.
Its "include_detached" input flag is changed to "omit_detached", because
that name captures desired the semantics more naturally.
CreatePartitionDirectory() and RelationGetPartitionDesc() arguments are
identically renamed.
This was noticed because a buildfarm member that runs with relcache
clobbering, which would not keep the improperly cached partdesc, broke
one test, which led us to realize that the expected output of that test
was bogus. This commit also corrects that expected output.
Author: Amit Langote <amitlangote09@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/3269784.1617215412@sss.pgh.pa.us
2021-04-22 21:13:25 +02:00
|
|
|
}
|
ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.
Because it runs in two transactions, it cannot be used in a transaction
block. This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it. But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)
In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.
The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command. Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included. As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.
A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
2021-03-25 22:00:28 +01:00
|
|
|
}
|
|
|
|
|
2009-05-12 02:56:05 +02:00
|
|
|
inhrelid = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhrelid;
|
2009-12-29 23:00:14 +01:00
|
|
|
if (numoids >= maxoids)
|
|
|
|
{
|
|
|
|
maxoids *= 2;
|
|
|
|
oidarr = (Oid *) repalloc(oidarr, maxoids * sizeof(Oid));
|
|
|
|
}
|
|
|
|
oidarr[numoids++] = inhrelid;
|
|
|
|
}
|
|
|
|
|
|
|
|
systable_endscan(scan);
|
|
|
|
|
2019-01-21 19:32:19 +01:00
|
|
|
table_close(relation, AccessShareLock);
|
2009-12-29 23:00:14 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we found more than one child, sort them by OID. This ensures
|
|
|
|
* reasonably consistent behavior regardless of the vagaries of an
|
|
|
|
* indexscan. This is important since we need to be sure all backends
|
|
|
|
* lock children in the same order to avoid needless deadlocks.
|
|
|
|
*/
|
|
|
|
if (numoids > 1)
|
|
|
|
qsort(oidarr, numoids, sizeof(Oid), oid_cmp);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Acquire locks and build the result list.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < numoids; i++)
|
|
|
|
{
|
|
|
|
inhrelid = oidarr[i];
|
2009-05-12 05:11:02 +02:00
|
|
|
|
|
|
|
if (lockmode != NoLock)
|
|
|
|
{
|
|
|
|
/* Get the lock to synchronize against concurrent drop */
|
|
|
|
LockRelationOid(inhrelid, lockmode);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now that we have the lock, double-check to see if the relation
|
|
|
|
* really exists or not. If not, assume it was dropped while we
|
|
|
|
* waited to acquire lock, and ignore it.
|
|
|
|
*/
|
2010-02-14 19:42:19 +01:00
|
|
|
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(inhrelid)))
|
2009-05-12 05:11:02 +02:00
|
|
|
{
|
|
|
|
/* Release useless lock */
|
|
|
|
UnlockRelationOid(inhrelid, lockmode);
|
|
|
|
/* And ignore this relation */
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-05-12 02:56:05 +02:00
|
|
|
list = lappend_oid(list, inhrelid);
|
|
|
|
}
|
2009-05-12 05:11:02 +02:00
|
|
|
|
2009-12-29 23:00:14 +01:00
|
|
|
pfree(oidarr);
|
2009-05-12 05:11:02 +02:00
|
|
|
|
2009-05-12 02:56:05 +02:00
|
|
|
return list;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* find_all_inheritors -
|
|
|
|
* Returns a list of relation OIDs including the given rel plus
|
|
|
|
* all relations that inherit from it, directly or indirectly.
|
2010-02-01 20:28:56 +01:00
|
|
|
* Optionally, it also returns the number of parents found for
|
|
|
|
* each such relation within the inheritance tree rooted at the
|
|
|
|
* given rel.
|
2009-05-12 05:11:02 +02:00
|
|
|
*
|
|
|
|
* The specified lock type is acquired on all child relations (but not on the
|
|
|
|
* given rel; caller should already have locked it). If lockmode is NoLock
|
|
|
|
* then no locks are acquired, but caller must beware of race conditions
|
|
|
|
* against possible DROPs of child relations.
|
ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.
Because it runs in two transactions, it cannot be used in a transaction
block. This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it. But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)
In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.
The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command. Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included. As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.
A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
2021-03-25 22:00:28 +01:00
|
|
|
*
|
|
|
|
* NB - No current callers of this routine are interested in children being
|
|
|
|
* concurrently detached, so there's no provision to include them.
|
2009-05-12 02:56:05 +02:00
|
|
|
*/
|
|
|
|
List *
|
2010-02-01 20:28:56 +01:00
|
|
|
find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)
|
2009-05-12 02:56:05 +02:00
|
|
|
{
|
2017-03-27 18:07:48 +02:00
|
|
|
/* hash table for O(1) rel_oid -> rel_numparents cell lookup */
|
|
|
|
HTAB *seen_rels;
|
|
|
|
HASHCTL ctl;
|
2010-02-01 20:28:56 +01:00
|
|
|
List *rels_list,
|
|
|
|
*rel_numparents;
|
2009-05-12 02:56:05 +02:00
|
|
|
ListCell *l;
|
|
|
|
|
2017-03-27 18:07:48 +02:00
|
|
|
ctl.keysize = sizeof(Oid);
|
|
|
|
ctl.entrysize = sizeof(SeenRelsEntry);
|
2017-05-17 01:33:31 +02:00
|
|
|
ctl.hcxt = CurrentMemoryContext;
|
2017-03-27 18:07:48 +02:00
|
|
|
|
2017-05-17 01:33:31 +02:00
|
|
|
seen_rels = hash_create("find_all_inheritors temporary table",
|
|
|
|
32, /* start small and extend */
|
|
|
|
&ctl,
|
|
|
|
HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
|
2017-03-27 18:07:48 +02:00
|
|
|
|
2009-05-12 02:56:05 +02:00
|
|
|
/*
|
|
|
|
* We build a list starting with the given rel and adding all direct and
|
|
|
|
* indirect children. We can use a single list as both the record of
|
|
|
|
* already-found rels and the agenda of rels yet to be scanned for more
|
|
|
|
* children. This is a bit tricky but works because the foreach() macro
|
Represent Lists as expansible arrays, not chains of cons-cells.
Originally, Postgres Lists were a more or less exact reimplementation of
Lisp lists, which consist of chains of separately-allocated cons cells,
each having a value and a next-cell link. We'd hacked that once before
(commit d0b4399d8) to add a separate List header, but the data was still
in cons cells. That makes some operations -- notably list_nth() -- O(N),
and it's bulky because of the next-cell pointers and per-cell palloc
overhead, and it's very cache-unfriendly if the cons cells end up
scattered around rather than being adjacent.
In this rewrite, we still have List headers, but the data is in a
resizable array of values, with no next-cell links. Now we need at
most two palloc's per List, and often only one, since we can allocate
some values in the same palloc call as the List header. (Of course,
extending an existing List may require repalloc's to enlarge the array.
But this involves just O(log N) allocations not O(N).)
Of course this is not without downsides. The key difficulty is that
addition or deletion of a list entry may now cause other entries to
move, which it did not before.
For example, that breaks foreach() and sister macros, which historically
used a pointer to the current cons-cell as loop state. We can repair
those macros transparently by making their actual loop state be an
integer list index; the exposed "ListCell *" pointer is no longer state
carried across loop iterations, but is just a derived value. (In
practice, modern compilers can optimize things back to having just one
loop state value, at least for simple cases with inline loop bodies.)
In principle, this is a semantics change for cases where the loop body
inserts or deletes list entries ahead of the current loop index; but
I found no such cases in the Postgres code.
The change is not at all transparent for code that doesn't use foreach()
but chases lists "by hand" using lnext(). The largest share of such
code in the backend is in loops that were maintaining "prev" and "next"
variables in addition to the current-cell pointer, in order to delete
list cells efficiently using list_delete_cell(). However, we no longer
need a previous-cell pointer to delete a list cell efficiently. Keeping
a next-cell pointer doesn't work, as explained above, but we can improve
matters by changing such code to use a regular foreach() loop and then
using the new macro foreach_delete_current() to delete the current cell.
(This macro knows how to update the associated foreach loop's state so
that no cells will be missed in the traversal.)
There remains a nontrivial risk of code assuming that a ListCell *
pointer will remain good over an operation that could now move the list
contents. To help catch such errors, list.c can be compiled with a new
define symbol DEBUG_LIST_MEMORY_USAGE that forcibly moves list contents
whenever that could possibly happen. This makes list operations
significantly more expensive so it's not normally turned on (though it
is on by default if USE_VALGRIND is on).
There are two notable API differences from the previous code:
* lnext() now requires the List's header pointer in addition to the
current cell's address.
* list_delete_cell() no longer requires a previous-cell argument.
These changes are somewhat unfortunate, but on the other hand code using
either function needs inspection to see if it is assuming anything
it shouldn't, so it's not all bad.
Programmers should be aware of these significant performance changes:
* list_nth() and related functions are now O(1); so there's no
major access-speed difference between a list and an array.
* Inserting or deleting a list element now takes time proportional to
the distance to the end of the list, due to moving the array elements.
(However, it typically *doesn't* require palloc or pfree, so except in
long lists it's probably still faster than before.) Notably, lcons()
used to be about the same cost as lappend(), but that's no longer true
if the list is long. Code that uses lcons() and list_delete_first()
to maintain a stack might usefully be rewritten to push and pop at the
end of the list rather than the beginning.
* There are now list_insert_nth...() and list_delete_nth...() functions
that add or remove a list cell identified by index. These have the
data-movement penalty explained above, but there's no search penalty.
* list_concat() and variants now copy the second list's data into
storage belonging to the first list, so there is no longer any
sharing of cells between the input lists. The second argument is
now declared "const List *" to reflect that it isn't changed.
This patch just does the minimum needed to get the new implementation
in place and fix bugs exposed by the regression tests. As suggested
by the foregoing, there's a fair amount of followup work remaining to
do.
Also, the ENABLE_LIST_COMPAT macros are finally removed in this
commit. Code using those should have been gone a dozen years ago.
Patch by me; thanks to David Rowley, Jesper Pedersen, and others
for review.
Discussion: https://postgr.es/m/11587.1550975080@sss.pgh.pa.us
2019-07-15 19:41:58 +02:00
|
|
|
* doesn't fetch the next list element until the bottom of the loop. Note
|
|
|
|
* that we can't keep pointers into the output lists; but an index is
|
|
|
|
* sufficient.
|
2009-05-12 02:56:05 +02:00
|
|
|
*/
|
|
|
|
rels_list = list_make1_oid(parentrelId);
|
2010-02-01 20:28:56 +01:00
|
|
|
rel_numparents = list_make1_int(0);
|
2009-05-12 02:56:05 +02:00
|
|
|
|
|
|
|
foreach(l, rels_list)
|
|
|
|
{
|
|
|
|
Oid currentrel = lfirst_oid(l);
|
|
|
|
List *currentchildren;
|
2010-02-01 20:28:56 +01:00
|
|
|
ListCell *lc;
|
2009-05-12 02:56:05 +02:00
|
|
|
|
|
|
|
/* Get the direct children of this rel */
|
2021-04-28 21:44:35 +02:00
|
|
|
currentchildren = find_inheritance_children(currentrel, lockmode);
|
2009-05-12 02:56:05 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Add to the queue only those children not already seen. This avoids
|
|
|
|
* making duplicate entries in case of multiple inheritance paths from
|
|
|
|
* the same parent. (It'll also keep us from getting into an infinite
|
|
|
|
* loop, though theoretically there can't be any cycles in the
|
|
|
|
* inheritance graph anyway.)
|
|
|
|
*/
|
2010-02-01 20:28:56 +01:00
|
|
|
foreach(lc, currentchildren)
|
|
|
|
{
|
|
|
|
Oid child_oid = lfirst_oid(lc);
|
2017-03-27 18:07:48 +02:00
|
|
|
bool found;
|
|
|
|
SeenRelsEntry *hash_entry;
|
2010-02-01 20:28:56 +01:00
|
|
|
|
2017-03-27 18:07:48 +02:00
|
|
|
hash_entry = hash_search(seen_rels, &child_oid, HASH_ENTER, &found);
|
|
|
|
if (found)
|
2010-02-01 20:28:56 +01:00
|
|
|
{
|
2017-03-27 18:07:48 +02:00
|
|
|
/* if the rel is already there, bump number-of-parents counter */
|
Represent Lists as expansible arrays, not chains of cons-cells.
Originally, Postgres Lists were a more or less exact reimplementation of
Lisp lists, which consist of chains of separately-allocated cons cells,
each having a value and a next-cell link. We'd hacked that once before
(commit d0b4399d8) to add a separate List header, but the data was still
in cons cells. That makes some operations -- notably list_nth() -- O(N),
and it's bulky because of the next-cell pointers and per-cell palloc
overhead, and it's very cache-unfriendly if the cons cells end up
scattered around rather than being adjacent.
In this rewrite, we still have List headers, but the data is in a
resizable array of values, with no next-cell links. Now we need at
most two palloc's per List, and often only one, since we can allocate
some values in the same palloc call as the List header. (Of course,
extending an existing List may require repalloc's to enlarge the array.
But this involves just O(log N) allocations not O(N).)
Of course this is not without downsides. The key difficulty is that
addition or deletion of a list entry may now cause other entries to
move, which it did not before.
For example, that breaks foreach() and sister macros, which historically
used a pointer to the current cons-cell as loop state. We can repair
those macros transparently by making their actual loop state be an
integer list index; the exposed "ListCell *" pointer is no longer state
carried across loop iterations, but is just a derived value. (In
practice, modern compilers can optimize things back to having just one
loop state value, at least for simple cases with inline loop bodies.)
In principle, this is a semantics change for cases where the loop body
inserts or deletes list entries ahead of the current loop index; but
I found no such cases in the Postgres code.
The change is not at all transparent for code that doesn't use foreach()
but chases lists "by hand" using lnext(). The largest share of such
code in the backend is in loops that were maintaining "prev" and "next"
variables in addition to the current-cell pointer, in order to delete
list cells efficiently using list_delete_cell(). However, we no longer
need a previous-cell pointer to delete a list cell efficiently. Keeping
a next-cell pointer doesn't work, as explained above, but we can improve
matters by changing such code to use a regular foreach() loop and then
using the new macro foreach_delete_current() to delete the current cell.
(This macro knows how to update the associated foreach loop's state so
that no cells will be missed in the traversal.)
There remains a nontrivial risk of code assuming that a ListCell *
pointer will remain good over an operation that could now move the list
contents. To help catch such errors, list.c can be compiled with a new
define symbol DEBUG_LIST_MEMORY_USAGE that forcibly moves list contents
whenever that could possibly happen. This makes list operations
significantly more expensive so it's not normally turned on (though it
is on by default if USE_VALGRIND is on).
There are two notable API differences from the previous code:
* lnext() now requires the List's header pointer in addition to the
current cell's address.
* list_delete_cell() no longer requires a previous-cell argument.
These changes are somewhat unfortunate, but on the other hand code using
either function needs inspection to see if it is assuming anything
it shouldn't, so it's not all bad.
Programmers should be aware of these significant performance changes:
* list_nth() and related functions are now O(1); so there's no
major access-speed difference between a list and an array.
* Inserting or deleting a list element now takes time proportional to
the distance to the end of the list, due to moving the array elements.
(However, it typically *doesn't* require palloc or pfree, so except in
long lists it's probably still faster than before.) Notably, lcons()
used to be about the same cost as lappend(), but that's no longer true
if the list is long. Code that uses lcons() and list_delete_first()
to maintain a stack might usefully be rewritten to push and pop at the
end of the list rather than the beginning.
* There are now list_insert_nth...() and list_delete_nth...() functions
that add or remove a list cell identified by index. These have the
data-movement penalty explained above, but there's no search penalty.
* list_concat() and variants now copy the second list's data into
storage belonging to the first list, so there is no longer any
sharing of cells between the input lists. The second argument is
now declared "const List *" to reflect that it isn't changed.
This patch just does the minimum needed to get the new implementation
in place and fix bugs exposed by the regression tests. As suggested
by the foregoing, there's a fair amount of followup work remaining to
do.
Also, the ENABLE_LIST_COMPAT macros are finally removed in this
commit. Code using those should have been gone a dozen years ago.
Patch by me; thanks to David Rowley, Jesper Pedersen, and others
for review.
Discussion: https://postgr.es/m/11587.1550975080@sss.pgh.pa.us
2019-07-15 19:41:58 +02:00
|
|
|
ListCell *numparents_cell;
|
|
|
|
|
|
|
|
numparents_cell = list_nth_cell(rel_numparents,
|
|
|
|
hash_entry->list_index);
|
|
|
|
lfirst_int(numparents_cell)++;
|
2010-02-01 20:28:56 +01:00
|
|
|
}
|
2017-03-27 18:07:48 +02:00
|
|
|
else
|
2010-02-01 20:28:56 +01:00
|
|
|
{
|
2017-03-27 18:07:48 +02:00
|
|
|
/* if it's not there, add it. expect 1 parent, initially. */
|
Represent Lists as expansible arrays, not chains of cons-cells.
Originally, Postgres Lists were a more or less exact reimplementation of
Lisp lists, which consist of chains of separately-allocated cons cells,
each having a value and a next-cell link. We'd hacked that once before
(commit d0b4399d8) to add a separate List header, but the data was still
in cons cells. That makes some operations -- notably list_nth() -- O(N),
and it's bulky because of the next-cell pointers and per-cell palloc
overhead, and it's very cache-unfriendly if the cons cells end up
scattered around rather than being adjacent.
In this rewrite, we still have List headers, but the data is in a
resizable array of values, with no next-cell links. Now we need at
most two palloc's per List, and often only one, since we can allocate
some values in the same palloc call as the List header. (Of course,
extending an existing List may require repalloc's to enlarge the array.
But this involves just O(log N) allocations not O(N).)
Of course this is not without downsides. The key difficulty is that
addition or deletion of a list entry may now cause other entries to
move, which it did not before.
For example, that breaks foreach() and sister macros, which historically
used a pointer to the current cons-cell as loop state. We can repair
those macros transparently by making their actual loop state be an
integer list index; the exposed "ListCell *" pointer is no longer state
carried across loop iterations, but is just a derived value. (In
practice, modern compilers can optimize things back to having just one
loop state value, at least for simple cases with inline loop bodies.)
In principle, this is a semantics change for cases where the loop body
inserts or deletes list entries ahead of the current loop index; but
I found no such cases in the Postgres code.
The change is not at all transparent for code that doesn't use foreach()
but chases lists "by hand" using lnext(). The largest share of such
code in the backend is in loops that were maintaining "prev" and "next"
variables in addition to the current-cell pointer, in order to delete
list cells efficiently using list_delete_cell(). However, we no longer
need a previous-cell pointer to delete a list cell efficiently. Keeping
a next-cell pointer doesn't work, as explained above, but we can improve
matters by changing such code to use a regular foreach() loop and then
using the new macro foreach_delete_current() to delete the current cell.
(This macro knows how to update the associated foreach loop's state so
that no cells will be missed in the traversal.)
There remains a nontrivial risk of code assuming that a ListCell *
pointer will remain good over an operation that could now move the list
contents. To help catch such errors, list.c can be compiled with a new
define symbol DEBUG_LIST_MEMORY_USAGE that forcibly moves list contents
whenever that could possibly happen. This makes list operations
significantly more expensive so it's not normally turned on (though it
is on by default if USE_VALGRIND is on).
There are two notable API differences from the previous code:
* lnext() now requires the List's header pointer in addition to the
current cell's address.
* list_delete_cell() no longer requires a previous-cell argument.
These changes are somewhat unfortunate, but on the other hand code using
either function needs inspection to see if it is assuming anything
it shouldn't, so it's not all bad.
Programmers should be aware of these significant performance changes:
* list_nth() and related functions are now O(1); so there's no
major access-speed difference between a list and an array.
* Inserting or deleting a list element now takes time proportional to
the distance to the end of the list, due to moving the array elements.
(However, it typically *doesn't* require palloc or pfree, so except in
long lists it's probably still faster than before.) Notably, lcons()
used to be about the same cost as lappend(), but that's no longer true
if the list is long. Code that uses lcons() and list_delete_first()
to maintain a stack might usefully be rewritten to push and pop at the
end of the list rather than the beginning.
* There are now list_insert_nth...() and list_delete_nth...() functions
that add or remove a list cell identified by index. These have the
data-movement penalty explained above, but there's no search penalty.
* list_concat() and variants now copy the second list's data into
storage belonging to the first list, so there is no longer any
sharing of cells between the input lists. The second argument is
now declared "const List *" to reflect that it isn't changed.
This patch just does the minimum needed to get the new implementation
in place and fix bugs exposed by the regression tests. As suggested
by the foregoing, there's a fair amount of followup work remaining to
do.
Also, the ENABLE_LIST_COMPAT macros are finally removed in this
commit. Code using those should have been gone a dozen years ago.
Patch by me; thanks to David Rowley, Jesper Pedersen, and others
for review.
Discussion: https://postgr.es/m/11587.1550975080@sss.pgh.pa.us
2019-07-15 19:41:58 +02:00
|
|
|
hash_entry->list_index = list_length(rels_list);
|
2010-02-01 20:28:56 +01:00
|
|
|
rels_list = lappend_oid(rels_list, child_oid);
|
|
|
|
rel_numparents = lappend_int(rel_numparents, 1);
|
|
|
|
}
|
|
|
|
}
|
2009-05-12 02:56:05 +02:00
|
|
|
}
|
|
|
|
|
2010-02-01 20:28:56 +01:00
|
|
|
if (numparents)
|
|
|
|
*numparents = rel_numparents;
|
|
|
|
else
|
|
|
|
list_free(rel_numparents);
|
2017-03-27 18:07:48 +02:00
|
|
|
|
|
|
|
hash_destroy(seen_rels);
|
|
|
|
|
2009-05-12 02:56:05 +02:00
|
|
|
return rels_list;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* has_subclass - does this relation have any children?
|
|
|
|
*
|
|
|
|
* In the current implementation, has_subclass returns whether a
|
|
|
|
* particular class *might* have a subclass. It will not return the
|
|
|
|
* correct result if a class had a subclass which was later dropped.
|
2011-09-02 20:29:31 +02:00
|
|
|
* This is because relhassubclass in pg_class is not updated immediately
|
|
|
|
* when a subclass is dropped, primarily because of concurrency concerns.
|
2009-05-12 02:56:05 +02:00
|
|
|
*
|
|
|
|
* Currently has_subclass is only used as an efficiency hack to skip
|
2011-09-02 20:29:31 +02:00
|
|
|
* unnecessary inheritance searches, so this is OK. Note that ANALYZE
|
|
|
|
* on a childless table will clean up the obsolete relhassubclass flag.
|
2009-05-12 02:56:05 +02:00
|
|
|
*
|
|
|
|
* Although this doesn't actually touch pg_inherits, it seems reasonable
|
|
|
|
* to keep it here since it's normally used with the other routines here.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
has_subclass(Oid relationId)
|
|
|
|
{
|
|
|
|
HeapTuple tuple;
|
|
|
|
bool result;
|
|
|
|
|
2010-02-14 19:42:19 +01:00
|
|
|
tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(relationId));
|
2009-05-12 02:56:05 +02:00
|
|
|
if (!HeapTupleIsValid(tuple))
|
|
|
|
elog(ERROR, "cache lookup failed for relation %u", relationId);
|
|
|
|
|
|
|
|
result = ((Form_pg_class) GETSTRUCT(tuple))->relhassubclass;
|
|
|
|
ReleaseSysCache(tuple);
|
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
2017-06-28 19:55:03 +02:00
|
|
|
/*
|
Disallow converting an inheritance child table to a view.
Generally, members of inheritance trees must be plain tables (or,
in more recent versions, foreign tables). ALTER TABLE INHERIT
rejects creating an inheritance relationship that has a view at
either end. When DefineQueryRewrite attempts to convert a relation
to a view, it already had checks prohibiting doing so for partitioning
parents or children as well as traditional-inheritance parents ...
but it neglected to check that a traditional-inheritance child wasn't
being converted. Since the planner assumes that any inheritance
child is a table, this led to making plans that tried to do a physical
scan on a view, causing failures (or even crashes, in recent versions).
One could imagine trying to support such a case by expanding the view
normally, but since the rewriter runs before the planner does
inheritance expansion, it would take some very fundamental refactoring
to make that possible. There are probably a lot of other parts of the
system that don't cope well with such a situation, too. For now,
just forbid it.
Per bug #16856 from Yang Lin. Back-patch to all supported branches.
(In versions before v10, this includes back-patching the portion of
commit 501ed02cf that added has_superclass(). Perhaps the lack of
that infrastructure partially explains the missing check.)
Discussion: https://postgr.es/m/16856-0363e05c6e1612fd@postgresql.org
2021-02-06 21:17:01 +01:00
|
|
|
* has_superclass - does this relation inherit from another?
|
|
|
|
*
|
|
|
|
* Unlike has_subclass, this can be relied on to give an accurate answer.
|
|
|
|
* However, the caller must hold a lock on the given relation so that it
|
|
|
|
* can't be concurrently added to or removed from an inheritance hierarchy.
|
2017-06-28 19:55:03 +02:00
|
|
|
*/
|
|
|
|
bool
|
|
|
|
has_superclass(Oid relationId)
|
|
|
|
{
|
|
|
|
Relation catalog;
|
|
|
|
SysScanDesc scan;
|
|
|
|
ScanKeyData skey;
|
|
|
|
bool result;
|
|
|
|
|
2019-01-21 19:32:19 +01:00
|
|
|
catalog = table_open(InheritsRelationId, AccessShareLock);
|
2017-06-28 19:55:03 +02:00
|
|
|
ScanKeyInit(&skey, Anum_pg_inherits_inhrelid, BTEqualStrategyNumber,
|
|
|
|
F_OIDEQ, ObjectIdGetDatum(relationId));
|
|
|
|
scan = systable_beginscan(catalog, InheritsRelidSeqnoIndexId, true,
|
|
|
|
NULL, 1, &skey);
|
|
|
|
result = HeapTupleIsValid(systable_getnext(scan));
|
|
|
|
systable_endscan(scan);
|
2019-01-21 19:32:19 +01:00
|
|
|
table_close(catalog, AccessShareLock);
|
2017-06-28 19:55:03 +02:00
|
|
|
|
|
|
|
return result;
|
|
|
|
}
|
2009-05-12 02:56:05 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Given two type OIDs, determine whether the first is a complex type
|
|
|
|
* (class type) that inherits from the second.
|
2017-10-26 19:47:45 +02:00
|
|
|
*
|
|
|
|
* This essentially asks whether the first type is guaranteed to be coercible
|
|
|
|
* to the second. Therefore, we allow the first type to be a domain over a
|
|
|
|
* complex type that inherits from the second; that creates no difficulties.
|
|
|
|
* But the second type cannot be a domain.
|
2009-05-12 02:56:05 +02:00
|
|
|
*/
|
|
|
|
bool
|
|
|
|
typeInheritsFrom(Oid subclassTypeId, Oid superclassTypeId)
|
|
|
|
{
|
|
|
|
bool result = false;
|
|
|
|
Oid subclassRelid;
|
|
|
|
Oid superclassRelid;
|
|
|
|
Relation inhrel;
|
|
|
|
List *visited,
|
|
|
|
*queue;
|
|
|
|
ListCell *queue_item;
|
|
|
|
|
|
|
|
/* We need to work with the associated relation OIDs */
|
2017-10-26 19:47:45 +02:00
|
|
|
subclassRelid = typeOrDomainTypeRelid(subclassTypeId);
|
2009-05-12 02:56:05 +02:00
|
|
|
if (subclassRelid == InvalidOid)
|
2017-10-26 19:47:45 +02:00
|
|
|
return false; /* not a complex type or domain over one */
|
2009-05-12 02:56:05 +02:00
|
|
|
superclassRelid = typeidTypeRelid(superclassTypeId);
|
|
|
|
if (superclassRelid == InvalidOid)
|
|
|
|
return false; /* not a complex type */
|
|
|
|
|
|
|
|
/* No point in searching if the superclass has no subclasses */
|
|
|
|
if (!has_subclass(superclassRelid))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Begin the search at the relation itself, so add its relid to the queue.
|
|
|
|
*/
|
|
|
|
queue = list_make1_oid(subclassRelid);
|
|
|
|
visited = NIL;
|
|
|
|
|
2019-01-21 19:32:19 +01:00
|
|
|
inhrel = table_open(InheritsRelationId, AccessShareLock);
|
2009-05-12 02:56:05 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Use queue to do a breadth-first traversal of the inheritance graph from
|
|
|
|
* the relid supplied up to the root. Notice that we append to the queue
|
|
|
|
* inside the loop --- this is okay because the foreach() macro doesn't
|
|
|
|
* advance queue_item until the next loop iteration begins.
|
|
|
|
*/
|
|
|
|
foreach(queue_item, queue)
|
|
|
|
{
|
|
|
|
Oid this_relid = lfirst_oid(queue_item);
|
|
|
|
ScanKeyData skey;
|
2009-12-29 23:00:14 +01:00
|
|
|
SysScanDesc inhscan;
|
2009-05-12 02:56:05 +02:00
|
|
|
HeapTuple inhtup;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we've seen this relid already, skip it. This avoids extra work
|
|
|
|
* in multiple-inheritance scenarios, and also protects us from an
|
|
|
|
* infinite loop in case there is a cycle in pg_inherits (though
|
|
|
|
* theoretically that shouldn't happen).
|
|
|
|
*/
|
|
|
|
if (list_member_oid(visited, this_relid))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Okay, this is a not-yet-seen relid. Add it to the list of
|
|
|
|
* already-visited OIDs, then find all the types this relid inherits
|
|
|
|
* from and add them to the queue.
|
|
|
|
*/
|
|
|
|
visited = lappend_oid(visited, this_relid);
|
|
|
|
|
|
|
|
ScanKeyInit(&skey,
|
|
|
|
Anum_pg_inherits_inhrelid,
|
|
|
|
BTEqualStrategyNumber, F_OIDEQ,
|
|
|
|
ObjectIdGetDatum(this_relid));
|
|
|
|
|
2009-12-29 23:00:14 +01:00
|
|
|
inhscan = systable_beginscan(inhrel, InheritsRelidSeqnoIndexId, true,
|
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row. In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result. This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.
The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow. However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads. To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed. The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all. Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.
Patch by me. Review by Michael Paquier and Andres Freund.
2013-07-02 15:47:01 +02:00
|
|
|
NULL, 1, &skey);
|
2009-05-12 02:56:05 +02:00
|
|
|
|
2009-12-29 23:00:14 +01:00
|
|
|
while ((inhtup = systable_getnext(inhscan)) != NULL)
|
2009-05-12 02:56:05 +02:00
|
|
|
{
|
|
|
|
Form_pg_inherits inh = (Form_pg_inherits) GETSTRUCT(inhtup);
|
|
|
|
Oid inhparent = inh->inhparent;
|
|
|
|
|
|
|
|
/* If this is the target superclass, we're done */
|
|
|
|
if (inhparent == superclassRelid)
|
|
|
|
{
|
|
|
|
result = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Else add to queue */
|
|
|
|
queue = lappend_oid(queue, inhparent);
|
|
|
|
}
|
|
|
|
|
2009-12-29 23:00:14 +01:00
|
|
|
systable_endscan(inhscan);
|
2009-05-12 02:56:05 +02:00
|
|
|
|
|
|
|
if (result)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* clean up ... */
|
2019-01-21 19:32:19 +01:00
|
|
|
table_close(inhrel, AccessShareLock);
|
2009-05-12 02:56:05 +02:00
|
|
|
|
|
|
|
list_free(visited);
|
|
|
|
list_free(queue);
|
|
|
|
|
|
|
|
return result;
|
|
|
|
}
|
Local partitioned indexes
When CREATE INDEX is run on a partitioned table, create catalog entries
for an index on the partitioned table (which is just a placeholder since
the table proper has no data of its own), and recurse to create actual
indexes on the existing partitions; create them in future partitions
also.
As a convenience gadget, if the new index definition matches some
existing index in partitions, these are picked up and used instead of
creating new ones. Whichever way these indexes come about, they become
attached to the index on the parent table and are dropped alongside it,
and cannot be dropped on isolation unless they are detached first.
To support pg_dump'ing these indexes, add commands
CREATE INDEX ON ONLY <table>
(which creates the index on the parent partitioned table, without
recursing) and
ALTER INDEX ATTACH PARTITION
(which is used after the indexes have been created individually on each
partition, to attach them to the parent index). These reconstruct prior
database state exactly.
Reviewed-by: (in alphabetical order) Peter Eisentraut, Robert Haas, Amit
Langote, Jesper Pedersen, Simon Riggs, David Rowley
Discussion: https://postgr.es/m/20171113170646.gzweigyrgg6pwsg4@alvherre.pgsql
2018-01-19 15:49:22 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Create a single pg_inherits row with the given data
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
StoreSingleInheritance(Oid relationId, Oid parentOid, int32 seqNumber)
|
|
|
|
{
|
|
|
|
Datum values[Natts_pg_inherits];
|
|
|
|
bool nulls[Natts_pg_inherits];
|
|
|
|
HeapTuple tuple;
|
|
|
|
Relation inhRelation;
|
|
|
|
|
2019-01-21 19:32:19 +01:00
|
|
|
inhRelation = table_open(InheritsRelationId, RowExclusiveLock);
|
Local partitioned indexes
When CREATE INDEX is run on a partitioned table, create catalog entries
for an index on the partitioned table (which is just a placeholder since
the table proper has no data of its own), and recurse to create actual
indexes on the existing partitions; create them in future partitions
also.
As a convenience gadget, if the new index definition matches some
existing index in partitions, these are picked up and used instead of
creating new ones. Whichever way these indexes come about, they become
attached to the index on the parent table and are dropped alongside it,
and cannot be dropped on isolation unless they are detached first.
To support pg_dump'ing these indexes, add commands
CREATE INDEX ON ONLY <table>
(which creates the index on the parent partitioned table, without
recursing) and
ALTER INDEX ATTACH PARTITION
(which is used after the indexes have been created individually on each
partition, to attach them to the parent index). These reconstruct prior
database state exactly.
Reviewed-by: (in alphabetical order) Peter Eisentraut, Robert Haas, Amit
Langote, Jesper Pedersen, Simon Riggs, David Rowley
Discussion: https://postgr.es/m/20171113170646.gzweigyrgg6pwsg4@alvherre.pgsql
2018-01-19 15:49:22 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Make the pg_inherits entry
|
|
|
|
*/
|
|
|
|
values[Anum_pg_inherits_inhrelid - 1] = ObjectIdGetDatum(relationId);
|
|
|
|
values[Anum_pg_inherits_inhparent - 1] = ObjectIdGetDatum(parentOid);
|
|
|
|
values[Anum_pg_inherits_inhseqno - 1] = Int32GetDatum(seqNumber);
|
ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.
Because it runs in two transactions, it cannot be used in a transaction
block. This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it. But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)
In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.
The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command. Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included. As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.
A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
2021-03-25 22:00:28 +01:00
|
|
|
values[Anum_pg_inherits_inhdetachpending - 1] = BoolGetDatum(false);
|
Local partitioned indexes
When CREATE INDEX is run on a partitioned table, create catalog entries
for an index on the partitioned table (which is just a placeholder since
the table proper has no data of its own), and recurse to create actual
indexes on the existing partitions; create them in future partitions
also.
As a convenience gadget, if the new index definition matches some
existing index in partitions, these are picked up and used instead of
creating new ones. Whichever way these indexes come about, they become
attached to the index on the parent table and are dropped alongside it,
and cannot be dropped on isolation unless they are detached first.
To support pg_dump'ing these indexes, add commands
CREATE INDEX ON ONLY <table>
(which creates the index on the parent partitioned table, without
recursing) and
ALTER INDEX ATTACH PARTITION
(which is used after the indexes have been created individually on each
partition, to attach them to the parent index). These reconstruct prior
database state exactly.
Reviewed-by: (in alphabetical order) Peter Eisentraut, Robert Haas, Amit
Langote, Jesper Pedersen, Simon Riggs, David Rowley
Discussion: https://postgr.es/m/20171113170646.gzweigyrgg6pwsg4@alvherre.pgsql
2018-01-19 15:49:22 +01:00
|
|
|
|
|
|
|
memset(nulls, 0, sizeof(nulls));
|
|
|
|
|
|
|
|
tuple = heap_form_tuple(RelationGetDescr(inhRelation), values, nulls);
|
|
|
|
|
|
|
|
CatalogTupleInsert(inhRelation, tuple);
|
|
|
|
|
|
|
|
heap_freetuple(tuple);
|
|
|
|
|
2019-01-21 19:32:19 +01:00
|
|
|
table_close(inhRelation, RowExclusiveLock);
|
Local partitioned indexes
When CREATE INDEX is run on a partitioned table, create catalog entries
for an index on the partitioned table (which is just a placeholder since
the table proper has no data of its own), and recurse to create actual
indexes on the existing partitions; create them in future partitions
also.
As a convenience gadget, if the new index definition matches some
existing index in partitions, these are picked up and used instead of
creating new ones. Whichever way these indexes come about, they become
attached to the index on the parent table and are dropped alongside it,
and cannot be dropped on isolation unless they are detached first.
To support pg_dump'ing these indexes, add commands
CREATE INDEX ON ONLY <table>
(which creates the index on the parent partitioned table, without
recursing) and
ALTER INDEX ATTACH PARTITION
(which is used after the indexes have been created individually on each
partition, to attach them to the parent index). These reconstruct prior
database state exactly.
Reviewed-by: (in alphabetical order) Peter Eisentraut, Robert Haas, Amit
Langote, Jesper Pedersen, Simon Riggs, David Rowley
Discussion: https://postgr.es/m/20171113170646.gzweigyrgg6pwsg4@alvherre.pgsql
2018-01-19 15:49:22 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* DeleteInheritsTuple
|
|
|
|
*
|
|
|
|
* Delete pg_inherits tuples with the given inhrelid. inhparent may be given
|
|
|
|
* as InvalidOid, in which case all tuples matching inhrelid are deleted;
|
|
|
|
* otherwise only delete tuples with the specified inhparent.
|
|
|
|
*
|
ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.
Because it runs in two transactions, it cannot be used in a transaction
block. This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it. But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)
In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.
The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command. Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included. As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.
A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
2021-03-25 22:00:28 +01:00
|
|
|
* expect_detach_pending is the expected state of the inhdetachpending flag.
|
|
|
|
* If the catalog row does not match that state, an error is raised.
|
|
|
|
*
|
|
|
|
* childname is the partition name, if a table; pass NULL for regular
|
|
|
|
* inheritance or when working with other relation kinds.
|
|
|
|
*
|
Local partitioned indexes
When CREATE INDEX is run on a partitioned table, create catalog entries
for an index on the partitioned table (which is just a placeholder since
the table proper has no data of its own), and recurse to create actual
indexes on the existing partitions; create them in future partitions
also.
As a convenience gadget, if the new index definition matches some
existing index in partitions, these are picked up and used instead of
creating new ones. Whichever way these indexes come about, they become
attached to the index on the parent table and are dropped alongside it,
and cannot be dropped on isolation unless they are detached first.
To support pg_dump'ing these indexes, add commands
CREATE INDEX ON ONLY <table>
(which creates the index on the parent partitioned table, without
recursing) and
ALTER INDEX ATTACH PARTITION
(which is used after the indexes have been created individually on each
partition, to attach them to the parent index). These reconstruct prior
database state exactly.
Reviewed-by: (in alphabetical order) Peter Eisentraut, Robert Haas, Amit
Langote, Jesper Pedersen, Simon Riggs, David Rowley
Discussion: https://postgr.es/m/20171113170646.gzweigyrgg6pwsg4@alvherre.pgsql
2018-01-19 15:49:22 +01:00
|
|
|
* Returns whether at least one row was deleted.
|
|
|
|
*/
|
|
|
|
bool
|
ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.
Because it runs in two transactions, it cannot be used in a transaction
block. This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it. But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)
In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.
The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command. Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included. As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.
A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
2021-03-25 22:00:28 +01:00
|
|
|
DeleteInheritsTuple(Oid inhrelid, Oid inhparent, bool expect_detach_pending,
|
|
|
|
const char *childname)
|
Local partitioned indexes
When CREATE INDEX is run on a partitioned table, create catalog entries
for an index on the partitioned table (which is just a placeholder since
the table proper has no data of its own), and recurse to create actual
indexes on the existing partitions; create them in future partitions
also.
As a convenience gadget, if the new index definition matches some
existing index in partitions, these are picked up and used instead of
creating new ones. Whichever way these indexes come about, they become
attached to the index on the parent table and are dropped alongside it,
and cannot be dropped on isolation unless they are detached first.
To support pg_dump'ing these indexes, add commands
CREATE INDEX ON ONLY <table>
(which creates the index on the parent partitioned table, without
recursing) and
ALTER INDEX ATTACH PARTITION
(which is used after the indexes have been created individually on each
partition, to attach them to the parent index). These reconstruct prior
database state exactly.
Reviewed-by: (in alphabetical order) Peter Eisentraut, Robert Haas, Amit
Langote, Jesper Pedersen, Simon Riggs, David Rowley
Discussion: https://postgr.es/m/20171113170646.gzweigyrgg6pwsg4@alvherre.pgsql
2018-01-19 15:49:22 +01:00
|
|
|
{
|
|
|
|
bool found = false;
|
|
|
|
Relation catalogRelation;
|
|
|
|
ScanKeyData key;
|
|
|
|
SysScanDesc scan;
|
|
|
|
HeapTuple inheritsTuple;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find pg_inherits entries by inhrelid.
|
|
|
|
*/
|
2019-01-21 19:32:19 +01:00
|
|
|
catalogRelation = table_open(InheritsRelationId, RowExclusiveLock);
|
Local partitioned indexes
When CREATE INDEX is run on a partitioned table, create catalog entries
for an index on the partitioned table (which is just a placeholder since
the table proper has no data of its own), and recurse to create actual
indexes on the existing partitions; create them in future partitions
also.
As a convenience gadget, if the new index definition matches some
existing index in partitions, these are picked up and used instead of
creating new ones. Whichever way these indexes come about, they become
attached to the index on the parent table and are dropped alongside it,
and cannot be dropped on isolation unless they are detached first.
To support pg_dump'ing these indexes, add commands
CREATE INDEX ON ONLY <table>
(which creates the index on the parent partitioned table, without
recursing) and
ALTER INDEX ATTACH PARTITION
(which is used after the indexes have been created individually on each
partition, to attach them to the parent index). These reconstruct prior
database state exactly.
Reviewed-by: (in alphabetical order) Peter Eisentraut, Robert Haas, Amit
Langote, Jesper Pedersen, Simon Riggs, David Rowley
Discussion: https://postgr.es/m/20171113170646.gzweigyrgg6pwsg4@alvherre.pgsql
2018-01-19 15:49:22 +01:00
|
|
|
ScanKeyInit(&key,
|
|
|
|
Anum_pg_inherits_inhrelid,
|
|
|
|
BTEqualStrategyNumber, F_OIDEQ,
|
|
|
|
ObjectIdGetDatum(inhrelid));
|
|
|
|
scan = systable_beginscan(catalogRelation, InheritsRelidSeqnoIndexId,
|
|
|
|
true, NULL, 1, &key);
|
|
|
|
|
|
|
|
while (HeapTupleIsValid(inheritsTuple = systable_getnext(scan)))
|
|
|
|
{
|
|
|
|
Oid parent;
|
|
|
|
|
|
|
|
/* Compare inhparent if it was given, and do the actual deletion. */
|
|
|
|
parent = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhparent;
|
|
|
|
if (!OidIsValid(inhparent) || parent == inhparent)
|
|
|
|
{
|
ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.
Because it runs in two transactions, it cannot be used in a transaction
block. This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it. But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)
In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.
The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command. Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included. As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.
A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
2021-03-25 22:00:28 +01:00
|
|
|
bool detach_pending;
|
|
|
|
|
|
|
|
detach_pending =
|
|
|
|
((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhdetachpending;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Raise error depending on state. This should only happen for
|
|
|
|
* partitions, but we have no way to cross-check.
|
|
|
|
*/
|
|
|
|
if (detach_pending && !expect_detach_pending)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
|
|
|
|
errmsg("cannot detach partition \"%s\"",
|
|
|
|
childname ? childname : "unknown relation"),
|
|
|
|
errdetail("The partition is being detached concurrently or has an unfinished detach."),
|
|
|
|
errhint("Use ALTER TABLE ... DETACH PARTITION ... FINALIZE to complete the pending detach operation")));
|
|
|
|
if (!detach_pending && expect_detach_pending)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
|
|
|
|
errmsg("cannot complete detaching partition \"%s\"",
|
|
|
|
childname ? childname : "unknown relation"),
|
|
|
|
errdetail("There's no pending concurrent detach.")));
|
|
|
|
|
Local partitioned indexes
When CREATE INDEX is run on a partitioned table, create catalog entries
for an index on the partitioned table (which is just a placeholder since
the table proper has no data of its own), and recurse to create actual
indexes on the existing partitions; create them in future partitions
also.
As a convenience gadget, if the new index definition matches some
existing index in partitions, these are picked up and used instead of
creating new ones. Whichever way these indexes come about, they become
attached to the index on the parent table and are dropped alongside it,
and cannot be dropped on isolation unless they are detached first.
To support pg_dump'ing these indexes, add commands
CREATE INDEX ON ONLY <table>
(which creates the index on the parent partitioned table, without
recursing) and
ALTER INDEX ATTACH PARTITION
(which is used after the indexes have been created individually on each
partition, to attach them to the parent index). These reconstruct prior
database state exactly.
Reviewed-by: (in alphabetical order) Peter Eisentraut, Robert Haas, Amit
Langote, Jesper Pedersen, Simon Riggs, David Rowley
Discussion: https://postgr.es/m/20171113170646.gzweigyrgg6pwsg4@alvherre.pgsql
2018-01-19 15:49:22 +01:00
|
|
|
CatalogTupleDelete(catalogRelation, &inheritsTuple->t_self);
|
|
|
|
found = true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Done */
|
|
|
|
systable_endscan(scan);
|
2019-01-21 19:32:19 +01:00
|
|
|
table_close(catalogRelation, RowExclusiveLock);
|
Local partitioned indexes
When CREATE INDEX is run on a partitioned table, create catalog entries
for an index on the partitioned table (which is just a placeholder since
the table proper has no data of its own), and recurse to create actual
indexes on the existing partitions; create them in future partitions
also.
As a convenience gadget, if the new index definition matches some
existing index in partitions, these are picked up and used instead of
creating new ones. Whichever way these indexes come about, they become
attached to the index on the parent table and are dropped alongside it,
and cannot be dropped on isolation unless they are detached first.
To support pg_dump'ing these indexes, add commands
CREATE INDEX ON ONLY <table>
(which creates the index on the parent partitioned table, without
recursing) and
ALTER INDEX ATTACH PARTITION
(which is used after the indexes have been created individually on each
partition, to attach them to the parent index). These reconstruct prior
database state exactly.
Reviewed-by: (in alphabetical order) Peter Eisentraut, Robert Haas, Amit
Langote, Jesper Pedersen, Simon Riggs, David Rowley
Discussion: https://postgr.es/m/20171113170646.gzweigyrgg6pwsg4@alvherre.pgsql
2018-01-19 15:49:22 +01:00
|
|
|
|
|
|
|
return found;
|
|
|
|
}
|
ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.
Because it runs in two transactions, it cannot be used in a transaction
block. This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it. But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)
In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.
The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command. Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included. As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.
A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
2021-03-25 22:00:28 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Return whether the pg_inherits tuple for a partition has the "detach
|
|
|
|
* pending" flag set.
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
PartitionHasPendingDetach(Oid partoid)
|
|
|
|
{
|
|
|
|
Relation catalogRelation;
|
|
|
|
ScanKeyData key;
|
|
|
|
SysScanDesc scan;
|
|
|
|
HeapTuple inheritsTuple;
|
|
|
|
|
|
|
|
/* We don't have a good way to verify it is in fact a partition */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find the pg_inherits entry by inhrelid. (There should only be one.)
|
|
|
|
*/
|
|
|
|
catalogRelation = table_open(InheritsRelationId, RowExclusiveLock);
|
|
|
|
ScanKeyInit(&key,
|
|
|
|
Anum_pg_inherits_inhrelid,
|
|
|
|
BTEqualStrategyNumber, F_OIDEQ,
|
|
|
|
ObjectIdGetDatum(partoid));
|
|
|
|
scan = systable_beginscan(catalogRelation, InheritsRelidSeqnoIndexId,
|
|
|
|
true, NULL, 1, &key);
|
|
|
|
|
|
|
|
while (HeapTupleIsValid(inheritsTuple = systable_getnext(scan)))
|
|
|
|
{
|
|
|
|
bool detached;
|
|
|
|
|
|
|
|
detached =
|
|
|
|
((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhdetachpending;
|
|
|
|
|
|
|
|
/* Done */
|
|
|
|
systable_endscan(scan);
|
|
|
|
table_close(catalogRelation, RowExclusiveLock);
|
|
|
|
|
|
|
|
return detached;
|
|
|
|
}
|
|
|
|
|
|
|
|
elog(ERROR, "relation %u is not a partition", partoid);
|
|
|
|
return false; /* keep compiler quiet */
|
|
|
|
}
|