After a tuple is routed to a partition, it has been converted from the
root table's row type to the partition's row type. ExecConstraints
needs to report the failure using the original tuple and the parent's
tuple descriptor rather than the ones for the selected partition.
Amit Langote
Commit 2ac3ef7a0 added a query with an underdetermined output row order;
it has failed multiple times in the buildfarm since then. Add an ORDER BY
to fix. Also, don't rely on a DROP CASCADE to drop in a well-determined
order; that hasn't failed yet but I don't trust it much, and we're not
saving any typing by using CASCADE anyway.
The previous coding failed to work correctly when we have a
multi-level partitioned hierarchy where tables at successive levels
have different attribute numbers for the partition key attributes. To
fix, have each PartitionDispatch object store a standalone
TupleTableSlot initialized with the TupleDesc of the corresponding
partitioned table, along with a TupleConversionMap to map tuples from
the its parent's rowtype to own rowtype. After tuple routing chooses
a leaf partition, we must use the leaf partition's tuple descriptor,
not the root table's. To that end, a dedicated TupleTableSlot for
tuple routing is now allocated in EState.
Amit Langote
Table partitioning is like table inheritance and reuses much of the
existing infrastructure, but there are some important differences.
The parent is called a partitioned table and is always empty; it may
not have indexes or non-inherited constraints, since those make no
sense for a relation with no data of its own. The children are called
partitions and contain all of the actual data. Each partition has an
implicit partitioning constraint. Multiple inheritance is not
allowed, and partitioning and inheritance can't be mixed. Partitions
can't have extra columns and may not allow nulls unless the parent
does. Tuples inserted into the parent are automatically routed to the
correct partition, so tuple-routing ON INSERT triggers are not needed.
Tuple routing isn't yet supported for partitions which are foreign
tables, and it doesn't handle updates that cross partition boundaries.
Currently, tables can be range-partitioned or list-partitioned. List
partitioning is limited to a single column, but range partitioning can
involve multiple columns. A partitioning "column" can be an
expression.
Because table partitioning is less general than table inheritance, it
is hoped that it will be easier to reason about properties of
partitions, and therefore that this will serve as a better foundation
for a variety of possible optimizations, including query planner
optimizations. The tuple routing based which this patch does based on
the implicit partitioning constraints is an example of this, but it
seems likely that many other useful optimizations are also possible.
Amit Langote, reviewed and tested by Robert Haas, Ashutosh Bapat,
Amit Kapila, Rajkumar Raghuwanshi, Corey Huinker, Jaime Casanova,
Rushabh Lathia, Erik Rijkers, among others. Minor revisions by me.
Previously, if an INSERT with multiple rows of VALUES had indirection
(array subscripting or field selection) in its target-columns list, the
parser handled that by applying transformAssignedExpr() to each element
of each VALUES row independently. This led to having ArrayRef assignment
nodes or FieldStore nodes in each row of the VALUES RTE. That works for
simple cases, but in bug #14265 Nuri Boardman points out that it fails
if there are multiple assignments to elements/fields of the same target
column. For such cases to work, rewriteTargetListIU() has to nest the
ArrayRefs or FieldStores together to produce a single expression to be
assigned to the column. But it failed to find them in the top-level
targetlist and issued an error about "multiple assignments to same column".
We could possibly fix this by teaching the rewriter to apply
rewriteTargetListIU to each VALUES row separately, but that would be messy
(it would change the output rowtype of the VALUES RTE, for example) and
inefficient. Instead, let's fix the parser so that the VALUES RTE outputs
are just the user-specified values, cast to the right type if necessary,
and then the ArrayRefs or FieldStores are applied in the top-level
targetlist to Vars representing the RTE's outputs. This is the same
parsetree representation already used for similar cases with INSERT/SELECT
syntax, so it allows simplifications in ruleutils.c, which no longer needs
to treat INSERT-from-multiple-VALUES as its own special case.
This implementation works by applying transformAssignedExpr to the VALUES
entries as before, and then stripping off any ArrayRefs or FieldStores it
adds. With lots of VALUES rows it would be noticeably more efficient to
not add those nodes in the first place. But that's just an optimization
not a bug fix, and there doesn't seem to be any good way to do it without
significant refactoring. (A non-invasive answer would be to apply
transformAssignedExpr + stripping to just the first VALUES row, and then
just forcibly cast remaining rows to the same data types exposed in the
first row. But this way would lead to different, not-INSERT-specific
errors being reported in casting failure cases, so it doesn't seem very
nice.) So leave that for later; this patch at least isn't making the
per-row parsing work worse, and it does make the finished parsetree
smaller, saving rewriter and planner work.
Catversion bump because stored rules containing such INSERTs would need
to change. Because of that, no back-patch, even though this is a very
long-standing bug.
Report: <20160727005725.7438.26021@wrigleys.postgresql.org>
Discussion: <9578.1469645245@sss.pgh.pa.us>
comments on one of the optimizer functions a lot more
clear, adds a summary of the recent KSQO discussion to the
comments in the code, adds regression tests for the bug with
sequence state Tom fixed recently and another reg. test, and
removes some PostQuel legacy stuff: ExecAppend -> ExecInsert,
ExecRetrieve -> ExecSelect, etc. This was changed because the
elog() messages from this routine are user-visible, so we
should be using the SQL terms.
Neil Conway
BAD: INSERT INTO tab (col1, col2) VALUES ('val1');
GOOD: INSERT INTO tab (col1, col2) VALUES ('val1', 'val2');
Regress tests against DEFAULT and normal values as they're managed
slightly different.
Rod Taylor