Commit Graph

42 Commits

Author SHA1 Message Date
Alvaro Herrera ec38694894
Move PartitioPruneInfo out of plan nodes into PlannedStmt
The planner will now add a given PartitioPruneInfo to
PlannedStmt.partPruneInfos instead of directly to the
Append/MergeAppend plan node.  What gets set instead in the
latter is an index field which points to the list element
of PlannedStmt.partPruneInfos containing the PartitioPruneInfo
belonging to the plan node.

A later commit will make AcquireExecutorLocks() do the initial
partition pruning to determine a minimal set of partitions to be
locked when validating a plan tree and it will need to consult the
PartitioPruneInfos referenced therein to do so.  It would be better
for the PartitioPruneInfos to be accessible directly than requiring
a walk of the plan tree to find them, which is easier when it can be
done by simply iterating over PlannedStmt.partPruneInfos.

Author: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CA+HiwqFGkMSge6TgC9KQzde0ohpAycLQuV7ooitEEpbKB0O_mg@mail.gmail.com
2022-12-01 12:56:21 +01:00
Alvaro Herrera a90641eac2
Revert "Rewrite some RI code to avoid using SPI"
This reverts commit 99392cdd78.
We'd rather rewrite ri_triggers.c as a whole rather than piecemeal.

Discussion: https://postgr.es/m/E1ncXX2-000mFt-Pe@gemulon.postgresql.org
2022-04-07 23:42:13 +02:00
Alvaro Herrera 99392cdd78
Rewrite some RI code to avoid using SPI
Modify the subroutines called by RI trigger functions that want to check
if a given referenced value exists in the referenced relation to simply
scan the foreign key constraint's unique index, instead of using SPI to
execute
  SELECT 1 FROM referenced_relation WHERE ref_key = $1
This saves a lot of work, especially when inserting into or updating a
referencing relation.

This rewrite allows to fix a PK row visibility bug caused by a partition
descriptor hack which requires ActiveSnapshot to be set to come up with
the correct set of partitions for the RI query running under REPEATABLE
READ isolation.  We now set that snapshot indepedently of the snapshot
to be used by the PK index scan, so the two no longer interfere.  The
buggy output in src/test/isolation/expected/fk-snapshot.out of the
relevant test case added by commit 00cb86e75d has been corrected.
(The bug still exists in branch 14, however, but this fix is too
invasive to backpatch.)

Author: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Li Japin <japinli@hotmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Zhihong Yu <zyu@yugabyte.com>
Discussion: https://postgr.es/m/CA+HiwqGkfJfYdeq5vHPh6eqPKjSbfpDDY+j-kXYFePQedtSLeg@mail.gmail.com
2022-04-07 21:10:03 +02:00
Alvaro Herrera 297daa9d43
Refactor and cleanup runtime partition prune code a little
* Move the execution pruning initialization steps that are common
between both ExecInitAppend() and ExecInitMergeAppend() into a new
function ExecInitPartitionPruning() defined in execPartition.c.
Those steps include creation of a PartitionPruneState to be used for
all instances of pruning and determining the minimal set of child
subplans that need to be initialized by performing initial pruning if
needed, and finally adjusting the subplan_map arrays in the
PartitionPruneState to reflect the new set of subplans remaining
after initial pruning if it was indeed performed.
ExecCreatePartitionPruneState() is no longer exported out of
execPartition.c and has been renamed to CreatePartitionPruneState()
as a local sub-routine of ExecInitPartitionPruning().

* Likewise, ExecFindInitialMatchingSubPlans() that was in charge of
performing initial pruning no longer needs to be exported.  In fact,
since it would now have the same body as the more generally named
ExecFindMatchingSubPlans(), except differing in the value of
initial_prune passed to the common subroutine
find_matching_subplans_recurse(), it seems better to remove it and add
an initial_prune argument to ExecFindMatchingSubPlans().

* Add an ExprContext field to PartitionPruneContext to remove the
implicit assumption in the runtime pruning code that the ExprContext to
use to compute pruning expressions that need one can always rely on the
PlanState providing it.  A future patch will allow runtime pruning (at
least the initial pruning steps) to be performed without the
corresponding PlanState yet having been created, so this will help.

Author: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CA+HiwqEYCpEqh2LMDOp9mT+4-QoVe8HgFMKBjntEMCTZLpcCCA@mail.gmail.com
2022-04-05 11:46:48 +02:00
Bruce Momjian 27b77ecf9f Update copyright for 2022
Backpatch-through: 10
2022-01-07 19:04:57 -05:00
Tom Lane c5b7ba4e67 Postpone some stuff out of ExecInitModifyTable.
Arrange to do some things on-demand, rather than immediately during
executor startup, because there's a fair chance of never having to do
them at all:

* Don't open result relations' indexes until needed.

* Don't initialize partition tuple routing, nor the child-to-root
tuple conversion map, until needed.

This wins in UPDATEs on partitioned tables when only some of the
partitions will actually receive updates; with larger partition
counts the savings is quite noticeable.  Also, we can remove some
sketchy heuristics in ExecInitModifyTable about whether to set up
tuple routing.

Also, remove execPartition.c's private hash table tracking which
partitions were already opened by the ModifyTable node.  Instead
use the hash added to ModifyTable itself by commit 86dc90056.

To allow lazy computation of the conversion maps, we now set
ri_RootResultRelInfo in all child ResultRelInfos.  We formerly set it
only in some, not terribly well-defined, cases.  This has user-visible
side effects in that now more error messages refer to the root
relation instead of some partition (and provide error data in the
root's column order, too).  It looks to me like this is a strict
improvement in consistency, so I don't have a problem with the
output changes visible in this commit.

Extracted from a larger patch, which seemed to me to be too messy
to push in one commit.

Amit Langote, reviewed at different times by Heikki Linnakangas and
myself

Discussion: https://postgr.es/m/CA+HiwqG7ZruBmmih3wPsBZ4s0H2EhywrnXEduckY5Hr3fWzPWA@mail.gmail.com
2021-04-06 15:57:11 -04:00
Bruce Momjian ca3b37487b Update copyright for 2021
Backpatch-through: 9.5
2021-01-02 13:06:25 -05:00
Heikki Linnakangas fb5883da86 Remove PartitionRoutingInfo struct.
The extra indirection neeeded to access its members via its enclosing
ResultRelInfo seems pointless. Move all the fields from
PartitionRoutingInfo to ResultRelInfo.

Author: Amit Langote
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqFViT47Zbr_ASBejiK7iDG8%3DQ1swQ-tjM6caRPQ67pT%3Dw%40mail.gmail.com
2020-10-19 14:42:55 +03:00
Heikki Linnakangas 6973533650 Revise child-to-root tuple conversion map management.
Store the tuple conversion map to convert a tuple from a child table's
format to the root format in a new ri_ChildToRootMap field in
ResultRelInfo. It is initialized if transition tuple capture for FOR
STATEMENT triggers or INSERT tuple routing on a partitioned table is
needed. Previously, ModifyTable kept the maps in the per-subplan
ModifyTableState->mt_per_subplan_tupconv_maps array, or when tuple
routing was used, in
ResultRelInfo->ri_Partitioninfo->pi_PartitionToRootMap. The new field
replaces both of those.

Now that the child-to-root tuple conversion map is always available in
ResultRelInfo (when needed), remove the TransitionCaptureState.tcs_map
field. The callers of Exec*Trigger() functions no longer need to set or
save it, which is much less confusing and bug-prone. Also, as a future
optimization, this will allow us to delay creating the map for a given
result relation until the relation is actually processed during
execution.

Author: Amit Langote
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqHtCWLdK-LO%3DNEsvOdHx%2B7yv4mE_zYK0i3BH7dXb-wxog%40mail.gmail.com
2020-10-19 14:42:55 +03:00
Bruce Momjian 7559d8ebfa Update copyrights for 2020
Backpatch-through: update all files in master, backpatch legal files through 9.4
2020-01-01 12:21:45 -05:00
Tom Lane 8255c7a5ee Phase 2 pgindent run for v12.
Switch to 2.1 version of pg_bsd_indent.  This formats
multiline function declarations "correctly", that is with
additional lines of parameter declarations indented to match
where the first line's left parenthesis is.

Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
2019-05-22 13:04:48 -04:00
Tom Lane 6630ccad7a Restructure creation of run-time pruning steps.
Previously, gen_partprune_steps() always built executor pruning steps
using all suitable clauses, including those containing PARAM_EXEC
Params.  This meant that the pruning steps were only completely safe
for executor run-time (scan start) pruning.  To prune at executor
startup, we had to ignore the steps involving exec Params.  But this
doesn't really work in general, since there may be logic changes
needed as well --- for example, pruning according to the last operator's
btree strategy is the wrong thing if we're not applying that operator.
The rules embodied in gen_partprune_steps() and its minions are
sufficiently complicated that tracking their incremental effects in
other logic seems quite impractical.

Short of a complete redesign, the only safe fix seems to be to run
gen_partprune_steps() twice, once to create executor startup pruning
steps and then again for run-time pruning steps.  We can save a few
cycles however by noting during the first scan whether we rejected
any clauses because they involved exec Params --- if not, we don't
need to do the second scan.

In support of this, refactor the internal APIs in partprune.c to make
more use of passing information in the GeneratePruningStepsContext
struct, rather than as separate arguments.

This is, I hope, the last piece of our response to a bug report from
Alan Jackson.  Back-patch to v11 where this code came in.

Discussion: https://postgr.es/m/FAD28A83-AC73-489E-A058-2681FA31D648@tvsquared.com
2019-05-17 19:44:34 -04:00
Robert Haas 898e5e3290 Allow ATTACH PARTITION with only ShareUpdateExclusiveLock.
We still require AccessExclusiveLock on the partition itself, because
otherwise an insert that violates the newly-imposed partition
constraint could be in progress at the same time that we're changing
that constraint; only the lock level on the parent relation is
weakened.

To make this safe, we have to cope with (at least) three separate
problems. First, relevant DDL might commit while we're in the process
of building a PartitionDesc.  If so, find_inheritance_children() might
see a new partition while the RELOID system cache still has the old
partition bound cached, and even before invalidation messages have
been queued.  To fix that, if we see that the pg_class tuple seems to
be missing or to have a null relpartbound, refetch the value directly
from the table. We can't get the wrong value, because DETACH PARTITION
still requires AccessExclusiveLock throughout; if we ever want to
change that, this will need more thought. In testing, I found it quite
difficult to hit even the null-relpartbound case; the race condition
is extremely tight, but the theoretical risk is there.

Second, successive calls to RelationGetPartitionDesc might not return
the same answer.  The query planner will get confused if lookup up the
PartitionDesc for a particular relation does not return a consistent
answer for the entire duration of query planning.  Likewise, query
execution will get confused if the same relation seems to have a
different PartitionDesc at different times.  Invent a new
PartitionDirectory concept and use it to ensure consistency.  This
ensures that a single invocation of either the planner or the executor
sees the same view of the PartitionDesc from beginning to end, but it
does not guarantee that the planner and the executor see the same
view.  Since this allows pointers to old PartitionDesc entries to
survive even after a relcache rebuild, also postpone removing the old
PartitionDesc entry until we're certain no one is using it.

For the most part, it seems to be OK for the planner and executor to
have different views of the PartitionDesc, because the executor will
just ignore any concurrently added partitions which were unknown at
plan time; those partitions won't be part of the inheritance
expansion, but invalidation messages will trigger replanning at some
point.  Normally, this happens by the time the very next command is
executed, but if the next command acquires no locks and executes a
prepared query, it can manage not to notice until a new transaction is
started.  We might want to tighten that up, but it's material for a
separate patch.  There would still be a small window where a query
that started just after an ATTACH PARTITION command committed might
fail to notice its results -- but only if the command starts before
the commit has been acknowledged to the user. All in all, the warts
here around serializability seem small enough to be worth accepting
for the considerable advantage of being able to add partitions without
a full table lock.

Although in general the consequences of new partitions showing up
between planning and execution are limited to the query not noticing
the new partitions, run-time partition pruning will get confused in
that case, so that's the third problem that this patch fixes.
Run-time partition pruning assumes that indexes into the PartitionDesc
are stable between planning and execution.  So, add code so that if
new partitions are added between plan time and execution time, the
indexes stored in the subplan_map[] and subpart_map[] arrays within
the plan's PartitionedRelPruneInfo get adjusted accordingly.  There
does not seem to be a simple way to generalize this scheme to cope
with partitions that are removed, mostly because they could then get
added back again with different bounds, but it works OK for added
partitions.

This code does not try to ensure that every backend participating in
a parallel query sees the same view of the PartitionDesc.  That
currently doesn't matter, because we never pass PartitionDesc
indexes between backends.  Each backend will ignore the concurrently
added partitions which it notices, and it doesn't matter if different
backends are ignoring different sets of concurrently added partitions.
If in the future that matters, for example because we allow writes in
parallel query and want all participants to do tuple routing to the same
set of partitions, the PartitionDirectory concept could be improved to
share PartitionDescs across backends.  There is a draft patch to
serialize and restore PartitionDescs on the thread where this patch
was discussed, which may be a useful place to start.

Patch by me.  Thanks to Alvaro Herrera, David Rowley, Simon Riggs,
Amit Langote, and Michael Paquier for discussion, and to Alvaro
Herrera for some review.

Discussion: http://postgr.es/m/CA+Tgmobt2upbSocvvDej3yzokd7AkiT+PvgFH+a9-5VV1oJNSQ@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoZE0r9-cyA-aY6f8WFEROaDLLL7Vf81kZ8MtFCkxpeQSw@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoY13KQZF-=HNTrt9UYWYx3_oYOQpu9ioNT49jGgiDpUEA@mail.gmail.com
2019-03-07 11:13:12 -05:00
Bruce Momjian 97c39498e5 Update copyright for 2019
Backpatch-through: certain files through 9.4
2019-01-02 12:44:25 -05:00
Alvaro Herrera 3f2393edef Redesign initialization of partition routing structures
This speeds up write operations (INSERT, UPDATE, DELETE, COPY, as well
as the future MERGE) on partitioned tables.

This changes the setup for tuple routing so that it does far less work
during the initial setup and pushes more work out to when partitions
receive tuples.  PartitionDispatchData structs for sub-partitioned
tables are only created when a tuple gets routed through it.  The
possibly large arrays in the PartitionTupleRouting struct have largely
been removed.  The partitions[] array remains but now never contains any
NULL gaps.  Previously the NULLs had to be skipped during
ExecCleanupTupleRouting(), which could add a large overhead to the
cleanup when the number of partitions was large.  The partitions[] array
is allocated small to start with and only enlarged when we route tuples
to enough partitions that it runs out of space. This allows us to keep
simple single-row partition INSERTs running quickly.  Redesign

The arrays in PartitionTupleRouting which stored the tuple translation maps
have now been removed.  These have been moved out into a
PartitionRoutingInfo struct which is an additional field in ResultRelInfo.

The find_all_inheritors() call still remains by far the slowest part of
ExecSetupPartitionTupleRouting(). This commit just removes the other slow
parts.

In passing also rename the tuple translation maps from being ParentToChild
and ChildToParent to being RootToPartition and PartitionToRoot. The old
names mislead you into thinking that a partition of some sub-partitioned
table would translate to the rowtype of the sub-partitioned table rather
than the root partitioned table.

Authors: David Rowley and Amit Langote, heavily revised by Álvaro Herrera
Testing help from Jesper Pedersen and Kato Sho.
Discussion: https://postgr.es/m/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com
2018-11-16 15:01:05 -03:00
Tom Lane 9ddef36278 Centralize executor's opening/closing of Relations for rangetable entries.
Create an array estate->es_relations[] paralleling the es_range_table,
and store references to Relations (relcache entries) there, so that any
given RT entry is opened and closed just once per executor run.  Scan
nodes typically still call ExecOpenScanRelation, but ExecCloseScanRelation
is no more; relation closing is now done centrally in ExecEndPlan.

This is slightly more complex than one would expect because of the
interactions with relcache references held in ResultRelInfo nodes.
The general convention is now that ResultRelInfo->ri_RelationDesc does
not represent a separate relcache reference and so does not need to be
explicitly closed; but there is an exception for ResultRelInfos in the
es_trig_target_relations list, which are manufactured by
ExecGetTriggerResultRel and have to be cleaned up by
ExecCleanUpTriggerState.  (That much was true all along, but these
ResultRelInfos are now more different from others than they used to be.)

To allow the partition pruning logic to make use of es_relations[] rather
than having its own relcache references, adjust PartitionedRelPruneInfo
to store an RT index rather than a relation OID.

Amit Langote, reviewed by David Rowley and Jesper Pedersen,
some mods by me

Discussion: https://postgr.es/m/468c85d9-540e-66a2-1dde-fec2b741e688@lab.ntt.co.jp
2018-10-04 14:03:42 -04:00
Andres Freund cc2905e963 Use slots more widely in tuple mapping code and make naming more consistent.
It's inefficient to use a single slot for mapping between tuple
descriptors for multiple tuples, as previously done when using
ConvertPartitionTupleSlot(), as that means the slot's tuple descriptors
change for every tuple.

Previously we also, via ConvertPartitionTupleSlot(), built new tuples
after the mapping even in cases where we, immediately afterwards,
access individual columns again.

Refactor the code so one slot, on demand, is used for each
partition. That avoids having to change the descriptor (and allows to
use the more efficient "fixed" tuple slots). Then use slot->slot
mapping, to avoid unnecessarily forming a tuple.

As the naming between the tuple and slot mapping functions wasn't
consistent, rename them to execute_attr_map_{tuple,slot}.  It's likely
that we'll also rename convert_tuples_by_* to denote that these
functions "only" build a map, but that's left for later.

Author: Amit Khandekar and Amit Langote, editorialized by me
Reviewed-By: Amit Langote, Amit Khandekar, Andres Freund
Discussion:
    https://postgr.es/m/CAJ3gD9fR0wRNeAE8VqffNTyONS_UfFPRpqxhnD9Q42vZB+Jvpg@mail.gmail.com
    https://postgr.es/m/e4f9d743-cd4b-efb0-7574-da21d86a7f36%40lab.ntt.co.jp
Backpatch: -
2018-10-02 11:14:26 -07:00
Alvaro Herrera 6b78231d91 Move PartitionDispatchData struct definition to execPartition.c
There's no reason to expose the struct definition, so don't.

Author: Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
Discussion: https://postgr.es/m/d3fa24c1-bc65-7133-81df-6474387ccc4f@lab.ntt.co.jp
2018-09-14 19:06:57 -03:00
Tom Lane 1c2cb2744b Fix run-time partition pruning for appends with multiple source rels.
The previous coding here supposed that if run-time partitioning applied to
a particular Append/MergeAppend plan, then all child plans of that node
must be members of a single partitioning hierarchy.  This is totally wrong,
since an Append could be formed from a UNION ALL: we could have multiple
hierarchies sharing the same Append, or child plans that aren't part of any
hierarchy.

To fix, restructure the related plan-time and execution-time data
structures so that we can have a separate list or array for each
partitioning hierarchy.  Also track subplans that are not part of any
hierarchy, and make sure they don't get pruned.

Per reports from Phil Florent and others.  Back-patch to v11, since
the bug originated there.

David Rowley, with a lot of cosmetic adjustments by me; thanks also
to Amit Langote for review.

Discussion: https://postgr.es/m/HE1PR03MB17068BB27404C90B5B788BCABA7B0@HE1PR03MB1706.eurprd03.prod.outlook.com
2018-08-01 19:42:52 -04:00
Peter Eisentraut 0d5f05cde0 Allow multi-inserts during COPY into a partitioned table
CopyFrom allows multi-inserts to be used for non-partitioned tables, but
this was disabled for partitioned tables.  The reason for this appeared
to be that the tuple may not belong to the same partition as the
previous tuple did.  Not allowing multi-inserts here greatly slowed down
imports into partitioned tables.  These could take twice as long as a
copy to an equivalent non-partitioned table.  It seems wise to do
something about this, so this change allows the multi-inserts by
flushing the so-far inserted tuples to the partition when the next tuple
does not belong to the same partition, or when the buffer fills.  This
improves performance when the next tuple in the stream commonly belongs
to the same partition as the previous tuple.

In cases where the target partition changes on every tuple, using
multi-inserts slightly slows the performance.  To get around this we
track the average size of the batches that have been inserted and
adaptively enable or disable multi-inserts based on the size of the
batch.  Some testing was done and the regression only seems to exist
when the average size of the insert batch is close to 1, so let's just
enable multi-inserts when the average size is at least 1.3.  More
performance testing might reveal a better number for, this, but since
the slowdown was only 1-2% it does not seem critical enough to spend too
much time calculating it.  In any case it may depend on other factors
rather than just the size of the batch.

Allowing multi-inserts for partitions required a bit of work around the
per-tuple memory contexts as we must flush the tuples when the next
tuple does not belong the same partition.  In which case there is no
good time to reset the per-tuple context, as we've already built the new
tuple by this time.  In order to work around this we maintain two
per-tuple contexts and just switch between them every time the partition
changes and reset the old one.  This does mean that the first of each
batch of tuples is not allocated in the same memory context as the
others, but that does not matter since we only reset the context once
the previous batch has been inserted.

Author: David Rowley <david.rowley@2ndquadrant.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
2018-08-01 10:23:09 +02:00
Tom Lane e23bae82cf Fix up run-time partition pruning's use of relcache's partition data.
The previous coding saved pointers into the partitioned table's relcache
entry, but then closed the relcache entry, causing those pointers to
nominally become dangling.  Actual trouble would be seen in the field
only if a relcache flush occurred mid-query, but that's hardly out of
the question.

While we could fix this by copying all the data in question at query
start, it seems better to just hold the relcache entry open for the
whole query.

While at it, improve the handling of support-function lookups: do that
once per query not once per pruning test.  There's still something to be
desired here, in that we fail to exploit the possibility of caching data
across queries in the fn_extra fields of the relcache's FmgrInfo structs,
which could happen if we just used those structs in-place rather than
copying them.  However, combining that with the possibility of per-query
lookups of cross-type comparison functions seems to require changes in the
APIs of a lot of the pruning support functions, so it's too invasive to
consider as part of this patch.  A win would ensue only for complex
partition key data types (e.g. arrays), so it may not be worth the
trouble.

David Rowley and Tom Lane

Discussion: https://postgr.es/m/17850.1528755844@sss.pgh.pa.us
2018-06-13 12:03:26 -04:00
Tom Lane 4e23236403 Improve commentary about run-time partition pruning data structures.
No code changes except for a couple of new Asserts.

David Rowley and Tom Lane

Discussion: https://postgr.es/m/CAKJS1f-6GODRNgEtdPxCnAPme2h2hTztB6LmtfdmcYAAOE0kQg@mail.gmail.com
2018-06-11 17:35:53 -04:00
Tom Lane 321f648a31 Assorted cosmetic cleanup of run-time-partition-pruning code.
Use "subplan" rather than "subnode" to refer to the child plans of
a partitioning Append; this seems a bit more specific and hence
clearer.  Improve assorted comments.  No non-cosmetic changes.

David Rowley and Tom Lane

Discussion: https://postgr.es/m/CAFj8pRBjrufA3ocDm8o4LPGNye9Y+pm1b9kCwode4X04CULG3g@mail.gmail.com
2018-06-10 18:24:34 -04:00
Tom Lane 73b7f48f78 Improve run-time partition pruning to handle any stable expression.
The initial coding of the run-time-pruning feature only coped with cases
where the partition key(s) are compared to Params.  That is a bit silly;
we can allow it to work with any non-Var-containing stable expression, as
long as we take special care with expressions containing PARAM_EXEC Params.
The code is hardly any longer this way, and it's considerably clearer
(IMO at least).  Per gripe from Pavel Stehule.

David Rowley, whacked around a bit by me

Discussion: https://postgr.es/m/CAFj8pRBjrufA3ocDm8o4LPGNye9Y+pm1b9kCwode4X04CULG3g@mail.gmail.com
2018-06-10 15:22:32 -04:00
Bruce Momjian 7551d9bc40 C comment: add description of root_tuple_slot
Reported-by: Amit Langote

Discussion: https://postgr.es/m/d2e6674c-c5c6-fe89-1d0b-3534b9db0476@lab.ntt.co.jp
2018-04-26 14:55:03 -04:00
Tom Lane bdf46af748 Post-feature-freeze pgindent run.
Discussion: https://postgr.es/m/15719.1523984266@sss.pgh.pa.us
2018-04-26 14:47:16 -04:00
Alvaro Herrera da6f3e45dd Reorganize partitioning code
There's been a massive addition of partitioning code in PostgreSQL 11,
with little oversight on its placement, resulting in a
catalog/partition.c with poorly defined boundaries and responsibilities.
This commit tries to set a couple of distinct modules to separate things
a little bit.  There are no code changes here, only code movement.

There are three new files:
  src/backend/utils/cache/partcache.c
  src/include/partitioning/partdefs.h
  src/include/utils/partcache.h

The previous arrangement of #including catalog/partition.h almost
everywhere is no more.

Authors: Amit Langote and Álvaro Herrera
Discussion: https://postgr.es/m/98e8d509-790a-128c-be7f-e48a5b2d8d97@lab.ntt.co.jp
	https://postgr.es/m/11aa0c50-316b-18bb-722d-c23814f39059@lab.ntt.co.jp
	https://postgr.es/m/143ed9a4-6038-76d4-9a55-502035815e68@lab.ntt.co.jp
	https://postgr.es/m/20180413193503.nynq7bnmgh6vs5vm@alvherre.pgsql
2018-04-14 21:12:14 -03:00
Simon Riggs 08ea7a2291 Revert MERGE patch
This reverts commits d204ef6377,
83454e3c2b and a few more commits thereafter
(complete list at the end) related to MERGE feature.

While the feature was fully functional, with sufficient test coverage and
necessary documentation, it was felt that some parts of the executor and
parse-analyzer can use a different design and it wasn't possible to do that in
the available time. So it was decided to revert the patch for PG11 and retry
again in the future.

Thanks again to all reviewers and bug reporters.

List of commits reverted, in reverse chronological order:

 f1464c5380 Improve parse representation for MERGE
 ddb4158579 MERGE syntax diagram correction
 530e69e59b Allow cpluspluscheck to pass by renaming variable
 01b88b4df5 MERGE minor errata
 3af7b2b0d4 MERGE fix variable warning in non-assert builds
 a5d86181ec MERGE INSERT allows only one VALUES clause
 4b2d44031f MERGE post-commit review
 4923550c20 Tab completion for MERGE
 aa3faa3c7a WITH support in MERGE
 83454e3c2b New files for MERGE
 d204ef6377 MERGE SQL Command following SQL:2016

Author: Pavan Deolasee
Reviewed-by: Michael Paquier
2018-04-12 11:22:56 +01:00
Alvaro Herrera 499be013de Support partition pruning at execution time
Existing partition pruning is only able to work at plan time, for query
quals that appear in the parsed query.  This is good but limiting, as
there can be parameters that appear later that can be usefully used to
further prune partitions.

This commit adds support for pruning subnodes of Append which cannot
possibly contain any matching tuples, during execution, by evaluating
Params to determine the minimum set of subnodes that can possibly match.
We support more than just simple Params in WHERE clauses. Support
additionally includes:

1. Parameterized Nested Loop Joins: The parameter from the outer side of the
   join can be used to determine the minimum set of inner side partitions to
   scan.

2. Initplans: Once an initplan has been executed we can then determine which
   partitions match the value from the initplan.

Partition pruning is performed in two ways.  When Params external to the plan
are found to match the partition key we attempt to prune away unneeded Append
subplans during the initialization of the executor.  This allows us to bypass
the initialization of non-matching subplans meaning they won't appear in the
EXPLAIN or EXPLAIN ANALYZE output.

For parameters whose value is only known during the actual execution
then the pruning of these subplans must wait.  Subplans which are
eliminated during this stage of pruning are still visible in the EXPLAIN
output.  In order to determine if pruning has actually taken place, the
EXPLAIN ANALYZE must be viewed.  If a certain Append subplan was never
executed due to the elimination of the partition then the execution
timing area will state "(never executed)".  Whereas, if, for example in
the case of parameterized nested loops, the number of loops stated in
the EXPLAIN ANALYZE output for certain subplans may appear lower than
others due to the subplan having been scanned fewer times.  This is due
to the list of matching subnodes having to be evaluated whenever a
parameter which was found to match the partition key changes.

This commit required some additional infrastructure that permits the
building of a data structure which is able to perform the translation of
the matching partition IDs, as returned by get_matching_partitions, into
the list index of a subpaths list, as exist in node types such as
Append, MergeAppend and ModifyTable.  This allows us to translate a list
of clauses into a Bitmapset of all the subpath indexes which must be
included to satisfy the clause list.

Author: David Rowley, based on an earlier effort by Beena Emerson
Reviewers: Amit Langote, Robert Haas, Amul Sul, Rajkumar Raghuwanshi,
Jesper Pedersen
Discussion: https://postgr.es/m/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A@mail.gmail.com
2018-04-07 17:54:39 -03:00
Robert Haas 3d956d9562 Allow insert and update tuple routing and COPY for foreign tables.
Also enable this for postgres_fdw.

Etsuro Fujita, based on an earlier patch by Amit Langote. The larger
patch series of which this is a part has been reviewed by Amit
Langote, David Fetter, Maksim Milyutin, Álvaro Herrera, Stephen Frost,
and me.  Minor documentation changes to the final version by me.

Discussion: http://postgr.es/m/29906a26-da12-8c86-4fb9-d8f88442f2b9@lab.ntt.co.jp
2018-04-06 19:22:03 -04:00
Simon Riggs d204ef6377 MERGE SQL Command following SQL:2016
MERGE performs actions that modify rows in the target table
using a source table or query. MERGE provides a single SQL
statement that can conditionally INSERT/UPDATE/DELETE rows
a task that would other require multiple PL statements.
e.g.

MERGE INTO target AS t
USING source AS s
ON t.tid = s.sid
WHEN MATCHED AND t.balance > s.delta THEN
  UPDATE SET balance = t.balance - s.delta
WHEN MATCHED THEN
  DELETE
WHEN NOT MATCHED AND s.delta > 0 THEN
  INSERT VALUES (s.sid, s.delta)
WHEN NOT MATCHED THEN
  DO NOTHING;

MERGE works with regular and partitioned tables, including
column and row security enforcement, as well as support for
row, statement and transition triggers.

MERGE is optimized for OLTP and is parameterizable, though
also useful for large scale ETL/ELT. MERGE is not intended
to be used in preference to existing single SQL commands
for INSERT, UPDATE or DELETE since there is some overhead.
MERGE can be used statically from PL/pgSQL.

MERGE does not yet support inheritance, write rules,
RETURNING clauses, updatable views or foreign tables.
MERGE follows SQL Standard per the most recent SQL:2016.

Includes full tests and documentation, including full
isolation tests to demonstrate the concurrent behavior.

This version written from scratch in 2017 by Simon Riggs,
using docs and tests originally written in 2009. Later work
from Pavan Deolasee has been both complex and deep, leaving
the lead author credit now in his hands.
Extensive discussion of concurrency from Peter Geoghegan,
with thanks for the time and effort contributed.

Various issues reported via sqlsmith by Andreas Seltenreich

Authors: Pavan Deolasee, Simon Riggs
Reviewer: Peter Geoghegan, Amit Langote, Tomas Vondra, Simon Riggs

Discussion:
https://postgr.es/m/CANP8+jKitBSrB7oTgT9CY2i1ObfOt36z0XMraQc+Xrz8QB0nXA@mail.gmail.com
https://postgr.es/m/CAH2-WzkJdBuxj9PO=2QaO9-3h3xGbQPZ34kJH=HukRekwM-GZg@mail.gmail.com
2018-04-03 09:28:16 +01:00
Simon Riggs 7cf8a5c302 Revert "Modified files for MERGE"
This reverts commit 354f13855e.
2018-04-02 21:34:15 +01:00
Simon Riggs 354f13855e Modified files for MERGE 2018-04-02 21:12:47 +01:00
Alvaro Herrera 364de25665 Update PartitionTupleRouting struct comment
Small review on edd44738bc.

Discussion: https://postgr.es/m/20180222165315.k27qfn4goskhoswj@alvherre.pgsql
Reviewed-by: Robert Haas, Amit Langote
2018-02-26 17:05:46 -03:00
Robert Haas edd44738bc Be lazier about partition tuple routing.
It's not necessary to fully initialize the executor data structures
for partitions to which no tuples are ever routed.  Consider, for
example, an INSERT statement that inserts only one row: it only cares
about the partition to which that one row is routed.  The new function
ExecInitPartitionInfo performs the initialization in question only
when a particular partition is about to receive a tuple. This includes
creating, validating, and saving a pointer to the ResultRelInfo,
setting up for speculative insertions, translating WCOs and
initializing the resulting expressions, translating returning lists
and building the appropriate projection information, and setting up a
tuple conversion map.

One thing that's not deferred is locking the child partitions; that
seems desirable but would need more thought.  Still, testing shows
that this makes single-row inserts significantly faster on a table
with many partitions without harming the bulk-insert case.

Amit Langote, reviewed by Etsuro Fujita, with a few changes by me

Discussion: http://postgr.es/m/8975331d-d961-cbdd-f862-fdd3d97dc2d0@lab.ntt.co.jp
2018-02-22 10:55:54 -05:00
Robert Haas 945f71db84 Avoid referencing off the end of subplan_partition_offsets.
Report by buildfarm member skink and Tom Lane.  Analysis by me.
Patch by Amit Khandekar.

Discussion: http://postgr.es/m/CAJ3gD9fVA1iXQYhfqHP5n_TEd4U9=V8TL_cc-oKRnRmxgdvJrQ@mail.gmail.com
2018-01-24 16:34:51 -05:00
Robert Haas 2f17844104 Allow UPDATE to move rows between partitions.
When an UPDATE causes a row to no longer match the partition
constraint, try to move it to a different partition where it does
match the partition constraint.  In essence, the UPDATE is split into
a DELETE from the old partition and an INSERT into the new one.  This
can lead to surprising behavior in concurrency scenarios because
EvalPlanQual rechecks won't work as they normally did; the known
problems are documented.  (There is a pending patch to improve the
situation further, but it needs more review.)

Amit Khandekar, reviewed and tested by Amit Langote, David Rowley,
Rajkumar Raghuwanshi, Dilip Kumar, Amul Sul, Thomas Munro, Álvaro
Herrera, Amit Kapila, and me.  A few final revisions by me.

Discussion: http://postgr.es/m/CAJ3gD9do9o2ccQ7j7+tSgiE1REY65XRiMb=yJO3u3QhyP8EEPQ@mail.gmail.com
2018-01-19 15:33:06 -05:00
Robert Haas cc6337d2fe Simplify and encapsulate tuple routing support code.
Instead of having ExecSetupPartitionTupleRouting return multiple out
parameters, have it return a pointer to a structure containing all of
those different things.  Also, provide and use a cleanup function,
ExecCleanupTupleRouting, instead of cleaning up all of the resources
allocated by ExecSetupPartitionTupleRouting individually.

Amit Khandekar, reviewed by Amit Langote, David Rowley, and me

Discussion: http://postgr.es/m/CAJ3gD9fWfxgKC+PfJZF3hkgAcNOy-LpfPxVYitDEXKHjeieWQQ@mail.gmail.com
2018-01-04 15:48:15 -05:00
Bruce Momjian 9d4649ca49 Update copyright for 2018
Backpatch-through: certain files through 9.3
2018-01-02 23:30:12 -05:00
Robert Haas 87c37e3291 Re-allow INSERT .. ON CONFLICT DO NOTHING on partitioned tables.
Commit 8355a011a0 was reverted in
f05230752d, but this attempt is
hopefully better-considered: we now pass the correct value to
ExecOpenIndices, which should avoid the crash that we hit before.

Amit Langote, reviewed by Simon Riggs and by me.  Some final
editing by me.

Discussion: http://postgr.es/m/7ff1e8ec-dc39-96b1-7f47-ff5965dceeac@lab.ntt.co.jp
2017-12-01 12:53:21 -05:00
Robert Haas eaedf0df71 Update typedefs.list and re-run pgindent
Discussion: http://postgr.es/m/CA+TgmoaA9=1RWKtBWpDaj+sF3Stgc8sHgf5z=KGtbjwPLQVDMA@mail.gmail.com
2017-11-29 09:24:24 -05:00
Robert Haas 4e5fe9ad19 Centralize executor-related partitioning code.
Some code is moved from partition.c, which has grown very quickly lately;
splitting the executor parts out might help to keep it from getting
totally out of control.  Other code is moved from execMain.c.  All is
moved to a new file execPartition.c.  get_partition_for_tuple now has
a new interface that more clearly separates executor concerns from
generic concerns.

Amit Langote.  A slight comment tweak by me.

Discussion: http://postgr.es/m/1f0985f8-3b61-8bc4-4350-baa6d804cb6d@lab.ntt.co.jp
2017-11-15 10:26:25 -05:00