2010-09-20 22:08:53 +02:00
|
|
|
src/backend/optimizer/README
|
2008-03-20 18:55:15 +01:00
|
|
|
|
|
|
|
Optimizer
|
2008-03-21 14:23:29 +01:00
|
|
|
=========
|
1999-02-09 04:51:42 +01:00
|
|
|
|
2000-02-07 05:41:04 +01:00
|
|
|
These directories take the Query structure returned by the parser, and
|
|
|
|
generate a plan used by the executor. The /plan directory generates the
|
|
|
|
actual output plan, the /path code generates all possible ways to join the
|
2000-11-12 01:37:02 +01:00
|
|
|
tables, and /prep handles various preprocessing steps for special cases.
|
|
|
|
/util is utility stuff. /geqo is the separate "genetic optimization" planner
|
|
|
|
--- it does a semi-random search through the join tree space, rather than
|
|
|
|
exhaustively considering all possible join trees. (But each join considered
|
|
|
|
by /geqo is given to /path to create paths for, so we consider all possible
|
2000-09-29 20:21:41 +02:00
|
|
|
implementation paths for each specific join pair even in GEQO mode.)
|
|
|
|
|
|
|
|
|
|
|
|
Paths and Join Pairs
|
|
|
|
--------------------
|
|
|
|
|
|
|
|
During the planning/optimizing process, we build "Path" trees representing
|
|
|
|
the different ways of doing a query. We select the cheapest Path that
|
|
|
|
generates the desired relation and turn it into a Plan to pass to the
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
executor. (There is pretty nearly a one-to-one correspondence between the
|
2000-09-29 20:21:41 +02:00
|
|
|
Path and Plan trees, but Path nodes omit info that won't be needed during
|
|
|
|
planning, and include info needed for planning that won't be needed by the
|
|
|
|
executor.)
|
|
|
|
|
|
|
|
The optimizer builds a RelOptInfo structure for each base relation used in
|
|
|
|
the query. Base rels are either primitive tables, or subquery subselects
|
|
|
|
that are planned via a separate recursive invocation of the planner. A
|
|
|
|
RelOptInfo is also built for each join relation that is considered during
|
|
|
|
planning. A join rel is simply a combination of base rels. There is only
|
|
|
|
one join RelOptInfo for any given set of baserels --- for example, the join
|
|
|
|
{A B C} is represented by the same RelOptInfo no matter whether we build it
|
|
|
|
by joining A and B first and then adding C, or joining B and C first and
|
|
|
|
then adding A, etc. These different means of building the joinrel are
|
|
|
|
represented as Paths. For each RelOptInfo we build a list of Paths that
|
|
|
|
represent plausible ways to implement the scan or join of that relation.
|
|
|
|
Once we've considered all the plausible Paths for a rel, we select the one
|
|
|
|
that is cheapest according to the planner's cost estimates. The final plan
|
|
|
|
is derived from the cheapest Path for the RelOptInfo that includes all the
|
|
|
|
base rels of the query.
|
|
|
|
|
|
|
|
Possible Paths for a primitive table relation include plain old sequential
|
2005-12-20 03:30:36 +01:00
|
|
|
scan, plus index scans for any indexes that exist on the table, plus bitmap
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
index scans using one or more indexes. Specialized RTE types, such as
|
|
|
|
function RTEs, may have only one possible Path.
|
2000-09-29 20:21:41 +02:00
|
|
|
|
|
|
|
Joins always occur using two RelOptInfos. One is outer, the other inner.
|
|
|
|
Outers drive lookups of values in the inner. In a nested loop, lookups of
|
|
|
|
values in the inner occur by scanning the inner path once per outer tuple
|
|
|
|
to find each matching inner row. In a mergejoin, inner and outer rows are
|
|
|
|
ordered, and are accessed in order, so only one scan is required to perform
|
|
|
|
the entire join: both inner and outer paths are scanned in-sync. (There's
|
|
|
|
not a lot of difference between inner and outer in a mergejoin...) In a
|
|
|
|
hashjoin, the inner is scanned first and all its rows are entered in a
|
|
|
|
hashtable, then the outer is scanned and for each row we lookup the join
|
|
|
|
key in the hashtable.
|
|
|
|
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
A Path for a join relation is actually a tree structure, with the topmost
|
|
|
|
Path node representing the last-applied join method. It has left and right
|
|
|
|
subpaths that represent the scan or join methods used for the two input
|
|
|
|
relations.
|
2000-02-07 05:41:04 +01:00
|
|
|
|
|
|
|
|
|
|
|
Join Tree Construction
|
|
|
|
----------------------
|
|
|
|
|
1999-08-16 04:17:58 +02:00
|
|
|
The optimizer generates optimal query plans by doing a more-or-less
|
2000-09-29 20:21:41 +02:00
|
|
|
exhaustive search through the ways of executing the query. The best Path
|
|
|
|
tree is found by a recursive process:
|
1999-08-16 04:17:58 +02:00
|
|
|
|
|
|
|
1) Take each base relation in the query, and make a RelOptInfo structure
|
|
|
|
for it. Find each potentially useful way of accessing the relation,
|
2008-04-09 03:00:46 +02:00
|
|
|
including sequential and index scans, and make Paths representing those
|
|
|
|
ways. All the Paths made for a given relation are placed in its
|
1999-08-16 04:17:58 +02:00
|
|
|
RelOptInfo.pathlist. (Actually, we discard Paths that are obviously
|
|
|
|
inferior alternatives before they ever get into the pathlist --- what
|
|
|
|
ends up in the pathlist is the cheapest way of generating each potentially
|
2012-01-28 01:26:38 +01:00
|
|
|
useful sort ordering and parameterization of the relation.) Also create a
|
|
|
|
RelOptInfo.joininfo list including all the join clauses that involve this
|
|
|
|
relation. For example, the WHERE clause "tab1.col1 = tab2.col1" generates
|
|
|
|
entries in both tab1 and tab2's joininfo lists.
|
1999-08-16 04:17:58 +02:00
|
|
|
|
2001-01-17 07:41:31 +01:00
|
|
|
If we have only a single base relation in the query, we are done.
|
2000-02-07 05:41:04 +01:00
|
|
|
Otherwise we have to figure out how to join the base relations into a
|
|
|
|
single join relation.
|
|
|
|
|
2005-12-20 03:30:36 +01:00
|
|
|
2) Normally, any explicit JOIN clauses are "flattened" so that we just
|
|
|
|
have a list of relations to join. However, FULL OUTER JOIN clauses are
|
|
|
|
never flattened, and other kinds of JOIN might not be either, if the
|
|
|
|
flattening process is stopped by join_collapse_limit or from_collapse_limit
|
|
|
|
restrictions. Therefore, we end up with a planning problem that contains
|
2007-01-20 21:45:41 +01:00
|
|
|
lists of relations to be joined in any order, where any individual item
|
|
|
|
might be a sub-list that has to be joined together before we can consider
|
|
|
|
joining it to its siblings. We process these sub-problems recursively,
|
|
|
|
bottom up. Note that the join list structure constrains the possible join
|
|
|
|
orders, but it doesn't constrain the join implementation method at each
|
|
|
|
join (nestloop, merge, hash), nor does it say which rel is considered outer
|
|
|
|
or inner at each join. We consider all these possibilities in building
|
|
|
|
Paths. We generate a Path for each feasible join method, and select the
|
|
|
|
cheapest Path.
|
|
|
|
|
|
|
|
For each planning problem, therefore, we will have a list of relations
|
|
|
|
that are either base rels or joinrels constructed per sub-join-lists.
|
|
|
|
We can join these rels together in any order the planner sees fit.
|
2000-09-29 20:21:41 +02:00
|
|
|
The standard (non-GEQO) planner does this as follows:
|
|
|
|
|
Restructure code that is responsible for ensuring that clauseless joins are
considered when it is necessary to do so because of a join-order restriction
(that is, an outer-join or IN-subselect construct). The former coding was a
bit ad-hoc and inconsistent, and it missed some cases, as exposed by Mario
Weilguni's recent bug report. His specific problem was that an IN could be
turned into a "clauseless" join due to constant-propagation removing the IN's
joinclause, and if the IN's subselect involved more than one relation and
there was more than one such IN linking to the same upper relation, then the
only valid join orders involve "bushy" plans but we would fail to consider the
specific paths needed to get there. (See the example case added to the join
regression test.) On examining the code I wonder if there weren't some other
problem cases too; in particular it seems that GEQO was defending against a
different set of corner cases than the main planner was. There was also an
efficiency problem, in that when we did realize we needed a clauseless join
because of an IN, we'd consider clauseless joins against every other relation
whether this was sensible or not. It seems a better design is to use the
outer-join and in-clause lists as a backup heuristic, just as the rule of
joining only where there are joinclauses is a heuristic: we'll join two
relations if they have a usable joinclause *or* this might be necessary to
satisfy an outer-join or IN-clause join order restriction. I refactored the
code to have just one place considering this instead of three, and made sure
that it covered all the cases that any of them had been considering.
Backpatch as far as 8.1 (which has only the IN-clause form of the disease).
By rights 8.0 and 7.4 should have the bug too, but they accidentally fail
to fail, because the joininfo structure used in those releases preserves some
memory of there having once been a joinclause between the inner and outer
sides of an IN, and so it leads the code in the right direction anyway.
I'll be conservative and not touch them.
2007-02-16 01:14:01 +01:00
|
|
|
Consider joining each RelOptInfo to each other RelOptInfo for which there
|
|
|
|
is a usable joinclause, and generate a Path for each possible join method
|
|
|
|
for each such pair. (If we have a RelOptInfo with no join clauses, we have
|
|
|
|
no choice but to generate a clauseless Cartesian-product join; so we
|
|
|
|
consider joining that rel to each other available rel. But in the presence
|
|
|
|
of join clauses we will only consider joins that use available join
|
|
|
|
clauses. Note that join-order restrictions induced by outer joins and
|
2008-08-14 20:48:00 +02:00
|
|
|
IN/EXISTS clauses are also checked, to ensure that we find a workable join
|
|
|
|
order in cases where those restrictions force a clauseless join to be done.)
|
2000-09-29 20:21:41 +02:00
|
|
|
|
2007-01-20 21:45:41 +01:00
|
|
|
If we only had two relations in the list, we are done: we just pick
|
2000-09-29 20:21:41 +02:00
|
|
|
the cheapest path for the join RelOptInfo. If we had more than two, we now
|
1999-08-16 04:17:58 +02:00
|
|
|
need to consider ways of joining join RelOptInfos to each other to make
|
2007-01-20 21:45:41 +01:00
|
|
|
join RelOptInfos that represent more than two list items.
|
2000-02-07 05:41:04 +01:00
|
|
|
|
|
|
|
The join tree is constructed using a "dynamic programming" algorithm:
|
|
|
|
in the first pass (already described) we consider ways to create join rels
|
2007-01-20 21:45:41 +01:00
|
|
|
representing exactly two list items. The second pass considers ways
|
|
|
|
to make join rels that represent exactly three list items; the next pass,
|
2000-09-29 20:21:41 +02:00
|
|
|
four items, etc. The last pass considers how to make the final join
|
2007-01-20 21:45:41 +01:00
|
|
|
relation that includes all list items --- obviously there can be only one
|
2000-02-07 05:41:04 +01:00
|
|
|
join rel at this top level, whereas there can be more than one join rel
|
|
|
|
at lower levels. At each level we use joins that follow available join
|
|
|
|
clauses, if possible, just as described for the first level.
|
1999-08-16 04:17:58 +02:00
|
|
|
|
|
|
|
For example:
|
1999-02-09 04:51:42 +01:00
|
|
|
|
1999-02-15 23:19:01 +01:00
|
|
|
SELECT *
|
|
|
|
FROM tab1, tab2, tab3, tab4
|
|
|
|
WHERE tab1.col = tab2.col AND
|
|
|
|
tab2.col = tab3.col AND
|
|
|
|
tab3.col = tab4.col
|
|
|
|
|
|
|
|
Tables 1, 2, 3, and 4 are joined as:
|
|
|
|
{1 2},{2 3},{3 4}
|
|
|
|
{1 2 3},{2 3 4}
|
|
|
|
{1 2 3 4}
|
2000-02-07 05:41:04 +01:00
|
|
|
(other possibilities will be excluded for lack of join clauses)
|
1999-02-15 23:19:01 +01:00
|
|
|
|
|
|
|
SELECT *
|
|
|
|
FROM tab1, tab2, tab3, tab4
|
|
|
|
WHERE tab1.col = tab2.col AND
|
|
|
|
tab1.col = tab3.col AND
|
|
|
|
tab1.col = tab4.col
|
|
|
|
|
|
|
|
Tables 1, 2, 3, and 4 are joined as:
|
|
|
|
{1 2},{1 3},{1 4}
|
2000-02-07 05:41:04 +01:00
|
|
|
{1 2 3},{1 3 4},{1 2 4}
|
1999-02-15 23:19:01 +01:00
|
|
|
{1 2 3 4}
|
|
|
|
|
2000-02-07 05:41:04 +01:00
|
|
|
We consider left-handed plans (the outer rel of an upper join is a joinrel,
|
2007-01-20 21:45:41 +01:00
|
|
|
but the inner is always a single list item); right-handed plans (outer rel
|
2000-09-29 20:21:41 +02:00
|
|
|
is always a single item); and bushy plans (both inner and outer can be
|
|
|
|
joins themselves). For example, when building {1 2 3 4} we consider
|
|
|
|
joining {1 2 3} to {4} (left-handed), {4} to {1 2 3} (right-handed), and
|
|
|
|
{1 2} to {3 4} (bushy), among other choices. Although the jointree
|
|
|
|
scanning code produces these potential join combinations one at a time,
|
|
|
|
all the ways to produce the same set of joined base rels will share the
|
|
|
|
same RelOptInfo, so the paths produced from different join combinations
|
2005-12-20 03:30:36 +01:00
|
|
|
that produce equivalent joinrels will compete in add_path().
|
2000-02-07 05:41:04 +01:00
|
|
|
|
2016-04-30 18:29:21 +02:00
|
|
|
The dynamic-programming approach has an important property that's not
|
|
|
|
immediately obvious: we will finish constructing all paths for a given
|
|
|
|
relation before we construct any paths for relations containing that rel.
|
|
|
|
This means that we can reliably identify the "cheapest path" for each rel
|
|
|
|
before higher-level relations need to know that. Also, we can safely
|
|
|
|
discard a path when we find that another path for the same rel is better,
|
|
|
|
without worrying that maybe there is already a reference to that path in
|
|
|
|
some higher-level join path. Without this, memory management for paths
|
|
|
|
would be much more complicated.
|
|
|
|
|
2000-02-07 05:41:04 +01:00
|
|
|
Once we have built the final join rel, we use either the cheapest path
|
|
|
|
for it or the cheapest path with the desired ordering (if that's cheaper
|
|
|
|
than applying a sort to the cheapest other path).
|
|
|
|
|
2005-12-20 03:30:36 +01:00
|
|
|
If the query contains one-sided outer joins (LEFT or RIGHT joins), or
|
2012-01-28 01:26:38 +01:00
|
|
|
IN or EXISTS WHERE clauses that were converted to semijoins or antijoins,
|
|
|
|
then some of the possible join orders may be illegal. These are excluded
|
|
|
|
by having join_is_legal consult a side list of such "special" joins to see
|
|
|
|
whether a proposed join is illegal. (The same consultation allows it to
|
|
|
|
see which join style should be applied for a valid join, ie, JOIN_INNER,
|
|
|
|
JOIN_LEFT, etc.)
|
2005-12-20 03:30:36 +01:00
|
|
|
|
|
|
|
|
2008-03-20 18:55:15 +01:00
|
|
|
Valid OUTER JOIN Optimizations
|
2005-12-20 03:30:36 +01:00
|
|
|
------------------------------
|
|
|
|
|
|
|
|
The planner's treatment of outer join reordering is based on the following
|
|
|
|
identities:
|
|
|
|
|
|
|
|
1. (A leftjoin B on (Pab)) innerjoin C on (Pac)
|
|
|
|
= (A innerjoin C on (Pac)) leftjoin B on (Pab)
|
|
|
|
|
|
|
|
where Pac is a predicate referencing A and C, etc (in this case, clearly
|
|
|
|
Pac cannot reference B, or the transformation is nonsensical).
|
|
|
|
|
|
|
|
2. (A leftjoin B on (Pab)) leftjoin C on (Pac)
|
|
|
|
= (A leftjoin C on (Pac)) leftjoin B on (Pab)
|
|
|
|
|
|
|
|
3. (A leftjoin B on (Pab)) leftjoin C on (Pbc)
|
|
|
|
= A leftjoin (B leftjoin C on (Pbc)) on (Pab)
|
|
|
|
|
|
|
|
Identity 3 only holds if predicate Pbc must fail for all-null B rows
|
|
|
|
(that is, Pbc is strict for at least one column of B). If Pbc is not
|
|
|
|
strict, the first form might produce some rows with nonnull C columns
|
|
|
|
where the second form would make those entries null.
|
|
|
|
|
|
|
|
RIGHT JOIN is equivalent to LEFT JOIN after switching the two input
|
2009-02-27 23:41:38 +01:00
|
|
|
tables, so the same identities work for right joins.
|
2005-12-20 03:30:36 +01:00
|
|
|
|
|
|
|
An example of a case that does *not* work is moving an innerjoin into or
|
|
|
|
out of the nullable side of an outer join:
|
|
|
|
|
|
|
|
A leftjoin (B join C on (Pbc)) on (Pab)
|
|
|
|
!= (A leftjoin B on (Pab)) join C on (Pbc)
|
|
|
|
|
2009-02-27 23:41:38 +01:00
|
|
|
SEMI joins work a little bit differently. A semijoin can be reassociated
|
2009-07-21 04:02:44 +02:00
|
|
|
into or out of the lefthand side of another semijoin, left join, or
|
|
|
|
antijoin, but not into or out of the righthand side. Likewise, an inner
|
|
|
|
join, left join, or antijoin can be reassociated into or out of the
|
|
|
|
lefthand side of a semijoin, but not into or out of the righthand side.
|
2009-02-27 23:41:38 +01:00
|
|
|
|
|
|
|
ANTI joins work approximately like LEFT joins, except that identity 3
|
|
|
|
fails if the join to C is an antijoin (even if Pbc is strict, and in
|
|
|
|
both the cases where the other join is a leftjoin and where it is an
|
|
|
|
antijoin). So we can't reorder antijoins into or out of the RHS of a
|
|
|
|
leftjoin or antijoin, even if the relevant clause is strict.
|
|
|
|
|
|
|
|
The current code does not attempt to re-order FULL JOINs at all.
|
2005-12-20 03:30:36 +01:00
|
|
|
FULL JOIN ordering is enforced by not collapsing FULL JOIN nodes when
|
2009-02-27 23:41:38 +01:00
|
|
|
translating the jointree to "joinlist" representation. Other types of
|
2005-12-20 03:30:36 +01:00
|
|
|
JOIN nodes are normally collapsed so that they participate fully in the
|
|
|
|
join order search. To avoid generating illegal join orders, the planner
|
2009-02-27 23:41:38 +01:00
|
|
|
creates a SpecialJoinInfo node for each non-inner join, and join_is_legal
|
2005-12-20 03:30:36 +01:00
|
|
|
checks this list to decide if a proposed join is legal.
|
|
|
|
|
2008-08-14 20:48:00 +02:00
|
|
|
What we store in SpecialJoinInfo nodes are the minimum sets of Relids
|
2005-12-20 03:30:36 +01:00
|
|
|
required on each side of the join to form the outer join. Note that
|
|
|
|
these are minimums; there's no explicit maximum, since joining other
|
|
|
|
rels to the OJ's syntactic rels may be legal. Per identities 1 and 2,
|
|
|
|
non-FULL joins can be freely associated into the lefthand side of an
|
2009-02-27 23:41:38 +01:00
|
|
|
OJ, but in some cases they can't be associated into the righthand side.
|
2007-10-26 20:10:50 +02:00
|
|
|
So the restriction enforced by join_is_legal is that a proposed join
|
2007-02-13 03:31:03 +01:00
|
|
|
can't join a rel within or partly within an RHS boundary to one outside
|
2015-08-06 21:35:27 +02:00
|
|
|
the boundary, unless the proposed join is a LEFT join that can associate
|
|
|
|
into the SpecialJoinInfo's RHS using identity 3.
|
|
|
|
|
|
|
|
The use of minimum Relid sets has some pitfalls; consider a query like
|
|
|
|
A leftjoin (B leftjoin (C innerjoin D) on (Pbcd)) on Pa
|
|
|
|
where Pa doesn't mention B/C/D at all. In this case a naive computation
|
|
|
|
would give the upper leftjoin's min LHS as {A} and min RHS as {C,D} (since
|
|
|
|
we know that the innerjoin can't associate out of the leftjoin's RHS, and
|
|
|
|
enforce that by including its relids in the leftjoin's min RHS). And the
|
|
|
|
lower leftjoin has min LHS of {B} and min RHS of {C,D}. Given such
|
|
|
|
information, join_is_legal would think it's okay to associate the upper
|
|
|
|
join into the lower join's RHS, transforming the query to
|
|
|
|
B leftjoin (A leftjoin (C innerjoin D) on Pa) on (Pbcd)
|
2015-10-01 16:31:22 +02:00
|
|
|
which yields totally wrong answers. We prevent that by forcing the min RHS
|
2015-08-06 21:35:27 +02:00
|
|
|
for the upper join to include B. This is perhaps overly restrictive, but
|
|
|
|
such cases don't arise often so it's not clear that it's worth developing a
|
|
|
|
more complicated system.
|
2005-12-20 03:30:36 +01:00
|
|
|
|
2000-09-29 20:21:41 +02:00
|
|
|
|
2008-03-20 18:55:15 +01:00
|
|
|
Pulling Up Subqueries
|
2000-09-29 20:21:41 +02:00
|
|
|
---------------------
|
|
|
|
|
|
|
|
As we described above, a subquery appearing in the range table is planned
|
|
|
|
independently and treated as a "black box" during planning of the outer
|
|
|
|
query. This is necessary when the subquery uses features such as
|
|
|
|
aggregates, GROUP, or DISTINCT. But if the subquery is just a simple
|
|
|
|
scan or join, treating the subquery as a black box may produce a poor plan
|
|
|
|
compared to considering it as part of the entire plan search space.
|
|
|
|
Therefore, at the start of the planning process the planner looks for
|
|
|
|
simple subqueries and pulls them up into the main query's jointree.
|
|
|
|
|
|
|
|
Pulling up a subquery may result in FROM-list joins appearing below the top
|
|
|
|
of the join tree. Each FROM-list is planned using the dynamic-programming
|
|
|
|
search method described above.
|
|
|
|
|
|
|
|
If pulling up a subquery produces a FROM-list as a direct child of another
|
2005-12-20 03:30:36 +01:00
|
|
|
FROM-list, then we can merge the two FROM-lists together. Once that's
|
|
|
|
done, the subquery is an absolutely integral part of the outer query and
|
|
|
|
will not constrain the join tree search space at all. However, that could
|
|
|
|
result in unpleasant growth of planning time, since the dynamic-programming
|
|
|
|
search has runtime exponential in the number of FROM-items considered.
|
|
|
|
Therefore, we don't merge FROM-lists if the result would have too many
|
|
|
|
FROM-items in one list.
|
2000-09-12 23:07:18 +02:00
|
|
|
|
1999-02-08 05:29:25 +01:00
|
|
|
|
Make Vars be outer-join-aware.
Traditionally we used the same Var struct to represent the value
of a table column everywhere in parse and plan trees. This choice
predates our support for SQL outer joins, and it's really a pretty
bad idea with outer joins, because the Var's value can depend on
where it is in the tree: it might go to NULL above an outer join.
So expression nodes that are equal() per equalfuncs.c might not
represent the same value, which is a huge correctness hazard for
the planner.
To improve this, decorate Var nodes with a bitmapset showing
which outer joins (identified by RTE indexes) may have nulled
them at the point in the parse tree where the Var appears.
This allows us to trust that equal() Vars represent the same value.
A certain amount of klugery is still needed to cope with cases
where we re-order two outer joins, but it's possible to make it
work without sacrificing that core principle. PlaceHolderVars
receive similar decoration for the same reason.
In the planner, we include these outer join bitmapsets into the relids
that an expression is considered to depend on, and in consequence also
add outer-join relids to the relids of join RelOptInfos. This allows
us to correctly perceive whether an expression can be calculated above
or below a particular outer join.
This change affects FDWs that want to plan foreign joins. They *must*
follow suit when labeling foreign joins in order to match with the
core planner, but for many purposes (if postgres_fdw is any guide)
they'd prefer to consider only base relations within the join.
To support both requirements, redefine ForeignScan.fs_relids as
base+OJ relids, and add a new field fs_base_relids that's set up by
the core planner.
Large though it is, this commit just does the minimum necessary to
install the new mechanisms and get check-world passing again.
Follow-up patches will perform some cleanup. (The README additions
and comments mention some stuff that will appear in the follow-up.)
Patch by me; thanks to Richard Guo for review.
Discussion: https://postgr.es/m/830269.1656693747@sss.pgh.pa.us
2023-01-30 19:16:20 +01:00
|
|
|
Vars and PlaceHolderVars
|
|
|
|
------------------------
|
|
|
|
|
|
|
|
A Var node is simply the parse-tree representation of a table column
|
|
|
|
reference. However, in the presence of outer joins, that concept is
|
|
|
|
more subtle than it might seem. We need to distinguish the values of
|
|
|
|
a Var "above" and "below" any outer join that could force the Var to
|
|
|
|
null. As an example, consider
|
|
|
|
|
|
|
|
SELECT * FROM t1 LEFT JOIN t2 ON (t1.x = t2.y) WHERE foo(t2.z)
|
|
|
|
|
|
|
|
(Assume foo() is not strict, so that we can't reduce the left join to
|
|
|
|
a plain join.) A naive implementation might try to push the foo(t2.z)
|
|
|
|
call down to the scan of t2, but that is not correct because
|
|
|
|
(a) what foo() should actually see for a null-extended join row is NULL,
|
|
|
|
and (b) if foo() returns false, we should suppress the t1 row from the
|
|
|
|
join altogether, not emit it with a null-extended t2 row. On the other
|
|
|
|
hand, it *would* be correct (and desirable) to push that call down to
|
|
|
|
the scan level if the query were
|
|
|
|
|
|
|
|
SELECT * FROM t1 LEFT JOIN t2 ON (t1.x = t2.y AND foo(t2.z))
|
|
|
|
|
|
|
|
This motivates considering "t2.z" within the left join's ON clause
|
|
|
|
to be a different value from "t2.z" outside the JOIN clause. The
|
|
|
|
former can be identified with t2.z as seen at the relation scan level,
|
|
|
|
but the latter can't.
|
|
|
|
|
|
|
|
Another example occurs in connection with EquivalenceClasses (discussed
|
|
|
|
below). Given
|
|
|
|
|
|
|
|
SELECT * FROM t1 LEFT JOIN t2 ON (t1.x = t2.y) WHERE t1.x = 42
|
|
|
|
|
|
|
|
we would like to use the EquivalenceClass mechanisms to derive "t2.y = 42"
|
|
|
|
to use as a restriction clause for the scan of t2. (That works, because t2
|
|
|
|
rows having y different from 42 cannot affect the query result.) However,
|
|
|
|
it'd be wrong to conclude that t2.y will be equal to t1.x in every joined
|
|
|
|
row. Part of the solution to this problem is to deem that "t2.y" in the
|
|
|
|
ON clause refers to the relation-scan-level value of t2.y, but not to the
|
|
|
|
value that y will have in joined rows, where it might be NULL rather than
|
|
|
|
equal to t1.x.
|
|
|
|
|
|
|
|
Therefore, Var nodes are decorated with "varnullingrels", which are sets
|
|
|
|
of the rangetable indexes of outer joins that potentially null the Var
|
|
|
|
at the point where it appears in the query. (Using a set, not an ordered
|
|
|
|
list, is fine since it doesn't matter which join forced the value to null;
|
|
|
|
and that avoids having to change the representation when we consider
|
|
|
|
different outer-join orders.) In the examples above, all occurrences of
|
|
|
|
t1.x would have empty varnullingrels, since the left join doesn't null t1.
|
|
|
|
The t2 references within the JOIN ON clauses would also have empty
|
|
|
|
varnullingrels. But outside the JOIN clauses, any Vars referencing t2
|
|
|
|
would have varnullingrels containing the index of the JOIN's rangetable
|
|
|
|
entry (RTE), so that they'd be understood as potentially different from
|
|
|
|
the t2 values seen at scan level. Labeling t2.z in the WHERE clause with
|
|
|
|
the JOIN's RT index lets us recognize that that occurrence of foo(t2.z)
|
|
|
|
cannot be pushed down to the t2 scan level: we cannot evaluate that value
|
|
|
|
at the scan level, but only after the join has been done.
|
|
|
|
|
|
|
|
For LEFT and RIGHT outer joins, only Vars coming from the nullable side
|
|
|
|
of the join are marked with that join's RT index. For FULL joins, Vars
|
|
|
|
from both inputs are marked. (Such marking doesn't let us tell which
|
|
|
|
side of the full join a Var came from; but that information can be found
|
|
|
|
elsewhere at need.)
|
|
|
|
|
|
|
|
Notionally, a Var having nonempty varnullingrels can be thought of as
|
|
|
|
CASE WHEN any-of-these-outer-joins-produced-a-null-extended-row
|
|
|
|
THEN NULL
|
|
|
|
ELSE the-scan-level-value-of-the-column
|
|
|
|
END
|
|
|
|
It's only notional, because no such calculation is ever done explicitly.
|
|
|
|
In a finished plan, Vars occurring in scan-level plan nodes represent
|
|
|
|
the actual table column values, but upper-level Vars are always
|
|
|
|
references to outputs of lower-level plan nodes. When a join node emits
|
|
|
|
a null-extended row, it just returns nulls for the relevant output
|
|
|
|
columns rather than copying up values from its input. Because we don't
|
|
|
|
ever have to do this calculation explicitly, it's not necessary to
|
|
|
|
distinguish which side of an outer join got null-extended, which'd
|
|
|
|
otherwise be essential information for FULL JOIN cases.
|
|
|
|
|
|
|
|
Outer join identity 3 (discussed above) complicates this picture
|
|
|
|
a bit. In the form
|
|
|
|
A leftjoin (B leftjoin C on (Pbc)) on (Pab)
|
|
|
|
all of the Vars in clauses Pbc and Pab will have empty varnullingrels,
|
|
|
|
but if we start with
|
|
|
|
(A leftjoin B on (Pab)) leftjoin C on (Pbc)
|
|
|
|
then the parser will have marked Pbc's B Vars with the A/B join's
|
|
|
|
RT index, making this form artificially different from the first.
|
|
|
|
For discussion's sake, let's denote this marking with a star:
|
|
|
|
(A leftjoin B on (Pab)) leftjoin C on (Pb*c)
|
|
|
|
To cope with this, once we have detected that commuting these joins
|
|
|
|
is legal, we generate both the Pbc and Pb*c forms of that ON clause,
|
|
|
|
by either removing or adding the first join's RT index in the B Vars
|
|
|
|
that the parser created. While generating paths for a plan step that
|
|
|
|
joins B and C, we include as a relevant join qual only the form that
|
|
|
|
is appropriate depending on whether A has already been joined to B.
|
|
|
|
|
|
|
|
It's also worth noting that identity 3 makes "the left join's RT index"
|
|
|
|
itself a bit of a fuzzy concept, since the syntactic scope of each join
|
|
|
|
RTE will depend on which form was produced by the parser. We resolve
|
|
|
|
this by considering that a left join's identity is determined by its
|
|
|
|
minimum set of right-hand-side input relations. In both forms allowed
|
|
|
|
by identity 3, we can identify the first join as having minimum RHS B
|
|
|
|
and the second join as having minimum RHS C.
|
|
|
|
|
|
|
|
Another thing to notice is that C Vars appearing outside the nested
|
|
|
|
JOIN clauses will be marked as nulled by both left joins if the
|
|
|
|
original parser input was in the first form of identity 3, but if the
|
|
|
|
parser input was in the second form, such Vars will only be marked as
|
|
|
|
nulled by the second join. This is not really a semantic problem:
|
|
|
|
such Vars will be marked the same way throughout the upper part of the
|
|
|
|
query, so they will all look equal() which is correct; and they will not
|
|
|
|
look equal() to any C Var appearing in the JOIN ON clause or below these
|
|
|
|
joins. However, when building Vars representing the outputs of join
|
|
|
|
relations, we need to ensure that their varnullingrels are set to
|
|
|
|
values consistent with the syntactic join order, so that they will
|
|
|
|
appear equal() to pre-existing Vars in the upper part of the query.
|
|
|
|
|
|
|
|
Outer joins also complicate handling of subquery pull-up. Consider
|
|
|
|
|
|
|
|
SELECT ..., ss.x FROM tab1
|
|
|
|
LEFT JOIN (SELECT *, 42 AS x FROM tab2) ss ON ...
|
|
|
|
|
|
|
|
We want to be able to pull up the subquery as discussed previously,
|
|
|
|
but we can't just replace the "ss.x" Var in the top-level SELECT list
|
|
|
|
with the constant 42. That'd result in always emitting 42, rather
|
|
|
|
than emitting NULL in null-extended join rows.
|
|
|
|
|
|
|
|
To solve this, we introduce the concept of PlaceHolderVars.
|
|
|
|
A PlaceHolderVar is somewhat like a Var, in that its value originates
|
|
|
|
at a relation scan level and can then be forced to null by higher-level
|
|
|
|
outer joins; hence PlaceHolderVars carry a set of nulling rel IDs just
|
|
|
|
like Vars. Unlike a Var, whose original value comes from a table,
|
|
|
|
a PlaceHolderVar's original value is defined by a query-determined
|
|
|
|
expression ("42" in this example); so we represent the PlaceHolderVar
|
|
|
|
as a node with that expression as child. We insert a PlaceHolderVar
|
|
|
|
whenever subquery pullup needs to replace a subquery-referencing Var
|
|
|
|
that has nonempty varnullingrels with an expression that is not simply a
|
|
|
|
Var. (When the replacement expression is a pulled-up Var, we can just
|
|
|
|
add the replaced Var's varnullingrels to its set. Also, if the replaced
|
|
|
|
Var has empty varnullingrels, we don't need a PlaceHolderVar: there is
|
|
|
|
nothing that'd force the value to null, so the pulled-up expression is
|
|
|
|
fine to use as-is.) In a finished plan, a PlaceHolderVar becomes just
|
|
|
|
the contained expression at whatever plan level it's supposed to be
|
|
|
|
evaluated at, and then upper-level occurrences are replaced by Var
|
|
|
|
references to that output column of the lower plan level. That causes
|
|
|
|
the value to go to null when appropriate at an outer join, in the same
|
|
|
|
way as for normal Vars. Thus, PlaceHolderVars are never seen outside
|
|
|
|
the planner.
|
|
|
|
|
|
|
|
PlaceHolderVars (PHVs) are more complicated than Vars in another way:
|
|
|
|
their original value might need to be calculated at a join, not a
|
|
|
|
base-level relation scan. This can happen when a pulled-up subquery
|
|
|
|
contains a join. Because of this, a PHV can create a join order
|
|
|
|
constraint that wouldn't otherwise exist, to ensure that it can
|
|
|
|
be calculated before it is used. A PHV's expression can also contain
|
|
|
|
LATERAL references, adding complications that are discussed below.
|
|
|
|
|
|
|
|
|
|
|
|
Relation Identification and Qual Clause Placement
|
|
|
|
-------------------------------------------------
|
|
|
|
|
|
|
|
A qual clause obtained from WHERE or JOIN/ON can be enforced at the lowest
|
|
|
|
scan or join level that includes all relations used in the clause. For
|
|
|
|
this purpose we consider that outer joins listed in varnullingrels or
|
|
|
|
phnullingrels are used in the clause, since we can't compute the qual's
|
|
|
|
result correctly until we know whether such Vars have gone to null.
|
|
|
|
|
|
|
|
The one exception to this general rule is that a non-degenerate outer
|
|
|
|
JOIN/ON qual (one that references the non-nullable side of the join)
|
|
|
|
cannot be enforced below that join, even if it doesn't reference the
|
|
|
|
nullable side. Pushing it down into the non-nullable side would result
|
|
|
|
in rows disappearing from the join's result, rather than appearing as
|
|
|
|
null-extended rows. To handle that, when we identify such a qual we
|
|
|
|
artificially add the join's minimum input relid set to the set of
|
|
|
|
relations it is considered to use, forcing it to be evaluated exactly at
|
|
|
|
that join level. The same happens for outer-join quals that mention no
|
|
|
|
relations at all.
|
|
|
|
|
|
|
|
When attaching a qual clause to a join plan node that is performing an
|
|
|
|
outer join, the qual clause is considered a "join clause" (that is, it is
|
|
|
|
applied before the join performs null-extension) if it does not reference
|
|
|
|
that outer join in any varnullingrels or phnullingrels set, or a "filter
|
|
|
|
clause" (applied after null-extension) if it does reference that outer
|
|
|
|
join. A qual clause that originally appeared in that outer join's JOIN/ON
|
|
|
|
will fall into the first category, since the parser would not have marked
|
|
|
|
any of its Vars as referencing the outer join. A qual clause that
|
|
|
|
originally came from some upper ON clause or WHERE clause will be seen as
|
|
|
|
referencing the outer join if it references any of the nullable side's
|
|
|
|
Vars, since those Vars will be so marked by the parser. But, if such a
|
|
|
|
qual does not reference any nullable-side Vars, it's okay to push it down
|
|
|
|
into the non-nullable side, so it won't get attached to the join node in
|
|
|
|
the first place.
|
|
|
|
|
|
|
|
These things lead us to identify join relations within the planner
|
|
|
|
by the sets of base relation RT indexes plus outer join RT indexes
|
|
|
|
that they include. In that way, the sets of relations used by qual
|
|
|
|
clauses can be directly compared to join relations' relid sets to
|
|
|
|
see where to place the clauses. These identifying sets are unique
|
|
|
|
because, for any given collection of base relations, there is only
|
|
|
|
one valid set of outer joins to have performed along the way to
|
|
|
|
joining that set of base relations (although the order of applying
|
|
|
|
them could vary, as discussed above).
|
|
|
|
|
|
|
|
SEMI joins do not have RT indexes, because they are artifacts made by
|
|
|
|
the planner rather than the parser. (We could create rangetable
|
|
|
|
entries for them, but there seems no need at present.) This does not
|
|
|
|
cause a problem for qual placement, because the nullable side of a
|
|
|
|
semijoin is not referenceable from above the join, so there is never a
|
|
|
|
need to cite it in varnullingrels or phnullingrels. It does not cause a
|
|
|
|
problem for join relation identification either, since whether a semijoin
|
|
|
|
has been completed is again implicit in the set of base relations
|
|
|
|
included in the join.
|
|
|
|
|
2023-05-17 17:13:52 +02:00
|
|
|
As usual, outer join identity 3 complicates matters. If we start with
|
|
|
|
(A leftjoin B on (Pab)) leftjoin C on (Pbc)
|
|
|
|
then the parser will have marked any C Vars appearing above these joins
|
|
|
|
with the RT index of the B/C join. If we now transform to
|
|
|
|
A leftjoin (B leftjoin C on (Pbc)) on (Pab)
|
|
|
|
then it would appear that a clause using only such Vars could be pushed
|
|
|
|
down and applied as a filter clause (not a join clause) at the lower
|
|
|
|
B/C join. But *this might not give the right answer* since the clause
|
|
|
|
might see a non-null value for the C Var that will be replaced by null
|
|
|
|
once the A/B join is performed. We handle this by saying that the
|
|
|
|
pushed-down join hasn't completely performed the work of the B/C join
|
|
|
|
and hence is not entitled to include that outer join relid in its
|
|
|
|
relid set. When we form the A/B join, both outer joins' relids will
|
|
|
|
be added to its relid set, and then the upper clause will be applied
|
|
|
|
at the correct join level. (Note there is no problem when identity 3
|
|
|
|
is applied in the other direction: if we started with the second form
|
|
|
|
then upper C Vars are marked with both outer join relids, so they
|
|
|
|
cannot drop below whichever join is applied second.) Similarly,
|
|
|
|
Vars representing the output of a pushed-down join do not acquire
|
|
|
|
nullingrel bits for that join until after the upper join is performed.
|
|
|
|
|
Make Vars be outer-join-aware.
Traditionally we used the same Var struct to represent the value
of a table column everywhere in parse and plan trees. This choice
predates our support for SQL outer joins, and it's really a pretty
bad idea with outer joins, because the Var's value can depend on
where it is in the tree: it might go to NULL above an outer join.
So expression nodes that are equal() per equalfuncs.c might not
represent the same value, which is a huge correctness hazard for
the planner.
To improve this, decorate Var nodes with a bitmapset showing
which outer joins (identified by RTE indexes) may have nulled
them at the point in the parse tree where the Var appears.
This allows us to trust that equal() Vars represent the same value.
A certain amount of klugery is still needed to cope with cases
where we re-order two outer joins, but it's possible to make it
work without sacrificing that core principle. PlaceHolderVars
receive similar decoration for the same reason.
In the planner, we include these outer join bitmapsets into the relids
that an expression is considered to depend on, and in consequence also
add outer-join relids to the relids of join RelOptInfos. This allows
us to correctly perceive whether an expression can be calculated above
or below a particular outer join.
This change affects FDWs that want to plan foreign joins. They *must*
follow suit when labeling foreign joins in order to match with the
core planner, but for many purposes (if postgres_fdw is any guide)
they'd prefer to consider only base relations within the join.
To support both requirements, redefine ForeignScan.fs_relids as
base+OJ relids, and add a new field fs_base_relids that's set up by
the core planner.
Large though it is, this commit just does the minimum necessary to
install the new mechanisms and get check-world passing again.
Follow-up patches will perform some cleanup. (The README additions
and comments mention some stuff that will appear in the follow-up.)
Patch by me; thanks to Richard Guo for review.
Discussion: https://postgr.es/m/830269.1656693747@sss.pgh.pa.us
2023-01-30 19:16:20 +01:00
|
|
|
There is one additional complication for qual clause placement, which
|
|
|
|
occurs when we have made multiple versions of an outer-join clause as
|
|
|
|
described previously (that is, we have both "Pbc" and "Pb*c" forms of
|
|
|
|
the same clause seen in outer join identity 3). When forming an outer
|
|
|
|
join we only want to apply one of the redundant versions of the clause.
|
|
|
|
If we are forming the B/C join without having yet computed the A/B
|
|
|
|
join, it's easy to reject the "Pb*c" form since its required relid
|
|
|
|
set includes the A/B join relid which is not in the input. However,
|
|
|
|
if we form B/C after A/B, then both forms of the clause are applicable
|
|
|
|
so far as that test can tell. We have to look more closely to notice
|
|
|
|
that the "Pbc" clause form refers to relation B which is no longer
|
2023-05-25 16:28:33 +02:00
|
|
|
directly accessible. While such a check could be performed using the
|
|
|
|
per-relation RelOptInfo.nulling_relids data, it would be annoyingly
|
|
|
|
expensive to do over and over as we consider different join paths.
|
|
|
|
To make this simple and reliable, we compute an "incompatible_relids"
|
|
|
|
set for each variant version (clone) of a redundant clause. A clone
|
|
|
|
clause should not be applied if any of the outer-join relids listed in
|
|
|
|
incompatible_relids has already been computed below the current join.
|
Make Vars be outer-join-aware.
Traditionally we used the same Var struct to represent the value
of a table column everywhere in parse and plan trees. This choice
predates our support for SQL outer joins, and it's really a pretty
bad idea with outer joins, because the Var's value can depend on
where it is in the tree: it might go to NULL above an outer join.
So expression nodes that are equal() per equalfuncs.c might not
represent the same value, which is a huge correctness hazard for
the planner.
To improve this, decorate Var nodes with a bitmapset showing
which outer joins (identified by RTE indexes) may have nulled
them at the point in the parse tree where the Var appears.
This allows us to trust that equal() Vars represent the same value.
A certain amount of klugery is still needed to cope with cases
where we re-order two outer joins, but it's possible to make it
work without sacrificing that core principle. PlaceHolderVars
receive similar decoration for the same reason.
In the planner, we include these outer join bitmapsets into the relids
that an expression is considered to depend on, and in consequence also
add outer-join relids to the relids of join RelOptInfos. This allows
us to correctly perceive whether an expression can be calculated above
or below a particular outer join.
This change affects FDWs that want to plan foreign joins. They *must*
follow suit when labeling foreign joins in order to match with the
core planner, but for many purposes (if postgres_fdw is any guide)
they'd prefer to consider only base relations within the join.
To support both requirements, redefine ForeignScan.fs_relids as
base+OJ relids, and add a new field fs_base_relids that's set up by
the core planner.
Large though it is, this commit just does the minimum necessary to
install the new mechanisms and get check-world passing again.
Follow-up patches will perform some cleanup. (The README additions
and comments mention some stuff that will appear in the follow-up.)
Patch by me; thanks to Richard Guo for review.
Discussion: https://postgr.es/m/830269.1656693747@sss.pgh.pa.us
2023-01-30 19:16:20 +01:00
|
|
|
|
|
|
|
|
1999-02-04 04:19:11 +01:00
|
|
|
Optimizer Functions
|
|
|
|
-------------------
|
|
|
|
|
2000-03-21 06:12:12 +01:00
|
|
|
The primary entry point is planner().
|
|
|
|
|
1997-12-17 19:02:33 +01:00
|
|
|
planner()
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
set up for recursive handling of subqueries
|
2000-03-21 06:12:12 +01:00
|
|
|
-subquery_planner()
|
2008-08-14 20:48:00 +02:00
|
|
|
pull up sublinks and subqueries from rangetable, if possible
|
2000-03-21 06:12:12 +01:00
|
|
|
canonicalize qual
|
2003-12-30 22:49:19 +01:00
|
|
|
Attempt to simplify WHERE clause to the most useful form; this includes
|
|
|
|
flattening nested AND/ORs and detecting clauses that are duplicated in
|
|
|
|
different branches of an OR.
|
|
|
|
simplify constant expressions
|
2000-03-21 06:12:12 +01:00
|
|
|
process sublinks
|
|
|
|
convert Vars of outer query levels into Params
|
2000-11-12 01:37:02 +01:00
|
|
|
--grouping_planner()
|
|
|
|
preprocess target list for non-SELECT queries
|
|
|
|
handle UNION/INTERSECT/EXCEPT, GROUP BY, HAVING, aggregates,
|
|
|
|
ORDER BY, DISTINCT, LIMIT
|
2020-05-22 08:45:00 +02:00
|
|
|
---query_planner()
|
2002-11-06 01:00:45 +01:00
|
|
|
make list of base relations used in query
|
|
|
|
split up the qual into restrictions (a=1) and joins (b=c)
|
|
|
|
find qual clauses that enable merge and hash joins
|
1999-02-15 23:19:01 +01:00
|
|
|
----make_one_rel()
|
2018-06-07 14:40:38 +02:00
|
|
|
set_base_rel_pathlists()
|
2007-09-26 20:51:51 +02:00
|
|
|
find seqscan and all index paths for each base relation
|
1999-02-15 23:19:01 +01:00
|
|
|
find selectivity of columns used in joins
|
2007-09-26 20:51:51 +02:00
|
|
|
make_rel_from_joinlist()
|
|
|
|
hand off join subproblems to a plugin, GEQO, or standard_join_search()
|
2020-05-22 08:45:00 +02:00
|
|
|
------standard_join_search()
|
2007-09-26 20:51:51 +02:00
|
|
|
call join_search_one_level() for each level of join tree needed
|
|
|
|
join_search_one_level():
|
2000-02-07 05:41:04 +01:00
|
|
|
For each joinrel of the prior level, do make_rels_by_clause_joins()
|
|
|
|
if it has join clauses, or make_rels_by_clauseless_joins() if not.
|
|
|
|
Also generate "bushy plan" joins between joinrels of lower levels.
|
2016-04-30 18:29:21 +02:00
|
|
|
Back at standard_join_search(), generate gather paths if needed for
|
|
|
|
each newly constructed joinrel, then apply set_cheapest() to extract
|
|
|
|
the cheapest path for it.
|
2000-02-07 05:41:04 +01:00
|
|
|
Loop back if this wasn't the top join level.
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
Back at grouping_planner:
|
|
|
|
do grouping (GROUP BY) and aggregation
|
|
|
|
do window functions
|
|
|
|
make unique (DISTINCT)
|
|
|
|
do sorting (ORDER BY)
|
|
|
|
do limit (LIMIT/OFFSET)
|
|
|
|
Back at planner():
|
|
|
|
convert finished Path tree into a Plan tree
|
|
|
|
do final cleanup after planning
|
1999-02-03 21:15:53 +01:00
|
|
|
|
|
|
|
|
1999-08-16 04:17:58 +02:00
|
|
|
Optimizer Data Structures
|
|
|
|
-------------------------
|
1999-02-04 04:19:11 +01:00
|
|
|
|
2007-02-19 08:03:34 +01:00
|
|
|
PlannerGlobal - global information for a single planner invocation
|
|
|
|
|
|
|
|
PlannerInfo - information for planning a particular Query (we make
|
|
|
|
a separate PlannerInfo node for each sub-Query)
|
2005-06-06 00:32:58 +02:00
|
|
|
|
1999-02-15 23:19:01 +01:00
|
|
|
RelOptInfo - a relation or joined relations
|
1999-02-04 04:19:11 +01:00
|
|
|
|
2003-01-15 20:35:48 +01:00
|
|
|
RestrictInfo - WHERE clauses, like "x = 3" or "y = z"
|
|
|
|
(note the same structure is used for restriction and
|
|
|
|
join clauses)
|
1999-02-04 04:19:11 +01:00
|
|
|
|
1999-02-15 23:19:01 +01:00
|
|
|
Path - every way to generate a RelOptInfo(sequential,index,joins)
|
In the planner, replace an empty FROM clause with a dummy RTE.
The fact that "SELECT expression" has no base relations has long been a
thorn in the side of the planner. It makes it hard to flatten a sub-query
that looks like that, or is a trivial VALUES() item, because the planner
generally uses relid sets to identify sub-relations, and such a sub-query
would have an empty relid set if we flattened it. prepjointree.c contains
some baroque logic that works around this in certain special cases --- but
there is a much better answer. We can replace an empty FROM clause with a
dummy RTE that acts like a table of one row and no columns, and then there
are no such corner cases to worry about. Instead we need some logic to
get rid of useless dummy RTEs, but that's simpler and covers more cases
than what was there before.
For really trivial cases, where the query is just "SELECT expression" and
nothing else, there's a hazard that adding the extra RTE makes for a
noticeable slowdown; even though it's not much processing, there's not
that much for the planner to do overall. However testing says that the
penalty is very small, close to the noise level. In more complex queries,
this is able to find optimizations that we could not find before.
The new RTE type is called RTE_RESULT, since the "scan" plan type it
gives rise to is a Result node (the same plan we produced for a "SELECT
expression" query before). To avoid confusion, rename the old ResultPath
path type to GroupResultPath, reflecting that it's only used in degenerate
grouping cases where we know the query produces just one grouped row.
(It wouldn't work to unify the two cases, because there are different
rules about where the associated quals live during query_planner.)
Note: although this touches readfuncs.c, I don't think a catversion
bump is required, because the added case can't occur in stored rules,
only plans.
Patch by me, reviewed by David Rowley and Mark Dilger
Discussion: https://postgr.es/m/15944.1521127664@sss.pgh.pa.us
2019-01-28 23:54:10 +01:00
|
|
|
A plain Path node can represent several simple plans, per its pathtype:
|
|
|
|
T_SeqScan - sequential scan
|
|
|
|
T_SampleScan - tablesample scan
|
|
|
|
T_FunctionScan - function-in-FROM scan
|
|
|
|
T_TableFuncScan - table function scan
|
|
|
|
T_ValuesScan - VALUES scan
|
|
|
|
T_CteScan - CTE (WITH) scan
|
|
|
|
T_NamedTuplestoreScan - ENR scan
|
|
|
|
T_WorkTableScan - scan worktable of a recursive CTE
|
|
|
|
T_Result - childless Result plan node (used for FROM-less SELECT)
|
2010-10-14 22:56:39 +02:00
|
|
|
IndexPath - index scan
|
2005-04-21 21:18:13 +02:00
|
|
|
BitmapHeapPath - top of a bitmapped index scan
|
2002-11-06 01:00:45 +01:00
|
|
|
TidPath - scan by CTID
|
2021-02-27 10:59:36 +01:00
|
|
|
TidRangePath - scan a contiguous range of CTIDs
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
SubqueryScanPath - scan a subquery-in-FROM
|
2016-04-21 19:30:48 +02:00
|
|
|
ForeignPath - scan a foreign table, foreign join or foreign upper-relation
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
CustomPath - for custom scan providers
|
2002-11-06 01:00:45 +01:00
|
|
|
AppendPath - append multiple subpaths together
|
2010-10-14 22:56:39 +02:00
|
|
|
MergeAppendPath - merge multiple subpaths, preserving their common sort order
|
In the planner, replace an empty FROM clause with a dummy RTE.
The fact that "SELECT expression" has no base relations has long been a
thorn in the side of the planner. It makes it hard to flatten a sub-query
that looks like that, or is a trivial VALUES() item, because the planner
generally uses relid sets to identify sub-relations, and such a sub-query
would have an empty relid set if we flattened it. prepjointree.c contains
some baroque logic that works around this in certain special cases --- but
there is a much better answer. We can replace an empty FROM clause with a
dummy RTE that acts like a table of one row and no columns, and then there
are no such corner cases to worry about. Instead we need some logic to
get rid of useless dummy RTEs, but that's simpler and covers more cases
than what was there before.
For really trivial cases, where the query is just "SELECT expression" and
nothing else, there's a hazard that adding the extra RTE makes for a
noticeable slowdown; even though it's not much processing, there's not
that much for the planner to do overall. However testing says that the
penalty is very small, close to the noise level. In more complex queries,
this is able to find optimizations that we could not find before.
The new RTE type is called RTE_RESULT, since the "scan" plan type it
gives rise to is a Result node (the same plan we produced for a "SELECT
expression" query before). To avoid confusion, rename the old ResultPath
path type to GroupResultPath, reflecting that it's only used in degenerate
grouping cases where we know the query produces just one grouped row.
(It wouldn't work to unify the two cases, because there are different
rules about where the associated quals live during query_planner.)
Note: although this touches readfuncs.c, I don't think a catversion
bump is required, because the added case can't occur in stored rules,
only plans.
Patch by me, reviewed by David Rowley and Mark Dilger
Discussion: https://postgr.es/m/15944.1521127664@sss.pgh.pa.us
2019-01-28 23:54:10 +01:00
|
|
|
GroupResultPath - childless Result plan node (used for degenerate grouping)
|
2002-11-30 06:21:03 +01:00
|
|
|
MaterialPath - a Material plan node
|
2021-07-14 02:43:58 +02:00
|
|
|
MemoizePath - a Memoize plan node for caching tuples from sub-paths
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
UniquePath - remove duplicate rows (either by hashing or sorting)
|
|
|
|
GatherPath - collect the results of parallel workers
|
Force rescanning of parallel-aware scan nodes below a Gather[Merge].
The ExecReScan machinery contains various optimizations for postponing
or skipping rescans of plan subtrees; for example a HashAgg node may
conclude that it can re-use the table it built before, instead of
re-reading its input subtree. But that is wrong if the input contains
a parallel-aware table scan node, since the portion of the table scanned
by the leader process is likely to vary from one rescan to the next.
This explains the timing-dependent buildfarm failures we saw after
commit a2b70c89c.
The established mechanism for showing that a plan node's output is
potentially variable is to mark it as depending on some runtime Param.
Hence, to fix this, invent a dummy Param (one that has a PARAM_EXEC
parameter number, but carries no actual value) associated with each Gather
or GatherMerge node, mark parallel-aware nodes below that node as dependent
on that Param, and arrange for ExecReScanGather[Merge] to flag that Param
as changed whenever the Gather[Merge] node is rescanned.
This solution breaks an undocumented assumption made by the parallel
executor logic, namely that all rescans of nodes below a Gather[Merge]
will happen synchronously during the ReScan of the top node itself.
But that's fundamentally contrary to the design of the ExecReScan code,
and so was doomed to fail someday anyway (even if you want to argue
that the bug being fixed here wasn't a failure of that assumption).
A follow-on patch will address that issue. In the meantime, the worst
that's expected to happen is that given very bad timing luck, the leader
might have to do all the work during a rescan, because workers think
they have nothing to do, if they are able to start up before the eventual
ReScan of the leader's parallel-aware table scan node has reset the
shared scan state.
Although this problem exists in 9.6, there does not seem to be any way
for it to manifest there. Without GatherMerge, it seems that a plan tree
that has a rescan-short-circuiting node below Gather will always also
have one above it that will short-circuit in the same cases, preventing
the Gather from being rescanned. Hence we won't take the risk of
back-patching this change into 9.6. But v10 needs it.
Discussion: https://postgr.es/m/CAA4eK1JkByysFJNh9M349u_nNjqETuEnY_y1VUc_kJiU0bxtaQ@mail.gmail.com
2017-08-30 15:29:55 +02:00
|
|
|
GatherMergePath - collect parallel results, preserving their common sort order
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
ProjectionPath - a Result plan node with child (used for projection)
|
Move targetlist SRF handling from expression evaluation to new executor node.
Evaluation of set returning functions (SRFs_ in the targetlist (like SELECT
generate_series(1,5)) so far was done in the expression evaluation (i.e.
ExecEvalExpr()) and projection (i.e. ExecProject/ExecTargetList) code.
This meant that most executor nodes performing projection, and most
expression evaluation functions, had to deal with the possibility that an
evaluated expression could return a set of return values.
That's bad because it leads to repeated code in a lot of places. It also,
and that's my (Andres's) motivation, made it a lot harder to implement a
more efficient way of doing expression evaluation.
To fix this, introduce a new executor node (ProjectSet) that can evaluate
targetlists containing one or more SRFs. To avoid the complexity of the old
way of handling nested expressions returning sets (e.g. having to pass up
ExprDoneCond, and dealing with arguments to functions returning sets etc.),
those SRFs can only be at the top level of the node's targetlist. The
planner makes sure (via split_pathtarget_at_srfs()) that SRF evaluation is
only necessary in ProjectSet nodes and that SRFs are only present at the
top level of the node's targetlist. If there are nested SRFs the planner
creates multiple stacked ProjectSet nodes. The ProjectSet nodes always get
input from an underlying node.
We also discussed and prototyped evaluating targetlist SRFs using ROWS
FROM(), but that turned out to be more complicated than we'd hoped.
While moving SRF evaluation to ProjectSet would allow to retain the old
"least common multiple" behavior when multiple SRFs are present in one
targetlist (i.e. continue returning rows until all SRFs are at the end of
their input at the same time), we decided to instead only return rows till
all SRFs are exhausted, returning NULL for already exhausted ones. We
deemed the previous behavior to be too confusing, unexpected and actually
not particularly useful.
As a side effect, the previously prohibited case of multiple set returning
arguments to a function, is now allowed. Not because it's particularly
desirable, but because it ends up working and there seems to be no argument
for adding code to prohibit it.
Currently the behavior for COALESCE and CASE containing SRFs has changed,
returning multiple rows from the expression, even when the SRF containing
"arm" of the expression is not evaluated. That's because the SRFs are
evaluated in a separate ProjectSet node. As that's quite confusing, we're
likely to instead prohibit SRFs in those places. But that's still being
discussed, and the code would reside in places not touched here, so that's
a task for later.
There's a lot of, now superfluous, code dealing with set return expressions
around. But as the changes to get rid of those are verbose largely boring,
it seems better for readability to keep the cleanup as a separate commit.
Author: Tom Lane and Andres Freund
Discussion: https://postgr.es/m/20160822214023.aaxz5l4igypowyri@alap3.anarazel.de
2017-01-18 21:46:50 +01:00
|
|
|
ProjectSetPath - a ProjectSet plan node applied to some sub-path
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
SortPath - a Sort plan node applied to some sub-path
|
2020-11-30 22:32:56 +01:00
|
|
|
IncrementalSortPath - an IncrementalSort plan node applied to some sub-path
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
GroupPath - a Group plan node applied to some sub-path
|
|
|
|
UpperUniquePath - a Unique plan node applied to some sub-path
|
|
|
|
AggPath - an Agg plan node applied to some sub-path
|
|
|
|
GroupingSetsPath - an Agg plan node used to implement GROUPING SETS
|
|
|
|
MinMaxAggPath - a Result plan node with subplans performing MIN/MAX
|
|
|
|
WindowAggPath - a WindowAgg plan node applied to some sub-path
|
|
|
|
SetOpPath - a SetOp plan node applied to some sub-path
|
|
|
|
RecursiveUnionPath - a RecursiveUnion plan node applied to two sub-paths
|
|
|
|
LockRowsPath - a LockRows plan node applied to some sub-path
|
|
|
|
ModifyTablePath - a ModifyTable plan node applied to some sub-path(s)
|
|
|
|
LimitPath - a Limit plan node applied to some sub-path
|
1999-08-16 04:17:58 +02:00
|
|
|
NestPath - nested-loop joins
|
1999-02-15 23:19:01 +01:00
|
|
|
MergePath - merge joins
|
|
|
|
HashPath - hash joins
|
1999-02-04 02:47:02 +01:00
|
|
|
|
2007-01-20 21:45:41 +01:00
|
|
|
EquivalenceClass - a data structure representing a set of values known equal
|
|
|
|
|
|
|
|
PathKey - a data structure representing the sort ordering of a path
|
1999-08-16 04:17:58 +02:00
|
|
|
|
|
|
|
The optimizer spends a good deal of its time worrying about the ordering
|
|
|
|
of the tuples returned by a path. The reason this is useful is that by
|
|
|
|
knowing the sort ordering of a path, we may be able to use that path as
|
|
|
|
the left or right input of a mergejoin and avoid an explicit sort step.
|
|
|
|
Nestloops and hash joins don't really care what the order of their inputs
|
|
|
|
is, but mergejoin needs suitably ordered inputs. Therefore, all paths
|
|
|
|
generated during the optimization process are marked with their sort order
|
|
|
|
(to the extent that it is known) for possible use by a higher-level merge.
|
|
|
|
|
|
|
|
It is also possible to avoid an explicit sort step to implement a user's
|
2000-02-07 05:41:04 +01:00
|
|
|
ORDER BY clause if the final path has the right ordering already, so the
|
Simplify query_planner's API by having it return the top-level RelOptInfo.
Formerly, query_planner returned one or possibly two Paths for the topmost
join relation, so that grouping_planner didn't see the join RelOptInfo
(at least not directly; it didn't have any hesitation about examining
cheapest_path->parent, though). However, correct selection of the Paths
involved a significant amount of coupling between query_planner and
grouping_planner, a problem which has gotten worse over time. It seems
best to give up on this API choice and instead return the topmost
RelOptInfo explicitly. Then grouping_planner can pull out the Paths it
wants from the rel's path list. In this way we can remove all knowledge
of grouping behaviors from query_planner.
The only real benefit of the old way is that in the case of an empty
FROM clause, we never made any RelOptInfos at all, just a Path. Now
we have to gin up a dummy RelOptInfo to represent the empty FROM clause.
That's not a very big deal though.
While at it, simplify query_planner's API a bit more by having the caller
set up root->tuple_fraction and root->limit_tuples, rather than passing
those values as separate parameters. Since query_planner no longer does
anything with either value, requiring it to fill the PlannerInfo fields
seemed pretty arbitrary.
This patch just rearranges code; it doesn't (intentionally) change any
behaviors. Followup patches will do more interesting things.
2013-08-05 21:00:57 +02:00
|
|
|
sort ordering is of interest even at the top level. grouping_planner() will
|
2000-02-07 05:41:04 +01:00
|
|
|
look for the cheapest path with a sort order matching the desired order,
|
Simplify query_planner's API by having it return the top-level RelOptInfo.
Formerly, query_planner returned one or possibly two Paths for the topmost
join relation, so that grouping_planner didn't see the join RelOptInfo
(at least not directly; it didn't have any hesitation about examining
cheapest_path->parent, though). However, correct selection of the Paths
involved a significant amount of coupling between query_planner and
grouping_planner, a problem which has gotten worse over time. It seems
best to give up on this API choice and instead return the topmost
RelOptInfo explicitly. Then grouping_planner can pull out the Paths it
wants from the rel's path list. In this way we can remove all knowledge
of grouping behaviors from query_planner.
The only real benefit of the old way is that in the case of an empty
FROM clause, we never made any RelOptInfos at all, just a Path. Now
we have to gin up a dummy RelOptInfo to represent the empty FROM clause.
That's not a very big deal though.
While at it, simplify query_planner's API a bit more by having the caller
set up root->tuple_fraction and root->limit_tuples, rather than passing
those values as separate parameters. Since query_planner no longer does
anything with either value, requiring it to fill the PlannerInfo fields
seemed pretty arbitrary.
This patch just rearranges code; it doesn't (intentionally) change any
behaviors. Followup patches will do more interesting things.
2013-08-05 21:00:57 +02:00
|
|
|
then compare its cost to the cost of using the cheapest-overall path and
|
|
|
|
doing an explicit sort on that.
|
1999-08-16 04:17:58 +02:00
|
|
|
|
|
|
|
When we are generating paths for a particular RelOptInfo, we discard a path
|
|
|
|
if it is more expensive than another known path that has the same or better
|
|
|
|
sort order. We will never discard a path that is the only known way to
|
2000-02-07 05:41:04 +01:00
|
|
|
achieve a given sort order (without an explicit sort, that is). In this
|
|
|
|
way, the next level up will have the maximum freedom to build mergejoins
|
|
|
|
without sorting, since it can pick from any of the paths retained for its
|
|
|
|
inputs.
|
1999-08-16 04:17:58 +02:00
|
|
|
|
2000-07-24 05:11:01 +02:00
|
|
|
|
2007-01-20 21:45:41 +01:00
|
|
|
EquivalenceClasses
|
|
|
|
------------------
|
|
|
|
|
Make Vars be outer-join-aware.
Traditionally we used the same Var struct to represent the value
of a table column everywhere in parse and plan trees. This choice
predates our support for SQL outer joins, and it's really a pretty
bad idea with outer joins, because the Var's value can depend on
where it is in the tree: it might go to NULL above an outer join.
So expression nodes that are equal() per equalfuncs.c might not
represent the same value, which is a huge correctness hazard for
the planner.
To improve this, decorate Var nodes with a bitmapset showing
which outer joins (identified by RTE indexes) may have nulled
them at the point in the parse tree where the Var appears.
This allows us to trust that equal() Vars represent the same value.
A certain amount of klugery is still needed to cope with cases
where we re-order two outer joins, but it's possible to make it
work without sacrificing that core principle. PlaceHolderVars
receive similar decoration for the same reason.
In the planner, we include these outer join bitmapsets into the relids
that an expression is considered to depend on, and in consequence also
add outer-join relids to the relids of join RelOptInfos. This allows
us to correctly perceive whether an expression can be calculated above
or below a particular outer join.
This change affects FDWs that want to plan foreign joins. They *must*
follow suit when labeling foreign joins in order to match with the
core planner, but for many purposes (if postgres_fdw is any guide)
they'd prefer to consider only base relations within the join.
To support both requirements, redefine ForeignScan.fs_relids as
base+OJ relids, and add a new field fs_base_relids that's set up by
the core planner.
Large though it is, this commit just does the minimum necessary to
install the new mechanisms and get check-world passing again.
Follow-up patches will perform some cleanup. (The README additions
and comments mention some stuff that will appear in the follow-up.)
Patch by me; thanks to Richard Guo for review.
Discussion: https://postgr.es/m/830269.1656693747@sss.pgh.pa.us
2023-01-30 19:16:20 +01:00
|
|
|
During the deconstruct_jointree() scan of the query's qual clauses, we
|
|
|
|
look for mergejoinable equality clauses A = B. When we find one, we
|
|
|
|
create an EquivalenceClass containing the expressions A and B to record
|
|
|
|
that they are equal. If we later find another equivalence clause B = C,
|
2007-01-20 21:45:41 +01:00
|
|
|
we add C to the existing EquivalenceClass for {A B}; this may require
|
|
|
|
merging two existing EquivalenceClasses. At the end of the scan, we have
|
|
|
|
sets of values that are known all transitively equal to each other. We can
|
|
|
|
therefore use a comparison of any pair of the values as a restriction or
|
|
|
|
join clause (when these values are available at the scan or join, of
|
|
|
|
course); furthermore, we need test only one such comparison, not all of
|
|
|
|
them. Therefore, equivalence clauses are removed from the standard qual
|
|
|
|
distribution process. Instead, when preparing a restriction or join clause
|
|
|
|
list, we examine each EquivalenceClass to see if it can contribute a
|
|
|
|
clause, and if so we select an appropriate pair of values to compare. For
|
|
|
|
example, if we are trying to join A's relation to C's, we can generate the
|
|
|
|
clause A = C, even though this appeared nowhere explicitly in the original
|
|
|
|
query. This may allow us to explore join paths that otherwise would have
|
|
|
|
been rejected as requiring Cartesian-product joins.
|
|
|
|
|
|
|
|
Sometimes an EquivalenceClass may contain a pseudo-constant expression
|
|
|
|
(i.e., one not containing Vars or Aggs of the current query level, nor
|
|
|
|
volatile functions). In this case we do not follow the policy of
|
|
|
|
dynamically generating join clauses: instead, we dynamically generate
|
|
|
|
restriction clauses "var = const" wherever one of the variable members of
|
|
|
|
the class can first be computed. For example, if we have A = B and B = 42,
|
|
|
|
we effectively generate the restriction clauses A = 42 and B = 42, and then
|
|
|
|
we need not bother with explicitly testing the join clause A = B when the
|
|
|
|
relations are joined. In effect, all the class members can be tested at
|
|
|
|
relation-scan level and there's never a need for join tests.
|
|
|
|
|
|
|
|
The precise technical interpretation of an EquivalenceClass is that it
|
|
|
|
asserts that at any plan node where more than one of its member values
|
|
|
|
can be computed, output rows in which the values are not all equal may
|
|
|
|
be discarded without affecting the query result. (We require all levels
|
|
|
|
of the plan to enforce EquivalenceClasses, hence a join need not recheck
|
Make Vars be outer-join-aware.
Traditionally we used the same Var struct to represent the value
of a table column everywhere in parse and plan trees. This choice
predates our support for SQL outer joins, and it's really a pretty
bad idea with outer joins, because the Var's value can depend on
where it is in the tree: it might go to NULL above an outer join.
So expression nodes that are equal() per equalfuncs.c might not
represent the same value, which is a huge correctness hazard for
the planner.
To improve this, decorate Var nodes with a bitmapset showing
which outer joins (identified by RTE indexes) may have nulled
them at the point in the parse tree where the Var appears.
This allows us to trust that equal() Vars represent the same value.
A certain amount of klugery is still needed to cope with cases
where we re-order two outer joins, but it's possible to make it
work without sacrificing that core principle. PlaceHolderVars
receive similar decoration for the same reason.
In the planner, we include these outer join bitmapsets into the relids
that an expression is considered to depend on, and in consequence also
add outer-join relids to the relids of join RelOptInfos. This allows
us to correctly perceive whether an expression can be calculated above
or below a particular outer join.
This change affects FDWs that want to plan foreign joins. They *must*
follow suit when labeling foreign joins in order to match with the
core planner, but for many purposes (if postgres_fdw is any guide)
they'd prefer to consider only base relations within the join.
To support both requirements, redefine ForeignScan.fs_relids as
base+OJ relids, and add a new field fs_base_relids that's set up by
the core planner.
Large though it is, this commit just does the minimum necessary to
install the new mechanisms and get check-world passing again.
Follow-up patches will perform some cleanup. (The README additions
and comments mention some stuff that will appear in the follow-up.)
Patch by me; thanks to Richard Guo for review.
Discussion: https://postgr.es/m/830269.1656693747@sss.pgh.pa.us
2023-01-30 19:16:20 +01:00
|
|
|
equality of values that were computable by one of its children.)
|
|
|
|
|
|
|
|
Outer joins complicate this picture quite a bit, however. While we could
|
|
|
|
theoretically use mergejoinable equality clauses that appear in outer-join
|
|
|
|
conditions as sources of EquivalenceClasses, there's a serious difficulty:
|
|
|
|
the resulting deductions are not valid everywhere. For example, given
|
|
|
|
|
|
|
|
SELECT * FROM a LEFT JOIN b ON (a.x = b.y AND a.x = 42);
|
|
|
|
|
|
|
|
we can safely derive b.y = 42 and use that in the scan of B, because B
|
|
|
|
rows not having b.y = 42 will not contribute to the join result. However,
|
|
|
|
we cannot apply a.x = 42 at the scan of A, or we will remove rows that
|
|
|
|
should appear in the join result. We could apply a.x = 42 as an outer join
|
|
|
|
condition (and then it would be unnecessary to also check a.x = b.y).
|
|
|
|
This is not yet implemented, however.
|
|
|
|
|
|
|
|
A related issue is that constants appearing below an outer join are
|
|
|
|
less constant than they appear. Ordinarily, if we find "A = 1" and
|
|
|
|
"B = 1", it's okay to put A and B into the same EquivalenceClass.
|
|
|
|
But consider
|
|
|
|
|
|
|
|
SELECT * FROM a
|
|
|
|
LEFT JOIN (SELECT * FROM b WHERE b.z = 1) b ON (a.x = b.y)
|
|
|
|
WHERE a.x = 1;
|
|
|
|
|
|
|
|
It would be a serious error to conclude that a.x = b.z, so we cannot
|
|
|
|
form a single EquivalenceClass {a.x b.z 1}.
|
|
|
|
|
|
|
|
This leads to considering EquivalenceClasses as applying within "join
|
|
|
|
domains", which are sets of relations that are inner-joined to each other.
|
|
|
|
(We can treat semijoins as if they were inner joins for this purpose.)
|
|
|
|
There is a top-level join domain, and then each outer join in the query
|
|
|
|
creates a new join domain comprising its nullable side. Full joins create
|
|
|
|
two join domains, one for each side. EquivalenceClasses generated from
|
|
|
|
WHERE are associated with the top-level join domain. EquivalenceClasses
|
|
|
|
generated from the ON clause of an outer join are associated with the
|
|
|
|
domain created by that outer join. EquivalenceClasses generated from the
|
|
|
|
ON clause of an inner or semi join are associated with the syntactically
|
|
|
|
most closely nested join domain.
|
|
|
|
|
|
|
|
Having defined these domains, we can fix the not-so-constant-constants
|
|
|
|
problem by considering that constants only match EquivalenceClass members
|
|
|
|
when they come from clauses within the same join domain. In the above
|
|
|
|
example, this means we keep {a.x 1} and {b.z 1} as separate
|
|
|
|
EquivalenceClasses and don't erroneously merge them. We don't have to
|
|
|
|
worry about this for Vars (or expressions containing Vars), because
|
|
|
|
references to the "same" column from different join domains will have
|
|
|
|
different varnullingrels and thus won't be equal() anyway.
|
|
|
|
|
|
|
|
In the future, the join-domain concept may allow us to treat mergejoinable
|
|
|
|
outer-join conditions as sources of EquivalenceClasses. The idea would be
|
|
|
|
that conditions derived from such classes could only be enforced at scans
|
|
|
|
or joins that are within the appropriate join domain. This is not
|
|
|
|
implemented yet, however, as the details are trickier than they appear.
|
|
|
|
|
|
|
|
Another instructive example is:
|
|
|
|
|
|
|
|
SELECT *
|
|
|
|
FROM a LEFT JOIN
|
|
|
|
(SELECT * FROM b JOIN c ON b.y = c.z WHERE b.y = 10) ss
|
|
|
|
ON a.x = ss.y
|
|
|
|
ORDER BY ss.y;
|
|
|
|
|
|
|
|
We can form the EquivalenceClass {b.y c.z 10} and thereby apply c.z = 10
|
|
|
|
while scanning C, as well as b.y = 10 while scanning B, so that no clause
|
|
|
|
needs to be checked at the inner join. The left-join clause "a.x = ss.y"
|
|
|
|
(really "a.x = b.y") is not considered an equivalence clause, so we do
|
|
|
|
not insert a.x into that same EquivalenceClass; if we did, we'd falsely
|
|
|
|
conclude a.x = 10. In the future though we might be able to do that,
|
|
|
|
if we can keep from applying a.x = 10 at the scan of A, which in principle
|
|
|
|
we could do by noting that the EquivalenceClass only applies within the
|
|
|
|
{B,C} join domain.
|
|
|
|
|
|
|
|
Also notice that ss.y in the ORDER BY is really b.y* (that is, the
|
|
|
|
possibly-nulled form of b.y), so we will not confuse it with the b.y member
|
|
|
|
of the lower EquivalenceClass. Thus, we won't mistakenly conclude that
|
|
|
|
that ss.y is equal to a constant, which if true would lead us to think that
|
|
|
|
sorting for the ORDER BY is unnecessary (see discussion of PathKeys below).
|
|
|
|
Instead, there will be a separate EquivalenceClass containing only b.y*,
|
|
|
|
which will form the basis for the PathKey describing the required sort
|
|
|
|
order.
|
|
|
|
|
|
|
|
Also consider this variant:
|
2007-01-20 21:45:41 +01:00
|
|
|
|
|
|
|
SELECT *
|
|
|
|
FROM a LEFT JOIN
|
|
|
|
(SELECT * FROM b JOIN c ON b.y = c.z WHERE b.y = 10) ss
|
|
|
|
ON a.x = ss.y
|
|
|
|
WHERE a.x = 42;
|
|
|
|
|
Make Vars be outer-join-aware.
Traditionally we used the same Var struct to represent the value
of a table column everywhere in parse and plan trees. This choice
predates our support for SQL outer joins, and it's really a pretty
bad idea with outer joins, because the Var's value can depend on
where it is in the tree: it might go to NULL above an outer join.
So expression nodes that are equal() per equalfuncs.c might not
represent the same value, which is a huge correctness hazard for
the planner.
To improve this, decorate Var nodes with a bitmapset showing
which outer joins (identified by RTE indexes) may have nulled
them at the point in the parse tree where the Var appears.
This allows us to trust that equal() Vars represent the same value.
A certain amount of klugery is still needed to cope with cases
where we re-order two outer joins, but it's possible to make it
work without sacrificing that core principle. PlaceHolderVars
receive similar decoration for the same reason.
In the planner, we include these outer join bitmapsets into the relids
that an expression is considered to depend on, and in consequence also
add outer-join relids to the relids of join RelOptInfos. This allows
us to correctly perceive whether an expression can be calculated above
or below a particular outer join.
This change affects FDWs that want to plan foreign joins. They *must*
follow suit when labeling foreign joins in order to match with the
core planner, but for many purposes (if postgres_fdw is any guide)
they'd prefer to consider only base relations within the join.
To support both requirements, redefine ForeignScan.fs_relids as
base+OJ relids, and add a new field fs_base_relids that's set up by
the core planner.
Large though it is, this commit just does the minimum necessary to
install the new mechanisms and get check-world passing again.
Follow-up patches will perform some cleanup. (The README additions
and comments mention some stuff that will appear in the follow-up.)
Patch by me; thanks to Richard Guo for review.
Discussion: https://postgr.es/m/830269.1656693747@sss.pgh.pa.us
2023-01-30 19:16:20 +01:00
|
|
|
We still form the EquivalenceClass {b.y c.z 10}, and additionally
|
|
|
|
we have an EquivalenceClass {a.x 42} belonging to a different join domain.
|
|
|
|
We cannot use "a.x = b.y" to merge these classes. However, we can compare
|
|
|
|
that outer join clause to the existing EquivalenceClasses and form the
|
|
|
|
derived clause "b.y = 42", which we can treat as a valid equivalence
|
|
|
|
within the lower join domain (since no row of that domain not having
|
|
|
|
b.y = 42 can contribute to the outer-join result). That makes the lower
|
|
|
|
EquivalenceClass {42 b.y c.z 10}, resulting in the contradiction 10 = 42,
|
|
|
|
which lets the planner deduce that the B/C join need not be computed at
|
|
|
|
all: the result of that whole join domain can be forced to empty.
|
|
|
|
(This gets implemented as a gating Result filter, since more usually the
|
|
|
|
potential contradiction involves Param values rather than just Consts, and
|
|
|
|
thus it has to be checked at runtime. We can use the join domain to
|
|
|
|
determine the join level at which to place the gating condition.)
|
|
|
|
|
|
|
|
There is an additional complication when re-ordering outer joins according
|
|
|
|
to identity 3. Recall that the two choices we consider for such joins are
|
|
|
|
|
|
|
|
A leftjoin (B leftjoin C on (Pbc)) on (Pab)
|
|
|
|
(A leftjoin B on (Pab)) leftjoin C on (Pb*c)
|
|
|
|
|
|
|
|
where the star denotes varnullingrels markers on B's Vars. When Pbc
|
|
|
|
is (or includes) a mergejoinable clause, we have something like
|
|
|
|
|
|
|
|
A leftjoin (B leftjoin C on (b.b = c.c)) on (Pab)
|
|
|
|
(A leftjoin B on (Pab)) leftjoin C on (b.b* = c.c)
|
|
|
|
|
|
|
|
We could generate an EquivalenceClause linking b.b and c.c, but if we
|
|
|
|
then also try to link b.b* and c.c, we end with a nonsensical conclusion
|
|
|
|
that b.b and b.b* are equal (at least in some parts of the plan tree).
|
|
|
|
In any case, the conclusions we could derive from such a thing would be
|
|
|
|
largely duplicative. Conditions involving b.b* can't be computed below
|
|
|
|
this join nest, while any conditions that can be computed would be
|
|
|
|
duplicative of what we'd get from the b.b/c.c combination. Therefore,
|
|
|
|
we choose to generate an EquivalenceClause linking b.b and c.c, but
|
|
|
|
"b.b* = c.c" is handled as just an ordinary clause.
|
2007-01-20 21:45:41 +01:00
|
|
|
|
|
|
|
To aid in determining the sort ordering(s) that can work with a mergejoin,
|
|
|
|
we mark each mergejoinable clause with the EquivalenceClasses of its left
|
|
|
|
and right inputs. For an equivalence clause, these are of course the same
|
|
|
|
EquivalenceClass. For a non-equivalence mergejoinable clause (such as an
|
|
|
|
outer-join qualification), we generate two separate EquivalenceClasses for
|
|
|
|
the left and right inputs. This may result in creating single-item
|
|
|
|
equivalence "classes", though of course these are still subject to merging
|
|
|
|
if other equivalence clauses are later found to bear on the same
|
|
|
|
expressions.
|
|
|
|
|
|
|
|
Another way that we may form a single-item EquivalenceClass is in creation
|
Make Vars be outer-join-aware.
Traditionally we used the same Var struct to represent the value
of a table column everywhere in parse and plan trees. This choice
predates our support for SQL outer joins, and it's really a pretty
bad idea with outer joins, because the Var's value can depend on
where it is in the tree: it might go to NULL above an outer join.
So expression nodes that are equal() per equalfuncs.c might not
represent the same value, which is a huge correctness hazard for
the planner.
To improve this, decorate Var nodes with a bitmapset showing
which outer joins (identified by RTE indexes) may have nulled
them at the point in the parse tree where the Var appears.
This allows us to trust that equal() Vars represent the same value.
A certain amount of klugery is still needed to cope with cases
where we re-order two outer joins, but it's possible to make it
work without sacrificing that core principle. PlaceHolderVars
receive similar decoration for the same reason.
In the planner, we include these outer join bitmapsets into the relids
that an expression is considered to depend on, and in consequence also
add outer-join relids to the relids of join RelOptInfos. This allows
us to correctly perceive whether an expression can be calculated above
or below a particular outer join.
This change affects FDWs that want to plan foreign joins. They *must*
follow suit when labeling foreign joins in order to match with the
core planner, but for many purposes (if postgres_fdw is any guide)
they'd prefer to consider only base relations within the join.
To support both requirements, redefine ForeignScan.fs_relids as
base+OJ relids, and add a new field fs_base_relids that's set up by
the core planner.
Large though it is, this commit just does the minimum necessary to
install the new mechanisms and get check-world passing again.
Follow-up patches will perform some cleanup. (The README additions
and comments mention some stuff that will appear in the follow-up.)
Patch by me; thanks to Richard Guo for review.
Discussion: https://postgr.es/m/830269.1656693747@sss.pgh.pa.us
2023-01-30 19:16:20 +01:00
|
|
|
of a PathKey to represent a desired sort order (see below). This happens
|
|
|
|
if an ORDER BY or GROUP BY key is not mentioned in any equivalence
|
|
|
|
clause. We need to reason about sort orders in such queries, and our
|
|
|
|
representation of sort ordering is a PathKey which depends on an
|
|
|
|
EquivalenceClass, so we have to make an EquivalenceClass. This is a bit
|
2007-01-20 21:45:41 +01:00
|
|
|
different from the above cases because such an EquivalenceClass might
|
|
|
|
contain an aggregate function or volatile expression. (A clause containing
|
|
|
|
a volatile function will never be considered mergejoinable, even if its top
|
|
|
|
operator is mergejoinable, so there is no way for a volatile expression to
|
|
|
|
get into EquivalenceClasses otherwise. Aggregates are disallowed in WHERE
|
|
|
|
altogether, so will never be found in a mergejoinable clause.) This is just
|
|
|
|
a convenience to maintain a uniform PathKey representation: such an
|
2009-09-29 03:20:34 +02:00
|
|
|
EquivalenceClass will never be merged with any other. Note in particular
|
|
|
|
that a single-item EquivalenceClass {a.x} is *not* meant to imply an
|
|
|
|
assertion that a.x = a.x; the practical effect of this is that a.x could
|
|
|
|
be NULL.
|
2007-01-20 21:45:41 +01:00
|
|
|
|
|
|
|
An EquivalenceClass also contains a list of btree opfamily OIDs, which
|
|
|
|
determines what the equalities it represents actually "mean". All the
|
|
|
|
equivalence clauses that contribute to an EquivalenceClass must have
|
|
|
|
equality operators that belong to the same set of opfamilies. (Note: most
|
|
|
|
of the time, a particular equality operator belongs to only one family, but
|
|
|
|
it's possible that it belongs to more than one. We keep track of all the
|
|
|
|
families to ensure that we can make use of an index belonging to any one of
|
|
|
|
the families for mergejoin purposes.)
|
|
|
|
|
Make Vars be outer-join-aware.
Traditionally we used the same Var struct to represent the value
of a table column everywhere in parse and plan trees. This choice
predates our support for SQL outer joins, and it's really a pretty
bad idea with outer joins, because the Var's value can depend on
where it is in the tree: it might go to NULL above an outer join.
So expression nodes that are equal() per equalfuncs.c might not
represent the same value, which is a huge correctness hazard for
the planner.
To improve this, decorate Var nodes with a bitmapset showing
which outer joins (identified by RTE indexes) may have nulled
them at the point in the parse tree where the Var appears.
This allows us to trust that equal() Vars represent the same value.
A certain amount of klugery is still needed to cope with cases
where we re-order two outer joins, but it's possible to make it
work without sacrificing that core principle. PlaceHolderVars
receive similar decoration for the same reason.
In the planner, we include these outer join bitmapsets into the relids
that an expression is considered to depend on, and in consequence also
add outer-join relids to the relids of join RelOptInfos. This allows
us to correctly perceive whether an expression can be calculated above
or below a particular outer join.
This change affects FDWs that want to plan foreign joins. They *must*
follow suit when labeling foreign joins in order to match with the
core planner, but for many purposes (if postgres_fdw is any guide)
they'd prefer to consider only base relations within the join.
To support both requirements, redefine ForeignScan.fs_relids as
base+OJ relids, and add a new field fs_base_relids that's set up by
the core planner.
Large though it is, this commit just does the minimum necessary to
install the new mechanisms and get check-world passing again.
Follow-up patches will perform some cleanup. (The README additions
and comments mention some stuff that will appear in the follow-up.)
Patch by me; thanks to Richard Guo for review.
Discussion: https://postgr.es/m/830269.1656693747@sss.pgh.pa.us
2023-01-30 19:16:20 +01:00
|
|
|
For the same sort of reason, an EquivalenceClass is also associated
|
|
|
|
with a particular collation, if its datatype(s) care about collation.
|
|
|
|
|
Revisit handling of UNION ALL subqueries with non-Var output columns.
In commit 57664ed25e5dea117158a2e663c29e60b3546e1c I tried to fix a bug
reported by Teodor Sigaev by making non-simple-Var output columns distinct
(by wrapping their expressions with dummy PlaceHolderVar nodes). This did
not work too well. Commit b28ffd0fcc583c1811e5295279e7d4366c3cae6c fixed
some ensuing problems with matching to child indexes, but per a recent
report from Claus Stadler, constraint exclusion of UNION ALL subqueries was
still broken, because constant-simplification didn't handle the injected
PlaceHolderVars well either. On reflection, the original patch was quite
misguided: there is no reason to expect that EquivalenceClass child members
will be distinct. So instead of trying to make them so, we should ensure
that we can cope with the situation when they're not.
Accordingly, this patch reverts the code changes in the above-mentioned
commits (though the regression test cases they added stay). Instead, I've
added assorted defenses to make sure that duplicate EC child members don't
cause any problems. Teodor's original problem ("MergeAppend child's
targetlist doesn't match MergeAppend") is addressed more directly by
revising prepare_sort_from_pathkeys to let the parent MergeAppend's sort
list guide creation of each child's sort list.
In passing, get rid of add_sort_column; as far as I can tell, testing for
duplicate sort keys at this stage is dead code. Certainly it doesn't
trigger often enough to be worth expending cycles on in ordinary queries.
And keeping the test would've greatly complicated the new logic in
prepare_sort_from_pathkeys, because comparing pathkey list entries against
a previous output array requires that we not skip any entries in the list.
Back-patch to 9.1, like the previous patches. The only known issue in
this area that wasn't caused by the ill-advised previous patches was the
MergeAppend planning failure, which of course is not relevant before 9.1.
It's possible that we need some of the new defenses against duplicate child
EC entries in older branches, but until there's some clear evidence of that
I'm going to refrain from back-patching further.
2012-03-16 18:11:12 +01:00
|
|
|
An EquivalenceClass can contain "em_is_child" members, which are copies
|
|
|
|
of members that contain appendrel parent relation Vars, transposed to
|
|
|
|
contain the equivalent child-relation variables or expressions. These
|
|
|
|
members are *not* full-fledged members of the EquivalenceClass and do not
|
|
|
|
affect the class's overall properties at all. They are kept only to
|
|
|
|
simplify matching of child-relation expressions to EquivalenceClasses.
|
|
|
|
Most operations on EquivalenceClasses should ignore child members.
|
|
|
|
|
2007-01-20 21:45:41 +01:00
|
|
|
|
2000-07-24 05:11:01 +02:00
|
|
|
PathKeys
|
|
|
|
--------
|
|
|
|
|
|
|
|
The PathKeys data structure represents what is known about the sort order
|
2007-01-20 21:45:41 +01:00
|
|
|
of the tuples generated by a particular Path. A path's pathkeys field is a
|
|
|
|
list of PathKey nodes, where the n'th item represents the n'th sort key of
|
|
|
|
the result. Each PathKey contains these fields:
|
2000-07-24 05:11:01 +02:00
|
|
|
|
2007-01-20 21:45:41 +01:00
|
|
|
* a reference to an EquivalenceClass
|
|
|
|
* a btree opfamily OID (must match one of those in the EC)
|
|
|
|
* a sort direction (ascending or descending)
|
|
|
|
* a nulls-first-or-last flag
|
|
|
|
|
|
|
|
The EquivalenceClass represents the value being sorted on. Since the
|
|
|
|
various members of an EquivalenceClass are known equal according to the
|
|
|
|
opfamily, we can consider a path sorted by any one of them to be sorted by
|
|
|
|
any other too; this is what justifies referencing the whole
|
|
|
|
EquivalenceClass rather than just one member of it.
|
2000-07-24 05:11:01 +02:00
|
|
|
|
|
|
|
In single/base relation RelOptInfo's, the Paths represent various ways
|
|
|
|
of scanning the relation and the resulting ordering of the tuples.
|
|
|
|
Sequential scan Paths have NIL pathkeys, indicating no known ordering.
|
|
|
|
Index scans have Path.pathkeys that represent the chosen index's ordering,
|
2007-01-20 21:45:41 +01:00
|
|
|
if any. A single-key index would create a single-PathKey list, while a
|
2018-04-07 22:00:39 +02:00
|
|
|
multi-column index generates a list with one element per key index column.
|
|
|
|
Non-key columns specified in the INCLUDE clause of covering indexes don't
|
Make Vars be outer-join-aware.
Traditionally we used the same Var struct to represent the value
of a table column everywhere in parse and plan trees. This choice
predates our support for SQL outer joins, and it's really a pretty
bad idea with outer joins, because the Var's value can depend on
where it is in the tree: it might go to NULL above an outer join.
So expression nodes that are equal() per equalfuncs.c might not
represent the same value, which is a huge correctness hazard for
the planner.
To improve this, decorate Var nodes with a bitmapset showing
which outer joins (identified by RTE indexes) may have nulled
them at the point in the parse tree where the Var appears.
This allows us to trust that equal() Vars represent the same value.
A certain amount of klugery is still needed to cope with cases
where we re-order two outer joins, but it's possible to make it
work without sacrificing that core principle. PlaceHolderVars
receive similar decoration for the same reason.
In the planner, we include these outer join bitmapsets into the relids
that an expression is considered to depend on, and in consequence also
add outer-join relids to the relids of join RelOptInfos. This allows
us to correctly perceive whether an expression can be calculated above
or below a particular outer join.
This change affects FDWs that want to plan foreign joins. They *must*
follow suit when labeling foreign joins in order to match with the
core planner, but for many purposes (if postgres_fdw is any guide)
they'd prefer to consider only base relations within the join.
To support both requirements, redefine ForeignScan.fs_relids as
base+OJ relids, and add a new field fs_base_relids that's set up by
the core planner.
Large though it is, this commit just does the minimum necessary to
install the new mechanisms and get check-world passing again.
Follow-up patches will perform some cleanup. (The README additions
and comments mention some stuff that will appear in the follow-up.)
Patch by me; thanks to Richard Guo for review.
Discussion: https://postgr.es/m/830269.1656693747@sss.pgh.pa.us
2023-01-30 19:16:20 +01:00
|
|
|
have corresponding PathKeys in the list, because they have no influence on
|
2018-04-07 22:00:39 +02:00
|
|
|
index ordering. (Actually, since an index can be scanned either forward or
|
|
|
|
backward, there are two possible sort orders and two possible PathKey lists
|
|
|
|
it can generate.)
|
2007-01-20 21:45:41 +01:00
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
Note that a bitmap scan has NIL pathkeys since we can say nothing about
|
|
|
|
the overall order of its result. Also, an indexscan on an unordered type
|
|
|
|
of index generates NIL pathkeys. However, we can always create a pathkey
|
|
|
|
by doing an explicit sort. The pathkeys for a Sort plan's output just
|
|
|
|
represent the sort key fields and the ordering operators used.
|
2000-07-24 05:11:01 +02:00
|
|
|
|
|
|
|
Things get more interesting when we consider joins. Suppose we do a
|
|
|
|
mergejoin between A and B using the mergeclause A.X = B.Y. The output
|
2007-01-20 21:45:41 +01:00
|
|
|
of the mergejoin is sorted by X --- but it is also sorted by Y. Again,
|
|
|
|
this can be represented by a PathKey referencing an EquivalenceClass
|
|
|
|
containing both X and Y.
|
|
|
|
|
|
|
|
With a little further thought, it becomes apparent that nestloop joins
|
|
|
|
can also produce sorted output. For example, if we do a nestloop join
|
|
|
|
between outer relation A and inner relation B, then any pathkeys relevant
|
|
|
|
to A are still valid for the join result: we have not altered the order of
|
|
|
|
the tuples from A. Even more interesting, if there was an equivalence clause
|
|
|
|
A.X=B.Y, and A.X was a pathkey for the outer relation A, then we can assert
|
|
|
|
that B.Y is a pathkey for the join result; X was ordered before and still
|
|
|
|
is, and the joined values of Y are equal to the joined values of X, so Y
|
2000-07-24 05:11:01 +02:00
|
|
|
must now be ordered too. This is true even though we used neither an
|
2007-01-20 21:45:41 +01:00
|
|
|
explicit sort nor a mergejoin on Y. (Note: hash joins cannot be counted
|
|
|
|
on to preserve the order of their outer relation, because the executor
|
|
|
|
might decide to "batch" the join, so we always set pathkeys to NIL for
|
Make Vars be outer-join-aware.
Traditionally we used the same Var struct to represent the value
of a table column everywhere in parse and plan trees. This choice
predates our support for SQL outer joins, and it's really a pretty
bad idea with outer joins, because the Var's value can depend on
where it is in the tree: it might go to NULL above an outer join.
So expression nodes that are equal() per equalfuncs.c might not
represent the same value, which is a huge correctness hazard for
the planner.
To improve this, decorate Var nodes with a bitmapset showing
which outer joins (identified by RTE indexes) may have nulled
them at the point in the parse tree where the Var appears.
This allows us to trust that equal() Vars represent the same value.
A certain amount of klugery is still needed to cope with cases
where we re-order two outer joins, but it's possible to make it
work without sacrificing that core principle. PlaceHolderVars
receive similar decoration for the same reason.
In the planner, we include these outer join bitmapsets into the relids
that an expression is considered to depend on, and in consequence also
add outer-join relids to the relids of join RelOptInfos. This allows
us to correctly perceive whether an expression can be calculated above
or below a particular outer join.
This change affects FDWs that want to plan foreign joins. They *must*
follow suit when labeling foreign joins in order to match with the
core planner, but for many purposes (if postgres_fdw is any guide)
they'd prefer to consider only base relations within the join.
To support both requirements, redefine ForeignScan.fs_relids as
base+OJ relids, and add a new field fs_base_relids that's set up by
the core planner.
Large though it is, this commit just does the minimum necessary to
install the new mechanisms and get check-world passing again.
Follow-up patches will perform some cleanup. (The README additions
and comments mention some stuff that will appear in the follow-up.)
Patch by me; thanks to Richard Guo for review.
Discussion: https://postgr.es/m/830269.1656693747@sss.pgh.pa.us
2023-01-30 19:16:20 +01:00
|
|
|
a hashjoin path.)
|
|
|
|
|
|
|
|
An outer join doesn't preserve the ordering of its nullable input
|
|
|
|
relation(s), because it might insert nulls at random points in the
|
|
|
|
ordering. We don't need to think about this explicitly in the PathKey
|
|
|
|
representation, because a PathKey representing a post-join variable
|
|
|
|
will contain varnullingrel bits, making it not equal to a PathKey
|
|
|
|
representing the pre-join value.
|
2007-01-20 21:45:41 +01:00
|
|
|
|
|
|
|
In general, we can justify using EquivalenceClasses as the basis for
|
|
|
|
pathkeys because, whenever we scan a relation containing multiple
|
|
|
|
EquivalenceClass members or join two relations each containing
|
|
|
|
EquivalenceClass members, we apply restriction or join clauses derived from
|
|
|
|
the EquivalenceClass. This guarantees that any two values listed in the
|
|
|
|
EquivalenceClass are in fact equal in all tuples emitted by the scan or
|
|
|
|
join, and therefore that if the tuples are sorted by one of the values,
|
|
|
|
they can be considered sorted by any other as well. It does not matter
|
|
|
|
whether the test clause is used as a mergeclause, or merely enforced
|
|
|
|
after-the-fact as a qpqual filter.
|
|
|
|
|
|
|
|
Note that there is no particular difficulty in labeling a path's sort
|
|
|
|
order with a PathKey referencing an EquivalenceClass that contains
|
|
|
|
variables not yet joined into the path's output. We can simply ignore
|
|
|
|
such entries as not being relevant (yet). This makes it possible to
|
|
|
|
use the same EquivalenceClasses throughout the join planning process.
|
|
|
|
In fact, by being careful not to generate multiple identical PathKey
|
|
|
|
objects, we can reduce comparison of EquivalenceClasses and PathKeys
|
|
|
|
to simple pointer comparison, which is a huge savings because add_path
|
|
|
|
has to make a large number of PathKey comparisons in deciding whether
|
|
|
|
competing Paths are equivalently sorted.
|
2000-07-24 05:11:01 +02:00
|
|
|
|
|
|
|
Pathkeys are also useful to represent an ordering that we wish to achieve,
|
|
|
|
since they are easily compared to the pathkeys of a potential candidate
|
2008-08-02 23:32:01 +02:00
|
|
|
path. So, SortGroupClause lists are turned into pathkeys lists for use
|
|
|
|
inside the optimizer.
|
2000-07-24 05:11:01 +02:00
|
|
|
|
2000-12-14 23:30:45 +01:00
|
|
|
An additional refinement we can make is to insist that canonical pathkey
|
2007-01-20 21:45:41 +01:00
|
|
|
lists (sort orderings) do not mention the same EquivalenceClass more than
|
|
|
|
once. For example, in all these cases the second sort column is redundant,
|
|
|
|
because it cannot distinguish values that are the same according to the
|
|
|
|
first sort column:
|
|
|
|
SELECT ... ORDER BY x, x
|
|
|
|
SELECT ... ORDER BY x, x DESC
|
|
|
|
SELECT ... WHERE x = y ORDER BY x, y
|
|
|
|
Although a user probably wouldn't write "ORDER BY x,x" directly, such
|
|
|
|
redundancies are more probable once equivalence classes have been
|
|
|
|
considered. Also, the system may generate redundant pathkey lists when
|
|
|
|
computing the sort ordering needed for a mergejoin. By eliminating the
|
|
|
|
redundancy, we save time and improve planning, since the planner will more
|
|
|
|
easily recognize equivalent orderings as being equivalent.
|
|
|
|
|
|
|
|
Another interesting property is that if the underlying EquivalenceClass
|
Make Vars be outer-join-aware.
Traditionally we used the same Var struct to represent the value
of a table column everywhere in parse and plan trees. This choice
predates our support for SQL outer joins, and it's really a pretty
bad idea with outer joins, because the Var's value can depend on
where it is in the tree: it might go to NULL above an outer join.
So expression nodes that are equal() per equalfuncs.c might not
represent the same value, which is a huge correctness hazard for
the planner.
To improve this, decorate Var nodes with a bitmapset showing
which outer joins (identified by RTE indexes) may have nulled
them at the point in the parse tree where the Var appears.
This allows us to trust that equal() Vars represent the same value.
A certain amount of klugery is still needed to cope with cases
where we re-order two outer joins, but it's possible to make it
work without sacrificing that core principle. PlaceHolderVars
receive similar decoration for the same reason.
In the planner, we include these outer join bitmapsets into the relids
that an expression is considered to depend on, and in consequence also
add outer-join relids to the relids of join RelOptInfos. This allows
us to correctly perceive whether an expression can be calculated above
or below a particular outer join.
This change affects FDWs that want to plan foreign joins. They *must*
follow suit when labeling foreign joins in order to match with the
core planner, but for many purposes (if postgres_fdw is any guide)
they'd prefer to consider only base relations within the join.
To support both requirements, redefine ForeignScan.fs_relids as
base+OJ relids, and add a new field fs_base_relids that's set up by
the core planner.
Large though it is, this commit just does the minimum necessary to
install the new mechanisms and get check-world passing again.
Follow-up patches will perform some cleanup. (The README additions
and comments mention some stuff that will appear in the follow-up.)
Patch by me; thanks to Richard Guo for review.
Discussion: https://postgr.es/m/830269.1656693747@sss.pgh.pa.us
2023-01-30 19:16:20 +01:00
|
|
|
contains a constant, then the pathkey is completely redundant and need not
|
|
|
|
be sorted by at all! Every interesting row must contain the same value,
|
|
|
|
so there's no need to sort. This might seem pointless because users
|
2007-01-20 21:45:41 +01:00
|
|
|
are unlikely to write "... WHERE x = 42 ORDER BY x", but it allows us to
|
|
|
|
recognize when particular index columns are irrelevant to the sort order:
|
|
|
|
if we have "... WHERE x = 42 ORDER BY y", scanning an index on (x,y)
|
|
|
|
produces correctly ordered data without a sort step. We used to have very
|
|
|
|
ugly ad-hoc code to recognize that in limited contexts, but discarding
|
|
|
|
constant ECs from pathkeys makes it happen cleanly and automatically.
|
|
|
|
|
|
|
|
|
2010-10-29 17:52:16 +02:00
|
|
|
Order of processing for EquivalenceClasses and PathKeys
|
|
|
|
-------------------------------------------------------
|
|
|
|
|
|
|
|
As alluded to above, there is a specific sequence of phases in the
|
|
|
|
processing of EquivalenceClasses and PathKeys during planning. During the
|
|
|
|
initial scanning of the query's quals (deconstruct_jointree followed by
|
|
|
|
reconsider_outer_join_clauses), we construct EquivalenceClasses based on
|
|
|
|
mergejoinable clauses found in the quals. At the end of this process,
|
|
|
|
we know all we can know about equivalence of different variables, so
|
|
|
|
subsequently there will be no further merging of EquivalenceClasses.
|
|
|
|
At that point it is possible to consider the EquivalenceClasses as
|
Postpone creation of pathkeys lists to fix bug #8049.
This patch gets rid of the concept of, and infrastructure for,
non-canonical PathKeys; we now only ever create canonical pathkey lists.
The need for non-canonical pathkeys came from the desire to have
grouping_planner initialize query_pathkeys and related pathkey lists before
calling query_planner. However, since query_planner didn't actually *do*
anything with those lists before they'd been made canonical, we can get rid
of the whole mess by just not creating the lists at all until the point
where we formerly canonicalized them.
There are several ways in which we could implement that without making
query_planner itself deal with grouping/sorting features (which are
supposed to be the province of grouping_planner). I chose to add a
callback function to query_planner's API; other alternatives would have
required adding more fields to PlannerInfo, which while not bad in itself
would create an ABI break for planner-related plugins in the 9.2 release
series. This still breaks ABI for anything that calls query_planner
directly, but it seems somewhat unlikely that there are any such plugins.
I had originally conceived of this change as merely a step on the way to
fixing bug #8049 from Teun Hoogendoorn; but it turns out that this fixes
that bug all by itself, as per the added regression test. The reason is
that now get_eclass_for_sort_expr is adding the ORDER BY expression at the
end of EquivalenceClass creation not the start, and so anything that is in
a multi-member EquivalenceClass has already been created with correct
em_nullable_relids. I am suspicious that there are related scenarios in
which we still need to teach get_eclass_for_sort_expr to compute correct
nullable_relids, but am not eager to risk destabilizing either 9.2 or 9.3
to fix bugs that are only hypothetical. So for the moment, do this and
stop here.
Back-patch to 9.2 but not to earlier branches, since they don't exhibit
this bug for lack of join-clause-movement logic that depends on
em_nullable_relids being correct. (We might have to revisit that choice
if any related bugs turn up.) In 9.2, don't change the signature of
make_pathkeys_for_sortclauses nor remove canonicalize_pathkeys, so as
not to risk more plugin breakage than we have to.
2013-04-29 20:49:01 +02:00
|
|
|
"canonical" and build canonical PathKeys that reference them. At this
|
|
|
|
time we construct PathKeys for the query's ORDER BY and related clauses.
|
|
|
|
(Any ordering expressions that do not appear elsewhere will result in
|
|
|
|
the creation of new EquivalenceClasses, but this cannot result in merging
|
|
|
|
existing classes, so canonical-ness is not lost.)
|
2010-10-29 17:52:16 +02:00
|
|
|
|
|
|
|
Because all the EquivalenceClasses are known before we begin path
|
|
|
|
generation, we can use them as a guide to which indexes are of interest:
|
|
|
|
if an index's column is not mentioned in any EquivalenceClass then that
|
|
|
|
index's sort order cannot possibly be helpful for the query. This allows
|
|
|
|
short-circuiting of much of the processing of create_index_paths() for
|
|
|
|
irrelevant indexes.
|
|
|
|
|
|
|
|
There are some cases where planner.c constructs additional
|
|
|
|
EquivalenceClasses and PathKeys after query_planner has completed.
|
|
|
|
In these cases, the extra ECs/PKs are needed to represent sort orders
|
|
|
|
that were not considered during query_planner. Such situations should be
|
|
|
|
minimized since it is impossible for query_planner to return a plan
|
2015-05-20 15:18:11 +02:00
|
|
|
producing such a sort order, meaning an explicit sort will always be needed.
|
2010-10-29 17:52:16 +02:00
|
|
|
Currently this happens only for queries involving multiple window functions
|
|
|
|
with different orderings, for which extra sorts are needed anyway.
|
2000-12-14 23:30:45 +01:00
|
|
|
|
2002-08-26 00:39:37 +02:00
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
Parameterized Paths
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
The naive way to join two relations using a clause like WHERE A.X = B.Y
|
|
|
|
is to generate a nestloop plan like this:
|
|
|
|
|
|
|
|
NestLoop
|
|
|
|
Filter: A.X = B.Y
|
|
|
|
-> Seq Scan on A
|
|
|
|
-> Seq Scan on B
|
|
|
|
|
|
|
|
We can make this better by using a merge or hash join, but it still
|
|
|
|
requires scanning all of both input relations. If A is very small and B is
|
|
|
|
very large, but there is an index on B.Y, it can be enormously better to do
|
|
|
|
something like this:
|
|
|
|
|
|
|
|
NestLoop
|
|
|
|
-> Seq Scan on A
|
|
|
|
-> Index Scan using B_Y_IDX on B
|
|
|
|
Index Condition: B.Y = A.X
|
|
|
|
|
|
|
|
Here, we are expecting that for each row scanned from A, the nestloop
|
|
|
|
plan node will pass down the current value of A.X into the scan of B.
|
|
|
|
That allows the indexscan to treat A.X as a constant for any one
|
|
|
|
invocation, and thereby use it as an index key. This is the only plan type
|
|
|
|
that can avoid fetching all of B, and for small numbers of rows coming from
|
|
|
|
A, that will dominate every other consideration. (As A gets larger, this
|
|
|
|
gets less attractive, and eventually a merge or hash join will win instead.
|
|
|
|
So we have to cost out all the alternatives to decide what to do.)
|
|
|
|
|
|
|
|
It can be useful for the parameter value to be passed down through
|
|
|
|
intermediate layers of joins, for example:
|
|
|
|
|
|
|
|
NestLoop
|
|
|
|
-> Seq Scan on A
|
|
|
|
Hash Join
|
|
|
|
Join Condition: B.Y = C.W
|
|
|
|
-> Seq Scan on B
|
|
|
|
-> Index Scan using C_Z_IDX on C
|
|
|
|
Index Condition: C.Z = A.X
|
|
|
|
|
2015-02-28 18:43:04 +01:00
|
|
|
If all joins are plain inner joins then this is usually unnecessary,
|
|
|
|
because it's possible to reorder the joins so that a parameter is used
|
2012-01-28 01:26:38 +01:00
|
|
|
immediately below the nestloop node that provides it. But in the
|
2015-02-28 18:43:04 +01:00
|
|
|
presence of outer joins, such join reordering may not be possible.
|
|
|
|
|
|
|
|
Also, the bottom-level scan might require parameters from more than one
|
|
|
|
other relation. In principle we could join the other relations first
|
|
|
|
so that all the parameters are supplied from a single nestloop level.
|
|
|
|
But if those other relations have no join clause in common (which is
|
|
|
|
common in star-schema queries for instance), the planner won't consider
|
|
|
|
joining them directly to each other. In such a case we need to be able
|
|
|
|
to create a plan like
|
|
|
|
|
|
|
|
NestLoop
|
|
|
|
-> Seq Scan on SmallTable1 A
|
|
|
|
NestLoop
|
|
|
|
-> Seq Scan on SmallTable2 B
|
2017-01-23 15:38:36 +01:00
|
|
|
-> Index Scan using XYIndex on LargeTable C
|
|
|
|
Index Condition: C.X = A.AID and C.Y = B.BID
|
2015-02-28 18:43:04 +01:00
|
|
|
|
|
|
|
so we should be willing to pass down A.AID through a join even though
|
|
|
|
there is no join order constraint forcing the plan to look like this.
|
|
|
|
|
|
|
|
Before version 9.2, Postgres used ad-hoc methods for planning and
|
|
|
|
executing nestloop queries of this kind, and those methods could not
|
|
|
|
handle passing parameters down through multiple join levels.
|
2012-01-28 01:26:38 +01:00
|
|
|
|
|
|
|
To plan such queries, we now use a notion of a "parameterized path",
|
|
|
|
which is a path that makes use of a join clause to a relation that's not
|
2015-02-28 18:43:04 +01:00
|
|
|
scanned by the path. In the example two above, we would construct a
|
2012-01-28 01:26:38 +01:00
|
|
|
path representing the possibility of doing this:
|
|
|
|
|
|
|
|
-> Index Scan using C_Z_IDX on C
|
|
|
|
Index Condition: C.Z = A.X
|
|
|
|
|
|
|
|
This path will be marked as being parameterized by relation A. (Note that
|
|
|
|
this is only one of the possible access paths for C; we'd still have a
|
|
|
|
plain unparameterized seqscan, and perhaps other possibilities.) The
|
|
|
|
parameterization marker does not prevent joining the path to B, so one of
|
|
|
|
the paths generated for the joinrel {B C} will represent
|
|
|
|
|
|
|
|
Hash Join
|
|
|
|
Join Condition: B.Y = C.W
|
|
|
|
-> Seq Scan on B
|
|
|
|
-> Index Scan using C_Z_IDX on C
|
|
|
|
Index Condition: C.Z = A.X
|
|
|
|
|
|
|
|
This path is still marked as being parameterized by A. When we attempt to
|
|
|
|
join {B C} to A to form the complete join tree, such a path can only be
|
|
|
|
used as the inner side of a nestloop join: it will be ignored for other
|
|
|
|
possible join types. So we will form a join path representing the query
|
|
|
|
plan shown above, and it will compete in the usual way with paths built
|
|
|
|
from non-parameterized scans.
|
|
|
|
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
While all ordinary paths for a particular relation generate the same set
|
|
|
|
of rows (since they must all apply the same set of restriction clauses),
|
|
|
|
parameterized paths typically generate fewer rows than less-parameterized
|
|
|
|
paths, since they have additional clauses to work with. This means we
|
|
|
|
must consider the number of rows generated as an additional figure of
|
|
|
|
merit. A path that costs more than another, but generates fewer rows,
|
|
|
|
must be kept since the smaller number of rows might save work at some
|
|
|
|
intermediate join level. (It would not save anything if joined
|
|
|
|
immediately to the source of the parameters.)
|
|
|
|
|
|
|
|
To keep cost estimation rules relatively simple, we make an implementation
|
|
|
|
restriction that all paths for a given relation of the same parameterization
|
|
|
|
(i.e., the same set of outer relations supplying parameters) must have the
|
|
|
|
same rowcount estimate. This is justified by insisting that each such path
|
|
|
|
apply *all* join clauses that are available with the named outer relations.
|
|
|
|
Different paths might, for instance, choose different join clauses to use
|
|
|
|
as index clauses; but they must then apply any other join clauses available
|
|
|
|
from the same outer relations as filter conditions, so that the set of rows
|
|
|
|
returned is held constant. This restriction doesn't degrade the quality of
|
|
|
|
the finished plan: it amounts to saying that we should always push down
|
|
|
|
movable join clauses to the lowest possible evaluation level, which is a
|
|
|
|
good thing anyway. The restriction is useful in particular to support
|
|
|
|
pre-filtering of join paths in add_path_precheck. Without this rule we
|
|
|
|
could never reject a parameterized path in advance of computing its rowcount
|
|
|
|
estimate, which would greatly reduce the value of the pre-filter mechanism.
|
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
To limit planning time, we have to avoid generating an unreasonably large
|
|
|
|
number of parameterized paths. We do this by only generating parameterized
|
|
|
|
relation scan paths for index scans, and then only for indexes for which
|
|
|
|
suitable join clauses are available. There are also heuristics in join
|
|
|
|
planning that try to limit the number of parameterized paths considered.
|
|
|
|
|
|
|
|
In particular, there's been a deliberate policy decision to favor hash
|
|
|
|
joins over merge joins for parameterized join steps (those occurring below
|
|
|
|
a nestloop that provides parameters to the lower join's inputs). While we
|
|
|
|
do not ignore merge joins entirely, joinpath.c does not fully explore the
|
|
|
|
space of potential merge joins with parameterized inputs. Also, add_path
|
|
|
|
treats parameterized paths as having no pathkeys, so that they compete
|
Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 17:58:47 +02:00
|
|
|
only on cost and rowcount; they don't get preference for producing a
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
special sort order. This creates additional bias against merge joins,
|
|
|
|
since we might discard a path that could have been useful for performing
|
|
|
|
a merge without an explicit sort step. Since a parameterized path must
|
|
|
|
ultimately be used on the inside of a nestloop, where its sort order is
|
|
|
|
uninteresting, these choices do not affect any requirement for the final
|
|
|
|
output order of a query --- they only make it harder to use a merge join
|
|
|
|
at a lower level. The savings in planning work justifies that.
|
2012-01-28 01:26:38 +01:00
|
|
|
|
Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 17:58:47 +02:00
|
|
|
Similarly, parameterized paths do not normally get preference in add_path
|
|
|
|
for having cheap startup cost; that's seldom of much value when on the
|
|
|
|
inside of a nestloop, so it seems not worth keeping extra paths solely for
|
|
|
|
that. An exception occurs for parameterized paths for the RHS relation of
|
|
|
|
a SEMI or ANTI join: in those cases, we can stop the inner scan after the
|
|
|
|
first match, so it's primarily startup not total cost that we care about.
|
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
|
Adjust definition of cheapest_total_path to work better with LATERAL.
In the initial cut at LATERAL, I kept the rule that cheapest_total_path
was always unparameterized, which meant it had to be NULL if the relation
has no unparameterized paths. It turns out to work much more nicely if
we always have *some* path nominated as cheapest-total for each relation.
In particular, let's still say it's the cheapest unparameterized path if
there is one; if not, take the cheapest-total-cost path among those of
the minimum available parameterization. (The first rule is actually
a special case of the second.)
This allows reversion of some temporary lobotomizations I'd put in place.
In particular, the planner can now consider hash and merge joins for
joins below a parameter-supplying nestloop, even if there aren't any
unparameterized paths available. This should bring planning of
LATERAL-containing queries to the same level as queries not using that
feature.
Along the way, simplify management of parameterized paths in add_path()
and friends. In the original coding for parameterized paths in 9.2,
I tried to minimize the logic changes in add_path(), so it just treated
parameterization as yet another dimension of comparison for paths.
We later made it ignore pathkeys (sort ordering) of parameterized paths,
on the grounds that ordering isn't a useful property for the path on the
inside of a nestloop, so we might as well get rid of useless parameterized
paths as quickly as possible. But we didn't take that reasoning as far as
we should have. Startup cost isn't a useful property inside a nestloop
either, so add_path() ought to discount startup cost of parameterized paths
as well. Having done that, the secondary sorting I'd implemented (in
add_parameterized_path) is no longer needed --- any parameterized path that
survives add_path() at all is worth considering at higher levels. So this
should be a bit faster as well as simpler.
2012-08-30 04:05:27 +02:00
|
|
|
LATERAL subqueries
|
|
|
|
------------------
|
|
|
|
|
|
|
|
As of 9.3 we support SQL-standard LATERAL references from subqueries in
|
|
|
|
FROM (and also functions in FROM). The planner implements these by
|
|
|
|
generating parameterized paths for any RTE that contains lateral
|
|
|
|
references. In such cases, *all* paths for that relation will be
|
|
|
|
parameterized by at least the set of relations used in its lateral
|
|
|
|
references. (And in turn, join relations including such a subquery might
|
|
|
|
not have any unparameterized paths.) All the other comments made above for
|
|
|
|
parameterized paths still apply, though; in particular, each such path is
|
|
|
|
still expected to enforce any join clauses that can be pushed down to it,
|
|
|
|
so that all paths of the same parameterization have the same rowcount.
|
|
|
|
|
2013-08-18 02:22:37 +02:00
|
|
|
We also allow LATERAL subqueries to be flattened (pulled up into the parent
|
2013-08-19 19:19:25 +02:00
|
|
|
query) by the optimizer, but only when this does not introduce lateral
|
|
|
|
references into JOIN/ON quals that would refer to relations outside the
|
|
|
|
lowest outer join at/above that qual. The semantics of such a qual would
|
|
|
|
be unclear. Note that even with this restriction, pullup of a LATERAL
|
2013-08-18 02:22:37 +02:00
|
|
|
subquery can result in creating PlaceHolderVars that contain lateral
|
|
|
|
references to relations outside their syntactic scope. We still evaluate
|
|
|
|
such PHVs at their syntactic location or lower, but the presence of such a
|
|
|
|
PHV in the quals or targetlist of a plan node requires that node to appear
|
|
|
|
on the inside of a nestloop join relative to the rel(s) supplying the
|
|
|
|
lateral reference. (Perhaps now that that stuff works, we could relax the
|
|
|
|
pullup restriction?)
|
|
|
|
|
Adjust definition of cheapest_total_path to work better with LATERAL.
In the initial cut at LATERAL, I kept the rule that cheapest_total_path
was always unparameterized, which meant it had to be NULL if the relation
has no unparameterized paths. It turns out to work much more nicely if
we always have *some* path nominated as cheapest-total for each relation.
In particular, let's still say it's the cheapest unparameterized path if
there is one; if not, take the cheapest-total-cost path among those of
the minimum available parameterization. (The first rule is actually
a special case of the second.)
This allows reversion of some temporary lobotomizations I'd put in place.
In particular, the planner can now consider hash and merge joins for
joins below a parameter-supplying nestloop, even if there aren't any
unparameterized paths available. This should bring planning of
LATERAL-containing queries to the same level as queries not using that
feature.
Along the way, simplify management of parameterized paths in add_path()
and friends. In the original coding for parameterized paths in 9.2,
I tried to minimize the logic changes in add_path(), so it just treated
parameterization as yet another dimension of comparison for paths.
We later made it ignore pathkeys (sort ordering) of parameterized paths,
on the grounds that ordering isn't a useful property for the path on the
inside of a nestloop, so we might as well get rid of useless parameterized
paths as quickly as possible. But we didn't take that reasoning as far as
we should have. Startup cost isn't a useful property inside a nestloop
either, so add_path() ought to discount startup cost of parameterized paths
as well. Having done that, the secondary sorting I'd implemented (in
add_parameterized_path) is no longer needed --- any parameterized path that
survives add_path() at all is worth considering at higher levels. So this
should be a bit faster as well as simpler.
2012-08-30 04:05:27 +02:00
|
|
|
|
Improve RLS planning by marking individual quals with security levels.
In an RLS query, we must ensure that security filter quals are evaluated
before ordinary query quals, in case the latter contain "leaky" functions
that could expose the contents of sensitive rows. The original
implementation of RLS planning ensured this by pushing the scan of a
secured table into a sub-query that it marked as a security-barrier view.
Unfortunately this results in very inefficient plans in many cases, because
the sub-query cannot be flattened and gets planned independently of the
rest of the query.
To fix, drop the use of sub-queries to enforce RLS qual order, and instead
mark each qual (RestrictInfo) with a security_level field establishing its
priority for evaluation. Quals must be evaluated in security_level order,
except that "leakproof" quals can be allowed to go ahead of quals of lower
security_level, if it's helpful to do so. This has to be enforced within
the ordering of any one list of quals to be evaluated at a table scan node,
and we also have to ensure that quals are not chosen for early evaluation
(i.e., use as an index qual or TID scan qual) if they're not allowed to go
ahead of other quals at the scan node.
This is sufficient to fix the problem for RLS quals, since we only support
RLS policies on simple tables and thus RLS quals will always exist at the
table scan level only. Eventually these qual ordering rules should be
enforced for join quals as well, which would permit improving planning for
explicit security-barrier views; but that's a task for another patch.
Note that FDWs would need to be aware of these rules --- and not, for
example, send an insecure qual for remote execution --- but since we do
not yet allow RLS policies on foreign tables, the case doesn't arise.
This will need to be addressed before we can allow such policies.
Patch by me, reviewed by Stephen Frost and Dean Rasheed.
Discussion: https://postgr.es/m/8185.1477432701@sss.pgh.pa.us
2017-01-18 18:58:20 +01:00
|
|
|
Security-level constraints on qual clauses
|
|
|
|
------------------------------------------
|
|
|
|
|
|
|
|
To support row-level security and security-barrier views efficiently,
|
|
|
|
we mark qual clauses (RestrictInfo nodes) with a "security_level" field.
|
|
|
|
The basic concept is that a qual with a lower security_level must be
|
|
|
|
evaluated before one with a higher security_level. This ensures that
|
|
|
|
"leaky" quals that might expose sensitive data are not evaluated until
|
|
|
|
after the security barrier quals that are supposed to filter out
|
|
|
|
security-sensitive rows. However, many qual conditions are "leakproof",
|
|
|
|
that is we trust the functions they use to not expose data. To avoid
|
|
|
|
unnecessarily inefficient plans, a leakproof qual is not delayed by
|
|
|
|
security-level considerations, even if it has a higher syntactic
|
|
|
|
security_level than another qual.
|
|
|
|
|
|
|
|
In a query that contains no use of RLS or security-barrier views, all
|
|
|
|
quals will have security_level zero, so that none of these restrictions
|
|
|
|
kick in; we don't even need to check leakproofness of qual conditions.
|
|
|
|
|
|
|
|
If there are security-barrier quals, they get security_level zero (and
|
|
|
|
possibly higher, if there are multiple layers of barriers). Regular quals
|
|
|
|
coming from the query text get a security_level one more than the highest
|
|
|
|
level used for barrier quals.
|
|
|
|
|
|
|
|
When new qual clauses are generated by EquivalenceClass processing,
|
|
|
|
they must be assigned a security_level. This is trickier than it seems.
|
|
|
|
One's first instinct is that it would be safe to use the largest level
|
|
|
|
found among the source quals for the EquivalenceClass, but that isn't
|
|
|
|
safe at all, because it allows unwanted delays of security-barrier quals.
|
|
|
|
Consider a barrier qual "t.x = t.y" plus a query qual "t.x = constant",
|
|
|
|
and suppose there is another query qual "leaky_function(t.z)" that
|
|
|
|
we mustn't evaluate before the barrier qual has been checked.
|
|
|
|
We will have an EC {t.x, t.y, constant} which will lead us to replace
|
|
|
|
the EC quals with "t.x = constant AND t.y = constant". (We do not want
|
|
|
|
to give up that behavior, either, since the latter condition could allow
|
|
|
|
use of an index on t.y, which we would never discover from the original
|
|
|
|
quals.) If these generated quals are assigned the same security_level as
|
|
|
|
the query quals, then it's possible for the leaky_function qual to be
|
|
|
|
evaluated first, allowing leaky_function to see data from rows that
|
|
|
|
possibly don't pass the barrier condition.
|
|
|
|
|
|
|
|
Instead, our handling of security levels with ECs works like this:
|
|
|
|
* Quals are not accepted as source clauses for ECs in the first place
|
|
|
|
unless they are leakproof or have security_level zero.
|
|
|
|
* EC-derived quals are assigned the minimum (not maximum) security_level
|
|
|
|
found among the EC's source clauses.
|
|
|
|
* If the maximum security_level found among the EC's source clauses is
|
|
|
|
above zero, then the equality operators selected for derived quals must
|
|
|
|
be leakproof. When no such operator can be found, the EC is treated as
|
|
|
|
"broken" and we fall back to emitting its source clauses without any
|
|
|
|
additional derived quals.
|
|
|
|
|
|
|
|
These rules together ensure that an untrusted qual clause (one with
|
|
|
|
security_level above zero) cannot cause an EC to generate a leaky derived
|
|
|
|
clause. This makes it safe to use the minimum not maximum security_level
|
|
|
|
for derived clauses. The rules could result in poor plans due to not
|
|
|
|
being able to generate derived clauses at all, but the risk of that is
|
|
|
|
small in practice because most btree equality operators are leakproof.
|
|
|
|
Also, by making exceptions for level-zero quals, we ensure that there is
|
|
|
|
no plan degradation when no barrier quals are present.
|
|
|
|
|
|
|
|
Once we have security levels assigned to all clauses, enforcement
|
|
|
|
of barrier-qual ordering restrictions boils down to two rules:
|
|
|
|
|
|
|
|
* Table scan plan nodes must not select quals for early execution
|
|
|
|
(for example, use them as index qualifiers in an indexscan) unless
|
|
|
|
they are leakproof or have security_level no higher than any other
|
|
|
|
qual that is due to be executed at the same plan node. (Use the
|
|
|
|
utility function restriction_is_securely_promotable() to check
|
|
|
|
whether it's okay to select a qual for early execution.)
|
|
|
|
|
|
|
|
* Normal execution of a list of quals must execute them in an order
|
|
|
|
that satisfies the same security rule, ie higher security_levels must
|
|
|
|
be evaluated later unless leakproof. (This is handled in a single place
|
|
|
|
by order_qual_clauses() in createplan.c.)
|
|
|
|
|
|
|
|
order_qual_clauses() uses a heuristic to decide exactly what to do with
|
|
|
|
leakproof clauses. Normally it sorts clauses by security_level then cost,
|
|
|
|
being careful that the sort is stable so that we don't reorder clauses
|
|
|
|
without a clear reason. But this could result in a very expensive qual
|
|
|
|
being done before a cheaper one that is of higher security_level.
|
|
|
|
If the cheaper qual is leaky we have no choice, but if it is leakproof
|
|
|
|
we could put it first. We choose to sort leakproof quals as if they
|
|
|
|
have security_level zero, but only when their cost is less than 10X
|
|
|
|
cpu_operator_cost; that restriction alleviates the opposite problem of
|
|
|
|
doing expensive quals first just because they're leakproof.
|
|
|
|
|
|
|
|
Additional rules will be needed to support safe handling of join quals
|
|
|
|
when there is a mix of security levels among join quals; for example, it
|
|
|
|
will be necessary to prevent leaky higher-security-level quals from being
|
|
|
|
evaluated at a lower join level than other quals of lower security level.
|
|
|
|
Currently there is no need to consider that since security-prioritized
|
|
|
|
quals can only be single-table restriction quals coming from RLS policies
|
|
|
|
or security-barrier views, and security-barrier view subqueries are never
|
|
|
|
flattened into the parent query. Hence enforcement of security-prioritized
|
|
|
|
quals only happens at the table scan level. With extra rules for safe
|
|
|
|
handling of security levels among join quals, it should be possible to let
|
|
|
|
security-barrier views be flattened into the parent query, allowing more
|
|
|
|
flexibility of planning while still preserving required ordering of qual
|
|
|
|
evaluation. But that will come later.
|
|
|
|
|
|
|
|
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
Post scan/join planning
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
So far we have discussed only scan/join planning, that is, implementation
|
|
|
|
of the FROM and WHERE clauses of a SQL query. But the planner must also
|
|
|
|
determine how to deal with GROUP BY, aggregation, and other higher-level
|
|
|
|
features of queries; and in many cases there are multiple ways to do these
|
|
|
|
steps and thus opportunities for optimization choices. These steps, like
|
|
|
|
scan/join planning, are handled by constructing Paths representing the
|
|
|
|
different ways to do a step, then choosing the cheapest Path.
|
|
|
|
|
|
|
|
Since all Paths require a RelOptInfo as "parent", we create RelOptInfos
|
|
|
|
representing the outputs of these upper-level processing steps. These
|
|
|
|
RelOptInfos are mostly dummy, but their pathlist lists hold all the Paths
|
|
|
|
considered useful for each step. Currently, we may create these types of
|
|
|
|
additional RelOptInfos during upper-level planning:
|
|
|
|
|
|
|
|
UPPERREL_SETOP result of UNION/INTERSECT/EXCEPT, if any
|
Add a new upper planner relation for partially-aggregated results.
Up until now, we've abused grouped_rel->partial_pathlist as a place to
store partial paths that have been partially aggregate, but that's
really not correct, because a partial path for a relation is supposed
to be one which produces the correct results with the addition of only
a Gather or Gather Merge node, and these paths also require a Finalize
Aggregate step. Instead, add a new partially_group_rel which can hold
either partial paths (which need to be gathered and then have
aggregation finalized) or non-partial paths (which only need to have
aggregation finalized). This allows us to reuse generate_gather_paths
for partially_grouped_rel instead of writing new code, so that this
patch actually basically no net new code while making things cleaner,
simplifying things for pending patches for partition-wise aggregate.
Robert Haas and Jeevan Chalke. The larger patch series of which this
patch is a part was also reviewed and tested by Antonin Houska,
Rajkumar Raghuwanshi, David Rowley, Dilip Kumar, Konstantin Knizhnik,
Pascal Legrand, Rafia Sabih, and me.
Discussion: http://postgr.es/m/CA+TgmobrzFYS3+U8a_BCy3-hOvh5UyJbC18rEcYehxhpw5=ETA@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoZyQEjdBNuoG9-wC5GQ5GrO4544Myo13dVptvx+uLg9uQ@mail.gmail.com
2018-02-26 15:30:12 +01:00
|
|
|
UPPERREL_PARTIAL_GROUP_AGG result of partial grouping/aggregation, if any
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
UPPERREL_GROUP_AGG result of grouping/aggregation, if any
|
|
|
|
UPPERREL_WINDOW result of window functions, if any
|
2021-08-22 13:31:16 +02:00
|
|
|
UPPERREL_PARTIAL_DISTINCT result of partial "SELECT DISTINCT", if any
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
UPPERREL_DISTINCT result of "SELECT DISTINCT", if any
|
|
|
|
UPPERREL_ORDERED result of ORDER BY, if any
|
|
|
|
UPPERREL_FINAL result of any remaining top-level actions
|
|
|
|
|
|
|
|
UPPERREL_FINAL is used to represent any final processing steps, currently
|
|
|
|
LockRows (SELECT FOR UPDATE), LIMIT/OFFSET, and ModifyTable. There is no
|
|
|
|
flexibility about the order in which these steps are done, and thus no need
|
|
|
|
to subdivide this stage more finely.
|
|
|
|
|
|
|
|
These "upper relations" are identified by the UPPERREL enum values shown
|
|
|
|
above, plus a relids set, which allows there to be more than one upperrel
|
|
|
|
of the same kind. We use NULL for the relids if there's no need for more
|
|
|
|
than one upperrel of the same kind. Currently, in fact, the relids set
|
|
|
|
is vestigial because it's always NULL, but that's expected to change in
|
2016-03-17 12:26:20 +01:00
|
|
|
the future. For example, in planning set operations, we might need the
|
|
|
|
relids to denote which subset of the leaf SELECTs has been combined in a
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
particular group of Paths that are competing with each other.
|
|
|
|
|
|
|
|
The result of subquery_planner() is always returned as a set of Paths
|
|
|
|
stored in the UPPERREL_FINAL rel with NULL relids. The other types of
|
|
|
|
upperrels are created only if needed for the particular query.
|
|
|
|
|
|
|
|
|
2016-01-20 20:29:22 +01:00
|
|
|
Parallel Query and Partial Paths
|
|
|
|
--------------------------------
|
|
|
|
|
|
|
|
Parallel query involves dividing up the work that needs to be performed
|
|
|
|
either by an entire query or some portion of the query in such a way that
|
|
|
|
some of that work can be done by one or more worker processes, which are
|
|
|
|
called parallel workers. Parallel workers are a subtype of dynamic
|
|
|
|
background workers; see src/backend/access/transam/README.parallel for a
|
Force rescanning of parallel-aware scan nodes below a Gather[Merge].
The ExecReScan machinery contains various optimizations for postponing
or skipping rescans of plan subtrees; for example a HashAgg node may
conclude that it can re-use the table it built before, instead of
re-reading its input subtree. But that is wrong if the input contains
a parallel-aware table scan node, since the portion of the table scanned
by the leader process is likely to vary from one rescan to the next.
This explains the timing-dependent buildfarm failures we saw after
commit a2b70c89c.
The established mechanism for showing that a plan node's output is
potentially variable is to mark it as depending on some runtime Param.
Hence, to fix this, invent a dummy Param (one that has a PARAM_EXEC
parameter number, but carries no actual value) associated with each Gather
or GatherMerge node, mark parallel-aware nodes below that node as dependent
on that Param, and arrange for ExecReScanGather[Merge] to flag that Param
as changed whenever the Gather[Merge] node is rescanned.
This solution breaks an undocumented assumption made by the parallel
executor logic, namely that all rescans of nodes below a Gather[Merge]
will happen synchronously during the ReScan of the top node itself.
But that's fundamentally contrary to the design of the ExecReScan code,
and so was doomed to fail someday anyway (even if you want to argue
that the bug being fixed here wasn't a failure of that assumption).
A follow-on patch will address that issue. In the meantime, the worst
that's expected to happen is that given very bad timing luck, the leader
might have to do all the work during a rescan, because workers think
they have nothing to do, if they are able to start up before the eventual
ReScan of the leader's parallel-aware table scan node has reset the
shared scan state.
Although this problem exists in 9.6, there does not seem to be any way
for it to manifest there. Without GatherMerge, it seems that a plan tree
that has a rescan-short-circuiting node below Gather will always also
have one above it that will short-circuit in the same cases, preventing
the Gather from being rescanned. Hence we won't take the risk of
back-patching this change into 9.6. But v10 needs it.
Discussion: https://postgr.es/m/CAA4eK1JkByysFJNh9M349u_nNjqETuEnY_y1VUc_kJiU0bxtaQ@mail.gmail.com
2017-08-30 15:29:55 +02:00
|
|
|
fuller description. The academic literature on parallel query suggests
|
2016-01-20 20:29:22 +01:00
|
|
|
that parallel execution strategies can be divided into essentially two
|
|
|
|
categories: pipelined parallelism, where the execution of the query is
|
|
|
|
divided into multiple stages and each stage is handled by a separate
|
|
|
|
process; and partitioning parallelism, where the data is split between
|
|
|
|
multiple processes and each process handles a subset of it. The
|
|
|
|
literature, however, suggests that gains from pipeline parallelism are
|
|
|
|
often very limited due to the difficulty of avoiding pipeline stalls.
|
|
|
|
Consequently, we do not currently attempt to generate query plans that
|
|
|
|
use this technique.
|
|
|
|
|
2016-03-08 03:48:17 +01:00
|
|
|
Instead, we focus on partitioning parallelism, which does not require
|
2016-01-20 20:29:22 +01:00
|
|
|
that the underlying table be partitioned. It only requires that (1)
|
|
|
|
there is some method of dividing the data from at least one of the base
|
|
|
|
tables involved in the relation across multiple processes, (2) allowing
|
|
|
|
each process to handle its own portion of the data, and then (3)
|
Force rescanning of parallel-aware scan nodes below a Gather[Merge].
The ExecReScan machinery contains various optimizations for postponing
or skipping rescans of plan subtrees; for example a HashAgg node may
conclude that it can re-use the table it built before, instead of
re-reading its input subtree. But that is wrong if the input contains
a parallel-aware table scan node, since the portion of the table scanned
by the leader process is likely to vary from one rescan to the next.
This explains the timing-dependent buildfarm failures we saw after
commit a2b70c89c.
The established mechanism for showing that a plan node's output is
potentially variable is to mark it as depending on some runtime Param.
Hence, to fix this, invent a dummy Param (one that has a PARAM_EXEC
parameter number, but carries no actual value) associated with each Gather
or GatherMerge node, mark parallel-aware nodes below that node as dependent
on that Param, and arrange for ExecReScanGather[Merge] to flag that Param
as changed whenever the Gather[Merge] node is rescanned.
This solution breaks an undocumented assumption made by the parallel
executor logic, namely that all rescans of nodes below a Gather[Merge]
will happen synchronously during the ReScan of the top node itself.
But that's fundamentally contrary to the design of the ExecReScan code,
and so was doomed to fail someday anyway (even if you want to argue
that the bug being fixed here wasn't a failure of that assumption).
A follow-on patch will address that issue. In the meantime, the worst
that's expected to happen is that given very bad timing luck, the leader
might have to do all the work during a rescan, because workers think
they have nothing to do, if they are able to start up before the eventual
ReScan of the leader's parallel-aware table scan node has reset the
shared scan state.
Although this problem exists in 9.6, there does not seem to be any way
for it to manifest there. Without GatherMerge, it seems that a plan tree
that has a rescan-short-circuiting node below Gather will always also
have one above it that will short-circuit in the same cases, preventing
the Gather from being rescanned. Hence we won't take the risk of
back-patching this change into 9.6. But v10 needs it.
Discussion: https://postgr.es/m/CAA4eK1JkByysFJNh9M349u_nNjqETuEnY_y1VUc_kJiU0bxtaQ@mail.gmail.com
2017-08-30 15:29:55 +02:00
|
|
|
collecting the results. Requirements (2) and (3) are satisfied by the
|
|
|
|
executor node Gather (or GatherMerge), which launches any number of worker
|
|
|
|
processes and executes its single child plan in all of them, and perhaps
|
|
|
|
in the leader also, if the children aren't generating enough data to keep
|
|
|
|
the leader busy. Requirement (1) is handled by the table scan node: when
|
|
|
|
invoked with parallel_aware = true, this node will, in effect, partition
|
|
|
|
the table on a block by block basis, returning a subset of the tuples from
|
|
|
|
the relation in each worker where that scan node is executed.
|
2016-01-20 20:29:22 +01:00
|
|
|
|
|
|
|
Just as we do for non-parallel access methods, we build Paths to
|
|
|
|
represent access strategies that can be used in a parallel plan. These
|
|
|
|
are, in essence, the same strategies that are available in the
|
|
|
|
non-parallel plan, but there is an important difference: a path that
|
|
|
|
will run beneath a Gather node returns only a subset of the query
|
|
|
|
results in each worker, not all of them. To form a path that can
|
|
|
|
actually be executed, the (rather large) cost of the Gather node must be
|
|
|
|
accounted for. For this reason among others, paths intended to run
|
|
|
|
beneath a Gather node - which we call "partial" paths since they return
|
|
|
|
only a subset of the results in each worker - must be kept separate from
|
|
|
|
ordinary paths (see RelOptInfo's partial_pathlist and the function
|
|
|
|
add_partial_path).
|
|
|
|
|
|
|
|
One of the keys to making parallel query effective is to run as much of
|
|
|
|
the query in parallel as possible. Therefore, we expect it to generally
|
|
|
|
be desirable to postpone the Gather stage until as near to the top of the
|
|
|
|
plan as possible. Expanding the range of cases in which more work can be
|
2016-03-17 12:26:20 +01:00
|
|
|
pushed below the Gather (and costing them accurately) is likely to keep us
|
2016-01-20 20:29:22 +01:00
|
|
|
busy for a long time to come.
|
Basic partition-wise join functionality.
Instead of joining two partitioned tables in their entirety we can, if
it is an equi-join on the partition keys, join the matching partitions
individually. This involves teaching the planner about "other join"
rels, which are related to regular join rels in the same way that
other member rels are related to baserels. This can use significantly
more CPU time and memory than regular join planning, because there may
now be a set of "other" rels not only for every base relation but also
for every join relation. In most practical cases, this probably
shouldn't be a problem, because (1) it's probably unusual to join many
tables each with many partitions using the partition keys for all
joins and (2) if you do that scenario then you probably have a big
enough machine to handle the increased memory cost of planning and (3)
the resulting plan is highly likely to be better, so what you spend in
planning you'll make up on the execution side. All the same, for now,
turn this feature off by default.
Currently, we can only perform joins between two tables whose
partitioning schemes are absolutely identical. It would be nice to
cope with other scenarios, such as extra partitions on one side or the
other with no match on the other side, but that will have to wait for
a future patch.
Ashutosh Bapat, reviewed and tested by Rajkumar Raghuwanshi, Amit
Langote, Rafia Sabih, Thomas Munro, Dilip Kumar, Antonin Houska, Amit
Khandekar, and by me. A few final adjustments by me.
Discussion: http://postgr.es/m/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com
Discussion: http://postgr.es/m/CAFjFpRcitjfrULr5jfuKWRPsGUX0LQ0k8-yG0Qw2+1LBGNpMdw@mail.gmail.com
2017-10-06 17:11:10 +02:00
|
|
|
|
2018-02-16 16:33:59 +01:00
|
|
|
Partitionwise joins
|
|
|
|
-------------------
|
Implement partition-wise grouping/aggregation.
If the partition keys of input relation are part of the GROUP BY
clause, all the rows belonging to a given group come from a single
partition. This allows aggregation/grouping over a partitioned
relation to be broken down * into aggregation/grouping on each
partition. This should be no worse, and often better, than the normal
approach.
If the GROUP BY clause does not contain all the partition keys, we can
still perform partial aggregation for each partition and then finalize
aggregation after appending the partial results. This is less certain
to be a win, but it's still useful.
Jeevan Chalke, Ashutosh Bapat, Robert Haas. The larger patch series
of which this patch is a part was also reviewed and tested by Antonin
Houska, Rajkumar Raghuwanshi, David Rowley, Dilip Kumar, Konstantin
Knizhnik, Pascal Legrand, and Rafia Sabih.
Discussion: http://postgr.es/m/CAM2+6=V64_xhstVHie0Rz=KPEQnLJMZt_e314P0jaT_oJ9MR8A@mail.gmail.com
2018-03-22 17:49:48 +01:00
|
|
|
|
Basic partition-wise join functionality.
Instead of joining two partitioned tables in their entirety we can, if
it is an equi-join on the partition keys, join the matching partitions
individually. This involves teaching the planner about "other join"
rels, which are related to regular join rels in the same way that
other member rels are related to baserels. This can use significantly
more CPU time and memory than regular join planning, because there may
now be a set of "other" rels not only for every base relation but also
for every join relation. In most practical cases, this probably
shouldn't be a problem, because (1) it's probably unusual to join many
tables each with many partitions using the partition keys for all
joins and (2) if you do that scenario then you probably have a big
enough machine to handle the increased memory cost of planning and (3)
the resulting plan is highly likely to be better, so what you spend in
planning you'll make up on the execution side. All the same, for now,
turn this feature off by default.
Currently, we can only perform joins between two tables whose
partitioning schemes are absolutely identical. It would be nice to
cope with other scenarios, such as extra partitions on one side or the
other with no match on the other side, but that will have to wait for
a future patch.
Ashutosh Bapat, reviewed and tested by Rajkumar Raghuwanshi, Amit
Langote, Rafia Sabih, Thomas Munro, Dilip Kumar, Antonin Houska, Amit
Khandekar, and by me. A few final adjustments by me.
Discussion: http://postgr.es/m/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com
Discussion: http://postgr.es/m/CAFjFpRcitjfrULr5jfuKWRPsGUX0LQ0k8-yG0Qw2+1LBGNpMdw@mail.gmail.com
2017-10-06 17:11:10 +02:00
|
|
|
A join between two similarly partitioned tables can be broken down into joins
|
|
|
|
between their matching partitions if there exists an equi-join condition
|
|
|
|
between the partition keys of the joining tables. The equi-join between
|
|
|
|
partition keys implies that all join partners for a given row in one
|
|
|
|
partitioned table must be in the corresponding partition of the other
|
|
|
|
partitioned table. Because of this the join between partitioned tables to be
|
|
|
|
broken into joins between the matching partitions. The resultant join is
|
|
|
|
partitioned in the same way as the joining relations, thus allowing an N-way
|
|
|
|
join between similarly partitioned tables having equi-join condition between
|
|
|
|
their partition keys to be broken down into N-way joins between their matching
|
2017-10-28 11:14:23 +02:00
|
|
|
partitions. This technique of breaking down a join between partitioned tables
|
2018-02-16 16:33:59 +01:00
|
|
|
into joins between their partitions is called partitionwise join. We will use
|
Basic partition-wise join functionality.
Instead of joining two partitioned tables in their entirety we can, if
it is an equi-join on the partition keys, join the matching partitions
individually. This involves teaching the planner about "other join"
rels, which are related to regular join rels in the same way that
other member rels are related to baserels. This can use significantly
more CPU time and memory than regular join planning, because there may
now be a set of "other" rels not only for every base relation but also
for every join relation. In most practical cases, this probably
shouldn't be a problem, because (1) it's probably unusual to join many
tables each with many partitions using the partition keys for all
joins and (2) if you do that scenario then you probably have a big
enough machine to handle the increased memory cost of planning and (3)
the resulting plan is highly likely to be better, so what you spend in
planning you'll make up on the execution side. All the same, for now,
turn this feature off by default.
Currently, we can only perform joins between two tables whose
partitioning schemes are absolutely identical. It would be nice to
cope with other scenarios, such as extra partitions on one side or the
other with no match on the other side, but that will have to wait for
a future patch.
Ashutosh Bapat, reviewed and tested by Rajkumar Raghuwanshi, Amit
Langote, Rafia Sabih, Thomas Munro, Dilip Kumar, Antonin Houska, Amit
Khandekar, and by me. A few final adjustments by me.
Discussion: http://postgr.es/m/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com
Discussion: http://postgr.es/m/CAFjFpRcitjfrULr5jfuKWRPsGUX0LQ0k8-yG0Qw2+1LBGNpMdw@mail.gmail.com
2017-10-06 17:11:10 +02:00
|
|
|
term "partitioned relation" for either a partitioned table or a join between
|
|
|
|
compatibly partitioned tables.
|
|
|
|
|
Allow partitionwise joins in more cases.
Previously, the partitionwise join technique only allowed partitionwise
join when input partitioned tables had exactly the same partition
bounds. This commit extends the technique to some cases when the tables
have different partition bounds, by using an advanced partition-matching
algorithm introduced by this commit. For both the input partitioned
tables, the algorithm checks whether every partition of one input
partitioned table only matches one partition of the other input
partitioned table at most, and vice versa. In such a case the join
between the tables can be broken down into joins between the matching
partitions, so the algorithm produces the pairs of the matching
partitions, plus the partition bounds for the join relation, to allow
partitionwise join for computing the join. Currently, the algorithm
works for list-partitioned and range-partitioned tables, but not
hash-partitioned tables. See comments in partition_bounds_merge().
Ashutosh Bapat and Etsuro Fujita, most of regression tests by Rajkumar
Raghuwanshi, some of the tests by Mark Dilger and Amul Sul, reviewed by
Dmitry Dolgov and Amul Sul, with additional review at various points by
Ashutosh Bapat, Mark Dilger, Robert Haas, Antonin Houska, Amit Langote,
Justin Pryzby, and Tomas Vondra
Discussion: https://postgr.es/m/CAFjFpRdjQvaUEV5DJX3TW6pU5eq54NCkadtxHX2JiJG_GvbrCA@mail.gmail.com
2020-04-08 03:25:00 +02:00
|
|
|
Even if the joining relations don't have exactly the same partition bounds,
|
|
|
|
partitionwise join can still be applied by using an advanced
|
|
|
|
partition-matching algorithm. For both the joining relations, the algorithm
|
2020-05-18 11:55:11 +02:00
|
|
|
checks whether every partition of one joining relation only matches one
|
Allow partitionwise joins in more cases.
Previously, the partitionwise join technique only allowed partitionwise
join when input partitioned tables had exactly the same partition
bounds. This commit extends the technique to some cases when the tables
have different partition bounds, by using an advanced partition-matching
algorithm introduced by this commit. For both the input partitioned
tables, the algorithm checks whether every partition of one input
partitioned table only matches one partition of the other input
partitioned table at most, and vice versa. In such a case the join
between the tables can be broken down into joins between the matching
partitions, so the algorithm produces the pairs of the matching
partitions, plus the partition bounds for the join relation, to allow
partitionwise join for computing the join. Currently, the algorithm
works for list-partitioned and range-partitioned tables, but not
hash-partitioned tables. See comments in partition_bounds_merge().
Ashutosh Bapat and Etsuro Fujita, most of regression tests by Rajkumar
Raghuwanshi, some of the tests by Mark Dilger and Amul Sul, reviewed by
Dmitry Dolgov and Amul Sul, with additional review at various points by
Ashutosh Bapat, Mark Dilger, Robert Haas, Antonin Houska, Amit Langote,
Justin Pryzby, and Tomas Vondra
Discussion: https://postgr.es/m/CAFjFpRdjQvaUEV5DJX3TW6pU5eq54NCkadtxHX2JiJG_GvbrCA@mail.gmail.com
2020-04-08 03:25:00 +02:00
|
|
|
partition of the other joining relation at most. In such a case the join
|
|
|
|
between the joining relations can be broken down into joins between the
|
2020-05-18 11:55:11 +02:00
|
|
|
matching partitions. The join relation can then be considered partitioned.
|
Allow partitionwise joins in more cases.
Previously, the partitionwise join technique only allowed partitionwise
join when input partitioned tables had exactly the same partition
bounds. This commit extends the technique to some cases when the tables
have different partition bounds, by using an advanced partition-matching
algorithm introduced by this commit. For both the input partitioned
tables, the algorithm checks whether every partition of one input
partitioned table only matches one partition of the other input
partitioned table at most, and vice versa. In such a case the join
between the tables can be broken down into joins between the matching
partitions, so the algorithm produces the pairs of the matching
partitions, plus the partition bounds for the join relation, to allow
partitionwise join for computing the join. Currently, the algorithm
works for list-partitioned and range-partitioned tables, but not
hash-partitioned tables. See comments in partition_bounds_merge().
Ashutosh Bapat and Etsuro Fujita, most of regression tests by Rajkumar
Raghuwanshi, some of the tests by Mark Dilger and Amul Sul, reviewed by
Dmitry Dolgov and Amul Sul, with additional review at various points by
Ashutosh Bapat, Mark Dilger, Robert Haas, Antonin Houska, Amit Langote,
Justin Pryzby, and Tomas Vondra
Discussion: https://postgr.es/m/CAFjFpRdjQvaUEV5DJX3TW6pU5eq54NCkadtxHX2JiJG_GvbrCA@mail.gmail.com
2020-04-08 03:25:00 +02:00
|
|
|
The algorithm produces the pairs of the matching partitions, plus the
|
|
|
|
partition bounds for the join relation, to allow partitionwise join for
|
|
|
|
computing the join. The algorithm is implemented in partition_bounds_merge().
|
|
|
|
For an N-way join relation considered partitioned this way, not every pair of
|
|
|
|
joining relations can use partitionwise join. For example:
|
|
|
|
|
|
|
|
(A leftjoin B on (Pab)) innerjoin C on (Pac)
|
|
|
|
|
|
|
|
where A, B, and C are partitioned tables, and A has an extra partition
|
|
|
|
compared to B and C. When considering partitionwise join for the join {A B},
|
|
|
|
the extra partition of A doesn't have a matching partition on the nullable
|
|
|
|
side, which is the case that the current implementation of partitionwise join
|
|
|
|
can't handle. So {A B} is not considered partitioned, and the pair of {A B}
|
|
|
|
and C considered for the 3-way join can't use partitionwise join. On the
|
|
|
|
other hand, the pair of {A C} and B can use partitionwise join because {A C}
|
|
|
|
is considered partitioned by eliminating the extra partition (see identity 1
|
|
|
|
on outer join reordering). Whether an N-way join can use partitionwise join
|
|
|
|
is determined based on the first pair of joining relations that are both
|
|
|
|
partitioned and can use partitionwise join.
|
|
|
|
|
Basic partition-wise join functionality.
Instead of joining two partitioned tables in their entirety we can, if
it is an equi-join on the partition keys, join the matching partitions
individually. This involves teaching the planner about "other join"
rels, which are related to regular join rels in the same way that
other member rels are related to baserels. This can use significantly
more CPU time and memory than regular join planning, because there may
now be a set of "other" rels not only for every base relation but also
for every join relation. In most practical cases, this probably
shouldn't be a problem, because (1) it's probably unusual to join many
tables each with many partitions using the partition keys for all
joins and (2) if you do that scenario then you probably have a big
enough machine to handle the increased memory cost of planning and (3)
the resulting plan is highly likely to be better, so what you spend in
planning you'll make up on the execution side. All the same, for now,
turn this feature off by default.
Currently, we can only perform joins between two tables whose
partitioning schemes are absolutely identical. It would be nice to
cope with other scenarios, such as extra partitions on one side or the
other with no match on the other side, but that will have to wait for
a future patch.
Ashutosh Bapat, reviewed and tested by Rajkumar Raghuwanshi, Amit
Langote, Rafia Sabih, Thomas Munro, Dilip Kumar, Antonin Houska, Amit
Khandekar, and by me. A few final adjustments by me.
Discussion: http://postgr.es/m/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com
Discussion: http://postgr.es/m/CAFjFpRcitjfrULr5jfuKWRPsGUX0LQ0k8-yG0Qw2+1LBGNpMdw@mail.gmail.com
2017-10-06 17:11:10 +02:00
|
|
|
The partitioning properties of a partitioned relation are stored in its
|
|
|
|
RelOptInfo. The information about data types of partition keys are stored in
|
|
|
|
PartitionSchemeData structure. The planner maintains a list of canonical
|
|
|
|
partition schemes (distinct PartitionSchemeData objects) so that RelOptInfo of
|
|
|
|
any two partitioned relations with same partitioning scheme point to the same
|
|
|
|
PartitionSchemeData object. This reduces memory consumed by
|
|
|
|
PartitionSchemeData objects and makes it easy to compare the partition schemes
|
|
|
|
of joining relations.
|
Implement partition-wise grouping/aggregation.
If the partition keys of input relation are part of the GROUP BY
clause, all the rows belonging to a given group come from a single
partition. This allows aggregation/grouping over a partitioned
relation to be broken down * into aggregation/grouping on each
partition. This should be no worse, and often better, than the normal
approach.
If the GROUP BY clause does not contain all the partition keys, we can
still perform partial aggregation for each partition and then finalize
aggregation after appending the partial results. This is less certain
to be a win, but it's still useful.
Jeevan Chalke, Ashutosh Bapat, Robert Haas. The larger patch series
of which this patch is a part was also reviewed and tested by Antonin
Houska, Rajkumar Raghuwanshi, David Rowley, Dilip Kumar, Konstantin
Knizhnik, Pascal Legrand, and Rafia Sabih.
Discussion: http://postgr.es/m/CAM2+6=V64_xhstVHie0Rz=KPEQnLJMZt_e314P0jaT_oJ9MR8A@mail.gmail.com
2018-03-22 17:49:48 +01:00
|
|
|
|
2018-08-09 09:41:28 +02:00
|
|
|
Partitionwise aggregates/grouping
|
|
|
|
---------------------------------
|
Implement partition-wise grouping/aggregation.
If the partition keys of input relation are part of the GROUP BY
clause, all the rows belonging to a given group come from a single
partition. This allows aggregation/grouping over a partitioned
relation to be broken down * into aggregation/grouping on each
partition. This should be no worse, and often better, than the normal
approach.
If the GROUP BY clause does not contain all the partition keys, we can
still perform partial aggregation for each partition and then finalize
aggregation after appending the partial results. This is less certain
to be a win, but it's still useful.
Jeevan Chalke, Ashutosh Bapat, Robert Haas. The larger patch series
of which this patch is a part was also reviewed and tested by Antonin
Houska, Rajkumar Raghuwanshi, David Rowley, Dilip Kumar, Konstantin
Knizhnik, Pascal Legrand, and Rafia Sabih.
Discussion: http://postgr.es/m/CAM2+6=V64_xhstVHie0Rz=KPEQnLJMZt_e314P0jaT_oJ9MR8A@mail.gmail.com
2018-03-22 17:49:48 +01:00
|
|
|
|
2018-08-31 09:40:17 +02:00
|
|
|
If the GROUP BY clause contains all of the partition keys, all the rows
|
Implement partition-wise grouping/aggregation.
If the partition keys of input relation are part of the GROUP BY
clause, all the rows belonging to a given group come from a single
partition. This allows aggregation/grouping over a partitioned
relation to be broken down * into aggregation/grouping on each
partition. This should be no worse, and often better, than the normal
approach.
If the GROUP BY clause does not contain all the partition keys, we can
still perform partial aggregation for each partition and then finalize
aggregation after appending the partial results. This is less certain
to be a win, but it's still useful.
Jeevan Chalke, Ashutosh Bapat, Robert Haas. The larger patch series
of which this patch is a part was also reviewed and tested by Antonin
Houska, Rajkumar Raghuwanshi, David Rowley, Dilip Kumar, Konstantin
Knizhnik, Pascal Legrand, and Rafia Sabih.
Discussion: http://postgr.es/m/CAM2+6=V64_xhstVHie0Rz=KPEQnLJMZt_e314P0jaT_oJ9MR8A@mail.gmail.com
2018-03-22 17:49:48 +01:00
|
|
|
that belong to a given group must come from a single partition; therefore,
|
|
|
|
aggregation can be done completely separately for each partition. Otherwise,
|
|
|
|
partial aggregates can be computed for each partition, and then finalized
|
|
|
|
after appending the results from the individual partitions. This technique of
|
|
|
|
breaking down aggregation or grouping over a partitioned relation into
|
|
|
|
aggregation or grouping over its partitions is called partitionwise
|
|
|
|
aggregation. Especially when the partition keys match the GROUP BY clause,
|
|
|
|
this can be significantly faster than the regular method.
|