1996-07-09 08:22:35 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
1999-02-14 00:22:53 +01:00
|
|
|
* costsize.c
|
1997-09-07 07:04:48 +02:00
|
|
|
* Routines to compute (and set) relation sizes and path costs
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2006-06-05 04:49:58 +02:00
|
|
|
* Path costs are measured in arbitrary units established by these basic
|
|
|
|
* parameters:
|
2000-02-15 21:49:31 +01:00
|
|
|
*
|
2006-06-05 04:49:58 +02:00
|
|
|
* seq_page_cost Cost of a sequential page fetch
|
2000-02-15 21:49:31 +01:00
|
|
|
* random_page_cost Cost of a non-sequential page fetch
|
|
|
|
* cpu_tuple_cost Cost of typical CPU time to process a tuple
|
|
|
|
* cpu_index_tuple_cost Cost of typical CPU time to process an index tuple
|
2006-06-05 04:49:58 +02:00
|
|
|
* cpu_operator_cost Cost of CPU time to execute an operator or function
|
2020-06-14 23:22:47 +02:00
|
|
|
* parallel_tuple_cost Cost of CPU time to pass a tuple from worker to leader backend
|
Add a Gather executor node.
A Gather executor node runs any number of copies of a plan in an equal
number of workers and merges all of the results into a single tuple
stream. It can also run the plan itself, if the workers are
unavailable or haven't started up yet. It is intended to work with
the Partial Seq Scan node which will be added in future commits.
It could also be used to implement parallel query of a different sort
by itself, without help from Partial Seq Scan, if the single_copy mode
is used. In that mode, a worker executes the plan, and the parallel
leader does not, merely collecting the worker's results. So, a Gather
node could be inserted into a plan to split the execution of that plan
across two processes. Nested Gather nodes aren't currently supported,
but we might want to add support for that in the future.
There's nothing in the planner to actually generate Gather nodes yet,
so it's not quite time to break out the champagne. But we're getting
close.
Amit Kapila. Some designs suggestions were provided by me, and I also
reviewed the patch. Single-copy mode, documentation, and other minor
changes also by me.
2015-10-01 01:23:36 +02:00
|
|
|
* parallel_setup_cost Cost of setting up shared memory for parallelism
|
2006-06-05 04:49:58 +02:00
|
|
|
*
|
|
|
|
* We expect that the kernel will typically do some amount of read-ahead
|
|
|
|
* optimization; this in conjunction with seek costs means that seq_page_cost
|
|
|
|
* is normally considerably less than random_page_cost. (However, if the
|
|
|
|
* database is fully cached in RAM, it is reasonable to set them equal.)
|
2000-02-15 21:49:31 +01:00
|
|
|
*
|
|
|
|
* We also use a rough estimate "effective_cache_size" of the number of
|
|
|
|
* disk pages in Postgres + OS-level disk cache. (We can't simply use
|
|
|
|
* NBuffers for this purpose because that would ignore the effects of
|
|
|
|
* the kernel's disk cache.)
|
|
|
|
*
|
|
|
|
* Obviously, taking constants for these values is an oversimplification,
|
|
|
|
* but it's tough enough to get any useful estimates even at this level of
|
2014-05-06 18:12:18 +02:00
|
|
|
* detail. Note that all of these parameters are user-settable, in case
|
2000-02-15 21:49:31 +01:00
|
|
|
* the default values are drastically off for a particular platform.
|
|
|
|
*
|
2010-01-05 22:54:00 +01:00
|
|
|
* seq_page_cost and random_page_cost can also be overridden for an individual
|
|
|
|
* tablespace, in case some data is on a fast disk and other data is on a slow
|
|
|
|
* disk. Per-tablespace overrides never apply to temporary work files such as
|
|
|
|
* an external sort or a materialize node that overflows work_mem.
|
|
|
|
*
|
2000-02-15 21:49:31 +01:00
|
|
|
* We compute two separate costs for each path:
|
|
|
|
* total_cost: total estimated cost to fetch all tuples
|
|
|
|
* startup_cost: cost that is expended before first tuple is fetched
|
|
|
|
* In some scenarios, such as when there is a LIMIT or we are implementing
|
|
|
|
* an EXISTS(...) sub-select, it is not necessary to fetch all tuples of the
|
|
|
|
* path's result. A caller can estimate the cost of fetching a partial
|
|
|
|
* result by interpolating between startup_cost and total_cost. In detail:
|
|
|
|
* actual_cost = startup_cost +
|
2012-01-28 01:26:38 +01:00
|
|
|
* (total_cost - startup_cost) * tuples_to_fetch / path->rows;
|
2001-05-10 01:13:37 +02:00
|
|
|
* Note that a base relation's rows count (and, by extension, plan_rows for
|
|
|
|
* plan nodes below the LIMIT node) are set without regard to any LIMIT, so
|
2016-03-26 17:03:12 +01:00
|
|
|
* that this equation works properly. (Note: while path->rows is never zero
|
|
|
|
* for ordinary relations, it is zero for paths for provably-empty relations,
|
2016-04-06 17:34:02 +02:00
|
|
|
* so beware of division-by-zero.) The LIMIT is applied as a top-level
|
2016-03-26 17:03:12 +01:00
|
|
|
* plan node.
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2003-01-27 21:51:54 +01:00
|
|
|
* For largely historical reasons, most of the routines in this module use
|
2012-01-28 01:26:38 +01:00
|
|
|
* the passed result Path only to store their results (rows, startup_cost and
|
|
|
|
* total_cost) into. All the input data they need is passed as separate
|
2003-01-27 21:51:54 +01:00
|
|
|
* parameters, even though much of it could be extracted from the Path.
|
|
|
|
* An exception is made for the cost_XXXjoin() routines, which expect all
|
2012-01-28 01:26:38 +01:00
|
|
|
* the other fields of the passed XXXPath to be filled in, and similarly
|
2011-12-25 01:03:21 +01:00
|
|
|
* cost_index() assumes the passed IndexPath is valid except for its output
|
|
|
|
* values.
|
2003-01-27 21:51:54 +01:00
|
|
|
*
|
2000-04-12 19:17:23 +02:00
|
|
|
*
|
2021-01-02 19:06:25 +01:00
|
|
|
* Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
|
2000-01-26 06:58:53 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/backend/optimizer/path/costsize.c
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
1997-01-08 11:33:46 +01:00
|
|
|
|
1999-07-16 07:00:38 +02:00
|
|
|
#include "postgres.h"
|
1999-08-06 06:00:17 +02:00
|
|
|
|
2000-01-09 01:26:47 +01:00
|
|
|
#include <math.h>
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2019-12-27 00:09:00 +01:00
|
|
|
#include "access/amapi.h"
|
2012-08-30 22:15:44 +02:00
|
|
|
#include "access/htup_details.h"
|
Redesign tablesample method API, and do extensive code review.
The original implementation of TABLESAMPLE modeled the tablesample method
API on index access methods, which wasn't a good choice because, without
specialized DDL commands, there's no way to build an extension that can
implement a TSM. (Raw inserts into system catalogs are not an acceptable
thing to do, because we can't undo them during DROP EXTENSION, nor will
pg_upgrade behave sanely.) Instead adopt an API more like procedural
language handlers or foreign data wrappers, wherein the only SQL-level
support object needed is a single handler function identified by having
a special return type. This lets us get rid of the supporting catalog
altogether, so that no custom DDL support is needed for the feature.
Adjust the API so that it can support non-constant tablesample arguments
(the original coding assumed we could evaluate the argument expressions at
ExecInitSampleScan time, which is undesirable even if it weren't outright
unsafe), and discourage sampling methods from looking at invisible tuples.
Make sure that the BERNOULLI and SYSTEM methods are genuinely repeatable
within and across queries, as required by the SQL standard, and deal more
honestly with methods that can't support that requirement.
Make a full code-review pass over the tablesample additions, and fix
assorted bugs, omissions, infelicities, and cosmetic issues (such as
failure to put the added code stanzas in a consistent ordering).
Improve EXPLAIN's output of tablesample plans, too.
Back-patch to 9.5 so that we don't have to support the original API
in production.
2015-07-25 20:39:00 +02:00
|
|
|
#include "access/tsmapi.h"
|
2009-09-13 00:12:09 +02:00
|
|
|
#include "executor/executor.h"
|
Disk-based Hash Aggregation.
While performing hash aggregation, track memory usage when adding new
groups to a hash table. If the memory usage exceeds work_mem, enter
"spill mode".
In spill mode, new groups are not created in the hash table(s), but
existing groups continue to be advanced if input tuples match. Tuples
that would cause a new group to be created are instead spilled to a
logical tape to be processed later.
The tuples are spilled in a partitioned fashion. When all tuples from
the outer plan are processed (either by advancing the group or
spilling the tuple), finalize and emit the groups from the hash
table. Then, create new batches of work from the spilled partitions,
and select one of the saved batches and process it (possibly spilling
recursively).
Author: Jeff Davis
Reviewed-by: Tomas Vondra, Adam Lee, Justin Pryzby, Taylor Vesely, Melanie Plageman
Discussion: https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel@j-davis.com
2020-03-18 23:42:02 +01:00
|
|
|
#include "executor/nodeAgg.h"
|
2000-04-18 07:43:02 +02:00
|
|
|
#include "executor/nodeHash.h"
|
1999-08-06 06:00:17 +02:00
|
|
|
#include "miscadmin.h"
|
2019-01-29 21:26:44 +01:00
|
|
|
#include "nodes/makefuncs.h"
|
2008-08-26 00:42:34 +02:00
|
|
|
#include "nodes/nodeFuncs.h"
|
2000-02-15 21:49:31 +01:00
|
|
|
#include "optimizer/clauses.h"
|
1996-07-09 08:22:35 +02:00
|
|
|
#include "optimizer/cost.h"
|
2019-01-29 21:48:51 +01:00
|
|
|
#include "optimizer/optimizer.h"
|
2001-05-07 02:43:27 +02:00
|
|
|
#include "optimizer/pathnode.h"
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
#include "optimizer/paths.h"
|
2008-10-21 22:42:53 +02:00
|
|
|
#include "optimizer/placeholder.h"
|
2010-11-19 23:31:50 +01:00
|
|
|
#include "optimizer/plancat.h"
|
2007-01-22 02:35:23 +01:00
|
|
|
#include "optimizer/planmain.h"
|
2009-05-10 00:51:41 +02:00
|
|
|
#include "optimizer/restrictinfo.h"
|
2001-05-07 02:43:27 +02:00
|
|
|
#include "parser/parsetree.h"
|
1998-01-13 05:05:12 +01:00
|
|
|
#include "utils/lsyscache.h"
|
2006-07-11 18:35:33 +02:00
|
|
|
#include "utils/selfuncs.h"
|
2010-01-05 22:54:00 +01:00
|
|
|
#include "utils/spccache.h"
|
2006-02-19 06:54:06 +01:00
|
|
|
#include "utils/tuplesort.h"
|
1996-07-09 08:22:35 +02:00
|
|
|
|
1996-10-31 06:58:01 +01:00
|
|
|
|
2000-02-15 21:49:31 +01:00
|
|
|
#define LOG2(x) (log(x) / 0.693147180559945)
|
|
|
|
|
Charge cpu_tuple_cost * 0.5 for Append and MergeAppend nodes.
Previously, Append didn't charge anything at all, and MergeAppend
charged only cpu_operator_cost, about half the value used here. This
change might make MergeAppend plans slightly more likely to be chosen
than before, since this commit increases the assumed cost for Append
-- with default values -- by 0.005 per tuple but MergeAppend by only
0.0025 per tuple. Since the comparisons required by MergeAppend are
costed separately, it's not clear why MergeAppend needs to be
otherwise more expensive than Append, so hopefully this is OK.
Prior to partition-wise join, it didn't really matter whether or not
an Append node had any cost of its own, because every plan had to use
the same number of Append or MergeAppend nodes and in the same places.
Only the relative cost of Append vs. MergeAppend made a difference.
Now, however, it is possible to avoid some of the Append nodes using a
partition-wise join, so it's worth making an effort. Pending patches
for partition-wise aggregate care too, because an Append of Aggregate
nodes will incur the Append overhead fewer times than an Aggregate
over an Append. Although in most cases this change will favor the use
of partition-wise techniques, it does the opposite when the join
cardinality is greater than the sum of the input cardinalities. Since
this situation arises in an existing regression test, I [rhaas]
adjusted it to keep the overall plan shape approximately the same.
Jeevan Chalke, per a suggestion from David Rowley. Reviewed by
Ashutosh Bapat. Some changes by me. The larger patch series of which
this patch is a part was also reviewed and tested by Antonin Houska,
Rajkumar Raghuwanshi, David Rowley, Dilip Kumar, Konstantin Knizhnik,
Pascal Legrand, Rafia Sabih, and me.
Discussion: http://postgr.es/m/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B=ZRh-rxy9qxfPA5Gw@mail.gmail.com
2018-02-22 05:09:27 +01:00
|
|
|
/*
|
|
|
|
* Append and MergeAppend nodes are less expensive than some other operations
|
|
|
|
* which use cpu_tuple_cost; instead of adding a separate GUC, estimate the
|
|
|
|
* per-tuple cost as cpu_tuple_cost multiplied by this value.
|
|
|
|
*/
|
|
|
|
#define APPEND_CPU_COST_MULTIPLIER 0.5
|
|
|
|
|
Prevent overly large and NaN row estimates in relations
Given a query with enough joins, it was possible that the query planner,
after multiplying the row estimates with the join selectivity that the
estimated number of rows would exceed the limits of the double data type
and become infinite.
To give an indication on how extreme a case is required to hit this, the
particular example case reported required 379 joins to a table without any
statistics, which resulted in the 1.0/DEFAULT_NUM_DISTINCT being used for
the join selectivity. This eventually caused the row estimates to go
infinite and resulted in an assert failure in initial_cost_mergejoin()
where the infinite row estimated was multiplied by an outerstartsel of 0.0
resulting in NaN. The failing assert verified that NaN <= Inf, which is
false.
To get around this we use clamp_row_est() to cap row estimates at a
maximum of 1e100. This value is thought to be low enough that costs
derived from it would remain within the bounds of what the double type can
represent.
Aside from fixing the failing Assert, this also has the added benefit of
making it so add_path() will still receive proper numerical values as
costs which will allow it to make more sane choices when determining the
cheaper path in extreme cases such as the one described above.
Additionally, we also get rid of the isnan() checks in the join costing
functions. The actual case which originally triggered those checks to be
added in the first place never made it to the mailing lists. It seems
likely that the new code being added to clamp_row_est() will result in
those becoming checks redundant, so just remove them.
The fairly harmless assert failure problem does also exist in the
backbranches, however, a more minimalistic fix will be applied there.
Reported-by: Onder Kalaci
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/DM6PR21MB1211FF360183BCA901B27F04D80B0@DM6PR21MB1211.namprd21.prod.outlook.com
2020-10-18 23:53:52 +02:00
|
|
|
/*
|
|
|
|
* Maximum value for row estimates. We cap row estimates to this to help
|
|
|
|
* ensure that costs based on these estimates remain within the range of what
|
|
|
|
* double can represent. add_path() wouldn't act sanely given infinite or NaN
|
|
|
|
* cost values.
|
|
|
|
*/
|
|
|
|
#define MAXIMUM_ROWCOUNT 1e100
|
2014-01-27 06:05:49 +01:00
|
|
|
|
2006-06-05 04:49:58 +02:00
|
|
|
double seq_page_cost = DEFAULT_SEQ_PAGE_COST;
|
2001-03-22 05:01:46 +01:00
|
|
|
double random_page_cost = DEFAULT_RANDOM_PAGE_COST;
|
|
|
|
double cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
|
|
|
|
double cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
|
|
|
|
double cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
|
Add a Gather executor node.
A Gather executor node runs any number of copies of a plan in an equal
number of workers and merges all of the results into a single tuple
stream. It can also run the plan itself, if the workers are
unavailable or haven't started up yet. It is intended to work with
the Partial Seq Scan node which will be added in future commits.
It could also be used to implement parallel query of a different sort
by itself, without help from Partial Seq Scan, if the single_copy mode
is used. In that mode, a worker executes the plan, and the parallel
leader does not, merely collecting the worker's results. So, a Gather
node could be inserted into a plan to split the execution of that plan
across two processes. Nested Gather nodes aren't currently supported,
but we might want to add support for that in the future.
There's nothing in the planner to actually generate Gather nodes yet,
so it's not quite time to break out the champagne. But we're getting
close.
Amit Kapila. Some designs suggestions were provided by me, and I also
reviewed the patch. Single-copy mode, documentation, and other minor
changes also by me.
2015-10-01 01:23:36 +02:00
|
|
|
double parallel_tuple_cost = DEFAULT_PARALLEL_TUPLE_COST;
|
|
|
|
double parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
|
2000-01-23 00:50:30 +01:00
|
|
|
|
2014-05-09 02:49:38 +02:00
|
|
|
int effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
|
2006-06-05 04:49:58 +02:00
|
|
|
|
2009-04-17 17:33:33 +02:00
|
|
|
Cost disable_cost = 1.0e10;
|
2000-01-23 00:50:30 +01:00
|
|
|
|
2016-06-09 15:08:27 +02:00
|
|
|
int max_parallel_workers_per_gather = 2;
|
Add a Gather executor node.
A Gather executor node runs any number of copies of a plan in an equal
number of workers and merges all of the results into a single tuple
stream. It can also run the plan itself, if the workers are
unavailable or haven't started up yet. It is intended to work with
the Partial Seq Scan node which will be added in future commits.
It could also be used to implement parallel query of a different sort
by itself, without help from Partial Seq Scan, if the single_copy mode
is used. In that mode, a worker executes the plan, and the parallel
leader does not, merely collecting the worker's results. So, a Gather
node could be inserted into a plan to split the execution of that plan
across two processes. Nested Gather nodes aren't currently supported,
but we might want to add support for that in the future.
There's nothing in the planner to actually generate Gather nodes yet,
so it's not quite time to break out the champagne. But we're getting
close.
Amit Kapila. Some designs suggestions were provided by me, and I also
reviewed the patch. Single-copy mode, documentation, and other minor
changes also by me.
2015-10-01 01:23:36 +02:00
|
|
|
|
2000-01-23 00:50:30 +01:00
|
|
|
bool enable_seqscan = true;
|
|
|
|
bool enable_indexscan = true;
|
2011-10-08 02:13:02 +02:00
|
|
|
bool enable_indexonlyscan = true;
|
2005-04-21 21:18:13 +02:00
|
|
|
bool enable_bitmapscan = true;
|
2000-01-23 00:50:30 +01:00
|
|
|
bool enable_tidscan = true;
|
|
|
|
bool enable_sort = true;
|
2020-07-05 11:41:52 +02:00
|
|
|
bool enable_incremental_sort = true;
|
2002-11-21 01:42:20 +01:00
|
|
|
bool enable_hashagg = true;
|
2000-01-23 00:50:30 +01:00
|
|
|
bool enable_nestloop = true;
|
2010-04-19 02:55:26 +02:00
|
|
|
bool enable_material = true;
|
2000-01-23 00:50:30 +01:00
|
|
|
bool enable_mergejoin = true;
|
|
|
|
bool enable_hashjoin = true;
|
2017-03-09 13:40:36 +01:00
|
|
|
bool enable_gathermerge = true;
|
2018-02-16 16:33:59 +01:00
|
|
|
bool enable_partitionwise_join = false;
|
Implement partition-wise grouping/aggregation.
If the partition keys of input relation are part of the GROUP BY
clause, all the rows belonging to a given group come from a single
partition. This allows aggregation/grouping over a partitioned
relation to be broken down * into aggregation/grouping on each
partition. This should be no worse, and often better, than the normal
approach.
If the GROUP BY clause does not contain all the partition keys, we can
still perform partial aggregation for each partition and then finalize
aggregation after appending the partial results. This is less certain
to be a win, but it's still useful.
Jeevan Chalke, Ashutosh Bapat, Robert Haas. The larger patch series
of which this patch is a part was also reviewed and tested by Antonin
Houska, Rajkumar Raghuwanshi, David Rowley, Dilip Kumar, Konstantin
Knizhnik, Pascal Legrand, and Rafia Sabih.
Discussion: http://postgr.es/m/CAM2+6=V64_xhstVHie0Rz=KPEQnLJMZt_e314P0jaT_oJ9MR8A@mail.gmail.com
2018-03-22 17:49:48 +01:00
|
|
|
bool enable_partitionwise_aggregate = false;
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
bool enable_parallel_append = true;
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
bool enable_parallel_hash = true;
|
2018-04-23 22:57:43 +02:00
|
|
|
bool enable_partition_pruning = true;
|
Add support for asynchronous execution.
This implements asynchronous execution, which runs multiple parts of a
non-parallel-aware Append concurrently rather than serially to improve
performance when possible. Currently, the only node type that can be
run concurrently is a ForeignScan that is an immediate child of such an
Append. In the case where such ForeignScans access data on different
remote servers, this would run those ForeignScans concurrently, and
overlap the remote operations to be performed simultaneously, so it'll
improve the performance especially when the operations involve
time-consuming ones such as remote join and remote aggregation.
We may extend this to other node types such as joins or aggregates over
ForeignScans in the future.
This also adds the support for postgres_fdw, which is enabled by the
table-level/server-level option "async_capable". The default is false.
Robert Haas, Kyotaro Horiguchi, Thomas Munro, and myself. This commit
is mostly based on the patch proposed by Robert Haas, but also uses
stuff from the patch proposed by Kyotaro Horiguchi and from the patch
proposed by Thomas Munro. Reviewed by Kyotaro Horiguchi, Konstantin
Knizhnik, Andrey Lepikhov, Movead Li, Thomas Munro, Justin Pryzby, and
others.
Discussion: https://postgr.es/m/CA%2BTgmoaXQEt4tZ03FtQhnzeDEMzBck%2BLrni0UWHVVgOTnA6C1w%40mail.gmail.com
Discussion: https://postgr.es/m/CA%2BhUKGLBRyu0rHrDCMC4%3DRn3252gogyp1SjOgG8SEKKZv%3DFwfQ%40mail.gmail.com
Discussion: https://postgr.es/m/20200228.170650.667613673625155850.horikyota.ntt%40gmail.com
2021-03-31 11:45:00 +02:00
|
|
|
bool enable_async_append = true;
|
2000-01-23 00:50:30 +01:00
|
|
|
|
2007-02-22 23:00:26 +01:00
|
|
|
typedef struct
|
|
|
|
{
|
|
|
|
PlannerInfo *root;
|
|
|
|
QualCost total;
|
2007-11-15 23:25:18 +01:00
|
|
|
} cost_qual_eval_context;
|
2000-01-23 00:50:30 +01:00
|
|
|
|
Refactor the representation of indexable clauses in IndexPaths.
In place of three separate but interrelated lists (indexclauses,
indexquals, and indexqualcols), an IndexPath now has one list
"indexclauses" of IndexClause nodes. This holds basically the same
information as before, but in a more useful format: in particular, there
is now a clear connection between an indexclause (an original restriction
clause from WHERE or JOIN/ON) and the indexquals (directly usable index
conditions) derived from it.
We also change the ground rules a bit by mandating that clause commutation,
if needed, be done up-front so that what is stored in the indexquals list
is always directly usable as an index condition. This gets rid of repeated
re-determination of which side of the clause is the indexkey during costing
and plan generation, as well as repeated lookups of the commutator
operator. To minimize the added up-front cost, the typical case of
commuting a plain OpExpr is handled by a new special-purpose function
commute_restrictinfo(). For RowCompareExprs, generating the new clause
properly commuted to begin with is not really any more complex than before,
it's just different --- and we can save doing that work twice, as the
pretty-klugy original implementation did.
Tracking the connection between original and derived clauses lets us
also track explicitly whether the derived clauses are an exact or lossy
translation of the original. This provides a cheap solution to getting
rid of unnecessary rechecks of boolean index clauses, which previously
seemed like it'd be more expensive than it was worth.
Another pleasant (IMO) side-effect is that EXPLAIN now always shows
index clauses with the indexkey on the left; this seems less confusing.
This commit leaves expand_indexqual_conditions() and some related
functions in a slightly messy state. I didn't bother to change them
any more than minimally necessary to work with the new data structure,
because all that code is going to be refactored out of existence in
a follow-on patch.
Discussion: https://postgr.es/m/22182.1549124950@sss.pgh.pa.us
2019-02-09 23:30:43 +01:00
|
|
|
static List *extract_nonindex_conditions(List *qual_clauses, List *indexclauses);
|
2007-01-22 21:00:40 +01:00
|
|
|
static MergeScanSelCache *cached_scansel(PlannerInfo *root,
|
2019-05-22 19:04:48 +02:00
|
|
|
RestrictInfo *rinfo,
|
|
|
|
PathKey *pathkey);
|
2009-09-13 00:12:09 +02:00
|
|
|
static void cost_rescan(PlannerInfo *root, Path *path,
|
2019-05-22 19:04:48 +02:00
|
|
|
Cost *rescan_startup_cost, Cost *rescan_total_cost);
|
2007-11-15 23:25:18 +01:00
|
|
|
static bool cost_qual_eval_walker(Node *node, cost_qual_eval_context *context);
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
static void get_restriction_qual_cost(PlannerInfo *root, RelOptInfo *baserel,
|
2019-05-22 19:04:48 +02:00
|
|
|
ParamPathInfo *param_info,
|
|
|
|
QualCost *qpqual_cost);
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
static bool has_indexed_join_quals(NestPath *joinpath);
|
2008-08-16 02:01:38 +02:00
|
|
|
static double approx_tuple_count(PlannerInfo *root, JoinPath *path,
|
2019-05-22 19:04:48 +02:00
|
|
|
List *quals);
|
2012-01-28 01:26:38 +01:00
|
|
|
static double calc_joinrel_size_estimate(PlannerInfo *root,
|
2019-05-22 19:04:48 +02:00
|
|
|
RelOptInfo *joinrel,
|
|
|
|
RelOptInfo *outer_rel,
|
|
|
|
RelOptInfo *inner_rel,
|
|
|
|
double outer_rows,
|
|
|
|
double inner_rows,
|
|
|
|
SpecialJoinInfo *sjinfo,
|
|
|
|
List *restrictlist);
|
2016-06-18 21:22:34 +02:00
|
|
|
static Selectivity get_foreign_key_join_selectivity(PlannerInfo *root,
|
2019-05-22 19:04:48 +02:00
|
|
|
Relids outer_relids,
|
|
|
|
Relids inner_relids,
|
|
|
|
SpecialJoinInfo *sjinfo,
|
|
|
|
List **restrictlist);
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
static Cost append_nonpartial_cost(List *subpaths, int numpaths,
|
2019-05-22 19:04:48 +02:00
|
|
|
int parallel_workers);
|
2005-06-06 00:32:58 +02:00
|
|
|
static void set_rel_width(PlannerInfo *root, RelOptInfo *rel);
|
2000-01-09 01:26:47 +01:00
|
|
|
static double relation_byte_size(double tuples, int width);
|
|
|
|
static double page_size(double tuples, int width);
|
2017-01-13 19:29:31 +01:00
|
|
|
static double get_parallel_divisor(Path *path);
|
1999-08-06 06:00:17 +02:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2004-01-05 06:07:36 +01:00
|
|
|
/*
|
|
|
|
* clamp_row_est
|
|
|
|
* Force a row-count estimate to a sane value.
|
|
|
|
*/
|
|
|
|
double
|
|
|
|
clamp_row_est(double nrows)
|
|
|
|
{
|
|
|
|
/*
|
Prevent overly large and NaN row estimates in relations
Given a query with enough joins, it was possible that the query planner,
after multiplying the row estimates with the join selectivity that the
estimated number of rows would exceed the limits of the double data type
and become infinite.
To give an indication on how extreme a case is required to hit this, the
particular example case reported required 379 joins to a table without any
statistics, which resulted in the 1.0/DEFAULT_NUM_DISTINCT being used for
the join selectivity. This eventually caused the row estimates to go
infinite and resulted in an assert failure in initial_cost_mergejoin()
where the infinite row estimated was multiplied by an outerstartsel of 0.0
resulting in NaN. The failing assert verified that NaN <= Inf, which is
false.
To get around this we use clamp_row_est() to cap row estimates at a
maximum of 1e100. This value is thought to be low enough that costs
derived from it would remain within the bounds of what the double type can
represent.
Aside from fixing the failing Assert, this also has the added benefit of
making it so add_path() will still receive proper numerical values as
costs which will allow it to make more sane choices when determining the
cheaper path in extreme cases such as the one described above.
Additionally, we also get rid of the isnan() checks in the join costing
functions. The actual case which originally triggered those checks to be
added in the first place never made it to the mailing lists. It seems
likely that the new code being added to clamp_row_est() will result in
those becoming checks redundant, so just remove them.
The fairly harmless assert failure problem does also exist in the
backbranches, however, a more minimalistic fix will be applied there.
Reported-by: Onder Kalaci
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/DM6PR21MB1211FF360183BCA901B27F04D80B0@DM6PR21MB1211.namprd21.prod.outlook.com
2020-10-18 23:53:52 +02:00
|
|
|
* Avoid infinite and NaN row estimates. Costs derived from such values
|
|
|
|
* are going to be useless. Also force the estimate to be at least one
|
|
|
|
* row, to make explain output look better and to avoid possible
|
|
|
|
* divide-by-zero when interpolating costs. Make it an integer, too.
|
2004-01-05 06:07:36 +01:00
|
|
|
*/
|
Prevent overly large and NaN row estimates in relations
Given a query with enough joins, it was possible that the query planner,
after multiplying the row estimates with the join selectivity that the
estimated number of rows would exceed the limits of the double data type
and become infinite.
To give an indication on how extreme a case is required to hit this, the
particular example case reported required 379 joins to a table without any
statistics, which resulted in the 1.0/DEFAULT_NUM_DISTINCT being used for
the join selectivity. This eventually caused the row estimates to go
infinite and resulted in an assert failure in initial_cost_mergejoin()
where the infinite row estimated was multiplied by an outerstartsel of 0.0
resulting in NaN. The failing assert verified that NaN <= Inf, which is
false.
To get around this we use clamp_row_est() to cap row estimates at a
maximum of 1e100. This value is thought to be low enough that costs
derived from it would remain within the bounds of what the double type can
represent.
Aside from fixing the failing Assert, this also has the added benefit of
making it so add_path() will still receive proper numerical values as
costs which will allow it to make more sane choices when determining the
cheaper path in extreme cases such as the one described above.
Additionally, we also get rid of the isnan() checks in the join costing
functions. The actual case which originally triggered those checks to be
added in the first place never made it to the mailing lists. It seems
likely that the new code being added to clamp_row_est() will result in
those becoming checks redundant, so just remove them.
The fairly harmless assert failure problem does also exist in the
backbranches, however, a more minimalistic fix will be applied there.
Reported-by: Onder Kalaci
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/DM6PR21MB1211FF360183BCA901B27F04D80B0@DM6PR21MB1211.namprd21.prod.outlook.com
2020-10-18 23:53:52 +02:00
|
|
|
if (nrows > MAXIMUM_ROWCOUNT || isnan(nrows))
|
|
|
|
nrows = MAXIMUM_ROWCOUNT;
|
|
|
|
else if (nrows <= 1.0)
|
2004-01-05 06:07:36 +01:00
|
|
|
nrows = 1.0;
|
|
|
|
else
|
2005-04-21 04:28:02 +02:00
|
|
|
nrows = rint(nrows);
|
2004-01-05 06:07:36 +01:00
|
|
|
|
|
|
|
return nrows;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
1999-02-14 00:22:53 +01:00
|
|
|
* cost_seqscan
|
1997-09-07 07:04:48 +02:00
|
|
|
* Determines and returns the cost of scanning a relation sequentially.
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
*
|
|
|
|
* 'baserel' is the relation to be scanned
|
|
|
|
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2000-02-15 21:49:31 +01:00
|
|
|
void
|
2005-06-06 00:32:58 +02:00
|
|
|
cost_seqscan(Path *path, PlannerInfo *root,
|
2016-01-20 20:29:22 +01:00
|
|
|
RelOptInfo *baserel, ParamPathInfo *param_info)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2000-02-15 21:49:31 +01:00
|
|
|
Cost startup_cost = 0;
|
2016-01-20 20:29:22 +01:00
|
|
|
Cost cpu_run_cost;
|
|
|
|
Cost disk_run_cost;
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
double spc_seq_page_cost;
|
|
|
|
QualCost qpqual_cost;
|
2000-02-15 21:49:31 +01:00
|
|
|
Cost cpu_per_tuple;
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2000-01-09 01:26:47 +01:00
|
|
|
/* Should only be applied to base relations */
|
2003-02-08 21:20:55 +01:00
|
|
|
Assert(baserel->relid > 0);
|
2002-05-12 22:10:05 +02:00
|
|
|
Assert(baserel->rtekind == RTE_RELATION);
|
2000-01-09 01:26:47 +01:00
|
|
|
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
/* Mark the path with the correct row estimate */
|
|
|
|
if (param_info)
|
|
|
|
path->rows = param_info->ppi_rows;
|
|
|
|
else
|
|
|
|
path->rows = baserel->rows;
|
2012-01-28 01:26:38 +01:00
|
|
|
|
2000-01-23 00:50:30 +01:00
|
|
|
if (!enable_seqscan)
|
2000-02-15 21:49:31 +01:00
|
|
|
startup_cost += disable_cost;
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2010-01-05 22:54:00 +01:00
|
|
|
/* fetch estimated page cost for tablespace containing table */
|
|
|
|
get_tablespace_page_costs(baserel->reltablespace,
|
|
|
|
NULL,
|
|
|
|
&spc_seq_page_cost);
|
|
|
|
|
2000-06-19 00:44:35 +02:00
|
|
|
/*
|
|
|
|
* disk costs
|
|
|
|
*/
|
2016-01-20 20:29:22 +01:00
|
|
|
disk_run_cost = spc_seq_page_cost * baserel->pages;
|
2000-01-09 01:26:47 +01:00
|
|
|
|
2000-02-15 21:49:31 +01:00
|
|
|
/* CPU costs */
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
|
|
|
|
|
|
|
|
startup_cost += qpqual_cost.startup;
|
|
|
|
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
|
2016-01-20 20:29:22 +01:00
|
|
|
cpu_run_cost = cpu_per_tuple * baserel->tuples;
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
/* tlist eval costs are paid per output row, not per tuple scanned */
|
|
|
|
startup_cost += path->pathtarget->cost.startup;
|
|
|
|
cpu_run_cost += path->pathtarget->cost.per_tuple * path->rows;
|
2000-02-15 21:49:31 +01:00
|
|
|
|
2016-01-20 20:29:22 +01:00
|
|
|
/* Adjust costing for parallelism, if used. */
|
2016-06-09 15:08:27 +02:00
|
|
|
if (path->parallel_workers > 0)
|
2016-01-20 20:29:22 +01:00
|
|
|
{
|
2017-01-13 19:29:31 +01:00
|
|
|
double parallel_divisor = get_parallel_divisor(path);
|
2016-01-20 20:29:22 +01:00
|
|
|
|
|
|
|
/* The CPU cost is divided among all the workers. */
|
|
|
|
cpu_run_cost /= parallel_divisor;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* It may be possible to amortize some of the I/O cost, but probably
|
|
|
|
* not very much, because most operating systems already do aggressive
|
|
|
|
* prefetching. For now, we assume that the disk run cost can't be
|
|
|
|
* amortized at all.
|
|
|
|
*/
|
2017-01-13 19:29:31 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* In the case of a parallel plan, the row count needs to represent
|
|
|
|
* the number of tuples processed per worker.
|
|
|
|
*/
|
|
|
|
path->rows = clamp_row_est(path->rows / parallel_divisor);
|
2016-01-20 20:29:22 +01:00
|
|
|
}
|
2015-11-11 14:57:52 +01:00
|
|
|
|
2015-05-15 20:37:10 +02:00
|
|
|
path->startup_cost = startup_cost;
|
2016-01-20 20:29:22 +01:00
|
|
|
path->total_cost = startup_cost + cpu_run_cost + disk_run_cost;
|
2015-05-15 20:37:10 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* cost_samplescan
|
|
|
|
* Determines and returns the cost of scanning a relation using sampling.
|
|
|
|
*
|
|
|
|
* 'baserel' is the relation to be scanned
|
|
|
|
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
|
|
|
|
*/
|
|
|
|
void
|
Redesign tablesample method API, and do extensive code review.
The original implementation of TABLESAMPLE modeled the tablesample method
API on index access methods, which wasn't a good choice because, without
specialized DDL commands, there's no way to build an extension that can
implement a TSM. (Raw inserts into system catalogs are not an acceptable
thing to do, because we can't undo them during DROP EXTENSION, nor will
pg_upgrade behave sanely.) Instead adopt an API more like procedural
language handlers or foreign data wrappers, wherein the only SQL-level
support object needed is a single handler function identified by having
a special return type. This lets us get rid of the supporting catalog
altogether, so that no custom DDL support is needed for the feature.
Adjust the API so that it can support non-constant tablesample arguments
(the original coding assumed we could evaluate the argument expressions at
ExecInitSampleScan time, which is undesirable even if it weren't outright
unsafe), and discourage sampling methods from looking at invisible tuples.
Make sure that the BERNOULLI and SYSTEM methods are genuinely repeatable
within and across queries, as required by the SQL standard, and deal more
honestly with methods that can't support that requirement.
Make a full code-review pass over the tablesample additions, and fix
assorted bugs, omissions, infelicities, and cosmetic issues (such as
failure to put the added code stanzas in a consistent ordering).
Improve EXPLAIN's output of tablesample plans, too.
Back-patch to 9.5 so that we don't have to support the original API
in production.
2015-07-25 20:39:00 +02:00
|
|
|
cost_samplescan(Path *path, PlannerInfo *root,
|
|
|
|
RelOptInfo *baserel, ParamPathInfo *param_info)
|
2015-05-15 20:37:10 +02:00
|
|
|
{
|
|
|
|
Cost startup_cost = 0;
|
|
|
|
Cost run_cost = 0;
|
Redesign tablesample method API, and do extensive code review.
The original implementation of TABLESAMPLE modeled the tablesample method
API on index access methods, which wasn't a good choice because, without
specialized DDL commands, there's no way to build an extension that can
implement a TSM. (Raw inserts into system catalogs are not an acceptable
thing to do, because we can't undo them during DROP EXTENSION, nor will
pg_upgrade behave sanely.) Instead adopt an API more like procedural
language handlers or foreign data wrappers, wherein the only SQL-level
support object needed is a single handler function identified by having
a special return type. This lets us get rid of the supporting catalog
altogether, so that no custom DDL support is needed for the feature.
Adjust the API so that it can support non-constant tablesample arguments
(the original coding assumed we could evaluate the argument expressions at
ExecInitSampleScan time, which is undesirable even if it weren't outright
unsafe), and discourage sampling methods from looking at invisible tuples.
Make sure that the BERNOULLI and SYSTEM methods are genuinely repeatable
within and across queries, as required by the SQL standard, and deal more
honestly with methods that can't support that requirement.
Make a full code-review pass over the tablesample additions, and fix
assorted bugs, omissions, infelicities, and cosmetic issues (such as
failure to put the added code stanzas in a consistent ordering).
Improve EXPLAIN's output of tablesample plans, too.
Back-patch to 9.5 so that we don't have to support the original API
in production.
2015-07-25 20:39:00 +02:00
|
|
|
RangeTblEntry *rte;
|
|
|
|
TableSampleClause *tsc;
|
|
|
|
TsmRoutine *tsm;
|
2015-05-15 20:37:10 +02:00
|
|
|
double spc_seq_page_cost,
|
|
|
|
spc_random_page_cost,
|
|
|
|
spc_page_cost;
|
|
|
|
QualCost qpqual_cost;
|
|
|
|
Cost cpu_per_tuple;
|
|
|
|
|
Redesign tablesample method API, and do extensive code review.
The original implementation of TABLESAMPLE modeled the tablesample method
API on index access methods, which wasn't a good choice because, without
specialized DDL commands, there's no way to build an extension that can
implement a TSM. (Raw inserts into system catalogs are not an acceptable
thing to do, because we can't undo them during DROP EXTENSION, nor will
pg_upgrade behave sanely.) Instead adopt an API more like procedural
language handlers or foreign data wrappers, wherein the only SQL-level
support object needed is a single handler function identified by having
a special return type. This lets us get rid of the supporting catalog
altogether, so that no custom DDL support is needed for the feature.
Adjust the API so that it can support non-constant tablesample arguments
(the original coding assumed we could evaluate the argument expressions at
ExecInitSampleScan time, which is undesirable even if it weren't outright
unsafe), and discourage sampling methods from looking at invisible tuples.
Make sure that the BERNOULLI and SYSTEM methods are genuinely repeatable
within and across queries, as required by the SQL standard, and deal more
honestly with methods that can't support that requirement.
Make a full code-review pass over the tablesample additions, and fix
assorted bugs, omissions, infelicities, and cosmetic issues (such as
failure to put the added code stanzas in a consistent ordering).
Improve EXPLAIN's output of tablesample plans, too.
Back-patch to 9.5 so that we don't have to support the original API
in production.
2015-07-25 20:39:00 +02:00
|
|
|
/* Should only be applied to base relations with tablesample clauses */
|
2015-05-15 20:37:10 +02:00
|
|
|
Assert(baserel->relid > 0);
|
Redesign tablesample method API, and do extensive code review.
The original implementation of TABLESAMPLE modeled the tablesample method
API on index access methods, which wasn't a good choice because, without
specialized DDL commands, there's no way to build an extension that can
implement a TSM. (Raw inserts into system catalogs are not an acceptable
thing to do, because we can't undo them during DROP EXTENSION, nor will
pg_upgrade behave sanely.) Instead adopt an API more like procedural
language handlers or foreign data wrappers, wherein the only SQL-level
support object needed is a single handler function identified by having
a special return type. This lets us get rid of the supporting catalog
altogether, so that no custom DDL support is needed for the feature.
Adjust the API so that it can support non-constant tablesample arguments
(the original coding assumed we could evaluate the argument expressions at
ExecInitSampleScan time, which is undesirable even if it weren't outright
unsafe), and discourage sampling methods from looking at invisible tuples.
Make sure that the BERNOULLI and SYSTEM methods are genuinely repeatable
within and across queries, as required by the SQL standard, and deal more
honestly with methods that can't support that requirement.
Make a full code-review pass over the tablesample additions, and fix
assorted bugs, omissions, infelicities, and cosmetic issues (such as
failure to put the added code stanzas in a consistent ordering).
Improve EXPLAIN's output of tablesample plans, too.
Back-patch to 9.5 so that we don't have to support the original API
in production.
2015-07-25 20:39:00 +02:00
|
|
|
rte = planner_rt_fetch(baserel->relid, root);
|
|
|
|
Assert(rte->rtekind == RTE_RELATION);
|
|
|
|
tsc = rte->tablesample;
|
|
|
|
Assert(tsc != NULL);
|
|
|
|
tsm = GetTsmRoutine(tsc->tsmhandler);
|
2015-05-15 20:37:10 +02:00
|
|
|
|
|
|
|
/* Mark the path with the correct row estimate */
|
Redesign tablesample method API, and do extensive code review.
The original implementation of TABLESAMPLE modeled the tablesample method
API on index access methods, which wasn't a good choice because, without
specialized DDL commands, there's no way to build an extension that can
implement a TSM. (Raw inserts into system catalogs are not an acceptable
thing to do, because we can't undo them during DROP EXTENSION, nor will
pg_upgrade behave sanely.) Instead adopt an API more like procedural
language handlers or foreign data wrappers, wherein the only SQL-level
support object needed is a single handler function identified by having
a special return type. This lets us get rid of the supporting catalog
altogether, so that no custom DDL support is needed for the feature.
Adjust the API so that it can support non-constant tablesample arguments
(the original coding assumed we could evaluate the argument expressions at
ExecInitSampleScan time, which is undesirable even if it weren't outright
unsafe), and discourage sampling methods from looking at invisible tuples.
Make sure that the BERNOULLI and SYSTEM methods are genuinely repeatable
within and across queries, as required by the SQL standard, and deal more
honestly with methods that can't support that requirement.
Make a full code-review pass over the tablesample additions, and fix
assorted bugs, omissions, infelicities, and cosmetic issues (such as
failure to put the added code stanzas in a consistent ordering).
Improve EXPLAIN's output of tablesample plans, too.
Back-patch to 9.5 so that we don't have to support the original API
in production.
2015-07-25 20:39:00 +02:00
|
|
|
if (param_info)
|
|
|
|
path->rows = param_info->ppi_rows;
|
2015-05-15 20:37:10 +02:00
|
|
|
else
|
|
|
|
path->rows = baserel->rows;
|
|
|
|
|
|
|
|
/* fetch estimated page cost for tablespace containing table */
|
|
|
|
get_tablespace_page_costs(baserel->reltablespace,
|
|
|
|
&spc_random_page_cost,
|
|
|
|
&spc_seq_page_cost);
|
|
|
|
|
Redesign tablesample method API, and do extensive code review.
The original implementation of TABLESAMPLE modeled the tablesample method
API on index access methods, which wasn't a good choice because, without
specialized DDL commands, there's no way to build an extension that can
implement a TSM. (Raw inserts into system catalogs are not an acceptable
thing to do, because we can't undo them during DROP EXTENSION, nor will
pg_upgrade behave sanely.) Instead adopt an API more like procedural
language handlers or foreign data wrappers, wherein the only SQL-level
support object needed is a single handler function identified by having
a special return type. This lets us get rid of the supporting catalog
altogether, so that no custom DDL support is needed for the feature.
Adjust the API so that it can support non-constant tablesample arguments
(the original coding assumed we could evaluate the argument expressions at
ExecInitSampleScan time, which is undesirable even if it weren't outright
unsafe), and discourage sampling methods from looking at invisible tuples.
Make sure that the BERNOULLI and SYSTEM methods are genuinely repeatable
within and across queries, as required by the SQL standard, and deal more
honestly with methods that can't support that requirement.
Make a full code-review pass over the tablesample additions, and fix
assorted bugs, omissions, infelicities, and cosmetic issues (such as
failure to put the added code stanzas in a consistent ordering).
Improve EXPLAIN's output of tablesample plans, too.
Back-patch to 9.5 so that we don't have to support the original API
in production.
2015-07-25 20:39:00 +02:00
|
|
|
/* if NextSampleBlock is used, assume random access, else sequential */
|
|
|
|
spc_page_cost = (tsm->NextSampleBlock != NULL) ?
|
|
|
|
spc_random_page_cost : spc_seq_page_cost;
|
2015-05-15 20:37:10 +02:00
|
|
|
|
|
|
|
/*
|
Redesign tablesample method API, and do extensive code review.
The original implementation of TABLESAMPLE modeled the tablesample method
API on index access methods, which wasn't a good choice because, without
specialized DDL commands, there's no way to build an extension that can
implement a TSM. (Raw inserts into system catalogs are not an acceptable
thing to do, because we can't undo them during DROP EXTENSION, nor will
pg_upgrade behave sanely.) Instead adopt an API more like procedural
language handlers or foreign data wrappers, wherein the only SQL-level
support object needed is a single handler function identified by having
a special return type. This lets us get rid of the supporting catalog
altogether, so that no custom DDL support is needed for the feature.
Adjust the API so that it can support non-constant tablesample arguments
(the original coding assumed we could evaluate the argument expressions at
ExecInitSampleScan time, which is undesirable even if it weren't outright
unsafe), and discourage sampling methods from looking at invisible tuples.
Make sure that the BERNOULLI and SYSTEM methods are genuinely repeatable
within and across queries, as required by the SQL standard, and deal more
honestly with methods that can't support that requirement.
Make a full code-review pass over the tablesample additions, and fix
assorted bugs, omissions, infelicities, and cosmetic issues (such as
failure to put the added code stanzas in a consistent ordering).
Improve EXPLAIN's output of tablesample plans, too.
Back-patch to 9.5 so that we don't have to support the original API
in production.
2015-07-25 20:39:00 +02:00
|
|
|
* disk costs (recall that baserel->pages has already been set to the
|
|
|
|
* number of pages the sampling method will visit)
|
2015-05-15 20:37:10 +02:00
|
|
|
*/
|
Redesign tablesample method API, and do extensive code review.
The original implementation of TABLESAMPLE modeled the tablesample method
API on index access methods, which wasn't a good choice because, without
specialized DDL commands, there's no way to build an extension that can
implement a TSM. (Raw inserts into system catalogs are not an acceptable
thing to do, because we can't undo them during DROP EXTENSION, nor will
pg_upgrade behave sanely.) Instead adopt an API more like procedural
language handlers or foreign data wrappers, wherein the only SQL-level
support object needed is a single handler function identified by having
a special return type. This lets us get rid of the supporting catalog
altogether, so that no custom DDL support is needed for the feature.
Adjust the API so that it can support non-constant tablesample arguments
(the original coding assumed we could evaluate the argument expressions at
ExecInitSampleScan time, which is undesirable even if it weren't outright
unsafe), and discourage sampling methods from looking at invisible tuples.
Make sure that the BERNOULLI and SYSTEM methods are genuinely repeatable
within and across queries, as required by the SQL standard, and deal more
honestly with methods that can't support that requirement.
Make a full code-review pass over the tablesample additions, and fix
assorted bugs, omissions, infelicities, and cosmetic issues (such as
failure to put the added code stanzas in a consistent ordering).
Improve EXPLAIN's output of tablesample plans, too.
Back-patch to 9.5 so that we don't have to support the original API
in production.
2015-07-25 20:39:00 +02:00
|
|
|
run_cost += spc_page_cost * baserel->pages;
|
2015-05-15 20:37:10 +02:00
|
|
|
|
Redesign tablesample method API, and do extensive code review.
The original implementation of TABLESAMPLE modeled the tablesample method
API on index access methods, which wasn't a good choice because, without
specialized DDL commands, there's no way to build an extension that can
implement a TSM. (Raw inserts into system catalogs are not an acceptable
thing to do, because we can't undo them during DROP EXTENSION, nor will
pg_upgrade behave sanely.) Instead adopt an API more like procedural
language handlers or foreign data wrappers, wherein the only SQL-level
support object needed is a single handler function identified by having
a special return type. This lets us get rid of the supporting catalog
altogether, so that no custom DDL support is needed for the feature.
Adjust the API so that it can support non-constant tablesample arguments
(the original coding assumed we could evaluate the argument expressions at
ExecInitSampleScan time, which is undesirable even if it weren't outright
unsafe), and discourage sampling methods from looking at invisible tuples.
Make sure that the BERNOULLI and SYSTEM methods are genuinely repeatable
within and across queries, as required by the SQL standard, and deal more
honestly with methods that can't support that requirement.
Make a full code-review pass over the tablesample additions, and fix
assorted bugs, omissions, infelicities, and cosmetic issues (such as
failure to put the added code stanzas in a consistent ordering).
Improve EXPLAIN's output of tablesample plans, too.
Back-patch to 9.5 so that we don't have to support the original API
in production.
2015-07-25 20:39:00 +02:00
|
|
|
/*
|
|
|
|
* CPU costs (recall that baserel->tuples has already been set to the
|
|
|
|
* number of tuples the sampling method will select). Note that we ignore
|
|
|
|
* execution cost of the TABLESAMPLE parameter expressions; they will be
|
|
|
|
* evaluated only once per scan, and in most usages they'll likely be
|
|
|
|
* simple constants anyway. We also don't charge anything for the
|
|
|
|
* calculations the sampling method might do internally.
|
|
|
|
*/
|
|
|
|
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
|
2015-05-15 20:37:10 +02:00
|
|
|
|
|
|
|
startup_cost += qpqual_cost.startup;
|
|
|
|
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
|
Redesign tablesample method API, and do extensive code review.
The original implementation of TABLESAMPLE modeled the tablesample method
API on index access methods, which wasn't a good choice because, without
specialized DDL commands, there's no way to build an extension that can
implement a TSM. (Raw inserts into system catalogs are not an acceptable
thing to do, because we can't undo them during DROP EXTENSION, nor will
pg_upgrade behave sanely.) Instead adopt an API more like procedural
language handlers or foreign data wrappers, wherein the only SQL-level
support object needed is a single handler function identified by having
a special return type. This lets us get rid of the supporting catalog
altogether, so that no custom DDL support is needed for the feature.
Adjust the API so that it can support non-constant tablesample arguments
(the original coding assumed we could evaluate the argument expressions at
ExecInitSampleScan time, which is undesirable even if it weren't outright
unsafe), and discourage sampling methods from looking at invisible tuples.
Make sure that the BERNOULLI and SYSTEM methods are genuinely repeatable
within and across queries, as required by the SQL standard, and deal more
honestly with methods that can't support that requirement.
Make a full code-review pass over the tablesample additions, and fix
assorted bugs, omissions, infelicities, and cosmetic issues (such as
failure to put the added code stanzas in a consistent ordering).
Improve EXPLAIN's output of tablesample plans, too.
Back-patch to 9.5 so that we don't have to support the original API
in production.
2015-07-25 20:39:00 +02:00
|
|
|
run_cost += cpu_per_tuple * baserel->tuples;
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
/* tlist eval costs are paid per output row, not per tuple scanned */
|
|
|
|
startup_cost += path->pathtarget->cost.startup;
|
|
|
|
run_cost += path->pathtarget->cost.per_tuple * path->rows;
|
2015-05-15 20:37:10 +02:00
|
|
|
|
2000-02-15 21:49:31 +01:00
|
|
|
path->startup_cost = startup_cost;
|
|
|
|
path->total_cost = startup_cost + run_cost;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
Add a Gather executor node.
A Gather executor node runs any number of copies of a plan in an equal
number of workers and merges all of the results into a single tuple
stream. It can also run the plan itself, if the workers are
unavailable or haven't started up yet. It is intended to work with
the Partial Seq Scan node which will be added in future commits.
It could also be used to implement parallel query of a different sort
by itself, without help from Partial Seq Scan, if the single_copy mode
is used. In that mode, a worker executes the plan, and the parallel
leader does not, merely collecting the worker's results. So, a Gather
node could be inserted into a plan to split the execution of that plan
across two processes. Nested Gather nodes aren't currently supported,
but we might want to add support for that in the future.
There's nothing in the planner to actually generate Gather nodes yet,
so it's not quite time to break out the champagne. But we're getting
close.
Amit Kapila. Some designs suggestions were provided by me, and I also
reviewed the patch. Single-copy mode, documentation, and other minor
changes also by me.
2015-10-01 01:23:36 +02:00
|
|
|
/*
|
|
|
|
* cost_gather
|
|
|
|
* Determines and returns the cost of gather path.
|
|
|
|
*
|
|
|
|
* 'rel' is the relation to be operated upon
|
|
|
|
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
|
2016-03-21 14:20:53 +01:00
|
|
|
* 'rows' may be used to point to a row estimate; if non-NULL, it overrides
|
|
|
|
* both 'rel' and 'param_info'. This is useful when the path doesn't exactly
|
|
|
|
* correspond to any particular RelOptInfo.
|
Add a Gather executor node.
A Gather executor node runs any number of copies of a plan in an equal
number of workers and merges all of the results into a single tuple
stream. It can also run the plan itself, if the workers are
unavailable or haven't started up yet. It is intended to work with
the Partial Seq Scan node which will be added in future commits.
It could also be used to implement parallel query of a different sort
by itself, without help from Partial Seq Scan, if the single_copy mode
is used. In that mode, a worker executes the plan, and the parallel
leader does not, merely collecting the worker's results. So, a Gather
node could be inserted into a plan to split the execution of that plan
across two processes. Nested Gather nodes aren't currently supported,
but we might want to add support for that in the future.
There's nothing in the planner to actually generate Gather nodes yet,
so it's not quite time to break out the champagne. But we're getting
close.
Amit Kapila. Some designs suggestions were provided by me, and I also
reviewed the patch. Single-copy mode, documentation, and other minor
changes also by me.
2015-10-01 01:23:36 +02:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
cost_gather(GatherPath *path, PlannerInfo *root,
|
2016-03-21 14:20:53 +01:00
|
|
|
RelOptInfo *rel, ParamPathInfo *param_info,
|
|
|
|
double *rows)
|
Add a Gather executor node.
A Gather executor node runs any number of copies of a plan in an equal
number of workers and merges all of the results into a single tuple
stream. It can also run the plan itself, if the workers are
unavailable or haven't started up yet. It is intended to work with
the Partial Seq Scan node which will be added in future commits.
It could also be used to implement parallel query of a different sort
by itself, without help from Partial Seq Scan, if the single_copy mode
is used. In that mode, a worker executes the plan, and the parallel
leader does not, merely collecting the worker's results. So, a Gather
node could be inserted into a plan to split the execution of that plan
across two processes. Nested Gather nodes aren't currently supported,
but we might want to add support for that in the future.
There's nothing in the planner to actually generate Gather nodes yet,
so it's not quite time to break out the champagne. But we're getting
close.
Amit Kapila. Some designs suggestions were provided by me, and I also
reviewed the patch. Single-copy mode, documentation, and other minor
changes also by me.
2015-10-01 01:23:36 +02:00
|
|
|
{
|
|
|
|
Cost startup_cost = 0;
|
|
|
|
Cost run_cost = 0;
|
|
|
|
|
|
|
|
/* Mark the path with the correct row estimate */
|
2016-03-21 14:20:53 +01:00
|
|
|
if (rows)
|
|
|
|
path->path.rows = *rows;
|
|
|
|
else if (param_info)
|
Add a Gather executor node.
A Gather executor node runs any number of copies of a plan in an equal
number of workers and merges all of the results into a single tuple
stream. It can also run the plan itself, if the workers are
unavailable or haven't started up yet. It is intended to work with
the Partial Seq Scan node which will be added in future commits.
It could also be used to implement parallel query of a different sort
by itself, without help from Partial Seq Scan, if the single_copy mode
is used. In that mode, a worker executes the plan, and the parallel
leader does not, merely collecting the worker's results. So, a Gather
node could be inserted into a plan to split the execution of that plan
across two processes. Nested Gather nodes aren't currently supported,
but we might want to add support for that in the future.
There's nothing in the planner to actually generate Gather nodes yet,
so it's not quite time to break out the champagne. But we're getting
close.
Amit Kapila. Some designs suggestions were provided by me, and I also
reviewed the patch. Single-copy mode, documentation, and other minor
changes also by me.
2015-10-01 01:23:36 +02:00
|
|
|
path->path.rows = param_info->ppi_rows;
|
|
|
|
else
|
|
|
|
path->path.rows = rel->rows;
|
|
|
|
|
|
|
|
startup_cost = path->subpath->startup_cost;
|
|
|
|
|
|
|
|
run_cost = path->subpath->total_cost - path->subpath->startup_cost;
|
|
|
|
|
|
|
|
/* Parallel setup and communication cost. */
|
|
|
|
startup_cost += parallel_setup_cost;
|
Generate parallel sequential scan plans in simple cases.
Add a new flag, consider_parallel, to each RelOptInfo, indicating
whether a plan for that relation could conceivably be run inside of
a parallel worker. Right now, we're pretty conservative: for example,
it might be possible to defer applying a parallel-restricted qual
in a worker, and later do it in the leader, but right now we just
don't try to parallelize access to that relation. That's probably
the right decision in most cases, anyway.
Using the new flag, generate parallel sequential scan plans for plain
baserels, meaning that we now have parallel sequential scan in
PostgreSQL. The logic here is pretty unsophisticated right now: the
costing model probably isn't right in detail, and we can't push joins
beneath Gather nodes, so the number of plans that can actually benefit
from this is pretty limited right now. Lots more work is needed.
Nevertheless, it seems time to enable this functionality so that all
this code can actually be tested easily by users and developers.
Note that, if you wish to test this functionality, it will be
necessary to set max_parallel_degree to a value greater than the
default of 0. Once a few more loose ends have been tidied up here, we
might want to consider changing the default value of this GUC, but
I'm leaving it alone for now.
Along the way, fix a bug in cost_gather: the previous coding thought
that a Gather node's transfer overhead should be costed on the basis of
the relation size rather than the number of tuples that actually need
to be passed off to the leader.
Patch by me, reviewed in earlier versions by Amit Kapila.
2015-11-11 15:02:52 +01:00
|
|
|
run_cost += parallel_tuple_cost * path->path.rows;
|
Add a Gather executor node.
A Gather executor node runs any number of copies of a plan in an equal
number of workers and merges all of the results into a single tuple
stream. It can also run the plan itself, if the workers are
unavailable or haven't started up yet. It is intended to work with
the Partial Seq Scan node which will be added in future commits.
It could also be used to implement parallel query of a different sort
by itself, without help from Partial Seq Scan, if the single_copy mode
is used. In that mode, a worker executes the plan, and the parallel
leader does not, merely collecting the worker's results. So, a Gather
node could be inserted into a plan to split the execution of that plan
across two processes. Nested Gather nodes aren't currently supported,
but we might want to add support for that in the future.
There's nothing in the planner to actually generate Gather nodes yet,
so it's not quite time to break out the champagne. But we're getting
close.
Amit Kapila. Some designs suggestions were provided by me, and I also
reviewed the patch. Single-copy mode, documentation, and other minor
changes also by me.
2015-10-01 01:23:36 +02:00
|
|
|
|
|
|
|
path->path.startup_cost = startup_cost;
|
|
|
|
path->path.total_cost = (startup_cost + run_cost);
|
|
|
|
}
|
|
|
|
|
2017-03-09 13:40:36 +01:00
|
|
|
/*
|
|
|
|
* cost_gather_merge
|
|
|
|
* Determines and returns the cost of gather merge path.
|
|
|
|
*
|
|
|
|
* GatherMerge merges several pre-sorted input streams, using a heap that at
|
|
|
|
* any given instant holds the next tuple from each stream. If there are N
|
|
|
|
* streams, we need about N*log2(N) tuple comparisons to construct the heap at
|
|
|
|
* startup, and then for each output tuple, about log2(N) comparisons to
|
|
|
|
* replace the top heap entry with the next tuple from the same stream.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
cost_gather_merge(GatherMergePath *path, PlannerInfo *root,
|
|
|
|
RelOptInfo *rel, ParamPathInfo *param_info,
|
|
|
|
Cost input_startup_cost, Cost input_total_cost,
|
|
|
|
double *rows)
|
|
|
|
{
|
|
|
|
Cost startup_cost = 0;
|
|
|
|
Cost run_cost = 0;
|
|
|
|
Cost comparison_cost;
|
|
|
|
double N;
|
|
|
|
double logN;
|
|
|
|
|
|
|
|
/* Mark the path with the correct row estimate */
|
|
|
|
if (rows)
|
|
|
|
path->path.rows = *rows;
|
|
|
|
else if (param_info)
|
|
|
|
path->path.rows = param_info->ppi_rows;
|
|
|
|
else
|
|
|
|
path->path.rows = rel->rows;
|
|
|
|
|
|
|
|
if (!enable_gathermerge)
|
|
|
|
startup_cost += disable_cost;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add one to the number of workers to account for the leader. This might
|
|
|
|
* be overgenerous since the leader will do less work than other workers
|
|
|
|
* in typical cases, but we'll go with it for now.
|
|
|
|
*/
|
|
|
|
Assert(path->num_workers > 0);
|
|
|
|
N = (double) path->num_workers + 1;
|
|
|
|
logN = LOG2(N);
|
|
|
|
|
|
|
|
/* Assumed cost per tuple comparison */
|
|
|
|
comparison_cost = 2.0 * cpu_operator_cost;
|
|
|
|
|
|
|
|
/* Heap creation cost */
|
|
|
|
startup_cost += comparison_cost * N * logN;
|
|
|
|
|
|
|
|
/* Per-tuple heap maintenance cost */
|
|
|
|
run_cost += path->path.rows * comparison_cost * logN;
|
|
|
|
|
|
|
|
/* small cost for heap management, like cost_merge_append */
|
|
|
|
run_cost += cpu_operator_cost * path->path.rows;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Parallel setup and communication cost. Since Gather Merge, unlike
|
|
|
|
* Gather, requires us to block until a tuple is available from every
|
|
|
|
* worker, we bump the IPC cost up a little bit as compared with Gather.
|
|
|
|
* For lack of a better idea, charge an extra 5%.
|
|
|
|
*/
|
|
|
|
startup_cost += parallel_setup_cost;
|
|
|
|
run_cost += parallel_tuple_cost * path->path.rows * 1.05;
|
|
|
|
|
|
|
|
path->path.startup_cost = startup_cost + input_startup_cost;
|
|
|
|
path->path.total_cost = (startup_cost + run_cost + input_total_cost);
|
|
|
|
}
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
1999-02-14 00:22:53 +01:00
|
|
|
* cost_index
|
1997-09-07 07:04:48 +02:00
|
|
|
* Determines and returns the cost of scanning a relation using an index.
|
|
|
|
*
|
2011-12-25 01:03:21 +01:00
|
|
|
* 'path' describes the indexscan under consideration, and is complete
|
|
|
|
* except for the fields to be set by this routine
|
2012-01-28 01:26:38 +01:00
|
|
|
* 'loop_count' is the number of repetitions of the indexscan to factor into
|
|
|
|
* estimates of caching behavior
|
2000-01-09 01:26:47 +01:00
|
|
|
*
|
2012-01-28 01:26:38 +01:00
|
|
|
* In addition to rows, startup_cost and total_cost, cost_index() sets the
|
|
|
|
* path's indextotalcost and indexselectivity fields. These values will be
|
|
|
|
* needed if the IndexPath is used in a BitmapIndexScan.
|
2005-04-21 04:28:02 +02:00
|
|
|
*
|
2011-12-25 01:03:21 +01:00
|
|
|
* NOTE: path->indexquals must contain only clauses usable as index
|
|
|
|
* restrictions. Any additional quals evaluated as qpquals may reduce the
|
|
|
|
* number of returned tuples, but they won't reduce the number of tuples
|
|
|
|
* we have to fetch from the table, so they don't reduce the scan cost.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2000-02-15 21:49:31 +01:00
|
|
|
void
|
2017-02-15 19:53:24 +01:00
|
|
|
cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
|
|
|
|
bool partial_path)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2011-12-25 01:03:21 +01:00
|
|
|
IndexOptInfo *index = path->indexinfo;
|
2005-03-27 08:29:49 +02:00
|
|
|
RelOptInfo *baserel = index->rel;
|
2011-12-25 01:03:21 +01:00
|
|
|
bool indexonly = (path->path.pathtype == T_IndexOnlyScan);
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
amcostestimate_function amcostestimate;
|
Fix long-obsolete code for separating filter conditions in cost_index().
This code relied on pointer equality to identify which restriction clauses
also appear in the indexquals (and, therefore, don't need to be applied as
simple filter conditions). That was okay once upon a time, years ago,
before we introduced the equivalence-class machinery. Now there's about a
50-50 chance that an equality clause appearing in the indexquals will be
the mirror image (commutator) of its mate in the restriction list. When
that happens, we'd erroneously think that the clause would be re-evaluated
at each visited row, and therefore inflate the cost estimate for the
indexscan by the clause's cost.
Add some logic to catch this case. It seems to me that it continues not to
be worthwhile to expend the extra predicate-proof work that createplan.c
will do on the finally-selected plan, but this case is common enough and
cheap enough to handle that we should do so.
This will make a small difference (about one cpu_operator_cost per row)
in simple cases; but in situations where there's an expensive function in
the indexquals, it can make a very large difference, as seen in recent
example from Jeff Janes.
This is a long-standing bug, but I'm hesitant to back-patch because of the
possibility of destabilizing plan choices that people may be happy with.
2015-03-04 03:19:42 +01:00
|
|
|
List *qpquals;
|
2000-02-15 21:49:31 +01:00
|
|
|
Cost startup_cost = 0;
|
|
|
|
Cost run_cost = 0;
|
2017-02-15 19:53:24 +01:00
|
|
|
Cost cpu_run_cost = 0;
|
2000-02-15 21:49:31 +01:00
|
|
|
Cost indexStartupCost;
|
|
|
|
Cost indexTotalCost;
|
2000-04-12 19:17:23 +02:00
|
|
|
Selectivity indexSelectivity;
|
2001-05-10 01:13:37 +02:00
|
|
|
double indexCorrelation,
|
|
|
|
csquared;
|
2010-01-05 22:54:00 +01:00
|
|
|
double spc_seq_page_cost,
|
|
|
|
spc_random_page_cost;
|
2001-05-10 01:13:37 +02:00
|
|
|
Cost min_IO_cost,
|
|
|
|
max_IO_cost;
|
2012-04-12 02:24:17 +02:00
|
|
|
QualCost qpqual_cost;
|
2001-05-10 01:13:37 +02:00
|
|
|
Cost cpu_per_tuple;
|
2000-02-15 21:49:31 +01:00
|
|
|
double tuples_fetched;
|
|
|
|
double pages_fetched;
|
2017-02-15 19:53:24 +01:00
|
|
|
double rand_heap_pages;
|
|
|
|
double index_pages;
|
2000-01-09 01:26:47 +01:00
|
|
|
|
|
|
|
/* Should only be applied to base relations */
|
2002-05-12 22:10:05 +02:00
|
|
|
Assert(IsA(baserel, RelOptInfo) &&
|
|
|
|
IsA(index, IndexOptInfo));
|
2003-02-08 21:20:55 +01:00
|
|
|
Assert(baserel->relid > 0);
|
2002-05-12 22:10:05 +02:00
|
|
|
Assert(baserel->rtekind == RTE_RELATION);
|
1996-07-09 08:22:35 +02:00
|
|
|
|
Fix long-obsolete code for separating filter conditions in cost_index().
This code relied on pointer equality to identify which restriction clauses
also appear in the indexquals (and, therefore, don't need to be applied as
simple filter conditions). That was okay once upon a time, years ago,
before we introduced the equivalence-class machinery. Now there's about a
50-50 chance that an equality clause appearing in the indexquals will be
the mirror image (commutator) of its mate in the restriction list. When
that happens, we'd erroneously think that the clause would be re-evaluated
at each visited row, and therefore inflate the cost estimate for the
indexscan by the clause's cost.
Add some logic to catch this case. It seems to me that it continues not to
be worthwhile to expend the extra predicate-proof work that createplan.c
will do on the finally-selected plan, but this case is common enough and
cheap enough to handle that we should do so.
This will make a small difference (about one cpu_operator_cost per row)
in simple cases; but in situations where there's an expensive function in
the indexquals, it can make a very large difference, as seen in recent
example from Jeff Janes.
This is a long-standing bug, but I'm hesitant to back-patch because of the
possibility of destabilizing plan choices that people may be happy with.
2015-03-04 03:19:42 +01:00
|
|
|
/*
|
|
|
|
* Mark the path with the correct row estimate, and identify which quals
|
Support using index-only scans with partial indexes in more cases.
Previously, the planner would reject an index-only scan if any restriction
clause for its table used a column not available from the index, even
if that restriction clause would later be dropped from the plan entirely
because it's implied by the index's predicate. This is a fairly common
situation for partial indexes because predicates using columns not included
in the index are often the most useful kind of predicate, and we have to
duplicate (or at least imply) the predicate in the WHERE clause in order
to get the index to be considered at all. So index-only scans were
essentially unavailable with such partial indexes.
To fix, we have to do detection of implied-by-predicate clauses much
earlier in the planner. This patch puts it in check_index_predicates
(nee check_partial_indexes), meaning it gets done for every partial index,
whereas we previously only considered this issue at createplan time,
so that the work was only done for an index actually selected for use.
That could result in a noticeable planning slowdown for queries against
tables with many partial indexes. However, testing suggested that there
isn't really a significant cost, especially not with reasonable numbers
of partial indexes. We do get a small additional benefit, which is that
cost_index is more accurate since it correctly discounts the evaluation
cost of clauses that will be removed. We can also avoid considering such
clauses as potential indexquals, which saves useless matching cycles in
the case where the predicate columns aren't in the index, and prevents
generating bogus plans that double-count the clause's selectivity when
the columns are in the index.
Tomas Vondra and Kyotaro Horiguchi, reviewed by Kevin Grittner and
Konstantin Knizhnik, and whacked around a little by me
2016-03-31 20:48:56 +02:00
|
|
|
* will need to be enforced as qpquals. We need not check any quals that
|
|
|
|
* are implied by the index's predicate, so we can use indrestrictinfo not
|
|
|
|
* baserestrictinfo as the list of relevant restriction clauses for the
|
|
|
|
* rel.
|
Fix long-obsolete code for separating filter conditions in cost_index().
This code relied on pointer equality to identify which restriction clauses
also appear in the indexquals (and, therefore, don't need to be applied as
simple filter conditions). That was okay once upon a time, years ago,
before we introduced the equivalence-class machinery. Now there's about a
50-50 chance that an equality clause appearing in the indexquals will be
the mirror image (commutator) of its mate in the restriction list. When
that happens, we'd erroneously think that the clause would be re-evaluated
at each visited row, and therefore inflate the cost estimate for the
indexscan by the clause's cost.
Add some logic to catch this case. It seems to me that it continues not to
be worthwhile to expend the extra predicate-proof work that createplan.c
will do on the finally-selected plan, but this case is common enough and
cheap enough to handle that we should do so.
This will make a small difference (about one cpu_operator_cost per row)
in simple cases; but in situations where there's an expensive function in
the indexquals, it can make a very large difference, as seen in recent
example from Jeff Janes.
This is a long-standing bug, but I'm hesitant to back-patch because of the
possibility of destabilizing plan choices that people may be happy with.
2015-03-04 03:19:42 +01:00
|
|
|
*/
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
if (path->path.param_info)
|
2012-01-28 01:26:38 +01:00
|
|
|
{
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
path->path.rows = path->path.param_info->ppi_rows;
|
Fix long-obsolete code for separating filter conditions in cost_index().
This code relied on pointer equality to identify which restriction clauses
also appear in the indexquals (and, therefore, don't need to be applied as
simple filter conditions). That was okay once upon a time, years ago,
before we introduced the equivalence-class machinery. Now there's about a
50-50 chance that an equality clause appearing in the indexquals will be
the mirror image (commutator) of its mate in the restriction list. When
that happens, we'd erroneously think that the clause would be re-evaluated
at each visited row, and therefore inflate the cost estimate for the
indexscan by the clause's cost.
Add some logic to catch this case. It seems to me that it continues not to
be worthwhile to expend the extra predicate-proof work that createplan.c
will do on the finally-selected plan, but this case is common enough and
cheap enough to handle that we should do so.
This will make a small difference (about one cpu_operator_cost per row)
in simple cases; but in situations where there's an expensive function in
the indexquals, it can make a very large difference, as seen in recent
example from Jeff Janes.
This is a long-standing bug, but I'm hesitant to back-patch because of the
possibility of destabilizing plan choices that people may be happy with.
2015-03-04 03:19:42 +01:00
|
|
|
/* qpquals come from the rel's restriction clauses and ppi_clauses */
|
Refactor the representation of indexable clauses in IndexPaths.
In place of three separate but interrelated lists (indexclauses,
indexquals, and indexqualcols), an IndexPath now has one list
"indexclauses" of IndexClause nodes. This holds basically the same
information as before, but in a more useful format: in particular, there
is now a clear connection between an indexclause (an original restriction
clause from WHERE or JOIN/ON) and the indexquals (directly usable index
conditions) derived from it.
We also change the ground rules a bit by mandating that clause commutation,
if needed, be done up-front so that what is stored in the indexquals list
is always directly usable as an index condition. This gets rid of repeated
re-determination of which side of the clause is the indexkey during costing
and plan generation, as well as repeated lookups of the commutator
operator. To minimize the added up-front cost, the typical case of
commuting a plain OpExpr is handled by a new special-purpose function
commute_restrictinfo(). For RowCompareExprs, generating the new clause
properly commuted to begin with is not really any more complex than before,
it's just different --- and we can save doing that work twice, as the
pretty-klugy original implementation did.
Tracking the connection between original and derived clauses lets us
also track explicitly whether the derived clauses are an exact or lossy
translation of the original. This provides a cheap solution to getting
rid of unnecessary rechecks of boolean index clauses, which previously
seemed like it'd be more expensive than it was worth.
Another pleasant (IMO) side-effect is that EXPLAIN now always shows
index clauses with the indexkey on the left; this seems less confusing.
This commit leaves expand_indexqual_conditions() and some related
functions in a slightly messy state. I didn't bother to change them
any more than minimally necessary to work with the new data structure,
because all that code is going to be refactored out of existence in
a follow-on patch.
Discussion: https://postgr.es/m/22182.1549124950@sss.pgh.pa.us
2019-02-09 23:30:43 +01:00
|
|
|
qpquals = list_concat(extract_nonindex_conditions(path->indexinfo->indrestrictinfo,
|
|
|
|
path->indexclauses),
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
extract_nonindex_conditions(path->path.param_info->ppi_clauses,
|
Refactor the representation of indexable clauses in IndexPaths.
In place of three separate but interrelated lists (indexclauses,
indexquals, and indexqualcols), an IndexPath now has one list
"indexclauses" of IndexClause nodes. This holds basically the same
information as before, but in a more useful format: in particular, there
is now a clear connection between an indexclause (an original restriction
clause from WHERE or JOIN/ON) and the indexquals (directly usable index
conditions) derived from it.
We also change the ground rules a bit by mandating that clause commutation,
if needed, be done up-front so that what is stored in the indexquals list
is always directly usable as an index condition. This gets rid of repeated
re-determination of which side of the clause is the indexkey during costing
and plan generation, as well as repeated lookups of the commutator
operator. To minimize the added up-front cost, the typical case of
commuting a plain OpExpr is handled by a new special-purpose function
commute_restrictinfo(). For RowCompareExprs, generating the new clause
properly commuted to begin with is not really any more complex than before,
it's just different --- and we can save doing that work twice, as the
pretty-klugy original implementation did.
Tracking the connection between original and derived clauses lets us
also track explicitly whether the derived clauses are an exact or lossy
translation of the original. This provides a cheap solution to getting
rid of unnecessary rechecks of boolean index clauses, which previously
seemed like it'd be more expensive than it was worth.
Another pleasant (IMO) side-effect is that EXPLAIN now always shows
index clauses with the indexkey on the left; this seems less confusing.
This commit leaves expand_indexqual_conditions() and some related
functions in a slightly messy state. I didn't bother to change them
any more than minimally necessary to work with the new data structure,
because all that code is going to be refactored out of existence in
a follow-on patch.
Discussion: https://postgr.es/m/22182.1549124950@sss.pgh.pa.us
2019-02-09 23:30:43 +01:00
|
|
|
path->indexclauses));
|
2012-01-28 01:26:38 +01:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
path->path.rows = baserel->rows;
|
Fix long-obsolete code for separating filter conditions in cost_index().
This code relied on pointer equality to identify which restriction clauses
also appear in the indexquals (and, therefore, don't need to be applied as
simple filter conditions). That was okay once upon a time, years ago,
before we introduced the equivalence-class machinery. Now there's about a
50-50 chance that an equality clause appearing in the indexquals will be
the mirror image (commutator) of its mate in the restriction list. When
that happens, we'd erroneously think that the clause would be re-evaluated
at each visited row, and therefore inflate the cost estimate for the
indexscan by the clause's cost.
Add some logic to catch this case. It seems to me that it continues not to
be worthwhile to expend the extra predicate-proof work that createplan.c
will do on the finally-selected plan, but this case is common enough and
cheap enough to handle that we should do so.
This will make a small difference (about one cpu_operator_cost per row)
in simple cases; but in situations where there's an expensive function in
the indexquals, it can make a very large difference, as seen in recent
example from Jeff Janes.
This is a long-standing bug, but I'm hesitant to back-patch because of the
possibility of destabilizing plan choices that people may be happy with.
2015-03-04 03:19:42 +01:00
|
|
|
/* qpquals come from just the rel's restriction clauses */
|
Support using index-only scans with partial indexes in more cases.
Previously, the planner would reject an index-only scan if any restriction
clause for its table used a column not available from the index, even
if that restriction clause would later be dropped from the plan entirely
because it's implied by the index's predicate. This is a fairly common
situation for partial indexes because predicates using columns not included
in the index are often the most useful kind of predicate, and we have to
duplicate (or at least imply) the predicate in the WHERE clause in order
to get the index to be considered at all. So index-only scans were
essentially unavailable with such partial indexes.
To fix, we have to do detection of implied-by-predicate clauses much
earlier in the planner. This patch puts it in check_index_predicates
(nee check_partial_indexes), meaning it gets done for every partial index,
whereas we previously only considered this issue at createplan time,
so that the work was only done for an index actually selected for use.
That could result in a noticeable planning slowdown for queries against
tables with many partial indexes. However, testing suggested that there
isn't really a significant cost, especially not with reasonable numbers
of partial indexes. We do get a small additional benefit, which is that
cost_index is more accurate since it correctly discounts the evaluation
cost of clauses that will be removed. We can also avoid considering such
clauses as potential indexquals, which saves useless matching cycles in
the case where the predicate columns aren't in the index, and prevents
generating bogus plans that double-count the clause's selectivity when
the columns are in the index.
Tomas Vondra and Kyotaro Horiguchi, reviewed by Kevin Grittner and
Konstantin Knizhnik, and whacked around a little by me
2016-03-31 20:48:56 +02:00
|
|
|
qpquals = extract_nonindex_conditions(path->indexinfo->indrestrictinfo,
|
Refactor the representation of indexable clauses in IndexPaths.
In place of three separate but interrelated lists (indexclauses,
indexquals, and indexqualcols), an IndexPath now has one list
"indexclauses" of IndexClause nodes. This holds basically the same
information as before, but in a more useful format: in particular, there
is now a clear connection between an indexclause (an original restriction
clause from WHERE or JOIN/ON) and the indexquals (directly usable index
conditions) derived from it.
We also change the ground rules a bit by mandating that clause commutation,
if needed, be done up-front so that what is stored in the indexquals list
is always directly usable as an index condition. This gets rid of repeated
re-determination of which side of the clause is the indexkey during costing
and plan generation, as well as repeated lookups of the commutator
operator. To minimize the added up-front cost, the typical case of
commuting a plain OpExpr is handled by a new special-purpose function
commute_restrictinfo(). For RowCompareExprs, generating the new clause
properly commuted to begin with is not really any more complex than before,
it's just different --- and we can save doing that work twice, as the
pretty-klugy original implementation did.
Tracking the connection between original and derived clauses lets us
also track explicitly whether the derived clauses are an exact or lossy
translation of the original. This provides a cheap solution to getting
rid of unnecessary rechecks of boolean index clauses, which previously
seemed like it'd be more expensive than it was worth.
Another pleasant (IMO) side-effect is that EXPLAIN now always shows
index clauses with the indexkey on the left; this seems less confusing.
This commit leaves expand_indexqual_conditions() and some related
functions in a slightly messy state. I didn't bother to change them
any more than minimally necessary to work with the new data structure,
because all that code is going to be refactored out of existence in
a follow-on patch.
Discussion: https://postgr.es/m/22182.1549124950@sss.pgh.pa.us
2019-02-09 23:30:43 +01:00
|
|
|
path->indexclauses);
|
2012-01-28 01:26:38 +01:00
|
|
|
}
|
|
|
|
|
2002-11-30 06:21:03 +01:00
|
|
|
if (!enable_indexscan)
|
2000-02-15 21:49:31 +01:00
|
|
|
startup_cost += disable_cost;
|
2011-10-08 16:41:17 +02:00
|
|
|
/* we don't need to check enable_indexonlyscan; indxpath.c does that */
|
1996-07-09 08:22:35 +02:00
|
|
|
|
1999-05-25 18:15:34 +02:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Call index-access-method-specific code to estimate the processing cost
|
|
|
|
* for scanning the index, as well as the selectivity of the index (ie,
|
|
|
|
* the fraction of main-table tuples we will have to retrieve) and its
|
2016-01-18 04:56:16 +01:00
|
|
|
* correlation to the main-table tuple order. We need a cast here because
|
2019-12-27 00:09:00 +01:00
|
|
|
* pathnodes.h uses a weak function type to avoid including amapi.h.
|
1999-04-30 06:01:44 +02:00
|
|
|
*/
|
2016-01-18 04:56:16 +01:00
|
|
|
amcostestimate = (amcostestimate_function) index->amcostestimate;
|
Restructure index access method API to hide most of it at the C level.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
2016-01-18 01:36:59 +01:00
|
|
|
amcostestimate(root, path, loop_count,
|
|
|
|
&indexStartupCost, &indexTotalCost,
|
2017-02-15 19:53:24 +01:00
|
|
|
&indexSelectivity, &indexCorrelation,
|
|
|
|
&index_pages);
|
1999-04-30 06:01:44 +02:00
|
|
|
|
2005-04-21 04:28:02 +02:00
|
|
|
/*
|
2005-04-21 21:18:13 +02:00
|
|
|
* Save amcostestimate's results for possible use in bitmap scan planning.
|
2005-10-15 04:49:52 +02:00
|
|
|
* We don't bother to save indexStartupCost or indexCorrelation, because a
|
|
|
|
* bitmap scan doesn't care about either.
|
2005-04-21 04:28:02 +02:00
|
|
|
*/
|
|
|
|
path->indextotalcost = indexTotalCost;
|
|
|
|
path->indexselectivity = indexSelectivity;
|
|
|
|
|
2000-01-23 00:50:30 +01:00
|
|
|
/* all costs for touching index itself included here */
|
2000-02-15 21:49:31 +01:00
|
|
|
startup_cost += indexStartupCost;
|
|
|
|
run_cost += indexTotalCost - indexStartupCost;
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2006-06-06 19:59:58 +02:00
|
|
|
/* estimate number of main-table tuples fetched */
|
|
|
|
tuples_fetched = clamp_row_est(indexSelectivity * baserel->tuples);
|
|
|
|
|
2010-01-05 22:54:00 +01:00
|
|
|
/* fetch estimated page costs for tablespace containing table */
|
|
|
|
get_tablespace_page_costs(baserel->reltablespace,
|
|
|
|
&spc_random_page_cost,
|
|
|
|
&spc_seq_page_cost);
|
|
|
|
|
2001-05-10 01:13:37 +02:00
|
|
|
/*----------
|
2006-06-06 19:59:58 +02:00
|
|
|
* Estimate number of main-table pages fetched, and compute I/O cost.
|
2000-01-09 01:26:47 +01:00
|
|
|
*
|
2001-05-10 01:13:37 +02:00
|
|
|
* When the index ordering is uncorrelated with the table ordering,
|
2006-06-06 19:59:58 +02:00
|
|
|
* we use an approximation proposed by Mackert and Lohman (see
|
|
|
|
* index_pages_fetched() for details) to compute the number of pages
|
2010-01-05 22:54:00 +01:00
|
|
|
* fetched, and then charge spc_random_page_cost per page fetched.
|
2001-05-10 01:13:37 +02:00
|
|
|
*
|
|
|
|
* When the index ordering is exactly correlated with the table ordering
|
|
|
|
* (just after a CLUSTER, for example), the number of pages fetched should
|
2006-06-06 19:59:58 +02:00
|
|
|
* be exactly selectivity * table_size. What's more, all but the first
|
|
|
|
* will be sequential fetches, not the random fetches that occur in the
|
|
|
|
* uncorrelated case. So if the number of pages is more than 1, we
|
|
|
|
* ought to charge
|
2010-01-05 22:54:00 +01:00
|
|
|
* spc_random_page_cost + (pages_fetched - 1) * spc_seq_page_cost
|
2006-06-06 19:59:58 +02:00
|
|
|
* For partially-correlated indexes, we ought to charge somewhere between
|
|
|
|
* these two estimates. We currently interpolate linearly between the
|
|
|
|
* estimates based on the correlation squared (XXX is that appropriate?).
|
2011-10-08 02:13:02 +02:00
|
|
|
*
|
|
|
|
* If it's an index-only scan, then we will not need to fetch any heap
|
|
|
|
* pages for which the visibility map shows all tuples are visible.
|
2011-10-14 23:23:01 +02:00
|
|
|
* Hence, reduce the estimated number of heap fetches accordingly.
|
|
|
|
* We use the measured fraction of the entire heap that is all-visible,
|
|
|
|
* which might not be particularly relevant to the subset of the heap
|
|
|
|
* that this query will fetch; but it's not clear how to do better.
|
2001-05-10 01:13:37 +02:00
|
|
|
*----------
|
1999-04-30 06:01:44 +02:00
|
|
|
*/
|
2012-01-28 01:26:38 +01:00
|
|
|
if (loop_count > 1)
|
2001-05-10 01:13:37 +02:00
|
|
|
{
|
2006-06-06 19:59:58 +02:00
|
|
|
/*
|
2006-12-15 19:42:26 +01:00
|
|
|
* For repeated indexscans, the appropriate estimate for the
|
|
|
|
* uncorrelated case is to scale up the number of tuples fetched in
|
2006-10-04 02:30:14 +02:00
|
|
|
* the Mackert and Lohman formula by the number of scans, so that we
|
2006-12-15 19:42:26 +01:00
|
|
|
* estimate the number of pages fetched by all the scans; then
|
2006-10-04 02:30:14 +02:00
|
|
|
* pro-rate the costs for one scan. In this case we assume all the
|
2006-12-15 19:42:26 +01:00
|
|
|
* fetches are random accesses.
|
2006-06-06 19:59:58 +02:00
|
|
|
*/
|
2012-01-28 01:26:38 +01:00
|
|
|
pages_fetched = index_pages_fetched(tuples_fetched * loop_count,
|
2006-06-06 19:59:58 +02:00
|
|
|
baserel->pages,
|
2006-09-20 00:49:53 +02:00
|
|
|
(double) index->pages,
|
|
|
|
root);
|
2006-06-06 19:59:58 +02:00
|
|
|
|
2011-10-08 16:41:17 +02:00
|
|
|
if (indexonly)
|
2011-10-14 23:23:01 +02:00
|
|
|
pages_fetched = ceil(pages_fetched * (1.0 - baserel->allvisfrac));
|
2011-10-08 02:13:02 +02:00
|
|
|
|
2017-02-15 19:53:24 +01:00
|
|
|
rand_heap_pages = pages_fetched;
|
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
max_IO_cost = (pages_fetched * spc_random_page_cost) / loop_count;
|
2006-12-15 19:42:26 +01:00
|
|
|
|
|
|
|
/*
|
2007-11-15 22:14:46 +01:00
|
|
|
* In the perfectly correlated case, the number of pages touched by
|
|
|
|
* each scan is selectivity * table_size, and we can use the Mackert
|
|
|
|
* and Lohman formula at the page level to estimate how much work is
|
|
|
|
* saved by caching across scans. We still assume all the fetches are
|
|
|
|
* random, though, which is an overestimate that's hard to correct for
|
|
|
|
* without double-counting the cache effects. (But in most cases
|
|
|
|
* where such a plan is actually interesting, only one page would get
|
|
|
|
* fetched per scan anyway, so it shouldn't matter much.)
|
2006-12-15 19:42:26 +01:00
|
|
|
*/
|
|
|
|
pages_fetched = ceil(indexSelectivity * (double) baserel->pages);
|
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
pages_fetched = index_pages_fetched(pages_fetched * loop_count,
|
2006-12-15 19:42:26 +01:00
|
|
|
baserel->pages,
|
|
|
|
(double) index->pages,
|
|
|
|
root);
|
|
|
|
|
2011-10-08 16:41:17 +02:00
|
|
|
if (indexonly)
|
2011-10-14 23:23:01 +02:00
|
|
|
pages_fetched = ceil(pages_fetched * (1.0 - baserel->allvisfrac));
|
2011-10-08 02:13:02 +02:00
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
min_IO_cost = (pages_fetched * spc_random_page_cost) / loop_count;
|
2001-05-10 01:13:37 +02:00
|
|
|
}
|
2000-02-15 21:49:31 +01:00
|
|
|
else
|
2001-05-10 01:13:37 +02:00
|
|
|
{
|
2006-06-06 19:59:58 +02:00
|
|
|
/*
|
|
|
|
* Normal case: apply the Mackert and Lohman formula, and then
|
|
|
|
* interpolate between that and the correlation-derived result.
|
|
|
|
*/
|
|
|
|
pages_fetched = index_pages_fetched(tuples_fetched,
|
|
|
|
baserel->pages,
|
2006-09-20 00:49:53 +02:00
|
|
|
(double) index->pages,
|
|
|
|
root);
|
2006-06-06 19:59:58 +02:00
|
|
|
|
2011-10-08 16:41:17 +02:00
|
|
|
if (indexonly)
|
2011-10-14 23:23:01 +02:00
|
|
|
pages_fetched = ceil(pages_fetched * (1.0 - baserel->allvisfrac));
|
2011-10-08 02:13:02 +02:00
|
|
|
|
2017-02-15 19:53:24 +01:00
|
|
|
rand_heap_pages = pages_fetched;
|
|
|
|
|
2006-06-06 19:59:58 +02:00
|
|
|
/* max_IO_cost is for the perfectly uncorrelated case (csquared=0) */
|
2010-01-05 22:54:00 +01:00
|
|
|
max_IO_cost = pages_fetched * spc_random_page_cost;
|
2006-06-06 19:59:58 +02:00
|
|
|
|
|
|
|
/* min_IO_cost is for the perfectly correlated case (csquared=1) */
|
|
|
|
pages_fetched = ceil(indexSelectivity * (double) baserel->pages);
|
2011-10-08 02:13:02 +02:00
|
|
|
|
2011-10-08 16:41:17 +02:00
|
|
|
if (indexonly)
|
2011-10-14 23:23:01 +02:00
|
|
|
pages_fetched = ceil(pages_fetched * (1.0 - baserel->allvisfrac));
|
2011-10-08 02:13:02 +02:00
|
|
|
|
2011-10-16 21:39:24 +02:00
|
|
|
if (pages_fetched > 0)
|
|
|
|
{
|
|
|
|
min_IO_cost = spc_random_page_cost;
|
|
|
|
if (pages_fetched > 1)
|
|
|
|
min_IO_cost += (pages_fetched - 1) * spc_seq_page_cost;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
min_IO_cost = 0;
|
2006-12-15 19:42:26 +01:00
|
|
|
}
|
2006-06-06 19:59:58 +02:00
|
|
|
|
2017-02-15 19:53:24 +01:00
|
|
|
if (partial_path)
|
|
|
|
{
|
2017-03-14 19:33:14 +01:00
|
|
|
/*
|
|
|
|
* For index only scans compute workers based on number of index pages
|
2017-05-17 22:31:56 +02:00
|
|
|
* fetched; the number of heap pages we fetch might be so small as to
|
|
|
|
* effectively rule out parallelism, which we don't want to do.
|
2017-03-14 19:33:14 +01:00
|
|
|
*/
|
|
|
|
if (indexonly)
|
|
|
|
rand_heap_pages = -1;
|
|
|
|
|
2017-02-15 19:53:24 +01:00
|
|
|
/*
|
|
|
|
* Estimate the number of parallel workers required to scan index. Use
|
|
|
|
* the number of heap pages computed considering heap fetches won't be
|
|
|
|
* sequential as for parallel scans the pages are accessed in random
|
|
|
|
* order.
|
|
|
|
*/
|
|
|
|
path->path.parallel_workers = compute_parallel_worker(baserel,
|
Support parallel btree index builds.
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
2018-02-02 19:25:55 +01:00
|
|
|
rand_heap_pages,
|
|
|
|
index_pages,
|
|
|
|
max_parallel_workers_per_gather);
|
2017-02-15 19:53:24 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Fall out if workers can't be assigned for parallel scan, because in
|
|
|
|
* such a case this path will be rejected. So there is no benefit in
|
|
|
|
* doing extra computation.
|
|
|
|
*/
|
|
|
|
if (path->path.parallel_workers <= 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
path->path.parallel_aware = true;
|
|
|
|
}
|
|
|
|
|
2006-12-15 19:42:26 +01:00
|
|
|
/*
|
2007-11-15 22:14:46 +01:00
|
|
|
* Now interpolate based on estimated index order correlation to get total
|
|
|
|
* disk I/O cost for main table accesses.
|
2006-12-15 19:42:26 +01:00
|
|
|
*/
|
|
|
|
csquared = indexCorrelation * indexCorrelation;
|
2006-06-06 19:59:58 +02:00
|
|
|
|
2006-12-15 19:42:26 +01:00
|
|
|
run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost);
|
2000-02-15 21:49:31 +01:00
|
|
|
|
|
|
|
/*
|
2001-05-10 01:13:37 +02:00
|
|
|
* Estimate CPU costs per tuple.
|
|
|
|
*
|
2012-04-12 02:24:17 +02:00
|
|
|
* What we want here is cpu_tuple_cost plus the evaluation costs of any
|
Fix long-obsolete code for separating filter conditions in cost_index().
This code relied on pointer equality to identify which restriction clauses
also appear in the indexquals (and, therefore, don't need to be applied as
simple filter conditions). That was okay once upon a time, years ago,
before we introduced the equivalence-class machinery. Now there's about a
50-50 chance that an equality clause appearing in the indexquals will be
the mirror image (commutator) of its mate in the restriction list. When
that happens, we'd erroneously think that the clause would be re-evaluated
at each visited row, and therefore inflate the cost estimate for the
indexscan by the clause's cost.
Add some logic to catch this case. It seems to me that it continues not to
be worthwhile to expend the extra predicate-proof work that createplan.c
will do on the finally-selected plan, but this case is common enough and
cheap enough to handle that we should do so.
This will make a small difference (about one cpu_operator_cost per row)
in simple cases; but in situations where there's an expensive function in
the indexquals, it can make a very large difference, as seen in recent
example from Jeff Janes.
This is a long-standing bug, but I'm hesitant to back-patch because of the
possibility of destabilizing plan choices that people may be happy with.
2015-03-04 03:19:42 +01:00
|
|
|
* qual clauses that we have to evaluate as qpquals.
|
2000-02-15 21:49:31 +01:00
|
|
|
*/
|
Fix long-obsolete code for separating filter conditions in cost_index().
This code relied on pointer equality to identify which restriction clauses
also appear in the indexquals (and, therefore, don't need to be applied as
simple filter conditions). That was okay once upon a time, years ago,
before we introduced the equivalence-class machinery. Now there's about a
50-50 chance that an equality clause appearing in the indexquals will be
the mirror image (commutator) of its mate in the restriction list. When
that happens, we'd erroneously think that the clause would be re-evaluated
at each visited row, and therefore inflate the cost estimate for the
indexscan by the clause's cost.
Add some logic to catch this case. It seems to me that it continues not to
be worthwhile to expend the extra predicate-proof work that createplan.c
will do on the finally-selected plan, but this case is common enough and
cheap enough to handle that we should do so.
This will make a small difference (about one cpu_operator_cost per row)
in simple cases; but in situations where there's an expensive function in
the indexquals, it can make a very large difference, as seen in recent
example from Jeff Janes.
This is a long-standing bug, but I'm hesitant to back-patch because of the
possibility of destabilizing plan choices that people may be happy with.
2015-03-04 03:19:42 +01:00
|
|
|
cost_qual_eval(&qpqual_cost, qpquals, root);
|
2001-05-10 01:13:37 +02:00
|
|
|
|
2012-04-12 02:24:17 +02:00
|
|
|
startup_cost += qpqual_cost.startup;
|
|
|
|
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2017-02-15 19:53:24 +01:00
|
|
|
cpu_run_cost += cpu_per_tuple * tuples_fetched;
|
2000-02-15 21:49:31 +01:00
|
|
|
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
/* tlist eval costs are paid per output row, not per tuple scanned */
|
|
|
|
startup_cost += path->path.pathtarget->cost.startup;
|
2017-02-15 19:53:24 +01:00
|
|
|
cpu_run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
|
|
|
|
|
|
|
|
/* Adjust costing for parallelism, if used. */
|
|
|
|
if (path->path.parallel_workers > 0)
|
|
|
|
{
|
|
|
|
double parallel_divisor = get_parallel_divisor(&path->path);
|
|
|
|
|
|
|
|
path->path.rows = clamp_row_est(path->path.rows / parallel_divisor);
|
|
|
|
|
|
|
|
/* The CPU cost is divided among all the workers. */
|
|
|
|
cpu_run_cost /= parallel_divisor;
|
|
|
|
}
|
|
|
|
|
|
|
|
run_cost += cpu_run_cost;
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
|
2005-04-21 04:28:02 +02:00
|
|
|
path->path.startup_cost = startup_cost;
|
|
|
|
path->path.total_cost = startup_cost + run_cost;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
Fix long-obsolete code for separating filter conditions in cost_index().
This code relied on pointer equality to identify which restriction clauses
also appear in the indexquals (and, therefore, don't need to be applied as
simple filter conditions). That was okay once upon a time, years ago,
before we introduced the equivalence-class machinery. Now there's about a
50-50 chance that an equality clause appearing in the indexquals will be
the mirror image (commutator) of its mate in the restriction list. When
that happens, we'd erroneously think that the clause would be re-evaluated
at each visited row, and therefore inflate the cost estimate for the
indexscan by the clause's cost.
Add some logic to catch this case. It seems to me that it continues not to
be worthwhile to expend the extra predicate-proof work that createplan.c
will do on the finally-selected plan, but this case is common enough and
cheap enough to handle that we should do so.
This will make a small difference (about one cpu_operator_cost per row)
in simple cases; but in situations where there's an expensive function in
the indexquals, it can make a very large difference, as seen in recent
example from Jeff Janes.
This is a long-standing bug, but I'm hesitant to back-patch because of the
possibility of destabilizing plan choices that people may be happy with.
2015-03-04 03:19:42 +01:00
|
|
|
/*
|
|
|
|
* extract_nonindex_conditions
|
|
|
|
*
|
|
|
|
* Given a list of quals to be enforced in an indexscan, extract the ones that
|
|
|
|
* will have to be applied as qpquals (ie, the index machinery won't handle
|
Refactor the representation of indexable clauses in IndexPaths.
In place of three separate but interrelated lists (indexclauses,
indexquals, and indexqualcols), an IndexPath now has one list
"indexclauses" of IndexClause nodes. This holds basically the same
information as before, but in a more useful format: in particular, there
is now a clear connection between an indexclause (an original restriction
clause from WHERE or JOIN/ON) and the indexquals (directly usable index
conditions) derived from it.
We also change the ground rules a bit by mandating that clause commutation,
if needed, be done up-front so that what is stored in the indexquals list
is always directly usable as an index condition. This gets rid of repeated
re-determination of which side of the clause is the indexkey during costing
and plan generation, as well as repeated lookups of the commutator
operator. To minimize the added up-front cost, the typical case of
commuting a plain OpExpr is handled by a new special-purpose function
commute_restrictinfo(). For RowCompareExprs, generating the new clause
properly commuted to begin with is not really any more complex than before,
it's just different --- and we can save doing that work twice, as the
pretty-klugy original implementation did.
Tracking the connection between original and derived clauses lets us
also track explicitly whether the derived clauses are an exact or lossy
translation of the original. This provides a cheap solution to getting
rid of unnecessary rechecks of boolean index clauses, which previously
seemed like it'd be more expensive than it was worth.
Another pleasant (IMO) side-effect is that EXPLAIN now always shows
index clauses with the indexkey on the left; this seems less confusing.
This commit leaves expand_indexqual_conditions() and some related
functions in a slightly messy state. I didn't bother to change them
any more than minimally necessary to work with the new data structure,
because all that code is going to be refactored out of existence in
a follow-on patch.
Discussion: https://postgr.es/m/22182.1549124950@sss.pgh.pa.us
2019-02-09 23:30:43 +01:00
|
|
|
* them). Here we detect only whether a qual clause is directly redundant
|
|
|
|
* with some indexclause. If the index path is chosen for use, createplan.c
|
|
|
|
* will try a bit harder to get rid of redundant qual conditions; specifically
|
|
|
|
* it will see if quals can be proven to be implied by the indexquals. But
|
|
|
|
* it does not seem worth the cycles to try to factor that in at this stage,
|
|
|
|
* since we're only trying to estimate qual eval costs. Otherwise this must
|
|
|
|
* match the logic in create_indexscan_plan().
|
|
|
|
*
|
|
|
|
* qual_clauses, and the result, are lists of RestrictInfos.
|
|
|
|
* indexclauses is a list of IndexClauses.
|
Fix long-obsolete code for separating filter conditions in cost_index().
This code relied on pointer equality to identify which restriction clauses
also appear in the indexquals (and, therefore, don't need to be applied as
simple filter conditions). That was okay once upon a time, years ago,
before we introduced the equivalence-class machinery. Now there's about a
50-50 chance that an equality clause appearing in the indexquals will be
the mirror image (commutator) of its mate in the restriction list. When
that happens, we'd erroneously think that the clause would be re-evaluated
at each visited row, and therefore inflate the cost estimate for the
indexscan by the clause's cost.
Add some logic to catch this case. It seems to me that it continues not to
be worthwhile to expend the extra predicate-proof work that createplan.c
will do on the finally-selected plan, but this case is common enough and
cheap enough to handle that we should do so.
This will make a small difference (about one cpu_operator_cost per row)
in simple cases; but in situations where there's an expensive function in
the indexquals, it can make a very large difference, as seen in recent
example from Jeff Janes.
This is a long-standing bug, but I'm hesitant to back-patch because of the
possibility of destabilizing plan choices that people may be happy with.
2015-03-04 03:19:42 +01:00
|
|
|
*/
|
|
|
|
static List *
|
Refactor the representation of indexable clauses in IndexPaths.
In place of three separate but interrelated lists (indexclauses,
indexquals, and indexqualcols), an IndexPath now has one list
"indexclauses" of IndexClause nodes. This holds basically the same
information as before, but in a more useful format: in particular, there
is now a clear connection between an indexclause (an original restriction
clause from WHERE or JOIN/ON) and the indexquals (directly usable index
conditions) derived from it.
We also change the ground rules a bit by mandating that clause commutation,
if needed, be done up-front so that what is stored in the indexquals list
is always directly usable as an index condition. This gets rid of repeated
re-determination of which side of the clause is the indexkey during costing
and plan generation, as well as repeated lookups of the commutator
operator. To minimize the added up-front cost, the typical case of
commuting a plain OpExpr is handled by a new special-purpose function
commute_restrictinfo(). For RowCompareExprs, generating the new clause
properly commuted to begin with is not really any more complex than before,
it's just different --- and we can save doing that work twice, as the
pretty-klugy original implementation did.
Tracking the connection between original and derived clauses lets us
also track explicitly whether the derived clauses are an exact or lossy
translation of the original. This provides a cheap solution to getting
rid of unnecessary rechecks of boolean index clauses, which previously
seemed like it'd be more expensive than it was worth.
Another pleasant (IMO) side-effect is that EXPLAIN now always shows
index clauses with the indexkey on the left; this seems less confusing.
This commit leaves expand_indexqual_conditions() and some related
functions in a slightly messy state. I didn't bother to change them
any more than minimally necessary to work with the new data structure,
because all that code is going to be refactored out of existence in
a follow-on patch.
Discussion: https://postgr.es/m/22182.1549124950@sss.pgh.pa.us
2019-02-09 23:30:43 +01:00
|
|
|
extract_nonindex_conditions(List *qual_clauses, List *indexclauses)
|
Fix long-obsolete code for separating filter conditions in cost_index().
This code relied on pointer equality to identify which restriction clauses
also appear in the indexquals (and, therefore, don't need to be applied as
simple filter conditions). That was okay once upon a time, years ago,
before we introduced the equivalence-class machinery. Now there's about a
50-50 chance that an equality clause appearing in the indexquals will be
the mirror image (commutator) of its mate in the restriction list. When
that happens, we'd erroneously think that the clause would be re-evaluated
at each visited row, and therefore inflate the cost estimate for the
indexscan by the clause's cost.
Add some logic to catch this case. It seems to me that it continues not to
be worthwhile to expend the extra predicate-proof work that createplan.c
will do on the finally-selected plan, but this case is common enough and
cheap enough to handle that we should do so.
This will make a small difference (about one cpu_operator_cost per row)
in simple cases; but in situations where there's an expensive function in
the indexquals, it can make a very large difference, as seen in recent
example from Jeff Janes.
This is a long-standing bug, but I'm hesitant to back-patch because of the
possibility of destabilizing plan choices that people may be happy with.
2015-03-04 03:19:42 +01:00
|
|
|
{
|
|
|
|
List *result = NIL;
|
|
|
|
ListCell *lc;
|
|
|
|
|
|
|
|
foreach(lc, qual_clauses)
|
|
|
|
{
|
Improve castNode notation by introducing list-extraction-specific variants.
This extends the castNode() notation introduced by commit 5bcab1114 to
provide, in one step, extraction of a list cell's pointer and coercion to
a concrete node type. For example, "lfirst_node(Foo, lc)" is the same
as "castNode(Foo, lfirst(lc))". Almost half of the uses of castNode
that have appeared so far include a list extraction call, so this is
pretty widely useful, and it saves a few more keystrokes compared to the
old way.
As with the previous patch, back-patch the addition of these macros to
pg_list.h, so that the notation will be available when back-patching.
Patch by me, after an idea of Andrew Gierth's.
Discussion: https://postgr.es/m/14197.1491841216@sss.pgh.pa.us
2017-04-10 19:51:29 +02:00
|
|
|
RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc);
|
Fix long-obsolete code for separating filter conditions in cost_index().
This code relied on pointer equality to identify which restriction clauses
also appear in the indexquals (and, therefore, don't need to be applied as
simple filter conditions). That was okay once upon a time, years ago,
before we introduced the equivalence-class machinery. Now there's about a
50-50 chance that an equality clause appearing in the indexquals will be
the mirror image (commutator) of its mate in the restriction list. When
that happens, we'd erroneously think that the clause would be re-evaluated
at each visited row, and therefore inflate the cost estimate for the
indexscan by the clause's cost.
Add some logic to catch this case. It seems to me that it continues not to
be worthwhile to expend the extra predicate-proof work that createplan.c
will do on the finally-selected plan, but this case is common enough and
cheap enough to handle that we should do so.
This will make a small difference (about one cpu_operator_cost per row)
in simple cases; but in situations where there's an expensive function in
the indexquals, it can make a very large difference, as seen in recent
example from Jeff Janes.
This is a long-standing bug, but I'm hesitant to back-patch because of the
possibility of destabilizing plan choices that people may be happy with.
2015-03-04 03:19:42 +01:00
|
|
|
|
|
|
|
if (rinfo->pseudoconstant)
|
|
|
|
continue; /* we may drop pseudoconstants here */
|
Refactor the representation of indexable clauses in IndexPaths.
In place of three separate but interrelated lists (indexclauses,
indexquals, and indexqualcols), an IndexPath now has one list
"indexclauses" of IndexClause nodes. This holds basically the same
information as before, but in a more useful format: in particular, there
is now a clear connection between an indexclause (an original restriction
clause from WHERE or JOIN/ON) and the indexquals (directly usable index
conditions) derived from it.
We also change the ground rules a bit by mandating that clause commutation,
if needed, be done up-front so that what is stored in the indexquals list
is always directly usable as an index condition. This gets rid of repeated
re-determination of which side of the clause is the indexkey during costing
and plan generation, as well as repeated lookups of the commutator
operator. To minimize the added up-front cost, the typical case of
commuting a plain OpExpr is handled by a new special-purpose function
commute_restrictinfo(). For RowCompareExprs, generating the new clause
properly commuted to begin with is not really any more complex than before,
it's just different --- and we can save doing that work twice, as the
pretty-klugy original implementation did.
Tracking the connection between original and derived clauses lets us
also track explicitly whether the derived clauses are an exact or lossy
translation of the original. This provides a cheap solution to getting
rid of unnecessary rechecks of boolean index clauses, which previously
seemed like it'd be more expensive than it was worth.
Another pleasant (IMO) side-effect is that EXPLAIN now always shows
index clauses with the indexkey on the left; this seems less confusing.
This commit leaves expand_indexqual_conditions() and some related
functions in a slightly messy state. I didn't bother to change them
any more than minimally necessary to work with the new data structure,
because all that code is going to be refactored out of existence in
a follow-on patch.
Discussion: https://postgr.es/m/22182.1549124950@sss.pgh.pa.us
2019-02-09 23:30:43 +01:00
|
|
|
if (is_redundant_with_indexclauses(rinfo, indexclauses))
|
|
|
|
continue; /* dup or derived from same EquivalenceClass */
|
Support using index-only scans with partial indexes in more cases.
Previously, the planner would reject an index-only scan if any restriction
clause for its table used a column not available from the index, even
if that restriction clause would later be dropped from the plan entirely
because it's implied by the index's predicate. This is a fairly common
situation for partial indexes because predicates using columns not included
in the index are often the most useful kind of predicate, and we have to
duplicate (or at least imply) the predicate in the WHERE clause in order
to get the index to be considered at all. So index-only scans were
essentially unavailable with such partial indexes.
To fix, we have to do detection of implied-by-predicate clauses much
earlier in the planner. This patch puts it in check_index_predicates
(nee check_partial_indexes), meaning it gets done for every partial index,
whereas we previously only considered this issue at createplan time,
so that the work was only done for an index actually selected for use.
That could result in a noticeable planning slowdown for queries against
tables with many partial indexes. However, testing suggested that there
isn't really a significant cost, especially not with reasonable numbers
of partial indexes. We do get a small additional benefit, which is that
cost_index is more accurate since it correctly discounts the evaluation
cost of clauses that will be removed. We can also avoid considering such
clauses as potential indexquals, which saves useless matching cycles in
the case where the predicate columns aren't in the index, and prevents
generating bogus plans that double-count the clause's selectivity when
the columns are in the index.
Tomas Vondra and Kyotaro Horiguchi, reviewed by Kevin Grittner and
Konstantin Knizhnik, and whacked around a little by me
2016-03-31 20:48:56 +02:00
|
|
|
/* ... skip the predicate proof attempt createplan.c will try ... */
|
Fix long-obsolete code for separating filter conditions in cost_index().
This code relied on pointer equality to identify which restriction clauses
also appear in the indexquals (and, therefore, don't need to be applied as
simple filter conditions). That was okay once upon a time, years ago,
before we introduced the equivalence-class machinery. Now there's about a
50-50 chance that an equality clause appearing in the indexquals will be
the mirror image (commutator) of its mate in the restriction list. When
that happens, we'd erroneously think that the clause would be re-evaluated
at each visited row, and therefore inflate the cost estimate for the
indexscan by the clause's cost.
Add some logic to catch this case. It seems to me that it continues not to
be worthwhile to expend the extra predicate-proof work that createplan.c
will do on the finally-selected plan, but this case is common enough and
cheap enough to handle that we should do so.
This will make a small difference (about one cpu_operator_cost per row)
in simple cases; but in situations where there's an expensive function in
the indexquals, it can make a very large difference, as seen in recent
example from Jeff Janes.
This is a long-standing bug, but I'm hesitant to back-patch because of the
possibility of destabilizing plan choices that people may be happy with.
2015-03-04 03:19:42 +01:00
|
|
|
result = lappend(result, rinfo);
|
|
|
|
}
|
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
2006-06-06 19:59:58 +02:00
|
|
|
/*
|
|
|
|
* index_pages_fetched
|
|
|
|
* Estimate the number of pages actually fetched after accounting for
|
|
|
|
* cache effects.
|
|
|
|
*
|
|
|
|
* We use an approximation proposed by Mackert and Lohman, "Index Scans
|
|
|
|
* Using a Finite LRU Buffer: A Validated I/O Model", ACM Transactions
|
|
|
|
* on Database Systems, Vol. 14, No. 3, September 1989, Pages 401-424.
|
|
|
|
* The Mackert and Lohman approximation is that the number of pages
|
|
|
|
* fetched is
|
|
|
|
* PF =
|
|
|
|
* min(2TNs/(2T+Ns), T) when T <= b
|
|
|
|
* 2TNs/(2T+Ns) when T > b and Ns <= 2Tb/(2T-b)
|
|
|
|
* b + (Ns - 2Tb/(2T-b))*(T-b)/T when T > b and Ns > 2Tb/(2T-b)
|
|
|
|
* where
|
|
|
|
* T = # pages in table
|
|
|
|
* N = # tuples in table
|
|
|
|
* s = selectivity = fraction of table to be scanned
|
|
|
|
* b = # buffer pages available (we include kernel space here)
|
|
|
|
*
|
|
|
|
* We assume that effective_cache_size is the total number of buffer pages
|
2006-09-20 00:49:53 +02:00
|
|
|
* available for the whole query, and pro-rate that space across all the
|
|
|
|
* tables in the query and the index currently under consideration. (This
|
|
|
|
* ignores space needed for other indexes used by the query, but since we
|
|
|
|
* don't know which indexes will get used, we can't estimate that very well;
|
|
|
|
* and in any case counting all the tables may well be an overestimate, since
|
|
|
|
* depending on the join plan not all the tables may be scanned concurrently.)
|
2006-06-06 19:59:58 +02:00
|
|
|
*
|
|
|
|
* The product Ns is the number of tuples fetched; we pass in that
|
2006-09-20 00:49:53 +02:00
|
|
|
* product rather than calculating it here. "pages" is the number of pages
|
|
|
|
* in the object under consideration (either an index or a table).
|
|
|
|
* "index_pages" is the amount to add to the total table space, which was
|
2019-02-13 08:31:20 +01:00
|
|
|
* computed for us by make_one_rel.
|
2006-06-06 19:59:58 +02:00
|
|
|
*
|
|
|
|
* Caller is expected to have ensured that tuples_fetched is greater than zero
|
2014-05-06 18:12:18 +02:00
|
|
|
* and rounded to integer (see clamp_row_est). The result will likewise be
|
2006-06-06 19:59:58 +02:00
|
|
|
* greater than zero and integral.
|
|
|
|
*/
|
|
|
|
double
|
|
|
|
index_pages_fetched(double tuples_fetched, BlockNumber pages,
|
2006-09-20 00:49:53 +02:00
|
|
|
double index_pages, PlannerInfo *root)
|
2006-06-06 19:59:58 +02:00
|
|
|
{
|
|
|
|
double pages_fetched;
|
2006-09-20 00:49:53 +02:00
|
|
|
double total_pages;
|
2006-06-06 19:59:58 +02:00
|
|
|
double T,
|
|
|
|
b;
|
|
|
|
|
|
|
|
/* T is # pages in table, but don't allow it to be zero */
|
|
|
|
T = (pages > 1) ? (double) pages : 1.0;
|
|
|
|
|
2006-09-20 00:49:53 +02:00
|
|
|
/* Compute number of pages assumed to be competing for cache space */
|
|
|
|
total_pages = root->total_table_pages + index_pages;
|
|
|
|
total_pages = Max(total_pages, 1.0);
|
|
|
|
Assert(T <= total_pages);
|
|
|
|
|
2006-06-06 19:59:58 +02:00
|
|
|
/* b is pro-rated share of effective_cache_size */
|
2017-06-21 20:39:04 +02:00
|
|
|
b = (double) effective_cache_size * T / total_pages;
|
2006-10-04 02:30:14 +02:00
|
|
|
|
2006-06-06 19:59:58 +02:00
|
|
|
/* force it positive and integral */
|
|
|
|
if (b <= 1.0)
|
|
|
|
b = 1.0;
|
|
|
|
else
|
|
|
|
b = ceil(b);
|
|
|
|
|
|
|
|
/* This part is the Mackert and Lohman formula */
|
|
|
|
if (T <= b)
|
|
|
|
{
|
|
|
|
pages_fetched =
|
|
|
|
(2.0 * T * tuples_fetched) / (2.0 * T + tuples_fetched);
|
|
|
|
if (pages_fetched >= T)
|
|
|
|
pages_fetched = T;
|
|
|
|
else
|
|
|
|
pages_fetched = ceil(pages_fetched);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
double lim;
|
|
|
|
|
|
|
|
lim = (2.0 * T * b) / (2.0 * T - b);
|
|
|
|
if (tuples_fetched <= lim)
|
|
|
|
{
|
|
|
|
pages_fetched =
|
|
|
|
(2.0 * T * tuples_fetched) / (2.0 * T + tuples_fetched);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
pages_fetched =
|
|
|
|
b + (tuples_fetched - lim) * (T - b) / T;
|
|
|
|
}
|
|
|
|
pages_fetched = ceil(pages_fetched);
|
|
|
|
}
|
|
|
|
return pages_fetched;
|
|
|
|
}
|
|
|
|
|
2006-09-20 00:49:53 +02:00
|
|
|
/*
|
|
|
|
* get_indexpath_pages
|
|
|
|
* Determine the total size of the indexes used in a bitmap index path.
|
|
|
|
*
|
|
|
|
* Note: if the same index is used more than once in a bitmap tree, we will
|
|
|
|
* count it multiple times, which perhaps is the wrong thing ... but it's
|
|
|
|
* not completely clear, and detecting duplicates is difficult, so ignore it
|
|
|
|
* for now.
|
|
|
|
*/
|
|
|
|
static double
|
|
|
|
get_indexpath_pages(Path *bitmapqual)
|
|
|
|
{
|
|
|
|
double result = 0;
|
|
|
|
ListCell *l;
|
|
|
|
|
|
|
|
if (IsA(bitmapqual, BitmapAndPath))
|
|
|
|
{
|
|
|
|
BitmapAndPath *apath = (BitmapAndPath *) bitmapqual;
|
|
|
|
|
|
|
|
foreach(l, apath->bitmapquals)
|
|
|
|
{
|
|
|
|
result += get_indexpath_pages((Path *) lfirst(l));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else if (IsA(bitmapqual, BitmapOrPath))
|
|
|
|
{
|
|
|
|
BitmapOrPath *opath = (BitmapOrPath *) bitmapqual;
|
|
|
|
|
|
|
|
foreach(l, opath->bitmapquals)
|
|
|
|
{
|
|
|
|
result += get_indexpath_pages((Path *) lfirst(l));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else if (IsA(bitmapqual, IndexPath))
|
|
|
|
{
|
|
|
|
IndexPath *ipath = (IndexPath *) bitmapqual;
|
|
|
|
|
|
|
|
result = (double) ipath->indexinfo->pages;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
elog(ERROR, "unrecognized node type: %d", nodeTag(bitmapqual));
|
|
|
|
|
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
2005-04-20 00:35:18 +02:00
|
|
|
/*
|
2005-04-21 21:18:13 +02:00
|
|
|
* cost_bitmap_heap_scan
|
2005-04-20 00:35:18 +02:00
|
|
|
* Determines and returns the cost of scanning a relation using a bitmap
|
|
|
|
* index-then-heap plan.
|
|
|
|
*
|
|
|
|
* 'baserel' is the relation to be scanned
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
|
2005-04-21 21:18:13 +02:00
|
|
|
* 'bitmapqual' is a tree of IndexPaths, BitmapAndPaths, and BitmapOrPaths
|
2012-01-28 01:26:38 +01:00
|
|
|
* 'loop_count' is the number of repetitions of the indexscan to factor into
|
|
|
|
* estimates of caching behavior
|
2006-06-06 19:59:58 +02:00
|
|
|
*
|
2012-01-28 01:26:38 +01:00
|
|
|
* Note: the component IndexPaths in bitmapqual should have been costed
|
|
|
|
* using the same loop_count.
|
2005-04-20 00:35:18 +02:00
|
|
|
*/
|
|
|
|
void
|
2005-06-06 00:32:58 +02:00
|
|
|
cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
ParamPathInfo *param_info,
|
2012-01-28 01:26:38 +01:00
|
|
|
Path *bitmapqual, double loop_count)
|
2005-04-20 00:35:18 +02:00
|
|
|
{
|
|
|
|
Cost startup_cost = 0;
|
|
|
|
Cost run_cost = 0;
|
2005-04-21 04:28:02 +02:00
|
|
|
Cost indexTotalCost;
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
QualCost qpqual_cost;
|
2005-04-21 04:28:02 +02:00
|
|
|
Cost cpu_per_tuple;
|
|
|
|
Cost cost_per_page;
|
Support parallel bitmap heap scans.
The index is scanned by a single process, but then all cooperating
processes can iterate jointly over the resulting set of heap blocks.
In the future, we might also want to support using a parallel bitmap
index scan to set up for a parallel bitmap heap scan, but that's a
job for another day.
Dilip Kumar, with some corrections and cosmetic changes by me. The
larger patch set of which this is a part has been reviewed and tested
by (at least) Andres Freund, Amit Khandekar, Tushar Ahuja, Rafia
Sabih, Haribabu Kommi, Thomas Munro, and me.
Discussion: http://postgr.es/m/CAFiTN-uc4=0WxRGfCzs-xfkMYcSEWUC-Fon6thkJGjkh9i=13A@mail.gmail.com
2017-03-08 18:05:43 +01:00
|
|
|
Cost cpu_run_cost;
|
2005-04-21 04:28:02 +02:00
|
|
|
double tuples_fetched;
|
|
|
|
double pages_fetched;
|
2010-01-05 22:54:00 +01:00
|
|
|
double spc_seq_page_cost,
|
|
|
|
spc_random_page_cost;
|
2005-04-21 04:28:02 +02:00
|
|
|
double T;
|
2005-04-20 00:35:18 +02:00
|
|
|
|
|
|
|
/* Should only be applied to base relations */
|
|
|
|
Assert(IsA(baserel, RelOptInfo));
|
|
|
|
Assert(baserel->relid > 0);
|
|
|
|
Assert(baserel->rtekind == RTE_RELATION);
|
|
|
|
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
/* Mark the path with the correct row estimate */
|
|
|
|
if (param_info)
|
|
|
|
path->rows = param_info->ppi_rows;
|
2012-01-28 01:26:38 +01:00
|
|
|
else
|
|
|
|
path->rows = baserel->rows;
|
|
|
|
|
2005-04-21 21:18:13 +02:00
|
|
|
if (!enable_bitmapscan)
|
2005-04-21 04:28:02 +02:00
|
|
|
startup_cost += disable_cost;
|
|
|
|
|
2017-01-27 22:22:11 +01:00
|
|
|
pages_fetched = compute_bitmap_pages(root, baserel, bitmapqual,
|
|
|
|
loop_count, &indexTotalCost,
|
|
|
|
&tuples_fetched);
|
2005-04-21 04:28:02 +02:00
|
|
|
|
|
|
|
startup_cost += indexTotalCost;
|
2017-01-27 22:22:11 +01:00
|
|
|
T = (baserel->pages > 1) ? (double) baserel->pages : 1.0;
|
2005-04-21 04:28:02 +02:00
|
|
|
|
2010-01-05 22:54:00 +01:00
|
|
|
/* Fetch estimated page costs for tablespace containing table. */
|
|
|
|
get_tablespace_page_costs(baserel->reltablespace,
|
|
|
|
&spc_random_page_cost,
|
|
|
|
&spc_seq_page_cost);
|
|
|
|
|
2005-04-21 04:28:02 +02:00
|
|
|
/*
|
2010-02-26 03:01:40 +01:00
|
|
|
* For small numbers of pages we should charge spc_random_page_cost
|
|
|
|
* apiece, while if nearly all the table's pages are being read, it's more
|
2014-05-06 18:12:18 +02:00
|
|
|
* appropriate to charge spc_seq_page_cost apiece. The effect is
|
2010-02-26 03:01:40 +01:00
|
|
|
* nonlinear, too. For lack of a better idea, interpolate like this to
|
|
|
|
* determine the cost per page.
|
2005-04-21 04:28:02 +02:00
|
|
|
*/
|
2005-04-22 23:58:32 +02:00
|
|
|
if (pages_fetched >= 2.0)
|
2010-01-05 22:54:00 +01:00
|
|
|
cost_per_page = spc_random_page_cost -
|
|
|
|
(spc_random_page_cost - spc_seq_page_cost)
|
|
|
|
* sqrt(pages_fetched / T);
|
2005-04-22 23:58:32 +02:00
|
|
|
else
|
2010-01-05 22:54:00 +01:00
|
|
|
cost_per_page = spc_random_page_cost;
|
2005-04-21 04:28:02 +02:00
|
|
|
|
|
|
|
run_cost += pages_fetched * cost_per_page;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Estimate CPU costs per tuple.
|
|
|
|
*
|
2005-11-22 19:17:34 +01:00
|
|
|
* Often the indexquals don't need to be rechecked at each tuple ... but
|
|
|
|
* not always, especially not if there are enough tuples involved that the
|
2005-10-15 04:49:52 +02:00
|
|
|
* bitmaps become lossy. For the moment, just assume they will be
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
* rechecked always. This means we charge the full freight for all the
|
|
|
|
* scan clauses.
|
2005-04-21 04:28:02 +02:00
|
|
|
*/
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
|
|
|
|
|
|
|
|
startup_cost += qpqual_cost.startup;
|
|
|
|
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
|
Support parallel bitmap heap scans.
The index is scanned by a single process, but then all cooperating
processes can iterate jointly over the resulting set of heap blocks.
In the future, we might also want to support using a parallel bitmap
index scan to set up for a parallel bitmap heap scan, but that's a
job for another day.
Dilip Kumar, with some corrections and cosmetic changes by me. The
larger patch set of which this is a part has been reviewed and tested
by (at least) Andres Freund, Amit Khandekar, Tushar Ahuja, Rafia
Sabih, Haribabu Kommi, Thomas Munro, and me.
Discussion: http://postgr.es/m/CAFiTN-uc4=0WxRGfCzs-xfkMYcSEWUC-Fon6thkJGjkh9i=13A@mail.gmail.com
2017-03-08 18:05:43 +01:00
|
|
|
cpu_run_cost = cpu_per_tuple * tuples_fetched;
|
|
|
|
|
|
|
|
/* Adjust costing for parallelism, if used. */
|
|
|
|
if (path->parallel_workers > 0)
|
|
|
|
{
|
|
|
|
double parallel_divisor = get_parallel_divisor(path);
|
|
|
|
|
|
|
|
/* The CPU cost is divided among all the workers. */
|
|
|
|
cpu_run_cost /= parallel_divisor;
|
2005-04-21 04:28:02 +02:00
|
|
|
|
Support parallel bitmap heap scans.
The index is scanned by a single process, but then all cooperating
processes can iterate jointly over the resulting set of heap blocks.
In the future, we might also want to support using a parallel bitmap
index scan to set up for a parallel bitmap heap scan, but that's a
job for another day.
Dilip Kumar, with some corrections and cosmetic changes by me. The
larger patch set of which this is a part has been reviewed and tested
by (at least) Andres Freund, Amit Khandekar, Tushar Ahuja, Rafia
Sabih, Haribabu Kommi, Thomas Munro, and me.
Discussion: http://postgr.es/m/CAFiTN-uc4=0WxRGfCzs-xfkMYcSEWUC-Fon6thkJGjkh9i=13A@mail.gmail.com
2017-03-08 18:05:43 +01:00
|
|
|
path->rows = clamp_row_est(path->rows / parallel_divisor);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
run_cost += cpu_run_cost;
|
2005-04-20 00:35:18 +02:00
|
|
|
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
/* tlist eval costs are paid per output row, not per tuple scanned */
|
|
|
|
startup_cost += path->pathtarget->cost.startup;
|
|
|
|
run_cost += path->pathtarget->cost.per_tuple * path->rows;
|
|
|
|
|
2005-04-20 00:35:18 +02:00
|
|
|
path->startup_cost = startup_cost;
|
|
|
|
path->total_cost = startup_cost + run_cost;
|
|
|
|
}
|
|
|
|
|
2005-04-21 04:28:02 +02:00
|
|
|
/*
|
2005-04-21 21:18:13 +02:00
|
|
|
* cost_bitmap_tree_node
|
|
|
|
* Extract cost and selectivity from a bitmap tree node (index/and/or)
|
2005-04-21 04:28:02 +02:00
|
|
|
*/
|
2005-04-22 23:58:32 +02:00
|
|
|
void
|
2005-04-21 21:18:13 +02:00
|
|
|
cost_bitmap_tree_node(Path *path, Cost *cost, Selectivity *selec)
|
2005-04-21 04:28:02 +02:00
|
|
|
{
|
2005-04-21 21:18:13 +02:00
|
|
|
if (IsA(path, IndexPath))
|
2005-04-21 04:28:02 +02:00
|
|
|
{
|
2005-04-21 21:18:13 +02:00
|
|
|
*cost = ((IndexPath *) path)->indextotalcost;
|
|
|
|
*selec = ((IndexPath *) path)->indexselectivity;
|
2007-11-15 22:14:46 +01:00
|
|
|
|
2006-12-15 19:42:26 +01:00
|
|
|
/*
|
|
|
|
* Charge a small amount per retrieved tuple to reflect the costs of
|
|
|
|
* manipulating the bitmap. This is mostly to make sure that a bitmap
|
2007-11-15 22:14:46 +01:00
|
|
|
* scan doesn't look to be the same cost as an indexscan to retrieve a
|
|
|
|
* single tuple.
|
2006-12-15 19:42:26 +01:00
|
|
|
*/
|
2012-01-28 01:26:38 +01:00
|
|
|
*cost += 0.1 * cpu_operator_cost * path->rows;
|
2005-04-21 04:28:02 +02:00
|
|
|
}
|
2005-04-21 21:18:13 +02:00
|
|
|
else if (IsA(path, BitmapAndPath))
|
2005-04-21 04:28:02 +02:00
|
|
|
{
|
2005-04-21 21:18:13 +02:00
|
|
|
*cost = path->total_cost;
|
|
|
|
*selec = ((BitmapAndPath *) path)->bitmapselectivity;
|
2005-04-21 04:28:02 +02:00
|
|
|
}
|
2005-04-21 21:18:13 +02:00
|
|
|
else if (IsA(path, BitmapOrPath))
|
2005-04-21 04:28:02 +02:00
|
|
|
{
|
2005-04-21 21:18:13 +02:00
|
|
|
*cost = path->total_cost;
|
|
|
|
*selec = ((BitmapOrPath *) path)->bitmapselectivity;
|
2005-04-21 04:28:02 +02:00
|
|
|
}
|
|
|
|
else
|
2006-11-11 02:14:19 +01:00
|
|
|
{
|
2005-04-21 21:18:13 +02:00
|
|
|
elog(ERROR, "unrecognized node type: %d", nodeTag(path));
|
2006-11-11 02:14:19 +01:00
|
|
|
*cost = *selec = 0; /* keep compiler quiet */
|
|
|
|
}
|
2005-04-21 21:18:13 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* cost_bitmap_and_node
|
|
|
|
* Estimate the cost of a BitmapAnd node
|
|
|
|
*
|
|
|
|
* Note that this considers only the costs of index scanning and bitmap
|
2014-05-06 18:12:18 +02:00
|
|
|
* creation, not the eventual heap access. In that sense the object isn't
|
2005-04-21 21:18:13 +02:00
|
|
|
* truly a Path, but it has enough path-like properties (costs in particular)
|
2012-01-28 01:26:38 +01:00
|
|
|
* to warrant treating it as one. We don't bother to set the path rows field,
|
|
|
|
* however.
|
2005-04-21 21:18:13 +02:00
|
|
|
*/
|
|
|
|
void
|
2005-06-06 00:32:58 +02:00
|
|
|
cost_bitmap_and_node(BitmapAndPath *path, PlannerInfo *root)
|
2005-04-21 21:18:13 +02:00
|
|
|
{
|
|
|
|
Cost totalCost;
|
2005-10-15 04:49:52 +02:00
|
|
|
Selectivity selec;
|
2005-04-21 21:18:13 +02:00
|
|
|
ListCell *l;
|
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* We estimate AND selectivity on the assumption that the inputs are
|
|
|
|
* independent. This is probably often wrong, but we don't have the info
|
|
|
|
* to do better.
|
2005-04-21 21:18:13 +02:00
|
|
|
*
|
|
|
|
* The runtime cost of the BitmapAnd itself is estimated at 100x
|
2005-10-15 04:49:52 +02:00
|
|
|
* cpu_operator_cost for each tbm_intersect needed. Probably too small,
|
|
|
|
* definitely too simplistic?
|
2005-04-21 21:18:13 +02:00
|
|
|
*/
|
|
|
|
totalCost = 0.0;
|
|
|
|
selec = 1.0;
|
|
|
|
foreach(l, path->bitmapquals)
|
2005-04-21 04:28:02 +02:00
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
Path *subpath = (Path *) lfirst(l);
|
|
|
|
Cost subCost;
|
2005-04-21 21:18:13 +02:00
|
|
|
Selectivity subselec;
|
|
|
|
|
|
|
|
cost_bitmap_tree_node(subpath, &subCost, &subselec);
|
|
|
|
|
|
|
|
selec *= subselec;
|
|
|
|
|
|
|
|
totalCost += subCost;
|
|
|
|
if (l != list_head(path->bitmapquals))
|
|
|
|
totalCost += 100.0 * cpu_operator_cost;
|
2005-04-21 04:28:02 +02:00
|
|
|
}
|
2005-04-21 21:18:13 +02:00
|
|
|
path->bitmapselectivity = selec;
|
2012-01-28 01:26:38 +01:00
|
|
|
path->path.rows = 0; /* per above, not used */
|
2005-04-21 21:18:13 +02:00
|
|
|
path->path.startup_cost = totalCost;
|
|
|
|
path->path.total_cost = totalCost;
|
|
|
|
}
|
2005-04-21 04:28:02 +02:00
|
|
|
|
2005-04-21 21:18:13 +02:00
|
|
|
/*
|
|
|
|
* cost_bitmap_or_node
|
|
|
|
* Estimate the cost of a BitmapOr node
|
|
|
|
*
|
|
|
|
* See comments for cost_bitmap_and_node.
|
|
|
|
*/
|
|
|
|
void
|
2005-06-06 00:32:58 +02:00
|
|
|
cost_bitmap_or_node(BitmapOrPath *path, PlannerInfo *root)
|
2005-04-21 21:18:13 +02:00
|
|
|
{
|
|
|
|
Cost totalCost;
|
2005-10-15 04:49:52 +02:00
|
|
|
Selectivity selec;
|
2005-04-21 21:18:13 +02:00
|
|
|
ListCell *l;
|
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* We estimate OR selectivity on the assumption that the inputs are
|
|
|
|
* non-overlapping, since that's often the case in "x IN (list)" type
|
2014-05-06 18:12:18 +02:00
|
|
|
* situations. Of course, we clamp to 1.0 at the end.
|
2005-04-21 21:18:13 +02:00
|
|
|
*
|
|
|
|
* The runtime cost of the BitmapOr itself is estimated at 100x
|
2005-10-15 04:49:52 +02:00
|
|
|
* cpu_operator_cost for each tbm_union needed. Probably too small,
|
|
|
|
* definitely too simplistic? We are aware that the tbm_unions are
|
|
|
|
* optimized out when the inputs are BitmapIndexScans.
|
2005-04-21 21:18:13 +02:00
|
|
|
*/
|
|
|
|
totalCost = 0.0;
|
|
|
|
selec = 0.0;
|
|
|
|
foreach(l, path->bitmapquals)
|
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
Path *subpath = (Path *) lfirst(l);
|
|
|
|
Cost subCost;
|
2005-04-21 21:18:13 +02:00
|
|
|
Selectivity subselec;
|
|
|
|
|
|
|
|
cost_bitmap_tree_node(subpath, &subCost, &subselec);
|
|
|
|
|
|
|
|
selec += subselec;
|
|
|
|
|
|
|
|
totalCost += subCost;
|
|
|
|
if (l != list_head(path->bitmapquals) &&
|
|
|
|
!IsA(subpath, IndexPath))
|
|
|
|
totalCost += 100.0 * cpu_operator_cost;
|
|
|
|
}
|
|
|
|
path->bitmapselectivity = Min(selec, 1.0);
|
2012-01-28 01:26:38 +01:00
|
|
|
path->path.rows = 0; /* per above, not used */
|
2005-04-21 21:18:13 +02:00
|
|
|
path->path.startup_cost = totalCost;
|
|
|
|
path->path.total_cost = totalCost;
|
2005-04-21 04:28:02 +02:00
|
|
|
}
|
|
|
|
|
1999-11-23 21:07:06 +01:00
|
|
|
/*
|
|
|
|
* cost_tidscan
|
2001-06-05 07:26:05 +02:00
|
|
|
* Determines and returns the cost of scanning a relation using TIDs.
|
2012-08-27 04:48:55 +02:00
|
|
|
*
|
|
|
|
* 'baserel' is the relation to be scanned
|
|
|
|
* 'tidquals' is the list of TID-checkable quals
|
|
|
|
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
|
1999-11-23 21:07:06 +01:00
|
|
|
*/
|
2000-02-15 21:49:31 +01:00
|
|
|
void
|
2005-06-06 00:32:58 +02:00
|
|
|
cost_tidscan(Path *path, PlannerInfo *root,
|
2012-08-27 04:48:55 +02:00
|
|
|
RelOptInfo *baserel, List *tidquals, ParamPathInfo *param_info)
|
1999-11-23 21:07:06 +01:00
|
|
|
{
|
2000-02-15 21:49:31 +01:00
|
|
|
Cost startup_cost = 0;
|
|
|
|
Cost run_cost = 0;
|
2007-10-24 20:37:09 +02:00
|
|
|
bool isCurrentOf = false;
|
2012-08-27 04:48:55 +02:00
|
|
|
QualCost qpqual_cost;
|
2000-02-15 21:49:31 +01:00
|
|
|
Cost cpu_per_tuple;
|
2007-06-11 03:16:30 +02:00
|
|
|
QualCost tid_qual_cost;
|
2005-11-26 23:14:57 +01:00
|
|
|
int ntuples;
|
|
|
|
ListCell *l;
|
2010-01-05 22:54:00 +01:00
|
|
|
double spc_random_page_cost;
|
1999-11-23 21:07:06 +01:00
|
|
|
|
2002-05-12 22:10:05 +02:00
|
|
|
/* Should only be applied to base relations */
|
2003-02-08 21:20:55 +01:00
|
|
|
Assert(baserel->relid > 0);
|
2002-05-12 22:10:05 +02:00
|
|
|
Assert(baserel->rtekind == RTE_RELATION);
|
|
|
|
|
2012-08-27 04:48:55 +02:00
|
|
|
/* Mark the path with the correct row estimate */
|
|
|
|
if (param_info)
|
|
|
|
path->rows = param_info->ppi_rows;
|
|
|
|
else
|
|
|
|
path->rows = baserel->rows;
|
2012-01-28 01:26:38 +01:00
|
|
|
|
2005-11-26 23:14:57 +01:00
|
|
|
/* Count how many tuples we expect to retrieve */
|
|
|
|
ntuples = 0;
|
|
|
|
foreach(l, tidquals)
|
|
|
|
{
|
2018-12-30 21:24:28 +01:00
|
|
|
RestrictInfo *rinfo = lfirst_node(RestrictInfo, l);
|
|
|
|
Expr *qual = rinfo->clause;
|
|
|
|
|
|
|
|
if (IsA(qual, ScalarArrayOpExpr))
|
2005-11-26 23:14:57 +01:00
|
|
|
{
|
|
|
|
/* Each element of the array yields 1 tuple */
|
2018-12-30 21:24:28 +01:00
|
|
|
ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) qual;
|
2006-10-04 02:30:14 +02:00
|
|
|
Node *arraynode = (Node *) lsecond(saop->args);
|
2005-11-26 23:14:57 +01:00
|
|
|
|
|
|
|
ntuples += estimate_array_length(arraynode);
|
|
|
|
}
|
2018-12-30 21:24:28 +01:00
|
|
|
else if (IsA(qual, CurrentOfExpr))
|
2007-10-24 20:37:09 +02:00
|
|
|
{
|
|
|
|
/* CURRENT OF yields 1 tuple */
|
|
|
|
isCurrentOf = true;
|
|
|
|
ntuples++;
|
|
|
|
}
|
2005-11-26 23:14:57 +01:00
|
|
|
else
|
|
|
|
{
|
|
|
|
/* It's just CTID = something, count 1 tuple */
|
|
|
|
ntuples++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-10-24 20:37:09 +02:00
|
|
|
/*
|
|
|
|
* We must force TID scan for WHERE CURRENT OF, because only nodeTidscan.c
|
2014-05-06 18:12:18 +02:00
|
|
|
* understands how to do it correctly. Therefore, honor enable_tidscan
|
2007-10-24 20:37:09 +02:00
|
|
|
* only when CURRENT OF isn't present. Also note that cost_qual_eval
|
|
|
|
* counts a CurrentOfExpr as having startup cost disable_cost, which we
|
|
|
|
* subtract off here; that's to prevent other plan types such as seqscan
|
|
|
|
* from winning.
|
|
|
|
*/
|
|
|
|
if (isCurrentOf)
|
|
|
|
{
|
|
|
|
Assert(baserel->baserestrictcost.startup >= disable_cost);
|
|
|
|
startup_cost -= disable_cost;
|
|
|
|
}
|
|
|
|
else if (!enable_tidscan)
|
|
|
|
startup_cost += disable_cost;
|
|
|
|
|
2007-06-11 03:16:30 +02:00
|
|
|
/*
|
|
|
|
* The TID qual expressions will be computed once, any other baserestrict
|
2015-09-05 10:35:49 +02:00
|
|
|
* quals once per retrieved tuple.
|
2007-06-11 03:16:30 +02:00
|
|
|
*/
|
|
|
|
cost_qual_eval(&tid_qual_cost, tidquals, root);
|
|
|
|
|
2010-01-05 22:54:00 +01:00
|
|
|
/* fetch estimated page cost for tablespace containing table */
|
|
|
|
get_tablespace_page_costs(baserel->reltablespace,
|
|
|
|
&spc_random_page_cost,
|
|
|
|
NULL);
|
|
|
|
|
2000-02-15 21:49:31 +01:00
|
|
|
/* disk costs --- assume each tuple on a different page */
|
2010-01-05 22:54:00 +01:00
|
|
|
run_cost += spc_random_page_cost * ntuples;
|
1999-11-23 21:07:06 +01:00
|
|
|
|
2012-08-27 04:48:55 +02:00
|
|
|
/* Add scanning CPU costs */
|
|
|
|
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
|
|
|
|
|
|
|
|
/* XXX currently we assume TID quals are a subset of qpquals */
|
|
|
|
startup_cost += qpqual_cost.startup + tid_qual_cost.per_tuple;
|
|
|
|
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple -
|
2007-06-11 03:16:30 +02:00
|
|
|
tid_qual_cost.per_tuple;
|
2000-02-15 21:49:31 +01:00
|
|
|
run_cost += cpu_per_tuple * ntuples;
|
|
|
|
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
/* tlist eval costs are paid per output row, not per tuple scanned */
|
|
|
|
startup_cost += path->pathtarget->cost.startup;
|
|
|
|
run_cost += path->pathtarget->cost.per_tuple * path->rows;
|
|
|
|
|
2000-02-15 21:49:31 +01:00
|
|
|
path->startup_cost = startup_cost;
|
|
|
|
path->total_cost = startup_cost + run_cost;
|
1999-11-23 21:07:06 +01:00
|
|
|
}
|
2000-04-12 19:17:23 +02:00
|
|
|
|
2021-02-27 10:59:36 +01:00
|
|
|
/*
|
|
|
|
* cost_tidrangescan
|
|
|
|
* Determines and sets the costs of scanning a relation using a range of
|
|
|
|
* TIDs for 'path'
|
|
|
|
*
|
|
|
|
* 'baserel' is the relation to be scanned
|
|
|
|
* 'tidrangequals' is the list of TID-checkable range quals
|
|
|
|
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
cost_tidrangescan(Path *path, PlannerInfo *root,
|
|
|
|
RelOptInfo *baserel, List *tidrangequals,
|
|
|
|
ParamPathInfo *param_info)
|
|
|
|
{
|
|
|
|
Selectivity selectivity;
|
|
|
|
double pages;
|
|
|
|
Cost startup_cost = 0;
|
|
|
|
Cost run_cost = 0;
|
|
|
|
QualCost qpqual_cost;
|
|
|
|
Cost cpu_per_tuple;
|
|
|
|
QualCost tid_qual_cost;
|
|
|
|
double ntuples;
|
|
|
|
double nseqpages;
|
|
|
|
double spc_random_page_cost;
|
|
|
|
double spc_seq_page_cost;
|
|
|
|
|
|
|
|
/* Should only be applied to base relations */
|
|
|
|
Assert(baserel->relid > 0);
|
|
|
|
Assert(baserel->rtekind == RTE_RELATION);
|
|
|
|
|
|
|
|
/* Mark the path with the correct row estimate */
|
|
|
|
if (param_info)
|
|
|
|
path->rows = param_info->ppi_rows;
|
|
|
|
else
|
|
|
|
path->rows = baserel->rows;
|
|
|
|
|
|
|
|
/* Count how many tuples and pages we expect to scan */
|
|
|
|
selectivity = clauselist_selectivity(root, tidrangequals, baserel->relid,
|
|
|
|
JOIN_INNER, NULL);
|
|
|
|
pages = ceil(selectivity * baserel->pages);
|
|
|
|
|
|
|
|
if (pages <= 0.0)
|
|
|
|
pages = 1.0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The first page in a range requires a random seek, but each subsequent
|
|
|
|
* page is just a normal sequential page read. NOTE: it's desirable for
|
|
|
|
* TID Range Scans to cost more than the equivalent Sequential Scans,
|
|
|
|
* because Seq Scans have some performance advantages such as scan
|
|
|
|
* synchronization and parallelizability, and we'd prefer one of them to
|
|
|
|
* be picked unless a TID Range Scan really is better.
|
|
|
|
*/
|
|
|
|
ntuples = selectivity * baserel->tuples;
|
|
|
|
nseqpages = pages - 1.0;
|
|
|
|
|
|
|
|
if (!enable_tidscan)
|
|
|
|
startup_cost += disable_cost;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The TID qual expressions will be computed once, any other baserestrict
|
|
|
|
* quals once per retrieved tuple.
|
|
|
|
*/
|
|
|
|
cost_qual_eval(&tid_qual_cost, tidrangequals, root);
|
|
|
|
|
|
|
|
/* fetch estimated page cost for tablespace containing table */
|
|
|
|
get_tablespace_page_costs(baserel->reltablespace,
|
|
|
|
&spc_random_page_cost,
|
|
|
|
&spc_seq_page_cost);
|
|
|
|
|
|
|
|
/* disk costs; 1 random page and the remainder as seq pages */
|
|
|
|
run_cost += spc_random_page_cost + spc_seq_page_cost * nseqpages;
|
|
|
|
|
|
|
|
/* Add scanning CPU costs */
|
|
|
|
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* XXX currently we assume TID quals are a subset of qpquals at this
|
|
|
|
* point; they will be removed (if possible) when we create the plan, so
|
|
|
|
* we subtract their cost from the total qpqual cost. (If the TID quals
|
|
|
|
* can't be removed, this is a mistake and we're going to underestimate
|
|
|
|
* the CPU cost a bit.)
|
|
|
|
*/
|
|
|
|
startup_cost += qpqual_cost.startup + tid_qual_cost.per_tuple;
|
|
|
|
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple -
|
|
|
|
tid_qual_cost.per_tuple;
|
|
|
|
run_cost += cpu_per_tuple * ntuples;
|
|
|
|
|
|
|
|
/* tlist eval costs are paid per output row, not per tuple scanned */
|
|
|
|
startup_cost += path->pathtarget->cost.startup;
|
|
|
|
run_cost += path->pathtarget->cost.per_tuple * path->rows;
|
|
|
|
|
|
|
|
path->startup_cost = startup_cost;
|
|
|
|
path->total_cost = startup_cost + run_cost;
|
|
|
|
}
|
|
|
|
|
2003-07-15 00:35:54 +02:00
|
|
|
/*
|
|
|
|
* cost_subqueryscan
|
|
|
|
* Determines and returns the cost of scanning a subquery RTE.
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
*
|
|
|
|
* 'baserel' is the relation to be scanned
|
|
|
|
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
|
2003-07-15 00:35:54 +02:00
|
|
|
*/
|
|
|
|
void
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
cost_subqueryscan(SubqueryScanPath *path, PlannerInfo *root,
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
RelOptInfo *baserel, ParamPathInfo *param_info)
|
2003-07-15 00:35:54 +02:00
|
|
|
{
|
|
|
|
Cost startup_cost;
|
|
|
|
Cost run_cost;
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
QualCost qpqual_cost;
|
2003-07-15 00:35:54 +02:00
|
|
|
Cost cpu_per_tuple;
|
|
|
|
|
|
|
|
/* Should only be applied to base relations that are subqueries */
|
|
|
|
Assert(baserel->relid > 0);
|
|
|
|
Assert(baserel->rtekind == RTE_SUBQUERY);
|
|
|
|
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
/* Mark the path with the correct row estimate */
|
|
|
|
if (param_info)
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
path->path.rows = param_info->ppi_rows;
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
else
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
path->path.rows = baserel->rows;
|
2012-01-28 01:26:38 +01:00
|
|
|
|
2003-07-15 00:35:54 +02:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Cost of path is cost of evaluating the subplan, plus cost of evaluating
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
* any restriction clauses and tlist that will be attached to the
|
|
|
|
* SubqueryScan node, plus cpu_tuple_cost to account for selection and
|
|
|
|
* projection overhead.
|
2003-07-15 00:35:54 +02:00
|
|
|
*/
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
path->path.startup_cost = path->subpath->startup_cost;
|
|
|
|
path->path.total_cost = path->subpath->total_cost;
|
2003-07-15 00:35:54 +02:00
|
|
|
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
|
|
|
|
|
|
|
|
startup_cost = qpqual_cost.startup;
|
|
|
|
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
|
2003-07-15 00:35:54 +02:00
|
|
|
run_cost = cpu_per_tuple * baserel->tuples;
|
|
|
|
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
/* tlist eval costs are paid per output row, not per tuple scanned */
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
startup_cost += path->path.pathtarget->cost.startup;
|
|
|
|
run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
path->path.startup_cost += startup_cost;
|
|
|
|
path->path.total_cost += startup_cost + run_cost;
|
2003-07-15 00:35:54 +02:00
|
|
|
}
|
|
|
|
|
2002-05-12 22:10:05 +02:00
|
|
|
/*
|
|
|
|
* cost_functionscan
|
|
|
|
* Determines and returns the cost of scanning a function RTE.
|
2012-08-08 01:02:54 +02:00
|
|
|
*
|
|
|
|
* 'baserel' is the relation to be scanned
|
|
|
|
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
|
2002-05-12 22:10:05 +02:00
|
|
|
*/
|
|
|
|
void
|
2012-08-08 01:02:54 +02:00
|
|
|
cost_functionscan(Path *path, PlannerInfo *root,
|
|
|
|
RelOptInfo *baserel, ParamPathInfo *param_info)
|
2002-05-12 22:10:05 +02:00
|
|
|
{
|
|
|
|
Cost startup_cost = 0;
|
|
|
|
Cost run_cost = 0;
|
2012-08-08 01:02:54 +02:00
|
|
|
QualCost qpqual_cost;
|
2002-05-12 22:10:05 +02:00
|
|
|
Cost cpu_per_tuple;
|
2007-01-22 02:35:23 +01:00
|
|
|
RangeTblEntry *rte;
|
|
|
|
QualCost exprcost;
|
2002-05-12 22:10:05 +02:00
|
|
|
|
|
|
|
/* Should only be applied to base relations that are functions */
|
2003-02-08 21:20:55 +01:00
|
|
|
Assert(baserel->relid > 0);
|
2007-04-21 23:01:45 +02:00
|
|
|
rte = planner_rt_fetch(baserel->relid, root);
|
2007-01-22 02:35:23 +01:00
|
|
|
Assert(rte->rtekind == RTE_FUNCTION);
|
2002-05-12 22:10:05 +02:00
|
|
|
|
2012-08-08 01:02:54 +02:00
|
|
|
/* Mark the path with the correct row estimate */
|
|
|
|
if (param_info)
|
|
|
|
path->rows = param_info->ppi_rows;
|
|
|
|
else
|
|
|
|
path->rows = baserel->rows;
|
2012-01-28 01:26:38 +01:00
|
|
|
|
2009-09-13 00:12:09 +02:00
|
|
|
/*
|
Support multi-argument UNNEST(), and TABLE() syntax for multiple functions.
This patch adds the ability to write TABLE( function1(), function2(), ...)
as a single FROM-clause entry. The result is the concatenation of the
first row from each function, followed by the second row from each
function, etc; with NULLs inserted if any function produces fewer rows than
others. This is believed to be a much more useful behavior than what
Postgres currently does with multiple SRFs in a SELECT list.
This syntax also provides a reasonable way to combine use of column
definition lists with WITH ORDINALITY: put the column definition list
inside TABLE(), where it's clear that it doesn't control the ordinality
column as well.
Also implement SQL-compliant multiple-argument UNNEST(), by turning
UNNEST(a,b,c) into TABLE(unnest(a), unnest(b), unnest(c)).
The SQL standard specifies TABLE() with only a single function, not
multiple functions, and it seems to require an implicit UNNEST() which is
not what this patch does. There may be something wrong with that reading
of the spec, though, because if it's right then the spec's TABLE() is just
a pointless alternative spelling of UNNEST(). After further review of
that, we might choose to adopt a different syntax for what this patch does,
but in any case this functionality seems clearly worthwhile.
Andrew Gierth, reviewed by Zoltán Böszörményi and Heikki Linnakangas, and
significantly revised by me
2013-11-22 01:37:02 +01:00
|
|
|
* Estimate costs of executing the function expression(s).
|
2009-09-13 00:12:09 +02:00
|
|
|
*
|
Support multi-argument UNNEST(), and TABLE() syntax for multiple functions.
This patch adds the ability to write TABLE( function1(), function2(), ...)
as a single FROM-clause entry. The result is the concatenation of the
first row from each function, followed by the second row from each
function, etc; with NULLs inserted if any function produces fewer rows than
others. This is believed to be a much more useful behavior than what
Postgres currently does with multiple SRFs in a SELECT list.
This syntax also provides a reasonable way to combine use of column
definition lists with WITH ORDINALITY: put the column definition list
inside TABLE(), where it's clear that it doesn't control the ordinality
column as well.
Also implement SQL-compliant multiple-argument UNNEST(), by turning
UNNEST(a,b,c) into TABLE(unnest(a), unnest(b), unnest(c)).
The SQL standard specifies TABLE() with only a single function, not
multiple functions, and it seems to require an implicit UNNEST() which is
not what this patch does. There may be something wrong with that reading
of the spec, though, because if it's right then the spec's TABLE() is just
a pointless alternative spelling of UNNEST(). After further review of
that, we might choose to adopt a different syntax for what this patch does,
but in any case this functionality seems clearly worthwhile.
Andrew Gierth, reviewed by Zoltán Böszörményi and Heikki Linnakangas, and
significantly revised by me
2013-11-22 01:37:02 +01:00
|
|
|
* Currently, nodeFunctionscan.c always executes the functions to
|
2009-09-13 00:12:09 +02:00
|
|
|
* completion before returning any rows, and caches the results in a
|
2014-05-06 18:12:18 +02:00
|
|
|
* tuplestore. So the function eval cost is all startup cost, and per-row
|
2010-02-26 03:01:40 +01:00
|
|
|
* costs are minimal.
|
2009-09-13 00:12:09 +02:00
|
|
|
*
|
|
|
|
* XXX in principle we ought to charge tuplestore spill costs if the
|
|
|
|
* number of rows is large. However, given how phony our rowcount
|
2010-02-26 03:01:40 +01:00
|
|
|
* estimates for functions tend to be, there's not a lot of point in that
|
|
|
|
* refinement right now.
|
2009-09-13 00:12:09 +02:00
|
|
|
*/
|
Support multi-argument UNNEST(), and TABLE() syntax for multiple functions.
This patch adds the ability to write TABLE( function1(), function2(), ...)
as a single FROM-clause entry. The result is the concatenation of the
first row from each function, followed by the second row from each
function, etc; with NULLs inserted if any function produces fewer rows than
others. This is believed to be a much more useful behavior than what
Postgres currently does with multiple SRFs in a SELECT list.
This syntax also provides a reasonable way to combine use of column
definition lists with WITH ORDINALITY: put the column definition list
inside TABLE(), where it's clear that it doesn't control the ordinality
column as well.
Also implement SQL-compliant multiple-argument UNNEST(), by turning
UNNEST(a,b,c) into TABLE(unnest(a), unnest(b), unnest(c)).
The SQL standard specifies TABLE() with only a single function, not
multiple functions, and it seems to require an implicit UNNEST() which is
not what this patch does. There may be something wrong with that reading
of the spec, though, because if it's right then the spec's TABLE() is just
a pointless alternative spelling of UNNEST(). After further review of
that, we might choose to adopt a different syntax for what this patch does,
but in any case this functionality seems clearly worthwhile.
Andrew Gierth, reviewed by Zoltán Böszörményi and Heikki Linnakangas, and
significantly revised by me
2013-11-22 01:37:02 +01:00
|
|
|
cost_qual_eval_node(&exprcost, (Node *) rte->functions, root);
|
2007-01-22 02:35:23 +01:00
|
|
|
|
2009-09-13 00:12:09 +02:00
|
|
|
startup_cost += exprcost.startup + exprcost.per_tuple;
|
2002-05-12 22:10:05 +02:00
|
|
|
|
|
|
|
/* Add scanning CPU costs */
|
2012-08-08 01:02:54 +02:00
|
|
|
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
|
|
|
|
|
|
|
|
startup_cost += qpqual_cost.startup;
|
|
|
|
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
|
2002-05-12 22:10:05 +02:00
|
|
|
run_cost += cpu_per_tuple * baserel->tuples;
|
|
|
|
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
/* tlist eval costs are paid per output row, not per tuple scanned */
|
|
|
|
startup_cost += path->pathtarget->cost.startup;
|
|
|
|
run_cost += path->pathtarget->cost.per_tuple * path->rows;
|
|
|
|
|
2002-05-12 22:10:05 +02:00
|
|
|
path->startup_cost = startup_cost;
|
|
|
|
path->total_cost = startup_cost + run_cost;
|
|
|
|
}
|
|
|
|
|
2017-03-08 16:39:37 +01:00
|
|
|
/*
|
|
|
|
* cost_tablefuncscan
|
|
|
|
* Determines and returns the cost of scanning a table function.
|
|
|
|
*
|
|
|
|
* 'baserel' is the relation to be scanned
|
|
|
|
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
cost_tablefuncscan(Path *path, PlannerInfo *root,
|
|
|
|
RelOptInfo *baserel, ParamPathInfo *param_info)
|
|
|
|
{
|
|
|
|
Cost startup_cost = 0;
|
|
|
|
Cost run_cost = 0;
|
|
|
|
QualCost qpqual_cost;
|
|
|
|
Cost cpu_per_tuple;
|
|
|
|
RangeTblEntry *rte;
|
|
|
|
QualCost exprcost;
|
|
|
|
|
|
|
|
/* Should only be applied to base relations that are functions */
|
|
|
|
Assert(baserel->relid > 0);
|
|
|
|
rte = planner_rt_fetch(baserel->relid, root);
|
|
|
|
Assert(rte->rtekind == RTE_TABLEFUNC);
|
|
|
|
|
|
|
|
/* Mark the path with the correct row estimate */
|
|
|
|
if (param_info)
|
|
|
|
path->rows = param_info->ppi_rows;
|
|
|
|
else
|
|
|
|
path->rows = baserel->rows;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Estimate costs of executing the table func expression(s).
|
|
|
|
*
|
|
|
|
* XXX in principle we ought to charge tuplestore spill costs if the
|
|
|
|
* number of rows is large. However, given how phony our rowcount
|
|
|
|
* estimates for tablefuncs tend to be, there's not a lot of point in that
|
|
|
|
* refinement right now.
|
|
|
|
*/
|
|
|
|
cost_qual_eval_node(&exprcost, (Node *) rte->tablefunc, root);
|
|
|
|
|
|
|
|
startup_cost += exprcost.startup + exprcost.per_tuple;
|
|
|
|
|
|
|
|
/* Add scanning CPU costs */
|
|
|
|
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
|
|
|
|
|
|
|
|
startup_cost += qpqual_cost.startup;
|
|
|
|
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
|
|
|
|
run_cost += cpu_per_tuple * baserel->tuples;
|
|
|
|
|
|
|
|
/* tlist eval costs are paid per output row, not per tuple scanned */
|
|
|
|
startup_cost += path->pathtarget->cost.startup;
|
|
|
|
run_cost += path->pathtarget->cost.per_tuple * path->rows;
|
|
|
|
|
|
|
|
path->startup_cost = startup_cost;
|
|
|
|
path->total_cost = startup_cost + run_cost;
|
|
|
|
}
|
|
|
|
|
2006-08-02 03:59:48 +02:00
|
|
|
/*
|
|
|
|
* cost_valuesscan
|
|
|
|
* Determines and returns the cost of scanning a VALUES RTE.
|
2012-08-12 22:01:26 +02:00
|
|
|
*
|
|
|
|
* 'baserel' is the relation to be scanned
|
|
|
|
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
|
2006-08-02 03:59:48 +02:00
|
|
|
*/
|
|
|
|
void
|
2012-08-12 22:01:26 +02:00
|
|
|
cost_valuesscan(Path *path, PlannerInfo *root,
|
|
|
|
RelOptInfo *baserel, ParamPathInfo *param_info)
|
2006-08-02 03:59:48 +02:00
|
|
|
{
|
|
|
|
Cost startup_cost = 0;
|
|
|
|
Cost run_cost = 0;
|
2012-08-12 22:01:26 +02:00
|
|
|
QualCost qpqual_cost;
|
2006-08-02 03:59:48 +02:00
|
|
|
Cost cpu_per_tuple;
|
|
|
|
|
|
|
|
/* Should only be applied to base relations that are values lists */
|
|
|
|
Assert(baserel->relid > 0);
|
|
|
|
Assert(baserel->rtekind == RTE_VALUES);
|
|
|
|
|
2012-08-12 22:01:26 +02:00
|
|
|
/* Mark the path with the correct row estimate */
|
|
|
|
if (param_info)
|
|
|
|
path->rows = param_info->ppi_rows;
|
|
|
|
else
|
|
|
|
path->rows = baserel->rows;
|
2012-01-28 01:26:38 +01:00
|
|
|
|
2006-08-02 03:59:48 +02:00
|
|
|
/*
|
2006-10-04 02:30:14 +02:00
|
|
|
* For now, estimate list evaluation cost at one operator eval per list
|
|
|
|
* (probably pretty bogus, but is it worth being smarter?)
|
2006-08-02 03:59:48 +02:00
|
|
|
*/
|
|
|
|
cpu_per_tuple = cpu_operator_cost;
|
|
|
|
|
|
|
|
/* Add scanning CPU costs */
|
2012-08-12 22:01:26 +02:00
|
|
|
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
|
|
|
|
|
|
|
|
startup_cost += qpqual_cost.startup;
|
|
|
|
cpu_per_tuple += cpu_tuple_cost + qpqual_cost.per_tuple;
|
2006-08-02 03:59:48 +02:00
|
|
|
run_cost += cpu_per_tuple * baserel->tuples;
|
|
|
|
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
/* tlist eval costs are paid per output row, not per tuple scanned */
|
|
|
|
startup_cost += path->pathtarget->cost.startup;
|
|
|
|
run_cost += path->pathtarget->cost.per_tuple * path->rows;
|
|
|
|
|
2006-08-02 03:59:48 +02:00
|
|
|
path->startup_cost = startup_cost;
|
|
|
|
path->total_cost = startup_cost + run_cost;
|
|
|
|
}
|
|
|
|
|
2008-10-04 23:56:55 +02:00
|
|
|
/*
|
|
|
|
* cost_ctescan
|
|
|
|
* Determines and returns the cost of scanning a CTE RTE.
|
|
|
|
*
|
|
|
|
* Note: this is used for both self-reference and regular CTEs; the
|
|
|
|
* possible cost differences are below the threshold of what we could
|
2014-05-06 18:12:18 +02:00
|
|
|
* estimate accurately anyway. Note that the costs of evaluating the
|
2008-10-04 23:56:55 +02:00
|
|
|
* referenced CTE query are added into the final plan as initplan costs,
|
|
|
|
* and should NOT be counted here.
|
|
|
|
*/
|
|
|
|
void
|
2012-08-27 04:48:55 +02:00
|
|
|
cost_ctescan(Path *path, PlannerInfo *root,
|
|
|
|
RelOptInfo *baserel, ParamPathInfo *param_info)
|
2008-10-04 23:56:55 +02:00
|
|
|
{
|
|
|
|
Cost startup_cost = 0;
|
|
|
|
Cost run_cost = 0;
|
2012-08-27 04:48:55 +02:00
|
|
|
QualCost qpqual_cost;
|
2008-10-04 23:56:55 +02:00
|
|
|
Cost cpu_per_tuple;
|
|
|
|
|
|
|
|
/* Should only be applied to base relations that are CTEs */
|
|
|
|
Assert(baserel->relid > 0);
|
|
|
|
Assert(baserel->rtekind == RTE_CTE);
|
|
|
|
|
2012-08-27 04:48:55 +02:00
|
|
|
/* Mark the path with the correct row estimate */
|
|
|
|
if (param_info)
|
|
|
|
path->rows = param_info->ppi_rows;
|
|
|
|
else
|
|
|
|
path->rows = baserel->rows;
|
2012-01-28 01:26:38 +01:00
|
|
|
|
2008-10-04 23:56:55 +02:00
|
|
|
/* Charge one CPU tuple cost per row for tuplestore manipulation */
|
|
|
|
cpu_per_tuple = cpu_tuple_cost;
|
|
|
|
|
|
|
|
/* Add scanning CPU costs */
|
2012-08-27 04:48:55 +02:00
|
|
|
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
|
|
|
|
|
|
|
|
startup_cost += qpqual_cost.startup;
|
|
|
|
cpu_per_tuple += cpu_tuple_cost + qpqual_cost.per_tuple;
|
2008-10-04 23:56:55 +02:00
|
|
|
run_cost += cpu_per_tuple * baserel->tuples;
|
|
|
|
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
/* tlist eval costs are paid per output row, not per tuple scanned */
|
|
|
|
startup_cost += path->pathtarget->cost.startup;
|
|
|
|
run_cost += path->pathtarget->cost.per_tuple * path->rows;
|
|
|
|
|
2008-10-04 23:56:55 +02:00
|
|
|
path->startup_cost = startup_cost;
|
|
|
|
path->total_cost = startup_cost + run_cost;
|
|
|
|
}
|
|
|
|
|
2017-04-01 06:17:18 +02:00
|
|
|
/*
|
|
|
|
* cost_namedtuplestorescan
|
|
|
|
* Determines and returns the cost of scanning a named tuplestore.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
cost_namedtuplestorescan(Path *path, PlannerInfo *root,
|
|
|
|
RelOptInfo *baserel, ParamPathInfo *param_info)
|
|
|
|
{
|
|
|
|
Cost startup_cost = 0;
|
|
|
|
Cost run_cost = 0;
|
|
|
|
QualCost qpqual_cost;
|
|
|
|
Cost cpu_per_tuple;
|
|
|
|
|
|
|
|
/* Should only be applied to base relations that are Tuplestores */
|
|
|
|
Assert(baserel->relid > 0);
|
|
|
|
Assert(baserel->rtekind == RTE_NAMEDTUPLESTORE);
|
|
|
|
|
|
|
|
/* Mark the path with the correct row estimate */
|
|
|
|
if (param_info)
|
|
|
|
path->rows = param_info->ppi_rows;
|
|
|
|
else
|
|
|
|
path->rows = baserel->rows;
|
|
|
|
|
|
|
|
/* Charge one CPU tuple cost per row for tuplestore manipulation */
|
|
|
|
cpu_per_tuple = cpu_tuple_cost;
|
|
|
|
|
|
|
|
/* Add scanning CPU costs */
|
|
|
|
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
|
|
|
|
|
|
|
|
startup_cost += qpqual_cost.startup;
|
|
|
|
cpu_per_tuple += cpu_tuple_cost + qpqual_cost.per_tuple;
|
|
|
|
run_cost += cpu_per_tuple * baserel->tuples;
|
|
|
|
|
|
|
|
path->startup_cost = startup_cost;
|
|
|
|
path->total_cost = startup_cost + run_cost;
|
|
|
|
}
|
|
|
|
|
In the planner, replace an empty FROM clause with a dummy RTE.
The fact that "SELECT expression" has no base relations has long been a
thorn in the side of the planner. It makes it hard to flatten a sub-query
that looks like that, or is a trivial VALUES() item, because the planner
generally uses relid sets to identify sub-relations, and such a sub-query
would have an empty relid set if we flattened it. prepjointree.c contains
some baroque logic that works around this in certain special cases --- but
there is a much better answer. We can replace an empty FROM clause with a
dummy RTE that acts like a table of one row and no columns, and then there
are no such corner cases to worry about. Instead we need some logic to
get rid of useless dummy RTEs, but that's simpler and covers more cases
than what was there before.
For really trivial cases, where the query is just "SELECT expression" and
nothing else, there's a hazard that adding the extra RTE makes for a
noticeable slowdown; even though it's not much processing, there's not
that much for the planner to do overall. However testing says that the
penalty is very small, close to the noise level. In more complex queries,
this is able to find optimizations that we could not find before.
The new RTE type is called RTE_RESULT, since the "scan" plan type it
gives rise to is a Result node (the same plan we produced for a "SELECT
expression" query before). To avoid confusion, rename the old ResultPath
path type to GroupResultPath, reflecting that it's only used in degenerate
grouping cases where we know the query produces just one grouped row.
(It wouldn't work to unify the two cases, because there are different
rules about where the associated quals live during query_planner.)
Note: although this touches readfuncs.c, I don't think a catversion
bump is required, because the added case can't occur in stored rules,
only plans.
Patch by me, reviewed by David Rowley and Mark Dilger
Discussion: https://postgr.es/m/15944.1521127664@sss.pgh.pa.us
2019-01-28 23:54:10 +01:00
|
|
|
/*
|
|
|
|
* cost_resultscan
|
|
|
|
* Determines and returns the cost of scanning an RTE_RESULT relation.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
cost_resultscan(Path *path, PlannerInfo *root,
|
|
|
|
RelOptInfo *baserel, ParamPathInfo *param_info)
|
|
|
|
{
|
|
|
|
Cost startup_cost = 0;
|
|
|
|
Cost run_cost = 0;
|
|
|
|
QualCost qpqual_cost;
|
|
|
|
Cost cpu_per_tuple;
|
|
|
|
|
|
|
|
/* Should only be applied to RTE_RESULT base relations */
|
|
|
|
Assert(baserel->relid > 0);
|
|
|
|
Assert(baserel->rtekind == RTE_RESULT);
|
|
|
|
|
|
|
|
/* Mark the path with the correct row estimate */
|
|
|
|
if (param_info)
|
|
|
|
path->rows = param_info->ppi_rows;
|
|
|
|
else
|
|
|
|
path->rows = baserel->rows;
|
|
|
|
|
|
|
|
/* We charge qual cost plus cpu_tuple_cost */
|
|
|
|
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
|
|
|
|
|
|
|
|
startup_cost += qpqual_cost.startup;
|
|
|
|
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
|
|
|
|
run_cost += cpu_per_tuple * baserel->tuples;
|
|
|
|
|
|
|
|
path->startup_cost = startup_cost;
|
|
|
|
path->total_cost = startup_cost + run_cost;
|
|
|
|
}
|
|
|
|
|
2008-10-04 23:56:55 +02:00
|
|
|
/*
|
|
|
|
* cost_recursive_union
|
|
|
|
* Determines and returns the cost of performing a recursive union,
|
|
|
|
* and also the estimated output size.
|
|
|
|
*
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
* We are given Paths for the nonrecursive and recursive terms.
|
2008-10-04 23:56:55 +02:00
|
|
|
*/
|
|
|
|
void
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
|
2008-10-04 23:56:55 +02:00
|
|
|
{
|
|
|
|
Cost startup_cost;
|
|
|
|
Cost total_cost;
|
|
|
|
double total_rows;
|
|
|
|
|
|
|
|
/* We probably have decent estimates for the non-recursive term */
|
|
|
|
startup_cost = nrterm->startup_cost;
|
|
|
|
total_cost = nrterm->total_cost;
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
total_rows = nrterm->rows;
|
2008-10-04 23:56:55 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We arbitrarily assume that about 10 recursive iterations will be
|
2009-06-11 16:49:15 +02:00
|
|
|
* needed, and that we've managed to get a good fix on the cost and output
|
|
|
|
* size of each one of them. These are mighty shaky assumptions but it's
|
|
|
|
* hard to see how to do better.
|
2008-10-04 23:56:55 +02:00
|
|
|
*/
|
|
|
|
total_cost += 10 * rterm->total_cost;
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
total_rows += 10 * rterm->rows;
|
2008-10-04 23:56:55 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Also charge cpu_tuple_cost per row to account for the costs of
|
|
|
|
* manipulating the tuplestores. (We don't worry about possible
|
|
|
|
* spill-to-disk costs.)
|
|
|
|
*/
|
|
|
|
total_cost += cpu_tuple_cost * total_rows;
|
|
|
|
|
|
|
|
runion->startup_cost = startup_cost;
|
|
|
|
runion->total_cost = total_cost;
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
runion->rows = total_rows;
|
|
|
|
runion->pathtarget->width = Max(nrterm->pathtarget->width,
|
|
|
|
rterm->pathtarget->width);
|
2008-10-04 23:56:55 +02:00
|
|
|
}
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
Implement Incremental Sort
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
2020-04-06 21:33:28 +02:00
|
|
|
* cost_tuplesort
|
|
|
|
* Determines and returns the cost of sorting a relation using tuplesort,
|
|
|
|
* not including the cost of reading the input data.
|
2000-02-15 21:49:31 +01:00
|
|
|
*
|
2010-10-08 02:00:28 +02:00
|
|
|
* If the total volume of data to sort is less than sort_mem, we will do
|
2000-01-09 01:26:47 +01:00
|
|
|
* an in-memory sort, which requires no I/O and about t*log2(t) tuple
|
2000-02-15 21:49:31 +01:00
|
|
|
* comparisons for t tuples.
|
2000-01-09 01:26:47 +01:00
|
|
|
*
|
2010-10-08 02:00:28 +02:00
|
|
|
* If the total volume exceeds sort_mem, we switch to a tape-style merge
|
2000-01-09 01:26:47 +01:00
|
|
|
* algorithm. There will still be about t*log2(t) tuple comparisons in
|
|
|
|
* total, but we will also need to write and read each tuple once per
|
2014-05-06 18:12:18 +02:00
|
|
|
* merge pass. We expect about ceil(logM(r)) merge passes where r is the
|
2006-02-19 06:54:06 +01:00
|
|
|
* number of initial runs formed and M is the merge order used by tuplesort.c.
|
Use quicksort, not replacement selection, for external sorting.
We still use replacement selection for the first run of the sort only
and only when the number of tuples is relatively small. Otherwise,
the first run, and subsequent runs in all cases, are produced using
quicksort. This tends to be faster except perhaps for very small
amounts of working memory.
Peter Geoghegan, reviewed by Tomas Vondra, Jeff Janes, Mithun Cy,
Greg Stark, and me.
2016-04-08 08:36:26 +02:00
|
|
|
* Since the average initial run should be about sort_mem, we have
|
|
|
|
* disk traffic = 2 * relsize * ceil(logM(p / sort_mem))
|
2000-02-15 21:49:31 +01:00
|
|
|
* cpu = comparison_cost * t * log2(t)
|
|
|
|
*
|
2007-05-04 03:13:45 +02:00
|
|
|
* If the sort is bounded (i.e., only the first k result tuples are needed)
|
2010-10-08 02:00:28 +02:00
|
|
|
* and k tuples can fit into sort_mem, we use a heap method that keeps only
|
2007-05-04 03:13:45 +02:00
|
|
|
* k tuples in the heap; this will require about t*log2(k) tuple comparisons.
|
|
|
|
*
|
2006-06-05 22:56:33 +02:00
|
|
|
* The disk traffic is assumed to be 3/4ths sequential and 1/4th random
|
2000-02-15 21:49:31 +01:00
|
|
|
* accesses (XXX can't we refine that guess?)
|
|
|
|
*
|
2010-10-08 02:00:28 +02:00
|
|
|
* By default, we charge two operator evals per tuple comparison, which should
|
2014-05-06 18:12:18 +02:00
|
|
|
* be in the right ballpark in most cases. The caller can tweak this by
|
2010-10-08 02:00:28 +02:00
|
|
|
* specifying nonzero comparison_cost; typically that's used for any extra
|
|
|
|
* work that has to be done to prepare the inputs to the comparison operators.
|
1997-09-07 07:04:48 +02:00
|
|
|
*
|
1996-07-09 08:22:35 +02:00
|
|
|
* 'tuples' is the number of tuples in the relation
|
|
|
|
* 'width' is the average tuple width in bytes
|
2010-10-08 02:00:28 +02:00
|
|
|
* 'comparison_cost' is the extra cost per comparison, if any
|
|
|
|
* 'sort_mem' is the number of kilobytes of work memory allowed for the sort
|
2007-05-04 03:13:45 +02:00
|
|
|
* 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
Implement Incremental Sort
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
2020-04-06 21:33:28 +02:00
|
|
|
static void
|
|
|
|
cost_tuplesort(Cost *startup_cost, Cost *run_cost,
|
|
|
|
double tuples, int width,
|
|
|
|
Cost comparison_cost, int sort_mem,
|
|
|
|
double limit_tuples)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2007-05-04 03:13:45 +02:00
|
|
|
double input_bytes = relation_byte_size(tuples, width);
|
|
|
|
double output_bytes;
|
|
|
|
double output_tuples;
|
2010-10-08 02:00:28 +02:00
|
|
|
long sort_mem_bytes = sort_mem * 1024L;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
1999-05-25 18:15:34 +02:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* We want to be sure the cost of a sort is never estimated as zero, even
|
|
|
|
* if passed-in tuple count is zero. Besides, mustn't do log(0)...
|
1999-04-30 06:01:44 +02:00
|
|
|
*/
|
2000-01-09 01:26:47 +01:00
|
|
|
if (tuples < 2.0)
|
|
|
|
tuples = 2.0;
|
1999-04-30 06:01:44 +02:00
|
|
|
|
2010-10-08 02:00:28 +02:00
|
|
|
/* Include the default cost-per-comparison */
|
|
|
|
comparison_cost += 2.0 * cpu_operator_cost;
|
|
|
|
|
2007-05-04 03:13:45 +02:00
|
|
|
/* Do we have a useful LIMIT? */
|
|
|
|
if (limit_tuples > 0 && limit_tuples < tuples)
|
|
|
|
{
|
|
|
|
output_tuples = limit_tuples;
|
|
|
|
output_bytes = relation_byte_size(output_tuples, width);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
output_tuples = tuples;
|
|
|
|
output_bytes = input_bytes;
|
|
|
|
}
|
1999-04-30 06:01:44 +02:00
|
|
|
|
2010-10-08 02:00:28 +02:00
|
|
|
if (output_bytes > sort_mem_bytes)
|
2000-01-09 01:26:47 +01:00
|
|
|
{
|
2007-05-04 03:13:45 +02:00
|
|
|
/*
|
|
|
|
* We'll have to use a disk-based sort of all the tuples
|
|
|
|
*/
|
|
|
|
double npages = ceil(input_bytes / BLCKSZ);
|
Use quicksort, not replacement selection, for external sorting.
We still use replacement selection for the first run of the sort only
and only when the number of tuples is relatively small. Otherwise,
the first run, and subsequent runs in all cases, are produced using
quicksort. This tends to be faster except perhaps for very small
amounts of working memory.
Peter Geoghegan, reviewed by Tomas Vondra, Jeff Janes, Mithun Cy,
Greg Stark, and me.
2016-04-08 08:36:26 +02:00
|
|
|
double nruns = input_bytes / sort_mem_bytes;
|
2010-10-08 02:00:28 +02:00
|
|
|
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
|
2006-02-19 06:54:06 +01:00
|
|
|
double log_runs;
|
2000-02-15 21:49:31 +01:00
|
|
|
double npageaccesses;
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2007-05-04 03:13:45 +02:00
|
|
|
/*
|
|
|
|
* CPU costs
|
|
|
|
*
|
2010-10-08 02:00:28 +02:00
|
|
|
* Assume about N log2 N comparisons
|
2007-05-04 03:13:45 +02:00
|
|
|
*/
|
Implement Incremental Sort
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
2020-04-06 21:33:28 +02:00
|
|
|
*startup_cost = comparison_cost * tuples * LOG2(tuples);
|
2007-05-04 03:13:45 +02:00
|
|
|
|
|
|
|
/* Disk costs */
|
|
|
|
|
2006-02-19 06:54:06 +01:00
|
|
|
/* Compute logM(r) as log(r) / log(M) */
|
|
|
|
if (nruns > mergeorder)
|
|
|
|
log_runs = ceil(log(nruns) / log(mergeorder));
|
|
|
|
else
|
2000-01-09 01:26:47 +01:00
|
|
|
log_runs = 1.0;
|
2000-02-15 21:49:31 +01:00
|
|
|
npageaccesses = 2.0 * npages * log_runs;
|
2006-06-05 22:56:33 +02:00
|
|
|
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
|
Implement Incremental Sort
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
2020-04-06 21:33:28 +02:00
|
|
|
*startup_cost += npageaccesses *
|
2006-06-05 22:56:33 +02:00
|
|
|
(seq_page_cost * 0.75 + random_page_cost * 0.25);
|
2000-01-09 01:26:47 +01:00
|
|
|
}
|
2010-10-08 02:00:28 +02:00
|
|
|
else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
|
2007-05-04 03:13:45 +02:00
|
|
|
{
|
|
|
|
/*
|
2007-11-15 22:14:46 +01:00
|
|
|
* We'll use a bounded heap-sort keeping just K tuples in memory, for
|
|
|
|
* a total number of tuple comparisons of N log2 K; but the constant
|
|
|
|
* factor is a bit higher than for quicksort. Tweak it so that the
|
|
|
|
* cost curve is continuous at the crossover point.
|
2007-05-04 03:13:45 +02:00
|
|
|
*/
|
Implement Incremental Sort
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
2020-04-06 21:33:28 +02:00
|
|
|
*startup_cost = comparison_cost * tuples * LOG2(2.0 * output_tuples);
|
2007-05-04 03:13:45 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* We'll use plain quicksort on all the input tuples */
|
Implement Incremental Sort
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
2020-04-06 21:33:28 +02:00
|
|
|
*startup_cost = comparison_cost * tuples * LOG2(tuples);
|
2007-05-04 03:13:45 +02:00
|
|
|
}
|
1999-04-30 06:01:44 +02:00
|
|
|
|
2000-02-15 21:49:31 +01:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Also charge a small amount (arbitrarily set equal to operator cost) per
|
2010-02-19 22:49:10 +01:00
|
|
|
* extracted tuple. We don't charge cpu_tuple_cost because a Sort node
|
|
|
|
* doesn't do qual-checking or projection, so it has less overhead than
|
|
|
|
* most plan nodes. Note it's correct to use tuples not output_tuples
|
2007-05-04 03:13:45 +02:00
|
|
|
* here --- the upper LIMIT will pro-rate the run cost so we'd be double
|
|
|
|
* counting the LIMIT otherwise.
|
2000-02-15 21:49:31 +01:00
|
|
|
*/
|
Implement Incremental Sort
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
2020-04-06 21:33:28 +02:00
|
|
|
*run_cost = cpu_operator_cost * tuples;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* cost_incremental_sort
|
|
|
|
* Determines and returns the cost of sorting a relation incrementally, when
|
|
|
|
* the input path is presorted by a prefix of the pathkeys.
|
|
|
|
*
|
|
|
|
* 'presorted_keys' is the number of leading pathkeys by which the input path
|
|
|
|
* is sorted.
|
|
|
|
*
|
|
|
|
* We estimate the number of groups into which the relation is divided by the
|
|
|
|
* leading pathkeys, and then calculate the cost of sorting a single group
|
|
|
|
* with tuplesort using cost_tuplesort().
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
cost_incremental_sort(Path *path,
|
|
|
|
PlannerInfo *root, List *pathkeys, int presorted_keys,
|
|
|
|
Cost input_startup_cost, Cost input_total_cost,
|
|
|
|
double input_tuples, int width, Cost comparison_cost, int sort_mem,
|
|
|
|
double limit_tuples)
|
|
|
|
{
|
|
|
|
Cost startup_cost = 0,
|
|
|
|
run_cost = 0,
|
|
|
|
input_run_cost = input_total_cost - input_startup_cost;
|
|
|
|
double group_tuples,
|
|
|
|
input_groups;
|
|
|
|
Cost group_startup_cost,
|
|
|
|
group_run_cost,
|
|
|
|
group_input_run_cost;
|
|
|
|
List *presortedExprs = NIL;
|
|
|
|
ListCell *l;
|
|
|
|
int i = 0;
|
2020-04-23 00:15:24 +02:00
|
|
|
bool unknown_varno = false;
|
Implement Incremental Sort
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
2020-04-06 21:33:28 +02:00
|
|
|
|
|
|
|
Assert(presorted_keys != 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We want to be sure the cost of a sort is never estimated as zero, even
|
|
|
|
* if passed-in tuple count is zero. Besides, mustn't do log(0)...
|
|
|
|
*/
|
|
|
|
if (input_tuples < 2.0)
|
|
|
|
input_tuples = 2.0;
|
|
|
|
|
2020-04-23 00:15:24 +02:00
|
|
|
/* Default estimate of number of groups, capped to one group per row. */
|
|
|
|
input_groups = Min(input_tuples, DEFAULT_NUM_DISTINCT);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Extract presorted keys as list of expressions.
|
|
|
|
*
|
2020-05-14 19:06:38 +02:00
|
|
|
* We need to be careful about Vars containing "varno 0" which might have
|
|
|
|
* been introduced by generate_append_tlist, which would confuse
|
2020-04-23 00:15:24 +02:00
|
|
|
* estimate_num_groups (in fact it'd fail for such expressions). See
|
|
|
|
* recurse_set_operations which has to deal with the same issue.
|
|
|
|
*
|
2020-05-14 19:06:38 +02:00
|
|
|
* Unlike recurse_set_operations we can't access the original target list
|
|
|
|
* here, and even if we could it's not very clear how useful would that be
|
|
|
|
* for a set operation combining multiple tables. So we simply detect if
|
|
|
|
* there are any expressions with "varno 0" and use the default
|
|
|
|
* DEFAULT_NUM_DISTINCT in that case.
|
2020-04-23 00:15:24 +02:00
|
|
|
*
|
2020-05-14 19:06:38 +02:00
|
|
|
* We might also use either 1.0 (a single group) or input_tuples (each row
|
|
|
|
* being a separate group), pretty much the worst and best case for
|
2020-04-23 00:15:24 +02:00
|
|
|
* incremental sort. But those are extreme cases and using something in
|
|
|
|
* between seems reasonable. Furthermore, generate_append_tlist is used
|
|
|
|
* for set operations, which are likely to produce mostly unique output
|
|
|
|
* anyway - from that standpoint the DEFAULT_NUM_DISTINCT is defensive
|
|
|
|
* while maintaining lower startup cost.
|
|
|
|
*/
|
Implement Incremental Sort
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
2020-04-06 21:33:28 +02:00
|
|
|
foreach(l, pathkeys)
|
|
|
|
{
|
|
|
|
PathKey *key = (PathKey *) lfirst(l);
|
|
|
|
EquivalenceMember *member = (EquivalenceMember *)
|
|
|
|
linitial(key->pk_eclass->ec_members);
|
|
|
|
|
2020-04-23 00:15:24 +02:00
|
|
|
/*
|
|
|
|
* Check if the expression contains Var with "varno 0" so that we
|
|
|
|
* don't call estimate_num_groups in that case.
|
|
|
|
*/
|
Fix pull_varnos' miscomputation of relids set for a PlaceHolderVar.
Previously, pull_varnos() took the relids of a PlaceHolderVar as being
equal to the relids in its contents, but that fails to account for the
possibility that we have to postpone evaluation of the PHV due to outer
joins. This could result in a malformed plan. The known cases end up
triggering the "failed to assign all NestLoopParams to plan nodes"
sanity check in createplan.c, but other symptoms may be possible.
The right value to use is the join level we actually intend to evaluate
the PHV at. We can get that from the ph_eval_at field of the associated
PlaceHolderInfo. However, there are some places that call pull_varnos()
before the PlaceHolderInfos have been created; in that case, fall back
to the conservative assumption that the PHV will be evaluated at its
syntactic level. (In principle this might result in missing some legal
optimization, but I'm not aware of any cases where it's an issue in
practice.) Things are also a bit ticklish for calls occurring during
deconstruct_jointree(), but AFAICS the ph_eval_at fields should have
reached their final values by the time we need them.
The main problem in making this work is that pull_varnos() has no
way to get at the PlaceHolderInfos. We can fix that easily, if a
bit tediously, in HEAD by passing it the planner "root" pointer.
In the back branches that'd cause an unacceptable API/ABI break for
extensions, so leave the existing entry points alone and add new ones
with the additional parameter. (If an old entry point is called and
encounters a PHV, it'll fall back to using the syntactic level,
again possibly missing some valid optimization.)
Back-patch to v12. The computation is surely also wrong before that,
but it appears that we cannot reach a bad plan thanks to join order
restrictions imposed on the subquery that the PlaceHolderVar came from.
The error only became reachable when commit 4be058fe9 allowed trivial
subqueries to be collapsed out completely, eliminating their join order
restrictions.
Per report from Stephan Springl.
Discussion: https://postgr.es/m/171041.1610849523@sss.pgh.pa.us
2021-01-21 21:37:23 +01:00
|
|
|
if (bms_is_member(0, pull_varnos(root, (Node *) member->em_expr)))
|
2020-04-23 00:15:24 +02:00
|
|
|
{
|
|
|
|
unknown_varno = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* expression not containing any Vars with "varno 0" */
|
Implement Incremental Sort
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
2020-04-06 21:33:28 +02:00
|
|
|
presortedExprs = lappend(presortedExprs, member->em_expr);
|
|
|
|
|
|
|
|
i++;
|
|
|
|
if (i >= presorted_keys)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2020-04-23 00:15:24 +02:00
|
|
|
/* Estimate number of groups with equal presorted keys. */
|
|
|
|
if (!unknown_varno)
|
2021-03-30 09:52:46 +02:00
|
|
|
input_groups = estimate_num_groups(root, presortedExprs, input_tuples,
|
|
|
|
NULL, NULL);
|
2020-04-23 00:15:24 +02:00
|
|
|
|
Implement Incremental Sort
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
2020-04-06 21:33:28 +02:00
|
|
|
group_tuples = input_tuples / input_groups;
|
|
|
|
group_input_run_cost = input_run_cost / input_groups;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Estimate average cost of sorting of one group where presorted keys are
|
|
|
|
* equal. Incremental sort is sensitive to distribution of tuples to the
|
|
|
|
* groups, where we're relying on quite rough assumptions. Thus, we're
|
|
|
|
* pessimistic about incremental sort performance and increase its average
|
|
|
|
* group size by half.
|
|
|
|
*/
|
|
|
|
cost_tuplesort(&group_startup_cost, &group_run_cost,
|
|
|
|
1.5 * group_tuples, width, comparison_cost, sort_mem,
|
|
|
|
limit_tuples);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Startup cost of incremental sort is the startup cost of its first group
|
|
|
|
* plus the cost of its input.
|
|
|
|
*/
|
|
|
|
startup_cost += group_startup_cost
|
|
|
|
+ input_startup_cost + group_input_run_cost;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* After we started producing tuples from the first group, the cost of
|
|
|
|
* producing all the tuples is given by the cost to finish processing this
|
|
|
|
* group, plus the total cost to process the remaining groups, plus the
|
|
|
|
* remaining cost of input.
|
|
|
|
*/
|
|
|
|
run_cost += group_run_cost
|
|
|
|
+ (group_run_cost + group_startup_cost) * (input_groups - 1)
|
|
|
|
+ group_input_run_cost * (input_groups - 1);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Incremental sort adds some overhead by itself. Firstly, it has to
|
|
|
|
* detect the sort groups. This is roughly equal to one extra copy and
|
|
|
|
* comparison per tuple. Secondly, it has to reset the tuplesort context
|
|
|
|
* for every group.
|
|
|
|
*/
|
|
|
|
run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
|
|
|
|
run_cost += 2.0 * cpu_tuple_cost * input_groups;
|
2002-03-01 21:50:20 +01:00
|
|
|
|
Implement Incremental Sort
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
2020-04-06 21:33:28 +02:00
|
|
|
path->rows = input_tuples;
|
|
|
|
path->startup_cost = startup_cost;
|
|
|
|
path->total_cost = startup_cost + run_cost;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* cost_sort
|
|
|
|
* Determines and returns the cost of sorting a relation, including
|
|
|
|
* the cost of reading the input data.
|
|
|
|
*
|
|
|
|
* NOTE: some callers currently pass NIL for pathkeys because they
|
|
|
|
* can't conveniently supply the sort keys. Since this routine doesn't
|
|
|
|
* currently do anything with pathkeys anyway, that doesn't matter...
|
|
|
|
* but if it ever does, it should react gracefully to lack of key data.
|
|
|
|
* (Actually, the thing we'd most likely be interested in is just the number
|
|
|
|
* of sort keys, which all callers *could* supply.)
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
cost_sort(Path *path, PlannerInfo *root,
|
|
|
|
List *pathkeys, Cost input_cost, double tuples, int width,
|
|
|
|
Cost comparison_cost, int sort_mem,
|
|
|
|
double limit_tuples)
|
|
|
|
|
|
|
|
{
|
|
|
|
Cost startup_cost;
|
|
|
|
Cost run_cost;
|
|
|
|
|
|
|
|
cost_tuplesort(&startup_cost, &run_cost,
|
|
|
|
tuples, width,
|
|
|
|
comparison_cost, sort_mem,
|
|
|
|
limit_tuples);
|
|
|
|
|
|
|
|
if (!enable_sort)
|
|
|
|
startup_cost += disable_cost;
|
|
|
|
|
|
|
|
startup_cost += input_cost;
|
|
|
|
|
|
|
|
path->rows = tuples;
|
2000-02-15 21:49:31 +01:00
|
|
|
path->startup_cost = startup_cost;
|
|
|
|
path->total_cost = startup_cost + run_cost;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
1997-09-07 07:04:48 +02:00
|
|
|
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
/*
|
|
|
|
* append_nonpartial_cost
|
|
|
|
* Estimate the cost of the non-partial paths in a Parallel Append.
|
|
|
|
* The non-partial paths are assumed to be the first "numpaths" paths
|
|
|
|
* from the subpaths list, and to be in order of decreasing cost.
|
|
|
|
*/
|
|
|
|
static Cost
|
|
|
|
append_nonpartial_cost(List *subpaths, int numpaths, int parallel_workers)
|
|
|
|
{
|
|
|
|
Cost *costarr;
|
|
|
|
int arrlen;
|
|
|
|
ListCell *l;
|
|
|
|
ListCell *cell;
|
|
|
|
int i;
|
|
|
|
int path_index;
|
|
|
|
int min_index;
|
|
|
|
int max_index;
|
|
|
|
|
|
|
|
if (numpaths == 0)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/*
|
2019-08-13 06:53:41 +02:00
|
|
|
* Array length is number of workers or number of relevant paths,
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
* whichever is less.
|
|
|
|
*/
|
|
|
|
arrlen = Min(parallel_workers, numpaths);
|
|
|
|
costarr = (Cost *) palloc(sizeof(Cost) * arrlen);
|
|
|
|
|
|
|
|
/* The first few paths will each be claimed by a different worker. */
|
|
|
|
path_index = 0;
|
|
|
|
foreach(cell, subpaths)
|
|
|
|
{
|
|
|
|
Path *subpath = (Path *) lfirst(cell);
|
|
|
|
|
|
|
|
if (path_index == arrlen)
|
|
|
|
break;
|
|
|
|
costarr[path_index++] = subpath->total_cost;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Since subpaths are sorted by decreasing cost, the last one will have
|
|
|
|
* the minimum cost.
|
|
|
|
*/
|
|
|
|
min_index = arrlen - 1;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* For each of the remaining subpaths, add its cost to the array element
|
|
|
|
* with minimum cost.
|
|
|
|
*/
|
Represent Lists as expansible arrays, not chains of cons-cells.
Originally, Postgres Lists were a more or less exact reimplementation of
Lisp lists, which consist of chains of separately-allocated cons cells,
each having a value and a next-cell link. We'd hacked that once before
(commit d0b4399d8) to add a separate List header, but the data was still
in cons cells. That makes some operations -- notably list_nth() -- O(N),
and it's bulky because of the next-cell pointers and per-cell palloc
overhead, and it's very cache-unfriendly if the cons cells end up
scattered around rather than being adjacent.
In this rewrite, we still have List headers, but the data is in a
resizable array of values, with no next-cell links. Now we need at
most two palloc's per List, and often only one, since we can allocate
some values in the same palloc call as the List header. (Of course,
extending an existing List may require repalloc's to enlarge the array.
But this involves just O(log N) allocations not O(N).)
Of course this is not without downsides. The key difficulty is that
addition or deletion of a list entry may now cause other entries to
move, which it did not before.
For example, that breaks foreach() and sister macros, which historically
used a pointer to the current cons-cell as loop state. We can repair
those macros transparently by making their actual loop state be an
integer list index; the exposed "ListCell *" pointer is no longer state
carried across loop iterations, but is just a derived value. (In
practice, modern compilers can optimize things back to having just one
loop state value, at least for simple cases with inline loop bodies.)
In principle, this is a semantics change for cases where the loop body
inserts or deletes list entries ahead of the current loop index; but
I found no such cases in the Postgres code.
The change is not at all transparent for code that doesn't use foreach()
but chases lists "by hand" using lnext(). The largest share of such
code in the backend is in loops that were maintaining "prev" and "next"
variables in addition to the current-cell pointer, in order to delete
list cells efficiently using list_delete_cell(). However, we no longer
need a previous-cell pointer to delete a list cell efficiently. Keeping
a next-cell pointer doesn't work, as explained above, but we can improve
matters by changing such code to use a regular foreach() loop and then
using the new macro foreach_delete_current() to delete the current cell.
(This macro knows how to update the associated foreach loop's state so
that no cells will be missed in the traversal.)
There remains a nontrivial risk of code assuming that a ListCell *
pointer will remain good over an operation that could now move the list
contents. To help catch such errors, list.c can be compiled with a new
define symbol DEBUG_LIST_MEMORY_USAGE that forcibly moves list contents
whenever that could possibly happen. This makes list operations
significantly more expensive so it's not normally turned on (though it
is on by default if USE_VALGRIND is on).
There are two notable API differences from the previous code:
* lnext() now requires the List's header pointer in addition to the
current cell's address.
* list_delete_cell() no longer requires a previous-cell argument.
These changes are somewhat unfortunate, but on the other hand code using
either function needs inspection to see if it is assuming anything
it shouldn't, so it's not all bad.
Programmers should be aware of these significant performance changes:
* list_nth() and related functions are now O(1); so there's no
major access-speed difference between a list and an array.
* Inserting or deleting a list element now takes time proportional to
the distance to the end of the list, due to moving the array elements.
(However, it typically *doesn't* require palloc or pfree, so except in
long lists it's probably still faster than before.) Notably, lcons()
used to be about the same cost as lappend(), but that's no longer true
if the list is long. Code that uses lcons() and list_delete_first()
to maintain a stack might usefully be rewritten to push and pop at the
end of the list rather than the beginning.
* There are now list_insert_nth...() and list_delete_nth...() functions
that add or remove a list cell identified by index. These have the
data-movement penalty explained above, but there's no search penalty.
* list_concat() and variants now copy the second list's data into
storage belonging to the first list, so there is no longer any
sharing of cells between the input lists. The second argument is
now declared "const List *" to reflect that it isn't changed.
This patch just does the minimum needed to get the new implementation
in place and fix bugs exposed by the regression tests. As suggested
by the foregoing, there's a fair amount of followup work remaining to
do.
Also, the ENABLE_LIST_COMPAT macros are finally removed in this
commit. Code using those should have been gone a dozen years ago.
Patch by me; thanks to David Rowley, Jesper Pedersen, and others
for review.
Discussion: https://postgr.es/m/11587.1550975080@sss.pgh.pa.us
2019-07-15 19:41:58 +02:00
|
|
|
for_each_cell(l, subpaths, cell)
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
{
|
|
|
|
Path *subpath = (Path *) lfirst(l);
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/* Consider only the non-partial paths */
|
|
|
|
if (path_index++ == numpaths)
|
|
|
|
break;
|
|
|
|
|
|
|
|
costarr[min_index] += subpath->total_cost;
|
|
|
|
|
|
|
|
/* Update the new min cost array index */
|
|
|
|
for (min_index = i = 0; i < arrlen; i++)
|
|
|
|
{
|
|
|
|
if (costarr[i] < costarr[min_index])
|
|
|
|
min_index = i;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Return the highest cost from the array */
|
|
|
|
for (max_index = i = 0; i < arrlen; i++)
|
|
|
|
{
|
|
|
|
if (costarr[i] > costarr[max_index])
|
|
|
|
max_index = i;
|
|
|
|
}
|
|
|
|
|
|
|
|
return costarr[max_index];
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* cost_append
|
|
|
|
* Determines and returns the cost of an Append node.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
cost_append(AppendPath *apath)
|
|
|
|
{
|
|
|
|
ListCell *l;
|
|
|
|
|
|
|
|
apath->path.startup_cost = 0;
|
|
|
|
apath->path.total_cost = 0;
|
Use Append rather than MergeAppend for scanning ordered partitions.
If we need ordered output from a scan of a partitioned table, but
the ordering matches the partition ordering, then we don't need to
use a MergeAppend to combine the pre-ordered per-partition scan
results: a plain Append will produce the same results. This
both saves useless comparison work inside the MergeAppend proper,
and allows us to start returning tuples after istarting up just
the first child node not all of them.
However, all is not peaches and cream, because if some of the
child nodes have high startup costs then there will be big
discontinuities in the tuples-returned-versus-elapsed-time curve.
The planner's cost model cannot handle that (yet, anyway).
If we model the Append's startup cost as being just the first
child's startup cost, we may drastically underestimate the cost
of fetching slightly more tuples than are available from the first
child. Since we've had bad experiences with over-optimistic choices
of "fast start" plans for ORDER BY LIMIT queries, that seems scary.
As a klugy workaround, set the startup cost estimate for an ordered
Append to be the sum of its children's startup costs (as MergeAppend
would). This doesn't really describe reality, but it's less likely
to cause a bad plan choice than an underestimated startup cost would.
In practice, the cases where we really care about this optimization
will have child plans that are IndexScans with zero startup cost,
so that the overly conservative estimate is still just zero.
David Rowley, reviewed by Julien Rouhaud and Antonin Houska
Discussion: https://postgr.es/m/CAKJS1f-hAqhPLRk_RaSFTgYxd=Tz5hA7kQ2h4-DhJufQk8TGuw@mail.gmail.com
2019-04-06 01:20:30 +02:00
|
|
|
apath->path.rows = 0;
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
|
|
|
|
if (apath->subpaths == NIL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (!apath->path.parallel_aware)
|
|
|
|
{
|
Use Append rather than MergeAppend for scanning ordered partitions.
If we need ordered output from a scan of a partitioned table, but
the ordering matches the partition ordering, then we don't need to
use a MergeAppend to combine the pre-ordered per-partition scan
results: a plain Append will produce the same results. This
both saves useless comparison work inside the MergeAppend proper,
and allows us to start returning tuples after istarting up just
the first child node not all of them.
However, all is not peaches and cream, because if some of the
child nodes have high startup costs then there will be big
discontinuities in the tuples-returned-versus-elapsed-time curve.
The planner's cost model cannot handle that (yet, anyway).
If we model the Append's startup cost as being just the first
child's startup cost, we may drastically underestimate the cost
of fetching slightly more tuples than are available from the first
child. Since we've had bad experiences with over-optimistic choices
of "fast start" plans for ORDER BY LIMIT queries, that seems scary.
As a klugy workaround, set the startup cost estimate for an ordered
Append to be the sum of its children's startup costs (as MergeAppend
would). This doesn't really describe reality, but it's less likely
to cause a bad plan choice than an underestimated startup cost would.
In practice, the cases where we really care about this optimization
will have child plans that are IndexScans with zero startup cost,
so that the overly conservative estimate is still just zero.
David Rowley, reviewed by Julien Rouhaud and Antonin Houska
Discussion: https://postgr.es/m/CAKJS1f-hAqhPLRk_RaSFTgYxd=Tz5hA7kQ2h4-DhJufQk8TGuw@mail.gmail.com
2019-04-06 01:20:30 +02:00
|
|
|
List *pathkeys = apath->path.pathkeys;
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
|
Use Append rather than MergeAppend for scanning ordered partitions.
If we need ordered output from a scan of a partitioned table, but
the ordering matches the partition ordering, then we don't need to
use a MergeAppend to combine the pre-ordered per-partition scan
results: a plain Append will produce the same results. This
both saves useless comparison work inside the MergeAppend proper,
and allows us to start returning tuples after istarting up just
the first child node not all of them.
However, all is not peaches and cream, because if some of the
child nodes have high startup costs then there will be big
discontinuities in the tuples-returned-versus-elapsed-time curve.
The planner's cost model cannot handle that (yet, anyway).
If we model the Append's startup cost as being just the first
child's startup cost, we may drastically underestimate the cost
of fetching slightly more tuples than are available from the first
child. Since we've had bad experiences with over-optimistic choices
of "fast start" plans for ORDER BY LIMIT queries, that seems scary.
As a klugy workaround, set the startup cost estimate for an ordered
Append to be the sum of its children's startup costs (as MergeAppend
would). This doesn't really describe reality, but it's less likely
to cause a bad plan choice than an underestimated startup cost would.
In practice, the cases where we really care about this optimization
will have child plans that are IndexScans with zero startup cost,
so that the overly conservative estimate is still just zero.
David Rowley, reviewed by Julien Rouhaud and Antonin Houska
Discussion: https://postgr.es/m/CAKJS1f-hAqhPLRk_RaSFTgYxd=Tz5hA7kQ2h4-DhJufQk8TGuw@mail.gmail.com
2019-04-06 01:20:30 +02:00
|
|
|
if (pathkeys == NIL)
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
{
|
Use Append rather than MergeAppend for scanning ordered partitions.
If we need ordered output from a scan of a partitioned table, but
the ordering matches the partition ordering, then we don't need to
use a MergeAppend to combine the pre-ordered per-partition scan
results: a plain Append will produce the same results. This
both saves useless comparison work inside the MergeAppend proper,
and allows us to start returning tuples after istarting up just
the first child node not all of them.
However, all is not peaches and cream, because if some of the
child nodes have high startup costs then there will be big
discontinuities in the tuples-returned-versus-elapsed-time curve.
The planner's cost model cannot handle that (yet, anyway).
If we model the Append's startup cost as being just the first
child's startup cost, we may drastically underestimate the cost
of fetching slightly more tuples than are available from the first
child. Since we've had bad experiences with over-optimistic choices
of "fast start" plans for ORDER BY LIMIT queries, that seems scary.
As a klugy workaround, set the startup cost estimate for an ordered
Append to be the sum of its children's startup costs (as MergeAppend
would). This doesn't really describe reality, but it's less likely
to cause a bad plan choice than an underestimated startup cost would.
In practice, the cases where we really care about this optimization
will have child plans that are IndexScans with zero startup cost,
so that the overly conservative estimate is still just zero.
David Rowley, reviewed by Julien Rouhaud and Antonin Houska
Discussion: https://postgr.es/m/CAKJS1f-hAqhPLRk_RaSFTgYxd=Tz5hA7kQ2h4-DhJufQk8TGuw@mail.gmail.com
2019-04-06 01:20:30 +02:00
|
|
|
Path *subpath = (Path *) linitial(apath->subpaths);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* For an unordered, non-parallel-aware Append we take the startup
|
|
|
|
* cost as the startup cost of the first subpath.
|
|
|
|
*/
|
|
|
|
apath->path.startup_cost = subpath->startup_cost;
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
|
Use Append rather than MergeAppend for scanning ordered partitions.
If we need ordered output from a scan of a partitioned table, but
the ordering matches the partition ordering, then we don't need to
use a MergeAppend to combine the pre-ordered per-partition scan
results: a plain Append will produce the same results. This
both saves useless comparison work inside the MergeAppend proper,
and allows us to start returning tuples after istarting up just
the first child node not all of them.
However, all is not peaches and cream, because if some of the
child nodes have high startup costs then there will be big
discontinuities in the tuples-returned-versus-elapsed-time curve.
The planner's cost model cannot handle that (yet, anyway).
If we model the Append's startup cost as being just the first
child's startup cost, we may drastically underestimate the cost
of fetching slightly more tuples than are available from the first
child. Since we've had bad experiences with over-optimistic choices
of "fast start" plans for ORDER BY LIMIT queries, that seems scary.
As a klugy workaround, set the startup cost estimate for an ordered
Append to be the sum of its children's startup costs (as MergeAppend
would). This doesn't really describe reality, but it's less likely
to cause a bad plan choice than an underestimated startup cost would.
In practice, the cases where we really care about this optimization
will have child plans that are IndexScans with zero startup cost,
so that the overly conservative estimate is still just zero.
David Rowley, reviewed by Julien Rouhaud and Antonin Houska
Discussion: https://postgr.es/m/CAKJS1f-hAqhPLRk_RaSFTgYxd=Tz5hA7kQ2h4-DhJufQk8TGuw@mail.gmail.com
2019-04-06 01:20:30 +02:00
|
|
|
/* Compute rows and costs as sums of subplan rows and costs. */
|
|
|
|
foreach(l, apath->subpaths)
|
|
|
|
{
|
|
|
|
Path *subpath = (Path *) lfirst(l);
|
|
|
|
|
|
|
|
apath->path.rows += subpath->rows;
|
|
|
|
apath->path.total_cost += subpath->total_cost;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* For an ordered, non-parallel-aware Append we take the startup
|
|
|
|
* cost as the sum of the subpath startup costs. This ensures
|
|
|
|
* that we don't underestimate the startup cost when a query's
|
|
|
|
* LIMIT is such that several of the children have to be run to
|
|
|
|
* satisfy it. This might be overkill --- another plausible hack
|
|
|
|
* would be to take the Append's startup cost as the maximum of
|
|
|
|
* the child startup costs. But we don't want to risk believing
|
|
|
|
* that an ORDER BY LIMIT query can be satisfied at small cost
|
|
|
|
* when the first child has small startup cost but later ones
|
|
|
|
* don't. (If we had the ability to deal with nonlinear cost
|
|
|
|
* interpolation for partial retrievals, we would not need to be
|
|
|
|
* so conservative about this.)
|
|
|
|
*
|
|
|
|
* This case is also different from the above in that we have to
|
|
|
|
* account for possibly injecting sorts into subpaths that aren't
|
|
|
|
* natively ordered.
|
|
|
|
*/
|
|
|
|
foreach(l, apath->subpaths)
|
|
|
|
{
|
|
|
|
Path *subpath = (Path *) lfirst(l);
|
|
|
|
Path sort_path; /* dummy for result of cost_sort */
|
|
|
|
|
|
|
|
if (!pathkeys_contained_in(pathkeys, subpath->pathkeys))
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We'll need to insert a Sort node, so include costs for
|
|
|
|
* that. We can use the parent's LIMIT if any, since we
|
|
|
|
* certainly won't pull more than that many tuples from
|
|
|
|
* any child.
|
|
|
|
*/
|
|
|
|
cost_sort(&sort_path,
|
|
|
|
NULL, /* doesn't currently need root */
|
|
|
|
pathkeys,
|
|
|
|
subpath->total_cost,
|
|
|
|
subpath->rows,
|
|
|
|
subpath->pathtarget->width,
|
|
|
|
0.0,
|
|
|
|
work_mem,
|
|
|
|
apath->limit_tuples);
|
|
|
|
subpath = &sort_path;
|
|
|
|
}
|
|
|
|
|
|
|
|
apath->path.rows += subpath->rows;
|
|
|
|
apath->path.startup_cost += subpath->startup_cost;
|
|
|
|
apath->path.total_cost += subpath->total_cost;
|
|
|
|
}
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
else /* parallel-aware */
|
|
|
|
{
|
|
|
|
int i = 0;
|
|
|
|
double parallel_divisor = get_parallel_divisor(&apath->path);
|
|
|
|
|
Use Append rather than MergeAppend for scanning ordered partitions.
If we need ordered output from a scan of a partitioned table, but
the ordering matches the partition ordering, then we don't need to
use a MergeAppend to combine the pre-ordered per-partition scan
results: a plain Append will produce the same results. This
both saves useless comparison work inside the MergeAppend proper,
and allows us to start returning tuples after istarting up just
the first child node not all of them.
However, all is not peaches and cream, because if some of the
child nodes have high startup costs then there will be big
discontinuities in the tuples-returned-versus-elapsed-time curve.
The planner's cost model cannot handle that (yet, anyway).
If we model the Append's startup cost as being just the first
child's startup cost, we may drastically underestimate the cost
of fetching slightly more tuples than are available from the first
child. Since we've had bad experiences with over-optimistic choices
of "fast start" plans for ORDER BY LIMIT queries, that seems scary.
As a klugy workaround, set the startup cost estimate for an ordered
Append to be the sum of its children's startup costs (as MergeAppend
would). This doesn't really describe reality, but it's less likely
to cause a bad plan choice than an underestimated startup cost would.
In practice, the cases where we really care about this optimization
will have child plans that are IndexScans with zero startup cost,
so that the overly conservative estimate is still just zero.
David Rowley, reviewed by Julien Rouhaud and Antonin Houska
Discussion: https://postgr.es/m/CAKJS1f-hAqhPLRk_RaSFTgYxd=Tz5hA7kQ2h4-DhJufQk8TGuw@mail.gmail.com
2019-04-06 01:20:30 +02:00
|
|
|
/* Parallel-aware Append never produces ordered output. */
|
|
|
|
Assert(apath->path.pathkeys == NIL);
|
|
|
|
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
/* Calculate startup cost. */
|
|
|
|
foreach(l, apath->subpaths)
|
|
|
|
{
|
|
|
|
Path *subpath = (Path *) lfirst(l);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Append will start returning tuples when the child node having
|
|
|
|
* lowest startup cost is done setting up. We consider only the
|
|
|
|
* first few subplans that immediately get a worker assigned.
|
|
|
|
*/
|
|
|
|
if (i == 0)
|
|
|
|
apath->path.startup_cost = subpath->startup_cost;
|
|
|
|
else if (i < apath->path.parallel_workers)
|
|
|
|
apath->path.startup_cost = Min(apath->path.startup_cost,
|
|
|
|
subpath->startup_cost);
|
|
|
|
|
|
|
|
/*
|
2018-01-04 13:56:09 +01:00
|
|
|
* Apply parallel divisor to subpaths. Scale the number of rows
|
|
|
|
* for each partial subpath based on the ratio of the parallel
|
|
|
|
* divisor originally used for the subpath to the one we adopted.
|
|
|
|
* Also add the cost of partial paths to the total cost, but
|
|
|
|
* ignore non-partial paths for now.
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
*/
|
|
|
|
if (i < apath->first_partial_path)
|
|
|
|
apath->path.rows += subpath->rows / parallel_divisor;
|
|
|
|
else
|
|
|
|
{
|
2018-01-04 13:56:09 +01:00
|
|
|
double subpath_parallel_divisor;
|
|
|
|
|
|
|
|
subpath_parallel_divisor = get_parallel_divisor(subpath);
|
|
|
|
apath->path.rows += subpath->rows * (subpath_parallel_divisor /
|
|
|
|
parallel_divisor);
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
apath->path.total_cost += subpath->total_cost;
|
|
|
|
}
|
|
|
|
|
2018-01-04 13:56:09 +01:00
|
|
|
apath->path.rows = clamp_row_est(apath->path.rows);
|
|
|
|
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
i++;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Add cost for non-partial subpaths. */
|
|
|
|
apath->path.total_cost +=
|
|
|
|
append_nonpartial_cost(apath->subpaths,
|
|
|
|
apath->first_partial_path,
|
|
|
|
apath->path.parallel_workers);
|
|
|
|
}
|
Charge cpu_tuple_cost * 0.5 for Append and MergeAppend nodes.
Previously, Append didn't charge anything at all, and MergeAppend
charged only cpu_operator_cost, about half the value used here. This
change might make MergeAppend plans slightly more likely to be chosen
than before, since this commit increases the assumed cost for Append
-- with default values -- by 0.005 per tuple but MergeAppend by only
0.0025 per tuple. Since the comparisons required by MergeAppend are
costed separately, it's not clear why MergeAppend needs to be
otherwise more expensive than Append, so hopefully this is OK.
Prior to partition-wise join, it didn't really matter whether or not
an Append node had any cost of its own, because every plan had to use
the same number of Append or MergeAppend nodes and in the same places.
Only the relative cost of Append vs. MergeAppend made a difference.
Now, however, it is possible to avoid some of the Append nodes using a
partition-wise join, so it's worth making an effort. Pending patches
for partition-wise aggregate care too, because an Append of Aggregate
nodes will incur the Append overhead fewer times than an Aggregate
over an Append. Although in most cases this change will favor the use
of partition-wise techniques, it does the opposite when the join
cardinality is greater than the sum of the input cardinalities. Since
this situation arises in an existing regression test, I [rhaas]
adjusted it to keep the overall plan shape approximately the same.
Jeevan Chalke, per a suggestion from David Rowley. Reviewed by
Ashutosh Bapat. Some changes by me. The larger patch series of which
this patch is a part was also reviewed and tested by Antonin Houska,
Rajkumar Raghuwanshi, David Rowley, Dilip Kumar, Konstantin Knizhnik,
Pascal Legrand, Rafia Sabih, and me.
Discussion: http://postgr.es/m/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B=ZRh-rxy9qxfPA5Gw@mail.gmail.com
2018-02-22 05:09:27 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Although Append does not do any selection or projection, it's not free;
|
|
|
|
* add a small per-tuple overhead.
|
|
|
|
*/
|
|
|
|
apath->path.total_cost +=
|
|
|
|
cpu_tuple_cost * APPEND_CPU_COST_MULTIPLIER * apath->path.rows;
|
Support Parallel Append plan nodes.
When we create an Append node, we can spread out the workers over the
subplans instead of piling on to each subplan one at a time, which
should typically be a bit more efficient, both because the startup
cost of any plan executed entirely by one worker is paid only once and
also because of reduced contention. We can also construct Append
plans using a mix of partial and non-partial subplans, which may allow
for parallelism in places that otherwise couldn't support it.
Unfortunately, this patch doesn't handle the important case of
parallelizing UNION ALL by running each branch in a separate worker;
the executor infrastructure is added here, but more planner work is
needed.
Amit Khandekar, Robert Haas, Amul Sul, reviewed and tested by
Ashutosh Bapat, Amit Langote, Rafia Sabih, Amit Kapila, and
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com
2017-12-05 23:28:39 +01:00
|
|
|
}
|
|
|
|
|
2010-10-14 22:56:39 +02:00
|
|
|
/*
|
|
|
|
* cost_merge_append
|
|
|
|
* Determines and returns the cost of a MergeAppend node.
|
|
|
|
*
|
|
|
|
* MergeAppend merges several pre-sorted input streams, using a heap that
|
2014-05-06 18:12:18 +02:00
|
|
|
* at any given instant holds the next tuple from each stream. If there
|
2010-10-14 22:56:39 +02:00
|
|
|
* are N streams, we need about N*log2(N) tuple comparisons to construct
|
|
|
|
* the heap at startup, and then for each output tuple, about log2(N)
|
2016-11-05 18:48:11 +01:00
|
|
|
* comparisons to replace the top entry.
|
2010-10-14 22:56:39 +02:00
|
|
|
*
|
|
|
|
* (The effective value of N will drop once some of the input streams are
|
|
|
|
* exhausted, but it seems unlikely to be worth trying to account for that.)
|
|
|
|
*
|
|
|
|
* The heap is never spilled to disk, since we assume N is not very large.
|
|
|
|
* So this is much simpler than cost_sort.
|
|
|
|
*
|
|
|
|
* As in cost_sort, we charge two operator evals per tuple comparison.
|
|
|
|
*
|
|
|
|
* 'pathkeys' is a list of sort keys
|
|
|
|
* 'n_streams' is the number of input streams
|
|
|
|
* 'input_startup_cost' is the sum of the input streams' startup costs
|
|
|
|
* 'input_total_cost' is the sum of the input streams' total costs
|
|
|
|
* 'tuples' is the number of tuples in all the streams
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
cost_merge_append(Path *path, PlannerInfo *root,
|
|
|
|
List *pathkeys, int n_streams,
|
|
|
|
Cost input_startup_cost, Cost input_total_cost,
|
|
|
|
double tuples)
|
|
|
|
{
|
|
|
|
Cost startup_cost = 0;
|
|
|
|
Cost run_cost = 0;
|
|
|
|
Cost comparison_cost;
|
|
|
|
double N;
|
|
|
|
double logN;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Avoid log(0)...
|
|
|
|
*/
|
|
|
|
N = (n_streams < 2) ? 2.0 : (double) n_streams;
|
|
|
|
logN = LOG2(N);
|
|
|
|
|
|
|
|
/* Assumed cost per tuple comparison */
|
|
|
|
comparison_cost = 2.0 * cpu_operator_cost;
|
|
|
|
|
|
|
|
/* Heap creation cost */
|
|
|
|
startup_cost += comparison_cost * N * logN;
|
|
|
|
|
|
|
|
/* Per-tuple heap maintenance cost */
|
2016-11-05 18:48:11 +01:00
|
|
|
run_cost += tuples * comparison_cost * logN;
|
2010-10-14 22:56:39 +02:00
|
|
|
|
|
|
|
/*
|
Charge cpu_tuple_cost * 0.5 for Append and MergeAppend nodes.
Previously, Append didn't charge anything at all, and MergeAppend
charged only cpu_operator_cost, about half the value used here. This
change might make MergeAppend plans slightly more likely to be chosen
than before, since this commit increases the assumed cost for Append
-- with default values -- by 0.005 per tuple but MergeAppend by only
0.0025 per tuple. Since the comparisons required by MergeAppend are
costed separately, it's not clear why MergeAppend needs to be
otherwise more expensive than Append, so hopefully this is OK.
Prior to partition-wise join, it didn't really matter whether or not
an Append node had any cost of its own, because every plan had to use
the same number of Append or MergeAppend nodes and in the same places.
Only the relative cost of Append vs. MergeAppend made a difference.
Now, however, it is possible to avoid some of the Append nodes using a
partition-wise join, so it's worth making an effort. Pending patches
for partition-wise aggregate care too, because an Append of Aggregate
nodes will incur the Append overhead fewer times than an Aggregate
over an Append. Although in most cases this change will favor the use
of partition-wise techniques, it does the opposite when the join
cardinality is greater than the sum of the input cardinalities. Since
this situation arises in an existing regression test, I [rhaas]
adjusted it to keep the overall plan shape approximately the same.
Jeevan Chalke, per a suggestion from David Rowley. Reviewed by
Ashutosh Bapat. Some changes by me. The larger patch series of which
this patch is a part was also reviewed and tested by Antonin Houska,
Rajkumar Raghuwanshi, David Rowley, Dilip Kumar, Konstantin Knizhnik,
Pascal Legrand, Rafia Sabih, and me.
Discussion: http://postgr.es/m/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B=ZRh-rxy9qxfPA5Gw@mail.gmail.com
2018-02-22 05:09:27 +01:00
|
|
|
* Although MergeAppend does not do any selection or projection, it's not
|
|
|
|
* free; add a small per-tuple overhead.
|
2010-10-14 22:56:39 +02:00
|
|
|
*/
|
Charge cpu_tuple_cost * 0.5 for Append and MergeAppend nodes.
Previously, Append didn't charge anything at all, and MergeAppend
charged only cpu_operator_cost, about half the value used here. This
change might make MergeAppend plans slightly more likely to be chosen
than before, since this commit increases the assumed cost for Append
-- with default values -- by 0.005 per tuple but MergeAppend by only
0.0025 per tuple. Since the comparisons required by MergeAppend are
costed separately, it's not clear why MergeAppend needs to be
otherwise more expensive than Append, so hopefully this is OK.
Prior to partition-wise join, it didn't really matter whether or not
an Append node had any cost of its own, because every plan had to use
the same number of Append or MergeAppend nodes and in the same places.
Only the relative cost of Append vs. MergeAppend made a difference.
Now, however, it is possible to avoid some of the Append nodes using a
partition-wise join, so it's worth making an effort. Pending patches
for partition-wise aggregate care too, because an Append of Aggregate
nodes will incur the Append overhead fewer times than an Aggregate
over an Append. Although in most cases this change will favor the use
of partition-wise techniques, it does the opposite when the join
cardinality is greater than the sum of the input cardinalities. Since
this situation arises in an existing regression test, I [rhaas]
adjusted it to keep the overall plan shape approximately the same.
Jeevan Chalke, per a suggestion from David Rowley. Reviewed by
Ashutosh Bapat. Some changes by me. The larger patch series of which
this patch is a part was also reviewed and tested by Antonin Houska,
Rajkumar Raghuwanshi, David Rowley, Dilip Kumar, Konstantin Knizhnik,
Pascal Legrand, Rafia Sabih, and me.
Discussion: http://postgr.es/m/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B=ZRh-rxy9qxfPA5Gw@mail.gmail.com
2018-02-22 05:09:27 +01:00
|
|
|
run_cost += cpu_tuple_cost * APPEND_CPU_COST_MULTIPLIER * tuples;
|
2010-10-14 22:56:39 +02:00
|
|
|
|
|
|
|
path->startup_cost = startup_cost + input_startup_cost;
|
|
|
|
path->total_cost = startup_cost + run_cost + input_total_cost;
|
|
|
|
}
|
|
|
|
|
2002-11-30 06:21:03 +01:00
|
|
|
/*
|
|
|
|
* cost_material
|
|
|
|
* Determines and returns the cost of materializing a relation, including
|
|
|
|
* the cost of reading the input data.
|
|
|
|
*
|
2004-02-03 18:34:04 +01:00
|
|
|
* If the total volume of data to materialize exceeds work_mem, we will need
|
2002-11-30 06:21:03 +01:00
|
|
|
* to write it to disk, so the cost is much higher in that case.
|
2009-09-13 00:12:09 +02:00
|
|
|
*
|
|
|
|
* Note that here we are estimating the costs for the first scan of the
|
|
|
|
* relation, so the materialization is all overhead --- any savings will
|
|
|
|
* occur only on rescan, which is estimated in cost_rescan.
|
2002-11-30 06:21:03 +01:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
cost_material(Path *path,
|
2009-09-13 00:12:09 +02:00
|
|
|
Cost input_startup_cost, Cost input_total_cost,
|
|
|
|
double tuples, int width)
|
2002-11-30 06:21:03 +01:00
|
|
|
{
|
2009-09-13 00:12:09 +02:00
|
|
|
Cost startup_cost = input_startup_cost;
|
|
|
|
Cost run_cost = input_total_cost - input_startup_cost;
|
2002-11-30 06:21:03 +01:00
|
|
|
double nbytes = relation_byte_size(tuples, width);
|
2004-02-03 18:34:04 +01:00
|
|
|
long work_mem_bytes = work_mem * 1024L;
|
2002-11-30 06:21:03 +01:00
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
path->rows = tuples;
|
|
|
|
|
2009-09-13 00:12:09 +02:00
|
|
|
/*
|
2010-02-19 22:49:10 +01:00
|
|
|
* Whether spilling or not, charge 2x cpu_operator_cost per tuple to
|
|
|
|
* reflect bookkeeping overhead. (This rate must be more than what
|
|
|
|
* cost_rescan charges for materialize, ie, cpu_operator_cost per tuple;
|
2009-09-13 00:12:09 +02:00
|
|
|
* if it is exactly the same then there will be a cost tie between
|
|
|
|
* nestloop with A outer, materialized B inner and nestloop with B outer,
|
|
|
|
* materialized A inner. The extra cost ensures we'll prefer
|
2010-02-26 03:01:40 +01:00
|
|
|
* materializing the smaller rel.) Note that this is normally a good deal
|
2010-02-19 22:49:10 +01:00
|
|
|
* less than cpu_tuple_cost; which is OK because a Material plan node
|
|
|
|
* doesn't do qual-checking or projection, so it's got less overhead than
|
|
|
|
* most plan nodes.
|
2009-09-13 00:12:09 +02:00
|
|
|
*/
|
2010-02-19 22:49:10 +01:00
|
|
|
run_cost += 2 * cpu_operator_cost * tuples;
|
2009-09-13 00:12:09 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we will spill to disk, charge at the rate of seq_page_cost per page.
|
|
|
|
* This cost is assumed to be evenly spread through the plan run phase,
|
|
|
|
* which isn't exactly accurate but our cost model doesn't allow for
|
|
|
|
* nonuniform costs within the run phase.
|
|
|
|
*/
|
2004-02-03 18:34:04 +01:00
|
|
|
if (nbytes > work_mem_bytes)
|
2002-11-30 06:21:03 +01:00
|
|
|
{
|
|
|
|
double npages = ceil(nbytes / BLCKSZ);
|
|
|
|
|
2006-06-05 04:49:58 +02:00
|
|
|
run_cost += seq_page_cost * npages;
|
2002-11-30 06:21:03 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
path->startup_cost = startup_cost;
|
|
|
|
path->total_cost = startup_cost + run_cost;
|
|
|
|
}
|
|
|
|
|
2002-11-21 01:42:20 +01:00
|
|
|
/*
|
|
|
|
* cost_agg
|
|
|
|
* Determines and returns the cost of performing an Agg plan node,
|
|
|
|
* including the cost of its input.
|
|
|
|
*
|
2011-04-24 22:55:20 +02:00
|
|
|
* aggcosts can be NULL when there are no actual aggregate functions (i.e.,
|
|
|
|
* we are using a hashed Agg node just to do grouping).
|
|
|
|
*
|
2002-11-21 01:42:20 +01:00
|
|
|
* Note: when aggstrategy == AGG_SORTED, caller must ensure that input costs
|
|
|
|
* are for appropriately-sorted input.
|
|
|
|
*/
|
|
|
|
void
|
2005-06-06 00:32:58 +02:00
|
|
|
cost_agg(Path *path, PlannerInfo *root,
|
2011-04-24 22:55:20 +02:00
|
|
|
AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
|
2002-11-21 01:42:20 +01:00
|
|
|
int numGroupCols, double numGroups,
|
2017-11-02 16:24:12 +01:00
|
|
|
List *quals,
|
2002-11-21 01:42:20 +01:00
|
|
|
Cost input_startup_cost, Cost input_total_cost,
|
Disk-based Hash Aggregation.
While performing hash aggregation, track memory usage when adding new
groups to a hash table. If the memory usage exceeds work_mem, enter
"spill mode".
In spill mode, new groups are not created in the hash table(s), but
existing groups continue to be advanced if input tuples match. Tuples
that would cause a new group to be created are instead spilled to a
logical tape to be processed later.
The tuples are spilled in a partitioned fashion. When all tuples from
the outer plan are processed (either by advancing the group or
spilling the tuple), finalize and emit the groups from the hash
table. Then, create new batches of work from the spilled partitions,
and select one of the saved batches and process it (possibly spilling
recursively).
Author: Jeff Davis
Reviewed-by: Tomas Vondra, Adam Lee, Justin Pryzby, Taylor Vesely, Melanie Plageman
Discussion: https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel@j-davis.com
2020-03-18 23:42:02 +01:00
|
|
|
double input_tuples, double input_width)
|
2002-11-21 01:42:20 +01:00
|
|
|
{
|
2012-01-28 01:26:38 +01:00
|
|
|
double output_tuples;
|
2002-11-21 01:42:20 +01:00
|
|
|
Cost startup_cost;
|
|
|
|
Cost total_cost;
|
2011-04-24 22:55:20 +02:00
|
|
|
AggClauseCosts dummy_aggcosts;
|
|
|
|
|
|
|
|
/* Use all-zero per-aggregate costs if NULL is passed */
|
|
|
|
if (aggcosts == NULL)
|
|
|
|
{
|
|
|
|
Assert(aggstrategy == AGG_HASHED);
|
|
|
|
MemSet(&dummy_aggcosts, 0, sizeof(AggClauseCosts));
|
|
|
|
aggcosts = &dummy_aggcosts;
|
|
|
|
}
|
2002-11-21 01:42:20 +01:00
|
|
|
|
|
|
|
/*
|
2011-04-24 22:55:20 +02:00
|
|
|
* The transCost.per_tuple component of aggcosts should be charged once
|
|
|
|
* per input tuple, corresponding to the costs of evaluating the aggregate
|
2019-02-10 00:32:23 +01:00
|
|
|
* transfns and their input expressions. The finalCost.per_tuple component
|
|
|
|
* is charged once per output tuple, corresponding to the costs of
|
|
|
|
* evaluating the finalfns. Startup costs are of course charged but once.
|
2011-04-24 22:55:20 +02:00
|
|
|
*
|
|
|
|
* If we are grouping, we charge an additional cpu_operator_cost per
|
|
|
|
* grouping column per input tuple for grouping comparisons.
|
2002-11-21 01:42:20 +01:00
|
|
|
*
|
2003-08-04 02:43:34 +02:00
|
|
|
* We will produce a single output tuple if not grouping, and a tuple per
|
2005-08-28 00:37:00 +02:00
|
|
|
* group otherwise. We charge cpu_tuple_cost for each output tuple.
|
2003-02-15 22:39:58 +01:00
|
|
|
*
|
2005-11-22 19:17:34 +01:00
|
|
|
* Note: in this cost model, AGG_SORTED and AGG_HASHED have exactly the
|
2014-05-06 18:12:18 +02:00
|
|
|
* same total CPU cost, but AGG_SORTED has lower startup cost. If the
|
2005-11-22 19:17:34 +01:00
|
|
|
* input path is already sorted appropriately, AGG_SORTED should be
|
|
|
|
* preferred (since it has no risk of memory overflow). This will happen
|
|
|
|
* as long as the computed total costs are indeed exactly equal --- but if
|
|
|
|
* there's roundoff error we might do the wrong thing. So be sure that
|
|
|
|
* the computations below form the same intermediate values in the same
|
|
|
|
* order.
|
2002-11-21 01:42:20 +01:00
|
|
|
*/
|
|
|
|
if (aggstrategy == AGG_PLAIN)
|
|
|
|
{
|
|
|
|
startup_cost = input_total_cost;
|
2011-04-24 22:55:20 +02:00
|
|
|
startup_cost += aggcosts->transCost.startup;
|
|
|
|
startup_cost += aggcosts->transCost.per_tuple * input_tuples;
|
2019-02-10 00:32:23 +01:00
|
|
|
startup_cost += aggcosts->finalCost.startup;
|
|
|
|
startup_cost += aggcosts->finalCost.per_tuple;
|
2002-11-21 01:42:20 +01:00
|
|
|
/* we aren't grouping */
|
2005-08-28 00:37:00 +02:00
|
|
|
total_cost = startup_cost + cpu_tuple_cost;
|
2012-01-28 01:26:38 +01:00
|
|
|
output_tuples = 1;
|
2002-11-21 01:42:20 +01:00
|
|
|
}
|
2017-03-27 05:20:54 +02:00
|
|
|
else if (aggstrategy == AGG_SORTED || aggstrategy == AGG_MIXED)
|
2002-11-21 01:42:20 +01:00
|
|
|
{
|
|
|
|
/* Here we are able to deliver output on-the-fly */
|
|
|
|
startup_cost = input_startup_cost;
|
|
|
|
total_cost = input_total_cost;
|
2017-03-27 05:20:54 +02:00
|
|
|
if (aggstrategy == AGG_MIXED && !enable_hashagg)
|
|
|
|
{
|
|
|
|
startup_cost += disable_cost;
|
|
|
|
total_cost += disable_cost;
|
|
|
|
}
|
2003-02-15 22:39:58 +01:00
|
|
|
/* calcs phrased this way to match HASHED case, see note above */
|
2011-04-24 22:55:20 +02:00
|
|
|
total_cost += aggcosts->transCost.startup;
|
|
|
|
total_cost += aggcosts->transCost.per_tuple * input_tuples;
|
|
|
|
total_cost += (cpu_operator_cost * numGroupCols) * input_tuples;
|
2019-02-10 00:32:23 +01:00
|
|
|
total_cost += aggcosts->finalCost.startup;
|
|
|
|
total_cost += aggcosts->finalCost.per_tuple * numGroups;
|
2005-08-28 00:37:00 +02:00
|
|
|
total_cost += cpu_tuple_cost * numGroups;
|
2012-01-28 01:26:38 +01:00
|
|
|
output_tuples = numGroups;
|
2002-11-21 01:42:20 +01:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* must be AGG_HASHED */
|
|
|
|
startup_cost = input_total_cost;
|
2016-03-21 14:20:53 +01:00
|
|
|
if (!enable_hashagg)
|
|
|
|
startup_cost += disable_cost;
|
2011-04-24 22:55:20 +02:00
|
|
|
startup_cost += aggcosts->transCost.startup;
|
|
|
|
startup_cost += aggcosts->transCost.per_tuple * input_tuples;
|
Disk-based Hash Aggregation.
While performing hash aggregation, track memory usage when adding new
groups to a hash table. If the memory usage exceeds work_mem, enter
"spill mode".
In spill mode, new groups are not created in the hash table(s), but
existing groups continue to be advanced if input tuples match. Tuples
that would cause a new group to be created are instead spilled to a
logical tape to be processed later.
The tuples are spilled in a partitioned fashion. When all tuples from
the outer plan are processed (either by advancing the group or
spilling the tuple), finalize and emit the groups from the hash
table. Then, create new batches of work from the spilled partitions,
and select one of the saved batches and process it (possibly spilling
recursively).
Author: Jeff Davis
Reviewed-by: Tomas Vondra, Adam Lee, Justin Pryzby, Taylor Vesely, Melanie Plageman
Discussion: https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel@j-davis.com
2020-03-18 23:42:02 +01:00
|
|
|
/* cost of computing hash value */
|
2011-04-24 22:55:20 +02:00
|
|
|
startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples;
|
2019-02-10 00:32:23 +01:00
|
|
|
startup_cost += aggcosts->finalCost.startup;
|
Disk-based Hash Aggregation.
While performing hash aggregation, track memory usage when adding new
groups to a hash table. If the memory usage exceeds work_mem, enter
"spill mode".
In spill mode, new groups are not created in the hash table(s), but
existing groups continue to be advanced if input tuples match. Tuples
that would cause a new group to be created are instead spilled to a
logical tape to be processed later.
The tuples are spilled in a partitioned fashion. When all tuples from
the outer plan are processed (either by advancing the group or
spilling the tuple), finalize and emit the groups from the hash
table. Then, create new batches of work from the spilled partitions,
and select one of the saved batches and process it (possibly spilling
recursively).
Author: Jeff Davis
Reviewed-by: Tomas Vondra, Adam Lee, Justin Pryzby, Taylor Vesely, Melanie Plageman
Discussion: https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel@j-davis.com
2020-03-18 23:42:02 +01:00
|
|
|
|
2002-11-21 01:42:20 +01:00
|
|
|
total_cost = startup_cost;
|
2019-02-10 00:32:23 +01:00
|
|
|
total_cost += aggcosts->finalCost.per_tuple * numGroups;
|
Disk-based Hash Aggregation.
While performing hash aggregation, track memory usage when adding new
groups to a hash table. If the memory usage exceeds work_mem, enter
"spill mode".
In spill mode, new groups are not created in the hash table(s), but
existing groups continue to be advanced if input tuples match. Tuples
that would cause a new group to be created are instead spilled to a
logical tape to be processed later.
The tuples are spilled in a partitioned fashion. When all tuples from
the outer plan are processed (either by advancing the group or
spilling the tuple), finalize and emit the groups from the hash
table. Then, create new batches of work from the spilled partitions,
and select one of the saved batches and process it (possibly spilling
recursively).
Author: Jeff Davis
Reviewed-by: Tomas Vondra, Adam Lee, Justin Pryzby, Taylor Vesely, Melanie Plageman
Discussion: https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel@j-davis.com
2020-03-18 23:42:02 +01:00
|
|
|
/* cost of retrieving from hash table */
|
2005-08-28 00:37:00 +02:00
|
|
|
total_cost += cpu_tuple_cost * numGroups;
|
2012-01-28 01:26:38 +01:00
|
|
|
output_tuples = numGroups;
|
2002-11-21 01:42:20 +01:00
|
|
|
}
|
|
|
|
|
Disk-based Hash Aggregation.
While performing hash aggregation, track memory usage when adding new
groups to a hash table. If the memory usage exceeds work_mem, enter
"spill mode".
In spill mode, new groups are not created in the hash table(s), but
existing groups continue to be advanced if input tuples match. Tuples
that would cause a new group to be created are instead spilled to a
logical tape to be processed later.
The tuples are spilled in a partitioned fashion. When all tuples from
the outer plan are processed (either by advancing the group or
spilling the tuple), finalize and emit the groups from the hash
table. Then, create new batches of work from the spilled partitions,
and select one of the saved batches and process it (possibly spilling
recursively).
Author: Jeff Davis
Reviewed-by: Tomas Vondra, Adam Lee, Justin Pryzby, Taylor Vesely, Melanie Plageman
Discussion: https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel@j-davis.com
2020-03-18 23:42:02 +01:00
|
|
|
/*
|
|
|
|
* Add the disk costs of hash aggregation that spills to disk.
|
|
|
|
*
|
2020-05-14 19:06:38 +02:00
|
|
|
* Groups that go into the hash table stay in memory until finalized, so
|
|
|
|
* spilling and reprocessing tuples doesn't incur additional invocations
|
|
|
|
* of transCost or finalCost. Furthermore, the computed hash value is
|
|
|
|
* stored with the spilled tuples, so we don't incur extra invocations of
|
|
|
|
* the hash function.
|
Disk-based Hash Aggregation.
While performing hash aggregation, track memory usage when adding new
groups to a hash table. If the memory usage exceeds work_mem, enter
"spill mode".
In spill mode, new groups are not created in the hash table(s), but
existing groups continue to be advanced if input tuples match. Tuples
that would cause a new group to be created are instead spilled to a
logical tape to be processed later.
The tuples are spilled in a partitioned fashion. When all tuples from
the outer plan are processed (either by advancing the group or
spilling the tuple), finalize and emit the groups from the hash
table. Then, create new batches of work from the spilled partitions,
and select one of the saved batches and process it (possibly spilling
recursively).
Author: Jeff Davis
Reviewed-by: Tomas Vondra, Adam Lee, Justin Pryzby, Taylor Vesely, Melanie Plageman
Discussion: https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel@j-davis.com
2020-03-18 23:42:02 +01:00
|
|
|
*
|
2020-05-14 19:06:38 +02:00
|
|
|
* Hash Agg begins returning tuples after the first batch is complete.
|
|
|
|
* Accrue writes (spilled tuples) to startup_cost and to total_cost;
|
|
|
|
* accrue reads only to total_cost.
|
Disk-based Hash Aggregation.
While performing hash aggregation, track memory usage when adding new
groups to a hash table. If the memory usage exceeds work_mem, enter
"spill mode".
In spill mode, new groups are not created in the hash table(s), but
existing groups continue to be advanced if input tuples match. Tuples
that would cause a new group to be created are instead spilled to a
logical tape to be processed later.
The tuples are spilled in a partitioned fashion. When all tuples from
the outer plan are processed (either by advancing the group or
spilling the tuple), finalize and emit the groups from the hash
table. Then, create new batches of work from the spilled partitions,
and select one of the saved batches and process it (possibly spilling
recursively).
Author: Jeff Davis
Reviewed-by: Tomas Vondra, Adam Lee, Justin Pryzby, Taylor Vesely, Melanie Plageman
Discussion: https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel@j-davis.com
2020-03-18 23:42:02 +01:00
|
|
|
*/
|
|
|
|
if (aggstrategy == AGG_HASHED || aggstrategy == AGG_MIXED)
|
|
|
|
{
|
2020-05-14 19:06:38 +02:00
|
|
|
double pages;
|
|
|
|
double pages_written = 0.0;
|
|
|
|
double pages_read = 0.0;
|
2020-09-07 22:31:59 +02:00
|
|
|
double spill_cost;
|
2020-05-14 19:06:38 +02:00
|
|
|
double hashentrysize;
|
|
|
|
double nbatches;
|
|
|
|
Size mem_limit;
|
|
|
|
uint64 ngroups_limit;
|
|
|
|
int num_partitions;
|
|
|
|
int depth;
|
Disk-based Hash Aggregation.
While performing hash aggregation, track memory usage when adding new
groups to a hash table. If the memory usage exceeds work_mem, enter
"spill mode".
In spill mode, new groups are not created in the hash table(s), but
existing groups continue to be advanced if input tuples match. Tuples
that would cause a new group to be created are instead spilled to a
logical tape to be processed later.
The tuples are spilled in a partitioned fashion. When all tuples from
the outer plan are processed (either by advancing the group or
spilling the tuple), finalize and emit the groups from the hash
table. Then, create new batches of work from the spilled partitions,
and select one of the saved batches and process it (possibly spilling
recursively).
Author: Jeff Davis
Reviewed-by: Tomas Vondra, Adam Lee, Justin Pryzby, Taylor Vesely, Melanie Plageman
Discussion: https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel@j-davis.com
2020-03-18 23:42:02 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Estimate number of batches based on the computed limits. If less
|
|
|
|
* than or equal to one, all groups are expected to fit in memory;
|
|
|
|
* otherwise we expect to spill.
|
|
|
|
*/
|
2020-11-24 09:45:00 +01:00
|
|
|
hashentrysize = hash_agg_entry_size(list_length(root->aggtransinfos),
|
|
|
|
input_width,
|
2020-05-14 19:06:38 +02:00
|
|
|
aggcosts->transitionSpace);
|
Disk-based Hash Aggregation.
While performing hash aggregation, track memory usage when adding new
groups to a hash table. If the memory usage exceeds work_mem, enter
"spill mode".
In spill mode, new groups are not created in the hash table(s), but
existing groups continue to be advanced if input tuples match. Tuples
that would cause a new group to be created are instead spilled to a
logical tape to be processed later.
The tuples are spilled in a partitioned fashion. When all tuples from
the outer plan are processed (either by advancing the group or
spilling the tuple), finalize and emit the groups from the hash
table. Then, create new batches of work from the spilled partitions,
and select one of the saved batches and process it (possibly spilling
recursively).
Author: Jeff Davis
Reviewed-by: Tomas Vondra, Adam Lee, Justin Pryzby, Taylor Vesely, Melanie Plageman
Discussion: https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel@j-davis.com
2020-03-18 23:42:02 +01:00
|
|
|
hash_agg_set_limits(hashentrysize, numGroups, 0, &mem_limit,
|
|
|
|
&ngroups_limit, &num_partitions);
|
|
|
|
|
2020-05-14 19:06:38 +02:00
|
|
|
nbatches = Max((numGroups * hashentrysize) / mem_limit,
|
|
|
|
numGroups / ngroups_limit);
|
Disk-based Hash Aggregation.
While performing hash aggregation, track memory usage when adding new
groups to a hash table. If the memory usage exceeds work_mem, enter
"spill mode".
In spill mode, new groups are not created in the hash table(s), but
existing groups continue to be advanced if input tuples match. Tuples
that would cause a new group to be created are instead spilled to a
logical tape to be processed later.
The tuples are spilled in a partitioned fashion. When all tuples from
the outer plan are processed (either by advancing the group or
spilling the tuple), finalize and emit the groups from the hash
table. Then, create new batches of work from the spilled partitions,
and select one of the saved batches and process it (possibly spilling
recursively).
Author: Jeff Davis
Reviewed-by: Tomas Vondra, Adam Lee, Justin Pryzby, Taylor Vesely, Melanie Plageman
Discussion: https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel@j-davis.com
2020-03-18 23:42:02 +01:00
|
|
|
|
2020-03-28 18:53:01 +01:00
|
|
|
nbatches = Max(ceil(nbatches), 1.0);
|
|
|
|
num_partitions = Max(num_partitions, 2);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The number of partitions can change at different levels of
|
|
|
|
* recursion; but for the purposes of this calculation assume it stays
|
|
|
|
* constant.
|
|
|
|
*/
|
2020-05-14 19:06:38 +02:00
|
|
|
depth = ceil(log(nbatches) / log(num_partitions));
|
2020-03-28 18:53:01 +01:00
|
|
|
|
Disk-based Hash Aggregation.
While performing hash aggregation, track memory usage when adding new
groups to a hash table. If the memory usage exceeds work_mem, enter
"spill mode".
In spill mode, new groups are not created in the hash table(s), but
existing groups continue to be advanced if input tuples match. Tuples
that would cause a new group to be created are instead spilled to a
logical tape to be processed later.
The tuples are spilled in a partitioned fashion. When all tuples from
the outer plan are processed (either by advancing the group or
spilling the tuple), finalize and emit the groups from the hash
table. Then, create new batches of work from the spilled partitions,
and select one of the saved batches and process it (possibly spilling
recursively).
Author: Jeff Davis
Reviewed-by: Tomas Vondra, Adam Lee, Justin Pryzby, Taylor Vesely, Melanie Plageman
Discussion: https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel@j-davis.com
2020-03-18 23:42:02 +01:00
|
|
|
/*
|
|
|
|
* Estimate number of pages read and written. For each level of
|
|
|
|
* recursion, a tuple must be written and then later read.
|
|
|
|
*/
|
2020-03-28 18:53:01 +01:00
|
|
|
pages = relation_byte_size(input_tuples, input_width) / BLCKSZ;
|
|
|
|
pages_written = pages_read = pages * depth;
|
Disk-based Hash Aggregation.
While performing hash aggregation, track memory usage when adding new
groups to a hash table. If the memory usage exceeds work_mem, enter
"spill mode".
In spill mode, new groups are not created in the hash table(s), but
existing groups continue to be advanced if input tuples match. Tuples
that would cause a new group to be created are instead spilled to a
logical tape to be processed later.
The tuples are spilled in a partitioned fashion. When all tuples from
the outer plan are processed (either by advancing the group or
spilling the tuple), finalize and emit the groups from the hash
table. Then, create new batches of work from the spilled partitions,
and select one of the saved batches and process it (possibly spilling
recursively).
Author: Jeff Davis
Reviewed-by: Tomas Vondra, Adam Lee, Justin Pryzby, Taylor Vesely, Melanie Plageman
Discussion: https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel@j-davis.com
2020-03-18 23:42:02 +01:00
|
|
|
|
2020-09-07 22:31:59 +02:00
|
|
|
/*
|
|
|
|
* HashAgg has somewhat worse IO behavior than Sort on typical
|
|
|
|
* hardware/OS combinations. Account for this with a generic penalty.
|
|
|
|
*/
|
|
|
|
pages_read *= 2.0;
|
|
|
|
pages_written *= 2.0;
|
|
|
|
|
Disk-based Hash Aggregation.
While performing hash aggregation, track memory usage when adding new
groups to a hash table. If the memory usage exceeds work_mem, enter
"spill mode".
In spill mode, new groups are not created in the hash table(s), but
existing groups continue to be advanced if input tuples match. Tuples
that would cause a new group to be created are instead spilled to a
logical tape to be processed later.
The tuples are spilled in a partitioned fashion. When all tuples from
the outer plan are processed (either by advancing the group or
spilling the tuple), finalize and emit the groups from the hash
table. Then, create new batches of work from the spilled partitions,
and select one of the saved batches and process it (possibly spilling
recursively).
Author: Jeff Davis
Reviewed-by: Tomas Vondra, Adam Lee, Justin Pryzby, Taylor Vesely, Melanie Plageman
Discussion: https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel@j-davis.com
2020-03-18 23:42:02 +01:00
|
|
|
startup_cost += pages_written * random_page_cost;
|
|
|
|
total_cost += pages_written * random_page_cost;
|
|
|
|
total_cost += pages_read * seq_page_cost;
|
2020-09-07 22:31:59 +02:00
|
|
|
|
|
|
|
/* account for CPU cost of spilling a tuple and reading it back */
|
|
|
|
spill_cost = depth * input_tuples * 2.0 * cpu_tuple_cost;
|
|
|
|
startup_cost += spill_cost;
|
|
|
|
total_cost += spill_cost;
|
Disk-based Hash Aggregation.
While performing hash aggregation, track memory usage when adding new
groups to a hash table. If the memory usage exceeds work_mem, enter
"spill mode".
In spill mode, new groups are not created in the hash table(s), but
existing groups continue to be advanced if input tuples match. Tuples
that would cause a new group to be created are instead spilled to a
logical tape to be processed later.
The tuples are spilled in a partitioned fashion. When all tuples from
the outer plan are processed (either by advancing the group or
spilling the tuple), finalize and emit the groups from the hash
table. Then, create new batches of work from the spilled partitions,
and select one of the saved batches and process it (possibly spilling
recursively).
Author: Jeff Davis
Reviewed-by: Tomas Vondra, Adam Lee, Justin Pryzby, Taylor Vesely, Melanie Plageman
Discussion: https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel@j-davis.com
2020-03-18 23:42:02 +01:00
|
|
|
}
|
|
|
|
|
2017-11-02 16:24:12 +01:00
|
|
|
/*
|
|
|
|
* If there are quals (HAVING quals), account for their cost and
|
|
|
|
* selectivity.
|
|
|
|
*/
|
|
|
|
if (quals)
|
|
|
|
{
|
|
|
|
QualCost qual_cost;
|
|
|
|
|
|
|
|
cost_qual_eval(&qual_cost, quals, root);
|
|
|
|
startup_cost += qual_cost.startup;
|
|
|
|
total_cost += qual_cost.startup + output_tuples * qual_cost.per_tuple;
|
|
|
|
|
|
|
|
output_tuples = clamp_row_est(output_tuples *
|
|
|
|
clauselist_selectivity(root,
|
|
|
|
quals,
|
|
|
|
0,
|
|
|
|
JOIN_INNER,
|
|
|
|
NULL));
|
|
|
|
}
|
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
path->rows = output_tuples;
|
2002-11-21 01:42:20 +01:00
|
|
|
path->startup_cost = startup_cost;
|
|
|
|
path->total_cost = total_cost;
|
|
|
|
}
|
|
|
|
|
2008-12-28 19:54:01 +01:00
|
|
|
/*
|
|
|
|
* cost_windowagg
|
|
|
|
* Determines and returns the cost of performing a WindowAgg plan node,
|
|
|
|
* including the cost of its input.
|
|
|
|
*
|
|
|
|
* Input is assumed already properly sorted.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
cost_windowagg(Path *path, PlannerInfo *root,
|
2011-04-24 22:55:20 +02:00
|
|
|
List *windowFuncs, int numPartCols, int numOrderCols,
|
2008-12-28 19:54:01 +01:00
|
|
|
Cost input_startup_cost, Cost input_total_cost,
|
|
|
|
double input_tuples)
|
|
|
|
{
|
|
|
|
Cost startup_cost;
|
|
|
|
Cost total_cost;
|
2011-04-24 22:55:20 +02:00
|
|
|
ListCell *lc;
|
2008-12-28 19:54:01 +01:00
|
|
|
|
|
|
|
startup_cost = input_startup_cost;
|
|
|
|
total_cost = input_total_cost;
|
|
|
|
|
|
|
|
/*
|
2011-04-24 22:55:20 +02:00
|
|
|
* Window functions are assumed to cost their stated execution cost, plus
|
|
|
|
* the cost of evaluating their input expressions, per tuple. Since they
|
|
|
|
* may in fact evaluate their inputs at multiple rows during each cycle,
|
|
|
|
* this could be a drastic underestimate; but without a way to know how
|
|
|
|
* many rows the window function will fetch, it's hard to do better. In
|
|
|
|
* any case, it's a good estimate for all the built-in window functions,
|
|
|
|
* so we'll just do this for now.
|
|
|
|
*/
|
|
|
|
foreach(lc, windowFuncs)
|
|
|
|
{
|
Improve castNode notation by introducing list-extraction-specific variants.
This extends the castNode() notation introduced by commit 5bcab1114 to
provide, in one step, extraction of a list cell's pointer and coercion to
a concrete node type. For example, "lfirst_node(Foo, lc)" is the same
as "castNode(Foo, lfirst(lc))". Almost half of the uses of castNode
that have appeared so far include a list extraction call, so this is
pretty widely useful, and it saves a few more keystrokes compared to the
old way.
As with the previous patch, back-patch the addition of these macros to
pg_list.h, so that the notation will be available when back-patching.
Patch by me, after an idea of Andrew Gierth's.
Discussion: https://postgr.es/m/14197.1491841216@sss.pgh.pa.us
2017-04-10 19:51:29 +02:00
|
|
|
WindowFunc *wfunc = lfirst_node(WindowFunc, lc);
|
2011-04-24 22:55:20 +02:00
|
|
|
Cost wfunccost;
|
|
|
|
QualCost argcosts;
|
|
|
|
|
2019-02-10 00:32:23 +01:00
|
|
|
argcosts.startup = argcosts.per_tuple = 0;
|
|
|
|
add_function_cost(root, wfunc->winfnoid, (Node *) wfunc,
|
|
|
|
&argcosts);
|
|
|
|
startup_cost += argcosts.startup;
|
|
|
|
wfunccost = argcosts.per_tuple;
|
2011-04-24 22:55:20 +02:00
|
|
|
|
|
|
|
/* also add the input expressions' cost to per-input-row costs */
|
|
|
|
cost_qual_eval_node(&argcosts, (Node *) wfunc->args, root);
|
|
|
|
startup_cost += argcosts.startup;
|
|
|
|
wfunccost += argcosts.per_tuple;
|
|
|
|
|
2013-07-17 02:15:36 +02:00
|
|
|
/*
|
|
|
|
* Add the filter's cost to per-input-row costs. XXX We should reduce
|
|
|
|
* input expression costs according to filter selectivity.
|
|
|
|
*/
|
|
|
|
cost_qual_eval_node(&argcosts, (Node *) wfunc->aggfilter, root);
|
|
|
|
startup_cost += argcosts.startup;
|
|
|
|
wfunccost += argcosts.per_tuple;
|
|
|
|
|
2011-04-24 22:55:20 +02:00
|
|
|
total_cost += wfunccost * input_tuples;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We also charge cpu_operator_cost per grouping column per tuple for
|
|
|
|
* grouping comparisons, plus cpu_tuple_cost per tuple for general
|
|
|
|
* overhead.
|
|
|
|
*
|
|
|
|
* XXX this neglects costs of spooling the data to disk when it overflows
|
|
|
|
* work_mem. Sooner or later that should get accounted for.
|
2008-12-28 19:54:01 +01:00
|
|
|
*/
|
2011-04-24 22:55:20 +02:00
|
|
|
total_cost += cpu_operator_cost * (numPartCols + numOrderCols) * input_tuples;
|
2008-12-28 19:54:01 +01:00
|
|
|
total_cost += cpu_tuple_cost * input_tuples;
|
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
path->rows = input_tuples;
|
2008-12-28 19:54:01 +01:00
|
|
|
path->startup_cost = startup_cost;
|
|
|
|
path->total_cost = total_cost;
|
|
|
|
}
|
|
|
|
|
2002-11-21 01:42:20 +01:00
|
|
|
/*
|
|
|
|
* cost_group
|
|
|
|
* Determines and returns the cost of performing a Group plan node,
|
|
|
|
* including the cost of its input.
|
|
|
|
*
|
|
|
|
* Note: caller must ensure that input costs are for appropriately-sorted
|
|
|
|
* input.
|
|
|
|
*/
|
|
|
|
void
|
2005-06-06 00:32:58 +02:00
|
|
|
cost_group(Path *path, PlannerInfo *root,
|
2002-11-21 01:42:20 +01:00
|
|
|
int numGroupCols, double numGroups,
|
2017-11-02 16:24:12 +01:00
|
|
|
List *quals,
|
2002-11-21 01:42:20 +01:00
|
|
|
Cost input_startup_cost, Cost input_total_cost,
|
|
|
|
double input_tuples)
|
|
|
|
{
|
2017-11-02 16:24:12 +01:00
|
|
|
double output_tuples;
|
2002-11-21 01:42:20 +01:00
|
|
|
Cost startup_cost;
|
|
|
|
Cost total_cost;
|
|
|
|
|
2017-11-02 16:24:12 +01:00
|
|
|
output_tuples = numGroups;
|
2002-11-21 01:42:20 +01:00
|
|
|
startup_cost = input_startup_cost;
|
|
|
|
total_cost = input_total_cost;
|
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Charge one cpu_operator_cost per comparison per input tuple. We assume
|
|
|
|
* all columns get compared at most of the tuples.
|
2002-11-21 01:42:20 +01:00
|
|
|
*/
|
|
|
|
total_cost += cpu_operator_cost * input_tuples * numGroupCols;
|
|
|
|
|
2017-11-02 16:24:12 +01:00
|
|
|
/*
|
|
|
|
* If there are quals (HAVING quals), account for their cost and
|
|
|
|
* selectivity.
|
|
|
|
*/
|
|
|
|
if (quals)
|
|
|
|
{
|
|
|
|
QualCost qual_cost;
|
|
|
|
|
|
|
|
cost_qual_eval(&qual_cost, quals, root);
|
|
|
|
startup_cost += qual_cost.startup;
|
|
|
|
total_cost += qual_cost.startup + output_tuples * qual_cost.per_tuple;
|
|
|
|
|
|
|
|
output_tuples = clamp_row_est(output_tuples *
|
|
|
|
clauselist_selectivity(root,
|
|
|
|
quals,
|
|
|
|
0,
|
|
|
|
JOIN_INNER,
|
|
|
|
NULL));
|
|
|
|
}
|
|
|
|
|
|
|
|
path->rows = output_tuples;
|
2002-11-21 01:42:20 +01:00
|
|
|
path->startup_cost = startup_cost;
|
|
|
|
path->total_cost = total_cost;
|
|
|
|
}
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2006-02-05 03:59:17 +01:00
|
|
|
/*
|
2012-01-28 01:26:38 +01:00
|
|
|
* initial_cost_nestloop
|
|
|
|
* Preliminary estimate of the cost of a nestloop join path.
|
|
|
|
*
|
|
|
|
* This must quickly produce lower-bound estimates of the path's startup and
|
|
|
|
* total costs. If we are unable to eliminate the proposed path from
|
|
|
|
* consideration using the lower bounds, final_cost_nestloop will be called
|
|
|
|
* to obtain the final estimates.
|
|
|
|
*
|
|
|
|
* The exact division of labor between this function and final_cost_nestloop
|
|
|
|
* is private to them, and represents a tradeoff between speed of the initial
|
|
|
|
* estimate and getting a tight lower bound. We choose to not examine the
|
|
|
|
* join quals here, since that's by far the most expensive part of the
|
|
|
|
* calculations. The end result is that CPU-cost considerations must be
|
Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 17:58:47 +02:00
|
|
|
* left for the second phase; and for SEMI/ANTI joins, we must also postpone
|
|
|
|
* incorporation of the inner path's run cost.
|
2012-01-28 01:26:38 +01:00
|
|
|
*
|
|
|
|
* 'workspace' is to be filled with startup_cost, total_cost, and perhaps
|
|
|
|
* other data to be used by final_cost_nestloop
|
|
|
|
* 'jointype' is the type of join to be performed
|
|
|
|
* 'outer_path' is the outer input to the join
|
|
|
|
* 'inner_path' is the inner input to the join
|
2017-04-08 04:20:03 +02:00
|
|
|
* 'extra' contains miscellaneous information about the join
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2000-02-15 21:49:31 +01:00
|
|
|
void
|
2012-01-28 01:26:38 +01:00
|
|
|
initial_cost_nestloop(PlannerInfo *root, JoinCostWorkspace *workspace,
|
|
|
|
JoinType jointype,
|
|
|
|
Path *outer_path, Path *inner_path,
|
2017-04-08 04:20:03 +02:00
|
|
|
JoinPathExtraData *extra)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2000-02-15 21:49:31 +01:00
|
|
|
Cost startup_cost = 0;
|
|
|
|
Cost run_cost = 0;
|
2012-01-28 01:26:38 +01:00
|
|
|
double outer_path_rows = outer_path->rows;
|
2009-09-13 00:12:09 +02:00
|
|
|
Cost inner_rescan_start_cost;
|
|
|
|
Cost inner_rescan_total_cost;
|
2009-05-10 00:51:41 +02:00
|
|
|
Cost inner_run_cost;
|
2009-09-13 00:12:09 +02:00
|
|
|
Cost inner_rescan_run_cost;
|
2000-02-15 21:49:31 +01:00
|
|
|
|
2009-09-13 00:12:09 +02:00
|
|
|
/* estimate costs to rescan the inner relation */
|
|
|
|
cost_rescan(root, inner_path,
|
|
|
|
&inner_rescan_start_cost,
|
|
|
|
&inner_rescan_total_cost);
|
|
|
|
|
2000-02-15 21:49:31 +01:00
|
|
|
/* cost of source data */
|
2000-04-12 19:17:23 +02:00
|
|
|
|
2000-02-15 21:49:31 +01:00
|
|
|
/*
|
2001-04-26 00:04:37 +02:00
|
|
|
* NOTE: clearly, we must pay both outer and inner paths' startup_cost
|
2001-10-25 07:50:21 +02:00
|
|
|
* before we can start returning tuples, so the join's startup cost is
|
2009-09-13 00:12:09 +02:00
|
|
|
* their sum. We'll also pay the inner path's rescan startup cost
|
|
|
|
* multiple times.
|
2000-02-15 21:49:31 +01:00
|
|
|
*/
|
|
|
|
startup_cost += outer_path->startup_cost + inner_path->startup_cost;
|
|
|
|
run_cost += outer_path->total_cost - outer_path->startup_cost;
|
2009-09-13 00:12:09 +02:00
|
|
|
if (outer_path_rows > 1)
|
|
|
|
run_cost += (outer_path_rows - 1) * inner_rescan_start_cost;
|
|
|
|
|
2009-05-10 00:51:41 +02:00
|
|
|
inner_run_cost = inner_path->total_cost - inner_path->startup_cost;
|
2009-09-13 00:12:09 +02:00
|
|
|
inner_rescan_run_cost = inner_rescan_total_cost - inner_rescan_start_cost;
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2017-04-08 04:20:03 +02:00
|
|
|
if (jointype == JOIN_SEMI || jointype == JOIN_ANTI ||
|
|
|
|
extra->inner_unique)
|
2009-05-10 00:51:41 +02:00
|
|
|
{
|
|
|
|
/*
|
2017-04-08 04:20:03 +02:00
|
|
|
* With a SEMI or ANTI join, or if the innerrel is known unique, the
|
|
|
|
* executor will stop after the first match.
|
2009-05-10 00:51:41 +02:00
|
|
|
*
|
Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 17:58:47 +02:00
|
|
|
* Getting decent estimates requires inspection of the join quals,
|
|
|
|
* which we choose to postpone to final_cost_nestloop.
|
2012-01-28 01:26:38 +01:00
|
|
|
*/
|
|
|
|
|
|
|
|
/* Save private data for final_cost_nestloop */
|
Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 17:58:47 +02:00
|
|
|
workspace->inner_run_cost = inner_run_cost;
|
|
|
|
workspace->inner_rescan_run_cost = inner_rescan_run_cost;
|
2012-01-28 01:26:38 +01:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* Normal case; we'll scan whole input rel for each outer row */
|
|
|
|
run_cost += inner_run_cost;
|
|
|
|
if (outer_path_rows > 1)
|
|
|
|
run_cost += (outer_path_rows - 1) * inner_rescan_run_cost;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* CPU costs left for later */
|
|
|
|
|
|
|
|
/* Public result fields */
|
|
|
|
workspace->startup_cost = startup_cost;
|
|
|
|
workspace->total_cost = startup_cost + run_cost;
|
|
|
|
/* Save private data for final_cost_nestloop */
|
|
|
|
workspace->run_cost = run_cost;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* final_cost_nestloop
|
|
|
|
* Final estimate of the cost and result size of a nestloop join path.
|
|
|
|
*
|
|
|
|
* 'path' is already filled in except for the rows and cost fields
|
|
|
|
* 'workspace' is the result from initial_cost_nestloop
|
2017-04-08 04:20:03 +02:00
|
|
|
* 'extra' contains miscellaneous information about the join
|
2012-01-28 01:26:38 +01:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
final_cost_nestloop(PlannerInfo *root, NestPath *path,
|
|
|
|
JoinCostWorkspace *workspace,
|
2017-04-08 04:20:03 +02:00
|
|
|
JoinPathExtraData *extra)
|
2012-01-28 01:26:38 +01:00
|
|
|
{
|
|
|
|
Path *outer_path = path->outerjoinpath;
|
|
|
|
Path *inner_path = path->innerjoinpath;
|
|
|
|
double outer_path_rows = outer_path->rows;
|
|
|
|
double inner_path_rows = inner_path->rows;
|
|
|
|
Cost startup_cost = workspace->startup_cost;
|
|
|
|
Cost run_cost = workspace->run_cost;
|
|
|
|
Cost cpu_per_tuple;
|
|
|
|
QualCost restrict_qual_cost;
|
|
|
|
double ntuples;
|
|
|
|
|
Prevent overly large and NaN row estimates in relations
Given a query with enough joins, it was possible that the query planner,
after multiplying the row estimates with the join selectivity that the
estimated number of rows would exceed the limits of the double data type
and become infinite.
To give an indication on how extreme a case is required to hit this, the
particular example case reported required 379 joins to a table without any
statistics, which resulted in the 1.0/DEFAULT_NUM_DISTINCT being used for
the join selectivity. This eventually caused the row estimates to go
infinite and resulted in an assert failure in initial_cost_mergejoin()
where the infinite row estimated was multiplied by an outerstartsel of 0.0
resulting in NaN. The failing assert verified that NaN <= Inf, which is
false.
To get around this we use clamp_row_est() to cap row estimates at a
maximum of 1e100. This value is thought to be low enough that costs
derived from it would remain within the bounds of what the double type can
represent.
Aside from fixing the failing Assert, this also has the added benefit of
making it so add_path() will still receive proper numerical values as
costs which will allow it to make more sane choices when determining the
cheaper path in extreme cases such as the one described above.
Additionally, we also get rid of the isnan() checks in the join costing
functions. The actual case which originally triggered those checks to be
added in the first place never made it to the mailing lists. It seems
likely that the new code being added to clamp_row_est() will result in
those becoming checks redundant, so just remove them.
The fairly harmless assert failure problem does also exist in the
backbranches, however, a more minimalistic fix will be applied there.
Reported-by: Onder Kalaci
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/DM6PR21MB1211FF360183BCA901B27F04D80B0@DM6PR21MB1211.namprd21.prod.outlook.com
2020-10-18 23:53:52 +02:00
|
|
|
/* Protect some assumptions below that rowcounts aren't zero */
|
|
|
|
if (outer_path_rows <= 0)
|
2016-03-26 17:03:12 +01:00
|
|
|
outer_path_rows = 1;
|
Prevent overly large and NaN row estimates in relations
Given a query with enough joins, it was possible that the query planner,
after multiplying the row estimates with the join selectivity that the
estimated number of rows would exceed the limits of the double data type
and become infinite.
To give an indication on how extreme a case is required to hit this, the
particular example case reported required 379 joins to a table without any
statistics, which resulted in the 1.0/DEFAULT_NUM_DISTINCT being used for
the join selectivity. This eventually caused the row estimates to go
infinite and resulted in an assert failure in initial_cost_mergejoin()
where the infinite row estimated was multiplied by an outerstartsel of 0.0
resulting in NaN. The failing assert verified that NaN <= Inf, which is
false.
To get around this we use clamp_row_est() to cap row estimates at a
maximum of 1e100. This value is thought to be low enough that costs
derived from it would remain within the bounds of what the double type can
represent.
Aside from fixing the failing Assert, this also has the added benefit of
making it so add_path() will still receive proper numerical values as
costs which will allow it to make more sane choices when determining the
cheaper path in extreme cases such as the one described above.
Additionally, we also get rid of the isnan() checks in the join costing
functions. The actual case which originally triggered those checks to be
added in the first place never made it to the mailing lists. It seems
likely that the new code being added to clamp_row_est() will result in
those becoming checks redundant, so just remove them.
The fairly harmless assert failure problem does also exist in the
backbranches, however, a more minimalistic fix will be applied there.
Reported-by: Onder Kalaci
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/DM6PR21MB1211FF360183BCA901B27F04D80B0@DM6PR21MB1211.namprd21.prod.outlook.com
2020-10-18 23:53:52 +02:00
|
|
|
if (inner_path_rows <= 0)
|
2016-03-26 17:03:12 +01:00
|
|
|
inner_path_rows = 1;
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
/* Mark the path with the correct row estimate */
|
|
|
|
if (path->path.param_info)
|
|
|
|
path->path.rows = path->path.param_info->ppi_rows;
|
2012-01-28 01:26:38 +01:00
|
|
|
else
|
|
|
|
path->path.rows = path->path.parent->rows;
|
|
|
|
|
2017-01-13 19:29:31 +01:00
|
|
|
/* For partial paths, scale row estimate. */
|
|
|
|
if (path->path.parallel_workers > 0)
|
2017-03-15 17:28:54 +01:00
|
|
|
{
|
2017-05-17 22:31:56 +02:00
|
|
|
double parallel_divisor = get_parallel_divisor(&path->path);
|
2017-03-15 17:28:54 +01:00
|
|
|
|
|
|
|
path->path.rows =
|
|
|
|
clamp_row_est(path->path.rows / parallel_divisor);
|
|
|
|
}
|
2017-01-13 19:29:31 +01:00
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
/*
|
|
|
|
* We could include disable_cost in the preliminary estimate, but that
|
|
|
|
* would amount to optimizing for the case where the join method is
|
|
|
|
* disabled, which doesn't seem like the way to bet.
|
|
|
|
*/
|
|
|
|
if (!enable_nestloop)
|
|
|
|
startup_cost += disable_cost;
|
|
|
|
|
Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 17:58:47 +02:00
|
|
|
/* cost of inner-relation source data (we already dealt with outer rel) */
|
2012-01-28 01:26:38 +01:00
|
|
|
|
2017-04-08 04:20:03 +02:00
|
|
|
if (path->jointype == JOIN_SEMI || path->jointype == JOIN_ANTI ||
|
|
|
|
extra->inner_unique)
|
2012-01-28 01:26:38 +01:00
|
|
|
{
|
|
|
|
/*
|
2017-04-08 04:20:03 +02:00
|
|
|
* With a SEMI or ANTI join, or if the innerrel is known unique, the
|
|
|
|
* executor will stop after the first match.
|
2012-01-28 01:26:38 +01:00
|
|
|
*/
|
Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 17:58:47 +02:00
|
|
|
Cost inner_run_cost = workspace->inner_run_cost;
|
|
|
|
Cost inner_rescan_run_cost = workspace->inner_rescan_run_cost;
|
|
|
|
double outer_matched_rows;
|
Fix old corner-case logic error in final_cost_nestloop().
When costing a nestloop with stop-at-first-inner-match semantics, and a
non-indexscan inner path, final_cost_nestloop() wants to charge the full
scan cost of the inner rel at least once, with additional scans charged
at inner_rescan_run_cost which might be less. However the logic for
doing this effectively assumed that outer_matched_rows is at least 1.
If it's zero, which is not unlikely for a small outer rel, we ended up
charging inner_run_cost plus N times inner_rescan_run_cost, as much as
double the correct charge for an outer rel with only one row that
we're betting won't be matched. (Unless the inner rel is materialized,
in which case it has very small inner_rescan_run_cost and the cost
is not so far off what it should have been.)
The upshot of this was that the planner had a tendency to select plans
that failed to make effective use of the stop-at-first-inner-match
semantics, and that might have Materialize nodes in them even when the
predicted number of executions of the Materialize subplan was only 1.
This was not so obvious before commit 9c7f5229a, because the case only
arose in connection with semi/anti joins where there's not freedom to
reverse the join order. But with the addition of unique-inner joins,
it could result in some fairly bad planning choices, as reported by
Teodor Sigaev. Indeed, some of the test cases added by that commit
have plans that look dubious on closer inspection, and are changed
by this patch.
Fix the logic to ensure that we don't charge for too many inner scans.
I chose to adjust it so that the full-freight scan cost is associated
with an unmatched outer row if possible, not a matched one, since that
seems like a better model of what would happen at runtime.
This is a longstanding bug, but given the lesser impact in back branches,
and the lack of field complaints, I won't risk a back-patch.
Discussion: https://postgr.es/m/CAKJS1f-LzkUsFxdJ_-Luy38orQ+AdEXM5o+vANR+-pHAWPSecg@mail.gmail.com
2017-06-03 19:48:15 +02:00
|
|
|
double outer_unmatched_rows;
|
Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 17:58:47 +02:00
|
|
|
Selectivity inner_scan_frac;
|
2012-01-28 01:26:38 +01:00
|
|
|
|
Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 17:58:47 +02:00
|
|
|
/*
|
|
|
|
* For an outer-rel row that has at least one match, we can expect the
|
|
|
|
* inner scan to stop after a fraction 1/(match_count+1) of the inner
|
|
|
|
* rows, if the matches are evenly distributed. Since they probably
|
|
|
|
* aren't quite evenly distributed, we apply a fuzz factor of 2.0 to
|
|
|
|
* that fraction. (If we used a larger fuzz factor, we'd have to
|
|
|
|
* clamp inner_scan_frac to at most 1.0; but since match_count is at
|
|
|
|
* least 1, no such clamp is needed now.)
|
|
|
|
*/
|
2017-04-08 04:20:03 +02:00
|
|
|
outer_matched_rows = rint(outer_path_rows * extra->semifactors.outer_match_frac);
|
Fix old corner-case logic error in final_cost_nestloop().
When costing a nestloop with stop-at-first-inner-match semantics, and a
non-indexscan inner path, final_cost_nestloop() wants to charge the full
scan cost of the inner rel at least once, with additional scans charged
at inner_rescan_run_cost which might be less. However the logic for
doing this effectively assumed that outer_matched_rows is at least 1.
If it's zero, which is not unlikely for a small outer rel, we ended up
charging inner_run_cost plus N times inner_rescan_run_cost, as much as
double the correct charge for an outer rel with only one row that
we're betting won't be matched. (Unless the inner rel is materialized,
in which case it has very small inner_rescan_run_cost and the cost
is not so far off what it should have been.)
The upshot of this was that the planner had a tendency to select plans
that failed to make effective use of the stop-at-first-inner-match
semantics, and that might have Materialize nodes in them even when the
predicted number of executions of the Materialize subplan was only 1.
This was not so obvious before commit 9c7f5229a, because the case only
arose in connection with semi/anti joins where there's not freedom to
reverse the join order. But with the addition of unique-inner joins,
it could result in some fairly bad planning choices, as reported by
Teodor Sigaev. Indeed, some of the test cases added by that commit
have plans that look dubious on closer inspection, and are changed
by this patch.
Fix the logic to ensure that we don't charge for too many inner scans.
I chose to adjust it so that the full-freight scan cost is associated
with an unmatched outer row if possible, not a matched one, since that
seems like a better model of what would happen at runtime.
This is a longstanding bug, but given the lesser impact in back branches,
and the lack of field complaints, I won't risk a back-patch.
Discussion: https://postgr.es/m/CAKJS1f-LzkUsFxdJ_-Luy38orQ+AdEXM5o+vANR+-pHAWPSecg@mail.gmail.com
2017-06-03 19:48:15 +02:00
|
|
|
outer_unmatched_rows = outer_path_rows - outer_matched_rows;
|
2017-04-08 04:20:03 +02:00
|
|
|
inner_scan_frac = 2.0 / (extra->semifactors.match_count + 1.0);
|
Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 17:58:47 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Compute number of tuples processed (not number emitted!). First,
|
|
|
|
* account for successfully-matched outer rows.
|
|
|
|
*/
|
2009-05-10 00:51:41 +02:00
|
|
|
ntuples = outer_matched_rows * inner_path_rows * inner_scan_frac;
|
|
|
|
|
|
|
|
/*
|
Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 17:58:47 +02:00
|
|
|
* Now we need to estimate the actual costs of scanning the inner
|
|
|
|
* relation, which may be quite a bit less than N times inner_run_cost
|
|
|
|
* due to early scan stops. We consider two cases. If the inner path
|
|
|
|
* is an indexscan using all the joinquals as indexquals, then an
|
|
|
|
* unmatched outer row results in an indexscan returning no rows,
|
|
|
|
* which is probably quite cheap. Otherwise, the executor will have
|
|
|
|
* to scan the whole inner rel for an unmatched row; not so cheap.
|
2009-05-10 00:51:41 +02:00
|
|
|
*/
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
if (has_indexed_join_quals(path))
|
2009-05-10 00:51:41 +02:00
|
|
|
{
|
Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 17:58:47 +02:00
|
|
|
/*
|
|
|
|
* Successfully-matched outer rows will only require scanning
|
|
|
|
* inner_scan_frac of the inner relation. In this case, we don't
|
|
|
|
* need to charge the full inner_run_cost even when that's more
|
|
|
|
* than inner_rescan_run_cost, because we can assume that none of
|
|
|
|
* the inner scans ever scan the whole inner relation. So it's
|
|
|
|
* okay to assume that all the inner scan executions can be
|
|
|
|
* fractions of the full cost, even if materialization is reducing
|
|
|
|
* the rescan cost. At this writing, it's impossible to get here
|
|
|
|
* for a materialized inner scan, so inner_run_cost and
|
|
|
|
* inner_rescan_run_cost will be the same anyway; but just in
|
|
|
|
* case, use inner_run_cost for the first matched tuple and
|
|
|
|
* inner_rescan_run_cost for additional ones.
|
|
|
|
*/
|
|
|
|
run_cost += inner_run_cost * inner_scan_frac;
|
|
|
|
if (outer_matched_rows > 1)
|
|
|
|
run_cost += (outer_matched_rows - 1) * inner_rescan_run_cost * inner_scan_frac;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add the cost of inner-scan executions for unmatched outer rows.
|
|
|
|
* We estimate this as the same cost as returning the first tuple
|
|
|
|
* of a nonempty scan. We consider that these are all rescans,
|
|
|
|
* since we used inner_run_cost once already.
|
|
|
|
*/
|
Fix old corner-case logic error in final_cost_nestloop().
When costing a nestloop with stop-at-first-inner-match semantics, and a
non-indexscan inner path, final_cost_nestloop() wants to charge the full
scan cost of the inner rel at least once, with additional scans charged
at inner_rescan_run_cost which might be less. However the logic for
doing this effectively assumed that outer_matched_rows is at least 1.
If it's zero, which is not unlikely for a small outer rel, we ended up
charging inner_run_cost plus N times inner_rescan_run_cost, as much as
double the correct charge for an outer rel with only one row that
we're betting won't be matched. (Unless the inner rel is materialized,
in which case it has very small inner_rescan_run_cost and the cost
is not so far off what it should have been.)
The upshot of this was that the planner had a tendency to select plans
that failed to make effective use of the stop-at-first-inner-match
semantics, and that might have Materialize nodes in them even when the
predicted number of executions of the Materialize subplan was only 1.
This was not so obvious before commit 9c7f5229a, because the case only
arose in connection with semi/anti joins where there's not freedom to
reverse the join order. But with the addition of unique-inner joins,
it could result in some fairly bad planning choices, as reported by
Teodor Sigaev. Indeed, some of the test cases added by that commit
have plans that look dubious on closer inspection, and are changed
by this patch.
Fix the logic to ensure that we don't charge for too many inner scans.
I chose to adjust it so that the full-freight scan cost is associated
with an unmatched outer row if possible, not a matched one, since that
seems like a better model of what would happen at runtime.
This is a longstanding bug, but given the lesser impact in back branches,
and the lack of field complaints, I won't risk a back-patch.
Discussion: https://postgr.es/m/CAKJS1f-LzkUsFxdJ_-Luy38orQ+AdEXM5o+vANR+-pHAWPSecg@mail.gmail.com
2017-06-03 19:48:15 +02:00
|
|
|
run_cost += outer_unmatched_rows *
|
2009-09-13 00:12:09 +02:00
|
|
|
inner_rescan_run_cost / inner_path_rows;
|
2010-02-26 03:01:40 +01:00
|
|
|
|
2009-09-13 00:12:09 +02:00
|
|
|
/*
|
Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 17:58:47 +02:00
|
|
|
* We won't be evaluating any quals at all for unmatched rows, so
|
2010-02-26 03:01:40 +01:00
|
|
|
* don't add them to ntuples.
|
2009-09-13 00:12:09 +02:00
|
|
|
*/
|
2009-05-10 00:51:41 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 17:58:47 +02:00
|
|
|
/*
|
|
|
|
* Here, a complicating factor is that rescans may be cheaper than
|
|
|
|
* first scans. If we never scan all the way to the end of the
|
|
|
|
* inner rel, it might be (depending on the plan type) that we'd
|
|
|
|
* never pay the whole inner first-scan run cost. However it is
|
|
|
|
* difficult to estimate whether that will happen (and it could
|
|
|
|
* not happen if there are any unmatched outer rows!), so be
|
|
|
|
* conservative and always charge the whole first-scan cost once.
|
Fix old corner-case logic error in final_cost_nestloop().
When costing a nestloop with stop-at-first-inner-match semantics, and a
non-indexscan inner path, final_cost_nestloop() wants to charge the full
scan cost of the inner rel at least once, with additional scans charged
at inner_rescan_run_cost which might be less. However the logic for
doing this effectively assumed that outer_matched_rows is at least 1.
If it's zero, which is not unlikely for a small outer rel, we ended up
charging inner_run_cost plus N times inner_rescan_run_cost, as much as
double the correct charge for an outer rel with only one row that
we're betting won't be matched. (Unless the inner rel is materialized,
in which case it has very small inner_rescan_run_cost and the cost
is not so far off what it should have been.)
The upshot of this was that the planner had a tendency to select plans
that failed to make effective use of the stop-at-first-inner-match
semantics, and that might have Materialize nodes in them even when the
predicted number of executions of the Materialize subplan was only 1.
This was not so obvious before commit 9c7f5229a, because the case only
arose in connection with semi/anti joins where there's not freedom to
reverse the join order. But with the addition of unique-inner joins,
it could result in some fairly bad planning choices, as reported by
Teodor Sigaev. Indeed, some of the test cases added by that commit
have plans that look dubious on closer inspection, and are changed
by this patch.
Fix the logic to ensure that we don't charge for too many inner scans.
I chose to adjust it so that the full-freight scan cost is associated
with an unmatched outer row if possible, not a matched one, since that
seems like a better model of what would happen at runtime.
This is a longstanding bug, but given the lesser impact in back branches,
and the lack of field complaints, I won't risk a back-patch.
Discussion: https://postgr.es/m/CAKJS1f-LzkUsFxdJ_-Luy38orQ+AdEXM5o+vANR+-pHAWPSecg@mail.gmail.com
2017-06-03 19:48:15 +02:00
|
|
|
* We consider this charge to correspond to the first unmatched
|
|
|
|
* outer row, unless there isn't one in our estimate, in which
|
|
|
|
* case blame it on the first matched row.
|
Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 17:58:47 +02:00
|
|
|
*/
|
Fix old corner-case logic error in final_cost_nestloop().
When costing a nestloop with stop-at-first-inner-match semantics, and a
non-indexscan inner path, final_cost_nestloop() wants to charge the full
scan cost of the inner rel at least once, with additional scans charged
at inner_rescan_run_cost which might be less. However the logic for
doing this effectively assumed that outer_matched_rows is at least 1.
If it's zero, which is not unlikely for a small outer rel, we ended up
charging inner_run_cost plus N times inner_rescan_run_cost, as much as
double the correct charge for an outer rel with only one row that
we're betting won't be matched. (Unless the inner rel is materialized,
in which case it has very small inner_rescan_run_cost and the cost
is not so far off what it should have been.)
The upshot of this was that the planner had a tendency to select plans
that failed to make effective use of the stop-at-first-inner-match
semantics, and that might have Materialize nodes in them even when the
predicted number of executions of the Materialize subplan was only 1.
This was not so obvious before commit 9c7f5229a, because the case only
arose in connection with semi/anti joins where there's not freedom to
reverse the join order. But with the addition of unique-inner joins,
it could result in some fairly bad planning choices, as reported by
Teodor Sigaev. Indeed, some of the test cases added by that commit
have plans that look dubious on closer inspection, and are changed
by this patch.
Fix the logic to ensure that we don't charge for too many inner scans.
I chose to adjust it so that the full-freight scan cost is associated
with an unmatched outer row if possible, not a matched one, since that
seems like a better model of what would happen at runtime.
This is a longstanding bug, but given the lesser impact in back branches,
and the lack of field complaints, I won't risk a back-patch.
Discussion: https://postgr.es/m/CAKJS1f-LzkUsFxdJ_-Luy38orQ+AdEXM5o+vANR+-pHAWPSecg@mail.gmail.com
2017-06-03 19:48:15 +02:00
|
|
|
|
|
|
|
/* First, count all unmatched join tuples as being processed */
|
|
|
|
ntuples += outer_unmatched_rows * inner_path_rows;
|
|
|
|
|
|
|
|
/* Now add the forced full scan, and decrement appropriate count */
|
Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 17:58:47 +02:00
|
|
|
run_cost += inner_run_cost;
|
Fix old corner-case logic error in final_cost_nestloop().
When costing a nestloop with stop-at-first-inner-match semantics, and a
non-indexscan inner path, final_cost_nestloop() wants to charge the full
scan cost of the inner rel at least once, with additional scans charged
at inner_rescan_run_cost which might be less. However the logic for
doing this effectively assumed that outer_matched_rows is at least 1.
If it's zero, which is not unlikely for a small outer rel, we ended up
charging inner_run_cost plus N times inner_rescan_run_cost, as much as
double the correct charge for an outer rel with only one row that
we're betting won't be matched. (Unless the inner rel is materialized,
in which case it has very small inner_rescan_run_cost and the cost
is not so far off what it should have been.)
The upshot of this was that the planner had a tendency to select plans
that failed to make effective use of the stop-at-first-inner-match
semantics, and that might have Materialize nodes in them even when the
predicted number of executions of the Materialize subplan was only 1.
This was not so obvious before commit 9c7f5229a, because the case only
arose in connection with semi/anti joins where there's not freedom to
reverse the join order. But with the addition of unique-inner joins,
it could result in some fairly bad planning choices, as reported by
Teodor Sigaev. Indeed, some of the test cases added by that commit
have plans that look dubious on closer inspection, and are changed
by this patch.
Fix the logic to ensure that we don't charge for too many inner scans.
I chose to adjust it so that the full-freight scan cost is associated
with an unmatched outer row if possible, not a matched one, since that
seems like a better model of what would happen at runtime.
This is a longstanding bug, but given the lesser impact in back branches,
and the lack of field complaints, I won't risk a back-patch.
Discussion: https://postgr.es/m/CAKJS1f-LzkUsFxdJ_-Luy38orQ+AdEXM5o+vANR+-pHAWPSecg@mail.gmail.com
2017-06-03 19:48:15 +02:00
|
|
|
if (outer_unmatched_rows >= 1)
|
|
|
|
outer_unmatched_rows -= 1;
|
|
|
|
else
|
|
|
|
outer_matched_rows -= 1;
|
Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 17:58:47 +02:00
|
|
|
|
|
|
|
/* Add inner run cost for additional outer tuples having matches */
|
Fix old corner-case logic error in final_cost_nestloop().
When costing a nestloop with stop-at-first-inner-match semantics, and a
non-indexscan inner path, final_cost_nestloop() wants to charge the full
scan cost of the inner rel at least once, with additional scans charged
at inner_rescan_run_cost which might be less. However the logic for
doing this effectively assumed that outer_matched_rows is at least 1.
If it's zero, which is not unlikely for a small outer rel, we ended up
charging inner_run_cost plus N times inner_rescan_run_cost, as much as
double the correct charge for an outer rel with only one row that
we're betting won't be matched. (Unless the inner rel is materialized,
in which case it has very small inner_rescan_run_cost and the cost
is not so far off what it should have been.)
The upshot of this was that the planner had a tendency to select plans
that failed to make effective use of the stop-at-first-inner-match
semantics, and that might have Materialize nodes in them even when the
predicted number of executions of the Materialize subplan was only 1.
This was not so obvious before commit 9c7f5229a, because the case only
arose in connection with semi/anti joins where there's not freedom to
reverse the join order. But with the addition of unique-inner joins,
it could result in some fairly bad planning choices, as reported by
Teodor Sigaev. Indeed, some of the test cases added by that commit
have plans that look dubious on closer inspection, and are changed
by this patch.
Fix the logic to ensure that we don't charge for too many inner scans.
I chose to adjust it so that the full-freight scan cost is associated
with an unmatched outer row if possible, not a matched one, since that
seems like a better model of what would happen at runtime.
This is a longstanding bug, but given the lesser impact in back branches,
and the lack of field complaints, I won't risk a back-patch.
Discussion: https://postgr.es/m/CAKJS1f-LzkUsFxdJ_-Luy38orQ+AdEXM5o+vANR+-pHAWPSecg@mail.gmail.com
2017-06-03 19:48:15 +02:00
|
|
|
if (outer_matched_rows > 0)
|
|
|
|
run_cost += outer_matched_rows * inner_rescan_run_cost * inner_scan_frac;
|
Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick. The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case. If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is. To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.
Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths. That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic. In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem. To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.
To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans. There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.
Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.
Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 17:58:47 +02:00
|
|
|
|
Fix old corner-case logic error in final_cost_nestloop().
When costing a nestloop with stop-at-first-inner-match semantics, and a
non-indexscan inner path, final_cost_nestloop() wants to charge the full
scan cost of the inner rel at least once, with additional scans charged
at inner_rescan_run_cost which might be less. However the logic for
doing this effectively assumed that outer_matched_rows is at least 1.
If it's zero, which is not unlikely for a small outer rel, we ended up
charging inner_run_cost plus N times inner_rescan_run_cost, as much as
double the correct charge for an outer rel with only one row that
we're betting won't be matched. (Unless the inner rel is materialized,
in which case it has very small inner_rescan_run_cost and the cost
is not so far off what it should have been.)
The upshot of this was that the planner had a tendency to select plans
that failed to make effective use of the stop-at-first-inner-match
semantics, and that might have Materialize nodes in them even when the
predicted number of executions of the Materialize subplan was only 1.
This was not so obvious before commit 9c7f5229a, because the case only
arose in connection with semi/anti joins where there's not freedom to
reverse the join order. But with the addition of unique-inner joins,
it could result in some fairly bad planning choices, as reported by
Teodor Sigaev. Indeed, some of the test cases added by that commit
have plans that look dubious on closer inspection, and are changed
by this patch.
Fix the logic to ensure that we don't charge for too many inner scans.
I chose to adjust it so that the full-freight scan cost is associated
with an unmatched outer row if possible, not a matched one, since that
seems like a better model of what would happen at runtime.
This is a longstanding bug, but given the lesser impact in back branches,
and the lack of field complaints, I won't risk a back-patch.
Discussion: https://postgr.es/m/CAKJS1f-LzkUsFxdJ_-Luy38orQ+AdEXM5o+vANR+-pHAWPSecg@mail.gmail.com
2017-06-03 19:48:15 +02:00
|
|
|
/* Add inner run cost for additional unmatched outer tuples */
|
|
|
|
if (outer_unmatched_rows > 0)
|
|
|
|
run_cost += outer_unmatched_rows * inner_rescan_run_cost;
|
2009-05-10 00:51:41 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
2012-01-28 01:26:38 +01:00
|
|
|
/* Normal-case source costs were included in preliminary estimate */
|
2009-05-10 00:51:41 +02:00
|
|
|
|
|
|
|
/* Compute number of tuples processed (not number emitted!) */
|
|
|
|
ntuples = outer_path_rows * inner_path_rows;
|
|
|
|
}
|
2000-01-09 01:26:47 +01:00
|
|
|
|
2000-02-15 21:49:31 +01:00
|
|
|
/* CPU costs */
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
cost_qual_eval(&restrict_qual_cost, path->joinrestrictinfo, root);
|
2003-01-12 23:35:29 +01:00
|
|
|
startup_cost += restrict_qual_cost.startup;
|
|
|
|
cpu_per_tuple = cpu_tuple_cost + restrict_qual_cost.per_tuple;
|
2000-02-15 21:49:31 +01:00
|
|
|
run_cost += cpu_per_tuple * ntuples;
|
|
|
|
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
/* tlist eval costs are paid per output row, not per tuple scanned */
|
|
|
|
startup_cost += path->path.pathtarget->cost.startup;
|
|
|
|
run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
|
|
|
|
|
2003-01-27 21:51:54 +01:00
|
|
|
path->path.startup_cost = startup_cost;
|
|
|
|
path->path.total_cost = startup_cost + run_cost;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
2012-01-28 01:26:38 +01:00
|
|
|
* initial_cost_mergejoin
|
|
|
|
* Preliminary estimate of the cost of a mergejoin path.
|
|
|
|
*
|
|
|
|
* This must quickly produce lower-bound estimates of the path's startup and
|
|
|
|
* total costs. If we are unable to eliminate the proposed path from
|
|
|
|
* consideration using the lower bounds, final_cost_mergejoin will be called
|
|
|
|
* to obtain the final estimates.
|
|
|
|
*
|
|
|
|
* The exact division of labor between this function and final_cost_mergejoin
|
|
|
|
* is private to them, and represents a tradeoff between speed of the initial
|
|
|
|
* estimate and getting a tight lower bound. We choose to not examine the
|
|
|
|
* join quals here, except for obtaining the scan selectivity estimate which
|
|
|
|
* is really essential (but fortunately, use of caching keeps the cost of
|
|
|
|
* getting that down to something reasonable).
|
|
|
|
* We also assume that cost_sort is cheap enough to use here.
|
|
|
|
*
|
|
|
|
* 'workspace' is to be filled with startup_cost, total_cost, and perhaps
|
|
|
|
* other data to be used by final_cost_mergejoin
|
|
|
|
* 'jointype' is the type of join to be performed
|
|
|
|
* 'mergeclauses' is the list of joinclauses to be used as merge clauses
|
|
|
|
* 'outer_path' is the outer input to the join
|
|
|
|
* 'inner_path' is the inner input to the join
|
|
|
|
* 'outersortkeys' is the list of sort keys for the outer path
|
|
|
|
* 'innersortkeys' is the list of sort keys for the inner path
|
2017-04-08 04:20:03 +02:00
|
|
|
* 'extra' contains miscellaneous information about the join
|
2003-01-27 21:51:54 +01:00
|
|
|
*
|
2012-01-28 01:26:38 +01:00
|
|
|
* Note: outersortkeys and innersortkeys should be NIL if no explicit
|
|
|
|
* sort is needed because the respective source path is already ordered.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2000-02-15 21:49:31 +01:00
|
|
|
void
|
2012-01-28 01:26:38 +01:00
|
|
|
initial_cost_mergejoin(PlannerInfo *root, JoinCostWorkspace *workspace,
|
|
|
|
JoinType jointype,
|
|
|
|
List *mergeclauses,
|
|
|
|
Path *outer_path, Path *inner_path,
|
|
|
|
List *outersortkeys, List *innersortkeys,
|
2017-04-08 04:20:03 +02:00
|
|
|
JoinPathExtraData *extra)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2000-02-15 21:49:31 +01:00
|
|
|
Cost startup_cost = 0;
|
|
|
|
Cost run_cost = 0;
|
2012-01-28 01:26:38 +01:00
|
|
|
double outer_path_rows = outer_path->rows;
|
|
|
|
double inner_path_rows = inner_path->rows;
|
|
|
|
Cost inner_run_cost;
|
2002-03-01 05:09:28 +01:00
|
|
|
double outer_rows,
|
2007-12-08 22:05:11 +01:00
|
|
|
inner_rows,
|
|
|
|
outer_skip_rows,
|
|
|
|
inner_skip_rows;
|
|
|
|
Selectivity outerstartsel,
|
|
|
|
outerendsel,
|
|
|
|
innerstartsel,
|
|
|
|
innerendsel;
|
2000-02-15 21:49:31 +01:00
|
|
|
Path sort_path; /* dummy for result of cost_sort */
|
1997-09-07 07:04:48 +02:00
|
|
|
|
Prevent overly large and NaN row estimates in relations
Given a query with enough joins, it was possible that the query planner,
after multiplying the row estimates with the join selectivity that the
estimated number of rows would exceed the limits of the double data type
and become infinite.
To give an indication on how extreme a case is required to hit this, the
particular example case reported required 379 joins to a table without any
statistics, which resulted in the 1.0/DEFAULT_NUM_DISTINCT being used for
the join selectivity. This eventually caused the row estimates to go
infinite and resulted in an assert failure in initial_cost_mergejoin()
where the infinite row estimated was multiplied by an outerstartsel of 0.0
resulting in NaN. The failing assert verified that NaN <= Inf, which is
false.
To get around this we use clamp_row_est() to cap row estimates at a
maximum of 1e100. This value is thought to be low enough that costs
derived from it would remain within the bounds of what the double type can
represent.
Aside from fixing the failing Assert, this also has the added benefit of
making it so add_path() will still receive proper numerical values as
costs which will allow it to make more sane choices when determining the
cheaper path in extreme cases such as the one described above.
Additionally, we also get rid of the isnan() checks in the join costing
functions. The actual case which originally triggered those checks to be
added in the first place never made it to the mailing lists. It seems
likely that the new code being added to clamp_row_est() will result in
those becoming checks redundant, so just remove them.
The fairly harmless assert failure problem does also exist in the
backbranches, however, a more minimalistic fix will be applied there.
Reported-by: Onder Kalaci
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/DM6PR21MB1211FF360183BCA901B27F04D80B0@DM6PR21MB1211.namprd21.prod.outlook.com
2020-10-18 23:53:52 +02:00
|
|
|
/* Protect some assumptions below that rowcounts aren't zero */
|
|
|
|
if (outer_path_rows <= 0)
|
2008-03-24 22:53:04 +01:00
|
|
|
outer_path_rows = 1;
|
Prevent overly large and NaN row estimates in relations
Given a query with enough joins, it was possible that the query planner,
after multiplying the row estimates with the join selectivity that the
estimated number of rows would exceed the limits of the double data type
and become infinite.
To give an indication on how extreme a case is required to hit this, the
particular example case reported required 379 joins to a table without any
statistics, which resulted in the 1.0/DEFAULT_NUM_DISTINCT being used for
the join selectivity. This eventually caused the row estimates to go
infinite and resulted in an assert failure in initial_cost_mergejoin()
where the infinite row estimated was multiplied by an outerstartsel of 0.0
resulting in NaN. The failing assert verified that NaN <= Inf, which is
false.
To get around this we use clamp_row_est() to cap row estimates at a
maximum of 1e100. This value is thought to be low enough that costs
derived from it would remain within the bounds of what the double type can
represent.
Aside from fixing the failing Assert, this also has the added benefit of
making it so add_path() will still receive proper numerical values as
costs which will allow it to make more sane choices when determining the
cheaper path in extreme cases such as the one described above.
Additionally, we also get rid of the isnan() checks in the join costing
functions. The actual case which originally triggered those checks to be
added in the first place never made it to the mailing lists. It seems
likely that the new code being added to clamp_row_est() will result in
those becoming checks redundant, so just remove them.
The fairly harmless assert failure problem does also exist in the
backbranches, however, a more minimalistic fix will be applied there.
Reported-by: Onder Kalaci
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/DM6PR21MB1211FF360183BCA901B27F04D80B0@DM6PR21MB1211.namprd21.prod.outlook.com
2020-10-18 23:53:52 +02:00
|
|
|
if (inner_path_rows <= 0)
|
2008-03-24 22:53:04 +01:00
|
|
|
inner_path_rows = 1;
|
|
|
|
|
2002-03-01 05:09:28 +01:00
|
|
|
/*
|
2005-04-04 03:43:12 +02:00
|
|
|
* A merge join will stop as soon as it exhausts either input stream
|
|
|
|
* (unless it's an outer join, in which case the outer side has to be
|
|
|
|
* scanned all the way anyway). Estimate fraction of the left and right
|
2007-12-08 22:05:11 +01:00
|
|
|
* inputs that will actually need to be scanned. Likewise, we can
|
2009-06-11 16:49:15 +02:00
|
|
|
* estimate the number of rows that will be skipped before the first join
|
|
|
|
* pair is found, which should be factored into startup cost. We use only
|
|
|
|
* the first (most significant) merge clause for this purpose. Since
|
|
|
|
* mergejoinscansel() is a fairly expensive computation, we cache the
|
|
|
|
* results in the merge clause RestrictInfo.
|
2002-03-01 05:09:28 +01:00
|
|
|
*/
|
2012-01-28 01:26:38 +01:00
|
|
|
if (mergeclauses && jointype != JOIN_FULL)
|
2002-03-01 21:50:20 +01:00
|
|
|
{
|
2007-01-20 21:45:41 +01:00
|
|
|
RestrictInfo *firstclause = (RestrictInfo *) linitial(mergeclauses);
|
|
|
|
List *opathkeys;
|
|
|
|
List *ipathkeys;
|
2007-11-15 22:14:46 +01:00
|
|
|
PathKey *opathkey;
|
|
|
|
PathKey *ipathkey;
|
2007-01-22 21:00:40 +01:00
|
|
|
MergeScanSelCache *cache;
|
2007-01-20 21:45:41 +01:00
|
|
|
|
|
|
|
/* Get the input pathkeys to determine the sort-order details */
|
|
|
|
opathkeys = outersortkeys ? outersortkeys : outer_path->pathkeys;
|
|
|
|
ipathkeys = innersortkeys ? innersortkeys : inner_path->pathkeys;
|
|
|
|
Assert(opathkeys);
|
|
|
|
Assert(ipathkeys);
|
|
|
|
opathkey = (PathKey *) linitial(opathkeys);
|
|
|
|
ipathkey = (PathKey *) linitial(ipathkeys);
|
|
|
|
/* debugging check */
|
|
|
|
if (opathkey->pk_opfamily != ipathkey->pk_opfamily ||
|
2011-03-20 01:29:08 +01:00
|
|
|
opathkey->pk_eclass->ec_collation != ipathkey->pk_eclass->ec_collation ||
|
2007-01-20 21:45:41 +01:00
|
|
|
opathkey->pk_strategy != ipathkey->pk_strategy ||
|
|
|
|
opathkey->pk_nulls_first != ipathkey->pk_nulls_first)
|
|
|
|
elog(ERROR, "left and right pathkeys do not match in mergejoin");
|
|
|
|
|
2007-01-22 21:00:40 +01:00
|
|
|
/* Get the selectivity with caching */
|
|
|
|
cache = cached_scansel(root, firstclause, opathkey);
|
2007-01-20 21:45:41 +01:00
|
|
|
|
|
|
|
if (bms_is_subset(firstclause->left_relids,
|
|
|
|
outer_path->parent->relids))
|
2004-04-06 20:46:03 +02:00
|
|
|
{
|
|
|
|
/* left side of clause is outer */
|
2007-12-08 22:05:11 +01:00
|
|
|
outerstartsel = cache->leftstartsel;
|
|
|
|
outerendsel = cache->leftendsel;
|
|
|
|
innerstartsel = cache->rightstartsel;
|
|
|
|
innerendsel = cache->rightendsel;
|
2004-04-06 20:46:03 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* left side of clause is inner */
|
2007-12-08 22:05:11 +01:00
|
|
|
outerstartsel = cache->rightstartsel;
|
|
|
|
outerendsel = cache->rightendsel;
|
|
|
|
innerstartsel = cache->leftstartsel;
|
|
|
|
innerendsel = cache->leftendsel;
|
2004-04-06 20:46:03 +02:00
|
|
|
}
|
2012-01-28 01:26:38 +01:00
|
|
|
if (jointype == JOIN_LEFT ||
|
|
|
|
jointype == JOIN_ANTI)
|
2007-12-08 22:05:11 +01:00
|
|
|
{
|
|
|
|
outerstartsel = 0.0;
|
|
|
|
outerendsel = 1.0;
|
|
|
|
}
|
2012-01-28 01:26:38 +01:00
|
|
|
else if (jointype == JOIN_RIGHT)
|
2007-12-08 22:05:11 +01:00
|
|
|
{
|
|
|
|
innerstartsel = 0.0;
|
|
|
|
innerendsel = 1.0;
|
|
|
|
}
|
2002-03-01 21:50:20 +01:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
2005-04-04 03:43:12 +02:00
|
|
|
/* cope with clauseless or full mergejoin */
|
2007-12-08 22:05:11 +01:00
|
|
|
outerstartsel = innerstartsel = 0.0;
|
|
|
|
outerendsel = innerendsel = 1.0;
|
2002-03-01 21:50:20 +01:00
|
|
|
}
|
|
|
|
|
2007-12-08 22:05:11 +01:00
|
|
|
/*
|
|
|
|
* Convert selectivities to row counts. We force outer_rows and
|
|
|
|
* inner_rows to be at least 1, but the skip_rows estimates can be zero.
|
|
|
|
*/
|
|
|
|
outer_skip_rows = rint(outer_path_rows * outerstartsel);
|
|
|
|
inner_skip_rows = rint(inner_path_rows * innerstartsel);
|
|
|
|
outer_rows = clamp_row_est(outer_path_rows * outerendsel);
|
|
|
|
inner_rows = clamp_row_est(inner_path_rows * innerendsel);
|
|
|
|
|
|
|
|
Assert(outer_skip_rows <= outer_rows);
|
|
|
|
Assert(inner_skip_rows <= inner_rows);
|
2003-01-22 21:16:42 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Readjust scan selectivities to account for above rounding. This is
|
2005-10-15 04:49:52 +02:00
|
|
|
* normally an insignificant effect, but when there are only a few rows in
|
|
|
|
* the inputs, failing to do this makes for a large percentage error.
|
2003-01-22 21:16:42 +01:00
|
|
|
*/
|
2007-12-08 22:05:11 +01:00
|
|
|
outerstartsel = outer_skip_rows / outer_path_rows;
|
|
|
|
innerstartsel = inner_skip_rows / inner_path_rows;
|
|
|
|
outerendsel = outer_rows / outer_path_rows;
|
|
|
|
innerendsel = inner_rows / inner_path_rows;
|
|
|
|
|
2011-12-30 23:58:15 +01:00
|
|
|
Assert(outerstartsel <= outerendsel);
|
|
|
|
Assert(innerstartsel <= innerendsel);
|
|
|
|
|
2000-01-09 01:26:47 +01:00
|
|
|
/* cost of source data */
|
2000-04-12 19:17:23 +02:00
|
|
|
|
2000-02-15 21:49:31 +01:00
|
|
|
if (outersortkeys) /* do we need to sort outer? */
|
|
|
|
{
|
|
|
|
cost_sort(&sort_path,
|
2001-06-05 07:26:05 +02:00
|
|
|
root,
|
2000-02-15 21:49:31 +01:00
|
|
|
outersortkeys,
|
2002-11-21 01:42:20 +01:00
|
|
|
outer_path->total_cost,
|
2003-01-27 21:51:54 +01:00
|
|
|
outer_path_rows,
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
outer_path->pathtarget->width,
|
2010-10-08 02:00:28 +02:00
|
|
|
0.0,
|
|
|
|
work_mem,
|
2007-05-04 03:13:45 +02:00
|
|
|
-1.0);
|
2000-02-15 21:49:31 +01:00
|
|
|
startup_cost += sort_path.startup_cost;
|
2007-12-08 22:05:11 +01:00
|
|
|
startup_cost += (sort_path.total_cost - sort_path.startup_cost)
|
|
|
|
* outerstartsel;
|
2002-03-01 05:09:28 +01:00
|
|
|
run_cost += (sort_path.total_cost - sort_path.startup_cost)
|
2007-12-08 22:05:11 +01:00
|
|
|
* (outerendsel - outerstartsel);
|
2000-02-15 21:49:31 +01:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
startup_cost += outer_path->startup_cost;
|
2007-12-08 22:05:11 +01:00
|
|
|
startup_cost += (outer_path->total_cost - outer_path->startup_cost)
|
|
|
|
* outerstartsel;
|
2002-03-01 05:09:28 +01:00
|
|
|
run_cost += (outer_path->total_cost - outer_path->startup_cost)
|
2007-12-08 22:05:11 +01:00
|
|
|
* (outerendsel - outerstartsel);
|
2000-02-15 21:49:31 +01:00
|
|
|
}
|
2000-01-09 01:26:47 +01:00
|
|
|
|
2000-02-15 21:49:31 +01:00
|
|
|
if (innersortkeys) /* do we need to sort inner? */
|
|
|
|
{
|
|
|
|
cost_sort(&sort_path,
|
2001-06-05 07:26:05 +02:00
|
|
|
root,
|
2000-02-15 21:49:31 +01:00
|
|
|
innersortkeys,
|
2002-11-21 01:42:20 +01:00
|
|
|
inner_path->total_cost,
|
2003-01-27 21:51:54 +01:00
|
|
|
inner_path_rows,
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
inner_path->pathtarget->width,
|
2010-10-08 02:00:28 +02:00
|
|
|
0.0,
|
|
|
|
work_mem,
|
2007-05-04 03:13:45 +02:00
|
|
|
-1.0);
|
2000-02-15 21:49:31 +01:00
|
|
|
startup_cost += sort_path.startup_cost;
|
2007-12-08 22:05:11 +01:00
|
|
|
startup_cost += (sort_path.total_cost - sort_path.startup_cost)
|
2009-11-15 03:45:35 +01:00
|
|
|
* innerstartsel;
|
|
|
|
inner_run_cost = (sort_path.total_cost - sort_path.startup_cost)
|
|
|
|
* (innerendsel - innerstartsel);
|
2000-02-15 21:49:31 +01:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
startup_cost += inner_path->startup_cost;
|
2007-12-08 22:05:11 +01:00
|
|
|
startup_cost += (inner_path->total_cost - inner_path->startup_cost)
|
2009-11-15 03:45:35 +01:00
|
|
|
* innerstartsel;
|
|
|
|
inner_run_cost = (inner_path->total_cost - inner_path->startup_cost)
|
|
|
|
* (innerendsel - innerstartsel);
|
2000-02-15 21:49:31 +01:00
|
|
|
}
|
1999-04-30 06:01:44 +02:00
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
/*
|
|
|
|
* We can't yet determine whether rescanning occurs, or whether
|
|
|
|
* materialization of the inner input should be done. The minimum
|
|
|
|
* possible inner input cost, regardless of rescan and materialization
|
|
|
|
* considerations, is inner_run_cost. We include that in
|
|
|
|
* workspace->total_cost, but not yet in run_cost.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* CPU costs left for later */
|
|
|
|
|
|
|
|
/* Public result fields */
|
|
|
|
workspace->startup_cost = startup_cost;
|
|
|
|
workspace->total_cost = startup_cost + run_cost + inner_run_cost;
|
|
|
|
/* Save private data for final_cost_mergejoin */
|
|
|
|
workspace->run_cost = run_cost;
|
|
|
|
workspace->inner_run_cost = inner_run_cost;
|
|
|
|
workspace->outer_rows = outer_rows;
|
|
|
|
workspace->inner_rows = inner_rows;
|
|
|
|
workspace->outer_skip_rows = outer_skip_rows;
|
|
|
|
workspace->inner_skip_rows = inner_skip_rows;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* final_cost_mergejoin
|
|
|
|
* Final estimate of the cost and result size of a mergejoin path.
|
|
|
|
*
|
2017-04-08 04:20:03 +02:00
|
|
|
* Unlike other costsize functions, this routine makes two actual decisions:
|
|
|
|
* whether the executor will need to do mark/restore, and whether we should
|
|
|
|
* materialize the inner path. It would be logically cleaner to build
|
|
|
|
* separate paths testing these alternatives, but that would require repeating
|
|
|
|
* most of the cost calculations, which are not all that cheap. Since the
|
|
|
|
* choice will not affect output pathkeys or startup cost, only total cost,
|
|
|
|
* there is no possibility of wanting to keep more than one path. So it seems
|
|
|
|
* best to make the decisions here and record them in the path's
|
|
|
|
* skip_mark_restore and materialize_inner fields.
|
|
|
|
*
|
|
|
|
* Mark/restore overhead is usually required, but can be skipped if we know
|
|
|
|
* that the executor need find only one match per outer tuple, and that the
|
|
|
|
* mergeclauses are sufficient to identify a match.
|
|
|
|
*
|
|
|
|
* We materialize the inner path if we need mark/restore and either the inner
|
|
|
|
* path can't support mark/restore, or it's cheaper to use an interposed
|
|
|
|
* Material node to handle mark/restore.
|
2012-01-28 01:26:38 +01:00
|
|
|
*
|
|
|
|
* 'path' is already filled in except for the rows and cost fields and
|
2017-04-08 04:20:03 +02:00
|
|
|
* skip_mark_restore and materialize_inner
|
2012-01-28 01:26:38 +01:00
|
|
|
* 'workspace' is the result from initial_cost_mergejoin
|
2017-04-08 04:20:03 +02:00
|
|
|
* 'extra' contains miscellaneous information about the join
|
2012-01-28 01:26:38 +01:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
final_cost_mergejoin(PlannerInfo *root, MergePath *path,
|
|
|
|
JoinCostWorkspace *workspace,
|
2017-04-08 04:20:03 +02:00
|
|
|
JoinPathExtraData *extra)
|
2012-01-28 01:26:38 +01:00
|
|
|
{
|
|
|
|
Path *outer_path = path->jpath.outerjoinpath;
|
|
|
|
Path *inner_path = path->jpath.innerjoinpath;
|
|
|
|
double inner_path_rows = inner_path->rows;
|
|
|
|
List *mergeclauses = path->path_mergeclauses;
|
|
|
|
List *innersortkeys = path->innersortkeys;
|
|
|
|
Cost startup_cost = workspace->startup_cost;
|
|
|
|
Cost run_cost = workspace->run_cost;
|
|
|
|
Cost inner_run_cost = workspace->inner_run_cost;
|
|
|
|
double outer_rows = workspace->outer_rows;
|
|
|
|
double inner_rows = workspace->inner_rows;
|
|
|
|
double outer_skip_rows = workspace->outer_skip_rows;
|
|
|
|
double inner_skip_rows = workspace->inner_skip_rows;
|
|
|
|
Cost cpu_per_tuple,
|
|
|
|
bare_inner_cost,
|
|
|
|
mat_inner_cost;
|
|
|
|
QualCost merge_qual_cost;
|
|
|
|
QualCost qp_qual_cost;
|
|
|
|
double mergejointuples,
|
|
|
|
rescannedtuples;
|
|
|
|
double rescanratio;
|
|
|
|
|
Prevent overly large and NaN row estimates in relations
Given a query with enough joins, it was possible that the query planner,
after multiplying the row estimates with the join selectivity that the
estimated number of rows would exceed the limits of the double data type
and become infinite.
To give an indication on how extreme a case is required to hit this, the
particular example case reported required 379 joins to a table without any
statistics, which resulted in the 1.0/DEFAULT_NUM_DISTINCT being used for
the join selectivity. This eventually caused the row estimates to go
infinite and resulted in an assert failure in initial_cost_mergejoin()
where the infinite row estimated was multiplied by an outerstartsel of 0.0
resulting in NaN. The failing assert verified that NaN <= Inf, which is
false.
To get around this we use clamp_row_est() to cap row estimates at a
maximum of 1e100. This value is thought to be low enough that costs
derived from it would remain within the bounds of what the double type can
represent.
Aside from fixing the failing Assert, this also has the added benefit of
making it so add_path() will still receive proper numerical values as
costs which will allow it to make more sane choices when determining the
cheaper path in extreme cases such as the one described above.
Additionally, we also get rid of the isnan() checks in the join costing
functions. The actual case which originally triggered those checks to be
added in the first place never made it to the mailing lists. It seems
likely that the new code being added to clamp_row_est() will result in
those becoming checks redundant, so just remove them.
The fairly harmless assert failure problem does also exist in the
backbranches, however, a more minimalistic fix will be applied there.
Reported-by: Onder Kalaci
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/DM6PR21MB1211FF360183BCA901B27F04D80B0@DM6PR21MB1211.namprd21.prod.outlook.com
2020-10-18 23:53:52 +02:00
|
|
|
/* Protect some assumptions below that rowcounts aren't zero */
|
|
|
|
if (inner_path_rows <= 0)
|
2012-01-28 01:26:38 +01:00
|
|
|
inner_path_rows = 1;
|
|
|
|
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
/* Mark the path with the correct row estimate */
|
|
|
|
if (path->jpath.path.param_info)
|
|
|
|
path->jpath.path.rows = path->jpath.path.param_info->ppi_rows;
|
|
|
|
else
|
|
|
|
path->jpath.path.rows = path->jpath.path.parent->rows;
|
2012-01-28 01:26:38 +01:00
|
|
|
|
2017-01-13 19:29:31 +01:00
|
|
|
/* For partial paths, scale row estimate. */
|
|
|
|
if (path->jpath.path.parallel_workers > 0)
|
2017-03-15 17:28:54 +01:00
|
|
|
{
|
2017-05-17 22:31:56 +02:00
|
|
|
double parallel_divisor = get_parallel_divisor(&path->jpath.path);
|
2017-03-15 17:28:54 +01:00
|
|
|
|
|
|
|
path->jpath.path.rows =
|
|
|
|
clamp_row_est(path->jpath.path.rows / parallel_divisor);
|
|
|
|
}
|
2017-01-13 19:29:31 +01:00
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
/*
|
|
|
|
* We could include disable_cost in the preliminary estimate, but that
|
|
|
|
* would amount to optimizing for the case where the join method is
|
|
|
|
* disabled, which doesn't seem like the way to bet.
|
|
|
|
*/
|
|
|
|
if (!enable_mergejoin)
|
|
|
|
startup_cost += disable_cost;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Compute cost of the mergequals and qpquals (other restriction clauses)
|
|
|
|
* separately.
|
|
|
|
*/
|
|
|
|
cost_qual_eval(&merge_qual_cost, mergeclauses, root);
|
|
|
|
cost_qual_eval(&qp_qual_cost, path->jpath.joinrestrictinfo, root);
|
|
|
|
qp_qual_cost.startup -= merge_qual_cost.startup;
|
|
|
|
qp_qual_cost.per_tuple -= merge_qual_cost.per_tuple;
|
|
|
|
|
2017-04-08 04:20:03 +02:00
|
|
|
/*
|
|
|
|
* With a SEMI or ANTI join, or if the innerrel is known unique, the
|
|
|
|
* executor will stop scanning for matches after the first match. When
|
|
|
|
* all the joinclauses are merge clauses, this means we don't ever need to
|
|
|
|
* back up the merge, and so we can skip mark/restore overhead.
|
|
|
|
*/
|
|
|
|
if ((path->jpath.jointype == JOIN_SEMI ||
|
|
|
|
path->jpath.jointype == JOIN_ANTI ||
|
|
|
|
extra->inner_unique) &&
|
|
|
|
(list_length(path->jpath.joinrestrictinfo) ==
|
|
|
|
list_length(path->path_mergeclauses)))
|
|
|
|
path->skip_mark_restore = true;
|
|
|
|
else
|
|
|
|
path->skip_mark_restore = false;
|
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Get approx # tuples passing the mergequals. We use approx_tuple_count
|
2012-01-28 01:26:38 +01:00
|
|
|
* here because we need an estimate done with JOIN_INNER semantics.
|
|
|
|
*/
|
|
|
|
mergejointuples = approx_tuple_count(root, &path->jpath, mergeclauses);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When there are equal merge keys in the outer relation, the mergejoin
|
|
|
|
* must rescan any matching tuples in the inner relation. This means
|
|
|
|
* re-fetching inner tuples; we have to estimate how often that happens.
|
|
|
|
*
|
|
|
|
* For regular inner and outer joins, the number of re-fetches can be
|
|
|
|
* estimated approximately as size of merge join output minus size of
|
|
|
|
* inner relation. Assume that the distinct key values are 1, 2, ..., and
|
|
|
|
* denote the number of values of each key in the outer relation as m1,
|
2014-05-06 18:12:18 +02:00
|
|
|
* m2, ...; in the inner relation, n1, n2, ... Then we have
|
2012-01-28 01:26:38 +01:00
|
|
|
*
|
|
|
|
* size of join = m1 * n1 + m2 * n2 + ...
|
|
|
|
*
|
|
|
|
* number of rescanned tuples = (m1 - 1) * n1 + (m2 - 1) * n2 + ... = m1 *
|
|
|
|
* n1 + m2 * n2 + ... - (n1 + n2 + ...) = size of join - size of inner
|
|
|
|
* relation
|
|
|
|
*
|
|
|
|
* This equation works correctly for outer tuples having no inner match
|
|
|
|
* (nk = 0), but not for inner tuples having no outer match (mk = 0); we
|
|
|
|
* are effectively subtracting those from the number of rescanned tuples,
|
2014-05-06 18:12:18 +02:00
|
|
|
* when we should not. Can we do better without expensive selectivity
|
2012-01-28 01:26:38 +01:00
|
|
|
* computations?
|
|
|
|
*
|
|
|
|
* The whole issue is moot if we are working from a unique-ified outer
|
2017-04-08 04:20:03 +02:00
|
|
|
* input, or if we know we don't need to mark/restore at all.
|
2012-01-28 01:26:38 +01:00
|
|
|
*/
|
2020-05-16 17:54:51 +02:00
|
|
|
if (IsA(outer_path, UniquePath) || path->skip_mark_restore)
|
2012-01-28 01:26:38 +01:00
|
|
|
rescannedtuples = 0;
|
|
|
|
else
|
|
|
|
{
|
|
|
|
rescannedtuples = mergejointuples - inner_path_rows;
|
|
|
|
/* Must clamp because of possible underestimate */
|
|
|
|
if (rescannedtuples < 0)
|
|
|
|
rescannedtuples = 0;
|
|
|
|
}
|
2018-12-18 17:19:38 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We'll inflate various costs this much to account for rescanning. Note
|
|
|
|
* that this is to be multiplied by something involving inner_rows, or
|
|
|
|
* another number related to the portion of the inner rel we'll scan.
|
|
|
|
*/
|
|
|
|
rescanratio = 1.0 + (rescannedtuples / inner_rows);
|
2012-01-28 01:26:38 +01:00
|
|
|
|
2009-11-15 03:45:35 +01:00
|
|
|
/*
|
|
|
|
* Decide whether we want to materialize the inner input to shield it from
|
2014-05-06 18:12:18 +02:00
|
|
|
* mark/restore and performing re-fetches. Our cost model for regular
|
2009-11-15 03:45:35 +01:00
|
|
|
* re-fetches is that a re-fetch costs the same as an original fetch,
|
|
|
|
* which is probably an overestimate; but on the other hand we ignore the
|
|
|
|
* bookkeeping costs of mark/restore. Not clear if it's worth developing
|
2010-02-26 03:01:40 +01:00
|
|
|
* a more refined model. So we just need to inflate the inner run cost by
|
|
|
|
* rescanratio.
|
2009-11-15 03:45:35 +01:00
|
|
|
*/
|
|
|
|
bare_inner_cost = inner_run_cost * rescanratio;
|
2010-02-26 03:01:40 +01:00
|
|
|
|
2009-11-15 03:45:35 +01:00
|
|
|
/*
|
|
|
|
* When we interpose a Material node the re-fetch cost is assumed to be
|
2010-02-19 22:49:10 +01:00
|
|
|
* just cpu_operator_cost per tuple, independently of the underlying
|
|
|
|
* plan's cost; and we charge an extra cpu_operator_cost per original
|
|
|
|
* fetch as well. Note that we're assuming the materialize node will
|
|
|
|
* never spill to disk, since it only has to remember tuples back to the
|
|
|
|
* last mark. (If there are a huge number of duplicates, our other cost
|
2009-11-15 03:45:35 +01:00
|
|
|
* factors will make the path so expensive that it probably won't get
|
2010-02-26 03:01:40 +01:00
|
|
|
* chosen anyway.) So we don't use cost_rescan here.
|
2009-11-15 03:45:35 +01:00
|
|
|
*
|
|
|
|
* Note: keep this estimate in sync with create_mergejoin_plan's labeling
|
|
|
|
* of the generated Material node.
|
|
|
|
*/
|
|
|
|
mat_inner_cost = inner_run_cost +
|
2018-12-18 17:19:38 +01:00
|
|
|
cpu_operator_cost * inner_rows * rescanratio;
|
2009-11-15 03:45:35 +01:00
|
|
|
|
2017-04-08 04:20:03 +02:00
|
|
|
/*
|
|
|
|
* If we don't need mark/restore at all, we don't need materialization.
|
|
|
|
*/
|
|
|
|
if (path->skip_mark_restore)
|
|
|
|
path->materialize_inner = false;
|
|
|
|
|
2010-04-19 02:55:26 +02:00
|
|
|
/*
|
2010-07-06 21:19:02 +02:00
|
|
|
* Prefer materializing if it looks cheaper, unless the user has asked to
|
|
|
|
* suppress materialization.
|
2010-04-19 02:55:26 +02:00
|
|
|
*/
|
2017-04-08 04:20:03 +02:00
|
|
|
else if (enable_material && mat_inner_cost < bare_inner_cost)
|
2009-11-15 03:45:35 +01:00
|
|
|
path->materialize_inner = true;
|
2010-02-26 03:01:40 +01:00
|
|
|
|
2009-11-15 03:45:35 +01:00
|
|
|
/*
|
|
|
|
* Even if materializing doesn't look cheaper, we *must* do it if the
|
|
|
|
* inner path is to be used directly (without sorting) and it doesn't
|
|
|
|
* support mark/restore.
|
|
|
|
*
|
|
|
|
* Since the inner side must be ordered, and only Sorts and IndexScans can
|
|
|
|
* create order to begin with, and they both support mark/restore, you
|
|
|
|
* might think there's no problem --- but you'd be wrong. Nestloop and
|
|
|
|
* merge joins can *preserve* the order of their inputs, so they can be
|
|
|
|
* selected as the input of a mergejoin, and they don't support
|
|
|
|
* mark/restore at present.
|
2010-04-19 02:55:26 +02:00
|
|
|
*
|
2010-07-06 21:19:02 +02:00
|
|
|
* We don't test the value of enable_material here, because
|
|
|
|
* materialization is required for correctness in this case, and turning
|
|
|
|
* it off does not entitle us to deliver an invalid plan.
|
2009-11-15 03:45:35 +01:00
|
|
|
*/
|
|
|
|
else if (innersortkeys == NIL &&
|
2014-11-07 23:26:02 +01:00
|
|
|
!ExecSupportsMarkRestore(inner_path))
|
2009-11-15 03:45:35 +01:00
|
|
|
path->materialize_inner = true;
|
2010-02-26 03:01:40 +01:00
|
|
|
|
2009-11-15 03:45:35 +01:00
|
|
|
/*
|
|
|
|
* Also, force materializing if the inner path is to be sorted and the
|
|
|
|
* sort is expected to spill to disk. This is because the final merge
|
|
|
|
* pass can be done on-the-fly if it doesn't have to support mark/restore.
|
|
|
|
* We don't try to adjust the cost estimates for this consideration,
|
|
|
|
* though.
|
2010-04-19 02:55:26 +02:00
|
|
|
*
|
2010-07-06 21:19:02 +02:00
|
|
|
* Since materialization is a performance optimization in this case,
|
|
|
|
* rather than necessary for correctness, we skip it if enable_material is
|
|
|
|
* off.
|
2009-11-15 03:45:35 +01:00
|
|
|
*/
|
2010-04-19 02:55:26 +02:00
|
|
|
else if (enable_material && innersortkeys != NIL &&
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
relation_byte_size(inner_path_rows,
|
|
|
|
inner_path->pathtarget->width) >
|
2009-11-15 03:45:35 +01:00
|
|
|
(work_mem * 1024L))
|
|
|
|
path->materialize_inner = true;
|
|
|
|
else
|
|
|
|
path->materialize_inner = false;
|
|
|
|
|
|
|
|
/* Charge the right incremental cost for the chosen case */
|
|
|
|
if (path->materialize_inner)
|
|
|
|
run_cost += mat_inner_cost;
|
|
|
|
else
|
|
|
|
run_cost += bare_inner_cost;
|
|
|
|
|
2003-01-27 21:51:54 +01:00
|
|
|
/* CPU costs */
|
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* The number of tuple comparisons needed is approximately number of outer
|
|
|
|
* rows plus number of inner rows plus number of rescanned tuples (can we
|
|
|
|
* refine this?). At each one, we need to evaluate the mergejoin quals.
|
2003-01-27 21:51:54 +01:00
|
|
|
*/
|
|
|
|
startup_cost += merge_qual_cost.startup;
|
2007-12-08 22:05:11 +01:00
|
|
|
startup_cost += merge_qual_cost.per_tuple *
|
|
|
|
(outer_skip_rows + inner_skip_rows * rescanratio);
|
2003-01-27 21:51:54 +01:00
|
|
|
run_cost += merge_qual_cost.per_tuple *
|
2007-12-08 22:05:11 +01:00
|
|
|
((outer_rows - outer_skip_rows) +
|
|
|
|
(inner_rows - inner_skip_rows) * rescanratio);
|
2001-06-05 07:26:05 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* For each tuple that gets through the mergejoin proper, we charge
|
|
|
|
* cpu_tuple_cost plus the cost of evaluating additional restriction
|
2014-05-06 18:12:18 +02:00
|
|
|
* clauses that are to be applied at the join. (This is pessimistic since
|
2008-08-16 02:01:38 +02:00
|
|
|
* not all of the quals may get evaluated at each tuple.)
|
2009-05-10 00:51:41 +02:00
|
|
|
*
|
2009-06-11 16:49:15 +02:00
|
|
|
* Note: we could adjust for SEMI/ANTI joins skipping some qual
|
|
|
|
* evaluations here, but it's probably not worth the trouble.
|
2001-06-05 07:26:05 +02:00
|
|
|
*/
|
2003-01-27 21:51:54 +01:00
|
|
|
startup_cost += qp_qual_cost.startup;
|
|
|
|
cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;
|
2008-08-16 02:01:38 +02:00
|
|
|
run_cost += cpu_per_tuple * mergejointuples;
|
2000-02-15 21:49:31 +01:00
|
|
|
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
/* tlist eval costs are paid per output row, not per tuple scanned */
|
|
|
|
startup_cost += path->jpath.path.pathtarget->cost.startup;
|
|
|
|
run_cost += path->jpath.path.pathtarget->cost.per_tuple * path->jpath.path.rows;
|
|
|
|
|
2003-01-27 21:51:54 +01:00
|
|
|
path->jpath.path.startup_cost = startup_cost;
|
|
|
|
path->jpath.path.total_cost = startup_cost + run_cost;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2007-01-22 21:00:40 +01:00
|
|
|
/*
|
|
|
|
* run mergejoinscansel() with caching
|
|
|
|
*/
|
|
|
|
static MergeScanSelCache *
|
2007-11-15 23:25:18 +01:00
|
|
|
cached_scansel(PlannerInfo *root, RestrictInfo *rinfo, PathKey *pathkey)
|
2007-01-22 21:00:40 +01:00
|
|
|
{
|
|
|
|
MergeScanSelCache *cache;
|
|
|
|
ListCell *lc;
|
2007-12-08 22:05:11 +01:00
|
|
|
Selectivity leftstartsel,
|
|
|
|
leftendsel,
|
|
|
|
rightstartsel,
|
|
|
|
rightendsel;
|
2007-01-22 21:00:40 +01:00
|
|
|
MemoryContext oldcontext;
|
|
|
|
|
|
|
|
/* Do we have this result already? */
|
|
|
|
foreach(lc, rinfo->scansel_cache)
|
|
|
|
{
|
|
|
|
cache = (MergeScanSelCache *) lfirst(lc);
|
|
|
|
if (cache->opfamily == pathkey->pk_opfamily &&
|
2011-03-20 01:29:08 +01:00
|
|
|
cache->collation == pathkey->pk_eclass->ec_collation &&
|
2007-01-22 21:00:40 +01:00
|
|
|
cache->strategy == pathkey->pk_strategy &&
|
|
|
|
cache->nulls_first == pathkey->pk_nulls_first)
|
|
|
|
return cache;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Nope, do the computation */
|
|
|
|
mergejoinscansel(root,
|
|
|
|
(Node *) rinfo->clause,
|
|
|
|
pathkey->pk_opfamily,
|
|
|
|
pathkey->pk_strategy,
|
|
|
|
pathkey->pk_nulls_first,
|
2007-12-08 22:05:11 +01:00
|
|
|
&leftstartsel,
|
|
|
|
&leftendsel,
|
|
|
|
&rightstartsel,
|
|
|
|
&rightendsel);
|
2007-01-22 21:00:40 +01:00
|
|
|
|
|
|
|
/* Cache the result in suitably long-lived workspace */
|
|
|
|
oldcontext = MemoryContextSwitchTo(root->planner_cxt);
|
|
|
|
|
|
|
|
cache = (MergeScanSelCache *) palloc(sizeof(MergeScanSelCache));
|
|
|
|
cache->opfamily = pathkey->pk_opfamily;
|
2011-03-20 01:29:08 +01:00
|
|
|
cache->collation = pathkey->pk_eclass->ec_collation;
|
2007-01-22 21:00:40 +01:00
|
|
|
cache->strategy = pathkey->pk_strategy;
|
|
|
|
cache->nulls_first = pathkey->pk_nulls_first;
|
2007-12-08 22:05:11 +01:00
|
|
|
cache->leftstartsel = leftstartsel;
|
|
|
|
cache->leftendsel = leftendsel;
|
|
|
|
cache->rightstartsel = rightstartsel;
|
|
|
|
cache->rightendsel = rightendsel;
|
2007-01-22 21:00:40 +01:00
|
|
|
|
|
|
|
rinfo->scansel_cache = lappend(rinfo->scansel_cache, cache);
|
|
|
|
|
|
|
|
MemoryContextSwitchTo(oldcontext);
|
|
|
|
|
|
|
|
return cache;
|
|
|
|
}
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
2012-01-28 01:26:38 +01:00
|
|
|
* initial_cost_hashjoin
|
|
|
|
* Preliminary estimate of the cost of a hashjoin path.
|
|
|
|
*
|
|
|
|
* This must quickly produce lower-bound estimates of the path's startup and
|
|
|
|
* total costs. If we are unable to eliminate the proposed path from
|
|
|
|
* consideration using the lower bounds, final_cost_hashjoin will be called
|
|
|
|
* to obtain the final estimates.
|
|
|
|
*
|
|
|
|
* The exact division of labor between this function and final_cost_hashjoin
|
|
|
|
* is private to them, and represents a tradeoff between speed of the initial
|
|
|
|
* estimate and getting a tight lower bound. We choose to not examine the
|
|
|
|
* join quals here (other than by counting the number of hash clauses),
|
|
|
|
* so we can't do much with CPU costs. We do assume that
|
|
|
|
* ExecChooseHashTableSize is cheap enough to use here.
|
|
|
|
*
|
|
|
|
* 'workspace' is to be filled with startup_cost, total_cost, and perhaps
|
|
|
|
* other data to be used by final_cost_hashjoin
|
|
|
|
* 'jointype' is the type of join to be performed
|
|
|
|
* 'hashclauses' is the list of joinclauses to be used as hash clauses
|
|
|
|
* 'outer_path' is the outer input to the join
|
|
|
|
* 'inner_path' is the inner input to the join
|
2017-04-08 04:20:03 +02:00
|
|
|
* 'extra' contains miscellaneous information about the join
|
2018-03-07 03:54:37 +01:00
|
|
|
* 'parallel_hash' indicates that inner_path is partial and that a shared
|
|
|
|
* hash table will be built in parallel
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2000-02-15 21:49:31 +01:00
|
|
|
void
|
2012-01-28 01:26:38 +01:00
|
|
|
initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
|
|
|
|
JoinType jointype,
|
|
|
|
List *hashclauses,
|
|
|
|
Path *outer_path, Path *inner_path,
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
JoinPathExtraData *extra,
|
|
|
|
bool parallel_hash)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2000-02-15 21:49:31 +01:00
|
|
|
Cost startup_cost = 0;
|
|
|
|
Cost run_cost = 0;
|
2012-01-28 01:26:38 +01:00
|
|
|
double outer_path_rows = outer_path->rows;
|
|
|
|
double inner_path_rows = inner_path->rows;
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
double inner_path_rows_total = inner_path_rows;
|
2004-05-31 01:40:41 +02:00
|
|
|
int num_hashclauses = list_length(hashclauses);
|
2005-03-06 23:15:05 +01:00
|
|
|
int numbuckets;
|
2002-12-30 16:21:23 +01:00
|
|
|
int numbatches;
|
2009-03-21 01:04:40 +01:00
|
|
|
int num_skew_mcvs;
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
size_t space_allowed; /* unused */
|
2003-01-27 21:51:54 +01:00
|
|
|
|
1999-04-05 04:07:07 +02:00
|
|
|
/* cost of source data */
|
2000-02-15 21:49:31 +01:00
|
|
|
startup_cost += outer_path->startup_cost;
|
|
|
|
run_cost += outer_path->total_cost - outer_path->startup_cost;
|
|
|
|
startup_cost += inner_path->total_cost;
|
1999-04-05 04:07:07 +02:00
|
|
|
|
2003-01-27 21:51:54 +01:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Cost of computing hash function: must do it once per input tuple. We
|
2007-01-08 17:09:22 +01:00
|
|
|
* charge one cpu_operator_cost for each column's hash function. Also,
|
|
|
|
* tack on one cpu_tuple_cost per inner row, to model the costs of
|
|
|
|
* inserting the row into the hashtable.
|
2003-01-27 21:51:54 +01:00
|
|
|
*
|
2005-10-15 04:49:52 +02:00
|
|
|
* XXX when a hashclause is more complex than a single operator, we really
|
|
|
|
* should charge the extra eval costs of the left or right side, as
|
|
|
|
* appropriate, here. This seems more work than it's worth at the moment.
|
2003-01-27 21:51:54 +01:00
|
|
|
*/
|
2007-01-08 17:09:22 +01:00
|
|
|
startup_cost += (cpu_operator_cost * num_hashclauses + cpu_tuple_cost)
|
|
|
|
* inner_path_rows;
|
2003-01-27 21:51:54 +01:00
|
|
|
run_cost += cpu_operator_cost * num_hashclauses * outer_path_rows;
|
1999-04-05 04:07:07 +02:00
|
|
|
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
/*
|
|
|
|
* If this is a parallel hash build, then the value we have for
|
|
|
|
* inner_rows_total currently refers only to the rows returned by each
|
|
|
|
* participant. For shared hash table size estimation, we need the total
|
|
|
|
* number, so we need to undo the division.
|
|
|
|
*/
|
|
|
|
if (parallel_hash)
|
|
|
|
inner_path_rows_total *= get_parallel_divisor(inner_path);
|
|
|
|
|
2009-03-21 01:04:40 +01:00
|
|
|
/*
|
|
|
|
* Get hash table size that executor would use for inner relation.
|
|
|
|
*
|
|
|
|
* XXX for the moment, always assume that skew optimization will be
|
2020-07-29 23:14:58 +02:00
|
|
|
* performed. As long as SKEW_HASH_MEM_PERCENT is small, it's not worth
|
2009-03-21 01:04:40 +01:00
|
|
|
* trying to determine that for sure.
|
|
|
|
*
|
|
|
|
* XXX at some point it might be interesting to try to account for skew
|
|
|
|
* optimization in the cost estimate, but for now, we don't.
|
|
|
|
*/
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
ExecChooseHashTableSize(inner_path_rows_total,
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
inner_path->pathtarget->width,
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
true, /* useskew */
|
2020-07-29 23:14:58 +02:00
|
|
|
parallel_hash, /* try_combined_hash_mem */
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
outer_path->parallel_workers,
|
|
|
|
&space_allowed,
|
2005-03-06 23:15:05 +01:00
|
|
|
&numbuckets,
|
2009-03-21 01:04:40 +01:00
|
|
|
&numbatches,
|
|
|
|
&num_skew_mcvs);
|
2012-01-28 01:26:38 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If inner relation is too big then we will need to "batch" the join,
|
|
|
|
* which implies writing and reading most of the tuples to disk an extra
|
|
|
|
* time. Charge seq_page_cost per page, since the I/O should be nice and
|
2014-05-06 18:12:18 +02:00
|
|
|
* sequential. Writing the inner rel counts as startup cost, all the rest
|
2012-01-28 01:26:38 +01:00
|
|
|
* as run cost.
|
|
|
|
*/
|
|
|
|
if (numbatches > 1)
|
|
|
|
{
|
|
|
|
double outerpages = page_size(outer_path_rows,
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
outer_path->pathtarget->width);
|
2012-01-28 01:26:38 +01:00
|
|
|
double innerpages = page_size(inner_path_rows,
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
inner_path->pathtarget->width);
|
2012-01-28 01:26:38 +01:00
|
|
|
|
|
|
|
startup_cost += seq_page_cost * innerpages;
|
|
|
|
run_cost += seq_page_cost * (innerpages + 2 * outerpages);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* CPU costs left for later */
|
|
|
|
|
|
|
|
/* Public result fields */
|
|
|
|
workspace->startup_cost = startup_cost;
|
|
|
|
workspace->total_cost = startup_cost + run_cost;
|
|
|
|
/* Save private data for final_cost_hashjoin */
|
|
|
|
workspace->run_cost = run_cost;
|
|
|
|
workspace->numbuckets = numbuckets;
|
|
|
|
workspace->numbatches = numbatches;
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
workspace->inner_rows_total = inner_path_rows_total;
|
2012-01-28 01:26:38 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* final_cost_hashjoin
|
|
|
|
* Final estimate of the cost and result size of a hashjoin path.
|
|
|
|
*
|
|
|
|
* Note: the numbatches estimate is also saved into 'path' for use later
|
|
|
|
*
|
|
|
|
* 'path' is already filled in except for the rows and cost fields and
|
|
|
|
* num_batches
|
|
|
|
* 'workspace' is the result from initial_cost_hashjoin
|
2017-04-08 04:20:03 +02:00
|
|
|
* 'extra' contains miscellaneous information about the join
|
2012-01-28 01:26:38 +01:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
final_cost_hashjoin(PlannerInfo *root, HashPath *path,
|
|
|
|
JoinCostWorkspace *workspace,
|
2017-04-08 04:20:03 +02:00
|
|
|
JoinPathExtraData *extra)
|
2012-01-28 01:26:38 +01:00
|
|
|
{
|
|
|
|
Path *outer_path = path->jpath.outerjoinpath;
|
|
|
|
Path *inner_path = path->jpath.innerjoinpath;
|
|
|
|
double outer_path_rows = outer_path->rows;
|
|
|
|
double inner_path_rows = inner_path->rows;
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
double inner_path_rows_total = workspace->inner_rows_total;
|
2012-01-28 01:26:38 +01:00
|
|
|
List *hashclauses = path->path_hashclauses;
|
|
|
|
Cost startup_cost = workspace->startup_cost;
|
|
|
|
Cost run_cost = workspace->run_cost;
|
|
|
|
int numbuckets = workspace->numbuckets;
|
|
|
|
int numbatches = workspace->numbatches;
|
2020-07-29 23:14:58 +02:00
|
|
|
int hash_mem;
|
2012-01-28 01:26:38 +01:00
|
|
|
Cost cpu_per_tuple;
|
|
|
|
QualCost hash_qual_cost;
|
|
|
|
QualCost qp_qual_cost;
|
|
|
|
double hashjointuples;
|
|
|
|
double virtualbuckets;
|
|
|
|
Selectivity innerbucketsize;
|
Avoid out-of-memory in a hash join with many duplicate inner keys.
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
2017-08-15 20:05:46 +02:00
|
|
|
Selectivity innermcvfreq;
|
2012-01-28 01:26:38 +01:00
|
|
|
ListCell *hcl;
|
|
|
|
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
/* Mark the path with the correct row estimate */
|
|
|
|
if (path->jpath.path.param_info)
|
|
|
|
path->jpath.path.rows = path->jpath.path.param_info->ppi_rows;
|
|
|
|
else
|
|
|
|
path->jpath.path.rows = path->jpath.path.parent->rows;
|
2012-01-28 01:26:38 +01:00
|
|
|
|
2017-01-13 19:29:31 +01:00
|
|
|
/* For partial paths, scale row estimate. */
|
|
|
|
if (path->jpath.path.parallel_workers > 0)
|
2017-03-15 17:28:54 +01:00
|
|
|
{
|
2017-05-17 22:31:56 +02:00
|
|
|
double parallel_divisor = get_parallel_divisor(&path->jpath.path);
|
2017-03-15 17:28:54 +01:00
|
|
|
|
|
|
|
path->jpath.path.rows =
|
|
|
|
clamp_row_est(path->jpath.path.rows / parallel_divisor);
|
|
|
|
}
|
2017-01-13 19:29:31 +01:00
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
/*
|
|
|
|
* We could include disable_cost in the preliminary estimate, but that
|
|
|
|
* would amount to optimizing for the case where the join method is
|
|
|
|
* disabled, which doesn't seem like the way to bet.
|
|
|
|
*/
|
|
|
|
if (!enable_hashjoin)
|
|
|
|
startup_cost += disable_cost;
|
2009-06-11 16:49:15 +02:00
|
|
|
|
2009-03-26 18:15:35 +01:00
|
|
|
/* mark the path with estimated # of batches */
|
|
|
|
path->num_batches = numbatches;
|
2002-12-30 16:21:23 +01:00
|
|
|
|
Add parallel-aware hash joins.
Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel
Hash Join with Parallel Hash. While hash joins could already appear in
parallel queries, they were previously always parallel-oblivious and had a
partial subplan only on the outer side, meaning that the work of the inner
subplan was duplicated in every worker.
After this commit, the planner will consider using a partial subplan on the
inner side too, using the Parallel Hash node to divide the work over the
available CPU cores and combine its results in shared memory. If the join
needs to be split into multiple batches in order to respect work_mem, then
workers process different batches as much as possible and then work together
on the remaining batches.
The advantages of a parallel-aware hash join over a parallel-oblivious hash
join used in a parallel query are that it:
* avoids wasting memory on duplicated hash tables
* avoids wasting disk space on duplicated batch files
* divides the work of building the hash table over the CPUs
One disadvantage is that there is some communication between the participating
CPUs which might outweigh the benefits of parallelism in the case of small
hash tables. This is avoided by the planner's existing reluctance to supply
partial plans for small scans, but it may be necessary to estimate
synchronization costs in future if that situation changes. Another is that
outer batch 0 must be written to disk if multiple batches are required.
A potential future advantage of parallel-aware hash joins is that right and
full outer joins could be supported, since there is a single set of matched
bits for each hashtable, but that is not yet implemented.
A new GUC enable_parallel_hash is defined to control the feature, defaulting
to on.
Author: Thomas Munro
Reviewed-By: Andres Freund, Robert Haas
Tested-By: Rafia Sabih, Prabhat Sahu
Discussion:
https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com
https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
2017-12-21 08:39:21 +01:00
|
|
|
/* store the total number of tuples (sum of partial row estimates) */
|
|
|
|
path->inner_rows_total = inner_path_rows_total;
|
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
/* and compute the number of "virtual" buckets in the whole join */
|
2017-06-21 20:39:04 +02:00
|
|
|
virtualbuckets = (double) numbuckets * (double) numbatches;
|
2012-01-28 01:26:38 +01:00
|
|
|
|
2001-06-05 07:26:05 +02:00
|
|
|
/*
|
Avoid out-of-memory in a hash join with many duplicate inner keys.
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
2017-08-15 20:05:46 +02:00
|
|
|
* Determine bucketsize fraction and MCV frequency for the inner relation.
|
|
|
|
* We use the smallest bucketsize or MCV frequency estimated for any
|
|
|
|
* individual hashclause; this is undoubtedly conservative.
|
2003-01-28 23:13:41 +01:00
|
|
|
*
|
2005-11-22 19:17:34 +01:00
|
|
|
* BUT: if inner relation has been unique-ified, we can assume it's good
|
|
|
|
* for hashing. This is important both because it's the right answer, and
|
2005-10-15 04:49:52 +02:00
|
|
|
* because we avoid contaminating the cache with a value that's wrong for
|
|
|
|
* non-unique-ified paths.
|
2001-06-05 07:26:05 +02:00
|
|
|
*/
|
2003-01-28 23:13:41 +01:00
|
|
|
if (IsA(inner_path, UniquePath))
|
Avoid out-of-memory in a hash join with many duplicate inner keys.
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
2017-08-15 20:05:46 +02:00
|
|
|
{
|
2003-01-28 23:13:41 +01:00
|
|
|
innerbucketsize = 1.0 / virtualbuckets;
|
Avoid out-of-memory in a hash join with many duplicate inner keys.
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
2017-08-15 20:05:46 +02:00
|
|
|
innermcvfreq = 0.0;
|
|
|
|
}
|
2003-01-28 23:13:41 +01:00
|
|
|
else
|
2001-06-05 07:26:05 +02:00
|
|
|
{
|
2003-01-28 23:13:41 +01:00
|
|
|
innerbucketsize = 1.0;
|
Avoid out-of-memory in a hash join with many duplicate inner keys.
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
2017-08-15 20:05:46 +02:00
|
|
|
innermcvfreq = 1.0;
|
2003-01-28 23:13:41 +01:00
|
|
|
foreach(hcl, hashclauses)
|
|
|
|
{
|
Improve castNode notation by introducing list-extraction-specific variants.
This extends the castNode() notation introduced by commit 5bcab1114 to
provide, in one step, extraction of a list cell's pointer and coercion to
a concrete node type. For example, "lfirst_node(Foo, lc)" is the same
as "castNode(Foo, lfirst(lc))". Almost half of the uses of castNode
that have appeared so far include a list extraction call, so this is
pretty widely useful, and it saves a few more keystrokes compared to the
old way.
As with the previous patch, back-patch the addition of these macros to
pg_list.h, so that the notation will be available when back-patching.
Patch by me, after an idea of Andrew Gierth's.
Discussion: https://postgr.es/m/14197.1491841216@sss.pgh.pa.us
2017-04-10 19:51:29 +02:00
|
|
|
RestrictInfo *restrictinfo = lfirst_node(RestrictInfo, hcl);
|
2003-01-28 23:13:41 +01:00
|
|
|
Selectivity thisbucketsize;
|
Avoid out-of-memory in a hash join with many duplicate inner keys.
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
2017-08-15 20:05:46 +02:00
|
|
|
Selectivity thismcvfreq;
|
2002-11-30 01:08:22 +01:00
|
|
|
|
2003-01-28 23:13:41 +01:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* First we have to figure out which side of the hashjoin clause
|
|
|
|
* is the inner side.
|
2003-01-28 23:13:41 +01:00
|
|
|
*
|
|
|
|
* Since we tend to visit the same clauses over and over when
|
Avoid out-of-memory in a hash join with many duplicate inner keys.
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
2017-08-15 20:05:46 +02:00
|
|
|
* planning a large query, we cache the bucket stats estimates in
|
|
|
|
* the RestrictInfo node to avoid repeated lookups of statistics.
|
2003-01-28 23:13:41 +01:00
|
|
|
*/
|
2003-02-08 21:20:55 +01:00
|
|
|
if (bms_is_subset(restrictinfo->right_relids,
|
|
|
|
inner_path->parent->relids))
|
2002-11-30 01:08:22 +01:00
|
|
|
{
|
2003-01-28 23:13:41 +01:00
|
|
|
/* righthand side is inner */
|
|
|
|
thisbucketsize = restrictinfo->right_bucketsize;
|
|
|
|
if (thisbucketsize < 0)
|
|
|
|
{
|
|
|
|
/* not cached yet */
|
Avoid out-of-memory in a hash join with many duplicate inner keys.
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
2017-08-15 20:05:46 +02:00
|
|
|
estimate_hash_bucket_stats(root,
|
|
|
|
get_rightop(restrictinfo->clause),
|
|
|
|
virtualbuckets,
|
|
|
|
&restrictinfo->right_mcvfreq,
|
|
|
|
&restrictinfo->right_bucketsize);
|
|
|
|
thisbucketsize = restrictinfo->right_bucketsize;
|
2003-01-28 23:13:41 +01:00
|
|
|
}
|
Avoid out-of-memory in a hash join with many duplicate inner keys.
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
2017-08-15 20:05:46 +02:00
|
|
|
thismcvfreq = restrictinfo->right_mcvfreq;
|
2002-11-30 01:08:22 +01:00
|
|
|
}
|
2003-01-28 23:13:41 +01:00
|
|
|
else
|
2002-11-30 01:08:22 +01:00
|
|
|
{
|
2003-02-08 21:20:55 +01:00
|
|
|
Assert(bms_is_subset(restrictinfo->left_relids,
|
|
|
|
inner_path->parent->relids));
|
2003-01-28 23:13:41 +01:00
|
|
|
/* lefthand side is inner */
|
|
|
|
thisbucketsize = restrictinfo->left_bucketsize;
|
|
|
|
if (thisbucketsize < 0)
|
|
|
|
{
|
|
|
|
/* not cached yet */
|
Avoid out-of-memory in a hash join with many duplicate inner keys.
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
2017-08-15 20:05:46 +02:00
|
|
|
estimate_hash_bucket_stats(root,
|
|
|
|
get_leftop(restrictinfo->clause),
|
|
|
|
virtualbuckets,
|
|
|
|
&restrictinfo->left_mcvfreq,
|
|
|
|
&restrictinfo->left_bucketsize);
|
|
|
|
thisbucketsize = restrictinfo->left_bucketsize;
|
2003-01-28 23:13:41 +01:00
|
|
|
}
|
Avoid out-of-memory in a hash join with many duplicate inner keys.
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
2017-08-15 20:05:46 +02:00
|
|
|
thismcvfreq = restrictinfo->left_mcvfreq;
|
2002-11-30 01:08:22 +01:00
|
|
|
}
|
|
|
|
|
2003-01-28 23:13:41 +01:00
|
|
|
if (innerbucketsize > thisbucketsize)
|
|
|
|
innerbucketsize = thisbucketsize;
|
Avoid out-of-memory in a hash join with many duplicate inner keys.
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
2017-08-15 20:05:46 +02:00
|
|
|
if (innermcvfreq > thismcvfreq)
|
|
|
|
innermcvfreq = thismcvfreq;
|
2003-01-28 23:13:41 +01:00
|
|
|
}
|
2001-06-05 07:26:05 +02:00
|
|
|
}
|
|
|
|
|
Avoid out-of-memory in a hash join with many duplicate inner keys.
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
2017-08-15 20:05:46 +02:00
|
|
|
/*
|
2020-07-29 23:14:58 +02:00
|
|
|
* If the bucket holding the inner MCV would exceed hash_mem, we don't
|
Avoid out-of-memory in a hash join with many duplicate inner keys.
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
2017-08-15 20:05:46 +02:00
|
|
|
* want to hash unless there is really no other alternative, so apply
|
|
|
|
* disable_cost. (The executor normally copes with excessive memory usage
|
|
|
|
* by splitting batches, but obviously it cannot separate equal values
|
2020-07-29 23:14:58 +02:00
|
|
|
* that way, so it will be unable to drive the batch size below hash_mem
|
Avoid out-of-memory in a hash join with many duplicate inner keys.
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
2017-08-15 20:05:46 +02:00
|
|
|
* when this is true.)
|
|
|
|
*/
|
2020-07-29 23:14:58 +02:00
|
|
|
hash_mem = get_hash_mem();
|
Avoid out-of-memory in a hash join with many duplicate inner keys.
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
2017-08-15 20:05:46 +02:00
|
|
|
if (relation_byte_size(clamp_row_est(inner_path_rows * innermcvfreq),
|
|
|
|
inner_path->pathtarget->width) >
|
2020-07-29 23:14:58 +02:00
|
|
|
(hash_mem * 1024L))
|
Avoid out-of-memory in a hash join with many duplicate inner keys.
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
2017-08-15 20:05:46 +02:00
|
|
|
startup_cost += disable_cost;
|
|
|
|
|
1999-05-25 18:15:34 +02:00
|
|
|
/*
|
2012-01-28 01:26:38 +01:00
|
|
|
* Compute cost of the hashquals and qpquals (other restriction clauses)
|
|
|
|
* separately.
|
1999-08-06 06:00:17 +02:00
|
|
|
*/
|
2012-01-28 01:26:38 +01:00
|
|
|
cost_qual_eval(&hash_qual_cost, hashclauses, root);
|
|
|
|
cost_qual_eval(&qp_qual_cost, path->jpath.joinrestrictinfo, root);
|
|
|
|
qp_qual_cost.startup -= hash_qual_cost.startup;
|
|
|
|
qp_qual_cost.per_tuple -= hash_qual_cost.per_tuple;
|
1999-08-06 06:00:17 +02:00
|
|
|
|
2003-01-27 21:51:54 +01:00
|
|
|
/* CPU costs */
|
|
|
|
|
2017-04-08 04:20:03 +02:00
|
|
|
if (path->jpath.jointype == JOIN_SEMI ||
|
|
|
|
path->jpath.jointype == JOIN_ANTI ||
|
|
|
|
extra->inner_unique)
|
2009-05-10 00:51:41 +02:00
|
|
|
{
|
|
|
|
double outer_matched_rows;
|
2009-06-11 16:49:15 +02:00
|
|
|
Selectivity inner_scan_frac;
|
2009-05-10 00:51:41 +02:00
|
|
|
|
|
|
|
/*
|
2017-04-08 04:20:03 +02:00
|
|
|
* With a SEMI or ANTI join, or if the innerrel is known unique, the
|
|
|
|
* executor will stop after the first match.
|
2009-05-10 00:51:41 +02:00
|
|
|
*
|
|
|
|
* For an outer-rel row that has at least one match, we can expect the
|
|
|
|
* bucket scan to stop after a fraction 1/(match_count+1) of the
|
|
|
|
* bucket's rows, if the matches are evenly distributed. Since they
|
|
|
|
* probably aren't quite evenly distributed, we apply a fuzz factor of
|
|
|
|
* 2.0 to that fraction. (If we used a larger fuzz factor, we'd have
|
|
|
|
* to clamp inner_scan_frac to at most 1.0; but since match_count is
|
|
|
|
* at least 1, no such clamp is needed now.)
|
|
|
|
*/
|
2017-04-08 04:20:03 +02:00
|
|
|
outer_matched_rows = rint(outer_path_rows * extra->semifactors.outer_match_frac);
|
|
|
|
inner_scan_frac = 2.0 / (extra->semifactors.match_count + 1.0);
|
2009-05-10 00:51:41 +02:00
|
|
|
|
|
|
|
startup_cost += hash_qual_cost.startup;
|
|
|
|
run_cost += hash_qual_cost.per_tuple * outer_matched_rows *
|
|
|
|
clamp_row_est(inner_path_rows * innerbucketsize * inner_scan_frac) * 0.5;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* For unmatched outer-rel rows, the picture is quite a lot different.
|
|
|
|
* In the first place, there is no reason to assume that these rows
|
|
|
|
* preferentially hit heavily-populated buckets; instead assume they
|
|
|
|
* are uncorrelated with the inner distribution and so they see an
|
|
|
|
* average bucket size of inner_path_rows / virtualbuckets. In the
|
2009-06-11 16:49:15 +02:00
|
|
|
* second place, it seems likely that they will have few if any exact
|
|
|
|
* hash-code matches and so very few of the tuples in the bucket will
|
|
|
|
* actually require eval of the hash quals. We don't have any good
|
|
|
|
* way to estimate how many will, but for the moment assume that the
|
|
|
|
* effective cost per bucket entry is one-tenth what it is for
|
|
|
|
* matchable tuples.
|
2009-05-10 00:51:41 +02:00
|
|
|
*/
|
|
|
|
run_cost += hash_qual_cost.per_tuple *
|
|
|
|
(outer_path_rows - outer_matched_rows) *
|
|
|
|
clamp_row_est(inner_path_rows / virtualbuckets) * 0.05;
|
|
|
|
|
|
|
|
/* Get # of tuples that will pass the basic join */
|
2018-07-14 17:59:12 +02:00
|
|
|
if (path->jpath.jointype == JOIN_ANTI)
|
2009-05-10 00:51:41 +02:00
|
|
|
hashjointuples = outer_path_rows - outer_matched_rows;
|
2018-07-14 17:59:12 +02:00
|
|
|
else
|
|
|
|
hashjointuples = outer_matched_rows;
|
2009-05-10 00:51:41 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* The number of tuple comparisons needed is the number of outer
|
|
|
|
* tuples times the typical number of tuples in a hash bucket, which
|
|
|
|
* is the inner relation size times its bucketsize fraction. At each
|
|
|
|
* one, we need to evaluate the hashjoin quals. But actually,
|
|
|
|
* charging the full qual eval cost at each tuple is pessimistic,
|
|
|
|
* since we don't evaluate the quals unless the hash values match
|
|
|
|
* exactly. For lack of a better idea, halve the cost estimate to
|
|
|
|
* allow for that.
|
|
|
|
*/
|
|
|
|
startup_cost += hash_qual_cost.startup;
|
|
|
|
run_cost += hash_qual_cost.per_tuple * outer_path_rows *
|
|
|
|
clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Get approx # tuples passing the hashquals. We use
|
|
|
|
* approx_tuple_count here because we need an estimate done with
|
|
|
|
* JOIN_INNER semantics.
|
|
|
|
*/
|
|
|
|
hashjointuples = approx_tuple_count(root, &path->jpath, hashclauses);
|
|
|
|
}
|
2003-01-27 21:51:54 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* For each tuple that gets through the hashjoin proper, we charge
|
|
|
|
* cpu_tuple_cost plus the cost of evaluating additional restriction
|
2014-05-06 18:12:18 +02:00
|
|
|
* clauses that are to be applied at the join. (This is pessimistic since
|
2005-10-15 04:49:52 +02:00
|
|
|
* not all of the quals may get evaluated at each tuple.)
|
2003-01-27 21:51:54 +01:00
|
|
|
*/
|
|
|
|
startup_cost += qp_qual_cost.startup;
|
|
|
|
cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;
|
2008-08-16 02:01:38 +02:00
|
|
|
run_cost += cpu_per_tuple * hashjointuples;
|
2003-01-27 21:51:54 +01:00
|
|
|
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
/* tlist eval costs are paid per output row, not per tuple scanned */
|
|
|
|
startup_cost += path->jpath.path.pathtarget->cost.startup;
|
|
|
|
run_cost += path->jpath.path.pathtarget->cost.per_tuple * path->jpath.path.rows;
|
|
|
|
|
2003-01-27 21:51:54 +01:00
|
|
|
path->jpath.path.startup_cost = startup_cost;
|
|
|
|
path->jpath.path.total_cost = startup_cost + run_cost;
|
2000-02-15 21:49:31 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2008-08-22 02:16:04 +02:00
|
|
|
/*
|
|
|
|
* cost_subplan
|
|
|
|
* Figure the costs for a SubPlan (or initplan).
|
|
|
|
*
|
|
|
|
* Note: we could dig the subplan's Plan out of the root list, but in practice
|
|
|
|
* all callers have it handy already, so we make them pass it.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan)
|
|
|
|
{
|
|
|
|
QualCost sp_cost;
|
|
|
|
|
|
|
|
/* Figure any cost for evaluating the testexpr */
|
|
|
|
cost_qual_eval(&sp_cost,
|
|
|
|
make_ands_implicit((Expr *) subplan->testexpr),
|
|
|
|
root);
|
|
|
|
|
|
|
|
if (subplan->useHashTable)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If we are using a hash table for the subquery outputs, then the
|
|
|
|
* cost of evaluating the query is a one-time cost. We charge one
|
|
|
|
* cpu_operator_cost per tuple for the work of loading the hashtable,
|
|
|
|
* too.
|
|
|
|
*/
|
|
|
|
sp_cost.startup += plan->total_cost +
|
|
|
|
cpu_operator_cost * plan->plan_rows;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The per-tuple costs include the cost of evaluating the lefthand
|
|
|
|
* expressions, plus the cost of probing the hashtable. We already
|
2009-06-11 16:49:15 +02:00
|
|
|
* accounted for the lefthand expressions as part of the testexpr, and
|
|
|
|
* will also have counted one cpu_operator_cost for each comparison
|
|
|
|
* operator. That is probably too low for the probing cost, but it's
|
|
|
|
* hard to make a better estimate, so live with it for now.
|
2008-08-22 02:16:04 +02:00
|
|
|
*/
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Otherwise we will be rescanning the subplan output on each
|
2014-05-06 18:12:18 +02:00
|
|
|
* evaluation. We need to estimate how much of the output we will
|
2008-08-22 02:16:04 +02:00
|
|
|
* actually need to scan. NOTE: this logic should agree with the
|
|
|
|
* tuple_fraction estimates used by make_subplan() in
|
|
|
|
* plan/subselect.c.
|
|
|
|
*/
|
|
|
|
Cost plan_run_cost = plan->total_cost - plan->startup_cost;
|
|
|
|
|
|
|
|
if (subplan->subLinkType == EXISTS_SUBLINK)
|
|
|
|
{
|
2016-03-26 17:03:12 +01:00
|
|
|
/* we only need to fetch 1 tuple; clamp to avoid zero divide */
|
|
|
|
sp_cost.per_tuple += plan_run_cost / clamp_row_est(plan->plan_rows);
|
2008-08-22 02:16:04 +02:00
|
|
|
}
|
|
|
|
else if (subplan->subLinkType == ALL_SUBLINK ||
|
|
|
|
subplan->subLinkType == ANY_SUBLINK)
|
|
|
|
{
|
|
|
|
/* assume we need 50% of the tuples */
|
|
|
|
sp_cost.per_tuple += 0.50 * plan_run_cost;
|
|
|
|
/* also charge a cpu_operator_cost per row examined */
|
|
|
|
sp_cost.per_tuple += 0.50 * plan->plan_rows * cpu_operator_cost;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* assume we need all tuples */
|
|
|
|
sp_cost.per_tuple += plan_run_cost;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Also account for subplan's startup cost. If the subplan is
|
2009-09-13 00:12:09 +02:00
|
|
|
* uncorrelated or undirect correlated, AND its topmost node is one
|
|
|
|
* that materializes its output, assume that we'll only need to pay
|
|
|
|
* its startup cost once; otherwise assume we pay the startup cost
|
|
|
|
* every time.
|
2008-08-22 02:16:04 +02:00
|
|
|
*/
|
|
|
|
if (subplan->parParam == NIL &&
|
2009-09-13 00:12:09 +02:00
|
|
|
ExecMaterializesOutput(nodeTag(plan)))
|
2008-08-22 02:16:04 +02:00
|
|
|
sp_cost.startup += plan->startup_cost;
|
|
|
|
else
|
|
|
|
sp_cost.per_tuple += plan->startup_cost;
|
|
|
|
}
|
|
|
|
|
|
|
|
subplan->startup_cost = sp_cost.startup;
|
|
|
|
subplan->per_call_cost = sp_cost.per_tuple;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2009-09-13 00:12:09 +02:00
|
|
|
/*
|
|
|
|
* cost_rescan
|
|
|
|
* Given a finished Path, estimate the costs of rescanning it after
|
2014-05-06 18:12:18 +02:00
|
|
|
* having done so the first time. For some Path types a rescan is
|
2009-09-13 00:12:09 +02:00
|
|
|
* cheaper than an original scan (if no parameters change), and this
|
|
|
|
* function embodies knowledge about that. The default is to return
|
2014-05-06 18:12:18 +02:00
|
|
|
* the same costs stored in the Path. (Note that the cost estimates
|
2009-09-13 00:12:09 +02:00
|
|
|
* actually stored in Paths are always for first scans.)
|
|
|
|
*
|
|
|
|
* This function is not currently intended to model effects such as rescans
|
|
|
|
* being cheaper due to disk block caching; what we are concerned with is
|
|
|
|
* plan types wherein the executor caches results explicitly, or doesn't
|
|
|
|
* redo startup calculations, etc.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
cost_rescan(PlannerInfo *root, Path *path,
|
2010-02-26 03:01:40 +01:00
|
|
|
Cost *rescan_startup_cost, /* output parameters */
|
2009-09-13 00:12:09 +02:00
|
|
|
Cost *rescan_total_cost)
|
|
|
|
{
|
|
|
|
switch (path->pathtype)
|
|
|
|
{
|
|
|
|
case T_FunctionScan:
|
2010-02-26 03:01:40 +01:00
|
|
|
|
2009-09-13 00:12:09 +02:00
|
|
|
/*
|
2010-02-26 03:01:40 +01:00
|
|
|
* Currently, nodeFunctionscan.c always executes the function to
|
|
|
|
* completion before returning any rows, and caches the results in
|
|
|
|
* a tuplestore. So the function eval cost is all startup cost
|
|
|
|
* and isn't paid over again on rescans. However, all run costs
|
|
|
|
* will be paid over again.
|
2009-09-13 00:12:09 +02:00
|
|
|
*/
|
|
|
|
*rescan_startup_cost = 0;
|
|
|
|
*rescan_total_cost = path->total_cost - path->startup_cost;
|
|
|
|
break;
|
|
|
|
case T_HashJoin:
|
2010-02-26 03:01:40 +01:00
|
|
|
|
2009-09-13 00:12:09 +02:00
|
|
|
/*
|
2016-07-27 23:44:34 +02:00
|
|
|
* If it's a single-batch join, we don't need to rebuild the hash
|
|
|
|
* table during a rescan.
|
2009-09-13 00:12:09 +02:00
|
|
|
*/
|
2016-07-27 23:44:34 +02:00
|
|
|
if (((HashPath *) path)->num_batches == 1)
|
|
|
|
{
|
|
|
|
/* Startup cost is exactly the cost of hash table building */
|
|
|
|
*rescan_startup_cost = 0;
|
|
|
|
*rescan_total_cost = path->total_cost - path->startup_cost;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* Otherwise, no special treatment */
|
|
|
|
*rescan_startup_cost = path->startup_cost;
|
|
|
|
*rescan_total_cost = path->total_cost;
|
|
|
|
}
|
2009-09-13 00:12:09 +02:00
|
|
|
break;
|
|
|
|
case T_CteScan:
|
|
|
|
case T_WorkTableScan:
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* These plan types materialize their final result in a
|
2014-05-06 18:12:18 +02:00
|
|
|
* tuplestore or tuplesort object. So the rescan cost is only
|
2009-09-13 00:12:09 +02:00
|
|
|
* cpu_tuple_cost per tuple, unless the result is large enough
|
|
|
|
* to spill to disk.
|
|
|
|
*/
|
2012-01-28 01:26:38 +01:00
|
|
|
Cost run_cost = cpu_tuple_cost * path->rows;
|
|
|
|
double nbytes = relation_byte_size(path->rows,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
path->pathtarget->width);
|
2010-02-26 03:01:40 +01:00
|
|
|
long work_mem_bytes = work_mem * 1024L;
|
2009-09-13 00:12:09 +02:00
|
|
|
|
|
|
|
if (nbytes > work_mem_bytes)
|
|
|
|
{
|
|
|
|
/* It will spill, so account for re-read cost */
|
|
|
|
double npages = ceil(nbytes / BLCKSZ);
|
|
|
|
|
|
|
|
run_cost += seq_page_cost * npages;
|
|
|
|
}
|
|
|
|
*rescan_startup_cost = 0;
|
|
|
|
*rescan_total_cost = run_cost;
|
|
|
|
}
|
|
|
|
break;
|
2010-02-19 22:49:10 +01:00
|
|
|
case T_Material:
|
|
|
|
case T_Sort:
|
|
|
|
{
|
|
|
|
/*
|
2010-02-26 03:01:40 +01:00
|
|
|
* These plan types not only materialize their results, but do
|
2014-05-06 18:12:18 +02:00
|
|
|
* not implement qual filtering or projection. So they are
|
|
|
|
* even cheaper to rescan than the ones above. We charge only
|
2010-02-26 03:01:40 +01:00
|
|
|
* cpu_operator_cost per tuple. (Note: keep that in sync with
|
|
|
|
* the run_cost charge in cost_sort, and also see comments in
|
|
|
|
* cost_material before you change it.)
|
2010-02-19 22:49:10 +01:00
|
|
|
*/
|
2012-01-28 01:26:38 +01:00
|
|
|
Cost run_cost = cpu_operator_cost * path->rows;
|
|
|
|
double nbytes = relation_byte_size(path->rows,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
path->pathtarget->width);
|
2010-02-26 03:01:40 +01:00
|
|
|
long work_mem_bytes = work_mem * 1024L;
|
2010-02-19 22:49:10 +01:00
|
|
|
|
|
|
|
if (nbytes > work_mem_bytes)
|
|
|
|
{
|
|
|
|
/* It will spill, so account for re-read cost */
|
|
|
|
double npages = ceil(nbytes / BLCKSZ);
|
|
|
|
|
|
|
|
run_cost += seq_page_cost * npages;
|
|
|
|
}
|
|
|
|
*rescan_startup_cost = 0;
|
|
|
|
*rescan_total_cost = run_cost;
|
|
|
|
}
|
|
|
|
break;
|
2009-09-13 00:12:09 +02:00
|
|
|
default:
|
|
|
|
*rescan_startup_cost = path->startup_cost;
|
|
|
|
*rescan_total_cost = path->total_cost;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2000-02-15 21:49:31 +01:00
|
|
|
/*
|
|
|
|
* cost_qual_eval
|
2003-01-12 23:35:29 +01:00
|
|
|
* Estimate the CPU costs of evaluating a WHERE clause.
|
2000-02-15 21:49:31 +01:00
|
|
|
* The input can be either an implicitly-ANDed list of boolean
|
2007-01-22 02:35:23 +01:00
|
|
|
* expressions, or a list of RestrictInfo nodes. (The latter is
|
|
|
|
* preferred since it allows caching of the results.)
|
2003-01-12 23:35:29 +01:00
|
|
|
* The result includes both a one-time (startup) component,
|
|
|
|
* and a per-evaluation component.
|
2000-02-15 21:49:31 +01:00
|
|
|
*/
|
2003-01-12 23:35:29 +01:00
|
|
|
void
|
2007-02-22 23:00:26 +01:00
|
|
|
cost_qual_eval(QualCost *cost, List *quals, PlannerInfo *root)
|
2000-02-15 21:49:31 +01:00
|
|
|
{
|
2007-02-22 23:00:26 +01:00
|
|
|
cost_qual_eval_context context;
|
2004-05-26 06:41:50 +02:00
|
|
|
ListCell *l;
|
2000-02-15 21:49:31 +01:00
|
|
|
|
2007-02-22 23:00:26 +01:00
|
|
|
context.root = root;
|
|
|
|
context.total.startup = 0;
|
|
|
|
context.total.per_tuple = 0;
|
2003-01-12 23:35:29 +01:00
|
|
|
|
2000-12-13 00:33:34 +01:00
|
|
|
/* We don't charge any cost for the implicit ANDing at top level ... */
|
|
|
|
|
|
|
|
foreach(l, quals)
|
|
|
|
{
|
2001-03-22 05:01:46 +01:00
|
|
|
Node *qual = (Node *) lfirst(l);
|
2000-12-13 00:33:34 +01:00
|
|
|
|
2007-02-22 23:00:26 +01:00
|
|
|
cost_qual_eval_walker(qual, &context);
|
2000-12-13 00:33:34 +01:00
|
|
|
}
|
2007-02-22 23:00:26 +01:00
|
|
|
|
|
|
|
*cost = context.total;
|
2000-02-15 21:49:31 +01:00
|
|
|
}
|
|
|
|
|
2007-01-22 02:35:23 +01:00
|
|
|
/*
|
|
|
|
* cost_qual_eval_node
|
|
|
|
* As above, for a single RestrictInfo or expression.
|
|
|
|
*/
|
|
|
|
void
|
2007-02-22 23:00:26 +01:00
|
|
|
cost_qual_eval_node(QualCost *cost, Node *qual, PlannerInfo *root)
|
2007-01-22 02:35:23 +01:00
|
|
|
{
|
2007-02-22 23:00:26 +01:00
|
|
|
cost_qual_eval_context context;
|
|
|
|
|
|
|
|
context.root = root;
|
|
|
|
context.total.startup = 0;
|
|
|
|
context.total.per_tuple = 0;
|
|
|
|
|
|
|
|
cost_qual_eval_walker(qual, &context);
|
|
|
|
|
|
|
|
*cost = context.total;
|
2007-01-22 02:35:23 +01:00
|
|
|
}
|
|
|
|
|
2000-02-15 21:49:31 +01:00
|
|
|
static bool
|
2007-11-15 23:25:18 +01:00
|
|
|
cost_qual_eval_walker(Node *node, cost_qual_eval_context *context)
|
2000-02-15 21:49:31 +01:00
|
|
|
{
|
|
|
|
if (node == NULL)
|
|
|
|
return false;
|
2000-04-12 19:17:23 +02:00
|
|
|
|
2000-02-15 21:49:31 +01:00
|
|
|
/*
|
2007-01-22 02:35:23 +01:00
|
|
|
* RestrictInfo nodes contain an eval_cost field reserved for this
|
2007-11-15 22:14:46 +01:00
|
|
|
* routine's use, so that it's not necessary to evaluate the qual clause's
|
|
|
|
* cost more than once. If the clause's cost hasn't been computed yet,
|
|
|
|
* the field's startup value will contain -1.
|
2007-01-22 02:35:23 +01:00
|
|
|
*/
|
|
|
|
if (IsA(node, RestrictInfo))
|
|
|
|
{
|
|
|
|
RestrictInfo *rinfo = (RestrictInfo *) node;
|
|
|
|
|
|
|
|
if (rinfo->eval_cost.startup < 0)
|
|
|
|
{
|
2007-02-22 23:00:26 +01:00
|
|
|
cost_qual_eval_context locContext;
|
|
|
|
|
|
|
|
locContext.root = context->root;
|
|
|
|
locContext.total.startup = 0;
|
|
|
|
locContext.total.per_tuple = 0;
|
2007-11-15 22:14:46 +01:00
|
|
|
|
2007-01-22 02:35:23 +01:00
|
|
|
/*
|
2007-11-15 22:14:46 +01:00
|
|
|
* For an OR clause, recurse into the marked-up tree so that we
|
|
|
|
* set the eval_cost for contained RestrictInfos too.
|
2007-01-22 02:35:23 +01:00
|
|
|
*/
|
|
|
|
if (rinfo->orclause)
|
2007-02-22 23:00:26 +01:00
|
|
|
cost_qual_eval_walker((Node *) rinfo->orclause, &locContext);
|
2007-01-22 02:35:23 +01:00
|
|
|
else
|
2007-02-22 23:00:26 +01:00
|
|
|
cost_qual_eval_walker((Node *) rinfo->clause, &locContext);
|
2007-11-15 22:14:46 +01:00
|
|
|
|
2007-01-22 02:35:23 +01:00
|
|
|
/*
|
|
|
|
* If the RestrictInfo is marked pseudoconstant, it will be tested
|
|
|
|
* only once, so treat its cost as all startup cost.
|
|
|
|
*/
|
|
|
|
if (rinfo->pseudoconstant)
|
|
|
|
{
|
|
|
|
/* count one execution during startup */
|
2007-02-22 23:00:26 +01:00
|
|
|
locContext.total.startup += locContext.total.per_tuple;
|
|
|
|
locContext.total.per_tuple = 0;
|
2007-01-22 02:35:23 +01:00
|
|
|
}
|
2007-02-22 23:00:26 +01:00
|
|
|
rinfo->eval_cost = locContext.total;
|
2007-01-22 02:35:23 +01:00
|
|
|
}
|
2007-02-22 23:00:26 +01:00
|
|
|
context->total.startup += rinfo->eval_cost.startup;
|
|
|
|
context->total.per_tuple += rinfo->eval_cost.per_tuple;
|
2007-01-22 02:35:23 +01:00
|
|
|
/* do NOT recurse into children */
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* For each operator or function node in the given tree, we charge the
|
2007-11-15 22:14:46 +01:00
|
|
|
* estimated execution cost given by pg_proc.procost (remember to multiply
|
|
|
|
* this by cpu_operator_cost).
|
2007-01-22 02:35:23 +01:00
|
|
|
*
|
|
|
|
* Vars and Consts are charged zero, and so are boolean operators (AND,
|
|
|
|
* OR, NOT). Simplistic, but a lot better than no model at all.
|
2000-02-15 21:49:31 +01:00
|
|
|
*
|
2005-11-22 19:17:34 +01:00
|
|
|
* Should we try to account for the possibility of short-circuit
|
2007-01-22 02:35:23 +01:00
|
|
|
* evaluation of AND/OR? Probably *not*, because that would make the
|
|
|
|
* results depend on the clause ordering, and we are not in any position
|
|
|
|
* to expect that the current ordering of the clauses is the one that's
|
2014-05-06 18:12:18 +02:00
|
|
|
* going to end up being used. The above per-RestrictInfo caching would
|
2011-04-24 22:55:20 +02:00
|
|
|
* not mix well with trying to re-order clauses anyway.
|
2012-07-21 23:45:07 +02:00
|
|
|
*
|
|
|
|
* Another issue that is entirely ignored here is that if a set-returning
|
|
|
|
* function is below top level in the tree, the functions/operators above
|
|
|
|
* it will need to be evaluated multiple times. In practical use, such
|
|
|
|
* cases arise so seldom as to not be worth the added complexity needed;
|
|
|
|
* moreover, since our rowcount estimates for functions tend to be pretty
|
|
|
|
* phony, the results would also be pretty phony.
|
1999-04-05 04:07:07 +02:00
|
|
|
*/
|
2007-01-22 02:35:23 +01:00
|
|
|
if (IsA(node, FuncExpr))
|
|
|
|
{
|
2019-02-10 00:32:23 +01:00
|
|
|
add_function_cost(context->root, ((FuncExpr *) node)->funcid, node,
|
|
|
|
&context->total);
|
2007-01-22 02:35:23 +01:00
|
|
|
}
|
|
|
|
else if (IsA(node, OpExpr) ||
|
|
|
|
IsA(node, DistinctExpr) ||
|
|
|
|
IsA(node, NullIfExpr))
|
|
|
|
{
|
|
|
|
/* rely on struct equivalence to treat these all alike */
|
|
|
|
set_opfuncid((OpExpr *) node);
|
2019-02-10 00:32:23 +01:00
|
|
|
add_function_cost(context->root, ((OpExpr *) node)->opfuncid, node,
|
|
|
|
&context->total);
|
2007-01-22 02:35:23 +01:00
|
|
|
}
|
2003-06-29 02:33:44 +02:00
|
|
|
else if (IsA(node, ScalarArrayOpExpr))
|
|
|
|
{
|
2005-11-26 23:14:57 +01:00
|
|
|
/*
|
|
|
|
* Estimate that the operator will be applied to about half of the
|
|
|
|
* array elements before the answer is determined.
|
|
|
|
*/
|
|
|
|
ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) node;
|
2006-10-04 02:30:14 +02:00
|
|
|
Node *arraynode = (Node *) lsecond(saop->args);
|
2019-02-10 00:32:23 +01:00
|
|
|
QualCost sacosts;
|
2005-11-26 23:14:57 +01:00
|
|
|
|
2007-01-22 02:35:23 +01:00
|
|
|
set_sa_opfuncid(saop);
|
2019-02-10 00:32:23 +01:00
|
|
|
sacosts.startup = sacosts.per_tuple = 0;
|
|
|
|
add_function_cost(context->root, saop->opfuncid, NULL,
|
|
|
|
&sacosts);
|
|
|
|
context->total.startup += sacosts.startup;
|
|
|
|
context->total.per_tuple += sacosts.per_tuple *
|
|
|
|
estimate_array_length(arraynode) * 0.5;
|
2003-06-29 02:33:44 +02:00
|
|
|
}
|
2011-04-24 22:55:20 +02:00
|
|
|
else if (IsA(node, Aggref) ||
|
|
|
|
IsA(node, WindowFunc))
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Aggref and WindowFunc nodes are (and should be) treated like Vars,
|
|
|
|
* ie, zero execution cost in the current model, because they behave
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
* essentially like Vars at execution. We disregard the costs of
|
2011-04-24 22:55:20 +02:00
|
|
|
* their input expressions for the same reason. The actual execution
|
|
|
|
* costs of the aggregate/window functions and their arguments have to
|
|
|
|
* be factored into plan-node-specific costing of the Agg or WindowAgg
|
|
|
|
* plan node.
|
|
|
|
*/
|
|
|
|
return false; /* don't recurse into children */
|
|
|
|
}
|
2007-06-05 23:31:09 +02:00
|
|
|
else if (IsA(node, CoerceViaIO))
|
|
|
|
{
|
|
|
|
CoerceViaIO *iocoerce = (CoerceViaIO *) node;
|
2007-11-15 22:14:46 +01:00
|
|
|
Oid iofunc;
|
|
|
|
Oid typioparam;
|
|
|
|
bool typisvarlena;
|
2007-06-05 23:31:09 +02:00
|
|
|
|
|
|
|
/* check the result type's input function */
|
|
|
|
getTypeInputInfo(iocoerce->resulttype,
|
|
|
|
&iofunc, &typioparam);
|
2019-02-10 00:32:23 +01:00
|
|
|
add_function_cost(context->root, iofunc, NULL,
|
|
|
|
&context->total);
|
2007-06-05 23:31:09 +02:00
|
|
|
/* check the input type's output function */
|
|
|
|
getTypeOutputInfo(exprType((Node *) iocoerce->arg),
|
|
|
|
&iofunc, &typisvarlena);
|
2019-02-10 00:32:23 +01:00
|
|
|
add_function_cost(context->root, iofunc, NULL,
|
|
|
|
&context->total);
|
2007-06-05 23:31:09 +02:00
|
|
|
}
|
2007-03-28 01:21:12 +02:00
|
|
|
else if (IsA(node, ArrayCoerceExpr))
|
|
|
|
{
|
|
|
|
ArrayCoerceExpr *acoerce = (ArrayCoerceExpr *) node;
|
Support arrays over domains.
Allowing arrays with a domain type as their element type was left un-done
in the original domain patch, but not for any very good reason. This
omission leads to such surprising results as array_agg() not working on
a domain column, because the parser can't identify a suitable output type
for the polymorphic aggregate.
In order to fix this, first clean up the APIs of coerce_to_domain() and
some internal functions in parse_coerce.c so that we consistently pass
around a CoercionContext along with CoercionForm. Previously, we sometimes
passed an "isExplicit" boolean flag instead, which is strictly less
information; and coerce_to_domain() didn't even get that, but instead had
to reverse-engineer isExplicit from CoercionForm. That's contrary to the
documentation in primnodes.h that says that CoercionForm only affects
display and not semantics. I don't think this change fixes any live bugs,
but it makes things more consistent. The main reason for doing it though
is that now build_coercion_expression() receives ccontext, which it needs
in order to be able to recursively invoke coerce_to_target_type().
Next, reimplement ArrayCoerceExpr so that the node does not directly know
any details of what has to be done to the individual array elements while
performing the array coercion. Instead, the per-element processing is
represented by a sub-expression whose input is a source array element and
whose output is a target array element. This simplifies life in
parse_coerce.c, because it can build that sub-expression by a recursive
invocation of coerce_to_target_type(). The executor now handles the
per-element processing as a compiled expression instead of hard-wired code.
The main advantage of this is that we can use a single ArrayCoerceExpr to
handle as many as three successive steps per element: base type conversion,
typmod coercion, and domain constraint checking. The old code used two
stacked ArrayCoerceExprs to handle type + typmod coercion, which was pretty
inefficient, and adding yet another array deconstruction to do domain
constraint checking seemed very unappetizing.
In the case where we just need a single, very simple coercion function,
doing this straightforwardly leads to a noticeable increase in the
per-array-element runtime cost. Hence, add an additional shortcut evalfunc
in execExprInterp.c that skips unnecessary overhead for that specific form
of expression. The runtime speed of simple cases is within 1% or so of
where it was before, while cases that previously required two levels of
array processing are significantly faster.
Finally, create an implicit array type for every domain type, as we do for
base types, enums, etc. Everything except the array-coercion case seems
to just work without further effort.
Tom Lane, reviewed by Andrew Dunstan
Discussion: https://postgr.es/m/9852.1499791473@sss.pgh.pa.us
2017-09-30 19:40:56 +02:00
|
|
|
QualCost perelemcost;
|
|
|
|
|
|
|
|
cost_qual_eval_node(&perelemcost, (Node *) acoerce->elemexpr,
|
|
|
|
context->root);
|
|
|
|
context->total.startup += perelemcost.startup;
|
|
|
|
if (perelemcost.per_tuple > 0)
|
|
|
|
context->total.per_tuple += perelemcost.per_tuple *
|
|
|
|
estimate_array_length((Node *) acoerce->arg);
|
2007-03-28 01:21:12 +02:00
|
|
|
}
|
2005-12-28 02:30:02 +01:00
|
|
|
else if (IsA(node, RowCompareExpr))
|
|
|
|
{
|
|
|
|
/* Conservatively assume we will check all the columns */
|
|
|
|
RowCompareExpr *rcexpr = (RowCompareExpr *) node;
|
2007-01-22 02:35:23 +01:00
|
|
|
ListCell *lc;
|
2005-12-28 02:30:02 +01:00
|
|
|
|
2007-01-22 02:35:23 +01:00
|
|
|
foreach(lc, rcexpr->opnos)
|
|
|
|
{
|
2007-11-15 22:14:46 +01:00
|
|
|
Oid opid = lfirst_oid(lc);
|
2007-01-22 02:35:23 +01:00
|
|
|
|
2019-02-10 00:32:23 +01:00
|
|
|
add_function_cost(context->root, get_opcode(opid), NULL,
|
|
|
|
&context->total);
|
2007-01-22 02:35:23 +01:00
|
|
|
}
|
2005-12-28 02:30:02 +01:00
|
|
|
}
|
Code review for NextValueExpr expression node type.
Add missing infrastructure for this node type, notably in ruleutils.c where
its lack could demonstrably cause EXPLAIN to fail. Add outfuncs/readfuncs
support. (outfuncs support is useful today for debugging purposes. The
readfuncs support may never be needed, since at present it would only
matter for parallel query and NextValueExpr should never appear in a
parallelizable query; but it seems like a bad idea to have a primnode type
that isn't fully supported here.) Teach planner infrastructure that
NextValueExpr is a volatile, parallel-unsafe, non-leaky expression node
with cost cpu_operator_cost. Given its limited scope of usage, there
*might* be no live bug today from the lack of that knowledge, but it's
certainly going to bite us on the rear someday. Teach pg_stat_statements
about the new node type, too.
While at it, also teach cost_qual_eval() that MinMaxExpr, SQLValueFunction,
XmlExpr, and CoerceToDomain should be charged as cpu_operator_cost.
Failing to do this for SQLValueFunction was an oversight in my commit
0bb51aa96. The others are longer-standing oversights, but no time like the
present to fix them. (In principle, CoerceToDomain could have cost much
higher than this, but it doesn't presently seem worth trying to examine the
domain's constraints here.)
Modify execExprInterp.c to execute NextValueExpr as an out-of-line
function; it seems quite unlikely to me that it's worth insisting that
it be inlined in all expression eval methods. Besides, providing the
out-of-line function doesn't stop anyone from inlining if they want to.
Adjust some places where NextValueExpr support had been inserted with the
aid of a dartboard rather than keeping it in the same order as elsewhere.
Discussion: https://postgr.es/m/23862.1499981661@sss.pgh.pa.us
2017-07-14 21:25:43 +02:00
|
|
|
else if (IsA(node, MinMaxExpr) ||
|
|
|
|
IsA(node, SQLValueFunction) ||
|
|
|
|
IsA(node, XmlExpr) ||
|
|
|
|
IsA(node, CoerceToDomain) ||
|
|
|
|
IsA(node, NextValueExpr))
|
|
|
|
{
|
|
|
|
/* Treat all these as having cost 1 */
|
|
|
|
context->total.per_tuple += cpu_operator_cost;
|
|
|
|
}
|
2007-06-11 03:16:30 +02:00
|
|
|
else if (IsA(node, CurrentOfExpr))
|
|
|
|
{
|
2007-10-24 20:37:09 +02:00
|
|
|
/* Report high cost to prevent selection of anything but TID scan */
|
|
|
|
context->total.startup += disable_cost;
|
2007-06-11 03:16:30 +02:00
|
|
|
}
|
2003-01-12 23:35:29 +01:00
|
|
|
else if (IsA(node, SubLink))
|
|
|
|
{
|
|
|
|
/* This routine should not be applied to un-planned expressions */
|
2003-07-25 02:01:09 +02:00
|
|
|
elog(ERROR, "cannot handle unplanned sub-select");
|
2003-01-12 23:35:29 +01:00
|
|
|
}
|
2002-12-14 01:17:59 +01:00
|
|
|
else if (IsA(node, SubPlan))
|
2000-02-15 21:49:31 +01:00
|
|
|
{
|
2002-12-12 16:49:42 +01:00
|
|
|
/*
|
2003-01-12 23:35:29 +01:00
|
|
|
* A subplan node in an expression typically indicates that the
|
2005-10-15 04:49:52 +02:00
|
|
|
* subplan will be executed on each evaluation, so charge accordingly.
|
|
|
|
* (Sub-selects that can be executed as InitPlans have already been
|
|
|
|
* removed from the expression.)
|
2002-12-12 16:49:42 +01:00
|
|
|
*/
|
2003-08-04 02:43:34 +02:00
|
|
|
SubPlan *subplan = (SubPlan *) node;
|
2000-02-15 21:49:31 +01:00
|
|
|
|
2008-08-22 02:16:04 +02:00
|
|
|
context->total.startup += subplan->startup_cost;
|
|
|
|
context->total.per_tuple += subplan->per_call_cost;
|
2003-01-12 23:35:29 +01:00
|
|
|
|
2008-08-22 02:16:04 +02:00
|
|
|
/*
|
|
|
|
* We don't want to recurse into the testexpr, because it was already
|
|
|
|
* counted in the SubPlan node's costs. So we're done.
|
|
|
|
*/
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
else if (IsA(node, AlternativeSubPlan))
|
|
|
|
{
|
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Arbitrarily use the first alternative plan for costing. (We should
|
2008-08-22 02:16:04 +02:00
|
|
|
* certainly only include one alternative, and we don't yet have
|
2009-06-11 16:49:15 +02:00
|
|
|
* enough information to know which one the executor is most likely to
|
|
|
|
* use.)
|
2008-08-22 02:16:04 +02:00
|
|
|
*/
|
|
|
|
AlternativeSubPlan *asplan = (AlternativeSubPlan *) node;
|
2003-08-04 02:43:34 +02:00
|
|
|
|
2008-08-22 02:16:04 +02:00
|
|
|
return cost_qual_eval_walker((Node *) linitial(asplan->subplans),
|
|
|
|
context);
|
2000-02-15 21:49:31 +01:00
|
|
|
}
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
else if (IsA(node, PlaceHolderVar))
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* A PlaceHolderVar should be given cost zero when considering general
|
|
|
|
* expression evaluation costs. The expense of doing the contained
|
|
|
|
* expression is charged as part of the tlist eval costs of the scan
|
|
|
|
* or join where the PHV is first computed (see set_rel_width and
|
|
|
|
* add_placeholders_to_joinrel). If we charged it again here, we'd be
|
|
|
|
* double-counting the cost for each level of plan that the PHV
|
|
|
|
* bubbles up through. Hence, return without recursing into the
|
|
|
|
* phexpr.
|
|
|
|
*/
|
|
|
|
return false;
|
|
|
|
}
|
2002-12-12 16:49:42 +01:00
|
|
|
|
2007-01-22 02:35:23 +01:00
|
|
|
/* recurse into children */
|
2000-02-15 21:49:31 +01:00
|
|
|
return expression_tree_walker(node, cost_qual_eval_walker,
|
2007-02-22 23:00:26 +01:00
|
|
|
(void *) context);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
/*
|
|
|
|
* get_restriction_qual_cost
|
|
|
|
* Compute evaluation costs of a baserel's restriction quals, plus any
|
|
|
|
* movable join quals that have been pushed down to the scan.
|
|
|
|
* Results are returned into *qpqual_cost.
|
|
|
|
*
|
|
|
|
* This is a convenience subroutine that works for seqscans and other cases
|
|
|
|
* where all the given quals will be evaluated the hard way. It's not useful
|
|
|
|
* for cost_index(), for example, where the index machinery takes care of
|
|
|
|
* some of the quals. We assume baserestrictcost was previously set by
|
|
|
|
* set_baserel_size_estimates().
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
get_restriction_qual_cost(PlannerInfo *root, RelOptInfo *baserel,
|
|
|
|
ParamPathInfo *param_info,
|
|
|
|
QualCost *qpqual_cost)
|
|
|
|
{
|
|
|
|
if (param_info)
|
|
|
|
{
|
|
|
|
/* Include costs of pushed-down clauses */
|
|
|
|
cost_qual_eval(qpqual_cost, param_info->ppi_clauses, root);
|
|
|
|
|
|
|
|
qpqual_cost->startup += baserel->baserestrictcost.startup;
|
|
|
|
qpqual_cost->per_tuple += baserel->baserestrictcost.per_tuple;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
*qpqual_cost = baserel->baserestrictcost;
|
|
|
|
}
|
|
|
|
|
2000-02-15 21:49:31 +01:00
|
|
|
|
2009-05-10 00:51:41 +02:00
|
|
|
/*
|
2012-01-28 01:26:38 +01:00
|
|
|
* compute_semi_anti_join_factors
|
2017-04-08 04:20:03 +02:00
|
|
|
* Estimate how much of the inner input a SEMI, ANTI, or inner_unique join
|
2009-05-10 00:51:41 +02:00
|
|
|
* can be expected to scan.
|
|
|
|
*
|
|
|
|
* In a hash or nestloop SEMI/ANTI join, the executor will stop scanning
|
|
|
|
* inner rows as soon as it finds a match to the current outer row.
|
2017-04-08 04:20:03 +02:00
|
|
|
* The same happens if we have detected the inner rel is unique.
|
2009-05-10 00:51:41 +02:00
|
|
|
* We should therefore adjust some of the cost components for this effect.
|
|
|
|
* This function computes some estimates needed for these adjustments.
|
2012-01-28 01:26:38 +01:00
|
|
|
* These estimates will be the same regardless of the particular paths used
|
|
|
|
* for the outer and inner relation, so we compute these once and then pass
|
|
|
|
* them to all the join cost estimation functions.
|
|
|
|
*
|
|
|
|
* Input parameters:
|
2018-04-20 22:00:47 +02:00
|
|
|
* joinrel: join relation under consideration
|
2012-01-28 01:26:38 +01:00
|
|
|
* outerrel: outer relation under consideration
|
|
|
|
* innerrel: inner relation under consideration
|
2017-04-08 04:20:03 +02:00
|
|
|
* jointype: if not JOIN_SEMI or JOIN_ANTI, we assume it's inner_unique
|
2012-01-28 01:26:38 +01:00
|
|
|
* sjinfo: SpecialJoinInfo relevant to this join
|
2012-06-10 21:20:04 +02:00
|
|
|
* restrictlist: join quals
|
2012-01-28 01:26:38 +01:00
|
|
|
* Output parameters:
|
2019-01-29 22:49:25 +01:00
|
|
|
* *semifactors is filled in (see pathnodes.h for field definitions)
|
2009-05-10 00:51:41 +02:00
|
|
|
*/
|
2012-01-28 01:26:38 +01:00
|
|
|
void
|
|
|
|
compute_semi_anti_join_factors(PlannerInfo *root,
|
2018-04-20 22:00:47 +02:00
|
|
|
RelOptInfo *joinrel,
|
2012-01-28 01:26:38 +01:00
|
|
|
RelOptInfo *outerrel,
|
|
|
|
RelOptInfo *innerrel,
|
|
|
|
JoinType jointype,
|
|
|
|
SpecialJoinInfo *sjinfo,
|
|
|
|
List *restrictlist,
|
|
|
|
SemiAntiJoinFactors *semifactors)
|
2009-05-10 00:51:41 +02:00
|
|
|
{
|
|
|
|
Selectivity jselec;
|
|
|
|
Selectivity nselec;
|
|
|
|
Selectivity avgmatch;
|
|
|
|
SpecialJoinInfo norm_sjinfo;
|
|
|
|
List *joinquals;
|
|
|
|
ListCell *l;
|
|
|
|
|
|
|
|
/*
|
2009-06-11 16:49:15 +02:00
|
|
|
* In an ANTI join, we must ignore clauses that are "pushed down", since
|
|
|
|
* those won't affect the match logic. In a SEMI join, we do not
|
2009-05-10 00:51:41 +02:00
|
|
|
* distinguish joinquals from "pushed down" quals, so just use the whole
|
2017-04-08 04:20:03 +02:00
|
|
|
* restrictinfo list. For other outer join types, we should consider only
|
|
|
|
* non-pushed-down quals, so that this devolves to an IS_OUTER_JOIN check.
|
2009-05-10 00:51:41 +02:00
|
|
|
*/
|
2017-04-08 04:20:03 +02:00
|
|
|
if (IS_OUTER_JOIN(jointype))
|
2009-05-10 00:51:41 +02:00
|
|
|
{
|
|
|
|
joinquals = NIL;
|
2012-01-28 01:26:38 +01:00
|
|
|
foreach(l, restrictlist)
|
2009-05-10 00:51:41 +02:00
|
|
|
{
|
Improve castNode notation by introducing list-extraction-specific variants.
This extends the castNode() notation introduced by commit 5bcab1114 to
provide, in one step, extraction of a list cell's pointer and coercion to
a concrete node type. For example, "lfirst_node(Foo, lc)" is the same
as "castNode(Foo, lfirst(lc))". Almost half of the uses of castNode
that have appeared so far include a list extraction call, so this is
pretty widely useful, and it saves a few more keystrokes compared to the
old way.
As with the previous patch, back-patch the addition of these macros to
pg_list.h, so that the notation will be available when back-patching.
Patch by me, after an idea of Andrew Gierth's.
Discussion: https://postgr.es/m/14197.1491841216@sss.pgh.pa.us
2017-04-10 19:51:29 +02:00
|
|
|
RestrictInfo *rinfo = lfirst_node(RestrictInfo, l);
|
2009-05-10 00:51:41 +02:00
|
|
|
|
2018-04-20 22:00:47 +02:00
|
|
|
if (!RINFO_IS_PUSHED_DOWN(rinfo, joinrel->relids))
|
2009-05-10 00:51:41 +02:00
|
|
|
joinquals = lappend(joinquals, rinfo);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
2012-01-28 01:26:38 +01:00
|
|
|
joinquals = restrictlist;
|
2009-05-10 00:51:41 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Get the JOIN_SEMI or JOIN_ANTI selectivity of the join clauses.
|
|
|
|
*/
|
|
|
|
jselec = clauselist_selectivity(root,
|
|
|
|
joinquals,
|
|
|
|
0,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
(jointype == JOIN_ANTI) ? JOIN_ANTI : JOIN_SEMI,
|
2017-04-07 01:10:51 +02:00
|
|
|
sjinfo);
|
2009-05-10 00:51:41 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Also get the normal inner-join selectivity of the join clauses.
|
|
|
|
*/
|
|
|
|
norm_sjinfo.type = T_SpecialJoinInfo;
|
2012-01-28 01:26:38 +01:00
|
|
|
norm_sjinfo.min_lefthand = outerrel->relids;
|
|
|
|
norm_sjinfo.min_righthand = innerrel->relids;
|
|
|
|
norm_sjinfo.syn_lefthand = outerrel->relids;
|
|
|
|
norm_sjinfo.syn_righthand = innerrel->relids;
|
2009-05-10 00:51:41 +02:00
|
|
|
norm_sjinfo.jointype = JOIN_INNER;
|
|
|
|
/* we don't bother trying to make the remaining fields valid */
|
|
|
|
norm_sjinfo.lhs_strict = false;
|
|
|
|
norm_sjinfo.delay_upper_joins = false;
|
Improve planner's cost estimation in the presence of semijoins.
If we have a semijoin, say
SELECT * FROM x WHERE x1 IN (SELECT y1 FROM y)
and we're estimating the cost of a parameterized indexscan on x, the number
of repetitions of the indexscan should not be taken as the size of y; it'll
really only be the number of distinct values of y1, because the only valid
plan with y on the outside of a nestloop would require y to be unique-ified
before joining it to x. Most of the time this doesn't make that much
difference, but sometimes it can lead to drastically underestimating the
cost of the indexscan and hence choosing a bad plan, as pointed out by
David Kubečka.
Fixing this is a bit difficult because parameterized indexscans are costed
out quite early in the planning process, before we have the information
that would be needed to call estimate_num_groups() and thereby estimate the
number of distinct values of the join column(s). However we can move the
code that extracts a semijoin RHS's unique-ification columns, so that it's
done in initsplan.c rather than on-the-fly in create_unique_path(). That
shouldn't make any difference speed-wise and it's really a bit cleaner too.
The other bit of information we need is the size of the semijoin RHS,
which is easy if it's a single relation (we make those estimates before
considering indexscan costs) but problematic if it's a join relation.
The solution adopted here is just to use the product of the sizes of the
join component rels. That will generally be an overestimate, but since
estimate_num_groups() only uses this input as a clamp, an overestimate
shouldn't hurt us too badly. In any case we don't allow this new logic
to produce a value larger than we would have chosen before, so that at
worst an overestimate leaves us no wiser than we were before.
2015-03-12 02:21:00 +01:00
|
|
|
norm_sjinfo.semi_can_btree = false;
|
|
|
|
norm_sjinfo.semi_can_hash = false;
|
|
|
|
norm_sjinfo.semi_operators = NIL;
|
|
|
|
norm_sjinfo.semi_rhs_exprs = NIL;
|
2009-05-10 00:51:41 +02:00
|
|
|
|
|
|
|
nselec = clauselist_selectivity(root,
|
|
|
|
joinquals,
|
|
|
|
0,
|
|
|
|
JOIN_INNER,
|
2017-04-07 01:10:51 +02:00
|
|
|
&norm_sjinfo);
|
2009-05-10 00:51:41 +02:00
|
|
|
|
|
|
|
/* Avoid leaking a lot of ListCells */
|
2017-04-08 04:20:03 +02:00
|
|
|
if (IS_OUTER_JOIN(jointype))
|
2009-05-10 00:51:41 +02:00
|
|
|
list_free(joinquals);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* jselec can be interpreted as the fraction of outer-rel rows that have
|
2009-06-11 16:49:15 +02:00
|
|
|
* any matches (this is true for both SEMI and ANTI cases). And nselec is
|
2014-05-06 18:12:18 +02:00
|
|
|
* the fraction of the Cartesian product that matches. So, the average
|
2009-06-11 16:49:15 +02:00
|
|
|
* number of matches for each outer-rel row that has at least one match is
|
|
|
|
* nselec * inner_rows / jselec.
|
2009-05-10 00:51:41 +02:00
|
|
|
*
|
2012-01-28 01:26:38 +01:00
|
|
|
* Note: it is correct to use the inner rel's "rows" count here, even
|
|
|
|
* though we might later be considering a parameterized inner path with
|
2014-05-06 18:12:18 +02:00
|
|
|
* fewer rows. This is because we have included all the join clauses in
|
2012-06-10 21:20:04 +02:00
|
|
|
* the selectivity estimate.
|
2009-05-10 00:51:41 +02:00
|
|
|
*/
|
|
|
|
if (jselec > 0) /* protect against zero divide */
|
|
|
|
{
|
2012-01-28 01:26:38 +01:00
|
|
|
avgmatch = nselec * innerrel->rows / jselec;
|
2009-05-10 00:51:41 +02:00
|
|
|
/* Clamp to sane range */
|
|
|
|
avgmatch = Max(1.0, avgmatch);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
avgmatch = 1.0;
|
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
semifactors->outer_match_frac = jselec;
|
|
|
|
semifactors->match_count = avgmatch;
|
|
|
|
}
|
2009-05-10 00:51:41 +02:00
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
/*
|
|
|
|
* has_indexed_join_quals
|
|
|
|
* Check whether all the joinquals of a nestloop join are used as
|
|
|
|
* inner index quals.
|
|
|
|
*
|
|
|
|
* If the inner path of a SEMI/ANTI join is an indexscan (including bitmap
|
|
|
|
* indexscan) that uses all the joinquals as indexquals, we can assume that an
|
|
|
|
* unmatched outer tuple is cheap to process, whereas otherwise it's probably
|
|
|
|
* expensive.
|
|
|
|
*/
|
|
|
|
static bool
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
has_indexed_join_quals(NestPath *joinpath)
|
2012-01-28 01:26:38 +01:00
|
|
|
{
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
Relids joinrelids = joinpath->path.parent->relids;
|
|
|
|
Path *innerpath = joinpath->innerjoinpath;
|
|
|
|
List *indexclauses;
|
|
|
|
bool found_one;
|
|
|
|
ListCell *lc;
|
|
|
|
|
|
|
|
/* If join still has quals to evaluate, it's not fast */
|
|
|
|
if (joinpath->joinrestrictinfo != NIL)
|
|
|
|
return false;
|
|
|
|
/* Nor if the inner path isn't parameterized at all */
|
|
|
|
if (innerpath->param_info == NULL)
|
|
|
|
return false;
|
2012-01-28 01:26:38 +01:00
|
|
|
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
/* Find the indexclauses list for the inner scan */
|
|
|
|
switch (innerpath->pathtype)
|
2009-05-10 00:51:41 +02:00
|
|
|
{
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
case T_IndexScan:
|
|
|
|
case T_IndexOnlyScan:
|
|
|
|
indexclauses = ((IndexPath *) innerpath)->indexclauses;
|
|
|
|
break;
|
|
|
|
case T_BitmapHeapScan:
|
2012-06-10 21:20:04 +02:00
|
|
|
{
|
|
|
|
/* Accept only a simple bitmap scan, not AND/OR cases */
|
|
|
|
Path *bmqual = ((BitmapHeapPath *) innerpath)->bitmapqual;
|
|
|
|
|
|
|
|
if (IsA(bmqual, IndexPath))
|
|
|
|
indexclauses = ((IndexPath *) bmqual)->indexclauses;
|
|
|
|
else
|
|
|
|
return false;
|
|
|
|
break;
|
|
|
|
}
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
default:
|
2012-06-10 21:20:04 +02:00
|
|
|
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
/*
|
|
|
|
* If it's not a simple indexscan, it probably doesn't run quickly
|
|
|
|
* for zero rows out, even if it's a parameterized path using all
|
|
|
|
* the joinquals.
|
|
|
|
*/
|
2012-01-28 01:26:38 +01:00
|
|
|
return false;
|
2009-05-10 00:51:41 +02:00
|
|
|
}
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Examine the inner path's param clauses. Any that are from the outer
|
|
|
|
* path must be found in the indexclauses list, either exactly or in an
|
2012-06-10 21:20:04 +02:00
|
|
|
* equivalent form generated by equivclass.c. Also, we must find at least
|
|
|
|
* one such clause, else it's a clauseless join which isn't fast.
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
*/
|
|
|
|
found_one = false;
|
|
|
|
foreach(lc, innerpath->param_info->ppi_clauses)
|
2012-01-28 01:26:38 +01:00
|
|
|
{
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
|
|
|
|
|
|
|
|
if (join_clause_is_movable_into(rinfo,
|
|
|
|
innerpath->parent->relids,
|
|
|
|
joinrelids))
|
|
|
|
{
|
Refactor the representation of indexable clauses in IndexPaths.
In place of three separate but interrelated lists (indexclauses,
indexquals, and indexqualcols), an IndexPath now has one list
"indexclauses" of IndexClause nodes. This holds basically the same
information as before, but in a more useful format: in particular, there
is now a clear connection between an indexclause (an original restriction
clause from WHERE or JOIN/ON) and the indexquals (directly usable index
conditions) derived from it.
We also change the ground rules a bit by mandating that clause commutation,
if needed, be done up-front so that what is stored in the indexquals list
is always directly usable as an index condition. This gets rid of repeated
re-determination of which side of the clause is the indexkey during costing
and plan generation, as well as repeated lookups of the commutator
operator. To minimize the added up-front cost, the typical case of
commuting a plain OpExpr is handled by a new special-purpose function
commute_restrictinfo(). For RowCompareExprs, generating the new clause
properly commuted to begin with is not really any more complex than before,
it's just different --- and we can save doing that work twice, as the
pretty-klugy original implementation did.
Tracking the connection between original and derived clauses lets us
also track explicitly whether the derived clauses are an exact or lossy
translation of the original. This provides a cheap solution to getting
rid of unnecessary rechecks of boolean index clauses, which previously
seemed like it'd be more expensive than it was worth.
Another pleasant (IMO) side-effect is that EXPLAIN now always shows
index clauses with the indexkey on the left; this seems less confusing.
This commit leaves expand_indexqual_conditions() and some related
functions in a slightly messy state. I didn't bother to change them
any more than minimally necessary to work with the new data structure,
because all that code is going to be refactored out of existence in
a follow-on patch.
Discussion: https://postgr.es/m/22182.1549124950@sss.pgh.pa.us
2019-02-09 23:30:43 +01:00
|
|
|
if (!is_redundant_with_indexclauses(rinfo, indexclauses))
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
return false;
|
|
|
|
found_one = true;
|
|
|
|
}
|
2012-01-28 01:26:38 +01:00
|
|
|
}
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
return found_one;
|
2009-05-10 00:51:41 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2001-06-05 07:26:05 +02:00
|
|
|
/*
|
2008-08-16 02:01:38 +02:00
|
|
|
* approx_tuple_count
|
|
|
|
* Quick-and-dirty estimation of the number of join rows passing
|
|
|
|
* a set of qual conditions.
|
2001-06-05 07:26:05 +02:00
|
|
|
*
|
2008-08-16 02:01:38 +02:00
|
|
|
* The quals can be either an implicitly-ANDed list of boolean expressions,
|
|
|
|
* or a list of RestrictInfo nodes (typically the latter).
|
2008-08-14 20:48:00 +02:00
|
|
|
*
|
2009-02-07 00:43:24 +01:00
|
|
|
* We intentionally compute the selectivity under JOIN_INNER rules, even
|
|
|
|
* if it's some type of outer join. This is appropriate because we are
|
|
|
|
* trying to figure out how many tuples pass the initial merge or hash
|
|
|
|
* join step.
|
|
|
|
*
|
2004-01-04 04:51:52 +01:00
|
|
|
* This is quick-and-dirty because we bypass clauselist_selectivity, and
|
|
|
|
* simply multiply the independent clause selectivities together. Now
|
2002-03-01 07:01:20 +01:00
|
|
|
* clauselist_selectivity often can't do any better than that anyhow, but
|
2004-01-04 04:51:52 +01:00
|
|
|
* for some situations (such as range constraints) it is smarter. However,
|
|
|
|
* we can't effectively cache the results of clauselist_selectivity, whereas
|
|
|
|
* the individual clause selectivities can be and are cached.
|
2001-06-05 07:26:05 +02:00
|
|
|
*
|
|
|
|
* Since we are only using the results to estimate how many potential
|
|
|
|
* output tuples are generated and passed through qpqual checking, it
|
|
|
|
* seems OK to live with the approximation.
|
|
|
|
*/
|
2008-08-16 02:01:38 +02:00
|
|
|
static double
|
2009-02-07 00:43:24 +01:00
|
|
|
approx_tuple_count(PlannerInfo *root, JoinPath *path, List *quals)
|
2001-06-05 07:26:05 +02:00
|
|
|
{
|
2008-08-16 02:01:38 +02:00
|
|
|
double tuples;
|
2012-01-28 01:26:38 +01:00
|
|
|
double outer_tuples = path->outerjoinpath->rows;
|
|
|
|
double inner_tuples = path->innerjoinpath->rows;
|
2009-02-07 00:43:24 +01:00
|
|
|
SpecialJoinInfo sjinfo;
|
2008-08-16 02:01:38 +02:00
|
|
|
Selectivity selec = 1.0;
|
2004-05-26 06:41:50 +02:00
|
|
|
ListCell *l;
|
2001-06-05 07:26:05 +02:00
|
|
|
|
2009-02-07 00:43:24 +01:00
|
|
|
/*
|
|
|
|
* Make up a SpecialJoinInfo for JOIN_INNER semantics.
|
|
|
|
*/
|
|
|
|
sjinfo.type = T_SpecialJoinInfo;
|
|
|
|
sjinfo.min_lefthand = path->outerjoinpath->parent->relids;
|
|
|
|
sjinfo.min_righthand = path->innerjoinpath->parent->relids;
|
|
|
|
sjinfo.syn_lefthand = path->outerjoinpath->parent->relids;
|
|
|
|
sjinfo.syn_righthand = path->innerjoinpath->parent->relids;
|
|
|
|
sjinfo.jointype = JOIN_INNER;
|
|
|
|
/* we don't bother trying to make the remaining fields valid */
|
|
|
|
sjinfo.lhs_strict = false;
|
|
|
|
sjinfo.delay_upper_joins = false;
|
Improve planner's cost estimation in the presence of semijoins.
If we have a semijoin, say
SELECT * FROM x WHERE x1 IN (SELECT y1 FROM y)
and we're estimating the cost of a parameterized indexscan on x, the number
of repetitions of the indexscan should not be taken as the size of y; it'll
really only be the number of distinct values of y1, because the only valid
plan with y on the outside of a nestloop would require y to be unique-ified
before joining it to x. Most of the time this doesn't make that much
difference, but sometimes it can lead to drastically underestimating the
cost of the indexscan and hence choosing a bad plan, as pointed out by
David Kubečka.
Fixing this is a bit difficult because parameterized indexscans are costed
out quite early in the planning process, before we have the information
that would be needed to call estimate_num_groups() and thereby estimate the
number of distinct values of the join column(s). However we can move the
code that extracts a semijoin RHS's unique-ification columns, so that it's
done in initsplan.c rather than on-the-fly in create_unique_path(). That
shouldn't make any difference speed-wise and it's really a bit cleaner too.
The other bit of information we need is the size of the semijoin RHS,
which is easy if it's a single relation (we make those estimates before
considering indexscan costs) but problematic if it's a join relation.
The solution adopted here is just to use the product of the sizes of the
join component rels. That will generally be an overestimate, but since
estimate_num_groups() only uses this input as a clamp, an overestimate
shouldn't hurt us too badly. In any case we don't allow this new logic
to produce a value larger than we would have chosen before, so that at
worst an overestimate leaves us no wiser than we were before.
2015-03-12 02:21:00 +01:00
|
|
|
sjinfo.semi_can_btree = false;
|
|
|
|
sjinfo.semi_can_hash = false;
|
|
|
|
sjinfo.semi_operators = NIL;
|
|
|
|
sjinfo.semi_rhs_exprs = NIL;
|
2009-02-07 00:43:24 +01:00
|
|
|
|
2008-08-16 02:01:38 +02:00
|
|
|
/* Get the approximate selectivity */
|
2001-06-05 07:26:05 +02:00
|
|
|
foreach(l, quals)
|
|
|
|
{
|
|
|
|
Node *qual = (Node *) lfirst(l);
|
|
|
|
|
2004-01-04 04:51:52 +01:00
|
|
|
/* Note that clause_selectivity will be able to cache its result */
|
2017-04-07 01:10:51 +02:00
|
|
|
selec *= clause_selectivity(root, qual, 0, JOIN_INNER, &sjinfo);
|
2001-06-05 07:26:05 +02:00
|
|
|
}
|
2008-08-16 02:01:38 +02:00
|
|
|
|
2009-02-07 00:43:24 +01:00
|
|
|
/* Apply it to the input relation sizes */
|
|
|
|
tuples = selec * outer_tuples * inner_tuples;
|
2008-08-16 02:01:38 +02:00
|
|
|
|
|
|
|
return clamp_row_est(tuples);
|
2001-06-05 07:26:05 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
2000-02-07 05:41:04 +01:00
|
|
|
* set_baserel_size_estimates
|
|
|
|
* Set the size estimates for the given base relation.
|
1997-09-07 07:04:48 +02:00
|
|
|
*
|
2000-02-07 05:41:04 +01:00
|
|
|
* The rel's targetlist and restrictinfo list must have been constructed
|
2010-11-19 23:31:50 +01:00
|
|
|
* already, and rel->tuples must be set.
|
2000-02-07 05:41:04 +01:00
|
|
|
*
|
|
|
|
* We set the following fields of the rel node:
|
|
|
|
* rows: the estimated number of output tuples (after applying
|
2000-04-12 19:17:23 +02:00
|
|
|
* restriction clauses).
|
2000-02-07 05:41:04 +01:00
|
|
|
* width: the estimated average output tuple width in bytes.
|
2000-02-15 21:49:31 +01:00
|
|
|
* baserestrictcost: estimated cost of evaluating baserestrictinfo clauses.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2000-01-09 01:26:47 +01:00
|
|
|
void
|
2005-06-06 00:32:58 +02:00
|
|
|
set_baserel_size_estimates(PlannerInfo *root, RelOptInfo *rel)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2004-01-05 06:07:36 +01:00
|
|
|
double nrows;
|
2002-12-13 18:29:25 +01:00
|
|
|
|
2000-01-09 01:26:47 +01:00
|
|
|
/* Should only be applied to base relations */
|
2003-02-08 21:20:55 +01:00
|
|
|
Assert(rel->relid > 0);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2004-01-05 06:07:36 +01:00
|
|
|
nrows = rel->tuples *
|
2004-01-04 04:51:52 +01:00
|
|
|
clauselist_selectivity(root,
|
|
|
|
rel->baserestrictinfo,
|
|
|
|
0,
|
2008-08-14 20:48:00 +02:00
|
|
|
JOIN_INNER,
|
2017-04-07 01:10:51 +02:00
|
|
|
NULL);
|
2000-04-12 19:17:23 +02:00
|
|
|
|
2004-01-05 06:07:36 +01:00
|
|
|
rel->rows = clamp_row_est(nrows);
|
2000-02-15 21:49:31 +01:00
|
|
|
|
2007-02-22 23:00:26 +01:00
|
|
|
cost_qual_eval(&rel->baserestrictcost, rel->baserestrictinfo, root);
|
2000-01-09 01:26:47 +01:00
|
|
|
|
|
|
|
set_rel_width(root, rel);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
/*
|
|
|
|
* get_parameterized_baserel_size
|
|
|
|
* Make a size estimate for a parameterized scan of a base relation.
|
|
|
|
*
|
|
|
|
* 'param_clauses' lists the additional join clauses to be used.
|
|
|
|
*
|
|
|
|
* set_baserel_size_estimates must have been applied already.
|
|
|
|
*/
|
|
|
|
double
|
|
|
|
get_parameterized_baserel_size(PlannerInfo *root, RelOptInfo *rel,
|
|
|
|
List *param_clauses)
|
|
|
|
{
|
|
|
|
List *allclauses;
|
|
|
|
double nrows;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Estimate the number of rows returned by the parameterized scan, knowing
|
|
|
|
* that it will apply all the extra join clauses as well as the rel's own
|
|
|
|
* restriction clauses. Note that we force the clauses to be treated as
|
|
|
|
* non-join clauses during selectivity estimation.
|
|
|
|
*/
|
Rationalize use of list_concat + list_copy combinations.
In the wake of commit 1cff1b95a, the result of list_concat no longer
shares the ListCells of the second input. Therefore, we can replace
"list_concat(x, list_copy(y))" with just "list_concat(x, y)".
To improve call sites that were list_copy'ing the first argument,
or both arguments, invent "list_concat_copy()" which produces a new
list sharing no ListCells with either input. (This is a bit faster
than "list_concat(list_copy(x), y)" because it makes the result list
the right size to start with.)
In call sites that were not list_copy'ing the second argument, the new
semantics mean that we are usually leaking the second List's storage,
since typically there is no remaining pointer to it. We considered
inventing another list_copy variant that would list_free the second
input, but concluded that for most call sites it isn't worth worrying
about, given the relative compactness of the new List representation.
(Note that in cases where such leakage would happen, the old code
already leaked the second List's header; so we're only discussing
the size of the leak not whether there is one. I did adjust two or
three places that had been troubling to free that header so that
they manually free the whole second List.)
Patch by me; thanks to David Rowley for review.
Discussion: https://postgr.es/m/11587.1550975080@sss.pgh.pa.us
2019-08-12 17:20:18 +02:00
|
|
|
allclauses = list_concat_copy(param_clauses, rel->baserestrictinfo);
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
nrows = rel->tuples *
|
|
|
|
clauselist_selectivity(root,
|
|
|
|
allclauses,
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
rel->relid, /* do not use 0! */
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
JOIN_INNER,
|
2017-04-07 01:10:51 +02:00
|
|
|
NULL);
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
nrows = clamp_row_est(nrows);
|
|
|
|
/* For safety, make sure result is not more than the base estimate */
|
|
|
|
if (nrows > rel->rows)
|
|
|
|
nrows = rel->rows;
|
|
|
|
return nrows;
|
|
|
|
}
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
2000-02-07 05:41:04 +01:00
|
|
|
* set_joinrel_size_estimates
|
|
|
|
* Set the size estimates for the given join relation.
|
|
|
|
*
|
|
|
|
* The rel's targetlist must have been constructed already, and a
|
|
|
|
* restriction clause list that matches the given component rels must
|
|
|
|
* be provided.
|
|
|
|
*
|
|
|
|
* Since there is more than one way to make a joinrel for more than two
|
|
|
|
* base relations, the results we get here could depend on which component
|
|
|
|
* rel pair is provided. In theory we should get the same answers no matter
|
|
|
|
* which pair is provided; in practice, since the selectivity estimation
|
|
|
|
* routines don't handle all cases equally well, we might not. But there's
|
|
|
|
* not much to be done about it. (Would it make sense to repeat the
|
|
|
|
* calculations for each pair of input rels that's encountered, and somehow
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
* average the results? Probably way more trouble than it's worth, and
|
|
|
|
* anyway we must keep the rowcount estimate the same for all paths for the
|
|
|
|
* joinrel.)
|
2000-02-07 05:41:04 +01:00
|
|
|
*
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
* We set only the rows field here. The reltarget field was already set by
|
2004-01-05 06:07:36 +01:00
|
|
|
* build_joinrel_tlist, and baserestrictcost is not used for join rels.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2000-01-09 01:26:47 +01:00
|
|
|
void
|
2005-06-06 00:32:58 +02:00
|
|
|
set_joinrel_size_estimates(PlannerInfo *root, RelOptInfo *rel,
|
2000-02-07 05:41:04 +01:00
|
|
|
RelOptInfo *outer_rel,
|
|
|
|
RelOptInfo *inner_rel,
|
2008-08-14 20:48:00 +02:00
|
|
|
SpecialJoinInfo *sjinfo,
|
2000-02-07 05:41:04 +01:00
|
|
|
List *restrictlist)
|
2012-01-28 01:26:38 +01:00
|
|
|
{
|
|
|
|
rel->rows = calc_joinrel_size_estimate(root,
|
2018-04-20 21:19:16 +02:00
|
|
|
rel,
|
2016-06-18 21:22:34 +02:00
|
|
|
outer_rel,
|
|
|
|
inner_rel,
|
2012-01-28 01:26:38 +01:00
|
|
|
outer_rel->rows,
|
|
|
|
inner_rel->rows,
|
|
|
|
sjinfo,
|
|
|
|
restrictlist);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
* get_parameterized_joinrel_size
|
|
|
|
* Make a size estimate for a parameterized scan of a join relation.
|
|
|
|
*
|
|
|
|
* 'rel' is the joinrel under consideration.
|
2016-06-18 21:22:34 +02:00
|
|
|
* 'outer_path', 'inner_path' are (probably also parameterized) Paths that
|
|
|
|
* produce the relations being joined.
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
* 'sjinfo' is any SpecialJoinInfo relevant to this join.
|
|
|
|
* 'restrict_clauses' lists the join clauses that need to be applied at the
|
|
|
|
* join node (including any movable clauses that were moved down to this join,
|
|
|
|
* and not including any movable clauses that were pushed down into the
|
|
|
|
* child paths).
|
|
|
|
*
|
|
|
|
* set_joinrel_size_estimates must have been applied already.
|
2012-01-28 01:26:38 +01:00
|
|
|
*/
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
double
|
|
|
|
get_parameterized_joinrel_size(PlannerInfo *root, RelOptInfo *rel,
|
2016-06-18 21:22:34 +02:00
|
|
|
Path *outer_path,
|
|
|
|
Path *inner_path,
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
SpecialJoinInfo *sjinfo,
|
|
|
|
List *restrict_clauses)
|
2012-01-28 01:26:38 +01:00
|
|
|
{
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
double nrows;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Estimate the number of rows returned by the parameterized join as the
|
|
|
|
* sizes of the input paths times the selectivity of the clauses that have
|
|
|
|
* ended up at this join node.
|
|
|
|
*
|
|
|
|
* As with set_joinrel_size_estimates, the rowcount estimate could depend
|
|
|
|
* on the pair of input paths provided, though ideally we'd get the same
|
|
|
|
* estimate for any pair with the same parameterization.
|
|
|
|
*/
|
|
|
|
nrows = calc_joinrel_size_estimate(root,
|
2018-04-20 21:19:16 +02:00
|
|
|
rel,
|
2016-06-18 21:22:34 +02:00
|
|
|
outer_path->parent,
|
|
|
|
inner_path->parent,
|
|
|
|
outer_path->rows,
|
|
|
|
inner_path->rows,
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
sjinfo,
|
|
|
|
restrict_clauses);
|
|
|
|
/* For safety, make sure result is not more than the base estimate */
|
|
|
|
if (nrows > rel->rows)
|
|
|
|
nrows = rel->rows;
|
|
|
|
return nrows;
|
2012-01-28 01:26:38 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* calc_joinrel_size_estimate
|
Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 21:52:46 +02:00
|
|
|
* Workhorse for set_joinrel_size_estimates and
|
|
|
|
* get_parameterized_joinrel_size.
|
2016-06-18 21:22:34 +02:00
|
|
|
*
|
|
|
|
* outer_rel/inner_rel are the relations being joined, but they should be
|
|
|
|
* assumed to have sizes outer_rows/inner_rows; those numbers might be less
|
|
|
|
* than what rel->rows says, when we are considering parameterized paths.
|
2012-01-28 01:26:38 +01:00
|
|
|
*/
|
|
|
|
static double
|
|
|
|
calc_joinrel_size_estimate(PlannerInfo *root,
|
2018-04-20 21:19:16 +02:00
|
|
|
RelOptInfo *joinrel,
|
2016-06-18 21:22:34 +02:00
|
|
|
RelOptInfo *outer_rel,
|
|
|
|
RelOptInfo *inner_rel,
|
2012-01-28 01:26:38 +01:00
|
|
|
double outer_rows,
|
|
|
|
double inner_rows,
|
|
|
|
SpecialJoinInfo *sjinfo,
|
2016-06-30 01:07:19 +02:00
|
|
|
List *restrictlist_in)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2016-06-30 01:07:19 +02:00
|
|
|
/* This apparently-useless variable dodges a compiler bug in VS2013: */
|
|
|
|
List *restrictlist = restrictlist_in;
|
2008-08-14 20:48:00 +02:00
|
|
|
JoinType jointype = sjinfo->jointype;
|
2016-06-18 21:22:34 +02:00
|
|
|
Selectivity fkselec;
|
2006-11-10 02:21:41 +01:00
|
|
|
Selectivity jselec;
|
|
|
|
Selectivity pselec;
|
2004-01-05 06:07:36 +01:00
|
|
|
double nrows;
|
2000-01-09 01:26:47 +01:00
|
|
|
|
2000-02-07 05:41:04 +01:00
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Compute joinclause selectivity. Note that we are only considering
|
2005-10-15 04:49:52 +02:00
|
|
|
* clauses that become restriction clauses at this join level; we are not
|
|
|
|
* double-counting them because they were not considered in estimating the
|
|
|
|
* sizes of the component rels.
|
2006-11-10 02:21:41 +01:00
|
|
|
*
|
2016-06-18 21:22:34 +02:00
|
|
|
* First, see whether any of the joinclauses can be matched to known FK
|
|
|
|
* constraints. If so, drop those clauses from the restrictlist, and
|
|
|
|
* instead estimate their selectivity using FK semantics. (We do this
|
|
|
|
* without regard to whether said clauses are local or "pushed down".
|
|
|
|
* Probably, an FK-matching clause could never be seen as pushed down at
|
|
|
|
* an outer join, since it would be strict and hence would be grounds for
|
|
|
|
* join strength reduction.) fkselec gets the net selectivity for
|
|
|
|
* FK-matching clauses, or 1.0 if there are none.
|
|
|
|
*/
|
|
|
|
fkselec = get_foreign_key_join_selectivity(root,
|
|
|
|
outer_rel->relids,
|
|
|
|
inner_rel->relids,
|
|
|
|
sjinfo,
|
|
|
|
&restrictlist);
|
|
|
|
|
|
|
|
/*
|
2007-11-15 22:14:46 +01:00
|
|
|
* For an outer join, we have to distinguish the selectivity of the join's
|
|
|
|
* own clauses (JOIN/ON conditions) from any clauses that were "pushed
|
|
|
|
* down". For inner joins we just count them all as joinclauses.
|
2000-02-07 05:41:04 +01:00
|
|
|
*/
|
2006-11-10 02:21:41 +01:00
|
|
|
if (IS_OUTER_JOIN(jointype))
|
|
|
|
{
|
|
|
|
List *joinquals = NIL;
|
|
|
|
List *pushedquals = NIL;
|
|
|
|
ListCell *l;
|
|
|
|
|
|
|
|
/* Grovel through the clauses to separate into two lists */
|
|
|
|
foreach(l, restrictlist)
|
|
|
|
{
|
Improve castNode notation by introducing list-extraction-specific variants.
This extends the castNode() notation introduced by commit 5bcab1114 to
provide, in one step, extraction of a list cell's pointer and coercion to
a concrete node type. For example, "lfirst_node(Foo, lc)" is the same
as "castNode(Foo, lfirst(lc))". Almost half of the uses of castNode
that have appeared so far include a list extraction call, so this is
pretty widely useful, and it saves a few more keystrokes compared to the
old way.
As with the previous patch, back-patch the addition of these macros to
pg_list.h, so that the notation will be available when back-patching.
Patch by me, after an idea of Andrew Gierth's.
Discussion: https://postgr.es/m/14197.1491841216@sss.pgh.pa.us
2017-04-10 19:51:29 +02:00
|
|
|
RestrictInfo *rinfo = lfirst_node(RestrictInfo, l);
|
2006-11-10 02:21:41 +01:00
|
|
|
|
2018-04-20 21:19:16 +02:00
|
|
|
if (RINFO_IS_PUSHED_DOWN(rinfo, joinrel->relids))
|
2006-11-10 02:21:41 +01:00
|
|
|
pushedquals = lappend(pushedquals, rinfo);
|
|
|
|
else
|
|
|
|
joinquals = lappend(joinquals, rinfo);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Get the separate selectivities */
|
2016-06-07 23:21:17 +02:00
|
|
|
jselec = clauselist_selectivity(root,
|
|
|
|
joinquals,
|
|
|
|
0,
|
|
|
|
jointype,
|
2017-04-07 01:10:51 +02:00
|
|
|
sjinfo);
|
2006-11-10 02:21:41 +01:00
|
|
|
pselec = clauselist_selectivity(root,
|
|
|
|
pushedquals,
|
|
|
|
0,
|
2008-08-14 20:48:00 +02:00
|
|
|
jointype,
|
2017-04-07 01:10:51 +02:00
|
|
|
sjinfo);
|
2006-11-10 02:21:41 +01:00
|
|
|
|
|
|
|
/* Avoid leaking a lot of ListCells */
|
|
|
|
list_free(joinquals);
|
|
|
|
list_free(pushedquals);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
2016-06-07 23:21:17 +02:00
|
|
|
jselec = clauselist_selectivity(root,
|
|
|
|
restrictlist,
|
|
|
|
0,
|
|
|
|
jointype,
|
2017-04-07 01:10:51 +02:00
|
|
|
sjinfo);
|
2006-11-10 02:21:41 +01:00
|
|
|
pselec = 0.0; /* not used, keep compiler quiet */
|
|
|
|
}
|
2000-01-09 01:26:47 +01:00
|
|
|
|
2001-02-16 01:03:08 +01:00
|
|
|
/*
|
2003-01-27 21:51:54 +01:00
|
|
|
* Basically, we multiply size of Cartesian product by selectivity.
|
2003-01-20 19:55:07 +01:00
|
|
|
*
|
2006-11-10 02:21:41 +01:00
|
|
|
* If we are doing an outer join, take that into account: the joinqual
|
|
|
|
* selectivity has to be clamped using the knowledge that the output must
|
2014-05-06 18:12:18 +02:00
|
|
|
* be at least as large as the non-nullable input. However, any
|
2006-11-10 02:21:41 +01:00
|
|
|
* pushed-down quals are applied after the outer join, so their
|
|
|
|
* selectivity applies fully.
|
2003-01-27 21:51:54 +01:00
|
|
|
*
|
2008-08-16 02:01:38 +02:00
|
|
|
* For JOIN_SEMI and JOIN_ANTI, the selectivity is defined as the fraction
|
|
|
|
* of LHS rows that have matches, and we apply that straightforwardly.
|
2001-02-16 01:03:08 +01:00
|
|
|
*/
|
|
|
|
switch (jointype)
|
|
|
|
{
|
|
|
|
case JOIN_INNER:
|
2016-06-18 21:22:34 +02:00
|
|
|
nrows = outer_rows * inner_rows * fkselec * jselec;
|
|
|
|
/* pselec not used */
|
2001-02-16 01:03:08 +01:00
|
|
|
break;
|
|
|
|
case JOIN_LEFT:
|
2016-06-18 21:22:34 +02:00
|
|
|
nrows = outer_rows * inner_rows * fkselec * jselec;
|
2012-01-28 01:26:38 +01:00
|
|
|
if (nrows < outer_rows)
|
|
|
|
nrows = outer_rows;
|
2006-11-10 02:21:41 +01:00
|
|
|
nrows *= pselec;
|
2001-02-16 01:03:08 +01:00
|
|
|
break;
|
|
|
|
case JOIN_FULL:
|
2016-06-18 21:22:34 +02:00
|
|
|
nrows = outer_rows * inner_rows * fkselec * jselec;
|
2012-01-28 01:26:38 +01:00
|
|
|
if (nrows < outer_rows)
|
|
|
|
nrows = outer_rows;
|
|
|
|
if (nrows < inner_rows)
|
|
|
|
nrows = inner_rows;
|
2006-11-10 02:21:41 +01:00
|
|
|
nrows *= pselec;
|
2001-02-16 01:03:08 +01:00
|
|
|
break;
|
2008-08-14 20:48:00 +02:00
|
|
|
case JOIN_SEMI:
|
2016-06-18 21:22:34 +02:00
|
|
|
nrows = outer_rows * fkselec * jselec;
|
2008-11-22 23:47:06 +01:00
|
|
|
/* pselec not used */
|
2003-01-20 19:55:07 +01:00
|
|
|
break;
|
2008-08-14 20:48:00 +02:00
|
|
|
case JOIN_ANTI:
|
2016-06-18 21:22:34 +02:00
|
|
|
nrows = outer_rows * (1.0 - fkselec * jselec);
|
2008-08-14 20:48:00 +02:00
|
|
|
nrows *= pselec;
|
2003-01-20 19:55:07 +01:00
|
|
|
break;
|
2001-02-16 01:03:08 +01:00
|
|
|
default:
|
2008-08-14 20:48:00 +02:00
|
|
|
/* other values not expected here */
|
2003-07-25 02:01:09 +02:00
|
|
|
elog(ERROR, "unrecognized join type: %d", (int) jointype);
|
2004-01-05 06:07:36 +01:00
|
|
|
nrows = 0; /* keep compiler quiet */
|
2001-02-16 01:03:08 +01:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2012-01-28 01:26:38 +01:00
|
|
|
return clamp_row_est(nrows);
|
2004-01-05 06:07:36 +01:00
|
|
|
}
|
|
|
|
|
2016-06-18 21:22:34 +02:00
|
|
|
/*
|
|
|
|
* get_foreign_key_join_selectivity
|
|
|
|
* Estimate join selectivity for foreign-key-related clauses.
|
|
|
|
*
|
|
|
|
* Remove any clauses that can be matched to FK constraints from *restrictlist,
|
|
|
|
* and return a substitute estimate of their selectivity. 1.0 is returned
|
|
|
|
* when there are no such clauses.
|
|
|
|
*
|
|
|
|
* The reason for treating such clauses specially is that we can get better
|
|
|
|
* estimates this way than by relying on clauselist_selectivity(), especially
|
|
|
|
* for multi-column FKs where that function's assumption that the clauses are
|
|
|
|
* independent falls down badly. But even with single-column FKs, we may be
|
|
|
|
* able to get a better answer when the pg_statistic stats are missing or out
|
|
|
|
* of date.
|
|
|
|
*/
|
|
|
|
static Selectivity
|
|
|
|
get_foreign_key_join_selectivity(PlannerInfo *root,
|
|
|
|
Relids outer_relids,
|
|
|
|
Relids inner_relids,
|
|
|
|
SpecialJoinInfo *sjinfo,
|
|
|
|
List **restrictlist)
|
|
|
|
{
|
|
|
|
Selectivity fkselec = 1.0;
|
|
|
|
JoinType jointype = sjinfo->jointype;
|
|
|
|
List *worklist = *restrictlist;
|
|
|
|
ListCell *lc;
|
|
|
|
|
|
|
|
/* Consider each FK constraint that is known to match the query */
|
|
|
|
foreach(lc, root->fkey_list)
|
|
|
|
{
|
|
|
|
ForeignKeyOptInfo *fkinfo = (ForeignKeyOptInfo *) lfirst(lc);
|
|
|
|
bool ref_is_outer;
|
|
|
|
List *removedlist;
|
|
|
|
ListCell *cell;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This FK is not relevant unless it connects a baserel on one side of
|
|
|
|
* this join to a baserel on the other side.
|
|
|
|
*/
|
|
|
|
if (bms_is_member(fkinfo->con_relid, outer_relids) &&
|
|
|
|
bms_is_member(fkinfo->ref_relid, inner_relids))
|
|
|
|
ref_is_outer = false;
|
|
|
|
else if (bms_is_member(fkinfo->ref_relid, outer_relids) &&
|
|
|
|
bms_is_member(fkinfo->con_relid, inner_relids))
|
|
|
|
ref_is_outer = true;
|
|
|
|
else
|
|
|
|
continue;
|
|
|
|
|
2017-06-19 21:33:41 +02:00
|
|
|
/*
|
|
|
|
* If we're dealing with a semi/anti join, and the FK's referenced
|
|
|
|
* relation is on the outside, then knowledge of the FK doesn't help
|
|
|
|
* us figure out what we need to know (which is the fraction of outer
|
|
|
|
* rows that have matches). On the other hand, if the referenced rel
|
|
|
|
* is on the inside, then all outer rows must have matches in the
|
|
|
|
* referenced table (ignoring nulls). But any restriction or join
|
|
|
|
* clauses that filter that table will reduce the fraction of matches.
|
|
|
|
* We can account for restriction clauses, but it's too hard to guess
|
|
|
|
* how many table rows would get through a join that's inside the RHS.
|
|
|
|
* Hence, if either case applies, punt and ignore the FK.
|
|
|
|
*/
|
|
|
|
if ((jointype == JOIN_SEMI || jointype == JOIN_ANTI) &&
|
|
|
|
(ref_is_outer || bms_membership(inner_relids) != BMS_SINGLETON))
|
|
|
|
continue;
|
|
|
|
|
2016-06-18 21:22:34 +02:00
|
|
|
/*
|
|
|
|
* Modify the restrictlist by removing clauses that match the FK (and
|
|
|
|
* putting them into removedlist instead). It seems unsafe to modify
|
|
|
|
* the originally-passed List structure, so we make a shallow copy the
|
|
|
|
* first time through.
|
|
|
|
*/
|
|
|
|
if (worklist == *restrictlist)
|
|
|
|
worklist = list_copy(worklist);
|
|
|
|
|
|
|
|
removedlist = NIL;
|
Represent Lists as expansible arrays, not chains of cons-cells.
Originally, Postgres Lists were a more or less exact reimplementation of
Lisp lists, which consist of chains of separately-allocated cons cells,
each having a value and a next-cell link. We'd hacked that once before
(commit d0b4399d8) to add a separate List header, but the data was still
in cons cells. That makes some operations -- notably list_nth() -- O(N),
and it's bulky because of the next-cell pointers and per-cell palloc
overhead, and it's very cache-unfriendly if the cons cells end up
scattered around rather than being adjacent.
In this rewrite, we still have List headers, but the data is in a
resizable array of values, with no next-cell links. Now we need at
most two palloc's per List, and often only one, since we can allocate
some values in the same palloc call as the List header. (Of course,
extending an existing List may require repalloc's to enlarge the array.
But this involves just O(log N) allocations not O(N).)
Of course this is not without downsides. The key difficulty is that
addition or deletion of a list entry may now cause other entries to
move, which it did not before.
For example, that breaks foreach() and sister macros, which historically
used a pointer to the current cons-cell as loop state. We can repair
those macros transparently by making their actual loop state be an
integer list index; the exposed "ListCell *" pointer is no longer state
carried across loop iterations, but is just a derived value. (In
practice, modern compilers can optimize things back to having just one
loop state value, at least for simple cases with inline loop bodies.)
In principle, this is a semantics change for cases where the loop body
inserts or deletes list entries ahead of the current loop index; but
I found no such cases in the Postgres code.
The change is not at all transparent for code that doesn't use foreach()
but chases lists "by hand" using lnext(). The largest share of such
code in the backend is in loops that were maintaining "prev" and "next"
variables in addition to the current-cell pointer, in order to delete
list cells efficiently using list_delete_cell(). However, we no longer
need a previous-cell pointer to delete a list cell efficiently. Keeping
a next-cell pointer doesn't work, as explained above, but we can improve
matters by changing such code to use a regular foreach() loop and then
using the new macro foreach_delete_current() to delete the current cell.
(This macro knows how to update the associated foreach loop's state so
that no cells will be missed in the traversal.)
There remains a nontrivial risk of code assuming that a ListCell *
pointer will remain good over an operation that could now move the list
contents. To help catch such errors, list.c can be compiled with a new
define symbol DEBUG_LIST_MEMORY_USAGE that forcibly moves list contents
whenever that could possibly happen. This makes list operations
significantly more expensive so it's not normally turned on (though it
is on by default if USE_VALGRIND is on).
There are two notable API differences from the previous code:
* lnext() now requires the List's header pointer in addition to the
current cell's address.
* list_delete_cell() no longer requires a previous-cell argument.
These changes are somewhat unfortunate, but on the other hand code using
either function needs inspection to see if it is assuming anything
it shouldn't, so it's not all bad.
Programmers should be aware of these significant performance changes:
* list_nth() and related functions are now O(1); so there's no
major access-speed difference between a list and an array.
* Inserting or deleting a list element now takes time proportional to
the distance to the end of the list, due to moving the array elements.
(However, it typically *doesn't* require palloc or pfree, so except in
long lists it's probably still faster than before.) Notably, lcons()
used to be about the same cost as lappend(), but that's no longer true
if the list is long. Code that uses lcons() and list_delete_first()
to maintain a stack might usefully be rewritten to push and pop at the
end of the list rather than the beginning.
* There are now list_insert_nth...() and list_delete_nth...() functions
that add or remove a list cell identified by index. These have the
data-movement penalty explained above, but there's no search penalty.
* list_concat() and variants now copy the second list's data into
storage belonging to the first list, so there is no longer any
sharing of cells between the input lists. The second argument is
now declared "const List *" to reflect that it isn't changed.
This patch just does the minimum needed to get the new implementation
in place and fix bugs exposed by the regression tests. As suggested
by the foregoing, there's a fair amount of followup work remaining to
do.
Also, the ENABLE_LIST_COMPAT macros are finally removed in this
commit. Code using those should have been gone a dozen years ago.
Patch by me; thanks to David Rowley, Jesper Pedersen, and others
for review.
Discussion: https://postgr.es/m/11587.1550975080@sss.pgh.pa.us
2019-07-15 19:41:58 +02:00
|
|
|
foreach(cell, worklist)
|
2016-06-18 21:22:34 +02:00
|
|
|
{
|
|
|
|
RestrictInfo *rinfo = (RestrictInfo *) lfirst(cell);
|
|
|
|
bool remove_it = false;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/* Drop this clause if it matches any column of the FK */
|
|
|
|
for (i = 0; i < fkinfo->nkeys; i++)
|
|
|
|
{
|
|
|
|
if (rinfo->parent_ec)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* EC-derived clauses can only match by EC. It is okay to
|
|
|
|
* consider any clause derived from the same EC as
|
|
|
|
* matching the FK: even if equivclass.c chose to generate
|
|
|
|
* a clause equating some other pair of Vars, it could
|
|
|
|
* have generated one equating the FK's Vars. So for
|
|
|
|
* purposes of estimation, we can act as though it did so.
|
|
|
|
*
|
|
|
|
* Note: checking parent_ec is a bit of a cheat because
|
|
|
|
* there are EC-derived clauses that don't have parent_ec
|
|
|
|
* set; but such clauses must compare expressions that
|
|
|
|
* aren't just Vars, so they cannot match the FK anyway.
|
|
|
|
*/
|
|
|
|
if (fkinfo->eclass[i] == rinfo->parent_ec)
|
|
|
|
{
|
|
|
|
remove_it = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Otherwise, see if rinfo was previously matched to FK as
|
|
|
|
* a "loose" clause.
|
|
|
|
*/
|
|
|
|
if (list_member_ptr(fkinfo->rinfos[i], rinfo))
|
|
|
|
{
|
|
|
|
remove_it = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (remove_it)
|
|
|
|
{
|
Represent Lists as expansible arrays, not chains of cons-cells.
Originally, Postgres Lists were a more or less exact reimplementation of
Lisp lists, which consist of chains of separately-allocated cons cells,
each having a value and a next-cell link. We'd hacked that once before
(commit d0b4399d8) to add a separate List header, but the data was still
in cons cells. That makes some operations -- notably list_nth() -- O(N),
and it's bulky because of the next-cell pointers and per-cell palloc
overhead, and it's very cache-unfriendly if the cons cells end up
scattered around rather than being adjacent.
In this rewrite, we still have List headers, but the data is in a
resizable array of values, with no next-cell links. Now we need at
most two palloc's per List, and often only one, since we can allocate
some values in the same palloc call as the List header. (Of course,
extending an existing List may require repalloc's to enlarge the array.
But this involves just O(log N) allocations not O(N).)
Of course this is not without downsides. The key difficulty is that
addition or deletion of a list entry may now cause other entries to
move, which it did not before.
For example, that breaks foreach() and sister macros, which historically
used a pointer to the current cons-cell as loop state. We can repair
those macros transparently by making their actual loop state be an
integer list index; the exposed "ListCell *" pointer is no longer state
carried across loop iterations, but is just a derived value. (In
practice, modern compilers can optimize things back to having just one
loop state value, at least for simple cases with inline loop bodies.)
In principle, this is a semantics change for cases where the loop body
inserts or deletes list entries ahead of the current loop index; but
I found no such cases in the Postgres code.
The change is not at all transparent for code that doesn't use foreach()
but chases lists "by hand" using lnext(). The largest share of such
code in the backend is in loops that were maintaining "prev" and "next"
variables in addition to the current-cell pointer, in order to delete
list cells efficiently using list_delete_cell(). However, we no longer
need a previous-cell pointer to delete a list cell efficiently. Keeping
a next-cell pointer doesn't work, as explained above, but we can improve
matters by changing such code to use a regular foreach() loop and then
using the new macro foreach_delete_current() to delete the current cell.
(This macro knows how to update the associated foreach loop's state so
that no cells will be missed in the traversal.)
There remains a nontrivial risk of code assuming that a ListCell *
pointer will remain good over an operation that could now move the list
contents. To help catch such errors, list.c can be compiled with a new
define symbol DEBUG_LIST_MEMORY_USAGE that forcibly moves list contents
whenever that could possibly happen. This makes list operations
significantly more expensive so it's not normally turned on (though it
is on by default if USE_VALGRIND is on).
There are two notable API differences from the previous code:
* lnext() now requires the List's header pointer in addition to the
current cell's address.
* list_delete_cell() no longer requires a previous-cell argument.
These changes are somewhat unfortunate, but on the other hand code using
either function needs inspection to see if it is assuming anything
it shouldn't, so it's not all bad.
Programmers should be aware of these significant performance changes:
* list_nth() and related functions are now O(1); so there's no
major access-speed difference between a list and an array.
* Inserting or deleting a list element now takes time proportional to
the distance to the end of the list, due to moving the array elements.
(However, it typically *doesn't* require palloc or pfree, so except in
long lists it's probably still faster than before.) Notably, lcons()
used to be about the same cost as lappend(), but that's no longer true
if the list is long. Code that uses lcons() and list_delete_first()
to maintain a stack might usefully be rewritten to push and pop at the
end of the list rather than the beginning.
* There are now list_insert_nth...() and list_delete_nth...() functions
that add or remove a list cell identified by index. These have the
data-movement penalty explained above, but there's no search penalty.
* list_concat() and variants now copy the second list's data into
storage belonging to the first list, so there is no longer any
sharing of cells between the input lists. The second argument is
now declared "const List *" to reflect that it isn't changed.
This patch just does the minimum needed to get the new implementation
in place and fix bugs exposed by the regression tests. As suggested
by the foregoing, there's a fair amount of followup work remaining to
do.
Also, the ENABLE_LIST_COMPAT macros are finally removed in this
commit. Code using those should have been gone a dozen years ago.
Patch by me; thanks to David Rowley, Jesper Pedersen, and others
for review.
Discussion: https://postgr.es/m/11587.1550975080@sss.pgh.pa.us
2019-07-15 19:41:58 +02:00
|
|
|
worklist = foreach_delete_current(worklist, cell);
|
2016-06-18 21:22:34 +02:00
|
|
|
removedlist = lappend(removedlist, rinfo);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we failed to remove all the matching clauses we expected to
|
|
|
|
* find, chicken out and ignore this FK; applying its selectivity
|
|
|
|
* might result in double-counting. Put any clauses we did manage to
|
|
|
|
* remove back into the worklist.
|
|
|
|
*
|
|
|
|
* Since the matching clauses are known not outerjoin-delayed, they
|
Fix foreign-key selectivity estimation in the presence of constants.
get_foreign_key_join_selectivity() looks for join clauses that equate
the two sides of the FK constraint. However, if we have a query like
"WHERE fktab.a = pktab.a and fktab.a = 1", it won't find any such join
clause, because equivclass.c replaces the given clauses with "fktab.a
= 1 and pktab.a = 1", which can be enforced at the scan level, leaving
nothing to be done for column "a" at the join level.
We can fix that expectation without much trouble, but then a new problem
arises: applying the foreign-key-based selectivity rule produces a
rowcount underestimate, because we're effectively double-counting the
selectivity of the "fktab.a = 1" clause. So we have to cancel that
selectivity out of the estimate.
To fix, refactor process_implied_equality() so that it can pass back the
new RestrictInfo to its callers in equivclass.c, allowing the generated
"fktab.a = 1" clause to be saved in the EquivalenceClass's ec_derives
list. Then it's not much trouble to dig out the relevant RestrictInfo
when we need to adjust an FK selectivity estimate. (While at it, we
can also remove the expensive use of initialize_mergeclause_eclasses()
to set up the new RestrictInfo's left_ec and right_ec pointers.
The equivclass.c code can set those basically for free.)
This seems like clearly a bug fix, but I'm hesitant to back-patch it,
first because there's some API/ABI risk for extensions and second because
we're usually loath to destabilize plan choices in stable branches.
Per report from Sigrid Ehrenreich.
Discussion: https://postgr.es/m/1019549.1603770457@sss.pgh.pa.us
Discussion: https://postgr.es/m/AM6PR02MB5287A0ADD936C1FA80973E72AB190@AM6PR02MB5287.eurprd02.prod.outlook.com
2020-10-28 16:15:47 +01:00
|
|
|
* would normally have appeared in the initial joinclause list. If we
|
|
|
|
* didn't find them, there are two possibilities:
|
|
|
|
*
|
|
|
|
* 1. If the FK match is based on an EC that is ec_has_const, it won't
|
|
|
|
* have generated any join clauses at all. We discount such ECs while
|
|
|
|
* checking to see if we have "all" the clauses. (Below, we'll adjust
|
|
|
|
* the selectivity estimate for this case.)
|
|
|
|
*
|
|
|
|
* 2. The clauses were matched to some other FK in a previous
|
|
|
|
* iteration of this loop, and thus removed from worklist. (A likely
|
2016-06-18 21:22:34 +02:00
|
|
|
* case is that two FKs are matched to the same EC; there will be only
|
|
|
|
* one EC-derived clause in the initial list, so the first FK will
|
|
|
|
* consume it.) Applying both FKs' selectivity independently risks
|
|
|
|
* underestimating the join size; in particular, this would undo one
|
|
|
|
* of the main things that ECs were invented for, namely to avoid
|
|
|
|
* double-counting the selectivity of redundant equality conditions.
|
|
|
|
* Later we might think of a reasonable way to combine the estimates,
|
|
|
|
* but for now, just punt, since this is a fairly uncommon situation.
|
|
|
|
*/
|
Fix foreign-key selectivity estimation in the presence of constants.
get_foreign_key_join_selectivity() looks for join clauses that equate
the two sides of the FK constraint. However, if we have a query like
"WHERE fktab.a = pktab.a and fktab.a = 1", it won't find any such join
clause, because equivclass.c replaces the given clauses with "fktab.a
= 1 and pktab.a = 1", which can be enforced at the scan level, leaving
nothing to be done for column "a" at the join level.
We can fix that expectation without much trouble, but then a new problem
arises: applying the foreign-key-based selectivity rule produces a
rowcount underestimate, because we're effectively double-counting the
selectivity of the "fktab.a = 1" clause. So we have to cancel that
selectivity out of the estimate.
To fix, refactor process_implied_equality() so that it can pass back the
new RestrictInfo to its callers in equivclass.c, allowing the generated
"fktab.a = 1" clause to be saved in the EquivalenceClass's ec_derives
list. Then it's not much trouble to dig out the relevant RestrictInfo
when we need to adjust an FK selectivity estimate. (While at it, we
can also remove the expensive use of initialize_mergeclause_eclasses()
to set up the new RestrictInfo's left_ec and right_ec pointers.
The equivclass.c code can set those basically for free.)
This seems like clearly a bug fix, but I'm hesitant to back-patch it,
first because there's some API/ABI risk for extensions and second because
we're usually loath to destabilize plan choices in stable branches.
Per report from Sigrid Ehrenreich.
Discussion: https://postgr.es/m/1019549.1603770457@sss.pgh.pa.us
Discussion: https://postgr.es/m/AM6PR02MB5287A0ADD936C1FA80973E72AB190@AM6PR02MB5287.eurprd02.prod.outlook.com
2020-10-28 16:15:47 +01:00
|
|
|
if (removedlist == NIL ||
|
|
|
|
list_length(removedlist) !=
|
|
|
|
(fkinfo->nmatched_ec - fkinfo->nconst_ec + fkinfo->nmatched_ri))
|
2016-06-18 21:22:34 +02:00
|
|
|
{
|
|
|
|
worklist = list_concat(worklist, removedlist);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Finally we get to the payoff: estimate selectivity using the
|
|
|
|
* knowledge that each referencing row will match exactly one row in
|
|
|
|
* the referenced table.
|
|
|
|
*
|
|
|
|
* XXX that's not true in the presence of nulls in the referencing
|
|
|
|
* column(s), so in principle we should derate the estimate for those.
|
|
|
|
* However (1) if there are any strict restriction clauses for the
|
|
|
|
* referencing column(s) elsewhere in the query, derating here would
|
|
|
|
* be double-counting the null fraction, and (2) it's not very clear
|
2017-06-19 21:33:41 +02:00
|
|
|
* how to combine null fractions for multiple referencing columns. So
|
2016-12-17 21:28:54 +01:00
|
|
|
* we do nothing for now about correcting for nulls.
|
2016-06-18 21:22:34 +02:00
|
|
|
*
|
|
|
|
* XXX another point here is that if either side of an FK constraint
|
|
|
|
* is an inheritance parent, we estimate as though the constraint
|
|
|
|
* covers all its children as well. This is not an unreasonable
|
|
|
|
* assumption for a referencing table, ie the user probably applied
|
|
|
|
* identical constraints to all child tables (though perhaps we ought
|
|
|
|
* to check that). But it's not possible to have done that for a
|
|
|
|
* referenced table. Fortunately, precisely because that doesn't
|
|
|
|
* work, it is uncommon in practice to have an FK referencing a parent
|
|
|
|
* table. So, at least for now, disregard inheritance here.
|
|
|
|
*/
|
2017-06-19 21:33:41 +02:00
|
|
|
if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
|
2016-06-18 21:22:34 +02:00
|
|
|
{
|
|
|
|
/*
|
2017-06-19 21:33:41 +02:00
|
|
|
* For JOIN_SEMI and JOIN_ANTI, we only get here when the FK's
|
|
|
|
* referenced table is exactly the inside of the join. The join
|
|
|
|
* selectivity is defined as the fraction of LHS rows that have
|
|
|
|
* matches. The FK implies that every LHS row has a match *in the
|
|
|
|
* referenced table*; but any restriction clauses on it will
|
|
|
|
* reduce the number of matches. Hence we take the join
|
|
|
|
* selectivity as equal to the selectivity of the table's
|
|
|
|
* restriction clauses, which is rows / tuples; but we must guard
|
|
|
|
* against tuples == 0.
|
2016-06-18 21:22:34 +02:00
|
|
|
*/
|
2017-06-19 21:33:41 +02:00
|
|
|
RelOptInfo *ref_rel = find_base_rel(root, fkinfo->ref_relid);
|
|
|
|
double ref_tuples = Max(ref_rel->tuples, 1.0);
|
2016-12-17 21:28:54 +01:00
|
|
|
|
2017-06-19 21:33:41 +02:00
|
|
|
fkselec *= ref_rel->rows / ref_tuples;
|
2016-06-18 21:22:34 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Otherwise, selectivity is exactly 1/referenced-table-size; but
|
|
|
|
* guard against tuples == 0. Note we should use the raw table
|
|
|
|
* tuple count, not any estimate of its filtered or joined size.
|
|
|
|
*/
|
|
|
|
RelOptInfo *ref_rel = find_base_rel(root, fkinfo->ref_relid);
|
|
|
|
double ref_tuples = Max(ref_rel->tuples, 1.0);
|
|
|
|
|
|
|
|
fkselec *= 1.0 / ref_tuples;
|
|
|
|
}
|
Fix foreign-key selectivity estimation in the presence of constants.
get_foreign_key_join_selectivity() looks for join clauses that equate
the two sides of the FK constraint. However, if we have a query like
"WHERE fktab.a = pktab.a and fktab.a = 1", it won't find any such join
clause, because equivclass.c replaces the given clauses with "fktab.a
= 1 and pktab.a = 1", which can be enforced at the scan level, leaving
nothing to be done for column "a" at the join level.
We can fix that expectation without much trouble, but then a new problem
arises: applying the foreign-key-based selectivity rule produces a
rowcount underestimate, because we're effectively double-counting the
selectivity of the "fktab.a = 1" clause. So we have to cancel that
selectivity out of the estimate.
To fix, refactor process_implied_equality() so that it can pass back the
new RestrictInfo to its callers in equivclass.c, allowing the generated
"fktab.a = 1" clause to be saved in the EquivalenceClass's ec_derives
list. Then it's not much trouble to dig out the relevant RestrictInfo
when we need to adjust an FK selectivity estimate. (While at it, we
can also remove the expensive use of initialize_mergeclause_eclasses()
to set up the new RestrictInfo's left_ec and right_ec pointers.
The equivclass.c code can set those basically for free.)
This seems like clearly a bug fix, but I'm hesitant to back-patch it,
first because there's some API/ABI risk for extensions and second because
we're usually loath to destabilize plan choices in stable branches.
Per report from Sigrid Ehrenreich.
Discussion: https://postgr.es/m/1019549.1603770457@sss.pgh.pa.us
Discussion: https://postgr.es/m/AM6PR02MB5287A0ADD936C1FA80973E72AB190@AM6PR02MB5287.eurprd02.prod.outlook.com
2020-10-28 16:15:47 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If any of the FK columns participated in ec_has_const ECs, then
|
|
|
|
* equivclass.c will have generated "var = const" restrictions for
|
|
|
|
* each side of the join, thus reducing the sizes of both input
|
|
|
|
* relations. Taking the fkselec at face value would amount to
|
|
|
|
* double-counting the selectivity of the constant restriction for the
|
|
|
|
* referencing Var. Hence, look for the restriction clause(s) that
|
|
|
|
* were applied to the referencing Var(s), and divide out their
|
|
|
|
* selectivity to correct for this.
|
|
|
|
*/
|
|
|
|
if (fkinfo->nconst_ec > 0)
|
|
|
|
{
|
|
|
|
for (int i = 0; i < fkinfo->nkeys; i++)
|
|
|
|
{
|
|
|
|
EquivalenceClass *ec = fkinfo->eclass[i];
|
|
|
|
|
|
|
|
if (ec && ec->ec_has_const)
|
|
|
|
{
|
|
|
|
EquivalenceMember *em = fkinfo->fk_eclass_member[i];
|
|
|
|
RestrictInfo *rinfo = find_derived_clause_for_ec_member(ec,
|
|
|
|
em);
|
|
|
|
|
|
|
|
if (rinfo)
|
|
|
|
{
|
|
|
|
Selectivity s0;
|
|
|
|
|
|
|
|
s0 = clause_selectivity(root,
|
|
|
|
(Node *) rinfo,
|
|
|
|
0,
|
|
|
|
jointype,
|
|
|
|
sjinfo);
|
|
|
|
if (s0 > 0)
|
|
|
|
fkselec /= s0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2016-06-18 21:22:34 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
*restrictlist = worklist;
|
Fix foreign-key selectivity estimation in the presence of constants.
get_foreign_key_join_selectivity() looks for join clauses that equate
the two sides of the FK constraint. However, if we have a query like
"WHERE fktab.a = pktab.a and fktab.a = 1", it won't find any such join
clause, because equivclass.c replaces the given clauses with "fktab.a
= 1 and pktab.a = 1", which can be enforced at the scan level, leaving
nothing to be done for column "a" at the join level.
We can fix that expectation without much trouble, but then a new problem
arises: applying the foreign-key-based selectivity rule produces a
rowcount underestimate, because we're effectively double-counting the
selectivity of the "fktab.a = 1" clause. So we have to cancel that
selectivity out of the estimate.
To fix, refactor process_implied_equality() so that it can pass back the
new RestrictInfo to its callers in equivclass.c, allowing the generated
"fktab.a = 1" clause to be saved in the EquivalenceClass's ec_derives
list. Then it's not much trouble to dig out the relevant RestrictInfo
when we need to adjust an FK selectivity estimate. (While at it, we
can also remove the expensive use of initialize_mergeclause_eclasses()
to set up the new RestrictInfo's left_ec and right_ec pointers.
The equivclass.c code can set those basically for free.)
This seems like clearly a bug fix, but I'm hesitant to back-patch it,
first because there's some API/ABI risk for extensions and second because
we're usually loath to destabilize plan choices in stable branches.
Per report from Sigrid Ehrenreich.
Discussion: https://postgr.es/m/1019549.1603770457@sss.pgh.pa.us
Discussion: https://postgr.es/m/AM6PR02MB5287A0ADD936C1FA80973E72AB190@AM6PR02MB5287.eurprd02.prod.outlook.com
2020-10-28 16:15:47 +01:00
|
|
|
CLAMP_PROBABILITY(fkselec);
|
2016-06-18 21:22:34 +02:00
|
|
|
return fkselec;
|
|
|
|
}
|
|
|
|
|
2010-11-19 23:31:50 +01:00
|
|
|
/*
|
|
|
|
* set_subquery_size_estimates
|
|
|
|
* Set the size estimates for a base relation that is a subquery.
|
|
|
|
*
|
|
|
|
* The rel's targetlist and restrictinfo list must have been constructed
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
* already, and the Paths for the subquery must have been completed.
|
|
|
|
* We look at the subquery's PlannerInfo to extract data.
|
2010-11-19 23:31:50 +01:00
|
|
|
*
|
|
|
|
* We set the same fields as set_baserel_size_estimates.
|
|
|
|
*/
|
|
|
|
void
|
2011-09-03 21:35:12 +02:00
|
|
|
set_subquery_size_estimates(PlannerInfo *root, RelOptInfo *rel)
|
2010-11-19 23:31:50 +01:00
|
|
|
{
|
2011-09-03 21:35:12 +02:00
|
|
|
PlannerInfo *subroot = rel->subroot;
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
RelOptInfo *sub_final_rel;
|
2010-11-19 23:31:50 +01:00
|
|
|
ListCell *lc;
|
|
|
|
|
|
|
|
/* Should only be applied to base relations that are subqueries */
|
|
|
|
Assert(rel->relid > 0);
|
2017-09-21 14:41:14 +02:00
|
|
|
Assert(planner_rt_fetch(rel->relid, root)->rtekind == RTE_SUBQUERY);
|
2010-11-19 23:31:50 +01:00
|
|
|
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
/*
|
|
|
|
* Copy raw number of output rows from subquery. All of its paths should
|
|
|
|
* have the same output rowcount, so just look at cheapest-total.
|
|
|
|
*/
|
|
|
|
sub_final_rel = fetch_upper_rel(subroot, UPPERREL_FINAL, NULL);
|
|
|
|
rel->tuples = sub_final_rel->cheapest_total_path->rows;
|
2010-11-19 23:31:50 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Compute per-output-column width estimates by examining the subquery's
|
2014-05-06 18:12:18 +02:00
|
|
|
* targetlist. For any output that is a plain Var, get the width estimate
|
2011-08-23 23:11:41 +02:00
|
|
|
* that was made while planning the subquery. Otherwise, we leave it to
|
|
|
|
* set_rel_width to fill in a datatype-based default estimate.
|
2010-11-19 23:31:50 +01:00
|
|
|
*/
|
|
|
|
foreach(lc, subroot->parse->targetList)
|
|
|
|
{
|
Improve castNode notation by introducing list-extraction-specific variants.
This extends the castNode() notation introduced by commit 5bcab1114 to
provide, in one step, extraction of a list cell's pointer and coercion to
a concrete node type. For example, "lfirst_node(Foo, lc)" is the same
as "castNode(Foo, lfirst(lc))". Almost half of the uses of castNode
that have appeared so far include a list extraction call, so this is
pretty widely useful, and it saves a few more keystrokes compared to the
old way.
As with the previous patch, back-patch the addition of these macros to
pg_list.h, so that the notation will be available when back-patching.
Patch by me, after an idea of Andrew Gierth's.
Discussion: https://postgr.es/m/14197.1491841216@sss.pgh.pa.us
2017-04-10 19:51:29 +02:00
|
|
|
TargetEntry *te = lfirst_node(TargetEntry, lc);
|
2010-11-19 23:31:50 +01:00
|
|
|
Node *texpr = (Node *) te->expr;
|
2011-08-23 23:11:41 +02:00
|
|
|
int32 item_width = 0;
|
2010-11-19 23:31:50 +01:00
|
|
|
|
|
|
|
/* junk columns aren't visible to upper query */
|
|
|
|
if (te->resjunk)
|
|
|
|
continue;
|
|
|
|
|
2013-04-01 00:32:54 +02:00
|
|
|
/*
|
|
|
|
* The subquery could be an expansion of a view that's had columns
|
|
|
|
* added to it since the current query was parsed, so that there are
|
|
|
|
* non-junk tlist columns in it that don't correspond to any column
|
2014-05-06 18:12:18 +02:00
|
|
|
* visible at our query level. Ignore such columns.
|
2013-04-01 00:32:54 +02:00
|
|
|
*/
|
|
|
|
if (te->resno < rel->min_attr || te->resno > rel->max_attr)
|
|
|
|
continue;
|
|
|
|
|
2010-11-19 23:31:50 +01:00
|
|
|
/*
|
|
|
|
* XXX This currently doesn't work for subqueries containing set
|
|
|
|
* operations, because the Vars in their tlists are bogus references
|
|
|
|
* to the first leaf subquery, which wouldn't give the right answer
|
2011-08-23 23:11:41 +02:00
|
|
|
* even if we could still get to its PlannerInfo.
|
|
|
|
*
|
|
|
|
* Also, the subquery could be an appendrel for which all branches are
|
|
|
|
* known empty due to constraint exclusion, in which case
|
|
|
|
* set_append_rel_pathlist will have left the attr_widths set to zero.
|
|
|
|
*
|
|
|
|
* In either case, we just leave the width estimate zero until
|
|
|
|
* set_rel_width fixes it.
|
2010-11-19 23:31:50 +01:00
|
|
|
*/
|
|
|
|
if (IsA(texpr, Var) &&
|
|
|
|
subroot->parse->setOperations == NULL)
|
|
|
|
{
|
2011-04-10 17:42:00 +02:00
|
|
|
Var *var = (Var *) texpr;
|
2010-11-19 23:31:50 +01:00
|
|
|
RelOptInfo *subrel = find_base_rel(subroot, var->varno);
|
|
|
|
|
|
|
|
item_width = subrel->attr_widths[var->varattno - subrel->min_attr];
|
|
|
|
}
|
|
|
|
rel->attr_widths[te->resno - rel->min_attr] = item_width;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Now estimate number of output rows, etc */
|
|
|
|
set_baserel_size_estimates(root, rel);
|
|
|
|
}
|
|
|
|
|
2002-05-12 22:10:05 +02:00
|
|
|
/*
|
|
|
|
* set_function_size_estimates
|
|
|
|
* Set the size estimates for a base relation that is a function call.
|
|
|
|
*
|
|
|
|
* The rel's targetlist and restrictinfo list must have been constructed
|
|
|
|
* already.
|
|
|
|
*
|
2004-01-05 06:07:36 +01:00
|
|
|
* We set the same fields as set_baserel_size_estimates.
|
2002-05-12 22:10:05 +02:00
|
|
|
*/
|
|
|
|
void
|
2005-06-06 00:32:58 +02:00
|
|
|
set_function_size_estimates(PlannerInfo *root, RelOptInfo *rel)
|
2002-05-12 22:10:05 +02:00
|
|
|
{
|
2005-10-05 19:19:19 +02:00
|
|
|
RangeTblEntry *rte;
|
Support multi-argument UNNEST(), and TABLE() syntax for multiple functions.
This patch adds the ability to write TABLE( function1(), function2(), ...)
as a single FROM-clause entry. The result is the concatenation of the
first row from each function, followed by the second row from each
function, etc; with NULLs inserted if any function produces fewer rows than
others. This is believed to be a much more useful behavior than what
Postgres currently does with multiple SRFs in a SELECT list.
This syntax also provides a reasonable way to combine use of column
definition lists with WITH ORDINALITY: put the column definition list
inside TABLE(), where it's clear that it doesn't control the ordinality
column as well.
Also implement SQL-compliant multiple-argument UNNEST(), by turning
UNNEST(a,b,c) into TABLE(unnest(a), unnest(b), unnest(c)).
The SQL standard specifies TABLE() with only a single function, not
multiple functions, and it seems to require an implicit UNNEST() which is
not what this patch does. There may be something wrong with that reading
of the spec, though, because if it's right then the spec's TABLE() is just
a pointless alternative spelling of UNNEST(). After further review of
that, we might choose to adopt a different syntax for what this patch does,
but in any case this functionality seems clearly worthwhile.
Andrew Gierth, reviewed by Zoltán Böszörményi and Heikki Linnakangas, and
significantly revised by me
2013-11-22 01:37:02 +01:00
|
|
|
ListCell *lc;
|
2005-10-05 19:19:19 +02:00
|
|
|
|
2002-05-12 22:10:05 +02:00
|
|
|
/* Should only be applied to base relations that are functions */
|
2003-02-08 21:20:55 +01:00
|
|
|
Assert(rel->relid > 0);
|
2007-04-21 23:01:45 +02:00
|
|
|
rte = planner_rt_fetch(rel->relid, root);
|
2005-10-05 19:19:19 +02:00
|
|
|
Assert(rte->rtekind == RTE_FUNCTION);
|
2002-05-12 22:10:05 +02:00
|
|
|
|
Support multi-argument UNNEST(), and TABLE() syntax for multiple functions.
This patch adds the ability to write TABLE( function1(), function2(), ...)
as a single FROM-clause entry. The result is the concatenation of the
first row from each function, followed by the second row from each
function, etc; with NULLs inserted if any function produces fewer rows than
others. This is believed to be a much more useful behavior than what
Postgres currently does with multiple SRFs in a SELECT list.
This syntax also provides a reasonable way to combine use of column
definition lists with WITH ORDINALITY: put the column definition list
inside TABLE(), where it's clear that it doesn't control the ordinality
column as well.
Also implement SQL-compliant multiple-argument UNNEST(), by turning
UNNEST(a,b,c) into TABLE(unnest(a), unnest(b), unnest(c)).
The SQL standard specifies TABLE() with only a single function, not
multiple functions, and it seems to require an implicit UNNEST() which is
not what this patch does. There may be something wrong with that reading
of the spec, though, because if it's right then the spec's TABLE() is just
a pointless alternative spelling of UNNEST(). After further review of
that, we might choose to adopt a different syntax for what this patch does,
but in any case this functionality seems clearly worthwhile.
Andrew Gierth, reviewed by Zoltán Böszörményi and Heikki Linnakangas, and
significantly revised by me
2013-11-22 01:37:02 +01:00
|
|
|
/*
|
|
|
|
* Estimate number of rows the functions will return. The rowcount of the
|
|
|
|
* node is that of the largest function result.
|
|
|
|
*/
|
|
|
|
rel->tuples = 0;
|
|
|
|
foreach(lc, rte->functions)
|
|
|
|
{
|
|
|
|
RangeTblFunction *rtfunc = (RangeTblFunction *) lfirst(lc);
|
2019-02-10 00:32:23 +01:00
|
|
|
double ntup = expression_returns_set_rows(root, rtfunc->funcexpr);
|
Support multi-argument UNNEST(), and TABLE() syntax for multiple functions.
This patch adds the ability to write TABLE( function1(), function2(), ...)
as a single FROM-clause entry. The result is the concatenation of the
first row from each function, followed by the second row from each
function, etc; with NULLs inserted if any function produces fewer rows than
others. This is believed to be a much more useful behavior than what
Postgres currently does with multiple SRFs in a SELECT list.
This syntax also provides a reasonable way to combine use of column
definition lists with WITH ORDINALITY: put the column definition list
inside TABLE(), where it's clear that it doesn't control the ordinality
column as well.
Also implement SQL-compliant multiple-argument UNNEST(), by turning
UNNEST(a,b,c) into TABLE(unnest(a), unnest(b), unnest(c)).
The SQL standard specifies TABLE() with only a single function, not
multiple functions, and it seems to require an implicit UNNEST() which is
not what this patch does. There may be something wrong with that reading
of the spec, though, because if it's right then the spec's TABLE() is just
a pointless alternative spelling of UNNEST(). After further review of
that, we might choose to adopt a different syntax for what this patch does,
but in any case this functionality seems clearly worthwhile.
Andrew Gierth, reviewed by Zoltán Böszörményi and Heikki Linnakangas, and
significantly revised by me
2013-11-22 01:37:02 +01:00
|
|
|
|
|
|
|
if (ntup > rel->tuples)
|
|
|
|
rel->tuples = ntup;
|
|
|
|
}
|
2002-05-12 22:10:05 +02:00
|
|
|
|
2004-01-05 06:07:36 +01:00
|
|
|
/* Now estimate number of output rows, etc */
|
|
|
|
set_baserel_size_estimates(root, rel);
|
2002-05-12 22:10:05 +02:00
|
|
|
}
|
|
|
|
|
2017-03-08 16:39:37 +01:00
|
|
|
/*
|
|
|
|
* set_function_size_estimates
|
|
|
|
* Set the size estimates for a base relation that is a function call.
|
|
|
|
*
|
|
|
|
* The rel's targetlist and restrictinfo list must have been constructed
|
|
|
|
* already.
|
|
|
|
*
|
|
|
|
* We set the same fields as set_tablefunc_size_estimates.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
set_tablefunc_size_estimates(PlannerInfo *root, RelOptInfo *rel)
|
|
|
|
{
|
|
|
|
/* Should only be applied to base relations that are functions */
|
|
|
|
Assert(rel->relid > 0);
|
2017-09-21 14:41:14 +02:00
|
|
|
Assert(planner_rt_fetch(rel->relid, root)->rtekind == RTE_TABLEFUNC);
|
2017-03-08 16:39:37 +01:00
|
|
|
|
|
|
|
rel->tuples = 100;
|
|
|
|
|
|
|
|
/* Now estimate number of output rows, etc */
|
|
|
|
set_baserel_size_estimates(root, rel);
|
|
|
|
}
|
|
|
|
|
2006-08-02 03:59:48 +02:00
|
|
|
/*
|
|
|
|
* set_values_size_estimates
|
|
|
|
* Set the size estimates for a base relation that is a values list.
|
|
|
|
*
|
|
|
|
* The rel's targetlist and restrictinfo list must have been constructed
|
|
|
|
* already.
|
|
|
|
*
|
|
|
|
* We set the same fields as set_baserel_size_estimates.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
set_values_size_estimates(PlannerInfo *root, RelOptInfo *rel)
|
|
|
|
{
|
|
|
|
RangeTblEntry *rte;
|
|
|
|
|
|
|
|
/* Should only be applied to base relations that are values lists */
|
|
|
|
Assert(rel->relid > 0);
|
2007-04-21 23:01:45 +02:00
|
|
|
rte = planner_rt_fetch(rel->relid, root);
|
2006-08-02 03:59:48 +02:00
|
|
|
Assert(rte->rtekind == RTE_VALUES);
|
|
|
|
|
|
|
|
/*
|
2006-10-04 02:30:14 +02:00
|
|
|
* Estimate number of rows the values list will return. We know this
|
|
|
|
* precisely based on the list length (well, barring set-returning
|
|
|
|
* functions in list items, but that's a refinement not catered for
|
|
|
|
* anywhere else either).
|
2006-08-02 03:59:48 +02:00
|
|
|
*/
|
|
|
|
rel->tuples = list_length(rte->values_lists);
|
|
|
|
|
|
|
|
/* Now estimate number of output rows, etc */
|
|
|
|
set_baserel_size_estimates(root, rel);
|
|
|
|
}
|
|
|
|
|
2008-10-04 23:56:55 +02:00
|
|
|
/*
|
|
|
|
* set_cte_size_estimates
|
|
|
|
* Set the size estimates for a base relation that is a CTE reference.
|
|
|
|
*
|
|
|
|
* The rel's targetlist and restrictinfo list must have been constructed
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
* already, and we need an estimate of the number of rows returned by the CTE
|
|
|
|
* (if a regular CTE) or the non-recursive term (if a self-reference).
|
2008-10-04 23:56:55 +02:00
|
|
|
*
|
|
|
|
* We set the same fields as set_baserel_size_estimates.
|
|
|
|
*/
|
|
|
|
void
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
set_cte_size_estimates(PlannerInfo *root, RelOptInfo *rel, double cte_rows)
|
2008-10-04 23:56:55 +02:00
|
|
|
{
|
|
|
|
RangeTblEntry *rte;
|
|
|
|
|
|
|
|
/* Should only be applied to base relations that are CTE references */
|
|
|
|
Assert(rel->relid > 0);
|
|
|
|
rte = planner_rt_fetch(rel->relid, root);
|
|
|
|
Assert(rte->rtekind == RTE_CTE);
|
|
|
|
|
|
|
|
if (rte->self_reference)
|
|
|
|
{
|
|
|
|
/*
|
2009-06-11 16:49:15 +02:00
|
|
|
* In a self-reference, arbitrarily assume the average worktable size
|
|
|
|
* is about 10 times the nonrecursive term's size.
|
2008-10-04 23:56:55 +02:00
|
|
|
*/
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
rel->tuples = 10 * cte_rows;
|
2008-10-04 23:56:55 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
/* Otherwise just believe the CTE's rowcount estimate */
|
|
|
|
rel->tuples = cte_rows;
|
2008-10-04 23:56:55 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Now estimate number of output rows, etc */
|
|
|
|
set_baserel_size_estimates(root, rel);
|
|
|
|
}
|
|
|
|
|
2017-04-01 06:17:18 +02:00
|
|
|
/*
|
|
|
|
* set_namedtuplestore_size_estimates
|
|
|
|
* Set the size estimates for a base relation that is a tuplestore reference.
|
|
|
|
*
|
|
|
|
* The rel's targetlist and restrictinfo list must have been constructed
|
|
|
|
* already.
|
|
|
|
*
|
|
|
|
* We set the same fields as set_baserel_size_estimates.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
set_namedtuplestore_size_estimates(PlannerInfo *root, RelOptInfo *rel)
|
|
|
|
{
|
|
|
|
RangeTblEntry *rte;
|
|
|
|
|
|
|
|
/* Should only be applied to base relations that are tuplestore references */
|
|
|
|
Assert(rel->relid > 0);
|
|
|
|
rte = planner_rt_fetch(rel->relid, root);
|
|
|
|
Assert(rte->rtekind == RTE_NAMEDTUPLESTORE);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Use the estimate provided by the code which is generating the named
|
|
|
|
* tuplestore. In some cases, the actual number might be available; in
|
|
|
|
* others the same plan will be re-used, so a "typical" value might be
|
|
|
|
* estimated and used.
|
|
|
|
*/
|
|
|
|
rel->tuples = rte->enrtuples;
|
|
|
|
if (rel->tuples < 0)
|
|
|
|
rel->tuples = 1000;
|
|
|
|
|
|
|
|
/* Now estimate number of output rows, etc */
|
|
|
|
set_baserel_size_estimates(root, rel);
|
|
|
|
}
|
|
|
|
|
In the planner, replace an empty FROM clause with a dummy RTE.
The fact that "SELECT expression" has no base relations has long been a
thorn in the side of the planner. It makes it hard to flatten a sub-query
that looks like that, or is a trivial VALUES() item, because the planner
generally uses relid sets to identify sub-relations, and such a sub-query
would have an empty relid set if we flattened it. prepjointree.c contains
some baroque logic that works around this in certain special cases --- but
there is a much better answer. We can replace an empty FROM clause with a
dummy RTE that acts like a table of one row and no columns, and then there
are no such corner cases to worry about. Instead we need some logic to
get rid of useless dummy RTEs, but that's simpler and covers more cases
than what was there before.
For really trivial cases, where the query is just "SELECT expression" and
nothing else, there's a hazard that adding the extra RTE makes for a
noticeable slowdown; even though it's not much processing, there's not
that much for the planner to do overall. However testing says that the
penalty is very small, close to the noise level. In more complex queries,
this is able to find optimizations that we could not find before.
The new RTE type is called RTE_RESULT, since the "scan" plan type it
gives rise to is a Result node (the same plan we produced for a "SELECT
expression" query before). To avoid confusion, rename the old ResultPath
path type to GroupResultPath, reflecting that it's only used in degenerate
grouping cases where we know the query produces just one grouped row.
(It wouldn't work to unify the two cases, because there are different
rules about where the associated quals live during query_planner.)
Note: although this touches readfuncs.c, I don't think a catversion
bump is required, because the added case can't occur in stored rules,
only plans.
Patch by me, reviewed by David Rowley and Mark Dilger
Discussion: https://postgr.es/m/15944.1521127664@sss.pgh.pa.us
2019-01-28 23:54:10 +01:00
|
|
|
/*
|
|
|
|
* set_result_size_estimates
|
|
|
|
* Set the size estimates for an RTE_RESULT base relation
|
|
|
|
*
|
|
|
|
* The rel's targetlist and restrictinfo list must have been constructed
|
|
|
|
* already.
|
|
|
|
*
|
|
|
|
* We set the same fields as set_baserel_size_estimates.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
set_result_size_estimates(PlannerInfo *root, RelOptInfo *rel)
|
|
|
|
{
|
|
|
|
/* Should only be applied to RTE_RESULT base relations */
|
|
|
|
Assert(rel->relid > 0);
|
|
|
|
Assert(planner_rt_fetch(rel->relid, root)->rtekind == RTE_RESULT);
|
|
|
|
|
|
|
|
/* RTE_RESULT always generates a single row, natively */
|
|
|
|
rel->tuples = 1;
|
|
|
|
|
|
|
|
/* Now estimate number of output rows, etc */
|
|
|
|
set_baserel_size_estimates(root, rel);
|
|
|
|
}
|
|
|
|
|
2011-02-20 06:17:18 +01:00
|
|
|
/*
|
|
|
|
* set_foreign_size_estimates
|
|
|
|
* Set the size estimates for a base relation that is a foreign table.
|
|
|
|
*
|
|
|
|
* There is not a whole lot that we can do here; the foreign-data wrapper
|
|
|
|
* is responsible for producing useful estimates. We can do a decent job
|
|
|
|
* of estimating baserestrictcost, so we set that, and we also set up width
|
|
|
|
* using what will be purely datatype-driven estimates from the targetlist.
|
|
|
|
* There is no way to do anything sane with the rows value, so we just put
|
2014-05-06 18:12:18 +02:00
|
|
|
* a default estimate and hope that the wrapper can improve on it. The
|
Revise FDW planning API, again.
Further reflection shows that a single callback isn't very workable if we
desire to let FDWs generate multiple Paths, because that forces the FDW to
do all work necessary to generate a valid Plan node for each Path. Instead
split the former PlanForeignScan API into three steps: GetForeignRelSize,
GetForeignPaths, GetForeignPlan. We had already bit the bullet of breaking
the 9.1 FDW API for 9.2, so this shouldn't cause very much additional pain,
and it's substantially more flexible for complex FDWs.
Add an fdw_private field to RelOptInfo so that the new functions can save
state there rather than possibly having to recalculate information two or
three times.
In addition, we'd not thought through what would be needed to allow an FDW
to set up subexpressions of its choice for runtime execution. We could
treat ForeignScan.fdw_private as an executable expression but that seems
likely to break existing FDWs unnecessarily (in particular, it would
restrict the set of node types allowable in fdw_private to those supported
by expression_tree_walker). Instead, invent a separate field fdw_exprs
which will receive the postprocessing appropriate for expression trees.
(One field is enough since it can be a list of expressions; also, we assume
the corresponding expression state tree(s) will be held within fdw_state,
so we don't need to add anything to ForeignScanState.)
Per review of Hanada Shigeru's pgsql_fdw patch. We may need to tweak this
further as we continue to work on that patch, but to me it feels a lot
closer to being right now.
2012-03-09 18:48:48 +01:00
|
|
|
* wrapper's GetForeignRelSize function will be called momentarily.
|
2011-02-20 06:17:18 +01:00
|
|
|
*
|
|
|
|
* The rel's targetlist and restrictinfo list must have been constructed
|
|
|
|
* already.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
set_foreign_size_estimates(PlannerInfo *root, RelOptInfo *rel)
|
|
|
|
{
|
|
|
|
/* Should only be applied to base relations */
|
|
|
|
Assert(rel->relid > 0);
|
|
|
|
|
|
|
|
rel->rows = 1000; /* entirely bogus default estimate */
|
|
|
|
|
|
|
|
cost_qual_eval(&rel->baserestrictcost, rel->baserestrictinfo, root);
|
|
|
|
|
|
|
|
set_rel_width(root, rel);
|
|
|
|
}
|
|
|
|
|
2002-05-12 22:10:05 +02:00
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
2000-01-09 01:26:47 +01:00
|
|
|
* set_rel_width
|
2003-06-30 01:05:05 +02:00
|
|
|
* Set the estimated output width of a base relation.
|
2001-05-09 02:35:09 +02:00
|
|
|
*
|
2010-11-19 23:31:50 +01:00
|
|
|
* The estimated output width is the sum of the per-attribute width estimates
|
|
|
|
* for the actually-referenced columns, plus any PHVs or other expressions
|
|
|
|
* that have to be calculated at this relation. This is the amount of data
|
|
|
|
* we'd need to pass upwards in case of a sort, hash, etc.
|
|
|
|
*
|
2016-03-14 21:59:59 +01:00
|
|
|
* This function also sets reltarget->cost, so it's a bit misnamed now.
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
*
|
2003-06-30 01:05:05 +02:00
|
|
|
* NB: this works best on plain relations because it prefers to look at
|
2010-11-19 23:31:50 +01:00
|
|
|
* real Vars. For subqueries, set_subquery_size_estimates will already have
|
|
|
|
* copied up whatever per-column estimates were made within the subquery,
|
|
|
|
* and for other types of rels there isn't much we can do anyway. We fall
|
|
|
|
* back on (fairly stupid) datatype-based width estimates if we can't get
|
|
|
|
* any better number.
|
2003-06-30 01:05:05 +02:00
|
|
|
*
|
|
|
|
* The per-attribute width estimates are cached for possible re-use while
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
* building join relations or post-scan/join pathtargets.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2000-01-09 01:26:47 +01:00
|
|
|
static void
|
2005-06-06 00:32:58 +02:00
|
|
|
set_rel_width(PlannerInfo *root, RelOptInfo *rel)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2008-10-17 22:27:24 +02:00
|
|
|
Oid reloid = planner_rt_fetch(rel->relid, root)->relid;
|
2001-05-09 02:35:09 +02:00
|
|
|
int32 tuple_width = 0;
|
2010-11-19 23:31:50 +01:00
|
|
|
bool have_wholerow_var = false;
|
2008-10-21 22:42:53 +02:00
|
|
|
ListCell *lc;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
/* Vars are assumed to have cost zero, but other exprs do not */
|
2016-03-14 21:59:59 +01:00
|
|
|
rel->reltarget->cost.startup = 0;
|
|
|
|
rel->reltarget->cost.per_tuple = 0;
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
|
2016-03-14 21:59:59 +01:00
|
|
|
foreach(lc, rel->reltarget->exprs)
|
2001-05-09 02:35:09 +02:00
|
|
|
{
|
2008-10-21 22:42:53 +02:00
|
|
|
Node *node = (Node *) lfirst(lc);
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2012-08-27 04:48:55 +02:00
|
|
|
/*
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
* Ordinarily, a Var in a rel's targetlist must belong to that rel;
|
2013-08-18 02:22:37 +02:00
|
|
|
* but there are corner cases involving LATERAL references where that
|
|
|
|
* isn't so. If the Var has the wrong varno, fall through to the
|
|
|
|
* generic case (it doesn't seem worth the trouble to be any smarter).
|
2012-08-27 04:48:55 +02:00
|
|
|
*/
|
|
|
|
if (IsA(node, Var) &&
|
|
|
|
((Var *) node)->varno == rel->relid)
|
2004-06-05 03:55:05 +02:00
|
|
|
{
|
2008-10-21 22:42:53 +02:00
|
|
|
Var *var = (Var *) node;
|
|
|
|
int ndx;
|
|
|
|
int32 item_width;
|
2008-10-17 22:27:24 +02:00
|
|
|
|
2008-10-21 22:42:53 +02:00
|
|
|
Assert(var->varattno >= rel->min_attr);
|
|
|
|
Assert(var->varattno <= rel->max_attr);
|
2003-06-30 01:05:05 +02:00
|
|
|
|
2008-10-21 22:42:53 +02:00
|
|
|
ndx = var->varattno - rel->min_attr;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2008-10-21 22:42:53 +02:00
|
|
|
/*
|
2011-04-10 17:42:00 +02:00
|
|
|
* If it's a whole-row Var, we'll deal with it below after we have
|
|
|
|
* already cached as many attr widths as possible.
|
2010-11-19 23:31:50 +01:00
|
|
|
*/
|
|
|
|
if (var->varattno == 0)
|
|
|
|
{
|
|
|
|
have_wholerow_var = true;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2011-04-10 17:42:00 +02:00
|
|
|
* The width may have been cached already (especially if it's a
|
|
|
|
* subquery), so don't duplicate effort.
|
2008-10-21 22:42:53 +02:00
|
|
|
*/
|
|
|
|
if (rel->attr_widths[ndx] > 0)
|
2001-05-09 02:35:09 +02:00
|
|
|
{
|
2008-10-21 22:42:53 +02:00
|
|
|
tuple_width += rel->attr_widths[ndx];
|
2003-06-30 01:05:05 +02:00
|
|
|
continue;
|
2001-05-09 02:35:09 +02:00
|
|
|
}
|
2008-10-21 22:42:53 +02:00
|
|
|
|
|
|
|
/* Try to get column width from statistics */
|
2010-11-19 23:31:50 +01:00
|
|
|
if (reloid != InvalidOid && var->varattno > 0)
|
2008-10-21 22:42:53 +02:00
|
|
|
{
|
|
|
|
item_width = get_attavgwidth(reloid, var->varattno);
|
|
|
|
if (item_width > 0)
|
|
|
|
{
|
|
|
|
rel->attr_widths[ndx] = item_width;
|
|
|
|
tuple_width += item_width;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Not a plain relation, or can't find statistics for it. Estimate
|
|
|
|
* using just the type info.
|
|
|
|
*/
|
|
|
|
item_width = get_typavgwidth(var->vartype, var->vartypmod);
|
|
|
|
Assert(item_width > 0);
|
|
|
|
rel->attr_widths[ndx] = item_width;
|
|
|
|
tuple_width += item_width;
|
2001-05-09 02:35:09 +02:00
|
|
|
}
|
2008-10-21 22:42:53 +02:00
|
|
|
else if (IsA(node, PlaceHolderVar))
|
|
|
|
{
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
/*
|
|
|
|
* We will need to evaluate the PHV's contained expression while
|
2016-03-14 21:59:59 +01:00
|
|
|
* scanning this rel, so be sure to include it in reltarget->cost.
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
*/
|
2008-10-21 22:42:53 +02:00
|
|
|
PlaceHolderVar *phv = (PlaceHolderVar *) node;
|
2011-08-09 06:48:51 +02:00
|
|
|
PlaceHolderInfo *phinfo = find_placeholder_info(root, phv, false);
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
QualCost cost;
|
2001-10-25 07:50:21 +02:00
|
|
|
|
2008-10-21 22:42:53 +02:00
|
|
|
tuple_width += phinfo->ph_width;
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
cost_qual_eval_node(&cost, (Node *) phv->phexpr, root);
|
2016-03-14 21:59:59 +01:00
|
|
|
rel->reltarget->cost.startup += cost.startup;
|
|
|
|
rel->reltarget->cost.per_tuple += cost.per_tuple;
|
2008-10-21 22:42:53 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
2009-07-11 06:09:33 +02:00
|
|
|
/*
|
|
|
|
* We could be looking at an expression pulled up from a subquery,
|
2014-05-06 18:12:18 +02:00
|
|
|
* or a ROW() representing a whole-row child Var, etc. Do what we
|
2010-02-26 03:01:40 +01:00
|
|
|
* can using the expression type information.
|
2009-07-11 06:09:33 +02:00
|
|
|
*/
|
|
|
|
int32 item_width;
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
QualCost cost;
|
2009-07-11 06:09:33 +02:00
|
|
|
|
|
|
|
item_width = get_typavgwidth(exprType(node), exprTypmod(node));
|
|
|
|
Assert(item_width > 0);
|
|
|
|
tuple_width += item_width;
|
Add an explicit representation of the output targetlist to Paths.
Up to now, there's been an assumption that all Paths for a given relation
compute the same output column set (targetlist). However, there are good
reasons to remove that assumption. For example, an indexscan on an
expression index might be able to return the value of an expensive function
"for free". While we have the ability to generate such a plan today in
simple cases, we don't have a way to model that it's cheaper than a plan
that computes the function from scratch, nor a way to create such a plan
in join cases (where the function computation would normally happen at
the topmost join node). Also, we need this so that we can have Paths
representing post-scan/join steps, where the targetlist may well change
from one step to the next. Therefore, invent a "struct PathTarget"
representing the columns we expect a plan step to emit. It's convenient
to include the output tuple width and tlist evaluation cost in this struct,
and there will likely be additional fields in future.
While Path nodes that actually do have custom outputs will need their own
PathTargets, it will still be true that most Paths for a given relation
will compute the same tlist. To reduce the overhead added by this patch,
keep a "default PathTarget" in RelOptInfo, and allow Paths that compute
that column set to just point to their parent RelOptInfo's reltarget.
(In the patch as committed, actually every Path is like that, since we
do not yet have any cases of custom PathTargets.)
I took this opportunity to provide some more-honest costing of
PlaceHolderVar evaluation. Up to now, the assumption that "scan/join
reltargetlists have cost zero" was applied not only to Vars, where it's
reasonable, but also PlaceHolderVars where it isn't. Now, we add the eval
cost of a PlaceHolderVar's expression to the first plan level where it can
be computed, by including it in the PathTarget cost field and adding that
to the cost estimates for Paths. This isn't perfect yet but it's much
better than before, and there is a way forward to improve it more. This
costing change affects the join order chosen for a couple of the regression
tests, changing expected row ordering.
2016-02-19 02:01:49 +01:00
|
|
|
/* Not entirely clear if we need to account for cost, but do so */
|
|
|
|
cost_qual_eval_node(&cost, node, root);
|
2016-03-14 21:59:59 +01:00
|
|
|
rel->reltarget->cost.startup += cost.startup;
|
|
|
|
rel->reltarget->cost.per_tuple += cost.per_tuple;
|
2008-10-21 22:42:53 +02:00
|
|
|
}
|
2001-05-09 02:35:09 +02:00
|
|
|
}
|
2010-11-19 23:31:50 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we have a whole-row reference, estimate its width as the sum of
|
2015-02-21 21:13:06 +01:00
|
|
|
* per-column widths plus heap tuple header overhead.
|
2010-11-19 23:31:50 +01:00
|
|
|
*/
|
|
|
|
if (have_wholerow_var)
|
|
|
|
{
|
2015-02-21 21:13:06 +01:00
|
|
|
int32 wholerow_width = MAXALIGN(SizeofHeapTupleHeader);
|
2010-11-19 23:31:50 +01:00
|
|
|
|
|
|
|
if (reloid != InvalidOid)
|
|
|
|
{
|
|
|
|
/* Real relation, so estimate true tuple width */
|
|
|
|
wholerow_width += get_relation_data_width(reloid,
|
Phase 3 of pgindent updates.
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:35:54 +02:00
|
|
|
rel->attr_widths - rel->min_attr);
|
2010-11-19 23:31:50 +01:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* Do what we can with info for a phony rel */
|
|
|
|
AttrNumber i;
|
|
|
|
|
|
|
|
for (i = 1; i <= rel->max_attr; i++)
|
|
|
|
wholerow_width += rel->attr_widths[i - rel->min_attr];
|
|
|
|
}
|
|
|
|
|
|
|
|
rel->attr_widths[0 - rel->min_attr] = wholerow_width;
|
|
|
|
|
|
|
|
/*
|
2011-04-10 17:42:00 +02:00
|
|
|
* Include the whole-row Var as part of the output tuple. Yes, that
|
|
|
|
* really is what happens at runtime.
|
2010-11-19 23:31:50 +01:00
|
|
|
*/
|
|
|
|
tuple_width += wholerow_width;
|
|
|
|
}
|
|
|
|
|
2001-05-09 02:35:09 +02:00
|
|
|
Assert(tuple_width >= 0);
|
2016-03-14 21:59:59 +01:00
|
|
|
rel->reltarget->width = tuple_width;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
Make the upper part of the planner work by generating and comparing Paths.
I've been saying we needed to do this for more than five years, and here it
finally is. This patch removes the ever-growing tangle of spaghetti logic
that grouping_planner() used to use to try to identify the best plan for
post-scan/join query steps. Now, there is (nearly) independent
consideration of each execution step, and entirely separate construction of
Paths to represent each of the possible ways to do that step. We choose
the best Path or set of Paths using the same add_path() logic that's been
used inside query_planner() for years.
In addition, this patch removes the old restriction that subquery_planner()
could return only a single Plan. It now returns a RelOptInfo containing a
set of Paths, just as query_planner() does, and the parent query level can
use each of those Paths as the basis of a SubqueryScanPath at its level.
This allows finding some optimizations that we missed before, wherein a
subquery was capable of returning presorted data and thereby avoiding a
sort in the parent level, making the overall cost cheaper even though
delivering sorted output was not the cheapest plan for the subquery in
isolation. (A couple of regression test outputs change in consequence of
that. However, there is very little change in visible planner behavior
overall, because the point of this patch is not to get immediate planning
benefits but to create the infrastructure for future improvements.)
There is a great deal left to do here. This patch unblocks a lot of
planner work that was basically impractical in the old code structure,
such as allowing FDWs to implement remote aggregation, or rewriting
plan_set_operations() to allow consideration of multiple implementation
orders for set operations. (The latter will likely require a full
rewrite of plan_set_operations(); what I've done here is only to fix it
to return Paths not Plans.) I have also left unfinished some localized
refactoring in createplan.c and planner.c, because it was not necessary
to get this patch to a working state.
Thanks to Robert Haas, David Rowley, and Amit Kapila for review.
2016-03-07 21:58:22 +01:00
|
|
|
/*
|
|
|
|
* set_pathtarget_cost_width
|
|
|
|
* Set the estimated eval cost and output width of a PathTarget tlist.
|
|
|
|
*
|
|
|
|
* As a notational convenience, returns the same PathTarget pointer passed in.
|
|
|
|
*
|
|
|
|
* Most, though not quite all, uses of this function occur after we've run
|
|
|
|
* set_rel_width() for base relations; so we can usually obtain cached width
|
|
|
|
* estimates for Vars. If we can't, fall back on datatype-based width
|
|
|
|
* estimates. Present early-planning uses of PathTargets don't need accurate
|
|
|
|
* widths badly enough to justify going to the catalogs for better data.
|
|
|
|
*/
|
|
|
|
PathTarget *
|
|
|
|
set_pathtarget_cost_width(PlannerInfo *root, PathTarget *target)
|
|
|
|
{
|
|
|
|
int32 tuple_width = 0;
|
|
|
|
ListCell *lc;
|
|
|
|
|
|
|
|
/* Vars are assumed to have cost zero, but other exprs do not */
|
|
|
|
target->cost.startup = 0;
|
|
|
|
target->cost.per_tuple = 0;
|
|
|
|
|
|
|
|
foreach(lc, target->exprs)
|
|
|
|
{
|
|
|
|
Node *node = (Node *) lfirst(lc);
|
|
|
|
|
|
|
|
if (IsA(node, Var))
|
|
|
|
{
|
|
|
|
Var *var = (Var *) node;
|
|
|
|
int32 item_width;
|
|
|
|
|
|
|
|
/* We should not see any upper-level Vars here */
|
|
|
|
Assert(var->varlevelsup == 0);
|
|
|
|
|
|
|
|
/* Try to get data from RelOptInfo cache */
|
|
|
|
if (var->varno < root->simple_rel_array_size)
|
|
|
|
{
|
|
|
|
RelOptInfo *rel = root->simple_rel_array[var->varno];
|
|
|
|
|
|
|
|
if (rel != NULL &&
|
|
|
|
var->varattno >= rel->min_attr &&
|
|
|
|
var->varattno <= rel->max_attr)
|
|
|
|
{
|
|
|
|
int ndx = var->varattno - rel->min_attr;
|
|
|
|
|
|
|
|
if (rel->attr_widths[ndx] > 0)
|
|
|
|
{
|
|
|
|
tuple_width += rel->attr_widths[ndx];
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* No cached data available, so estimate using just the type info.
|
|
|
|
*/
|
|
|
|
item_width = get_typavgwidth(var->vartype, var->vartypmod);
|
|
|
|
Assert(item_width > 0);
|
|
|
|
tuple_width += item_width;
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Handle general expressions using type info.
|
|
|
|
*/
|
|
|
|
int32 item_width;
|
|
|
|
QualCost cost;
|
|
|
|
|
|
|
|
item_width = get_typavgwidth(exprType(node), exprTypmod(node));
|
|
|
|
Assert(item_width > 0);
|
|
|
|
tuple_width += item_width;
|
|
|
|
|
|
|
|
/* Account for cost, too */
|
|
|
|
cost_qual_eval_node(&cost, node, root);
|
|
|
|
target->cost.startup += cost.startup;
|
|
|
|
target->cost.per_tuple += cost.per_tuple;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
Assert(tuple_width >= 0);
|
|
|
|
target->width = tuple_width;
|
|
|
|
|
|
|
|
return target;
|
|
|
|
}
|
|
|
|
|
1999-04-05 04:07:07 +02:00
|
|
|
/*
|
|
|
|
* relation_byte_size
|
1999-05-25 18:15:34 +02:00
|
|
|
* Estimate the storage space in bytes for a given number of tuples
|
|
|
|
* of a given width (size in bytes).
|
1999-04-05 04:07:07 +02:00
|
|
|
*/
|
|
|
|
static double
|
2000-01-09 01:26:47 +01:00
|
|
|
relation_byte_size(double tuples, int width)
|
1999-04-05 04:07:07 +02:00
|
|
|
{
|
2015-02-21 21:13:06 +01:00
|
|
|
return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
|
1999-04-05 04:07:07 +02:00
|
|
|
}
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
1999-02-14 00:22:53 +01:00
|
|
|
* page_size
|
1997-09-07 07:04:48 +02:00
|
|
|
* Returns an estimate of the number of pages covered by a given
|
|
|
|
* number of tuples of a given width (size in bytes).
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2000-01-09 01:26:47 +01:00
|
|
|
static double
|
|
|
|
page_size(double tuples, int width)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2000-01-09 01:26:47 +01:00
|
|
|
return ceil(relation_byte_size(tuples, width) / BLCKSZ);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
2017-01-13 19:29:31 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Estimate the fraction of the work that each worker will do given the
|
|
|
|
* number of workers budgeted for the path.
|
|
|
|
*/
|
|
|
|
static double
|
|
|
|
get_parallel_divisor(Path *path)
|
|
|
|
{
|
|
|
|
double parallel_divisor = path->parallel_workers;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Early experience with parallel query suggests that when there is only
|
|
|
|
* one worker, the leader often makes a very substantial contribution to
|
|
|
|
* executing the parallel portion of the plan, but as more workers are
|
|
|
|
* added, it does less and less, because it's busy reading tuples from the
|
|
|
|
* workers and doing whatever non-parallel post-processing is needed. By
|
|
|
|
* the time we reach 4 workers, the leader no longer makes a meaningful
|
|
|
|
* contribution. Thus, for now, estimate that the leader spends 30% of
|
|
|
|
* its time servicing each worker, and the remainder executing the
|
|
|
|
* parallel plan.
|
|
|
|
*/
|
2017-11-15 14:17:29 +01:00
|
|
|
if (parallel_leader_participation)
|
|
|
|
{
|
|
|
|
double leader_contribution;
|
|
|
|
|
|
|
|
leader_contribution = 1.0 - (0.3 * path->parallel_workers);
|
|
|
|
if (leader_contribution > 0)
|
|
|
|
parallel_divisor += leader_contribution;
|
|
|
|
}
|
2017-01-13 19:29:31 +01:00
|
|
|
|
|
|
|
return parallel_divisor;
|
|
|
|
}
|
2017-01-27 22:22:11 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* compute_bitmap_pages
|
|
|
|
*
|
|
|
|
* compute number of pages fetched from heap in bitmap heap scan.
|
|
|
|
*/
|
|
|
|
double
|
|
|
|
compute_bitmap_pages(PlannerInfo *root, RelOptInfo *baserel, Path *bitmapqual,
|
|
|
|
int loop_count, Cost *cost, double *tuple)
|
|
|
|
{
|
|
|
|
Cost indexTotalCost;
|
|
|
|
Selectivity indexSelectivity;
|
|
|
|
double T;
|
|
|
|
double pages_fetched;
|
|
|
|
double tuples_fetched;
|
2017-11-10 22:50:50 +01:00
|
|
|
double heap_pages;
|
|
|
|
long maxentries;
|
2017-01-27 22:22:11 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Fetch total cost of obtaining the bitmap, as well as its total
|
|
|
|
* selectivity.
|
|
|
|
*/
|
|
|
|
cost_bitmap_tree_node(bitmapqual, &indexTotalCost, &indexSelectivity);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Estimate number of main-table pages fetched.
|
|
|
|
*/
|
|
|
|
tuples_fetched = clamp_row_est(indexSelectivity * baserel->tuples);
|
|
|
|
|
|
|
|
T = (baserel->pages > 1) ? (double) baserel->pages : 1.0;
|
|
|
|
|
2017-11-10 22:50:50 +01:00
|
|
|
/*
|
|
|
|
* For a single scan, the number of heap pages that need to be fetched is
|
|
|
|
* the same as the Mackert and Lohman formula for the case T <= b (ie, no
|
|
|
|
* re-reads needed).
|
|
|
|
*/
|
|
|
|
pages_fetched = (2.0 * T * tuples_fetched) / (2.0 * T + tuples_fetched);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Calculate the number of pages fetched from the heap. Then based on
|
|
|
|
* current work_mem estimate get the estimated maxentries in the bitmap.
|
|
|
|
* (Note that we always do this calculation based on the number of pages
|
|
|
|
* that would be fetched in a single iteration, even if loop_count > 1.
|
|
|
|
* That's correct, because only that number of entries will be stored in
|
|
|
|
* the bitmap at one time.)
|
|
|
|
*/
|
|
|
|
heap_pages = Min(pages_fetched, baserel->pages);
|
|
|
|
maxentries = tbm_calculate_entries(work_mem * 1024L);
|
|
|
|
|
2017-01-27 22:22:11 +01:00
|
|
|
if (loop_count > 1)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* For repeated bitmap scans, scale up the number of tuples fetched in
|
|
|
|
* the Mackert and Lohman formula by the number of scans, so that we
|
|
|
|
* estimate the number of pages fetched by all the scans. Then
|
|
|
|
* pro-rate for one scan.
|
|
|
|
*/
|
|
|
|
pages_fetched = index_pages_fetched(tuples_fetched * loop_count,
|
|
|
|
baserel->pages,
|
|
|
|
get_indexpath_pages(bitmapqual),
|
|
|
|
root);
|
|
|
|
pages_fetched /= loop_count;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (pages_fetched >= T)
|
|
|
|
pages_fetched = T;
|
|
|
|
else
|
|
|
|
pages_fetched = ceil(pages_fetched);
|
|
|
|
|
2017-11-10 22:50:50 +01:00
|
|
|
if (maxentries < heap_pages)
|
|
|
|
{
|
|
|
|
double exact_pages;
|
|
|
|
double lossy_pages;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Crude approximation of the number of lossy pages. Because of the
|
|
|
|
* way tbm_lossify() is coded, the number of lossy pages increases
|
|
|
|
* very sharply as soon as we run short of memory; this formula has
|
|
|
|
* that property and seems to perform adequately in testing, but it's
|
|
|
|
* possible we could do better somehow.
|
|
|
|
*/
|
|
|
|
lossy_pages = Max(0, heap_pages - maxentries / 2);
|
|
|
|
exact_pages = heap_pages - lossy_pages;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If there are lossy pages then recompute the number of tuples
|
|
|
|
* processed by the bitmap heap node. We assume here that the chance
|
|
|
|
* of a given tuple coming from an exact page is the same as the
|
|
|
|
* chance that a given page is exact. This might not be true, but
|
|
|
|
* it's not clear how we can do any better.
|
|
|
|
*/
|
|
|
|
if (lossy_pages > 0)
|
|
|
|
tuples_fetched =
|
|
|
|
clamp_row_est(indexSelectivity *
|
|
|
|
(exact_pages / heap_pages) * baserel->tuples +
|
|
|
|
(lossy_pages / heap_pages) * baserel->tuples);
|
|
|
|
}
|
|
|
|
|
2017-01-27 22:22:11 +01:00
|
|
|
if (cost)
|
|
|
|
*cost = indexTotalCost;
|
|
|
|
if (tuple)
|
|
|
|
*tuple = tuples_fetched;
|
|
|
|
|
|
|
|
return pages_fetched;
|
|
|
|
}
|