Improve planner's handling of set-returning functions in grouping columns.

Improve query_is_distinct_for() to accept SRFs in the targetlist when
we can prove distinctness from a DISTINCT clause.  In that case the
de-duplication will surely happen after SRF expansion, so the proof
still works.  Continue to punt in the case where we'd try to prove
distinctness from GROUP BY (or, in the future, source relations).
To do that, we'd have to determine whether the SRFs were in the
grouping columns or elsewhere in the tlist, and it still doesn't
seem worth the trouble.  But this trivial change allows us to
recognize that "SELECT DISTINCT unnest(foo) FROM ..." produces
unique-ified output, which seems worth having.

Also, fix estimate_num_groups() to consider the possibility of SRFs in
the grouping columns.  Its failure to do so was masked before v10 because
grouping_planner() scaled up plan rowcount estimates by the estimated SRF
multiplier after performing grouping.  That doesn't happen anymore, which
is more correct, but it means we need an adjustment in the estimate for
the number of groups.  Failure to do this leads to an underestimate for
the number of output rows of subqueries like "SELECT DISTINCT unnest(foo)"
compared to what 9.6 and earlier estimated, thus breaking plan choices
in some cases.

Per report from Dmitry Shalashov.  Back-patch to v10 to avoid degraded
plan choices compared to previous releases.

Discussion: https://postgr.es/m/CAKPeCUGAeHgoh5O=SvcQxREVkoX7UdeJUMj1F5=aBNvoTa+O8w@mail.gmail.com
This commit is contained in:
Tom Lane 2017-11-25 11:48:09 -05:00
parent b10967eddf
commit df3a66e282
2 changed files with 41 additions and 14 deletions

View File

@ -744,8 +744,8 @@ rel_is_distinct_for(PlannerInfo *root, RelOptInfo *rel, List *clause_list)
bool
query_supports_distinctness(Query *query)
{
/* we don't cope with SRFs, see comment below */
if (query->hasTargetSRFs)
/* SRFs break distinctness except with DISTINCT, see below */
if (query->hasTargetSRFs && query->distinctClause == NIL)
return false;
/* check for features we can prove distinctness with */
@ -786,21 +786,11 @@ query_is_distinct_for(Query *query, List *colnos, List *opids)
Assert(list_length(colnos) == list_length(opids));
/*
* A set-returning function in the query's targetlist can result in
* returning duplicate rows, if the SRF is evaluated after the
* de-duplication step; so we play it safe and say "no" if there are any
* SRFs. (We could be certain that it's okay if SRFs appear only in the
* specified columns, since those must be evaluated before de-duplication;
* but it doesn't presently seem worth the complication to check that.)
*/
if (query->hasTargetSRFs)
return false;
/*
* DISTINCT (including DISTINCT ON) guarantees uniqueness if all the
* columns in the DISTINCT clause appear in colnos and operator semantics
* match.
* match. This is true even if there are SRFs in the DISTINCT columns or
* elsewhere in the tlist.
*/
if (query->distinctClause)
{
@ -819,6 +809,16 @@ query_is_distinct_for(Query *query, List *colnos, List *opids)
return true;
}
/*
* Otherwise, a set-returning function in the query's targetlist can
* result in returning duplicate rows, despite any grouping that might
* occur before tlist evaluation. (If all tlist SRFs are within GROUP BY
* columns, it would be safe because they'd be expanded before grouping.
* But it doesn't currently seem worth the effort to check for that.)
*/
if (query->hasTargetSRFs)
return false;
/*
* Similarly, GROUP BY without GROUPING SETS guarantees uniqueness if all
* the grouped columns appear in colnos and operator semantics match.

View File

@ -3361,6 +3361,7 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
List **pgset)
{
List *varinfos = NIL;
double srf_multiplier = 1.0;
double numdistinct;
ListCell *l;
int i;
@ -3394,6 +3395,7 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
foreach(l, groupExprs)
{
Node *groupexpr = (Node *) lfirst(l);
double this_srf_multiplier;
VariableStatData vardata;
List *varshere;
ListCell *l2;
@ -3402,6 +3404,21 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
if (pgset && !list_member_int(*pgset, i++))
continue;
/*
* Set-returning functions in grouping columns are a bit problematic.
* The code below will effectively ignore their SRF nature and come up
* with a numdistinct estimate as though they were scalar functions.
* We compensate by scaling up the end result by the largest SRF
* rowcount estimate. (This will be an overestimate if the SRF
* produces multiple copies of any output value, but it seems best to
* assume the SRF's outputs are distinct. In any case, it's probably
* pointless to worry too much about this without much better
* estimates for SRF output rowcounts than we have today.)
*/
this_srf_multiplier = expression_returns_set_rows(groupexpr);
if (srf_multiplier < this_srf_multiplier)
srf_multiplier = this_srf_multiplier;
/* Short-circuit for expressions returning boolean */
if (exprType(groupexpr) == BOOLOID)
{
@ -3467,9 +3484,15 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
*/
if (varinfos == NIL)
{
/* Apply SRF multiplier as we would do in the long path */
numdistinct *= srf_multiplier;
/* Round off */
numdistinct = ceil(numdistinct);
/* Guard against out-of-range answers */
if (numdistinct > input_rows)
numdistinct = input_rows;
if (numdistinct < 1.0)
numdistinct = 1.0;
return numdistinct;
}
@ -3638,6 +3661,10 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
varinfos = newvarinfos;
} while (varinfos != NIL);
/* Now we can account for the effects of any SRFs */
numdistinct *= srf_multiplier;
/* Round off */
numdistinct = ceil(numdistinct);
/* Guard against out-of-range answers */