Collect and use multi-column dependency stats

Follow on patch in the multi-variate statistics patch series.

CREATE STATISTICS s1 WITH (dependencies) ON (a, b) FROM t;
ANALYZE;
will collect dependency stats on (a, b) and then use the measured
dependency in subsequent query planning.

Commit 7b504eb282 added
CREATE STATISTICS with n-distinct coefficients. These are now
specified using the mutually exclusive option WITH (ndistinct).

Author: Tomas Vondra, David Rowley
Reviewed-by: Kyotaro HORIGUCHI, Álvaro Herrera, Dean Rasheed, Robert Haas
and many other comments and contributions
Discussion: https://postgr.es/m/56f40b20-c464-fad2-ff39-06b668fac47c@2ndquadrant.com
This commit is contained in:
Simon Riggs 2017-04-05 18:00:42 -04:00
parent 00b6b6feb1
commit 2686ee1b7c
31 changed files with 2035 additions and 79 deletions

View File

@ -1013,6 +1013,7 @@ estimate_size(PlannerInfo *root, RelOptInfo *baserel,
baserel->baserestrictinfo,
0,
JOIN_INNER,
NULL,
NULL);
nrows = clamp_row_est(nrows);

View File

@ -591,6 +591,7 @@ postgresGetForeignRelSize(PlannerInfo *root,
fpinfo->local_conds,
baserel->relid,
JOIN_INNER,
NULL,
NULL);
cost_qual_eval(&fpinfo->local_conds_cost, fpinfo->local_conds, root);
@ -2572,6 +2573,7 @@ estimate_path_cost_size(PlannerInfo *root,
local_param_join_conds,
foreignrel->relid,
JOIN_INNER,
NULL,
NULL);
local_sel *= fpinfo->local_conds_sel;
@ -4455,6 +4457,7 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
fpinfo->local_conds,
0,
JOIN_INNER,
NULL,
NULL);
cost_qual_eval(&fpinfo->local_conds_cost, fpinfo->local_conds, root);
@ -4465,7 +4468,7 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
if (!fpinfo->use_remote_estimate)
fpinfo->joinclause_sel = clauselist_selectivity(root, fpinfo->joinclauses,
0, fpinfo->jointype,
extra->sjinfo);
extra->sjinfo, NULL);
/* Estimate costs for bare join relation */
estimate_path_cost_size(root, joinrel, NIL, NIL, &rows,

View File

@ -4339,6 +4339,15 @@
</entry>
</row>
<row>
<entry><structfield>stadependencies</structfield></entry>
<entry><type>pg_dependencies</type></entry>
<entry></entry>
<entry>
Functional dependencies, serialized as <structname>pg_dependencies</> type.
</entry>
</row>
</tbody>
</tgroup>
</table>

View File

@ -446,6 +446,160 @@ rows = (outer_cardinality * inner_cardinality) * selectivity
in <filename>src/backend/utils/adt/selfuncs.c</filename>.
</para>
<sect2 id="functional-dependencies">
<title>Functional Dependencies</title>
<para>
The simplest type of extended statistics are functional dependencies,
used in definitions of database normal forms. When simplified, saying that
<literal>b</> is functionally dependent on <literal>a</> means that
knowledge of value of <literal>a</> is sufficient to determine value of
<literal>b</>.
</para>
<para>
In normalized databases, only functional dependencies on primary keys
and superkeys are allowed. However, in practice, many data sets are not
fully normalized, for example, due to intentional denormalization for
performance reasons.
</para>
<para>
Functional dependencies directly affect accuracy of the estimates, as
conditions on the dependent column(s) do not restrict the result set,
resulting in underestimates.
</para>
<para>
To inform the planner about the functional dependencies, we collect
measurements of dependency during <command>ANALYZE</>. Assessing
dependency between all sets of columns would be prohibitively
expensive, so we limit our search to potential dependencies defined
using the <command>CREATE STATISTICS</> command.
<programlisting>
CREATE TABLE t (a INT, b INT);
INSERT INTO t SELECT i/100, i/100 FROM generate_series(1,10000) s(i);
CREATE STATISTICS s1 WITH (dependencies) ON (a, b) FROM t;
ANALYZE t;
EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual time=0.095..3.118 rows=100 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900
Planning time: 0.367 ms
Execution time: 3.380 ms
(5 rows)
</programlisting>
The planner is now aware of the functional dependencies and considers
them when computing the selectivity of the second condition. Running
the query without the statistics would lead to quite different estimates.
<programlisting>
DROP STATISTICS s1;
EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual time=0.000..6.379 rows=100 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900
Planning time: 0.000 ms
Execution time: 6.379 ms
(5 rows)
</programlisting>
</para>
<para>
If no dependency exists, the collected statistics do not influence the
query plan. The only effect is to slow down <command>ANALYZE</>. Should
partial dependencies exist these will also be stored and applied
during planning.
</para>
<para>
Similarly to per-column statistics, extended statistics are stored in
a system catalog called <structname>pg_statistic_ext</structname>, but
there is also a more convenient view <structname>pg_stats_ext</structname>.
To inspect the statistics <literal>s1</literal> defined above,
you may do this:
<programlisting>
SELECT tablename, staname, attnums, depsbytes
FROM pg_stats_ext WHERE staname = 's1';
tablename | staname | attnums | depsbytes
-----------+---------+---------+-----------
t | s1 | 1 2 | 40
(1 row)
</programlisting>
This shows that the statistics are defined on table <structname>t</>,
<structfield>attnums</structfield> lists attribute numbers of columns
(references <structname>pg_attribute</structname>). It also shows
the length in bytes of the functional dependencies, as found by
<command>ANALYZE</> when serialized into a <literal>bytea</> column.
</para>
<para>
When computing the selectivity, the planner inspects all conditions and
attempts to identify which conditions are already implied by other
conditions. The selectivity estimates from any redundant conditions are
ignored from a selectivity point of view. In the example query above,
the selectivity estimates for either of the conditions may be eliminated,
thus improving the overall estimate.
</para>
<sect3 id="functional-dependencies-limitations">
<title>Limitations of functional dependencies</title>
<para>
Functional dependencies are a very simple type of statistics, and
as such have several limitations. The first limitation is that they
only work with simple equality conditions, comparing columns and constant
values. It's not possible to use them to eliminate equality conditions
comparing two columns or a column to an expression, range clauses,
<literal>LIKE</> or any other type of conditions.
</para>
<para>
When eliminating the implied conditions, the planner assumes that the
conditions are compatible. Consider the following example, violating
this assumption:
<programlisting>
EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual time=2.992..2.992 rows=0 loops=1)
Filter: ((a = 1) AND (b = 10))
Rows Removed by Filter: 10000
Planning time: 0.232 ms
Execution time: 3.033 ms
(5 rows)
</programlisting>
While there are no rows with such combination of values, the planner
is unable to verify whether the values match - it only knows that
the columns are functionally dependent.
</para>
<para>
This assumption is more about queries executed on the database - in many
cases, it's actually satisfied (e.g. when the GUI only allows selecting
compatible values). But if that's not the case, functional dependencies
may not be a viable option.
</para>
<para>
For additional information about functional dependencies, see
<filename>src/backend/statistics/README.dependencies</>.
</para>
</sect3>
</sect2>
</sect1>
</chapter>

View File

@ -21,8 +21,9 @@ PostgreSQL documentation
<refsynopsisdiv>
<synopsis>
CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_name</replaceable> ON (
<replaceable class="PARAMETER">column_name</replaceable>, <replaceable class="PARAMETER">column_name</replaceable> [, ...])
CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_name</replaceable>
WITH ( <replaceable class="PARAMETER">option</replaceable> [= <replaceable class="PARAMETER">value</replaceable>] [, ... ] )
ON ( <replaceable class="PARAMETER">column_name</replaceable>, <replaceable class="PARAMETER">column_name</replaceable> [, ...])
FROM <replaceable class="PARAMETER">table_name</replaceable>
</synopsis>
@ -94,6 +95,41 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="PARAMETER">statistics_na
</variablelist>
<refsect2 id="SQL-CREATESTATISTICS-parameters">
<title id="SQL-CREATESTATISTICS-parameters-title">Parameters</title>
<indexterm zone="sql-createstatistics-parameters">
<primary>statistics parameters</primary>
</indexterm>
<para>
The <literal>WITH</> clause can specify <firstterm>options</>
for the statistics. Available options are listed below.
</para>
<variablelist>
<varlistentry>
<term><literal>dependencies</> (<type>boolean</>)</term>
<listitem>
<para>
Enables functional dependencies for the statistics.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><literal>ndistinct</> (<type>boolean</>)</term>
<listitem>
<para>
Enables ndistinct coefficients for the statistics.
</para>
</listitem>
</varlistentry>
</variablelist>
</refsect2>
</refsect1>
<refsect1>
@ -122,7 +158,7 @@ CREATE TABLE t1 (
INSERT INTO t1 SELECT i/100, i/500
FROM generate_series(1,1000000) s(i);
CREATE STATISTICS s1 ON (a, b) FROM t1;
CREATE STATISTICS s1 WITH (dependencies) ON (a, b) FROM t1;
ANALYZE t1;

View File

@ -192,7 +192,8 @@ CREATE VIEW pg_stats_ext AS
C.relname AS tablename,
S.staname AS staname,
S.stakeys AS attnums,
length(s.standistinct) AS ndistbytes
length(s.standistinct::bytea) AS ndistbytes,
length(S.stadependencies::bytea) AS depsbytes
FROM (pg_statistic_ext S JOIN pg_class C ON (C.oid = S.starelid))
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace);

View File

@ -62,10 +62,11 @@ CreateStatistics(CreateStatsStmt *stmt)
Oid relid;
ObjectAddress parentobject,
childobject;
Datum types[1]; /* only ndistinct defined now */
Datum types[2]; /* one for each possible type of statistics */
int ntypes;
ArrayType *staenabled;
bool build_ndistinct;
bool build_dependencies;
bool requested_type = false;
Assert(IsA(stmt, CreateStatsStmt));
@ -159,7 +160,7 @@ CreateStatistics(CreateStatsStmt *stmt)
errmsg("statistics require at least 2 columns")));
/*
* Sort the attnums, which makes detecting duplicies somewhat easier, and
* Sort the attnums, which makes detecting duplicities somewhat easier, and
* it does not hurt (it does not affect the efficiency, unlike for
* indexes, for example).
*/
@ -182,6 +183,7 @@ CreateStatistics(CreateStatsStmt *stmt)
* recognized.
*/
build_ndistinct = false;
build_dependencies = false;
foreach(l, stmt->options)
{
DefElem *opt = (DefElem *) lfirst(l);
@ -191,6 +193,11 @@ CreateStatistics(CreateStatsStmt *stmt)
build_ndistinct = defGetBoolean(opt);
requested_type = true;
}
else if (strcmp(opt->defname, "dependencies") == 0)
{
build_dependencies = defGetBoolean(opt);
requested_type = true;
}
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@ -199,12 +206,17 @@ CreateStatistics(CreateStatsStmt *stmt)
}
/* If no statistic type was specified, build them all. */
if (!requested_type)
{
build_ndistinct = true;
build_dependencies = true;
}
/* construct the char array of enabled statistic types */
ntypes = 0;
if (build_ndistinct)
types[ntypes++] = CharGetDatum(STATS_EXT_NDISTINCT);
if (build_dependencies)
types[ntypes++] = CharGetDatum(STATS_EXT_DEPENDENCIES);
Assert(ntypes > 0);
staenabled = construct_array(types, ntypes, CHAROID, 1, true, 'c');
@ -222,6 +234,7 @@ CreateStatistics(CreateStatsStmt *stmt)
/* no statistics build yet */
nulls[Anum_pg_statistic_ext_standistinct - 1] = true;
nulls[Anum_pg_statistic_ext_stadependencies - 1] = true;
/* insert it into pg_statistic_ext */
statrel = heap_open(StatisticExtRelationId, RowExclusiveLock);

View File

@ -22,6 +22,7 @@
#include "utils/fmgroids.h"
#include "utils/lsyscache.h"
#include "utils/selfuncs.h"
#include "statistics/statistics.h"
/*
@ -60,23 +61,30 @@ static void addRangeClause(RangeQueryClause **rqlist, Node *clause,
* subclauses. However, that's only right if the subclauses have independent
* probabilities, and in reality they are often NOT independent. So,
* we want to be smarter where we can.
* Currently, the only extra smarts we have is to recognize "range queries",
* such as "x > 34 AND x < 42". Clauses are recognized as possible range
* query components if they are restriction opclauses whose operators have
* scalarltsel() or scalargtsel() as their restriction selectivity estimator.
* We pair up clauses of this form that refer to the same variable. An
* unpairable clause of this kind is simply multiplied into the selectivity
* product in the normal way. But when we find a pair, we know that the
* selectivities represent the relative positions of the low and high bounds
* within the column's range, so instead of figuring the selectivity as
* hisel * losel, we can figure it as hisel + losel - 1. (To visualize this,
* see that hisel is the fraction of the range below the high bound, while
* losel is the fraction above the low bound; so hisel can be interpreted
* directly as a 0..1 value but we need to convert losel to 1-losel before
* interpreting it as a value. Then the available range is 1-losel to hisel.
* However, this calculation double-excludes nulls, so really we need
* hisel + losel + null_frac - 1.)
*
* When 'rel' is not null and rtekind = RTE_RELATION, we'll try to apply
* selectivity estimates using any extended statistcs on 'rel'.
*
* If we identify such extended statistics exist, we try to apply them.
* Currently we only have (soft) functional dependencies, so apply these in as
* many cases as possible, and fall back on normal estimates for remaining
* clauses.
*
* We also recognize "range queries", such as "x > 34 AND x < 42". Clauses
* are recognized as possible range query components if they are restriction
* opclauses whose operators have scalarltsel() or scalargtsel() as their
* restriction selectivity estimator. We pair up clauses of this form that
* refer to the same variable. An unpairable clause of this kind is simply
* multiplied into the selectivity product in the normal way. But when we
* find a pair, we know that the selectivities represent the relative
* positions of the low and high bounds within the column's range, so instead
* of figuring the selectivity as hisel * losel, we can figure it as hisel +
* losel - 1. (To visualize this, see that hisel is the fraction of the range
* below the high bound, while losel is the fraction above the low bound; so
* hisel can be interpreted directly as a 0..1 value but we need to convert
* losel to 1-losel before interpreting it as a value. Then the available
* range is 1-losel to hisel. However, this calculation double-excludes
* nulls, so really we need hisel + losel + null_frac - 1.)
*
* If either selectivity is exactly DEFAULT_INEQ_SEL, we forget this equation
* and instead use DEFAULT_RANGE_INEQ_SEL. The same applies if the equation
@ -93,33 +101,70 @@ clauselist_selectivity(PlannerInfo *root,
List *clauses,
int varRelid,
JoinType jointype,
SpecialJoinInfo *sjinfo)
SpecialJoinInfo *sjinfo,
RelOptInfo *rel)
{
Selectivity s1 = 1.0;
RangeQueryClause *rqlist = NULL;
ListCell *l;
Bitmapset *estimatedclauses = NULL;
int listidx;
/*
* If there's exactly one clause, then no use in trying to match up pairs,
* so just go directly to clause_selectivity().
* If there's exactly one clause, then extended statistics is futile at
* this level (we might be able to apply them later if it's AND/OR
* clause). So just go directly to clause_selectivity().
*/
if (list_length(clauses) == 1)
return clause_selectivity(root, (Node *) linitial(clauses),
varRelid, jointype, sjinfo);
varRelid, jointype, sjinfo, rel);
/*
* Initial scan over clauses. Anything that doesn't look like a potential
* rangequery clause gets multiplied into s1 and forgotten. Anything that
* does gets inserted into an rqlist entry.
* When a relation of RTE_RELATION is given as 'rel', we'll try to
* perform selectivity estimation using extended statistics.
*/
if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
{
/*
* Perform selectivity estimations on any clauses found applicable by
* dependencies_clauselist_selectivity. The 0-based list position of
* estimated clauses will be populated in 'estimatedclauses'.
*/
s1 *= dependencies_clauselist_selectivity(root, clauses, varRelid,
jointype, sjinfo, rel, &estimatedclauses);
/*
* This would be the place to apply any other types of extended
* statistics selectivity estimations for remaining clauses.
*/
}
/*
* Apply normal selectivity estimates for remaining clauses. We'll be
* careful to skip any clauses which were already estimated above.
*
* Anything that doesn't look like a potential rangequery clause gets
* multiplied into s1 and forgotten. Anything that does gets inserted into
* an rqlist entry.
*/
listidx = -1;
foreach(l, clauses)
{
Node *clause = (Node *) lfirst(l);
RestrictInfo *rinfo;
Selectivity s2;
listidx++;
/*
* Skip this clause if it's already been estimated by some other
* statistics above.
*/
if (bms_is_member(listidx, estimatedclauses))
continue;
/* Always compute the selectivity using clause_selectivity */
s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo);
s2 = clause_selectivity(root, clause, varRelid, jointype, sjinfo, rel);
/*
* Check for being passed a RestrictInfo.
@ -484,7 +529,8 @@ clause_selectivity(PlannerInfo *root,
Node *clause,
int varRelid,
JoinType jointype,
SpecialJoinInfo *sjinfo)
SpecialJoinInfo *sjinfo,
RelOptInfo *rel)
{
Selectivity s1 = 0.5; /* default for any unhandled clause type */
RestrictInfo *rinfo = NULL;
@ -604,7 +650,8 @@ clause_selectivity(PlannerInfo *root,
(Node *) get_notclausearg((Expr *) clause),
varRelid,
jointype,
sjinfo);
sjinfo,
rel);
}
else if (and_clause(clause))
{
@ -613,7 +660,8 @@ clause_selectivity(PlannerInfo *root,
((BoolExpr *) clause)->args,
varRelid,
jointype,
sjinfo);
sjinfo,
rel);
}
else if (or_clause(clause))
{
@ -632,7 +680,8 @@ clause_selectivity(PlannerInfo *root,
(Node *) lfirst(arg),
varRelid,
jointype,
sjinfo);
sjinfo,
rel);
s1 = s1 + s2 - s1 * s2;
}
@ -725,7 +774,8 @@ clause_selectivity(PlannerInfo *root,
(Node *) ((RelabelType *) clause)->arg,
varRelid,
jointype,
sjinfo);
sjinfo,
rel);
}
else if (IsA(clause, CoerceToDomain))
{
@ -734,7 +784,8 @@ clause_selectivity(PlannerInfo *root,
(Node *) ((CoerceToDomain *) clause)->arg,
varRelid,
jointype,
sjinfo);
sjinfo,
rel);
}
else
{

View File

@ -3750,7 +3750,8 @@ compute_semi_anti_join_factors(PlannerInfo *root,
joinquals,
0,
jointype,
sjinfo);
sjinfo,
NULL);
/*
* Also get the normal inner-join selectivity of the join clauses.
@ -3773,7 +3774,8 @@ compute_semi_anti_join_factors(PlannerInfo *root,
joinquals,
0,
JOIN_INNER,
&norm_sjinfo);
&norm_sjinfo,
NULL);
/* Avoid leaking a lot of ListCells */
if (jointype == JOIN_ANTI)
@ -3940,7 +3942,7 @@ approx_tuple_count(PlannerInfo *root, JoinPath *path, List *quals)
Node *qual = (Node *) lfirst(l);
/* Note that clause_selectivity will be able to cache its result */
selec *= clause_selectivity(root, qual, 0, JOIN_INNER, &sjinfo);
selec *= clause_selectivity(root, qual, 0, JOIN_INNER, &sjinfo, NULL);
}
/* Apply it to the input relation sizes */
@ -3976,7 +3978,8 @@ set_baserel_size_estimates(PlannerInfo *root, RelOptInfo *rel)
rel->baserestrictinfo,
0,
JOIN_INNER,
NULL);
NULL,
rel);
rel->rows = clamp_row_est(nrows);
@ -4013,7 +4016,8 @@ get_parameterized_baserel_size(PlannerInfo *root, RelOptInfo *rel,
allclauses,
rel->relid, /* do not use 0! */
JOIN_INNER,
NULL);
NULL,
rel);
nrows = clamp_row_est(nrows);
/* For safety, make sure result is not more than the base estimate */
if (nrows > rel->rows)
@ -4179,12 +4183,14 @@ calc_joinrel_size_estimate(PlannerInfo *root,
joinquals,
0,
jointype,
sjinfo);
sjinfo,
NULL);
pselec = clauselist_selectivity(root,
pushedquals,
0,
jointype,
sjinfo);
sjinfo,
NULL);
/* Avoid leaking a lot of ListCells */
list_free(joinquals);
@ -4196,7 +4202,8 @@ calc_joinrel_size_estimate(PlannerInfo *root,
restrictlist,
0,
jointype,
sjinfo);
sjinfo,
NULL);
pselec = 0.0; /* not used, keep compiler quiet */
}
@ -4491,7 +4498,7 @@ get_foreign_key_join_selectivity(PlannerInfo *root,
Selectivity csel;
csel = clause_selectivity(root, (Node *) rinfo,
0, jointype, sjinfo);
0, jointype, sjinfo, NULL);
thisfksel = Min(thisfksel, csel);
}
fkselec *= thisfksel;

View File

@ -280,7 +280,7 @@ consider_new_or_clause(PlannerInfo *root, RelOptInfo *rel,
* saving work later.)
*/
or_selec = clause_selectivity(root, (Node *) or_rinfo,
0, JOIN_INNER, NULL);
0, JOIN_INNER, NULL, rel);
/*
* The clause is only worth adding to the query if it rejects a useful
@ -344,7 +344,7 @@ consider_new_or_clause(PlannerInfo *root, RelOptInfo *rel,
/* Compute inner-join size */
orig_selec = clause_selectivity(root, (Node *) join_or_rinfo,
0, JOIN_INNER, &sjinfo);
0, JOIN_INNER, &sjinfo, NULL);
/* And hack cached selectivity so join size remains the same */
join_or_rinfo->norm_selec = orig_selec / or_selec;

View File

@ -1308,6 +1308,18 @@ get_relation_statistics(RelOptInfo *rel, Relation relation)
stainfos = lcons(info, stainfos);
}
if (statext_is_kind_built(htup, STATS_EXT_DEPENDENCIES))
{
StatisticExtInfo *info = makeNode(StatisticExtInfo);
info->statOid = statOid;
info->rel = rel;
info->kind = STATS_EXT_DEPENDENCIES;
info->keys = bms_copy(keys);
stainfos = lcons(info, stainfos);
}
ReleaseSysCache(htup);
bms_free(keys);
}

View File

@ -12,6 +12,6 @@ subdir = src/backend/statistics
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
OBJS = extended_stats.o mvdistinct.o
OBJS = extended_stats.o dependencies.o mvdistinct.o
include $(top_srcdir)/src/backend/common.mk

View File

@ -8,10 +8,72 @@ not true, resulting in estimation errors.
Extended statistics track different types of dependencies between the columns,
hopefully improving the estimates and producing better plans.
Currently we only have one type of extended statistics - ndistinct
coefficients, and we use it to improve estimates of grouping queries. See
README.ndistinct for details.
Types of statistics
-------------------
There are two kinds of extended statistics:
(a) ndistinct coefficients
(b) soft functional dependencies (README.dependencies)
Compatible clause types
-----------------------
Each type of statistics may be used to estimate some subset of clause types.
(a) functional dependencies - equality clauses (AND), possibly IS NULL
Currently, only OpExprs in the form Var op Const, or Const op Var are
supported, however it's feasible to expand the code later to also estimate the
selectivities on clauses such as Var op Var.
Complex clauses
---------------
We also support estimating more complex clauses - essentially AND/OR clauses
with (Var op Const) as leaves, as long as all the referenced attributes are
covered by a single statistics.
For example this condition
(a=1) AND ((b=2) OR ((c=3) AND (d=4)))
may be estimated using statistics on (a,b,c,d). If we only have statistics on
(b,c,d) we may estimate the second part, and estimate (a=1) using simple stats.
If we only have statistics on (a,b,c) we can't apply it at all at this point,
but it's worth pointing out clauselist_selectivity() works recursively and when
handling the second part (the OR-clause), we'll be able to apply the statistics.
Note: The multi-statistics estimation patch also makes it possible to pass some
clauses as 'conditions' into the deeper parts of the expression tree.
Selectivity estimation
----------------------
Throughout the planner clauselist_selectivity() still remains in charge of
most selectivity estimate requests. clauselist_selectivity() can be instructed
to try to make use of any extended statistics on the given RelOptInfo, which
it will do, if:
(a) An actual valid RelOptInfo was given. Join relations are passed in as
NULL, therefore are invalid.
(b) The relation given actually has any extended statistics defined which
are actually built.
When the above conditions are met, clauselist_selectivity() first attempts to
pass the clause list off to the extended statistics selectivity estimation
function. This functions may not find any clauses which is can perform any
estimations on. In such cases these clauses are simply ignored. When actual
estimation work is performed in these functions they're expected to mark which
clauses they've performed estimations for so that any other function
performing estimations knows which clauses are to be skipped.
Size of sample in ANALYZE
-------------------------

View File

@ -0,0 +1,119 @@
Soft functional dependencies
============================
Functional dependencies are a concept well described in relational theory,
particularly in the definition of normalization and "normal forms". Wikipedia
has a nice definition of a functional dependency [1]:
In a given table, an attribute Y is said to have a functional dependency
on a set of attributes X (written X -> Y) if and only if each X value is
associated with precisely one Y value. For example, in an "Employee"
table that includes the attributes "Employee ID" and "Employee Date of
Birth", the functional dependency
{Employee ID} -> {Employee Date of Birth}
would hold. It follows from the previous two sentences that each
{Employee ID} is associated with precisely one {Employee Date of Birth}.
[1] https://en.wikipedia.org/wiki/Functional_dependency
In practical terms, functional dependencies mean that a value in one column
determines values in some other column. Consider for example this trivial
table with two integer columns:
CREATE TABLE t (a INT, b INT)
AS SELECT i, i/10 FROM generate_series(1,100000) s(i);
Clearly, knowledge of the value in column 'a' is sufficient to determine the
value in column 'b', as it's simply (a/10). A more practical example may be
addresses, where the knowledge of a ZIP code (usually) determines city. Larger
cities may have multiple ZIP codes, so the dependency can't be reversed.
Many datasets might be normalized not to contain such dependencies, but often
it's not practical for various reasons. In some cases, it's actually a conscious
design choice to model the dataset in a denormalized way, either because of
performance or to make querying easier.
Soft dependencies
-----------------
Real-world data sets often contain data errors, either because of data entry
mistakes (user mistyping the ZIP code) or perhaps issues in generating the
data (e.g. a ZIP code mistakenly assigned to two cities in different states).
A strict implementation would either ignore dependencies in such cases,
rendering the approach mostly useless even for slightly noisy data sets, or
result in sudden changes in behavior depending on minor differences between
samples provided to ANALYZE.
For this reason, the statistics implements "soft" functional dependencies,
associating each functional dependency with a degree of validity (a number
between 0 and 1). This degree is then used to combine selectivities in a
smooth manner.
Mining dependencies (ANALYZE)
-----------------------------
The current algorithm is fairly simple - generate all possible functional
dependencies, and for each one count the number of rows consistent with it.
Then use the fraction of rows (supporting/total) as the degree.
To count the rows consistent with the dependency (a => b):
(a) Sort the data lexicographically, i.e. first by 'a' then 'b'.
(b) For each group of rows with the same 'a' value, count the number of
distinct values in 'b'.
(c) If there's a single distinct value in 'b', the rows are consistent with
the functional dependency, otherwise they contradict it.
The algorithm also requires a minimum size of the group to consider it
consistent (currently 3 rows in the sample). Small groups make it less likely
to break the consistency.
Clause reduction (planner/optimizer)
------------------------------------
Applying the functional dependencies is fairly simple - given a list of
equality clauses, we compute selectivities of each clause and then use the
degree to combine them using this formula
P(a=?,b=?) = P(a=?) * (d + (1-d) * P(b=?))
Where 'd' is the degree of functional dependence (a=>b).
With more than two equality clauses, this process happens recursively. For
example for (a,b,c) we first use (a,b=>c) to break the computation into
P(a=?,b=?,c=?) = P(a=?,b=?) * (d + (1-d)*P(b=?))
and then apply (a=>b) the same way on P(a=?,b=?).
Consistency of clauses
----------------------
Functional dependencies only express general dependencies between columns,
without referencing particular values. This assumes that the equality clauses
are in fact consistent with the functional dependency, i.e. that given a
dependency (a=>b), the value in (b=?) clause is the value determined by (a=?).
If that's not the case, the clauses are "inconsistent" with the functional
dependency and the result will be over-estimation.
This may happen, for example, when using conditions on the ZIP code and city
name with mismatching values (ZIP code for a different city), etc. In such a
case, the result set will be empty, but we'll estimate the selectivity using
the ZIP code condition.
In this case, the default estimation based on AVIA principle happens to work
better, but mostly by chance.
This issue is the price for the simplicity of functional dependencies. If the
application frequently constructs queries with clauses inconsistent with
functional dependencies present in the data, the best solution is not to
use functional dependencies, but one of the more complex types of statistics.

File diff suppressed because it is too large Load Diff

View File

@ -47,7 +47,7 @@ static List *fetch_statentries_for_relation(Relation pg_statext, Oid relid);
static VacAttrStats **lookup_var_attr_stats(Relation rel, Bitmapset *attrs,
int natts, VacAttrStats **vacattrstats);
static void statext_store(Relation pg_stext, Oid relid,
MVNDistinct *ndistinct,
MVNDistinct *ndistinct, MVDependencies *dependencies,
VacAttrStats **stats);
@ -74,6 +74,7 @@ BuildRelationExtStatistics(Relation onerel, double totalrows,
{
StatExtEntry *stat = (StatExtEntry *) lfirst(lc);
MVNDistinct *ndistinct = NULL;
MVDependencies *dependencies = NULL;
VacAttrStats **stats;
ListCell *lc2;
@ -93,10 +94,13 @@ BuildRelationExtStatistics(Relation onerel, double totalrows,
if (t == STATS_EXT_NDISTINCT)
ndistinct = statext_ndistinct_build(totalrows, numrows, rows,
stat->columns, stats);
else if (t == STATS_EXT_DEPENDENCIES)
dependencies = statext_dependencies_build(numrows, rows,
stat->columns, stats);
}
/* store the statistics in the catalog */
statext_store(pg_stext, stat->statOid, ndistinct, stats);
statext_store(pg_stext, stat->statOid, ndistinct, dependencies, stats);
}
heap_close(pg_stext, RowExclusiveLock);
@ -117,6 +121,10 @@ statext_is_kind_built(HeapTuple htup, char type)
attnum = Anum_pg_statistic_ext_standistinct;
break;
case STATS_EXT_DEPENDENCIES:
attnum = Anum_pg_statistic_ext_stadependencies;
break;
default:
elog(ERROR, "unexpected statistics type requested: %d", type);
}
@ -178,7 +186,8 @@ fetch_statentries_for_relation(Relation pg_statext, Oid relid)
enabled = (char *) ARR_DATA_PTR(arr);
for (i = 0; i < ARR_DIMS(arr)[0]; i++)
{
Assert(enabled[i] == STATS_EXT_NDISTINCT);
Assert((enabled[i] == STATS_EXT_NDISTINCT) ||
(enabled[i] == STATS_EXT_DEPENDENCIES));
entry->types = lappend_int(entry->types, (int) enabled[i]);
}
@ -256,7 +265,7 @@ lookup_var_attr_stats(Relation rel, Bitmapset *attrs, int natts,
*/
static void
statext_store(Relation pg_stext, Oid statOid,
MVNDistinct *ndistinct,
MVNDistinct *ndistinct, MVDependencies *dependencies,
VacAttrStats **stats)
{
HeapTuple stup,
@ -280,8 +289,17 @@ statext_store(Relation pg_stext, Oid statOid,
values[Anum_pg_statistic_ext_standistinct - 1] = PointerGetDatum(data);
}
if (dependencies != NULL)
{
bytea *data = statext_dependencies_serialize(dependencies);
nulls[Anum_pg_statistic_ext_stadependencies - 1] = (data == NULL);
values[Anum_pg_statistic_ext_stadependencies - 1] = PointerGetDatum(data);
}
/* always replace the value (either by bytea or NULL) */
replaces[Anum_pg_statistic_ext_standistinct - 1] = true;
replaces[Anum_pg_statistic_ext_stadependencies - 1] = true;
/* there should already be a pg_statistic_ext tuple */
oldtup = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(statOid));
@ -387,3 +405,82 @@ multi_sort_compare_dims(int start, int end,
return 0;
}
/*
* has_stats_of_kind
* Check that the list contains statistic of a given kind
*/
bool
has_stats_of_kind(List *stats, char requiredkind)
{
ListCell *l;
foreach(l, stats)
{
StatisticExtInfo *stat = (StatisticExtInfo *) lfirst(l);
if (stat->kind == requiredkind)
return true;
}
return false;
}
/*
* choose_best_statistics
* Look for statistics with the specified 'requiredkind' which have keys
* that match at least two attnums.
*
* The current selection criteria is very simple - we choose the statistics
* referencing the most attributes with the least keys.
*
* XXX if multiple statistics exists of the same size matching the same number
* of keys, then the statistics which are chosen depend on the order that they
* appear in the stats list. Perhaps this needs to be more definitive.
*/
StatisticExtInfo *
choose_best_statistics(List *stats, Bitmapset *attnums, char requiredkind)
{
ListCell *lc;
StatisticExtInfo *best_match = NULL;
int best_num_matched = 2; /* goal #1: maximize */
int best_match_keys = (STATS_MAX_DIMENSIONS + 1); /* goal #2: minimize */
foreach(lc, stats)
{
StatisticExtInfo *info = (StatisticExtInfo *) lfirst(lc);
int num_matched;
int numkeys;
Bitmapset *matched;
/* skip statistics that are not the correct type */
if (info->kind != requiredkind)
continue;
/* determine how many attributes of these stats can be matched to */
matched = bms_intersect(attnums, info->keys);
num_matched = bms_num_members(matched);
bms_free(matched);
/*
* save the actual number of keys in the stats so that we can choose
* the narrowest stats with the most matching keys.
*/
numkeys = bms_num_members(info->keys);
/*
* Use these statistics when it increases the number of matched
* clauses or when it matches the same number of attributes but these
* stats have fewer keys than any previous match.
*/
if (num_matched > best_num_matched ||
(num_matched == best_num_matched && numkeys < best_match_keys))
{
best_match = info;
best_num_matched = num_matched;
best_match_keys = numkeys;
}
}
return best_match;
}

View File

@ -1452,6 +1452,13 @@ pg_get_statisticsext_worker(Oid statextid, bool missing_ok)
StringInfoData buf;
int colno;
char *nsp;
ArrayType *arr;
char *enabled;
Datum datum;
bool isnull;
bool ndistinct_enabled;
bool dependencies_enabled;
int i;
statexttup = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(statextid));
@ -1467,10 +1474,55 @@ pg_get_statisticsext_worker(Oid statextid, bool missing_ok)
initStringInfo(&buf);
nsp = get_namespace_name(statextrec->stanamespace);
appendStringInfo(&buf, "CREATE STATISTICS %s ON (",
appendStringInfo(&buf, "CREATE STATISTICS %s",
quote_qualified_identifier(nsp,
NameStr(statextrec->staname)));
/*
* Lookup the staenabled column so that we know how to handle the WITH
* clause.
*/
datum = SysCacheGetAttr(STATEXTOID, statexttup,
Anum_pg_statistic_ext_staenabled, &isnull);
Assert(!isnull);
arr = DatumGetArrayTypeP(datum);
if (ARR_NDIM(arr) != 1 ||
ARR_HASNULL(arr) ||
ARR_ELEMTYPE(arr) != CHAROID)
elog(ERROR, "staenabled is not a 1-D char array");
enabled = (char *) ARR_DATA_PTR(arr);
ndistinct_enabled = false;
dependencies_enabled = false;
for (i = 0; i < ARR_DIMS(arr)[0]; i++)
{
if (enabled[i] == STATS_EXT_NDISTINCT)
ndistinct_enabled = true;
if (enabled[i] == STATS_EXT_DEPENDENCIES)
dependencies_enabled = true;
}
/*
* If any option is disabled, then we'll need to append a WITH clause to
* show which options are enabled. We omit the WITH clause on purpose
* when all options are enabled, so a pg_dump/pg_restore will create all
* statistics types on a newer postgres version, if the statistics had all
* options enabled on the original version.
*/
if (!ndistinct_enabled || !dependencies_enabled)
{
appendStringInfoString(&buf, " WITH (");
if (ndistinct_enabled)
appendStringInfoString(&buf, "ndistinct");
else if (dependencies_enabled)
appendStringInfoString(&buf, "dependencies");
appendStringInfoChar(&buf, ')');
}
appendStringInfoString(&buf, " ON (");
for (colno = 0; colno < statextrec->stakeys.dim1; colno++)
{
AttrNumber attnum = statextrec->stakeys.values[colno];

View File

@ -1633,13 +1633,17 @@ booltestsel(PlannerInfo *root, BoolTestType booltesttype, Node *arg,
case IS_NOT_FALSE:
selec = (double) clause_selectivity(root, arg,
varRelid,
jointype, sjinfo);
jointype,
sjinfo,
NULL);
break;
case IS_FALSE:
case IS_NOT_TRUE:
selec = 1.0 - (double) clause_selectivity(root, arg,
varRelid,
jointype, sjinfo);
jointype,
sjinfo,
NULL);
break;
default:
elog(ERROR, "unrecognized booltesttype: %d",
@ -6436,7 +6440,8 @@ genericcostestimate(PlannerInfo *root,
indexSelectivity = clauselist_selectivity(root, selectivityQuals,
index->rel->relid,
JOIN_INNER,
NULL);
NULL,
index->rel);
/*
* If caller didn't give us an estimate, estimate the number of index
@ -6757,7 +6762,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
btreeSelectivity = clauselist_selectivity(root, selectivityQuals,
index->rel->relid,
JOIN_INNER,
NULL);
NULL,
index->rel);
numIndexTuples = btreeSelectivity * index->rel->tuples;
/*
@ -7516,7 +7522,8 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
*indexSelectivity = clauselist_selectivity(root, selectivityQuals,
index->rel->relid,
JOIN_INNER,
NULL);
NULL,
index->rel);
/* fetch estimated page cost for tablespace containing index */
get_tablespace_page_costs(index->reltablespace,
@ -7748,7 +7755,8 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
*indexSelectivity =
clauselist_selectivity(root, indexQuals,
path->indexinfo->rel->relid,
JOIN_INNER, NULL);
JOIN_INNER, NULL,
path->indexinfo->rel);
*indexCorrelation = 1;
/*

View File

@ -2331,7 +2331,8 @@ describeOneTableDetails(const char *schemaname,
" FROM ((SELECT pg_catalog.unnest(stakeys) AS attnum) s\n"
" JOIN pg_catalog.pg_attribute a ON (starelid = a.attrelid AND\n"
"a.attnum = s.attnum AND not attisdropped))) AS columns,\n"
" (staenabled::char[] @> '{d}'::char[]) AS ndist_enabled\n"
" (staenabled::char[] @> '{d}'::char[]) AS ndist_enabled,\n"
" (staenabled::char[] @> '{f}'::char[]) AS deps_enabled\n"
"FROM pg_catalog.pg_statistic_ext stat WHERE starelid = '%s'\n"
"ORDER BY 1;",
oid);
@ -2348,7 +2349,7 @@ describeOneTableDetails(const char *schemaname,
for (i = 0; i < tuples; i++)
{
int cnt = 0;
bool gotone = false;
printfPQExpBuffer(&buf, " ");
@ -2361,7 +2362,12 @@ describeOneTableDetails(const char *schemaname,
if (strcmp(PQgetvalue(result, i, 5), "t") == 0)
{
appendPQExpBufferStr(&buf, "ndistinct");
cnt++;
gotone = true;
}
if (strcmp(PQgetvalue(result, i, 6), "t") == 0)
{
appendPQExpBuffer(&buf, "%sdependencies", gotone ? ", " : "");
}
appendPQExpBuffer(&buf, ") ON (%s)",

View File

@ -258,6 +258,10 @@ DATA(insert ( 194 25 0 i b ));
DATA(insert ( 3361 17 0 i b ));
DATA(insert ( 3361 25 0 i i ));
/* pg_dependencies can be coerced to, but not from, bytea and text */
DATA(insert ( 3402 17 0 i b ));
DATA(insert ( 3402 25 0 i i ));
/*
* Datetime category
*/

View File

@ -2775,6 +2775,15 @@ DESCR("I/O");
DATA(insert OID = 3358 ( pg_ndistinct_send PGNSP PGUID 12 1 0 0 0 f f f f t f s s 1 0 17 "3361" _null_ _null_ _null_ _null_ _null_ pg_ndistinct_send _null_ _null_ _null_ ));
DESCR("I/O");
DATA(insert OID = 3404 ( pg_dependencies_in PGNSP PGUID 12 1 0 0 0 f f f f t f i s 1 0 3402 "2275" _null_ _null_ _null_ _null_ _null_ pg_dependencies_in _null_ _null_ _null_ ));
DESCR("I/O");
DATA(insert OID = 3405 ( pg_dependencies_out PGNSP PGUID 12 1 0 0 0 f f f f t f i s 1 0 2275 "3402" _null_ _null_ _null_ _null_ _null_ pg_dependencies_out _null_ _null_ _null_ ));
DESCR("I/O");
DATA(insert OID = 3406 ( pg_dependencies_recv PGNSP PGUID 12 1 0 0 0 f f f f t f s s 1 0 3402 "2281" _null_ _null_ _null_ _null_ _null_ pg_dependencies_recv _null_ _null_ _null_ ));
DESCR("I/O");
DATA(insert OID = 3407 ( pg_dependencies_send PGNSP PGUID 12 1 0 0 0 f f f f t f s s 1 0 17 "3402" _null_ _null_ _null_ _null_ _null_ pg_dependencies_send _null_ _null_ _null_ ));
DESCR("I/O");
DATA(insert OID = 1928 ( pg_stat_get_numscans PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_numscans _null_ _null_ _null_ ));
DESCR("statistics: number of scans done for table/index");
DATA(insert OID = 1929 ( pg_stat_get_tuples_returned PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_tuples_returned _null_ _null_ _null_ ));

View File

@ -46,6 +46,7 @@ CATALOG(pg_statistic_ext,3381)
char staenabled[1] BKI_FORCE_NOT_NULL; /* statistic types
* requested to build */
pg_ndistinct standistinct; /* ndistinct coefficients (serialized) */
pg_dependencies stadependencies; /* dependencies (serialized) */
#endif
} FormData_pg_statistic_ext;
@ -61,7 +62,7 @@ typedef FormData_pg_statistic_ext *Form_pg_statistic_ext;
* compiler constants for pg_statistic_ext
* ----------------
*/
#define Natts_pg_statistic_ext 7
#define Natts_pg_statistic_ext 8
#define Anum_pg_statistic_ext_starelid 1
#define Anum_pg_statistic_ext_staname 2
#define Anum_pg_statistic_ext_stanamespace 3
@ -69,7 +70,9 @@ typedef FormData_pg_statistic_ext *Form_pg_statistic_ext;
#define Anum_pg_statistic_ext_stakeys 5
#define Anum_pg_statistic_ext_staenabled 6
#define Anum_pg_statistic_ext_standistinct 7
#define Anum_pg_statistic_ext_stadependencies 8
#define STATS_EXT_NDISTINCT 'd'
#define STATS_EXT_NDISTINCT 'd'
#define STATS_EXT_DEPENDENCIES 'f'
#endif /* PG_STATISTIC_EXT_H */

View File

@ -368,6 +368,10 @@ DATA(insert OID = 3361 ( pg_ndistinct PGNSP PGUID -1 f b S f t \054 0 0 0 pg_nd
DESCR("multivariate ndistinct coefficients");
#define PGNDISTINCTOID 3361
DATA(insert OID = 3402 ( pg_dependencies PGNSP PGUID -1 f b S f t \054 0 0 0 pg_dependencies_in pg_dependencies_out pg_dependencies_recv pg_dependencies_send - - - i x f 0 -1 0 100 _null_ _null_ _null_ ));
DESCR("multivariate dependencies");
#define PGDEPENDENCIESOID 3402
DATA(insert OID = 32 ( pg_ddl_command PGNSP PGUID SIZEOF_POINTER t p P f t \054 0 0 0 pg_ddl_command_in pg_ddl_command_out pg_ddl_command_recv pg_ddl_command_send - - - ALIGNOF_POINTER p f 0 -1 0 0 _null_ _null_ _null_ ));
DESCR("internal type for passing CollectedCommand");
#define PGDDLCOMMANDOID 32

View File

@ -203,12 +203,14 @@ extern Selectivity clauselist_selectivity(PlannerInfo *root,
List *clauses,
int varRelid,
JoinType jointype,
SpecialJoinInfo *sjinfo);
SpecialJoinInfo *sjinfo,
RelOptInfo *rel);
extern Selectivity clause_selectivity(PlannerInfo *root,
Node *clause,
int varRelid,
JoinType jointype,
SpecialJoinInfo *sjinfo);
SpecialJoinInfo *sjinfo,
RelOptInfo *rel);
extern void cost_gather_merge(GatherMergePath *path, PlannerInfo *root,
RelOptInfo *rel, ParamPathInfo *param_info,
Cost input_startup_cost, Cost input_total_cost,

View File

@ -52,6 +52,11 @@ extern MVNDistinct *statext_ndistinct_build(double totalrows,
extern bytea *statext_ndistinct_serialize(MVNDistinct *ndistinct);
extern MVNDistinct *statext_ndistinct_deserialize(bytea *data);
extern MVDependencies *statext_dependencies_build(int numrows, HeapTuple *rows,
Bitmapset *attrs, VacAttrStats **stats);
extern bytea *statext_dependencies_serialize(MVDependencies *dependencies);
extern MVDependencies *statext_dependencies_deserialize(bytea *data);
extern MultiSortSupport multi_sort_init(int ndims);
extern void multi_sort_add_dimension(MultiSortSupport mss, int sortdim,
Oid oper);

View File

@ -14,6 +14,7 @@
#define STATISTICS_H
#include "commands/vacuum.h"
#include "nodes/relation.h"
#define STATS_MAX_DIMENSIONS 8 /* max number of attributes */
@ -44,11 +45,54 @@ typedef struct MVNDistinct
#define SizeOfMVNDistinct (offsetof(MVNDistinct, nitems) + sizeof(uint32))
/* size of the struct excluding the items array */
#define SizeOfMVNDistinct (offsetof(MVNDistinct, nitems) + sizeof(uint32))
#define STATS_DEPS_MAGIC 0xB4549A2C /* marks serialized bytea */
#define STATS_DEPS_TYPE_BASIC 1 /* basic dependencies type */
/*
* Functional dependencies, tracking column-level relationships (values
* in one column determine values in another one).
*/
typedef struct MVDependency
{
double degree; /* degree of validity (0-1) */
AttrNumber nattributes; /* number of attributes */
AttrNumber attributes[FLEXIBLE_ARRAY_MEMBER]; /* attribute numbers */
} MVDependency;
/* size of the struct excluding the deps array */
#define SizeOfDependency \
(offsetof(MVDependency, nattributes) + sizeof(AttrNumber))
typedef struct MVDependencies
{
uint32 magic; /* magic constant marker */
uint32 type; /* type of MV Dependencies (BASIC) */
uint32 ndeps; /* number of dependencies */
MVDependency *deps[FLEXIBLE_ARRAY_MEMBER]; /* dependencies */
} MVDependencies;
/* size of the struct excluding the deps array */
#define SizeOfDependencies (offsetof(MVDependencies, ndeps) + sizeof(uint32))
extern MVNDistinct *statext_ndistinct_load(Oid mvoid);
extern MVDependencies *staext_dependencies_load(Oid mvoid);
extern void BuildRelationExtStatistics(Relation onerel, double totalrows,
int numrows, HeapTuple *rows,
int natts, VacAttrStats **vacattrstats);
extern bool statext_is_kind_built(HeapTuple htup, char kind);
extern Selectivity dependencies_clauselist_selectivity(PlannerInfo *root,
List *clauses,
int varRelid,
JoinType jointype,
SpecialJoinInfo *sjinfo,
RelOptInfo *rel,
Bitmapset **estimatedclauses);
extern bool has_stats_of_kind(List *stats, char requiredkind);
extern StatisticExtInfo *choose_best_statistics(List *stats,
Bitmapset *attnums, char requiredkind);
#endif /* STATISTICS_H */

View File

@ -824,11 +824,12 @@ WHERE c.castmethod = 'b' AND
character varying | character | 0 | i
pg_node_tree | text | 0 | i
pg_ndistinct | bytea | 0 | i
pg_dependencies | bytea | 0 | i
cidr | inet | 0 | i
xml | text | 0 | a
xml | character varying | 0 | a
xml | character | 0 | a
(8 rows)
(9 rows)
-- **************** pg_conversion ****************
-- Look for illegal values in pg_conversion fields.

View File

@ -2192,7 +2192,8 @@ pg_stats_ext| SELECT n.nspname AS schemaname,
c.relname AS tablename,
s.staname,
s.stakeys AS attnums,
length((s.standistinct)::text) AS ndistbytes
length((s.standistinct)::bytea) AS ndistbytes,
length((s.stadependencies)::bytea) AS depsbytes
FROM ((pg_statistic_ext s
JOIN pg_class c ON ((c.oid = s.starelid)))
LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)));

View File

@ -31,7 +31,7 @@ ALTER TABLE ab1 DROP COLUMN a;
b | integer | | |
c | integer | | |
Statistics:
"public.ab1_b_c_stats" WITH (ndistinct) ON (b, c)
"public.ab1_b_c_stats" WITH (ndistinct, dependencies) ON (b, c)
DROP TABLE ab1;
-- Ensure things work sanely with SET STATISTICS 0
@ -135,7 +135,7 @@ SELECT staenabled, standistinct
FROM pg_statistic_ext WHERE starelid = 'ndistinct'::regclass;
staenabled | standistinct
------------+------------------------------------------------------------------------------------------------
{d} | [{(b 3 4), 301.000000}, {(b 3 6), 301.000000}, {(b 4 6), 301.000000}, {(b 3 4 6), 301.000000}]
{d,f} | [{(b 3 4), 301.000000}, {(b 3 6), 301.000000}, {(b 4 6), 301.000000}, {(b 3 4 6), 301.000000}]
(1 row)
-- Hash Aggregate, thanks to estimates improved by the statistic
@ -201,7 +201,7 @@ SELECT staenabled, standistinct
FROM pg_statistic_ext WHERE starelid = 'ndistinct'::regclass;
staenabled | standistinct
------------+----------------------------------------------------------------------------------------------------
{d} | [{(b 3 4), 2550.000000}, {(b 3 6), 800.000000}, {(b 4 6), 1632.000000}, {(b 3 4 6), 10000.000000}]
{d,f} | [{(b 3 4), 2550.000000}, {(b 3 6), 800.000000}, {(b 4 6), 1632.000000}, {(b 3 4 6), 10000.000000}]
(1 row)
-- plans using Group Aggregate, thanks to using correct esimates
@ -311,3 +311,107 @@ EXPLAIN (COSTS off)
(3 rows)
DROP TABLE ndistinct;
-- functional dependencies tests
CREATE TABLE functional_dependencies (
filler1 TEXT,
filler2 NUMERIC,
a INT,
b TEXT,
filler3 DATE,
c INT,
d TEXT
);
SET random_page_cost = 1.2;
CREATE INDEX fdeps_ab_idx ON functional_dependencies (a, b);
CREATE INDEX fdeps_abc_idx ON functional_dependencies (a, b, c);
-- random data (no functional dependencies)
INSERT INTO functional_dependencies (a, b, c, filler1)
SELECT mod(i, 23), mod(i, 29), mod(i, 31), i FROM generate_series(1,5000) s(i);
ANALYZE functional_dependencies;
EXPLAIN (COSTS OFF)
SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1';
QUERY PLAN
---------------------------------------------------
Bitmap Heap Scan on functional_dependencies
Recheck Cond: ((a = 1) AND (b = '1'::text))
-> Bitmap Index Scan on fdeps_abc_idx
Index Cond: ((a = 1) AND (b = '1'::text))
(4 rows)
EXPLAIN (COSTS OFF)
SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1' AND c = 1;
QUERY PLAN
-----------------------------------------------------------
Index Scan using fdeps_abc_idx on functional_dependencies
Index Cond: ((a = 1) AND (b = '1'::text) AND (c = 1))
(2 rows)
-- create statistics
CREATE STATISTICS func_deps_stat WITH (dependencies) ON (a, b, c) FROM functional_dependencies;
ANALYZE functional_dependencies;
EXPLAIN (COSTS OFF)
SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1';
QUERY PLAN
---------------------------------------------------
Bitmap Heap Scan on functional_dependencies
Recheck Cond: ((a = 1) AND (b = '1'::text))
-> Bitmap Index Scan on fdeps_abc_idx
Index Cond: ((a = 1) AND (b = '1'::text))
(4 rows)
EXPLAIN (COSTS OFF)
SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1' AND c = 1;
QUERY PLAN
-----------------------------------------------------------
Index Scan using fdeps_abc_idx on functional_dependencies
Index Cond: ((a = 1) AND (b = '1'::text) AND (c = 1))
(2 rows)
-- a => b, a => c, b => c
TRUNCATE functional_dependencies;
DROP STATISTICS func_deps_stat;
INSERT INTO functional_dependencies (a, b, c, filler1)
SELECT mod(i,100), mod(i,50), mod(i,25), i FROM generate_series(1,5000) s(i);
ANALYZE functional_dependencies;
EXPLAIN (COSTS OFF)
SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1';
QUERY PLAN
-----------------------------------------------------------
Index Scan using fdeps_abc_idx on functional_dependencies
Index Cond: ((a = 1) AND (b = '1'::text))
(2 rows)
EXPLAIN (COSTS OFF)
SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1' AND c = 1;
QUERY PLAN
-----------------------------------------------------------
Index Scan using fdeps_abc_idx on functional_dependencies
Index Cond: ((a = 1) AND (b = '1'::text) AND (c = 1))
(2 rows)
-- create statistics
CREATE STATISTICS func_deps_stat WITH (dependencies) ON (a, b, c) FROM functional_dependencies;
ANALYZE functional_dependencies;
EXPLAIN (COSTS OFF)
SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1';
QUERY PLAN
---------------------------------------------------
Bitmap Heap Scan on functional_dependencies
Recheck Cond: ((a = 1) AND (b = '1'::text))
-> Bitmap Index Scan on fdeps_abc_idx
Index Cond: ((a = 1) AND (b = '1'::text))
(4 rows)
EXPLAIN (COSTS OFF)
SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1' AND c = 1;
QUERY PLAN
---------------------------------------------------
Bitmap Heap Scan on functional_dependencies
Recheck Cond: ((a = 1) AND (b = '1'::text))
Filter: (c = 1)
-> Bitmap Index Scan on fdeps_ab_idx
Index Cond: ((a = 1) AND (b = '1'::text))
(5 rows)
RESET random_page_cost;
DROP TABLE functional_dependencies;

View File

@ -67,12 +67,13 @@ WHERE p1.typtype not in ('c','d','p') AND p1.typname NOT LIKE E'\\_%'
(SELECT 1 FROM pg_type as p2
WHERE p2.typname = ('_' || p1.typname)::name AND
p2.typelem = p1.oid and p1.typarray = p2.oid);
oid | typname
------+--------------
oid | typname
------+-----------------
194 | pg_node_tree
3361 | pg_ndistinct
3402 | pg_dependencies
210 | smgr
(3 rows)
(4 rows)
-- Make sure typarray points to a varlena array type of our own base
SELECT p1.oid, p1.typname as basetype, p2.typname as arraytype,

View File

@ -163,3 +163,71 @@ EXPLAIN (COSTS off)
SELECT COUNT(*) FROM ndistinct GROUP BY a, d;
DROP TABLE ndistinct;
-- functional dependencies tests
CREATE TABLE functional_dependencies (
filler1 TEXT,
filler2 NUMERIC,
a INT,
b TEXT,
filler3 DATE,
c INT,
d TEXT
);
SET random_page_cost = 1.2;
CREATE INDEX fdeps_ab_idx ON functional_dependencies (a, b);
CREATE INDEX fdeps_abc_idx ON functional_dependencies (a, b, c);
-- random data (no functional dependencies)
INSERT INTO functional_dependencies (a, b, c, filler1)
SELECT mod(i, 23), mod(i, 29), mod(i, 31), i FROM generate_series(1,5000) s(i);
ANALYZE functional_dependencies;
EXPLAIN (COSTS OFF)
SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1';
EXPLAIN (COSTS OFF)
SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1' AND c = 1;
-- create statistics
CREATE STATISTICS func_deps_stat WITH (dependencies) ON (a, b, c) FROM functional_dependencies;
ANALYZE functional_dependencies;
EXPLAIN (COSTS OFF)
SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1';
EXPLAIN (COSTS OFF)
SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1' AND c = 1;
-- a => b, a => c, b => c
TRUNCATE functional_dependencies;
DROP STATISTICS func_deps_stat;
INSERT INTO functional_dependencies (a, b, c, filler1)
SELECT mod(i,100), mod(i,50), mod(i,25), i FROM generate_series(1,5000) s(i);
ANALYZE functional_dependencies;
EXPLAIN (COSTS OFF)
SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1';
EXPLAIN (COSTS OFF)
SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1' AND c = 1;
-- create statistics
CREATE STATISTICS func_deps_stat WITH (dependencies) ON (a, b, c) FROM functional_dependencies;
ANALYZE functional_dependencies;
EXPLAIN (COSTS OFF)
SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1';
EXPLAIN (COSTS OFF)
SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1' AND c = 1;
RESET random_page_cost;
DROP TABLE functional_dependencies;