Add support for multivariate MCV lists

Introduce a third extended statistic type, supported by the CREATE
STATISTICS command - MCV lists, a generalization of the statistic
already built and used for individual columns.

Compared to the already supported types (n-distinct coefficients and
functional dependencies), MCV lists are more complex, include column
values and allow estimation of much wider range of common clauses
(equality and inequality conditions, IS NULL, IS NOT NULL etc.).
Similarly to the other types, a new pseudo-type (pg_mcv_list) is used.

Author: Tomas Vondra
Reviewed-by: Dean Rasheed, David Rowley, Mark Dilger, Alvaro Herrera
Discussion: https://postgr.es/m/dfdac334-9cf2-2597-fb27-f0fb3753f435@2ndquadrant.com
This commit is contained in:
Tomas Vondra 2019-03-27 18:32:18 +01:00
parent 333ed246c6
commit 7300a69950
32 changed files with 3597 additions and 134 deletions

View File

@ -6562,7 +6562,8 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
An array containing codes for the enabled statistic kinds;
valid values are:
<literal>d</literal> for n-distinct statistics,
<literal>f</literal> for functional dependency statistics
<literal>f</literal> for functional dependency statistics, and
<literal>m</literal> for most common values (MCV) list statistics
</entry>
</row>
@ -6585,6 +6586,16 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
</entry>
</row>
<row>
<entry><structfield>stxmcv</structfield></entry>
<entry><type>pg_mcv_list</type></entry>
<entry></entry>
<entry>
MCV (most-common values) list statistics, serialized as
<structname>pg_mcv_list</structname> type.
</entry>
</row>
</tbody>
</tgroup>
</table>

View File

@ -22174,4 +22174,86 @@ CREATE EVENT TRIGGER test_table_rewrite_oid
</sect2>
</sect1>
<sect1 id="functions-statistics">
<title>Statistics Information Functions</title>
<indexterm zone="functions-statistics">
<primary>function</primary>
<secondary>statistics</secondary>
</indexterm>
<para>
<productname>PostgreSQL</productname> provides a function to inspect complex
statistics defined using the <command>CREATE STATISTICS</command> command.
</para>
<sect2 id="functions-statistics-mcv">
<title>Inspecting MCV lists</title>
<indexterm>
<primary>pg_mcv_list_items</primary>
<secondary>pg_mcv_list</secondary>
</indexterm>
<para>
<function>pg_mcv_list_items</function> returns a list of all items
stored in a multi-column <acronym>MCV</acronym> list, and returns the
following columns:
<informaltable>
<tgroup cols="3">
<thead>
<row>
<entry>Name</entry>
<entry>Type</entry>
<entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry><literal>index</literal></entry>
<entry><type>int</type></entry>
<entry>index of the item in the <acronym>MCV</acronym> list</entry>
</row>
<row>
<entry><literal>values</literal></entry>
<entry><type>text[]</type></entry>
<entry>values stored in the MCV item</entry>
</row>
<row>
<entry><literal>nulls</literal></entry>
<entry><type>boolean[]</type></entry>
<entry>flags identifying <literal>NULL</literal> values</entry>
</row>
<row>
<entry><literal>frequency</literal></entry>
<entry><type>double precision</type></entry>
<entry>frequency of this <acronym>MCV</acronym> item</entry>
</row>
<row>
<entry><literal>base_frequency</literal></entry>
<entry><type>double precision</type></entry>
<entry>base frequency of this <acronym>MCV</acronym> item</entry>
</row>
</tbody>
</tgroup>
</informaltable>
</para>
<para>
The <function>pg_mcv_list_items</function> function can be used like this:
<programlisting>
SELECT m.* FROM pg_statistic_ext,
pg_mcv_list_items(stxmcv) m WHERE stxname = 'stts';
</programlisting>
Values of the <type>pg_mcv_list</type> can be obtained only from the
<literal>pg_statistic_ext.stxmcv</literal> column.
</para>
</sect2>
</sect1>
</chapter>

View File

@ -1285,6 +1285,72 @@ nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
plans. Otherwise, the <command>ANALYZE</command> cycles are just wasted.
</para>
</sect3>
<sect3>
<title>Multivariate MCV lists</title>
<para>
Another type of statistics stored for each column are most-common value
lists. This allows very accurate estimates for individual columns, but
may result in significant misestimates for queries with conditions on
multiple columns.
</para>
<para>
To improve such estimates, <command>ANALYZE</command> can collect MCV
lists on combinations of columns. Similarly to functional dependencies
and n-distinct coefficients, it's impractical to do this for every
possible column grouping. Even more so in this case, as the MCV list
(unlike functional dependencies and n-distinct coefficients) does store
the common column values. So data is collected only for those groups
of columns appearing together in a statistics object defined with the
<literal>mcv</literal> option.
</para>
<para>
Continuing the previous example, the MCV list for a table of ZIP codes
might look like the following (unlike for simpler types of statistics,
a function is required for inspection of MCV contents):
<programlisting>
CREATE STATISTICS stts3 (mcv) ON state, city FROM zipcodes;
ANALYZE zipcodes;
SELECT m.* FROM pg_statistic_ext,
pg_mcv_list_items(stxmcv) m WHERE stxname = 'stts3';
index | values | nulls | frequency | base_frequency
-------+------------------------+-------+-----------+----------------
0 | {Washington, DC} | {f,f} | 0.003467 | 2.7e-05
1 | {Apo, AE} | {f,f} | 0.003067 | 1.9e-05
2 | {Houston, TX} | {f,f} | 0.002167 | 0.000133
3 | {El Paso, TX} | {f,f} | 0.002 | 0.000113
4 | {New York, NY} | {f,f} | 0.001967 | 0.000114
5 | {Atlanta, GA} | {f,f} | 0.001633 | 3.3e-05
6 | {Sacramento, CA} | {f,f} | 0.001433 | 7.8e-05
7 | {Miami, FL} | {f,f} | 0.0014 | 6e-05
8 | {Dallas, TX} | {f,f} | 0.001367 | 8.8e-05
9 | {Chicago, IL} | {f,f} | 0.001333 | 5.1e-05
...
(99 rows)
</programlisting>
This indicates that the most common combination of city and state is
Washington in DC, with actual frequency (in the sample) about 0.35%.
The base frequency of the combination (as computed from the simple
per-column frequencies) is only 0.0027%, resulting in two orders of
magnitude under-estimates.
</para>
<para>
It's advisable to create <acronym>MCV</acronym> statistics objects only
on combinations of columns that are actually used in conditions together,
and for which misestimation of the number of groups is resulting in bad
plans. Otherwise, the <command>ANALYZE</command> and planning cycles
are just wasted.
</para>
</sect3>
</sect2>
</sect1>

View File

@ -455,7 +455,7 @@ rows = (outer_cardinality * inner_cardinality) * selectivity
<secondary>multivariate</secondary>
</indexterm>
<sect2>
<sect2 id="functional-dependencies">
<title>Functional Dependencies</title>
<para>
@ -540,7 +540,7 @@ EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
</para>
</sect2>
<sect2>
<sect2 id="multivariate-ndistinct-counts">
<title>Multivariate N-Distinct Counts</title>
<para>
@ -585,6 +585,118 @@ EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
</para>
</sect2>
<sect2 id="mcv-lists">
<title>MCV lists</title>
<para>
As explained in <xref linkend="functional-dependencies"/>, functional
dependencies are very cheap and efficient type of statistics, but their
main limitation is their global nature (only tracking dependencies at
the column level, not between individual column values).
</para>
<para>
This section introduces multivariate variant of <acronym>MCV</acronym>
(most-common values) lists, a straightforward extension of the per-column
statistics described in <xref linkend="row-estimation-examples"/>. These
statistics address the limitation by storing individual values, but it is
naturally more expensive, both in terms of building the statistics in
<command>ANALYZE</command>, storage and planning time.
</para>
<para>
Let's look at the query from <xref linkend="functional-dependencies"/>
again, but this time with a <acronym>MCV</acronym> list created on the
same set of columns (be sure to drop the functional dependencies, to
make sure the planner uses the newly created statistics).
<programlisting>
DROP STATISTICS stts;
CREATE STATISTICS stts2 (mcv) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
QUERY PLAN
-------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900
</programlisting>
The estimate is as accurate as with the functional dependencies, mostly
thanks to the table being fairly small and having a simple distribution
with a low number of distinct values. Before looking at the second query,
which was not handled by functional dependencies particularly well,
let's inspect the <acronym>MCV</acronym> list a bit.
</para>
<para>
Inspecting the <acronym>MCV</acronym> list is possible using
<function>pg_mcv_list_items</function> set-returning function.
<programlisting>
SELECT m.* FROM pg_statistic_ext,
pg_mcv_list_items(stxmcv) m WHERE stxname = 'stts2';
index | values | nulls | frequency | base_frequency
-------+----------+-------+-----------+----------------
0 | {0, 0} | {f,f} | 0.01 | 0.0001
1 | {1, 1} | {f,f} | 0.01 | 0.0001
...
49 | {49, 49} | {f,f} | 0.01 | 0.0001
50 | {50, 50} | {f,f} | 0.01 | 0.0001
...
97 | {97, 97} | {f,f} | 0.01 | 0.0001
98 | {98, 98} | {f,f} | 0.01 | 0.0001
99 | {99, 99} | {f,f} | 0.01 | 0.0001
(100 rows)
</programlisting>
This confirms there are 100 distinct combinations in the two columns, and
all of them are about equally likely (1% frequency for each one). The
base frequency is the frequency computed from per-column statistics, as if
there were no multi-column statistics. Had there been any null values in
either of the columns, this would be identified in the
<structfield>nulls</structfield> column.
</para>
<para>
When estimating the selectivity, the planner applies all the conditions
on items in the <acronym>MCV</acronym> list, and then sums the frequencies
of the matching ones. See <function>mcv_clauselist_selectivity</function>
in <filename>src/backend/statistics/mcv.c</filename> for details.
</para>
<para>
Compared to functional dependencies, <acronym>MCV</acronym> lists have two
major advantages. Firstly, the list stores actual values, making it possible
to decide which combinations are compatible.
<programlisting>
EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 10;
QUERY PLAN
---------------------------------------------------------------------------
Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=0 loops=1)
Filter: ((a = 1) AND (b = 10))
Rows Removed by Filter: 10000
</programlisting>
Secondly, <acronym>MCV</acronym> lists handle a wider range of clause types,
not just equality clauses like functional dependencies. See for example the
example range query, presented earlier:
<programlisting>
EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a &lt;= 49 AND b &gt; 49;
QUERY PLAN
---------------------------------------------------------------------------
Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=0 loops=1)
Filter: ((a &lt;= 49) AND (b &gt; 49))
Rows Removed by Filter: 10000
</programlisting>
</para>
</sect2>
</sect1>
<sect1 id="planner-stats-security">

View File

@ -81,9 +81,10 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_na
<para>
A statistics kind to be computed in this statistics object.
Currently supported kinds are
<literal>ndistinct</literal>, which enables n-distinct statistics, and
<literal>ndistinct</literal>, which enables n-distinct statistics,
<literal>dependencies</literal>, which enables functional
dependency statistics.
dependency statistics, and <literal>mcv</literal> which enables
most-common values lists.
If this clause is omitted, all supported statistics kinds are
included in the statistics object.
For more information, see <xref linkend="planner-stats-extended"/>
@ -164,6 +165,36 @@ EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
conditions are redundant and does not underestimate the row count.
</para>
<para>
Create table <structname>t2</structname> with two perfectly correlated columns
(containing identical data), and a MCV list on those columns:
<programlisting>
CREATE TABLE t2 (
a int,
b int
);
INSERT INTO t2 SELECT mod(i,100), mod(i,100)
FROM generate_series(1,1000000) s(i);
CREATE STATISTICS s2 (mcv) ON (a, b) FROM t2;
ANALYZE t2;
-- valid combination (found in MCV)
EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
-- invalid combination (not found in MCV)
EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 2);
</programlisting>
The MCV list gives the planner more detailed information about the
specific values that commonly appear in the table, as well as an upper
bound on the selectivities of combinations of values that do not appear
in the table, allowing it to generate better estimates in both cases.
</para>
</refsect1>
<refsect1>

View File

@ -73,11 +73,12 @@ CreateStatistics(CreateStatsStmt *stmt)
Oid relid;
ObjectAddress parentobject,
myself;
Datum types[2]; /* one for each possible type of statistic */
Datum types[3]; /* one for each possible type of statistic */
int ntypes;
ArrayType *stxkind;
bool build_ndistinct;
bool build_dependencies;
bool build_mcv;
bool requested_type = false;
int i;
ListCell *cell;
@ -272,6 +273,7 @@ CreateStatistics(CreateStatsStmt *stmt)
*/
build_ndistinct = false;
build_dependencies = false;
build_mcv = false;
foreach(cell, stmt->stat_types)
{
char *type = strVal((Value *) lfirst(cell));
@ -286,6 +288,11 @@ CreateStatistics(CreateStatsStmt *stmt)
build_dependencies = true;
requested_type = true;
}
else if (strcmp(type, "mcv") == 0)
{
build_mcv = true;
requested_type = true;
}
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@ -297,6 +304,7 @@ CreateStatistics(CreateStatsStmt *stmt)
{
build_ndistinct = true;
build_dependencies = true;
build_mcv = true;
}
/* construct the char array of enabled statistic types */
@ -305,6 +313,8 @@ CreateStatistics(CreateStatsStmt *stmt)
types[ntypes++] = CharGetDatum(STATS_EXT_NDISTINCT);
if (build_dependencies)
types[ntypes++] = CharGetDatum(STATS_EXT_DEPENDENCIES);
if (build_mcv)
types[ntypes++] = CharGetDatum(STATS_EXT_MCV);
Assert(ntypes > 0 && ntypes <= lengthof(types));
stxkind = construct_array(types, ntypes, CHAROID, 1, true, 'c');
@ -329,6 +339,7 @@ CreateStatistics(CreateStatsStmt *stmt)
/* no statistics built yet */
nulls[Anum_pg_statistic_ext_stxndistinct - 1] = true;
nulls[Anum_pg_statistic_ext_stxdependencies - 1] = true;
nulls[Anum_pg_statistic_ext_stxmcv - 1] = true;
/* insert it into pg_statistic_ext */
htup = heap_form_tuple(statrel->rd_att, values, nulls);
@ -424,23 +435,72 @@ RemoveStatisticsById(Oid statsOid)
* null until the next ANALYZE. (Note that the type change hasn't actually
* happened yet, so one option that's *not* on the table is to recompute
* immediately.)
*
* For both ndistinct and functional-dependencies stats, the on-disk
* representation is independent of the source column data types, and it is
* plausible to assume that the old statistic values will still be good for
* the new column contents. (Obviously, if the ALTER COLUMN TYPE has a USING
* expression that substantially alters the semantic meaning of the column
* values, this assumption could fail. But that seems like a corner case
* that doesn't justify zapping the stats in common cases.)
*
* For MCV lists that's not the case, as those statistics store the datums
* internally. In this case we simply reset the statistics value to NULL.
*/
void
UpdateStatisticsForTypeChange(Oid statsOid, Oid relationOid, int attnum,
Oid oldColumnType, Oid newColumnType)
{
HeapTuple stup,
oldtup;
Relation rel;
Datum values[Natts_pg_statistic_ext];
bool nulls[Natts_pg_statistic_ext];
bool replaces[Natts_pg_statistic_ext];
oldtup = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(statsOid));
if (!oldtup)
elog(ERROR, "cache lookup failed for statistics object %u", statsOid);
/*
* Currently, we don't actually need to do anything here. For both
* ndistinct and functional-dependencies stats, the on-disk representation
* is independent of the source column data types, and it is plausible to
* assume that the old statistic values will still be good for the new
* column contents. (Obviously, if the ALTER COLUMN TYPE has a USING
* expression that substantially alters the semantic meaning of the column
* values, this assumption could fail. But that seems like a corner case
* that doesn't justify zapping the stats in common cases.)
*
* Future types of extended stats will likely require us to work harder.
* When none of the defined statistics types contain datum values
* from the table's columns then there's no need to reset the stats.
* Functional dependencies and ndistinct stats should still hold true.
*/
if (!statext_is_kind_built(oldtup, STATS_EXT_MCV))
{
ReleaseSysCache(oldtup);
return;
}
/*
* OK, we need to reset some statistics. So let's build the new tuple,
* replacing the affected statistics types with NULL.
*/
memset(nulls, 0, Natts_pg_statistic_ext * sizeof(bool));
memset(replaces, 0, Natts_pg_statistic_ext * sizeof(bool));
memset(values, 0, Natts_pg_statistic_ext * sizeof(Datum));
replaces[Anum_pg_statistic_ext_stxmcv - 1] = true;
nulls[Anum_pg_statistic_ext_stxmcv - 1] = true;
rel = heap_open(StatisticExtRelationId, RowExclusiveLock);
/* replace the old tuple */
stup = heap_modify_tuple(oldtup,
RelationGetDescr(rel),
values,
nulls,
replaces);
ReleaseSysCache(oldtup);
CatalogTupleUpdate(rel, &stup->t_self, stup);
heap_freetuple(stup);
heap_close(rel, RowExclusiveLock);
}
/*

View File

@ -443,6 +443,50 @@ bms_is_member(int x, const Bitmapset *a)
return false;
}
/*
* bms_member_index
* determine 0-based index of member x in the bitmap
*
* Returns (-1) when x is not a member.
*/
int
bms_member_index(Bitmapset *a, int x)
{
int i;
int bitnum;
int wordnum;
int result = 0;
bitmapword mask;
/* return -1 if not a member of the bitmap */
if (!bms_is_member(x, a))
return -1;
wordnum = WORDNUM(x);
bitnum = BITNUM(x);
/* count bits in preceding words */
for (i = 0; i < wordnum; i++)
{
bitmapword w = a->words[i];
/* No need to count the bits in a zero word */
if (w != 0)
result += bmw_popcount(w);
}
/*
* Now add bits of the last word, but only those before the item.
* We can do that by applying a mask and then using popcount again.
* To get 0-based index, we want to count only preceding bits, not
* the item itself, so we subtract 1.
*/
mask = ((bitmapword) 1 << bitnum) - 1;
result += bmw_popcount(a->words[wordnum] & mask);
return result;
}
/*
* bms_overlap - do sets overlap (ie, have a nonempty intersection)?
*/

View File

@ -60,17 +60,67 @@ static RelOptInfo *find_single_rel_for_clauses(PlannerInfo *root,
*
* See clause_selectivity() for the meaning of the additional parameters.
*
* The basic approach is to apply extended statistics first, on as many
* clauses as possible, in order to capture cross-column dependencies etc.
* The remaining clauses are then estimated using regular statistics tracked
* for individual columns. This is done by simply passing the clauses to
* clauselist_selectivity_simple.
*/
Selectivity
clauselist_selectivity(PlannerInfo *root,
List *clauses,
int varRelid,
JoinType jointype,
SpecialJoinInfo *sjinfo)
{
Selectivity s1 = 1.0;
RelOptInfo *rel;
Bitmapset *estimatedclauses = NULL;
/*
* Determine if these clauses reference a single relation. If so, and if
* it has extended statistics, try to apply those.
*/
rel = find_single_rel_for_clauses(root, clauses);
if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
{
/*
* Estimate as many clauses as possible using extended statistics.
*
* 'estimatedclauses' tracks the 0-based list position index of
* clauses that we've estimated using extended statistics, and that
* should be ignored.
*/
s1 *= statext_clauselist_selectivity(root, clauses, varRelid,
jointype, sjinfo, rel,
&estimatedclauses);
}
/*
* Apply normal selectivity estimates for the remaining clauses, passing
* 'estimatedclauses' so that it skips already estimated ones.
*/
return s1 * clauselist_selectivity_simple(root, clauses, varRelid,
jointype, sjinfo,
estimatedclauses);
}
/*
* clauselist_selectivity_simple -
* Compute the selectivity of an implicitly-ANDed list of boolean
* expression clauses. The list can be empty, in which case 1.0
* must be returned. List elements may be either RestrictInfos
* or bare expression clauses --- the former is preferred since
* it allows caching of results. The estimatedclauses bitmap tracks
* clauses that have already been estimated by other means.
*
* See clause_selectivity() for the meaning of the additional parameters.
*
* Our basic approach is to take the product of the selectivities of the
* subclauses. However, that's only right if the subclauses have independent
* probabilities, and in reality they are often NOT independent. So,
* we want to be smarter where we can.
*
* If the clauses taken together refer to just one relation, we'll try to
* apply selectivity estimates using any extended statistics for that rel.
* Currently we only have (soft) functional dependencies, so apply these in as
* many cases as possible, and fall back on normal estimates for remaining
* clauses.
*
* We also recognize "range queries", such as "x > 34 AND x < 42". Clauses
* are recognized as possible range query components if they are restriction
* opclauses whose operators have scalarltsel or a related function as their
@ -98,54 +148,29 @@ static RelOptInfo *find_single_rel_for_clauses(PlannerInfo *root,
* selectivity functions; perhaps some day we can generalize the approach.
*/
Selectivity
clauselist_selectivity(PlannerInfo *root,
List *clauses,
int varRelid,
JoinType jointype,
SpecialJoinInfo *sjinfo)
clauselist_selectivity_simple(PlannerInfo *root,
List *clauses,
int varRelid,
JoinType jointype,
SpecialJoinInfo *sjinfo,
Bitmapset *estimatedclauses)
{
Selectivity s1 = 1.0;
RelOptInfo *rel;
Bitmapset *estimatedclauses = NULL;
RangeQueryClause *rqlist = NULL;
ListCell *l;
int listidx;
/*
* If there's exactly one clause, just go directly to
* clause_selectivity(). None of what we might do below is relevant.
* If there's exactly one clause (and it was not estimated yet), just
* go directly to clause_selectivity(). None of what we might do below
* is relevant.
*/
if (list_length(clauses) == 1)
if ((list_length(clauses) == 1) &&
bms_num_members(estimatedclauses) == 0)
return clause_selectivity(root, (Node *) linitial(clauses),
varRelid, jointype, sjinfo);
/*
* Determine if these clauses reference a single relation. If so, and if
* it has extended statistics, try to apply those.
*/
rel = find_single_rel_for_clauses(root, clauses);
if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL)
{
/*
* Perform selectivity estimations on any clauses found applicable by
* dependencies_clauselist_selectivity. 'estimatedclauses' will be
* filled with the 0-based list positions of clauses used that way, so
* that we can ignore them below.
*/
s1 *= dependencies_clauselist_selectivity(root, clauses, varRelid,
jointype, sjinfo, rel,
&estimatedclauses);
/*
* This would be the place to apply any other types of extended
* statistics selectivity estimations for remaining clauses.
*/
}
/*
* Apply normal selectivity estimates for remaining clauses. We'll be
* careful to skip any clauses which were already estimated above.
*
* Anything that doesn't look like a potential rangequery clause gets
* multiplied into s1 and forgotten. Anything that does gets inserted into
* an rqlist entry.

View File

@ -1363,6 +1363,18 @@ get_relation_statistics(RelOptInfo *rel, Relation relation)
stainfos = lcons(info, stainfos);
}
if (statext_is_kind_built(htup, STATS_EXT_MCV))
{
StatisticExtInfo *info = makeNode(StatisticExtInfo);
info->statOid = statOid;
info->rel = rel;
info->kind = STATS_EXT_MCV;
info->keys = bms_copy(keys);
stainfos = lcons(info, stainfos);
}
ReleaseSysCache(htup);
bms_free(keys);
}

View File

@ -1652,6 +1652,8 @@ generateClonedExtStatsStmt(RangeVar *heapRel, Oid heapRelid,
stat_types = lappend(stat_types, makeString("ndistinct"));
else if (enabled[i] == STATS_EXT_DEPENDENCIES)
stat_types = lappend(stat_types, makeString("dependencies"));
else if (enabled[i] == STATS_EXT_MCV)
stat_types = lappend(stat_types, makeString("mcv"));
else
elog(ERROR, "unrecognized statistics kind %c", enabled[i]);
}

View File

@ -12,6 +12,6 @@ subdir = src/backend/statistics
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
OBJS = extended_stats.o dependencies.o mvdistinct.o
OBJS = extended_stats.o dependencies.o mcv.o mvdistinct.o
include $(top_srcdir)/src/backend/common.mk

View File

@ -18,6 +18,8 @@ There are currently two kinds of extended statistics:
(b) soft functional dependencies (README.dependencies)
(c) MCV lists (README.mcv)
Compatible clause types
-----------------------
@ -26,6 +28,8 @@ Each type of statistics may be used to estimate some subset of clause types.
(a) functional dependencies - equality clauses (AND), possibly IS NULL
(b) MCV lists - equality and inequality clauses (AND, OR, NOT), IS [NOT] NULL
Currently, only OpExprs in the form Var op Const, or Const op Var are
supported, however it's feasible to expand the code later to also estimate the
selectivities on clauses such as Var op Var.

View File

@ -0,0 +1,100 @@
MCV lists
=========
Multivariate MCV (most-common values) lists are a straightforward extension of
regular MCV list, tracking most frequent combinations of values for a group of
attributes.
This works particularly well for columns with a small number of distinct values,
as the list may include all the combinations and approximate the distribution
very accurately.
For columns with a large number of distinct values (e.g. those with continuous
domains), the list will only track the most frequent combinations. If the
distribution is mostly uniform (all combinations about equally frequent), the
MCV list will be empty.
Estimates of some clauses (e.g. equality) based on MCV lists are more accurate
than when using histograms.
Also, MCV lists don't necessarily require sorting of the values (the fact that
we use sorting when building them is implementation detail), but even more
importantly the ordering is not built into the approximation (while histograms
are built on ordering). So MCV lists work well even for attributes where the
ordering of the data type is disconnected from the meaning of the data. For
example we know how to sort strings, but it's unlikely to make much sense for
city names (or other label-like attributes).
Selectivity estimation
----------------------
The estimation, implemented in mcv_clauselist_selectivity(), is quite simple
in principle - we need to identify MCV items matching all the clauses and sum
frequencies of all those items.
Currently MCV lists support estimation of the following clause types:
(a) equality clauses WHERE (a = 1) AND (b = 2)
(b) inequality clauses WHERE (a < 1) AND (b >= 2)
(c) NULL clauses WHERE (a IS NULL) AND (b IS NOT NULL)
(d) OR clauses WHERE (a < 1) OR (b >= 2)
It's possible to add support for additional clauses, for example:
(e) multi-var clauses WHERE (a > b)
and possibly others. These are tasks for the future, not yet implemented.
Hashed MCV (not yet implemented)
--------------------------------
Regular MCV lists have to include actual values for each item, so if those items
are large the list may be quite large. This is especially true for multivariate
MCV lists, although the current implementation partially mitigates this by
performing de-duplicating the values before storing them on disk.
It's possible to only store hashes (32-bit values) instead of the actual values,
significantly reducing the space requirements. Obviously, this would only make
the MCV lists useful for estimating equality conditions (assuming the 32-bit
hashes make the collisions rare enough).
This might also complicate matching the columns to available stats.
TODO Consider implementing hashed MCV list, storing just 32-bit hashes instead
of the actual values. This type of MCV list will be useful only for
estimating equality clauses, and will reduce space requirements for large
varlena types (in such cases we usually only want equality anyway).
Inspecting the MCV list
-----------------------
Inspecting the regular (per-attribute) MCV lists is trivial, as it's enough
to select the columns from pg_stats. The data is encoded as anyarrays, and
all the items have the same data type, so anyarray provides a simple way to
get a text representation.
With multivariate MCV lists the columns may use different data types, making
it impossible to use anyarrays. It might be possible to produce a similar
array-like representation, but that would complicate further processing and
analysis of the MCV list.
So instead the MCV lists are stored in a custom data type (pg_mcv_list),
which however makes it more difficult to inspect the contents. To make that
easier, there's a SRF returning detailed information about the MCV lists.
SELECT m.* FROM pg_statistic_ext,
pg_mcv_list_items(stxmcv) m WHERE stxname = 'stts2';
It accepts one parameter - a pg_mcv_list value (which can only be obtained
from pg_statistic_ext catalog, to defend against malicious input), and
returns these columns:
- item index (0, ..., (nitems-1))
- values (string array)
- nulls only (boolean array)
- frequency (double precision)
- base_frequency (double precision)

View File

@ -202,13 +202,12 @@ dependency_degree(int numrows, HeapTuple *rows, int k, AttrNumber *dependency,
VacAttrStats **stats, Bitmapset *attrs)
{
int i,
j;
int nvalues = numrows * k;
nitems;
MultiSortSupport mss;
SortItem *items;
Datum *values;
bool *isnull;
int *attnums;
AttrNumber *attnums;
AttrNumber *attnums_dep;
int numattrs;
/* counters valid within a group */
int group_size = 0;
@ -223,26 +222,16 @@ dependency_degree(int numrows, HeapTuple *rows, int k, AttrNumber *dependency,
/* sort info for all attributes columns */
mss = multi_sort_init(k);
/* data for the sort */
items = (SortItem *) palloc(numrows * sizeof(SortItem));
values = (Datum *) palloc(sizeof(Datum) * nvalues);
isnull = (bool *) palloc(sizeof(bool) * nvalues);
/* fix the pointers to values/isnull */
for (i = 0; i < numrows; i++)
{
items[i].values = &values[i * k];
items[i].isnull = &isnull[i * k];
}
/*
* Transform the bms into an array, to make accessing i-th member easier.
* Transform the attrs from bitmap to an array to make accessing the i-th
* member easier, and then construct a filtered version with only attnums
* referenced by the dependency we validate.
*/
attnums = (int *) palloc(sizeof(int) * bms_num_members(attrs));
i = 0;
j = -1;
while ((j = bms_next_member(attrs, j)) >= 0)
attnums[i++] = j;
attnums = build_attnums_array(attrs, &numattrs);
attnums_dep = (AttrNumber *) palloc(k * sizeof(AttrNumber));
for (i = 0; i < k; i++)
attnums_dep[i] = attnums[dependency[i]];
/*
* Verify the dependency (a,b,...)->z, using a rather simple algorithm:
@ -257,7 +246,7 @@ dependency_degree(int numrows, HeapTuple *rows, int k, AttrNumber *dependency,
* perhaps at some point it'd be worth using column-specific collations?
*/
/* prepare the sort function for the first dimension, and SortItem array */
/* prepare the sort function for the dimensions */
for (i = 0; i < k; i++)
{
VacAttrStats *colstat = stats[dependency[i]];
@ -270,19 +259,17 @@ dependency_degree(int numrows, HeapTuple *rows, int k, AttrNumber *dependency,
/* prepare the sort function for this dimension */
multi_sort_add_dimension(mss, i, type->lt_opr, type->typcollation);
/* accumulate all the data for both columns into an array and sort it */
for (j = 0; j < numrows; j++)
{
items[j].values[i] =
heap_getattr(rows[j], attnums[dependency[i]],
stats[i]->tupDesc, &items[j].isnull[i]);
}
}
/* sort the items so that we can detect the groups */
qsort_arg((void *) items, numrows, sizeof(SortItem),
multi_sort_compare, mss);
/*
* build an array of SortItem(s) sorted using the multi-sort support
*
* XXX This relies on all stats entries pointing to the same tuple
* descriptor. For now that assumption holds, but it might change in
* the future for example if we support statistics on multiple tables.
*/
items = build_sorted_items(numrows, &nitems, rows, stats[0]->tupDesc,
mss, k, attnums_dep);
/*
* Walk through the sorted array, split it into rows according to the
@ -295,14 +282,14 @@ dependency_degree(int numrows, HeapTuple *rows, int k, AttrNumber *dependency,
group_size = 1;
/* loop 1 beyond the end of the array so that we count the final group */
for (i = 1; i <= numrows; i++)
for (i = 1; i <= nitems; i++)
{
/*
* Check if the group ended, which may be either because we processed
* all the items (i==numrows), or because the i-th item is not equal
* all the items (i==nitems), or because the i-th item is not equal
* to the preceding one.
*/
if (i == numrows ||
if (i == nitems ||
multi_sort_compare_dims(0, k - 2, &items[i - 1], &items[i], mss) != 0)
{
/*
@ -324,10 +311,12 @@ dependency_degree(int numrows, HeapTuple *rows, int k, AttrNumber *dependency,
group_size++;
}
pfree(items);
pfree(values);
pfree(isnull);
if (items)
pfree(items);
pfree(mss);
pfree(attnums);
pfree(attnums_dep);
/* Compute the 'degree of validity' as (supporting/total). */
return (n_supporting_rows * 1.0 / numrows);
@ -354,24 +343,17 @@ statext_dependencies_build(int numrows, HeapTuple *rows, Bitmapset *attrs,
VacAttrStats **stats)
{
int i,
j,
k;
int numattrs;
int *attnums;
AttrNumber *attnums;
/* result */
MVDependencies *dependencies = NULL;
numattrs = bms_num_members(attrs);
/*
* Transform the bms into an array, to make accessing i-th member easier.
*/
attnums = palloc(sizeof(int) * bms_num_members(attrs));
i = 0;
j = -1;
while ((j = bms_next_member(attrs, j)) >= 0)
attnums[i++] = j;
attnums = build_attnums_array(attrs, &numattrs);
Assert(numattrs >= 2);
@ -918,9 +900,9 @@ find_strongest_dependency(StatisticExtInfo *stats, MVDependencies *dependencies,
* using functional dependency statistics, or 1.0 if no useful functional
* dependency statistic exists.
*
* 'estimatedclauses' is an output argument that gets a bit set corresponding
* to the (zero-based) list index of each clause that is included in the
* estimated selectivity.
* 'estimatedclauses' is an input/output argument that gets a bit set
* corresponding to the (zero-based) list index of each clause that is included
* in the estimated selectivity.
*
* Given equality clauses on attributes (a,b) we find the strongest dependency
* between them, i.e. either (a=>b) or (b=>a). Assuming (a=>b) is the selected
@ -955,9 +937,6 @@ dependencies_clauselist_selectivity(PlannerInfo *root,
AttrNumber *list_attnums;
int listidx;
/* initialize output argument */
*estimatedclauses = NULL;
/* check if there's any stats that might be useful for us. */
if (!has_stats_of_kind(rel->statlist, STATS_EXT_DEPENDENCIES))
return 1.0;
@ -972,6 +951,9 @@ dependencies_clauselist_selectivity(PlannerInfo *root,
* the attnums for each clause in a list which we'll reference later so we
* don't need to repeat the same work again. We'll also keep track of all
* attnums seen.
*
* We also skip clauses that we already estimated using different types of
* statistics (we treat them as incompatible).
*/
listidx = 0;
foreach(l, clauses)
@ -979,7 +961,8 @@ dependencies_clauselist_selectivity(PlannerInfo *root,
Node *clause = (Node *) lfirst(l);
AttrNumber attnum;
if (dependency_is_compatible_clause(clause, rel->relid, &attnum))
if (!bms_is_member(listidx, *estimatedclauses) &&
dependency_is_compatible_clause(clause, rel->relid, &attnum))
{
list_attnums[listidx] = attnum;
clauses_attnums = bms_add_member(clauses_attnums, attnum);
@ -1049,8 +1032,7 @@ dependencies_clauselist_selectivity(PlannerInfo *root,
/*
* Skip incompatible clauses, and ones we've already estimated on.
*/
if (list_attnums[listidx] == InvalidAttrNumber ||
bms_is_member(listidx, *estimatedclauses))
if (list_attnums[listidx] == InvalidAttrNumber)
continue;
/*

View File

@ -19,9 +19,13 @@
#include "access/genam.h"
#include "access/htup_details.h"
#include "access/table.h"
#include "access/tuptoaster.h"
#include "catalog/indexing.h"
#include "catalog/pg_collation.h"
#include "catalog/pg_statistic_ext.h"
#include "nodes/nodeFuncs.h"
#include "optimizer/clauses.h"
#include "optimizer/optimizer.h"
#include "postmaster/autovacuum.h"
#include "statistics/extended_stats_internal.h"
#include "statistics/statistics.h"
@ -30,8 +34,19 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
#include "utils/selfuncs.h"
#include "utils/syscache.h"
/*
* To avoid consuming too much memory during analysis and/or too much space
* in the resulting pg_statistic rows, we ignore varlena datums that are wider
* than WIDTH_THRESHOLD (after detoasting!). This is legitimate for MCV
* and distinct-value calculations since a wide value is unlikely to be
* duplicated at all, much less be a most-common value. For the same reason,
* ignoring wide values will not affect our estimates of histogram bin
* boundaries very much.
*/
#define WIDTH_THRESHOLD 1024
/*
* Used internally to refer to an individual statistics object, i.e.,
@ -52,7 +67,7 @@ static VacAttrStats **lookup_var_attr_stats(Relation rel, Bitmapset *attrs,
int nvacatts, VacAttrStats **vacatts);
static void statext_store(Relation pg_stext, Oid relid,
MVNDistinct *ndistinct, MVDependencies *dependencies,
VacAttrStats **stats);
MCVList * mcvlist, VacAttrStats **stats);
/*
@ -86,6 +101,7 @@ BuildRelationExtStatistics(Relation onerel, double totalrows,
StatExtEntry *stat = (StatExtEntry *) lfirst(lc);
MVNDistinct *ndistinct = NULL;
MVDependencies *dependencies = NULL;
MCVList *mcv = NULL;
VacAttrStats **stats;
ListCell *lc2;
@ -123,10 +139,13 @@ BuildRelationExtStatistics(Relation onerel, double totalrows,
else if (t == STATS_EXT_DEPENDENCIES)
dependencies = statext_dependencies_build(numrows, rows,
stat->columns, stats);
else if (t == STATS_EXT_MCV)
mcv = statext_mcv_build(numrows, rows, stat->columns, stats,
totalrows);
}
/* store the statistics in the catalog */
statext_store(pg_stext, stat->statOid, ndistinct, dependencies, stats);
statext_store(pg_stext, stat->statOid, ndistinct, dependencies, mcv, stats);
}
table_close(pg_stext, RowExclusiveLock);
@ -154,6 +173,10 @@ statext_is_kind_built(HeapTuple htup, char type)
attnum = Anum_pg_statistic_ext_stxdependencies;
break;
case STATS_EXT_MCV:
attnum = Anum_pg_statistic_ext_stxmcv;
break;
default:
elog(ERROR, "unexpected statistics type requested: %d", type);
}
@ -218,7 +241,8 @@ fetch_statentries_for_relation(Relation pg_statext, Oid relid)
for (i = 0; i < ARR_DIMS(arr)[0]; i++)
{
Assert((enabled[i] == STATS_EXT_NDISTINCT) ||
(enabled[i] == STATS_EXT_DEPENDENCIES));
(enabled[i] == STATS_EXT_DEPENDENCIES) ||
(enabled[i] == STATS_EXT_MCV));
entry->types = lappend_int(entry->types, (int) enabled[i]);
}
@ -293,7 +317,7 @@ lookup_var_attr_stats(Relation rel, Bitmapset *attrs,
static void
statext_store(Relation pg_stext, Oid statOid,
MVNDistinct *ndistinct, MVDependencies *dependencies,
VacAttrStats **stats)
MCVList * mcv, VacAttrStats **stats)
{
HeapTuple stup,
oldtup;
@ -324,9 +348,18 @@ statext_store(Relation pg_stext, Oid statOid,
values[Anum_pg_statistic_ext_stxdependencies - 1] = PointerGetDatum(data);
}
if (mcv != NULL)
{
bytea *data = statext_mcv_serialize(mcv, stats);
nulls[Anum_pg_statistic_ext_stxmcv - 1] = (data == NULL);
values[Anum_pg_statistic_ext_stxmcv - 1] = PointerGetDatum(data);
}
/* always replace the value (either by bytea or NULL) */
replaces[Anum_pg_statistic_ext_stxndistinct - 1] = true;
replaces[Anum_pg_statistic_ext_stxdependencies - 1] = true;
replaces[Anum_pg_statistic_ext_stxmcv - 1] = true;
/* there should already be a pg_statistic_ext tuple */
oldtup = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(statOid));
@ -433,6 +466,202 @@ multi_sort_compare_dims(int start, int end,
return 0;
}
int
compare_scalars_simple(const void *a, const void *b, void *arg)
{
return compare_datums_simple(*(Datum *) a,
*(Datum *) b,
(SortSupport) arg);
}
int
compare_datums_simple(Datum a, Datum b, SortSupport ssup)
{
return ApplySortComparator(a, false, b, false, ssup);
}
/* simple counterpart to qsort_arg */
void *
bsearch_arg(const void *key, const void *base, size_t nmemb, size_t size,
int (*compar) (const void *, const void *, void *),
void *arg)
{
size_t l,
u,
idx;
const void *p;
int comparison;
l = 0;
u = nmemb;
while (l < u)
{
idx = (l + u) / 2;
p = (void *) (((const char *) base) + (idx * size));
comparison = (*compar) (key, p, arg);
if (comparison < 0)
u = idx;
else if (comparison > 0)
l = idx + 1;
else
return (void *) p;
}
return NULL;
}
/*
* build_attnums_array
* Transforms a bitmap into an array of AttrNumber values.
*
* This is used for extended statistics only, so all the attribute must be
* user-defined. That means offsetting by FirstLowInvalidHeapAttributeNumber
* is not necessary here (and when querying the bitmap).
*/
AttrNumber *
build_attnums_array(Bitmapset *attrs, int *numattrs)
{
int i,
j;
AttrNumber *attnums;
int num = bms_num_members(attrs);
if (numattrs)
*numattrs = num;
/* build attnums from the bitmapset */
attnums = (AttrNumber *) palloc(sizeof(AttrNumber) * num);
i = 0;
j = -1;
while ((j = bms_next_member(attrs, j)) >= 0)
{
/*
* Make sure the bitmap contains only user-defined attributes. As
* bitmaps can't contain negative values, this can be violated in
* two ways. Firstly, the bitmap might contain 0 as a member, and
* secondly the integer value might be larger than MaxAttrNumber.
*/
Assert(AttrNumberIsForUserDefinedAttr(j));
Assert(j <= MaxAttrNumber);
attnums[i++] = (AttrNumber) j;
/* protect against overflows */
Assert(i <= num);
}
return attnums;
}
/*
* build_sorted_items
* build a sorted array of SortItem with values from rows
*
* Note: All the memory is allocated in a single chunk, so that the caller
* can simply pfree the return value to release all of it.
*/
SortItem *
build_sorted_items(int numrows, int *nitems, HeapTuple *rows, TupleDesc tdesc,
MultiSortSupport mss, int numattrs, AttrNumber *attnums)
{
int i,
j,
len,
idx;
int nvalues = numrows * numattrs;
SortItem *items;
Datum *values;
bool *isnull;
char *ptr;
/* Compute the total amount of memory we need (both items and values). */
len = numrows * sizeof(SortItem) + nvalues * (sizeof(Datum) + sizeof(bool));
/* Allocate the memory and split it into the pieces. */
ptr = palloc0(len);
/* items to sort */
items = (SortItem *) ptr;
ptr += numrows * sizeof(SortItem);
/* values and null flags */
values = (Datum *) ptr;
ptr += nvalues * sizeof(Datum);
isnull = (bool *) ptr;
ptr += nvalues * sizeof(bool);
/* make sure we consumed the whole buffer exactly */
Assert((ptr - (char *) items) == len);
/* fix the pointers to Datum and bool arrays */
idx = 0;
for (i = 0; i < numrows; i++)
{
bool toowide = false;
items[idx].values = &values[idx * numattrs];
items[idx].isnull = &isnull[idx * numattrs];
/* load the values/null flags from sample rows */
for (j = 0; j < numattrs; j++)
{
Datum value;
bool isnull;
value = heap_getattr(rows[i], attnums[j], tdesc, &isnull);
/*
* If this is a varlena value, check if it's too wide and if yes
* then skip the whole item. Otherwise detoast the value.
*
* XXX It may happen that we've already detoasted some preceding
* values for the current item. We don't bother to cleanup those
* on the assumption that those are small (below WIDTH_THRESHOLD)
* and will be discarded at the end of analyze.
*/
if ((!isnull) &&
(TupleDescAttr(tdesc, attnums[j] - 1)->attlen == -1))
{
if (toast_raw_datum_size(value) > WIDTH_THRESHOLD)
{
toowide = true;
break;
}
value = PointerGetDatum(PG_DETOAST_DATUM(value));
}
items[idx].values[j] = value;
items[idx].isnull[j] = isnull;
}
if (toowide)
continue;
idx++;
}
/* store the actual number of items (ignoring the too-wide ones) */
*nitems = idx;
/* all items were too wide */
if (idx == 0)
{
/* everything is allocated as a single chunk */
pfree(items);
return NULL;
}
/* do the sort, using the multi-sort */
qsort_arg((void *) items, idx, sizeof(SortItem),
multi_sort_compare, mss);
return items;
}
/*
* has_stats_of_kind
* Check whether the list contains statistic of a given kind
@ -463,7 +692,7 @@ has_stats_of_kind(List *stats, char requiredkind)
* object referencing the most of the requested attributes, breaking ties
* in favor of objects with fewer keys overall.
*
* XXX if multiple statistics objects tie on both criteria, then which object
* XXX If multiple statistics objects tie on both criteria, then which object
* is chosen depends on the order that they appear in the stats list. Perhaps
* further tiebreakers are needed.
*/
@ -513,3 +742,391 @@ choose_best_statistics(List *stats, Bitmapset *attnums, char requiredkind)
return best_match;
}
/*
* statext_is_compatible_clause_internal
* Determines if the clause is compatible with MCV lists.
*
* Does the heavy lifting of actually inspecting the clauses for
* statext_is_compatible_clause. It needs to be split like this because
* of recursion. The attnums bitmap is an input/output parameter collecting
* attribute numbers from all compatible clauses (recursively).
*/
static bool
statext_is_compatible_clause_internal(Node *clause, Index relid, Bitmapset **attnums)
{
/* Look inside any binary-compatible relabeling (as in examine_variable) */
if (IsA(clause, RelabelType))
clause = (Node *) ((RelabelType *) clause)->arg;
/* plain Var references (boolean Vars or recursive checks) */
if (IsA(clause, Var))
{
Var *var = (Var *) clause;
/* Ensure var is from the correct relation */
if (var->varno != relid)
return false;
/* we also better ensure the Var is from the current level */
if (var->varlevelsup > 0)
return false;
/* Also skip system attributes (we don't allow stats on those). */
if (!AttrNumberIsForUserDefinedAttr(var->varattno))
return false;
*attnums = bms_add_member(*attnums, var->varattno);
return true;
}
/* (Var op Const) or (Const op Var) */
if (is_opclause(clause))
{
OpExpr *expr = (OpExpr *) clause;
Var *var;
bool varonleft = true;
bool ok;
/* Only expressions with two arguments are considered compatible. */
if (list_length(expr->args) != 2)
return false;
/* see if it actually has the right shape (one Var, one Const) */
ok = (NumRelids((Node *) expr) == 1) &&
(is_pseudo_constant_clause(lsecond(expr->args)) ||
(varonleft = false,
is_pseudo_constant_clause(linitial(expr->args))));
/* unsupported structure (two variables or so) */
if (!ok)
return false;
/*
* If it's not one of the supported operators ("=", "<", ">", etc.),
* just ignore the clause, as it's not compatible with MCV lists.
*
* This uses the function for estimating selectivity, not the operator
* directly (a bit awkward, but well ...).
*/
switch (get_oprrest(expr->opno))
{
case F_EQSEL:
case F_NEQSEL:
case F_SCALARLTSEL:
case F_SCALARLESEL:
case F_SCALARGTSEL:
case F_SCALARGESEL:
/* supported, will continue with inspection of the Var */
break;
default:
/* other estimators are considered unknown/unsupported */
return false;
}
var = (varonleft) ? linitial(expr->args) : lsecond(expr->args);
return statext_is_compatible_clause_internal((Node *) var, relid, attnums);
}
/* AND/OR/NOT clause */
if (is_andclause(clause) ||
is_orclause(clause) ||
is_notclause(clause))
{
/*
* AND/OR/NOT-clauses are supported if all sub-clauses are supported
*
* Perhaps we could improve this by handling mixed cases, when some of
* the clauses are supported and some are not. Selectivity for the
* supported subclauses would be computed using extended statistics,
* and the remaining clauses would be estimated using the traditional
* algorithm (product of selectivities).
*
* It however seems overly complex, and in a way we already do that
* because if we reject the whole clause as unsupported here, it will
* be eventually passed to clauselist_selectivity() which does exactly
* this (split into supported/unsupported clauses etc).
*/
BoolExpr *expr = (BoolExpr *) clause;
ListCell *lc;
foreach(lc, expr->args)
{
/*
* Had we found incompatible clause in the arguments, treat the
* whole clause as incompatible.
*/
if (!statext_is_compatible_clause_internal((Node *) lfirst(lc),
relid, attnums))
return false;
}
return true;
}
/* Var IS NULL */
if (IsA(clause, NullTest))
{
NullTest *nt = (NullTest *) clause;
/*
* Only simple (Var IS NULL) expressions supported for now. Maybe we
* could use examine_variable to fix this?
*/
if (!IsA(nt->arg, Var))
return false;
return statext_is_compatible_clause_internal((Node *) (nt->arg), relid, attnums);
}
return false;
}
/*
* statext_is_compatible_clause
* Determines if the clause is compatible with MCV lists.
*
* Currently, we only support three types of clauses:
*
* (a) OpExprs of the form (Var op Const), or (Const op Var), where the op
* is one of ("=", "<", ">", ">=", "<=")
*
* (b) (Var IS [NOT] NULL)
*
* (c) combinations using AND/OR/NOT
*
* In the future, the range of supported clauses may be expanded to more
* complex cases, for example (Var op Var).
*/
static bool
statext_is_compatible_clause(Node *clause, Index relid, Bitmapset **attnums)
{
RestrictInfo *rinfo = (RestrictInfo *) clause;
if (!IsA(rinfo, RestrictInfo))
return false;
/* Pseudoconstants are not really interesting here. */
if (rinfo->pseudoconstant)
return false;
/* clauses referencing multiple varnos are incompatible */
if (bms_membership(rinfo->clause_relids) != BMS_SINGLETON)
return false;
return statext_is_compatible_clause_internal((Node *) rinfo->clause,
relid, attnums);
}
/*
* statext_mcv_clauselist_selectivity
* Estimate clauses using the best multi-column statistics.
*
* Selects the best extended (multi-column) statistic on a table (measured by
* the number of attributes extracted from the clauses and covered by it), and
* computes the selectivity for the supplied clauses.
*
* One of the main challenges with using MCV lists is how to extrapolate the
* estimate to the data not covered by the MCV list. To do that, we compute
* not only the "MCV selectivity" (selectivities for MCV items matching the
* supplied clauses), but also a couple of derived selectivities:
*
* - simple selectivity: Computed without extended statistic, i.e. as if the
* columns/clauses were independent
*
* - base selectivity: Similar to simple selectivity, but is computed using
* the extended statistic by adding up the base frequencies (that we compute
* and store for each MCV item) of matching MCV items.
*
* - total selectivity: Selectivity covered by the whole MCV list.
*
* - other selectivity: A selectivity estimate for data not covered by the MCV
* list (i.e. satisfying the clauses, but not common enough to make it into
* the MCV list)
*
* Note: While simple and base selectivities are defined in a quite similar
* way, the values are computed differently and are not therefore equal. The
* simple selectivity is computed as a product of per-clause estimates, while
* the base selectivity is computed by adding up base frequencies of matching
* items of the multi-column MCV list. So the values may differ for two main
* reasons - (a) the MCV list may not cover 100% of the data and (b) some of
* the MCV items did not match the estimated clauses.
*
* As both (a) and (b) reduce the base selectivity value, it generally holds
* that (simple_selectivity >= base_selectivity). If the MCV list covers all
* the data, the values may be equal.
*
* So, (simple_selectivity - base_selectivity) is an estimate for the part
* not covered by the MCV list, and (mcv_selectivity - base_selectivity) may
* be seen as a correction for the part covered by the MCV list. Those two
* statements are actually equivalent.
*
* Note: Due to rounding errors and minor differences in how the estimates
* are computed, the inequality may not always hold. Which is why we clamp
* the selectivities to prevent strange estimate (negative etc.).
*
* 'estimatedclauses' is an input/output parameter. We set bits for the
* 0-based 'clauses' indexes we estimate for and also skip clause items that
* already have a bit set.
*
* XXX If we were to use multiple statistics, this is where it would happen.
* We would simply repeat this on a loop on the "remaining" clauses, possibly
* using the already estimated clauses as conditions (and combining the values
* using conditional probability formula).
*/
static Selectivity
statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
JoinType jointype, SpecialJoinInfo *sjinfo,
RelOptInfo *rel, Bitmapset **estimatedclauses)
{
ListCell *l;
Bitmapset *clauses_attnums = NULL;
Bitmapset **list_attnums;
int listidx;
StatisticExtInfo *stat;
List *stat_clauses;
Selectivity simple_sel,
mcv_sel,
mcv_basesel,
mcv_totalsel,
other_sel,
sel;
/* check if there's any stats that might be useful for us. */
if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV))
return 1.0;
list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
list_length(clauses));
/*
* Pre-process the clauses list to extract the attnums seen in each item.
* We need to determine if there's any clauses which will be useful for
* selectivity estimations with extended stats. Along the way we'll record
* all of the attnums for each clause in a list which we'll reference later
* so we don't need to repeat the same work again. We'll also keep track of
* all attnums seen.
*
* We also skip clauses that we already estimated using different types of
* statistics (we treat them as incompatible).
*/
listidx = 0;
foreach(l, clauses)
{
Node *clause = (Node *) lfirst(l);
Bitmapset *attnums = NULL;
if (!bms_is_member(listidx, *estimatedclauses) &&
statext_is_compatible_clause(clause, rel->relid, &attnums))
{
list_attnums[listidx] = attnums;
clauses_attnums = bms_add_members(clauses_attnums, attnums);
}
else
list_attnums[listidx] = NULL;
listidx++;
}
/* We need at least two attributes for multivariate statistics. */
if (bms_membership(clauses_attnums) != BMS_MULTIPLE)
return 1.0;
/* find the best suited statistics object for these attnums */
stat = choose_best_statistics(rel->statlist, clauses_attnums, STATS_EXT_MCV);
/* if no matching stats could be found then we've nothing to do */
if (!stat)
return 1.0;
/* Ensure choose_best_statistics produced an expected stats type. */
Assert(stat->kind == STATS_EXT_MCV);
/* now filter the clauses to be estimated using the selected MCV */
stat_clauses = NIL;
listidx = 0;
foreach(l, clauses)
{
/*
* If the clause is compatible with the selected statistics, mark it
* as estimated and add it to the list to estimate.
*/
if (list_attnums[listidx] != NULL &&
bms_is_subset(list_attnums[listidx], stat->keys))
{
stat_clauses = lappend(stat_clauses, (Node *) lfirst(l));
*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
}
listidx++;
}
/*
* First compute "simple" selectivity, i.e. without the extended statistics,
* and essentially assuming independence of the columns/clauses. We'll then
* use the various selectivities computed from MCV list to improve it.
*/
simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
jointype, sjinfo, NULL);
/*
* Now compute the multi-column estimate from the MCV list, along with the
* other selectivities (base & total selectivity).
*/
mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses, varRelid,
jointype, sjinfo, rel,
&mcv_basesel, &mcv_totalsel);
/* Estimated selectivity of values not covered by MCV matches */
other_sel = simple_sel - mcv_basesel;
CLAMP_PROBABILITY(other_sel);
/* The non-MCV selectivity can't exceed the 1 - mcv_totalsel. */
if (other_sel > 1.0 - mcv_totalsel)
other_sel = 1.0 - mcv_totalsel;
/* Overall selectivity is the combination of MCV and non-MCV estimates. */
sel = mcv_sel + other_sel;
CLAMP_PROBABILITY(sel);
return sel;
}
/*
* statext_clauselist_selectivity
* Estimate clauses using the best multi-column statistics.
*/
Selectivity
statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
JoinType jointype, SpecialJoinInfo *sjinfo,
RelOptInfo *rel, Bitmapset **estimatedclauses)
{
Selectivity sel;
/* First, try estimating clauses using a multivariate MCV list. */
sel = statext_mcv_clauselist_selectivity(root, clauses, varRelid, jointype,
sjinfo, rel, estimatedclauses);
/*
* Then, apply functional dependencies on the remaining clauses by
* calling dependencies_clauselist_selectivity. Pass 'estimatedclauses'
* so the function can properly skip clauses already estimated above.
*
* The reasoning for applying dependencies last is that the more complex
* stats can track more complex correlations between the attributes, and
* so may be considered more reliable.
*
* For example, MCV list can give us an exact selectivity for values in
* two columns, while functional dependencies can only provide information
* about the overall strength of the dependency.
*/
sel *= dependencies_clauselist_selectivity(root, clauses, varRelid,
jointype, sjinfo, rel,
estimatedclauses);
return sel;
}

1695
src/backend/statistics/mcv.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -1509,6 +1509,7 @@ pg_get_statisticsobj_worker(Oid statextid, bool missing_ok)
bool isnull;
bool ndistinct_enabled;
bool dependencies_enabled;
bool mcv_enabled;
int i;
statexttup = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(statextid));
@ -1544,6 +1545,7 @@ pg_get_statisticsobj_worker(Oid statextid, bool missing_ok)
ndistinct_enabled = false;
dependencies_enabled = false;
mcv_enabled = false;
for (i = 0; i < ARR_DIMS(arr)[0]; i++)
{
@ -1551,6 +1553,8 @@ pg_get_statisticsobj_worker(Oid statextid, bool missing_ok)
ndistinct_enabled = true;
if (enabled[i] == STATS_EXT_DEPENDENCIES)
dependencies_enabled = true;
if (enabled[i] == STATS_EXT_MCV)
mcv_enabled = true;
}
/*
@ -1560,13 +1564,27 @@ pg_get_statisticsobj_worker(Oid statextid, bool missing_ok)
* statistics types on a newer postgres version, if the statistics had all
* options enabled on the original version.
*/
if (!ndistinct_enabled || !dependencies_enabled)
if (!ndistinct_enabled || !dependencies_enabled || !mcv_enabled)
{
bool gotone = false;
appendStringInfoString(&buf, " (");
if (ndistinct_enabled)
{
appendStringInfoString(&buf, "ndistinct");
else if (dependencies_enabled)
appendStringInfoString(&buf, "dependencies");
gotone = true;
}
if (dependencies_enabled)
{
appendStringInfo(&buf, "%sdependencies", gotone ? ", " : "");
gotone = true;
}
if (mcv_enabled)
appendStringInfo(&buf, "%smcv", gotone ? ", " : "");
appendStringInfoChar(&buf, ')');
}

View File

@ -2629,7 +2629,8 @@ describeOneTableDetails(const char *schemaname,
" JOIN pg_catalog.pg_attribute a ON (stxrelid = a.attrelid AND\n"
" a.attnum = s.attnum AND NOT attisdropped)) AS columns,\n"
" 'd' = any(stxkind) AS ndist_enabled,\n"
" 'f' = any(stxkind) AS deps_enabled\n"
" 'f' = any(stxkind) AS deps_enabled,\n"
" 'm' = any(stxkind) AS mcv_enabled\n"
"FROM pg_catalog.pg_statistic_ext stat "
"WHERE stxrelid = '%s'\n"
"ORDER BY 1;",
@ -2666,6 +2667,12 @@ describeOneTableDetails(const char *schemaname,
if (strcmp(PQgetvalue(result, i, 6), "t") == 0)
{
appendPQExpBuffer(&buf, "%sdependencies", gotone ? ", " : "");
gotone = true;
}
if (strcmp(PQgetvalue(result, i, 7), "t") == 0)
{
appendPQExpBuffer(&buf, "%smcv", gotone ? ", " : "");
}
appendPQExpBuffer(&buf, ") ON %s FROM %s",

View File

@ -2432,7 +2432,7 @@ psql_completion(const char *text, int start, int end)
else if (Matches("CREATE", "STATISTICS", MatchAny))
COMPLETE_WITH("(", "ON");
else if (Matches("CREATE", "STATISTICS", MatchAny, "("))
COMPLETE_WITH("ndistinct", "dependencies");
COMPLETE_WITH("ndistinct", "dependencies", "mcv");
else if (Matches("CREATE", "STATISTICS", MatchAny, "(*)"))
COMPLETE_WITH("ON");
else if (HeadMatches("CREATE", "STATISTICS", MatchAny) &&

View File

@ -324,6 +324,12 @@
{ castsource => 'pg_dependencies', casttarget => 'text', castfunc => '0',
castcontext => 'i', castmethod => 'i' },
# pg_mcv_list can be coerced to, but not from, bytea and text
{ castsource => 'pg_mcv_list', casttarget => 'bytea', castfunc => '0',
castcontext => 'i', castmethod => 'b' },
{ castsource => 'pg_mcv_list', casttarget => 'text', castfunc => '0',
castcontext => 'i', castmethod => 'i' },
# Datetime category
{ castsource => 'date', casttarget => 'timestamp',
castfunc => 'timestamp(date)', castcontext => 'i', castmethod => 'f' },

View File

@ -4999,6 +4999,30 @@
proname => 'pg_dependencies_send', provolatile => 's', prorettype => 'bytea',
proargtypes => 'pg_dependencies', prosrc => 'pg_dependencies_send' },
{ oid => '5018', descr => 'I/O',
proname => 'pg_mcv_list_in', prorettype => 'pg_mcv_list',
proargtypes => 'cstring', prosrc => 'pg_mcv_list_in' },
{ oid => '5019', descr => 'I/O',
proname => 'pg_mcv_list_out', prorettype => 'cstring',
proargtypes => 'pg_mcv_list', prosrc => 'pg_mcv_list_out' },
{ oid => '5020', descr => 'I/O',
proname => 'pg_mcv_list_recv', provolatile => 's',
prorettype => 'pg_mcv_list', proargtypes => 'internal',
prosrc => 'pg_mcv_list_recv' },
{ oid => '5021', descr => 'I/O',
proname => 'pg_mcv_list_send', provolatile => 's', prorettype => 'bytea',
proargtypes => 'pg_mcv_list', prosrc => 'pg_mcv_list_send' },
{ oid => '3427',
descr => 'details about MCV list items',
proname => 'pg_mcv_list_items', prorows => '1000', proisstrict => 't',
proretset => 't', provolatile => 's', prorettype => 'record',
proargtypes => 'pg_mcv_list',
proallargtypes => '{pg_mcv_list,int4,text,_bool,float8,float8}',
proargmodes => '{i,o,o,o,o,o}',
proargnames => '{mcv_list,index,values,nulls,frequency,base_frequency}',
prosrc => 'pg_stats_ext_mcvlist_items' },
{ oid => '1928', descr => 'statistics: number of scans done for table/index',
proname => 'pg_stat_get_numscans', provolatile => 's', proparallel => 'r',
prorettype => 'int8', proargtypes => 'oid',

View File

@ -49,6 +49,7 @@ CATALOG(pg_statistic_ext,3381,StatisticExtRelationId)
* to build */
pg_ndistinct stxndistinct; /* ndistinct coefficients (serialized) */
pg_dependencies stxdependencies; /* dependencies (serialized) */
pg_mcv_list stxmcv; /* MCV (serialized) */
#endif
} FormData_pg_statistic_ext;
@ -64,6 +65,7 @@ typedef FormData_pg_statistic_ext *Form_pg_statistic_ext;
#define STATS_EXT_NDISTINCT 'd'
#define STATS_EXT_DEPENDENCIES 'f'
#define STATS_EXT_MCV 'm'
#endif /* EXPOSE_TO_CLIENT_CODE */

View File

@ -165,6 +165,13 @@
typoutput => 'pg_dependencies_out', typreceive => 'pg_dependencies_recv',
typsend => 'pg_dependencies_send', typalign => 'i', typstorage => 'x',
typcollation => 'default' },
{ oid => '5017', oid_symbol => 'PGMCVLISTOID',
descr => 'multivariate MCV list',
typname => 'pg_mcv_list', typlen => '-1', typbyval => 'f',
typcategory => 'S', typinput => 'pg_mcv_list_in',
typoutput => 'pg_mcv_list_out', typreceive => 'pg_mcv_list_recv',
typsend => 'pg_mcv_list_send', typalign => 'i', typstorage => 'x',
typcollation => 'default' },
{ oid => '32', oid_symbol => 'PGDDLCOMMANDOID',
descr => 'internal type for passing CollectedCommand',
typname => 'pg_ddl_command', typlen => 'SIZEOF_POINTER', typbyval => 't',

View File

@ -87,6 +87,7 @@ extern Bitmapset *bms_difference(const Bitmapset *a, const Bitmapset *b);
extern bool bms_is_subset(const Bitmapset *a, const Bitmapset *b);
extern BMS_Comparison bms_subset_compare(const Bitmapset *a, const Bitmapset *b);
extern bool bms_is_member(int x, const Bitmapset *a);
extern int bms_member_index(Bitmapset *a, int x);
extern bool bms_overlap(const Bitmapset *a, const Bitmapset *b);
extern bool bms_overlap_list(const Bitmapset *a, const struct List *b);
extern bool bms_nonempty_difference(const Bitmapset *a, const Bitmapset *b);

View File

@ -58,6 +58,12 @@ extern Selectivity clause_selectivity(PlannerInfo *root,
int varRelid,
JoinType jointype,
SpecialJoinInfo *sjinfo);
extern Selectivity clauselist_selectivity_simple(PlannerInfo *root,
List *clauses,
int varRelid,
JoinType jointype,
SpecialJoinInfo *sjinfo,
Bitmapset *estimatedclauses);
extern Selectivity clauselist_selectivity(PlannerInfo *root,
List *clauses,
int varRelid,

View File

@ -31,6 +31,15 @@ typedef struct
int tupno; /* position index for tuple it came from */
} ScalarItem;
/* (de)serialization info */
typedef struct DimensionInfo
{
int nvalues; /* number of deduplicated values */
int nbytes; /* number of bytes (serialized) */
int typlen; /* pg_type.typlen */
bool typbyval; /* pg_type.typbyval */
} DimensionInfo;
/* multi-sort */
typedef struct MultiSortSupportData
{
@ -44,6 +53,7 @@ typedef struct SortItem
{
Datum *values;
bool *isnull;
int count;
} SortItem;
extern MVNDistinct *statext_ndistinct_build(double totalrows,
@ -57,6 +67,12 @@ extern MVDependencies *statext_dependencies_build(int numrows, HeapTuple *rows,
extern bytea *statext_dependencies_serialize(MVDependencies *dependencies);
extern MVDependencies *statext_dependencies_deserialize(bytea *data);
extern MCVList *statext_mcv_build(int numrows, HeapTuple *rows,
Bitmapset *attrs, VacAttrStats **stats,
double totalrows);
extern bytea *statext_mcv_serialize(MCVList * mcv, VacAttrStats **stats);
extern MCVList * statext_mcv_deserialize(bytea *data);
extern MultiSortSupport multi_sort_init(int ndims);
extern void multi_sort_add_dimension(MultiSortSupport mss, int sortdim,
Oid oper, Oid collation);
@ -65,5 +81,29 @@ extern int multi_sort_compare_dim(int dim, const SortItem *a,
const SortItem *b, MultiSortSupport mss);
extern int multi_sort_compare_dims(int start, int end, const SortItem *a,
const SortItem *b, MultiSortSupport mss);
extern int compare_scalars_simple(const void *a, const void *b, void *arg);
extern int compare_datums_simple(Datum a, Datum b, SortSupport ssup);
extern void *bsearch_arg(const void *key, const void *base,
size_t nmemb, size_t size,
int (*compar) (const void *, const void *, void *),
void *arg);
extern AttrNumber *build_attnums_array(Bitmapset *attrs, int *numattrs);
extern SortItem *build_sorted_items(int numrows, int *nitems, HeapTuple *rows,
TupleDesc tdesc, MultiSortSupport mss,
int numattrs, AttrNumber *attnums);
extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
StatisticExtInfo *stat,
List *clauses,
int varRelid,
JoinType jointype,
SpecialJoinInfo *sjinfo,
RelOptInfo *rel,
Selectivity *basesel,
Selectivity *totalsel);
#endif /* EXTENDED_STATS_INTERNAL_H */

View File

@ -78,8 +78,41 @@ typedef struct MVDependencies
/* size of the struct excluding the deps array */
#define SizeOfDependencies (offsetof(MVDependencies, ndeps) + sizeof(uint32))
/* used to flag stats serialized to bytea */
#define STATS_MCV_MAGIC 0xE1A651C2 /* marks serialized bytea */
#define STATS_MCV_TYPE_BASIC 1 /* basic MCV list type */
/* max items in MCV list (mostly arbitrary number) */
#define STATS_MCVLIST_MAX_ITEMS 8192
/*
* Multivariate MCV (most-common value) lists
*
* A straightforward extension of MCV items - i.e. a list (array) of
* combinations of attribute values, together with a frequency and null flags.
*/
typedef struct MCVItem
{
double frequency; /* frequency of this combination */
double base_frequency; /* frequency if independent */
bool *isnull; /* NULL flags */
Datum *values; /* item values */
} MCVItem;
/* multivariate MCV list - essentally an array of MCV items */
typedef struct MCVList
{
uint32 magic; /* magic constant marker */
uint32 type; /* type of MCV list (BASIC) */
uint32 nitems; /* number of MCV items in the array */
AttrNumber ndimensions; /* number of dimensions */
Oid types[STATS_MAX_DIMENSIONS]; /* OIDs of data types */
MCVItem **items; /* array of MCV items */
} MCVList;
extern MVNDistinct *statext_ndistinct_load(Oid mvoid);
extern MVDependencies *statext_dependencies_load(Oid mvoid);
extern MCVList *statext_mcv_load(Oid mvoid);
extern void BuildRelationExtStatistics(Relation onerel, double totalrows,
int numrows, HeapTuple *rows,
@ -92,6 +125,13 @@ extern Selectivity dependencies_clauselist_selectivity(PlannerInfo *root,
SpecialJoinInfo *sjinfo,
RelOptInfo *rel,
Bitmapset **estimatedclauses);
extern Selectivity statext_clauselist_selectivity(PlannerInfo *root,
List *clauses,
int varRelid,
JoinType jointype,
SpecialJoinInfo *sjinfo,
RelOptInfo *rel,
Bitmapset **estimatedclauses);
extern bool has_stats_of_kind(List *stats, char requiredkind);
extern StatisticExtInfo *choose_best_statistics(List *stats,
Bitmapset *attnums, char requiredkind);

View File

@ -243,7 +243,7 @@ Indexes:
Check constraints:
"ctlt1_a_check" CHECK (length(a) > 2)
Statistics objects:
"public"."ctlt_all_a_b_stat" (ndistinct, dependencies) ON a, b FROM ctlt_all
"public"."ctlt_all_a_b_stat" (ndistinct, dependencies, mcv) ON a, b FROM ctlt_all
SELECT c.relname, objsubid, description FROM pg_description, pg_index i, pg_class c WHERE classoid = 'pg_class'::regclass AND objoid = i.indexrelid AND c.oid = i.indexrelid AND i.indrelid = 'ctlt_all'::regclass ORDER BY c.relname, objsubid;
relname | objsubid | description

View File

@ -892,11 +892,12 @@ WHERE c.castmethod = 'b' AND
pg_node_tree | text | 0 | i
pg_ndistinct | bytea | 0 | i
pg_dependencies | bytea | 0 | i
pg_mcv_list | bytea | 0 | i
cidr | inet | 0 | i
xml | text | 0 | a
xml | character varying | 0 | a
xml | character | 0 | a
(9 rows)
(10 rows)
-- **************** pg_conversion ****************
-- Look for illegal values in pg_conversion fields.

View File

@ -58,7 +58,7 @@ ALTER TABLE ab1 DROP COLUMN a;
b | integer | | |
c | integer | | |
Statistics objects:
"public"."ab1_b_c_stats" (ndistinct, dependencies) ON b, c FROM ab1
"public"."ab1_b_c_stats" (ndistinct, dependencies, mcv) ON b, c FROM ab1
-- Ensure statistics are dropped when table is
SELECT stxname FROM pg_statistic_ext WHERE stxname LIKE 'ab1%';
@ -211,7 +211,7 @@ SELECT stxkind, stxndistinct
FROM pg_statistic_ext WHERE stxrelid = 'ndistinct'::regclass;
stxkind | stxndistinct
---------+---------------------------------------------------------
{d,f} | {"3, 4": 301, "3, 6": 301, "4, 6": 301, "3, 4, 6": 301}
{d,f,m} | {"3, 4": 301, "3, 6": 301, "4, 6": 301, "3, 4, 6": 301}
(1 row)
-- Hash Aggregate, thanks to estimates improved by the statistic
@ -277,7 +277,7 @@ SELECT stxkind, stxndistinct
FROM pg_statistic_ext WHERE stxrelid = 'ndistinct'::regclass;
stxkind | stxndistinct
---------+-------------------------------------------------------------
{d,f} | {"3, 4": 2550, "3, 6": 800, "4, 6": 1632, "3, 4, 6": 10000}
{d,f,m} | {"3, 4": 2550, "3, 6": 800, "4, 6": 1632, "3, 4, 6": 10000}
(1 row)
-- plans using Group Aggregate, thanks to using correct esimates
@ -514,3 +514,284 @@ EXPLAIN (COSTS OFF)
(5 rows)
RESET random_page_cost;
-- check the number of estimated/actual rows in the top node
create function check_estimated_rows(text) returns table (estimated int, actual int)
language plpgsql as
$$
declare
ln text;
tmp text[];
first_row bool := true;
begin
for ln in
execute format('explain analyze %s', $1)
loop
if first_row then
first_row := false;
tmp := regexp_match(ln, 'rows=(\d*) .* rows=(\d*)');
return query select tmp[1]::int, tmp[2]::int;
end if;
end loop;
end;
$$;
-- MCV lists
CREATE TABLE mcv_lists (
filler1 TEXT,
filler2 NUMERIC,
a INT,
b VARCHAR,
filler3 DATE,
c INT,
d TEXT
);
-- random data (no MCV list)
INSERT INTO mcv_lists (a, b, c, filler1)
SELECT mod(i,37), mod(i,41), mod(i,43), mod(i,47) FROM generate_series(1,5000) s(i);
ANALYZE mcv_lists;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1''');
estimated | actual
-----------+--------
3 | 4
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1'' AND c = 1');
estimated | actual
-----------+--------
1 | 1
(1 row)
-- create statistics
CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, c FROM mcv_lists;
ANALYZE mcv_lists;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1''');
estimated | actual
-----------+--------
3 | 4
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1'' AND c = 1');
estimated | actual
-----------+--------
1 | 1
(1 row)
-- 100 distinct combinations, all in the MCV list
TRUNCATE mcv_lists;
DROP STATISTICS mcv_lists_stats;
INSERT INTO mcv_lists (a, b, c, filler1)
SELECT mod(i,100), mod(i,50), mod(i,25), i FROM generate_series(1,5000) s(i);
ANALYZE mcv_lists;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1''');
estimated | actual
-----------+--------
1 | 50
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 1 AND b < ''1''');
estimated | actual
-----------+--------
1 | 50
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 0 AND b <= ''0''');
estimated | actual
-----------+--------
1 | 50
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1'' AND c = 1');
estimated | actual
-----------+--------
1 | 50
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 5 AND b < ''1'' AND c < 5');
estimated | actual
-----------+--------
1 | 50
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 4 AND b <= ''0'' AND c <= 4');
estimated | actual
-----------+--------
1 | 50
(1 row)
-- create statistics
CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, c FROM mcv_lists;
ANALYZE mcv_lists;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1''');
estimated | actual
-----------+--------
50 | 50
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 1 AND b < ''1''');
estimated | actual
-----------+--------
50 | 50
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 0 AND b <= ''0''');
estimated | actual
-----------+--------
50 | 50
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1'' AND c = 1');
estimated | actual
-----------+--------
50 | 50
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 5 AND b < ''1'' AND c < 5');
estimated | actual
-----------+--------
50 | 50
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 4 AND b <= ''0'' AND c <= 4');
estimated | actual
-----------+--------
50 | 50
(1 row)
-- check change of unrelated column type does not reset the MCV statistics
ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
SELECT stxmcv IS NOT NULL FROM pg_statistic_ext WHERE stxname = 'mcv_lists_stats';
?column?
----------
t
(1 row)
-- check change of column type resets the MCV statistics
ALTER TABLE mcv_lists ALTER COLUMN c TYPE numeric;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1''');
estimated | actual
-----------+--------
1 | 50
(1 row)
ANALYZE mcv_lists;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1''');
estimated | actual
-----------+--------
50 | 50
(1 row)
-- 100 distinct combinations with NULL values, all in the MCV list
TRUNCATE mcv_lists;
DROP STATISTICS mcv_lists_stats;
INSERT INTO mcv_lists (a, b, c, filler1)
SELECT
(CASE WHEN mod(i,100) = 1 THEN NULL ELSE mod(i,100) END),
(CASE WHEN mod(i,50) = 1 THEN NULL ELSE mod(i,50) END),
(CASE WHEN mod(i,25) = 1 THEN NULL ELSE mod(i,25) END),
i
FROM generate_series(1,5000) s(i);
ANALYZE mcv_lists;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND b IS NULL');
estimated | actual
-----------+--------
1 | 50
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND b IS NULL AND c IS NULL');
estimated | actual
-----------+--------
1 | 50
(1 row)
-- create statistics
CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, c FROM mcv_lists;
ANALYZE mcv_lists;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND b IS NULL');
estimated | actual
-----------+--------
50 | 50
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND b IS NULL AND c IS NULL');
estimated | actual
-----------+--------
50 | 50
(1 row)
RESET random_page_cost;
-- mcv with arrays
CREATE TABLE mcv_lists_arrays (
a TEXT[],
b NUMERIC[],
c INT[]
);
INSERT INTO mcv_lists_arrays (a, b, c)
SELECT
ARRAY[md5((i/100)::text), md5((i/100-1)::text), md5((i/100+1)::text)],
ARRAY[(i/100-1)::numeric/1000, (i/100)::numeric/1000, (i/100+1)::numeric/1000],
ARRAY[(i/100-1), i/100, (i/100+1)]
FROM generate_series(1,5000) s(i);
CREATE STATISTICS mcv_lists_arrays_stats (mcv) ON a, b, c
FROM mcv_lists_arrays;
ANALYZE mcv_lists_arrays;
-- mcv with bool
CREATE TABLE mcv_lists_bool (
a BOOL,
b BOOL,
c BOOL
);
INSERT INTO mcv_lists_bool (a, b, c)
SELECT
(mod(i,2) = 0), (mod(i,4) = 0), (mod(i,8) = 0)
FROM generate_series(1,10000) s(i);
ANALYZE mcv_lists_bool;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE a AND b AND c');
estimated | actual
-----------+--------
156 | 1250
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND c');
estimated | actual
-----------+--------
156 | 0
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND NOT b AND c');
estimated | actual
-----------+--------
469 | 0
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
estimated | actual
-----------+--------
1094 | 0
(1 row)
CREATE STATISTICS mcv_lists_bool_stats (mcv) ON a, b, c
FROM mcv_lists_bool;
ANALYZE mcv_lists_bool;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE a AND b AND c');
estimated | actual
-----------+--------
1250 | 1250
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND c');
estimated | actual
-----------+--------
1 | 0
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND NOT b AND c');
estimated | actual
-----------+--------
1 | 0
(1 row)
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
estimated | actual
-----------+--------
1 | 0
(1 row)

View File

@ -72,7 +72,8 @@ WHERE p1.typtype not in ('c','d','p') AND p1.typname NOT LIKE E'\\_%'
194 | pg_node_tree
3361 | pg_ndistinct
3402 | pg_dependencies
(3 rows)
5017 | pg_mcv_list
(4 rows)
-- Make sure typarray points to a varlena array type of our own base
SELECT p1.oid, p1.typname as basetype, p2.typname as arraytype,

View File

@ -280,3 +280,189 @@ EXPLAIN (COSTS OFF)
SELECT * FROM functional_dependencies WHERE a = 1 AND b = '1' AND c = 1;
RESET random_page_cost;
-- check the number of estimated/actual rows in the top node
create function check_estimated_rows(text) returns table (estimated int, actual int)
language plpgsql as
$$
declare
ln text;
tmp text[];
first_row bool := true;
begin
for ln in
execute format('explain analyze %s', $1)
loop
if first_row then
first_row := false;
tmp := regexp_match(ln, 'rows=(\d*) .* rows=(\d*)');
return query select tmp[1]::int, tmp[2]::int;
end if;
end loop;
end;
$$;
-- MCV lists
CREATE TABLE mcv_lists (
filler1 TEXT,
filler2 NUMERIC,
a INT,
b VARCHAR,
filler3 DATE,
c INT,
d TEXT
);
-- random data (no MCV list)
INSERT INTO mcv_lists (a, b, c, filler1)
SELECT mod(i,37), mod(i,41), mod(i,43), mod(i,47) FROM generate_series(1,5000) s(i);
ANALYZE mcv_lists;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1''');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1'' AND c = 1');
-- create statistics
CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, c FROM mcv_lists;
ANALYZE mcv_lists;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1''');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1'' AND c = 1');
-- 100 distinct combinations, all in the MCV list
TRUNCATE mcv_lists;
DROP STATISTICS mcv_lists_stats;
INSERT INTO mcv_lists (a, b, c, filler1)
SELECT mod(i,100), mod(i,50), mod(i,25), i FROM generate_series(1,5000) s(i);
ANALYZE mcv_lists;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1''');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 1 AND b < ''1''');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 0 AND b <= ''0''');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1'' AND c = 1');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 5 AND b < ''1'' AND c < 5');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 4 AND b <= ''0'' AND c <= 4');
-- create statistics
CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, c FROM mcv_lists;
ANALYZE mcv_lists;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1''');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 1 AND b < ''1''');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 0 AND b <= ''0''');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1'' AND c = 1');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < 5 AND b < ''1'' AND c < 5');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 4 AND b <= ''0'' AND c <= 4');
-- check change of unrelated column type does not reset the MCV statistics
ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
SELECT stxmcv IS NOT NULL FROM pg_statistic_ext WHERE stxname = 'mcv_lists_stats';
-- check change of column type resets the MCV statistics
ALTER TABLE mcv_lists ALTER COLUMN c TYPE numeric;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1''');
ANALYZE mcv_lists;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 AND b = ''1''');
-- 100 distinct combinations with NULL values, all in the MCV list
TRUNCATE mcv_lists;
DROP STATISTICS mcv_lists_stats;
INSERT INTO mcv_lists (a, b, c, filler1)
SELECT
(CASE WHEN mod(i,100) = 1 THEN NULL ELSE mod(i,100) END),
(CASE WHEN mod(i,50) = 1 THEN NULL ELSE mod(i,50) END),
(CASE WHEN mod(i,25) = 1 THEN NULL ELSE mod(i,25) END),
i
FROM generate_series(1,5000) s(i);
ANALYZE mcv_lists;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND b IS NULL');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND b IS NULL AND c IS NULL');
-- create statistics
CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, c FROM mcv_lists;
ANALYZE mcv_lists;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND b IS NULL');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IS NULL AND b IS NULL AND c IS NULL');
RESET random_page_cost;
-- mcv with arrays
CREATE TABLE mcv_lists_arrays (
a TEXT[],
b NUMERIC[],
c INT[]
);
INSERT INTO mcv_lists_arrays (a, b, c)
SELECT
ARRAY[md5((i/100)::text), md5((i/100-1)::text), md5((i/100+1)::text)],
ARRAY[(i/100-1)::numeric/1000, (i/100)::numeric/1000, (i/100+1)::numeric/1000],
ARRAY[(i/100-1), i/100, (i/100+1)]
FROM generate_series(1,5000) s(i);
CREATE STATISTICS mcv_lists_arrays_stats (mcv) ON a, b, c
FROM mcv_lists_arrays;
ANALYZE mcv_lists_arrays;
-- mcv with bool
CREATE TABLE mcv_lists_bool (
a BOOL,
b BOOL,
c BOOL
);
INSERT INTO mcv_lists_bool (a, b, c)
SELECT
(mod(i,2) = 0), (mod(i,4) = 0), (mod(i,8) = 0)
FROM generate_series(1,10000) s(i);
ANALYZE mcv_lists_bool;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE a AND b AND c');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND c');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND NOT b AND c');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
CREATE STATISTICS mcv_lists_bool_stats (mcv) ON a, b, c
FROM mcv_lists_bool;
ANALYZE mcv_lists_bool;
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE a AND b AND c');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND c');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND NOT b AND c');
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');