324 lines
12 KiB
Plaintext
324 lines
12 KiB
Plaintext
<!--
|
|
doc/src/sgml/ref/create_statistics.sgml
|
|
PostgreSQL documentation
|
|
-->
|
|
|
|
<refentry id="sql-createstatistics">
|
|
<indexterm zone="sql-createstatistics">
|
|
<primary>CREATE STATISTICS</primary>
|
|
</indexterm>
|
|
|
|
<refmeta>
|
|
<refentrytitle>CREATE STATISTICS</refentrytitle>
|
|
<manvolnum>7</manvolnum>
|
|
<refmiscinfo>SQL - Language Statements</refmiscinfo>
|
|
</refmeta>
|
|
|
|
<refnamediv>
|
|
<refname>CREATE STATISTICS</refname>
|
|
<refpurpose>define extended statistics</refpurpose>
|
|
</refnamediv>
|
|
|
|
<refsynopsisdiv>
|
|
<synopsis>
|
|
CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable>
|
|
ON ( <replaceable class="parameter">expression</replaceable> )
|
|
FROM <replaceable class="parameter">table_name</replaceable>
|
|
|
|
CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable>
|
|
[ ( <replaceable class="parameter">statistics_kind</replaceable> [, ... ] ) ]
|
|
ON { <replaceable class="parameter">column_name</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) }, { <replaceable class="parameter">column_name</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) } [, ...]
|
|
FROM <replaceable class="parameter">table_name</replaceable>
|
|
</synopsis>
|
|
|
|
</refsynopsisdiv>
|
|
|
|
<refsect1 id="sql-createstatistics-description">
|
|
<title>Description</title>
|
|
|
|
<para>
|
|
<command>CREATE STATISTICS</command> will create a new extended statistics
|
|
object tracking data about the specified table, foreign table or
|
|
materialized view. The statistics object will be created in the current
|
|
database and will be owned by the user issuing the command.
|
|
</para>
|
|
|
|
<para>
|
|
The <command>CREATE STATISTICS</command> command has two basic forms. The
|
|
first form allows univariate statistics for a single expression to be
|
|
collected, providing benefits similar to an expression index without the
|
|
overhead of index maintenance. This form does not allow the statistics
|
|
kind to be specified, since the various statistics kinds refer only to
|
|
multivariate statistics. The second form of the command allows
|
|
multivariate statistics on multiple columns and/or expressions to be
|
|
collected, optionally specifying which statistics kinds to include. This
|
|
form will also automatically cause univariate statistics to be collected on
|
|
any expressions included in the list.
|
|
</para>
|
|
|
|
<para>
|
|
If a schema name is given (for example, <literal>CREATE STATISTICS
|
|
myschema.mystat ...</literal>) then the statistics object is created in the
|
|
specified schema. Otherwise it is created in the current schema.
|
|
The name of the statistics object must be distinct from the name of any
|
|
other statistics object in the same schema.
|
|
</para>
|
|
</refsect1>
|
|
|
|
<refsect1>
|
|
<title>Parameters</title>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term><literal>IF NOT EXISTS</literal></term>
|
|
<listitem>
|
|
<para>
|
|
Do not throw an error if a statistics object with the same name already
|
|
exists. A notice is issued in this case. Note that only the name of
|
|
the statistics object is considered here, not the details of its
|
|
definition.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><replaceable class="parameter">statistics_name</replaceable></term>
|
|
<listitem>
|
|
<para>
|
|
The name (optionally schema-qualified) of the statistics object to be
|
|
created.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><replaceable class="parameter">statistics_kind</replaceable></term>
|
|
<listitem>
|
|
<para>
|
|
A multivariate statistics kind to be computed in this statistics object.
|
|
Currently supported kinds are
|
|
<literal>ndistinct</literal>, which enables n-distinct statistics,
|
|
<literal>dependencies</literal>, which enables functional
|
|
dependency statistics, and <literal>mcv</literal> which enables
|
|
most-common values lists.
|
|
If this clause is omitted, all supported statistics kinds are
|
|
included in the statistics object. Univariate expression statistics are
|
|
built automatically if the statistics definition includes any complex
|
|
expressions rather than just simple column references.
|
|
For more information, see <xref linkend="planner-stats-extended"/>
|
|
and <xref linkend="multivariate-statistics-examples"/>.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><replaceable class="parameter">column_name</replaceable></term>
|
|
<listitem>
|
|
<para>
|
|
The name of a table column to be covered by the computed statistics.
|
|
This is only allowed when building multivariate statistics. At least
|
|
two column names or expressions must be specified, and their order is
|
|
not significant.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><replaceable class="parameter">expression</replaceable></term>
|
|
<listitem>
|
|
<para>
|
|
An expression to be covered by the computed statistics. This may be
|
|
used to build univariate statistics on a single expression, or as part
|
|
of a list of multiple column names and/or expressions to build
|
|
multivariate statistics. In the latter case, separate univariate
|
|
statistics are built automatically for each expression in the list.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term><replaceable class="parameter">table_name</replaceable></term>
|
|
<listitem>
|
|
<para>
|
|
The name (optionally schema-qualified) of the table containing the
|
|
column(s) the statistics are computed on.
|
|
</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
</refsect1>
|
|
|
|
<refsect1>
|
|
<title>Notes</title>
|
|
|
|
<para>
|
|
You must be the owner of a table to create a statistics object
|
|
reading it. Once created, however, the ownership of the statistics
|
|
object is independent of the underlying table(s).
|
|
</para>
|
|
|
|
<para>
|
|
Expression statistics are per-expression and are similar to creating an
|
|
index on the expression, except that they avoid the overhead of index
|
|
maintenance. Expression statistics are built automatically for each
|
|
expression in the statistics object definition.
|
|
</para>
|
|
</refsect1>
|
|
|
|
<refsect1 id="sql-createstatistics-examples">
|
|
<title>Examples</title>
|
|
|
|
<para>
|
|
Create table <structname>t1</structname> with two functionally dependent columns, i.e.,
|
|
knowledge of a value in the first column is sufficient for determining the
|
|
value in the other column. Then functional dependency statistics are built
|
|
on those columns:
|
|
|
|
<programlisting>
|
|
CREATE TABLE t1 (
|
|
a int,
|
|
b int
|
|
);
|
|
|
|
INSERT INTO t1 SELECT i/100, i/500
|
|
FROM generate_series(1,1000000) s(i);
|
|
|
|
ANALYZE t1;
|
|
|
|
-- the number of matching rows will be drastically underestimated:
|
|
EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
|
|
|
|
CREATE STATISTICS s1 (dependencies) ON a, b FROM t1;
|
|
|
|
ANALYZE t1;
|
|
|
|
-- now the row count estimate is more accurate:
|
|
EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
|
|
</programlisting>
|
|
|
|
Without functional-dependency statistics, the planner would assume
|
|
that the two <literal>WHERE</literal> conditions are independent, and would
|
|
multiply their selectivities together to arrive at a much-too-small
|
|
row count estimate.
|
|
With such statistics, the planner recognizes that the <literal>WHERE</literal>
|
|
conditions are redundant and does not underestimate the row count.
|
|
</para>
|
|
|
|
<para>
|
|
Create table <structname>t2</structname> with two perfectly correlated columns
|
|
(containing identical data), and an MCV list on those columns:
|
|
|
|
<programlisting>
|
|
CREATE TABLE t2 (
|
|
a int,
|
|
b int
|
|
);
|
|
|
|
INSERT INTO t2 SELECT mod(i,100), mod(i,100)
|
|
FROM generate_series(1,1000000) s(i);
|
|
|
|
CREATE STATISTICS s2 (mcv) ON a, b FROM t2;
|
|
|
|
ANALYZE t2;
|
|
|
|
-- valid combination (found in MCV)
|
|
EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
|
|
|
|
-- invalid combination (not found in MCV)
|
|
EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 2);
|
|
</programlisting>
|
|
|
|
The MCV list gives the planner more detailed information about the
|
|
specific values that commonly appear in the table, as well as an upper
|
|
bound on the selectivities of combinations of values that do not appear
|
|
in the table, allowing it to generate better estimates in both cases.
|
|
</para>
|
|
|
|
<para>
|
|
Create table <structname>t3</structname> with a single timestamp column,
|
|
and run queries using expressions on that column. Without extended
|
|
statistics, the planner has no information about the data distribution for
|
|
the expressions, and uses default estimates. The planner also does not
|
|
realize that the value of the date truncated to the month is fully
|
|
determined by the value of the date truncated to the day. Then expression
|
|
and ndistinct statistics are built on those two expressions:
|
|
|
|
<programlisting>
|
|
CREATE TABLE t3 (
|
|
a timestamp
|
|
);
|
|
|
|
INSERT INTO t3 SELECT i FROM generate_series('2020-01-01'::timestamp,
|
|
'2020-12-31'::timestamp,
|
|
'1 minute'::interval) s(i);
|
|
|
|
ANALYZE t3;
|
|
|
|
-- the number of matching rows will be drastically underestimated:
|
|
EXPLAIN ANALYZE SELECT * FROM t3
|
|
WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
|
|
|
|
EXPLAIN ANALYZE SELECT * FROM t3
|
|
WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
|
|
AND '2020-06-30'::timestamp;
|
|
|
|
EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
|
|
FROM t3 GROUP BY 1, 2;
|
|
|
|
-- build ndistinct statistics on the pair of expressions (per-expression
|
|
-- statistics are built automatically)
|
|
CREATE STATISTICS s3 (ndistinct) ON date_trunc('month', a), date_trunc('day', a) FROM t3;
|
|
|
|
ANALYZE t3;
|
|
|
|
-- now the row count estimates are more accurate:
|
|
EXPLAIN ANALYZE SELECT * FROM t3
|
|
WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
|
|
|
|
EXPLAIN ANALYZE SELECT * FROM t3
|
|
WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
|
|
AND '2020-06-30'::timestamp;
|
|
|
|
EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
|
|
FROM t3 GROUP BY 1, 2;
|
|
</programlisting>
|
|
|
|
Without expression and ndistinct statistics, the planner has no information
|
|
about the number of distinct values for the expressions, and has to rely
|
|
on default estimates. The equality and range conditions are assumed to have
|
|
0.5% selectivity, and the number of distinct values in the expression is
|
|
assumed to be the same as for the column (i.e. unique). This results in a
|
|
significant underestimate of the row count in the first two queries. Moreover,
|
|
the planner has no information about the relationship between the expressions,
|
|
so it assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal>
|
|
conditions are independent, and multiplies their selectivities together to
|
|
arrive at a severe overestimate of the group count in the aggregate query.
|
|
This is further exacerbated by the lack of accurate statistics for the
|
|
expressions, forcing the planner to use a default ndistinct estimate for the
|
|
expression derived from ndistinct for the column. With such statistics, the
|
|
planner recognizes that the conditions are correlated, and arrives at much
|
|
more accurate estimates.
|
|
</para>
|
|
|
|
</refsect1>
|
|
|
|
<refsect1>
|
|
<title>Compatibility</title>
|
|
|
|
<para>
|
|
There is no <command>CREATE STATISTICS</command> command in the SQL standard.
|
|
</para>
|
|
</refsect1>
|
|
|
|
<refsect1>
|
|
<title>See Also</title>
|
|
|
|
<simplelist type="inline">
|
|
<member><xref linkend="sql-alterstatistics"/></member>
|
|
<member><xref linkend="sql-dropstatistics"/></member>
|
|
</simplelist>
|
|
</refsect1>
|
|
</refentry>
|