postgresql/doc/src/sgml/planstats.sgml

604 lines
24 KiB
Plaintext
Raw Normal View History

2010-09-20 22:08:53 +02:00
<!-- doc/src/sgml/planstats.sgml -->
<chapter id="planner-stats-details">
<title>How the Planner Uses Statistics</title>
<para>
This chapter builds on the material covered in <xref
linkend="using-explain"> and <xref linkend="planner-stats"> to show some
additional details about how the planner uses the
system statistics to estimate the number of rows each part of a query might
return. This is a significant part of the planning process,
providing much of the raw material for cost calculation.
</para>
<para>
The intent of this chapter is not to document the code in detail,
but to present an overview of how it works.
This will perhaps ease the learning curve for someone who subsequently
wishes to read the code.
</para>
<sect1 id="row-estimation-examples">
<title>Row Estimation Examples</title>
<indexterm zone="row-estimation-examples">
<primary>row estimation</primary>
<secondary>planner</secondary>
</indexterm>
<para>
The examples shown below use tables in the <productname>PostgreSQL</>
regression test database.
The outputs shown are taken from version 8.3.
The behavior of earlier (or later) versions might vary.
Note also that since <command>ANALYZE</> uses random sampling
while producing statistics, the results will change slightly after
any new <command>ANALYZE</>.
</para>
<para>
Let's start with a very simple query:
<programlisting>
EXPLAIN SELECT * FROM tenk1;
QUERY PLAN
-------------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
</programlisting>
How the planner determines the cardinality of <structname>tenk1</structname>
is covered in <xref linkend="planner-stats">, but is repeated here for
completeness. The number of pages and rows is looked up in
<structname>pg_class</structname>:
<programlisting>
SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
relpages | reltuples
----------+-----------
358 | 10000
</programlisting>
These numbers are current as of the last <command>VACUUM</> or
<command>ANALYZE</> on the table. The planner then fetches the
actual current number of pages in the table (this is a cheap operation,
not requiring a table scan). If that is different from
<structfield>relpages</structfield> then
<structfield>reltuples</structfield> is scaled accordingly to
arrive at a current number-of-rows estimate. In the example above, the value of
<structfield>relpages</structfield> is up-to-date so the rows estimate is
the same as <structfield>reltuples</structfield>.
</para>
<para>
Let's move on to an example with a range condition in its
<literal>WHERE</literal> clause:
<programlisting>
EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 1000;
QUERY PLAN
--------------------------------------------------------------------------------
Bitmap Heap Scan on tenk1 (cost=24.06..394.64 rows=1007 width=244)
Recheck Cond: (unique1 &lt; 1000)
-&gt; Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0)
Index Cond: (unique1 &lt; 1000)
</programlisting>
The planner examines the <literal>WHERE</literal> clause condition
and looks up the selectivity function for the operator
<literal>&lt;</literal> in <structname>pg_operator</structname>.
This is held in the column <structfield>oprrest</structfield>,
and the entry in this case is <function>scalarltsel</function>.
The <function>scalarltsel</function> function retrieves the histogram for
<structfield>unique1</structfield> from
<structname>pg_statistic</structname>. For manual queries it is more
convenient to look in the simpler <structname>pg_stats</structname>
view:
<programlisting>
SELECT histogram_bounds FROM pg_stats
WHERE tablename='tenk1' AND attname='unique1';
histogram_bounds
------------------------------------------------------
{0,993,1997,3050,4040,5036,5957,7057,8029,9016,9995}
</programlisting>
Next the fraction of the histogram occupied by <quote>&lt; 1000</quote>
is worked out. This is the selectivity. The histogram divides the range
into equal frequency buckets, so all we have to do is locate the bucket
that our value is in and count <emphasis>part</emphasis> of it and
<emphasis>all</emphasis> of the ones before. The value 1000 is clearly in
the second bucket (993-1997). Assuming a linear distribution of
values inside each bucket, we can calculate the selectivity as:
<programlisting>
selectivity = (1 + (1000 - bucket[2].min)/(bucket[2].max - bucket[2].min))/num_buckets
= (1 + (1000 - 993)/(1997 - 993))/10
= 0.100697
</programlisting>
that is, one whole bucket plus a linear fraction of the second, divided by
the number of buckets. The estimated number of rows can now be calculated as
the product of the selectivity and the cardinality of
<structname>tenk1</structname>:
<programlisting>
rows = rel_cardinality * selectivity
= 10000 * 0.100697
= 1007 (rounding off)
</programlisting>
</para>
<para>
Next let's consider an example with an equality condition in its
<literal>WHERE</literal> clause:
<programlisting>
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'CRAAAA';
QUERY PLAN
----------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..483.00 rows=30 width=244)
Filter: (stringu1 = 'CRAAAA'::name)
</programlisting>
Again the planner examines the <literal>WHERE</literal> clause condition
and looks up the selectivity function for <literal>=</literal>, which is
<function>eqsel</function>. For equality estimation the histogram is
not useful; instead the list of <firstterm>most
common values</> (<acronym>MCV</acronym>s) is used to determine the
selectivity. Let's have a look at the MCVs, with some additional columns
that will be useful later:
<programlisting>
SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
WHERE tablename='tenk1' AND attname='stringu1';
null_frac | 0
n_distinct | 676
most_common_vals | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,JOAAAA,MCAAAA,NAAAAA,WGAAAA}
most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003}
</programlisting>
Since <literal>CRAAAA</> appears in the list of MCVs, the selectivity is
merely the corresponding entry in the list of most common frequencies
(<acronym>MCF</acronym>s):
<programlisting>
selectivity = mcf[3]
= 0.003
</programlisting>
As before, the estimated number of rows is just the product of this with the
cardinality of <structname>tenk1</structname>:
<programlisting>
rows = 10000 * 0.003
= 30
</programlisting>
</para>
<para>
Now consider the same query, but with a constant that is not in the
<acronym>MCV</acronym> list:
<programlisting>
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx';
QUERY PLAN
----------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..483.00 rows=15 width=244)
Filter: (stringu1 = 'xxx'::name)
</programlisting>
This is quite a different problem: how to estimate the selectivity when the
value is <emphasis>not</emphasis> in the <acronym>MCV</acronym> list.
The approach is to use the fact that the value is not in the list,
combined with the knowledge of the frequencies for all of the
<acronym>MCV</acronym>s:
<programlisting>
selectivity = (1 - sum(mvf))/(num_distinct - num_mcv)
= (1 - (0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 +
0.003 + 0.003 + 0.003 + 0.003))/(676 - 10)
= 0.0014559
</programlisting>
That is, add up all the frequencies for the <acronym>MCV</acronym>s and
subtract them from one, then
divide by the number of <emphasis>other</emphasis> distinct values.
This amounts to assuming that the fraction of the column that is not any
of the MCVs is evenly distributed among all the other distinct values.
Notice that there are no null values so we don't have to worry about those
(otherwise we'd subtract the null fraction from the numerator as well).
The estimated number of rows is then calculated as usual:
<programlisting>
rows = 10000 * 0.0014559
= 15 (rounding off)
</programlisting>
</para>
<para>
The previous example with <literal>unique1 &lt; 1000</> was an
oversimplification of what <function>scalarltsel</function> really does;
now that we have seen an example of the use of MCVs, we can fill in some
more detail. The example was correct as far as it went, because since
<structfield>unique1</> is a unique column it has no MCVs (obviously, no
value is any more common than any other value). For a non-unique
column, there will normally be both a histogram and an MCV list, and
<emphasis>the histogram does not include the portion of the column
population represented by the MCVs</>. We do things this way because
it allows more precise estimation. In this situation
<function>scalarltsel</function> directly applies the condition (e.g.,
<quote>&lt; 1000</>) to each value of the MCV list, and adds up the
frequencies of the MCVs for which the condition is true. This gives
an exact estimate of the selectivity within the portion of the table
that is MCVs. The histogram is then used in the same way as above
to estimate the selectivity in the portion of the table that is not
MCVs, and then the two numbers are combined to estimate the overall
selectivity. For example, consider
<programlisting>
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 &lt; 'IAAAAA';
QUERY PLAN
------------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..483.00 rows=3077 width=244)
Filter: (stringu1 &lt; 'IAAAAA'::name)
</programlisting>
We already saw the MCV information for <structfield>stringu1</>,
and here is its histogram:
<programlisting>
SELECT histogram_bounds FROM pg_stats
WHERE tablename='tenk1' AND attname='stringu1';
histogram_bounds
--------------------------------------------------------------------------------
{AAAAAA,CQAAAA,FRAAAA,IBAAAA,KRAAAA,NFAAAA,PSAAAA,SGAAAA,VAAAAA,XLAAAA,ZZAAAA}
</programlisting>
Checking the MCV list, we find that the condition <literal>stringu1 &lt;
'IAAAAA'</> is satisfied by the first six entries and not the last four,
so the selectivity within the MCV part of the population is
<programlisting>
selectivity = sum(relevant mvfs)
= 0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003
= 0.01833333
</programlisting>
Summing all the MCFs also tells us that the total fraction of the
population represented by MCVs is 0.03033333, and therefore the
fraction represented by the histogram is 0.96966667 (again, there
are no nulls, else we'd have to exclude them here). We can see
that the value <literal>IAAAAA</> falls nearly at the end of the
third histogram bucket. Using some rather cheesy assumptions
about the frequency of different characters, the planner arrives
at the estimate 0.298387 for the portion of the histogram population
that is less than <literal>IAAAAA</>. We then combine the estimates
for the MCV and non-MCV populations:
<programlisting>
selectivity = mcv_selectivity + histogram_selectivity * histogram_fraction
= 0.01833333 + 0.298387 * 0.96966667
= 0.307669
rows = 10000 * 0.307669
= 3077 (rounding off)
</programlisting>
In this particular example, the correction from the MCV list is fairly
small, because the column distribution is actually quite flat (the
statistics showing these particular values as being more common than
others are mostly due to sampling error). In a more typical case where
some values are significantly more common than others, this complicated
process gives a useful improvement in accuracy because the selectivity
for the most common values is found exactly.
</para>
<para>
Now let's consider a case with more than one
condition in the <literal>WHERE</literal> clause:
<programlisting>
EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 1000 AND stringu1 = 'xxx';
QUERY PLAN
--------------------------------------------------------------------------------
Bitmap Heap Scan on tenk1 (cost=23.80..396.91 rows=1 width=244)
Recheck Cond: (unique1 &lt; 1000)
Filter: (stringu1 = 'xxx'::name)
-&gt; Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0)
Index Cond: (unique1 &lt; 1000)
</programlisting>
The planner assumes that the two conditions are independent, so that
the individual selectivities of the clauses can be multiplied together:
<programlisting>
selectivity = selectivity(unique1 &lt; 1000) * selectivity(stringu1 = 'xxx')
= 0.100697 * 0.0014559
= 0.0001466
rows = 10000 * 0.0001466
= 1 (rounding off)
</programlisting>
Notice that the number of rows estimated to be returned from the bitmap
index scan reflects only the condition used with the index; this is
important since it affects the cost estimate for the subsequent heap
fetches.
</para>
<para>
Finally we will examine a query that involves a join:
<programlisting>
EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
WHERE t1.unique1 &lt; 50 AND t1.unique2 = t2.unique2;
QUERY PLAN
--------------------------------------------------------------------------------------
Nested Loop (cost=4.64..456.23 rows=50 width=488)
-&gt; Bitmap Heap Scan on tenk1 t1 (cost=4.64..142.17 rows=50 width=244)
Recheck Cond: (unique1 &lt; 50)
-&gt; Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.63 rows=50 width=0)
Index Cond: (unique1 &lt; 50)
-&gt; Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.00..6.27 rows=1 width=244)
Index Cond: (unique2 = t1.unique2)
</programlisting>
The restriction on <structname>tenk1</structname>,
<literal>unique1 &lt; 50</literal>,
is evaluated before the nested-loop join.
This is handled analogously to the previous range example. This time the
value 50 falls into the first bucket of the
<structfield>unique1</structfield> histogram:
<programlisting>
selectivity = (0 + (50 - bucket[1].min)/(bucket[1].max - bucket[1].min))/num_buckets
= (0 + (50 - 0)/(993 - 0))/10
= 0.005035
rows = 10000 * 0.005035
= 50 (rounding off)
</programlisting>
The restriction for the join is <literal>t2.unique2 = t1.unique2</>.
The operator is just
our familiar <literal>=</literal>, however the selectivity function is
obtained from the <structfield>oprjoin</structfield> column of
<structname>pg_operator</structname>, and is <function>eqjoinsel</function>.
<function>eqjoinsel</function> looks up the statistical information for both
<structname>tenk2</structname> and <structname>tenk1</structname>:
<programlisting>
SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';
tablename | null_frac | n_distinct | most_common_vals
-----------+-----------+------------+------------------
tenk1 | 0 | -1 |
tenk2 | 0 | -1 |
</programlisting>
In this case there is no <acronym>MCV</acronym> information for
<structfield>unique2</structfield> because all the values appear to be
unique, so we use an algorithm that relies only on the number of
distinct values for both relations together with their null fractions:
<programlisting>
selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
= (1 - 0) * (1 - 0) / max(10000, 10000)
= 0.0001
</programlisting>
This is, subtract the null fraction from one for each of the relations,
and divide by the maximum of the numbers of distinct values.
The number of rows
that the join is likely to emit is calculated as the cardinality of the
Cartesian product of the two inputs, multiplied by the
selectivity:
<programlisting>
rows = (outer_cardinality * inner_cardinality) * selectivity
= (50 * 10000) * 0.0001
= 50
</programlisting>
</para>
<para>
Had there been MCV lists for the two columns,
<function>eqjoinsel</function> would have used direct comparison of the MCV
lists to determine the join selectivity within the part of the column
populations represented by the MCVs. The estimate for the remainder of the
populations follows the same approach shown here.
</para>
<para>
Notice that we showed <literal>inner_cardinality</> as 10000, that is,
the unmodified size of <structname>tenk2</>. It might appear from
inspection of the <command>EXPLAIN</> output that the estimate of
join rows comes from 50 * 1, that is, the number of outer rows times
2010-08-17 06:37:21 +02:00
the estimated number of rows obtained by each inner index scan on
<structname>tenk2</>. But this is not the case: the join relation size
is estimated before any particular join plan has been considered. If
everything is working well then the two ways of estimating the join
size will produce about the same answer, but due to round-off error and
other factors they sometimes diverge significantly.
</para>
<para>
For those interested in further details, estimation of the size of
a table (before any <literal>WHERE</> clauses) is done in
<filename>src/backend/optimizer/util/plancat.c</filename>. The generic
logic for clause selectivities is in
<filename>src/backend/optimizer/path/clausesel.c</filename>. The
operator-specific selectivity functions are mostly found
in <filename>src/backend/utils/adt/selfuncs.c</filename>.
</para>
<sect2 id="functional-dependencies">
<title>Functional Dependencies</title>
<para>
The simplest type of extended statistics are functional dependencies,
used in definitions of database normal forms. When simplified, saying that
<literal>b</> is functionally dependent on <literal>a</> means that
knowledge of value of <literal>a</> is sufficient to determine value of
<literal>b</>.
</para>
<para>
In normalized databases, only functional dependencies on primary keys
and superkeys are allowed. However, in practice, many data sets are not
fully normalized, for example, due to intentional denormalization for
performance reasons.
</para>
<para>
Functional dependencies directly affect accuracy of the estimates, as
conditions on the dependent column(s) do not restrict the result set,
resulting in underestimates.
</para>
<para>
To inform the planner about the functional dependencies, we collect
measurements of dependency during <command>ANALYZE</>. Assessing
dependency between all sets of columns would be prohibitively
expensive, so we limit our search to potential dependencies defined
using the <command>CREATE STATISTICS</> command.
<programlisting>
CREATE TABLE t (a INT, b INT);
INSERT INTO t SELECT i/100, i/100 FROM generate_series(1,10000) s(i);
CREATE STATISTICS s1 WITH (dependencies) ON (a, b) FROM t;
ANALYZE t;
EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual time=0.095..3.118 rows=100 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900
Planning time: 0.367 ms
Execution time: 3.380 ms
(5 rows)
</programlisting>
The planner is now aware of the functional dependencies and considers
them when computing the selectivity of the second condition. Running
the query without the statistics would lead to quite different estimates.
<programlisting>
DROP STATISTICS s1;
EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual time=0.000..6.379 rows=100 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900
Planning time: 0.000 ms
Execution time: 6.379 ms
(5 rows)
</programlisting>
</para>
<para>
If no dependency exists, the collected statistics do not influence the
query plan. The only effect is to slow down <command>ANALYZE</>. Should
partial dependencies exist these will also be stored and applied
during planning.
</para>
<para>
Similarly to per-column statistics, extended statistics are stored in
a system catalog called <structname>pg_statistic_ext</structname>.
To inspect the statistics <literal>s1</literal> defined above,
you may do this:
<programlisting>
SELECT stxname,stxdependencies FROM pg_statistic_ext WHERE stxname = 's1';
stxname | stxdependencies
---------+--------------------------------------------
s1 | [{1 => 2 : 1.000000}, {2 => 1 : 1.000000}]
(1 row)
</programlisting>
This shows that the statistics are defined on table <structname>t</>,
<structfield>attnums</structfield> lists attribute numbers of columns
(references <structname>pg_attribute</structname>). It also shows
the length in bytes of the functional dependencies, as found by
<command>ANALYZE</> when serialized into a <literal>bytea</> column.
</para>
<para>
When computing the selectivity, the planner inspects all conditions and
attempts to identify which conditions are already implied by other
conditions. The selectivity estimates from any redundant conditions are
ignored from a selectivity point of view. In the example query above,
the selectivity estimates for either of the conditions may be eliminated,
thus improving the overall estimate.
</para>
<sect3 id="functional-dependencies-limitations">
<title>Limitations of functional dependencies</title>
<para>
Functional dependencies are a very simple type of statistics, and
as such have several limitations. The first limitation is that they
only work with simple equality conditions, comparing columns and constant
values. It's not possible to use them to eliminate equality conditions
comparing two columns or a column to an expression, range clauses,
<literal>LIKE</> or any other type of conditions.
</para>
<para>
When eliminating the implied conditions, the planner assumes that the
conditions are compatible. Consider the following example, violating
this assumption:
<programlisting>
EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual time=2.992..2.992 rows=0 loops=1)
Filter: ((a = 1) AND (b = 10))
Rows Removed by Filter: 10000
Planning time: 0.232 ms
Execution time: 3.033 ms
(5 rows)
</programlisting>
While there are no rows with such combination of values, the planner
is unable to verify whether the values match - it only knows that
the columns are functionally dependent.
</para>
<para>
This assumption is more about queries executed on the database - in many
cases, it's actually satisfied (e.g. when the GUI only allows selecting
compatible values). But if that's not the case, functional dependencies
may not be a viable option.
</para>
<para>
For additional information about functional dependencies, see
<filename>src/backend/statistics/README.dependencies</>.
</para>
</sect3>
</sect2>
</sect1>
</chapter>