From 2bb6b98d5ce82ec2b94cc1788e0090bf26bc8f4f Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Tue, 16 Oct 2001 01:13:44 +0000 Subject: [PATCH] Add a section about the planner's statistics, including a description of the new pg_stats view. --- doc/src/sgml/perform.sgml | 196 +++++++++++++++++++++++++++++++++- doc/src/sgml/ref/analyze.sgml | 6 +- 2 files changed, 199 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml index 61c28bd000..c10be1e34f 100644 --- a/doc/src/sgml/perform.sgml +++ b/doc/src/sgml/perform.sgml @@ -1,5 +1,5 @@ @@ -315,6 +315,200 @@ Total runtime: 30.67 msec + + Statistics used by the Planner + + + As we saw in the previous section, the query planner needs to estimate + the number of rows retrieved by a query in order to make good choices + of query plans. This section provides a quick look at the statistics + that the system uses for these estimates. + + + + One component of the statistics is the total number of entries in each + table and index, as well as the number of disk blocks occupied by each + table and index. This information is kept in + pg_class's reltuples + and relpages columns. We can look at it + with queries similar to this one: + + +regression=# select relname, relkind, reltuples, relpages from pg_class +regression-# where relname like 'tenk1%'; + relname | relkind | reltuples | relpages +---------------+---------+-----------+---------- + tenk1 | r | 10000 | 233 + tenk1_hundred | i | 10000 | 30 + tenk1_unique1 | i | 10000 | 30 + tenk1_unique2 | i | 10000 | 30 +(4 rows) + + + Here we can see that tenk1 contains 10000 + rows, as do its indexes, but the indexes are (unsurprisingly) much + smaller than the table. + + + + For efficiency reasons, reltuples + and relpages are not updated on-the-fly, + and so they usually contain only approximate values (which is good + enough for the planner's purposes). They are initialized with dummy + values (presently 1000 and 10 respectively) when a table is created. + They are updated by certain commands, presently VACUUM, + ANALYZE, and CREATE INDEX. A stand-alone + ANALYZE, that is one not part of VACUUM, + generates an approximate reltuples value + since it does not read every row of the table. + + + + Most queries retrieve only a fraction of the rows in a table, due + to having WHERE clauses that restrict the rows to be examined. + The planner thus needs to make an estimate of the + selectivity of WHERE clauses, that is, the fraction of + rows that match each clause of the WHERE condition. The information + used for this task is stored in the pg_statistic + system catalog. Entries in pg_statistic are + updated by ANALYZE and VACUUM ANALYZE commands, + and are always approximate even when freshly updated. + + + + Rather than look at pg_statistic directly, + it's better to look at its view pg_stats + when examining the statistics manually. pg_stats + is designed to be more easily readable. Furthermore, + pg_stats is readable by all, whereas + pg_statistic is only readable by the superuser. + (This prevents unprivileged users from learning something about + the contents of other people's tables from the statistics. The + pg_stats view is restricted to show only + rows about tables that the current user can read.) + For example, we might do: + + +regression=# select attname, n_distinct, most_common_vals from pg_stats where tablename = 'road'; + attname | n_distinct | most_common_vals +---------+------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- + name | -0.467008 | {"I- 580 Ramp","I- 880 Ramp","Sp Railroad ","I- 580 ","I- 680 Ramp","I- 80 Ramp","14th St ","5th St ","Mission Blvd","I- 880 "} + thepath | 20 | {"[(-122.089,37.71),(-122.0886,37.711)]"} +(2 rows) +regression=# + + + As of Postgres 7.2 the following columns exist + in pg_stats: + + + + pg_stats Columns + + + + + Name + Type + Description + + + + + + tablename + name + Name of table containing column + + + + attname + name + Column described by this row + + + + null_frac + real + Fraction of column's entries that are NULL + + + + avg_width + integer + Average width in bytes of column's entries + + + + n_distinct + real + If greater than zero, the estimated number of distinct values + in the column. If less than zero, the negative of the number of + distinct values divided by the number of rows. (The negated form + is used when ANALYZE believes that the number of distinct values + is likely to increase as the table grows; the positive form is used + when the column seems to have a fixed number of possible values.) + For example, -1 indicates a unique column in which the number of + distinct values is the same as the number of rows. + + + + + most_common_vals + text[] + A list of the most common values in the column. (Omitted if + no values seem to be more common than any others.) + + + + most_common_freqs + real[] + A list of the frequencies of the most common values, + ie, number of occurrences of each divided by total number of rows. + + + + + histogram_bounds + text[] + A list of values that divide the column's values into + groups of approximately equal population. The + most_common_vals, if present, are omitted from the + histogram calculation. (Omitted if column datatype does not have a + < operator, or if the most_common_vals + list accounts for the entire population.) + + + + + correlation + real + Statistical correlation between physical row ordering and + logical ordering of the column values. This ranges from -1 to +1. + When the value is near -1 or +1, an indexscan on the column will + be estimated to be cheaper than when it is near zero, due to reduction + of random access to the disk. (Omitted if column datatype does + not have a < operator.) + + + + +
+ + + The maximum number of entries in the most_common_vals + and histogram_bounds arrays can be set on a + column-by-column basis using the ALTER TABLE SET STATISTICS + command. The default limit is presently 10 entries. Raising the limit + may allow more accurate planner estimates to be made, particularly for + columns with irregular data distributions, at the price of consuming + more space in pg_statistic and slightly more + time to compute the estimates. Conversely, a lower limit may be + appropriate for columns with simple data distributions. + + +
+ Controlling the Planner with Explicit JOINs diff --git a/doc/src/sgml/ref/analyze.sgml b/doc/src/sgml/ref/analyze.sgml index 98f38aefb9..70e172790b 100644 --- a/doc/src/sgml/ref/analyze.sgml +++ b/doc/src/sgml/ref/analyze.sgml @@ -1,5 +1,5 @@ @@ -152,7 +152,9 @@ ANALYZE [ VERBOSE ] [ table [ (ANALYZE deems them uninteresting (for example, in a unique-key column, there are no common values) or if the column - datatype does not support the appropriate operators. + datatype does not support the appropriate operators. There is more + information about the statistics in the User's + Guide.