From 2bb6b98d5ce82ec2b94cc1788e0090bf26bc8f4f Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue, 16 Oct 2001 01:13:44 +0000
Subject: [PATCH] Add a section about the planner's statistics, including a
 description of the new pg_stats view.

---
 doc/src/sgml/perform.sgml     | 196 +++++++++++++++++++++++++++++++++-
 doc/src/sgml/ref/analyze.sgml |   6 +-
 2 files changed, 199 insertions(+), 3 deletions(-)
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 61c28bd000..c10be1e34f 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1,5 +1,5 @@
 <!--
-$Header: /cvsroot/pgsql/doc/src/sgml/perform.sgml,v 1.12 2001/10/12 23:32:34 momjian Exp $
+$Header: /cvsroot/pgsql/doc/src/sgml/perform.sgml,v 1.13 2001/10/16 01:13:44 tgl Exp $
 -->
 
  <chapter id="performance-tips">
@@ -315,6 +315,200 @@ Total runtime: 30.67 msec
    </para>
   </sect1>
 
+ <sect1 id="planner-stats">
+  <title>Statistics used by the Planner</title>
+
+  <para>
+   As we saw in the previous section, the query planner needs to estimate
+   the number of rows retrieved by a query in order to make good choices
+   of query plans.  This section provides a quick look at the statistics
+   that the system uses for these estimates.
+  </para>
+
+  <para>
+   One component of the statistics is the total number of entries in each
+   table and index, as well as the number of disk blocks occupied by each
+   table and index.  This information is kept in
+   <structname>pg_class</structname>'s <structfield>reltuples</structfield>
+   and <structfield>relpages</structfield> columns.  We can look at it
+   with queries similar to this one:
+
+<screen>
+regression=# select relname, relkind, reltuples, relpages from pg_class
+regression-# where relname like 'tenk1%';
+    relname    | relkind | reltuples | relpages
+---------------+---------+-----------+----------
+ tenk1         | r       |     10000 |      233
+ tenk1_hundred | i       |     10000 |       30
+ tenk1_unique1 | i       |     10000 |       30
+ tenk1_unique2 | i       |     10000 |       30
+(4 rows)
+</screen>
+
+   Here we can see that <structname>tenk1</structname> contains 10000
+   rows, as do its indexes, but the indexes are (unsurprisingly) much
+   smaller than the table.
+  </para>
+
+  <para>
+   For efficiency reasons, <structfield>reltuples</structfield> 
+   and <structfield>relpages</structfield> are not updated on-the-fly,
+   and so they usually contain only approximate values (which is good
+   enough for the planner's purposes).  They are initialized with dummy
+   values (presently 1000 and 10 respectively) when a table is created.
+   They are updated by certain commands, presently <command>VACUUM</>,
+   <command>ANALYZE</>, and <command>CREATE INDEX</>.  A stand-alone
+   <command>ANALYZE</>, that is one not part of <command>VACUUM</>,
+   generates an approximate <structfield>reltuples</structfield> value
+   since it does not read every row of the table.
+  </para>
+
+  <para>
+   Most queries retrieve only a fraction of the rows in a table, due
+   to having WHERE clauses that restrict the rows to be examined.
+   The planner thus needs to make an estimate of the
+   <firstterm>selectivity</> of WHERE clauses, that is, the fraction of
+   rows that match each clause of the WHERE condition.  The information
+   used for this task is stored in the <structname>pg_statistic</structname>
+   system catalog.  Entries in <structname>pg_statistic</structname> are
+   updated by <command>ANALYZE</> and <command>VACUUM ANALYZE</> commands,
+   and are always approximate even when freshly updated.
+  </para>
+
+  <para>
+   Rather than look at <structname>pg_statistic</structname> directly,
+   it's better to look at its view <structname>pg_stats</structname>
+   when examining the statistics manually.  <structname>pg_stats</structname>
+   is designed to be more easily readable.  Furthermore,
+   <structname>pg_stats</structname> is readable by all, whereas
+   <structname>pg_statistic</structname> is only readable by the superuser.
+   (This prevents unprivileged users from learning something about
+   the contents of other people's tables from the statistics.  The
+   <structname>pg_stats</structname> view is restricted to show only
+   rows about tables that the current user can read.)
+   For example, we might do:
+
+<screen>
+regression=# select attname, n_distinct, most_common_vals from pg_stats where tablename = 'road';
+ attname | n_distinct |                                                                                                                                                                                  most_common_vals                                                                                                                                                                                   
+---------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ name    |  -0.467008 | {"I- 580                        Ramp","I- 880                        Ramp","Sp Railroad                       ","I- 580                            ","I- 680                        Ramp","I- 80                         Ramp","14th                          St  ","5th                           St  ","Mission                       Blvd","I- 880                            "}
+ thepath |         20 | {"[(-122.089,37.71),(-122.0886,37.711)]"}
+(2 rows)
+regression=#
+</screen>
+
+   As of <productname>Postgres</productname> 7.2 the following columns exist
+   in <structname>pg_stats</structname>:
+  </para>
+
+  <table>
+   <title>pg_stats Columns</title>
+
+   <tgroup cols=3>
+    <thead>
+     <row>
+      <entry>Name</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry>tablename</entry>
+      <entry><type>name</type></entry>
+      <entry>Name of table containing column</entry>
+     </row>
+
+     <row>
+      <entry>attname</entry>
+      <entry><type>name</type></entry>
+      <entry>Column described by this row</entry>
+     </row>
+
+     <row>
+      <entry>null_frac</entry>
+      <entry><type>real</type></entry>
+      <entry>Fraction of column's entries that are NULL</entry>
+     </row>
+
+     <row>
+      <entry>avg_width</entry>
+      <entry><type>integer</type></entry>
+      <entry>Average width in bytes of column's entries</entry>
+     </row>
+
+     <row>
+      <entry>n_distinct</entry>
+      <entry><type>real</type></entry>
+      <entry>If greater than zero, the estimated number of distinct values
+      in the column.  If less than zero, the negative of the number of
+      distinct values divided by the number of rows.  (The negated form
+      is used when ANALYZE believes that the number of distinct values
+      is likely to increase as the table grows; the positive form is used
+      when the column seems to have a fixed number of possible values.)
+      For example, -1 indicates a unique column in which the number of
+      distinct values is the same as the number of rows.
+      </entry>
+     </row>
+
+     <row>
+      <entry>most_common_vals</entry>
+      <entry><type>text[]</type></entry>
+      <entry>A list of the most common values in the column. (Omitted if
+      no values seem to be more common than any others.)</entry>
+     </row>
+
+     <row>
+      <entry>most_common_freqs</entry>
+      <entry><type>real[]</type></entry>
+      <entry>A list of the frequencies of the most common values,
+      ie, number of occurrences of each divided by total number of rows.
+     </entry>
+     </row>
+
+     <row>
+      <entry>histogram_bounds</entry>
+      <entry><type>text[]</type></entry>
+      <entry>A list of values that divide the column's values into
+      groups of approximately equal population.  The 
+      <structfield>most_common_vals</>, if present, are omitted from the
+      histogram calculation.  (Omitted if column datatype does not have a
+      <literal>&lt;</> operator, or if the <structfield>most_common_vals</>
+      list accounts for the entire population.)
+      </entry>
+     </row>
+
+     <row>
+      <entry>correlation</entry>
+      <entry><type>real</type></entry>
+      <entry>Statistical correlation between physical row ordering and
+      logical ordering of the column values.  This ranges from -1 to +1.
+      When the value is near -1 or +1, an indexscan on the column will
+      be estimated to be cheaper than when it is near zero, due to reduction
+      of random access to the disk.  (Omitted if column datatype does
+      not have a <literal>&lt;</> operator.)
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The maximum number of entries in the <structfield>most_common_vals</>
+   and <structfield>histogram_bounds</> arrays can be set on a
+   column-by-column basis using the <command>ALTER TABLE SET STATISTICS</>
+   command.  The default limit is presently 10 entries.  Raising the limit
+   may allow more accurate planner estimates to be made, particularly for
+   columns with irregular data distributions, at the price of consuming
+   more space in <structname>pg_statistic</structname> and slightly more
+   time to compute the estimates.  Conversely, a lower limit may be
+   appropriate for columns with simple data distributions.
+  </para>
+
+ </sect1>
+
  <sect1 id="explicit-joins">
   <title>Controlling the Planner with Explicit JOINs</title>
 
diff --git a/doc/src/sgml/ref/analyze.sgml b/doc/src/sgml/ref/analyze.sgml
index 98f38aefb9..70e172790b 100644
--- a/doc/src/sgml/ref/analyze.sgml
+++ b/doc/src/sgml/ref/analyze.sgml
@@ -1,5 +1,5 @@
 <!--
-$Header: /cvsroot/pgsql/doc/src/sgml/ref/analyze.sgml,v 1.2 2001/09/03 12:57:49 petere Exp $
+$Header: /cvsroot/pgsql/doc/src/sgml/ref/analyze.sgml,v 1.3 2001/10/16 01:13:44 tgl Exp $
 Postgres documentation
 -->
 
@@ -152,7 +152,9 @@ ANALYZE [ VERBOSE ] [ <replaceable class="PARAMETER">table</replaceable> [ (<rep
    distribution in each column.  One or both of these may be omitted if
    <command>ANALYZE</command> deems them uninteresting (for example, in
    a unique-key column, there are no common values) or if the column
-   datatype does not support the appropriate operators.
+   datatype does not support the appropriate operators.  There is more
+   information about the statistics in the <citetitle>User's
+   Guide</citetitle>.
   </para>
 
   <para>