mirror of
https://git.postgresql.org/git/postgresql.git
synced 2024-10-03 02:06:52 +02:00
5220bb7533
This expands the support for the run-time partition pruning which was added
for Append in 499be013de
to also allow unneeded subnodes of a MergeAppend
to be removed.
Author: David Rowley
Discussion: https://www.postgresql.org/message-id/CAKJS1f_F_V8D7Wu-HVdnH7zCUxhoGK8XhLLtd%3DCu85qDZzXrgg%40mail.gmail.com
1874 lines
82 KiB
Plaintext
1874 lines
82 KiB
Plaintext
<!-- doc/src/sgml/perform.sgml -->
|
|
|
|
<chapter id="performance-tips">
|
|
<title>Performance Tips</title>
|
|
|
|
<indexterm zone="performance-tips">
|
|
<primary>performance</primary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
Query performance can be affected by many things. Some of these can
|
|
be controlled by the user, while others are fundamental to the underlying
|
|
design of the system. This chapter provides some hints about understanding
|
|
and tuning <productname>PostgreSQL</productname> performance.
|
|
</para>
|
|
|
|
<sect1 id="using-explain">
|
|
<title>Using <command>EXPLAIN</command></title>
|
|
|
|
<indexterm zone="using-explain">
|
|
<primary>EXPLAIN</primary>
|
|
</indexterm>
|
|
|
|
<indexterm zone="using-explain">
|
|
<primary>query plan</primary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
<productname>PostgreSQL</productname> devises a <firstterm>query
|
|
plan</firstterm> for each query it receives. Choosing the right
|
|
plan to match the query structure and the properties of the data
|
|
is absolutely critical for good performance, so the system includes
|
|
a complex <firstterm>planner</firstterm> that tries to choose good plans.
|
|
You can use the <xref linkend="sql-explain"/> command
|
|
to see what query plan the planner creates for any query.
|
|
Plan-reading is an art that requires some experience to master,
|
|
but this section attempts to cover the basics.
|
|
</para>
|
|
|
|
<para>
|
|
Examples in this section are drawn from the regression test database
|
|
after doing a <command>VACUUM ANALYZE</command>, using 9.3 development sources.
|
|
You should be able to get similar results if you try the examples
|
|
yourself, but your estimated costs and row counts might vary slightly
|
|
because <command>ANALYZE</command>'s statistics are random samples rather
|
|
than exact, and because costs are inherently somewhat platform-dependent.
|
|
</para>
|
|
|
|
<para>
|
|
The examples use <command>EXPLAIN</command>'s default <quote>text</quote> output
|
|
format, which is compact and convenient for humans to read.
|
|
If you want to feed <command>EXPLAIN</command>'s output to a program for further
|
|
analysis, you should use one of its machine-readable output formats
|
|
(XML, JSON, or YAML) instead.
|
|
</para>
|
|
|
|
<sect2 id="using-explain-basics">
|
|
<title><command>EXPLAIN</command> Basics</title>
|
|
|
|
<para>
|
|
The structure of a query plan is a tree of <firstterm>plan nodes</firstterm>.
|
|
Nodes at the bottom level of the tree are scan nodes: they return raw rows
|
|
from a table. There are different types of scan nodes for different
|
|
table access methods: sequential scans, index scans, and bitmap index
|
|
scans. There are also non-table row sources, such as <literal>VALUES</literal>
|
|
clauses and set-returning functions in <literal>FROM</literal>, which have their
|
|
own scan node types.
|
|
If the query requires joining, aggregation, sorting, or other
|
|
operations on the raw rows, then there will be additional nodes
|
|
above the scan nodes to perform these operations. Again,
|
|
there is usually more than one possible way to do these operations,
|
|
so different node types can appear here too. The output
|
|
of <command>EXPLAIN</command> has one line for each node in the plan
|
|
tree, showing the basic node type plus the cost estimates that the planner
|
|
made for the execution of that plan node. Additional lines might appear,
|
|
indented from the node's summary line,
|
|
to show additional properties of the node.
|
|
The very first line (the summary line for the topmost
|
|
node) has the estimated total execution cost for the plan; it is this
|
|
number that the planner seeks to minimize.
|
|
</para>
|
|
|
|
<para>
|
|
Here is a trivial example, just to show what the output looks like:
|
|
|
|
<screen>
|
|
EXPLAIN SELECT * FROM tenk1;
|
|
|
|
QUERY PLAN
|
|
-------------------------------------------------------------
|
|
Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
|
|
</screen>
|
|
</para>
|
|
|
|
<para>
|
|
Since this query has no <literal>WHERE</literal> clause, it must scan all the
|
|
rows of the table, so the planner has chosen to use a simple sequential
|
|
scan plan. The numbers that are quoted in parentheses are (left
|
|
to right):
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Estimated start-up cost. This is the time expended before the output
|
|
phase can begin, e.g., time to do the sorting in a sort node.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Estimated total cost. This is stated on the assumption that the plan
|
|
node is run to completion, i.e., all available rows are retrieved.
|
|
In practice a node's parent node might stop short of reading all
|
|
available rows (see the <literal>LIMIT</literal> example below).
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Estimated number of rows output by this plan node. Again, the node
|
|
is assumed to be run to completion.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Estimated average width of rows output by this plan node (in bytes).
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
The costs are measured in arbitrary units determined by the planner's
|
|
cost parameters (see <xref linkend="runtime-config-query-constants"/>).
|
|
Traditional practice is to measure the costs in units of disk page
|
|
fetches; that is, <xref linkend="guc-seq-page-cost"/> is conventionally
|
|
set to <literal>1.0</literal> and the other cost parameters are set relative
|
|
to that. The examples in this section are run with the default cost
|
|
parameters.
|
|
</para>
|
|
|
|
<para>
|
|
It's important to understand that the cost of an upper-level node includes
|
|
the cost of all its child nodes. It's also important to realize that
|
|
the cost only reflects things that the planner cares about.
|
|
In particular, the cost does not consider the time spent transmitting
|
|
result rows to the client, which could be an important
|
|
factor in the real elapsed time; but the planner ignores it because
|
|
it cannot change it by altering the plan. (Every correct plan will
|
|
output the same row set, we trust.)
|
|
</para>
|
|
|
|
<para>
|
|
The <literal>rows</literal> value is a little tricky because it is
|
|
not the number of rows processed or scanned by the
|
|
plan node, but rather the number emitted by the node. This is often
|
|
less than the number scanned, as a result of filtering by any
|
|
<literal>WHERE</literal>-clause conditions that are being applied at the node.
|
|
Ideally the top-level rows estimate will approximate the number of rows
|
|
actually returned, updated, or deleted by the query.
|
|
</para>
|
|
|
|
<para>
|
|
Returning to our example:
|
|
|
|
<screen>
|
|
EXPLAIN SELECT * FROM tenk1;
|
|
|
|
QUERY PLAN
|
|
-------------------------------------------------------------
|
|
Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
|
|
</screen>
|
|
</para>
|
|
|
|
<para>
|
|
These numbers are derived very straightforwardly. If you do:
|
|
|
|
<programlisting>
|
|
SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
|
|
</programlisting>
|
|
|
|
you will find that <classname>tenk1</classname> has 358 disk
|
|
pages and 10000 rows. The estimated cost is computed as (disk pages read *
|
|
<xref linkend="guc-seq-page-cost"/>) + (rows scanned *
|
|
<xref linkend="guc-cpu-tuple-cost"/>). By default,
|
|
<varname>seq_page_cost</varname> is 1.0 and <varname>cpu_tuple_cost</varname> is 0.01,
|
|
so the estimated cost is (358 * 1.0) + (10000 * 0.01) = 458.
|
|
</para>
|
|
|
|
<para>
|
|
Now let's modify the query to add a <literal>WHERE</literal> condition:
|
|
|
|
<screen>
|
|
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 7000;
|
|
|
|
QUERY PLAN
|
|
------------------------------------------------------------
|
|
Seq Scan on tenk1 (cost=0.00..483.00 rows=7001 width=244)
|
|
Filter: (unique1 < 7000)
|
|
</screen>
|
|
|
|
Notice that the <command>EXPLAIN</command> output shows the <literal>WHERE</literal>
|
|
clause being applied as a <quote>filter</quote> condition attached to the Seq
|
|
Scan plan node. This means that
|
|
the plan node checks the condition for each row it scans, and outputs
|
|
only the ones that pass the condition.
|
|
The estimate of output rows has been reduced because of the
|
|
<literal>WHERE</literal> clause.
|
|
However, the scan will still have to visit all 10000 rows, so the cost
|
|
hasn't decreased; in fact it has gone up a bit (by 10000 * <xref
|
|
linkend="guc-cpu-operator-cost"/>, to be exact) to reflect the extra CPU
|
|
time spent checking the <literal>WHERE</literal> condition.
|
|
</para>
|
|
|
|
<para>
|
|
The actual number of rows this query would select is 7000, but the <literal>rows</literal>
|
|
estimate is only approximate. If you try to duplicate this experiment,
|
|
you will probably get a slightly different estimate; moreover, it can
|
|
change after each <command>ANALYZE</command> command, because the
|
|
statistics produced by <command>ANALYZE</command> are taken from a
|
|
randomized sample of the table.
|
|
</para>
|
|
|
|
<para>
|
|
Now, let's make the condition more restrictive:
|
|
|
|
<screen>
|
|
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100;
|
|
|
|
QUERY PLAN
|
|
------------------------------------------------------------------------------
|
|
Bitmap Heap Scan on tenk1 (cost=5.07..229.20 rows=101 width=244)
|
|
Recheck Cond: (unique1 < 100)
|
|
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=101 width=0)
|
|
Index Cond: (unique1 < 100)
|
|
</screen>
|
|
|
|
Here the planner has decided to use a two-step plan: the child plan
|
|
node visits an index to find the locations of rows matching the index
|
|
condition, and then the upper plan node actually fetches those rows
|
|
from the table itself. Fetching rows separately is much more
|
|
expensive than reading them sequentially, but because not all the pages
|
|
of the table have to be visited, this is still cheaper than a sequential
|
|
scan. (The reason for using two plan levels is that the upper plan
|
|
node sorts the row locations identified by the index into physical order
|
|
before reading them, to minimize the cost of separate fetches.
|
|
The <quote>bitmap</quote> mentioned in the node names is the mechanism that
|
|
does the sorting.)
|
|
</para>
|
|
|
|
<para>
|
|
Now let's add another condition to the <literal>WHERE</literal> clause:
|
|
|
|
<screen>
|
|
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100 AND stringu1 = 'xxx';
|
|
|
|
QUERY PLAN
|
|
------------------------------------------------------------------------------
|
|
Bitmap Heap Scan on tenk1 (cost=5.04..229.43 rows=1 width=244)
|
|
Recheck Cond: (unique1 < 100)
|
|
Filter: (stringu1 = 'xxx'::name)
|
|
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=101 width=0)
|
|
Index Cond: (unique1 < 100)
|
|
</screen>
|
|
|
|
The added condition <literal>stringu1 = 'xxx'</literal> reduces the
|
|
output row count estimate, but not the cost because we still have to visit
|
|
the same set of rows. Notice that the <literal>stringu1</literal> clause
|
|
cannot be applied as an index condition, since this index is only on
|
|
the <literal>unique1</literal> column. Instead it is applied as a filter on
|
|
the rows retrieved by the index. Thus the cost has actually gone up
|
|
slightly to reflect this extra checking.
|
|
</para>
|
|
|
|
<para>
|
|
In some cases the planner will prefer a <quote>simple</quote> index scan plan:
|
|
|
|
<screen>
|
|
EXPLAIN SELECT * FROM tenk1 WHERE unique1 = 42;
|
|
|
|
QUERY PLAN
|
|
-----------------------------------------------------------------------------
|
|
Index Scan using tenk1_unique1 on tenk1 (cost=0.29..8.30 rows=1 width=244)
|
|
Index Cond: (unique1 = 42)
|
|
</screen>
|
|
|
|
In this type of plan the table rows are fetched in index order, which
|
|
makes them even more expensive to read, but there are so few that the
|
|
extra cost of sorting the row locations is not worth it. You'll most
|
|
often see this plan type for queries that fetch just a single row. It's
|
|
also often used for queries that have an <literal>ORDER BY</literal> condition
|
|
that matches the index order, because then no extra sorting step is needed
|
|
to satisfy the <literal>ORDER BY</literal>.
|
|
</para>
|
|
|
|
<para>
|
|
If there are separate indexes on several of the columns referenced
|
|
in <literal>WHERE</literal>, the planner might choose to use an AND or OR
|
|
combination of the indexes:
|
|
|
|
<screen>
|
|
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000;
|
|
|
|
QUERY PLAN
|
|
-------------------------------------------------------------------------------------
|
|
Bitmap Heap Scan on tenk1 (cost=25.08..60.21 rows=10 width=244)
|
|
Recheck Cond: ((unique1 < 100) AND (unique2 > 9000))
|
|
-> BitmapAnd (cost=25.08..25.08 rows=10 width=0)
|
|
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=101 width=0)
|
|
Index Cond: (unique1 < 100)
|
|
-> Bitmap Index Scan on tenk1_unique2 (cost=0.00..19.78 rows=999 width=0)
|
|
Index Cond: (unique2 > 9000)
|
|
</screen>
|
|
|
|
But this requires visiting both indexes, so it's not necessarily a win
|
|
compared to using just one index and treating the other condition as
|
|
a filter. If you vary the ranges involved you'll see the plan change
|
|
accordingly.
|
|
</para>
|
|
|
|
<para>
|
|
Here is an example showing the effects of <literal>LIMIT</literal>:
|
|
|
|
<screen>
|
|
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000 LIMIT 2;
|
|
|
|
QUERY PLAN
|
|
-------------------------------------------------------------------------------------
|
|
Limit (cost=0.29..14.48 rows=2 width=244)
|
|
-> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..71.27 rows=10 width=244)
|
|
Index Cond: (unique2 > 9000)
|
|
Filter: (unique1 < 100)
|
|
</screen>
|
|
</para>
|
|
|
|
<para>
|
|
This is the same query as above, but we added a <literal>LIMIT</literal> so that
|
|
not all the rows need be retrieved, and the planner changed its mind about
|
|
what to do. Notice that the total cost and row count of the Index Scan
|
|
node are shown as if it were run to completion. However, the Limit node
|
|
is expected to stop after retrieving only a fifth of those rows, so its
|
|
total cost is only a fifth as much, and that's the actual estimated cost
|
|
of the query. This plan is preferred over adding a Limit node to the
|
|
previous plan because the Limit could not avoid paying the startup cost
|
|
of the bitmap scan, so the total cost would be something over 25 units
|
|
with that approach.
|
|
</para>
|
|
|
|
<para>
|
|
Let's try joining two tables, using the columns we have been discussing:
|
|
|
|
<screen>
|
|
EXPLAIN SELECT *
|
|
FROM tenk1 t1, tenk2 t2
|
|
WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
|
|
|
|
QUERY PLAN
|
|
--------------------------------------------------------------------------------------
|
|
Nested Loop (cost=4.65..118.62 rows=10 width=488)
|
|
-> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.47 rows=10 width=244)
|
|
Recheck Cond: (unique1 < 10)
|
|
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36 rows=10 width=0)
|
|
Index Cond: (unique1 < 10)
|
|
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.91 rows=1 width=244)
|
|
Index Cond: (unique2 = t1.unique2)
|
|
</screen>
|
|
</para>
|
|
|
|
<para>
|
|
In this plan, we have a nested-loop join node with two table scans as
|
|
inputs, or children. The indentation of the node summary lines reflects
|
|
the plan tree structure. The join's first, or <quote>outer</quote>, child
|
|
is a bitmap scan similar to those we saw before. Its cost and row count
|
|
are the same as we'd get from <literal>SELECT ... WHERE unique1 < 10</literal>
|
|
because we are
|
|
applying the <literal>WHERE</literal> clause <literal>unique1 < 10</literal>
|
|
at that node.
|
|
The <literal>t1.unique2 = t2.unique2</literal> clause is not relevant yet,
|
|
so it doesn't affect the row count of the outer scan. The nested-loop
|
|
join node will run its second,
|
|
or <quote>inner</quote> child once for each row obtained from the outer child.
|
|
Column values from the current outer row can be plugged into the inner
|
|
scan; here, the <literal>t1.unique2</literal> value from the outer row is available,
|
|
so we get a plan and costs similar to what we saw above for a simple
|
|
<literal>SELECT ... WHERE t2.unique2 = <replaceable>constant</replaceable></literal> case.
|
|
(The estimated cost is actually a bit lower than what was seen above,
|
|
as a result of caching that's expected to occur during the repeated
|
|
index scans on <literal>t2</literal>.) The
|
|
costs of the loop node are then set on the basis of the cost of the outer
|
|
scan, plus one repetition of the inner scan for each outer row (10 * 7.91,
|
|
here), plus a little CPU time for join processing.
|
|
</para>
|
|
|
|
<para>
|
|
In this example the join's output row count is the same as the product
|
|
of the two scans' row counts, but that's not true in all cases because
|
|
there can be additional <literal>WHERE</literal> clauses that mention both tables
|
|
and so can only be applied at the join point, not to either input scan.
|
|
Here's an example:
|
|
|
|
<screen>
|
|
EXPLAIN SELECT *
|
|
FROM tenk1 t1, tenk2 t2
|
|
WHERE t1.unique1 < 10 AND t2.unique2 < 10 AND t1.hundred < t2.hundred;
|
|
|
|
QUERY PLAN
|
|
---------------------------------------------------------------------------------------------
|
|
Nested Loop (cost=4.65..49.46 rows=33 width=488)
|
|
Join Filter: (t1.hundred < t2.hundred)
|
|
-> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.47 rows=10 width=244)
|
|
Recheck Cond: (unique1 < 10)
|
|
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36 rows=10 width=0)
|
|
Index Cond: (unique1 < 10)
|
|
-> Materialize (cost=0.29..8.51 rows=10 width=244)
|
|
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..8.46 rows=10 width=244)
|
|
Index Cond: (unique2 < 10)
|
|
</screen>
|
|
|
|
The condition <literal>t1.hundred < t2.hundred</literal> can't be
|
|
tested in the <literal>tenk2_unique2</literal> index, so it's applied at the
|
|
join node. This reduces the estimated output row count of the join node,
|
|
but does not change either input scan.
|
|
</para>
|
|
|
|
<para>
|
|
Notice that here the planner has chosen to <quote>materialize</quote> the inner
|
|
relation of the join, by putting a Materialize plan node atop it. This
|
|
means that the <literal>t2</literal> index scan will be done just once, even
|
|
though the nested-loop join node needs to read that data ten times, once
|
|
for each row from the outer relation. The Materialize node saves the data
|
|
in memory as it's read, and then returns the data from memory on each
|
|
subsequent pass.
|
|
</para>
|
|
|
|
<para>
|
|
When dealing with outer joins, you might see join plan nodes with both
|
|
<quote>Join Filter</quote> and plain <quote>Filter</quote> conditions attached.
|
|
Join Filter conditions come from the outer join's <literal>ON</literal> clause,
|
|
so a row that fails the Join Filter condition could still get emitted as
|
|
a null-extended row. But a plain Filter condition is applied after the
|
|
outer-join rules and so acts to remove rows unconditionally. In an inner
|
|
join there is no semantic difference between these types of filters.
|
|
</para>
|
|
|
|
<para>
|
|
If we change the query's selectivity a bit, we might get a very different
|
|
join plan:
|
|
|
|
<screen>
|
|
EXPLAIN SELECT *
|
|
FROM tenk1 t1, tenk2 t2
|
|
WHERE t1.unique1 < 100 AND t1.unique2 = t2.unique2;
|
|
|
|
QUERY PLAN
|
|
------------------------------------------------------------------------------------------
|
|
Hash Join (cost=230.47..713.98 rows=101 width=488)
|
|
Hash Cond: (t2.unique2 = t1.unique2)
|
|
-> Seq Scan on tenk2 t2 (cost=0.00..445.00 rows=10000 width=244)
|
|
-> Hash (cost=229.20..229.20 rows=101 width=244)
|
|
-> Bitmap Heap Scan on tenk1 t1 (cost=5.07..229.20 rows=101 width=244)
|
|
Recheck Cond: (unique1 < 100)
|
|
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=101 width=0)
|
|
Index Cond: (unique1 < 100)
|
|
</screen>
|
|
</para>
|
|
|
|
<para>
|
|
Here, the planner has chosen to use a hash join, in which rows of one
|
|
table are entered into an in-memory hash table, after which the other
|
|
table is scanned and the hash table is probed for matches to each row.
|
|
Again note how the indentation reflects the plan structure: the bitmap
|
|
scan on <literal>tenk1</literal> is the input to the Hash node, which constructs
|
|
the hash table. That's then returned to the Hash Join node, which reads
|
|
rows from its outer child plan and searches the hash table for each one.
|
|
</para>
|
|
|
|
<para>
|
|
Another possible type of join is a merge join, illustrated here:
|
|
|
|
<screen>
|
|
EXPLAIN SELECT *
|
|
FROM tenk1 t1, onek t2
|
|
WHERE t1.unique1 < 100 AND t1.unique2 = t2.unique2;
|
|
|
|
QUERY PLAN
|
|
------------------------------------------------------------------------------------------
|
|
Merge Join (cost=198.11..268.19 rows=10 width=488)
|
|
Merge Cond: (t1.unique2 = t2.unique2)
|
|
-> Index Scan using tenk1_unique2 on tenk1 t1 (cost=0.29..656.28 rows=101 width=244)
|
|
Filter: (unique1 < 100)
|
|
-> Sort (cost=197.83..200.33 rows=1000 width=244)
|
|
Sort Key: t2.unique2
|
|
-> Seq Scan on onek t2 (cost=0.00..148.00 rows=1000 width=244)
|
|
</screen>
|
|
</para>
|
|
|
|
<para>
|
|
Merge join requires its input data to be sorted on the join keys. In this
|
|
plan the <literal>tenk1</literal> data is sorted by using an index scan to visit
|
|
the rows in the correct order, but a sequential scan and sort is preferred
|
|
for <literal>onek</literal>, because there are many more rows to be visited in
|
|
that table.
|
|
(Sequential-scan-and-sort frequently beats an index scan for sorting many rows,
|
|
because of the nonsequential disk access required by the index scan.)
|
|
</para>
|
|
|
|
<para>
|
|
One way to look at variant plans is to force the planner to disregard
|
|
whatever strategy it thought was the cheapest, using the enable/disable
|
|
flags described in <xref linkend="runtime-config-query-enable"/>.
|
|
(This is a crude tool, but useful. See
|
|
also <xref linkend="explicit-joins"/>.)
|
|
For example, if we're unconvinced that sequential-scan-and-sort is the best way to
|
|
deal with table <literal>onek</literal> in the previous example, we could try
|
|
|
|
<screen>
|
|
SET enable_sort = off;
|
|
|
|
EXPLAIN SELECT *
|
|
FROM tenk1 t1, onek t2
|
|
WHERE t1.unique1 < 100 AND t1.unique2 = t2.unique2;
|
|
|
|
QUERY PLAN
|
|
------------------------------------------------------------------------------------------
|
|
Merge Join (cost=0.56..292.65 rows=10 width=488)
|
|
Merge Cond: (t1.unique2 = t2.unique2)
|
|
-> Index Scan using tenk1_unique2 on tenk1 t1 (cost=0.29..656.28 rows=101 width=244)
|
|
Filter: (unique1 < 100)
|
|
-> Index Scan using onek_unique2 on onek t2 (cost=0.28..224.79 rows=1000 width=244)
|
|
</screen>
|
|
|
|
which shows that the planner thinks that sorting <literal>onek</literal> by
|
|
index-scanning is about 12% more expensive than sequential-scan-and-sort.
|
|
Of course, the next question is whether it's right about that.
|
|
We can investigate that using <command>EXPLAIN ANALYZE</command>, as discussed
|
|
below.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="using-explain-analyze">
|
|
<title><command>EXPLAIN ANALYZE</command></title>
|
|
|
|
<para>
|
|
It is possible to check the accuracy of the planner's estimates
|
|
by using <command>EXPLAIN</command>'s <literal>ANALYZE</literal> option. With this
|
|
option, <command>EXPLAIN</command> actually executes the query, and then displays
|
|
the true row counts and true run time accumulated within each plan node,
|
|
along with the same estimates that a plain <command>EXPLAIN</command>
|
|
shows. For example, we might get a result like this:
|
|
|
|
<screen>
|
|
EXPLAIN ANALYZE SELECT *
|
|
FROM tenk1 t1, tenk2 t2
|
|
WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
|
|
|
|
QUERY PLAN
|
|
---------------------------------------------------------------------------------------------------------------------------------
|
|
Nested Loop (cost=4.65..118.62 rows=10 width=488) (actual time=0.128..0.377 rows=10 loops=1)
|
|
-> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.47 rows=10 width=244) (actual time=0.057..0.121 rows=10 loops=1)
|
|
Recheck Cond: (unique1 < 10)
|
|
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36 rows=10 width=0) (actual time=0.024..0.024 rows=10 loops=1)
|
|
Index Cond: (unique1 < 10)
|
|
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.91 rows=1 width=244) (actual time=0.021..0.022 rows=1 loops=10)
|
|
Index Cond: (unique2 = t1.unique2)
|
|
Planning time: 0.181 ms
|
|
Execution time: 0.501 ms
|
|
</screen>
|
|
|
|
Note that the <quote>actual time</quote> values are in milliseconds of
|
|
real time, whereas the <literal>cost</literal> estimates are expressed in
|
|
arbitrary units; so they are unlikely to match up.
|
|
The thing that's usually most important to look for is whether the
|
|
estimated row counts are reasonably close to reality. In this example
|
|
the estimates were all dead-on, but that's quite unusual in practice.
|
|
</para>
|
|
|
|
<para>
|
|
In some query plans, it is possible for a subplan node to be executed more
|
|
than once. For example, the inner index scan will be executed once per
|
|
outer row in the above nested-loop plan. In such cases, the
|
|
<literal>loops</literal> value reports the
|
|
total number of executions of the node, and the actual time and rows
|
|
values shown are averages per-execution. This is done to make the numbers
|
|
comparable with the way that the cost estimates are shown. Multiply by
|
|
the <literal>loops</literal> value to get the total time actually spent in
|
|
the node. In the above example, we spent a total of 0.220 milliseconds
|
|
executing the index scans on <literal>tenk2</literal>.
|
|
</para>
|
|
|
|
<para>
|
|
In some cases <command>EXPLAIN ANALYZE</command> shows additional execution
|
|
statistics beyond the plan node execution times and row counts.
|
|
For example, Sort and Hash nodes provide extra information:
|
|
|
|
<screen>
|
|
EXPLAIN ANALYZE SELECT *
|
|
FROM tenk1 t1, tenk2 t2
|
|
WHERE t1.unique1 < 100 AND t1.unique2 = t2.unique2 ORDER BY t1.fivethous;
|
|
|
|
QUERY PLAN
|
|
--------------------------------------------------------------------------------------------------------------------------------------------
|
|
Sort (cost=717.34..717.59 rows=101 width=488) (actual time=7.761..7.774 rows=100 loops=1)
|
|
Sort Key: t1.fivethous
|
|
Sort Method: quicksort Memory: 77kB
|
|
-> Hash Join (cost=230.47..713.98 rows=101 width=488) (actual time=0.711..7.427 rows=100 loops=1)
|
|
Hash Cond: (t2.unique2 = t1.unique2)
|
|
-> Seq Scan on tenk2 t2 (cost=0.00..445.00 rows=10000 width=244) (actual time=0.007..2.583 rows=10000 loops=1)
|
|
-> Hash (cost=229.20..229.20 rows=101 width=244) (actual time=0.659..0.659 rows=100 loops=1)
|
|
Buckets: 1024 Batches: 1 Memory Usage: 28kB
|
|
-> Bitmap Heap Scan on tenk1 t1 (cost=5.07..229.20 rows=101 width=244) (actual time=0.080..0.526 rows=100 loops=1)
|
|
Recheck Cond: (unique1 < 100)
|
|
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=101 width=0) (actual time=0.049..0.049 rows=100 loops=1)
|
|
Index Cond: (unique1 < 100)
|
|
Planning time: 0.194 ms
|
|
Execution time: 8.008 ms
|
|
</screen>
|
|
|
|
The Sort node shows the sort method used (in particular, whether the sort
|
|
was in-memory or on-disk) and the amount of memory or disk space needed.
|
|
The Hash node shows the number of hash buckets and batches as well as the
|
|
peak amount of memory used for the hash table. (If the number of batches
|
|
exceeds one, there will also be disk space usage involved, but that is not
|
|
shown.)
|
|
</para>
|
|
|
|
<para>
|
|
Another type of extra information is the number of rows removed by a
|
|
filter condition:
|
|
|
|
<screen>
|
|
EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE ten < 7;
|
|
|
|
QUERY PLAN
|
|
---------------------------------------------------------------------------------------------------------
|
|
Seq Scan on tenk1 (cost=0.00..483.00 rows=7000 width=244) (actual time=0.016..5.107 rows=7000 loops=1)
|
|
Filter: (ten < 7)
|
|
Rows Removed by Filter: 3000
|
|
Planning time: 0.083 ms
|
|
Execution time: 5.905 ms
|
|
</screen>
|
|
|
|
These counts can be particularly valuable for filter conditions applied at
|
|
join nodes. The <quote>Rows Removed</quote> line only appears when at least
|
|
one scanned row, or potential join pair in the case of a join node,
|
|
is rejected by the filter condition.
|
|
</para>
|
|
|
|
<para>
|
|
A case similar to filter conditions occurs with <quote>lossy</quote>
|
|
index scans. For example, consider this search for polygons containing a
|
|
specific point:
|
|
|
|
<screen>
|
|
EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
|
|
|
|
QUERY PLAN
|
|
------------------------------------------------------------------------------------------------------
|
|
Seq Scan on polygon_tbl (cost=0.00..1.05 rows=1 width=32) (actual time=0.044..0.044 rows=0 loops=1)
|
|
Filter: (f1 @> '((0.5,2))'::polygon)
|
|
Rows Removed by Filter: 4
|
|
Planning time: 0.040 ms
|
|
Execution time: 0.083 ms
|
|
</screen>
|
|
|
|
The planner thinks (quite correctly) that this sample table is too small
|
|
to bother with an index scan, so we have a plain sequential scan in which
|
|
all the rows got rejected by the filter condition. But if we force an
|
|
index scan to be used, we see:
|
|
|
|
<screen>
|
|
SET enable_seqscan TO off;
|
|
|
|
EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
|
|
|
|
QUERY PLAN
|
|
--------------------------------------------------------------------------------------------------------------------------
|
|
Index Scan using gpolygonind on polygon_tbl (cost=0.13..8.15 rows=1 width=32) (actual time=0.062..0.062 rows=0 loops=1)
|
|
Index Cond: (f1 @> '((0.5,2))'::polygon)
|
|
Rows Removed by Index Recheck: 1
|
|
Planning time: 0.034 ms
|
|
Execution time: 0.144 ms
|
|
</screen>
|
|
|
|
Here we can see that the index returned one candidate row, which was
|
|
then rejected by a recheck of the index condition. This happens because a
|
|
GiST index is <quote>lossy</quote> for polygon containment tests: it actually
|
|
returns the rows with polygons that overlap the target, and then we have
|
|
to do the exact containment test on those rows.
|
|
</para>
|
|
|
|
<para>
|
|
<command>EXPLAIN</command> has a <literal>BUFFERS</literal> option that can be used with
|
|
<literal>ANALYZE</literal> to get even more run time statistics:
|
|
|
|
<screen>
|
|
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000;
|
|
|
|
QUERY PLAN
|
|
---------------------------------------------------------------------------------------------------------------------------------
|
|
Bitmap Heap Scan on tenk1 (cost=25.08..60.21 rows=10 width=244) (actual time=0.323..0.342 rows=10 loops=1)
|
|
Recheck Cond: ((unique1 < 100) AND (unique2 > 9000))
|
|
Buffers: shared hit=15
|
|
-> BitmapAnd (cost=25.08..25.08 rows=10 width=0) (actual time=0.309..0.309 rows=0 loops=1)
|
|
Buffers: shared hit=7
|
|
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=101 width=0) (actual time=0.043..0.043 rows=100 loops=1)
|
|
Index Cond: (unique1 < 100)
|
|
Buffers: shared hit=2
|
|
-> Bitmap Index Scan on tenk1_unique2 (cost=0.00..19.78 rows=999 width=0) (actual time=0.227..0.227 rows=999 loops=1)
|
|
Index Cond: (unique2 > 9000)
|
|
Buffers: shared hit=5
|
|
Planning time: 0.088 ms
|
|
Execution time: 0.423 ms
|
|
</screen>
|
|
|
|
The numbers provided by <literal>BUFFERS</literal> help to identify which parts
|
|
of the query are the most I/O-intensive.
|
|
</para>
|
|
|
|
<para>
|
|
Keep in mind that because <command>EXPLAIN ANALYZE</command> actually
|
|
runs the query, any side-effects will happen as usual, even though
|
|
whatever results the query might output are discarded in favor of
|
|
printing the <command>EXPLAIN</command> data. If you want to analyze a
|
|
data-modifying query without changing your tables, you can
|
|
roll the command back afterwards, for example:
|
|
|
|
<screen>
|
|
BEGIN;
|
|
|
|
EXPLAIN ANALYZE UPDATE tenk1 SET hundred = hundred + 1 WHERE unique1 < 100;
|
|
|
|
QUERY PLAN
|
|
--------------------------------------------------------------------------------------------------------------------------------
|
|
Update on tenk1 (cost=5.07..229.46 rows=101 width=250) (actual time=14.628..14.628 rows=0 loops=1)
|
|
-> Bitmap Heap Scan on tenk1 (cost=5.07..229.46 rows=101 width=250) (actual time=0.101..0.439 rows=100 loops=1)
|
|
Recheck Cond: (unique1 < 100)
|
|
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=101 width=0) (actual time=0.043..0.043 rows=100 loops=1)
|
|
Index Cond: (unique1 < 100)
|
|
Planning time: 0.079 ms
|
|
Execution time: 14.727 ms
|
|
|
|
ROLLBACK;
|
|
</screen>
|
|
</para>
|
|
|
|
<para>
|
|
As seen in this example, when the query is an <command>INSERT</command>,
|
|
<command>UPDATE</command>, or <command>DELETE</command> command, the actual work of
|
|
applying the table changes is done by a top-level Insert, Update,
|
|
or Delete plan node. The plan nodes underneath this node perform
|
|
the work of locating the old rows and/or computing the new data.
|
|
So above, we see the same sort of bitmap table scan we've seen already,
|
|
and its output is fed to an Update node that stores the updated rows.
|
|
It's worth noting that although the data-modifying node can take a
|
|
considerable amount of run time (here, it's consuming the lion's share
|
|
of the time), the planner does not currently add anything to the cost
|
|
estimates to account for that work. That's because the work to be done is
|
|
the same for every correct query plan, so it doesn't affect planning
|
|
decisions.
|
|
</para>
|
|
|
|
<para>
|
|
When an <command>UPDATE</command> or <command>DELETE</command> command affects an
|
|
inheritance hierarchy, the output might look like this:
|
|
|
|
<screen>
|
|
EXPLAIN UPDATE parent SET f2 = f2 + 1 WHERE f1 = 101;
|
|
QUERY PLAN
|
|
-----------------------------------------------------------------------------------
|
|
Update on parent (cost=0.00..24.53 rows=4 width=14)
|
|
Update on parent
|
|
Update on child1
|
|
Update on child2
|
|
Update on child3
|
|
-> Seq Scan on parent (cost=0.00..0.00 rows=1 width=14)
|
|
Filter: (f1 = 101)
|
|
-> Index Scan using child1_f1_key on child1 (cost=0.15..8.17 rows=1 width=14)
|
|
Index Cond: (f1 = 101)
|
|
-> Index Scan using child2_f1_key on child2 (cost=0.15..8.17 rows=1 width=14)
|
|
Index Cond: (f1 = 101)
|
|
-> Index Scan using child3_f1_key on child3 (cost=0.15..8.17 rows=1 width=14)
|
|
Index Cond: (f1 = 101)
|
|
</screen>
|
|
|
|
In this example the Update node needs to consider three child tables as
|
|
well as the originally-mentioned parent table. So there are four input
|
|
scanning subplans, one per table. For clarity, the Update node is
|
|
annotated to show the specific target tables that will be updated, in the
|
|
same order as the corresponding subplans. (These annotations are new as
|
|
of <productname>PostgreSQL</productname> 9.5; in prior versions the reader had to
|
|
intuit the target tables by inspecting the subplans.)
|
|
</para>
|
|
|
|
<para>
|
|
The <literal>Planning time</literal> shown by <command>EXPLAIN
|
|
ANALYZE</command> is the time it took to generate the query plan from the
|
|
parsed query and optimize it. It does not include parsing or rewriting.
|
|
</para>
|
|
|
|
<para>
|
|
The <literal>Execution time</literal> shown by <command>EXPLAIN
|
|
ANALYZE</command> includes executor start-up and shut-down time, as well
|
|
as the time to run any triggers that are fired, but it does not include
|
|
parsing, rewriting, or planning time.
|
|
Time spent executing <literal>BEFORE</literal> triggers, if any, is included in
|
|
the time for the related Insert, Update, or Delete node; but time
|
|
spent executing <literal>AFTER</literal> triggers is not counted there because
|
|
<literal>AFTER</literal> triggers are fired after completion of the whole plan.
|
|
The total time spent in each trigger
|
|
(either <literal>BEFORE</literal> or <literal>AFTER</literal>) is also shown separately.
|
|
Note that deferred constraint triggers will not be executed
|
|
until end of transaction and are thus not considered at all by
|
|
<command>EXPLAIN ANALYZE</command>.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="using-explain-caveats">
|
|
<title>Caveats</title>
|
|
|
|
<para>
|
|
There are two significant ways in which run times measured by
|
|
<command>EXPLAIN ANALYZE</command> can deviate from normal execution of
|
|
the same query. First, since no output rows are delivered to the client,
|
|
network transmission costs and I/O conversion costs are not included.
|
|
Second, the measurement overhead added by <command>EXPLAIN
|
|
ANALYZE</command> can be significant, especially on machines with slow
|
|
<function>gettimeofday()</function> operating-system calls. You can use the
|
|
<xref linkend="pgtesttiming"/> tool to measure the overhead of timing
|
|
on your system.
|
|
</para>
|
|
|
|
<para>
|
|
<command>EXPLAIN</command> results should not be extrapolated to situations
|
|
much different from the one you are actually testing; for example,
|
|
results on a toy-sized table cannot be assumed to apply to large tables.
|
|
The planner's cost estimates are not linear and so it might choose
|
|
a different plan for a larger or smaller table. An extreme example
|
|
is that on a table that only occupies one disk page, you'll nearly
|
|
always get a sequential scan plan whether indexes are available or not.
|
|
The planner realizes that it's going to take one disk page read to
|
|
process the table in any case, so there's no value in expending additional
|
|
page reads to look at an index. (We saw this happening in the
|
|
<literal>polygon_tbl</literal> example above.)
|
|
</para>
|
|
|
|
<para>
|
|
There are cases in which the actual and estimated values won't match up
|
|
well, but nothing is really wrong. One such case occurs when
|
|
plan node execution is stopped short by a <literal>LIMIT</literal> or similar
|
|
effect. For example, in the <literal>LIMIT</literal> query we used before,
|
|
|
|
<screen>
|
|
EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000 LIMIT 2;
|
|
|
|
QUERY PLAN
|
|
-------------------------------------------------------------------------------------------------------------------------------
|
|
Limit (cost=0.29..14.71 rows=2 width=244) (actual time=0.177..0.249 rows=2 loops=1)
|
|
-> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..72.42 rows=10 width=244) (actual time=0.174..0.244 rows=2 loops=1)
|
|
Index Cond: (unique2 > 9000)
|
|
Filter: (unique1 < 100)
|
|
Rows Removed by Filter: 287
|
|
Planning time: 0.096 ms
|
|
Execution time: 0.336 ms
|
|
</screen>
|
|
|
|
the estimated cost and row count for the Index Scan node are shown as
|
|
though it were run to completion. But in reality the Limit node stopped
|
|
requesting rows after it got two, so the actual row count is only 2 and
|
|
the run time is less than the cost estimate would suggest. This is not
|
|
an estimation error, only a discrepancy in the way the estimates and true
|
|
values are displayed.
|
|
</para>
|
|
|
|
<para>
|
|
Merge joins also have measurement artifacts that can confuse the unwary.
|
|
A merge join will stop reading one input if it's exhausted the other input
|
|
and the next key value in the one input is greater than the last key value
|
|
of the other input; in such a case there can be no more matches and so no
|
|
need to scan the rest of the first input. This results in not reading all
|
|
of one child, with results like those mentioned for <literal>LIMIT</literal>.
|
|
Also, if the outer (first) child contains rows with duplicate key values,
|
|
the inner (second) child is backed up and rescanned for the portion of its
|
|
rows matching that key value. <command>EXPLAIN ANALYZE</command> counts these
|
|
repeated emissions of the same inner rows as if they were real additional
|
|
rows. When there are many outer duplicates, the reported actual row count
|
|
for the inner child plan node can be significantly larger than the number
|
|
of rows that are actually in the inner relation.
|
|
</para>
|
|
|
|
<para>
|
|
BitmapAnd and BitmapOr nodes always report their actual row counts as zero,
|
|
due to implementation limitations.
|
|
</para>
|
|
|
|
<para>
|
|
Generally, the <command>EXPLAIN</command> output will display details for
|
|
every plan node which was generated by the query planner. However, there
|
|
are cases where the executor is able to determine that certain nodes are
|
|
not required; currently, the only node types to support this are the
|
|
<literal>Append</literal> and <literal>MergeAppend</literal> nodes. These
|
|
node types have the ability to discard subnodes which they are able to
|
|
determine won't contain any records required by the query. It is possible
|
|
to determine that nodes have been removed in this way by the presence of a
|
|
"Subplans Removed" property in the <command>EXPLAIN</command> output.
|
|
</para>
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="planner-stats">
|
|
<title>Statistics Used by the Planner</title>
|
|
|
|
<indexterm zone="planner-stats">
|
|
<primary>statistics</primary>
|
|
<secondary>of the planner</secondary>
|
|
</indexterm>
|
|
|
|
<sect2>
|
|
<title>Single-Column Statistics</title>
|
|
<para>
|
|
As we saw in the previous section, the query planner needs to estimate
|
|
the number of rows retrieved by a query in order to make good choices
|
|
of query plans. This section provides a quick look at the statistics
|
|
that the system uses for these estimates.
|
|
</para>
|
|
|
|
<para>
|
|
One component of the statistics is the total number of entries in
|
|
each table and index, as well as the number of disk blocks occupied
|
|
by each table and index. This information is kept in the table
|
|
<link linkend="catalog-pg-class"><structname>pg_class</structname></link>,
|
|
in the columns <structfield>reltuples</structfield> and
|
|
<structfield>relpages</structfield>. We can look at it with
|
|
queries similar to this one:
|
|
|
|
<screen>
|
|
SELECT relname, relkind, reltuples, relpages
|
|
FROM pg_class
|
|
WHERE relname LIKE 'tenk1%';
|
|
|
|
relname | relkind | reltuples | relpages
|
|
----------------------+---------+-----------+----------
|
|
tenk1 | r | 10000 | 358
|
|
tenk1_hundred | i | 10000 | 30
|
|
tenk1_thous_tenthous | i | 10000 | 30
|
|
tenk1_unique1 | i | 10000 | 30
|
|
tenk1_unique2 | i | 10000 | 30
|
|
(5 rows)
|
|
</screen>
|
|
|
|
Here we can see that <structname>tenk1</structname> contains 10000
|
|
rows, as do its indexes, but the indexes are (unsurprisingly) much
|
|
smaller than the table.
|
|
</para>
|
|
|
|
<para>
|
|
For efficiency reasons, <structfield>reltuples</structfield>
|
|
and <structfield>relpages</structfield> are not updated on-the-fly,
|
|
and so they usually contain somewhat out-of-date values.
|
|
They are updated by <command>VACUUM</command>, <command>ANALYZE</command>, and a
|
|
few DDL commands such as <command>CREATE INDEX</command>. A <command>VACUUM</command>
|
|
or <command>ANALYZE</command> operation that does not scan the entire table
|
|
(which is commonly the case) will incrementally update the
|
|
<structfield>reltuples</structfield> count on the basis of the part
|
|
of the table it did scan, resulting in an approximate value.
|
|
In any case, the planner
|
|
will scale the values it finds in <structname>pg_class</structname>
|
|
to match the current physical table size, thus obtaining a closer
|
|
approximation.
|
|
</para>
|
|
|
|
<indexterm>
|
|
<primary>pg_statistic</primary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
Most queries retrieve only a fraction of the rows in a table, due
|
|
to <literal>WHERE</literal> clauses that restrict the rows to be
|
|
examined. The planner thus needs to make an estimate of the
|
|
<firstterm>selectivity</firstterm> of <literal>WHERE</literal> clauses, that is,
|
|
the fraction of rows that match each condition in the
|
|
<literal>WHERE</literal> clause. The information used for this task is
|
|
stored in the
|
|
<link linkend="catalog-pg-statistic"><structname>pg_statistic</structname></link>
|
|
system catalog. Entries in <structname>pg_statistic</structname>
|
|
are updated by the <command>ANALYZE</command> and <command>VACUUM
|
|
ANALYZE</command> commands, and are always approximate even when freshly
|
|
updated.
|
|
</para>
|
|
|
|
<indexterm>
|
|
<primary>pg_stats</primary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
Rather than look at <structname>pg_statistic</structname> directly,
|
|
it's better to look at its view
|
|
<link linkend="view-pg-stats"><structname>pg_stats</structname></link>
|
|
when examining the statistics manually. <structname>pg_stats</structname>
|
|
is designed to be more easily readable. Furthermore,
|
|
<structname>pg_stats</structname> is readable by all, whereas
|
|
<structname>pg_statistic</structname> is only readable by a superuser.
|
|
(This prevents unprivileged users from learning something about
|
|
the contents of other people's tables from the statistics. The
|
|
<structname>pg_stats</structname> view is restricted to show only
|
|
rows about tables that the current user can read.)
|
|
For example, we might do:
|
|
|
|
<screen>
|
|
SELECT attname, inherited, n_distinct,
|
|
array_to_string(most_common_vals, E'\n') as most_common_vals
|
|
FROM pg_stats
|
|
WHERE tablename = 'road';
|
|
|
|
attname | inherited | n_distinct | most_common_vals
|
|
---------+-----------+------------+------------------------------------
|
|
name | f | -0.363388 | I- 580 Ramp+
|
|
| | | I- 880 Ramp+
|
|
| | | Sp Railroad +
|
|
| | | I- 580 +
|
|
| | | I- 680 Ramp
|
|
name | t | -0.284859 | I- 880 Ramp+
|
|
| | | I- 580 Ramp+
|
|
| | | I- 680 Ramp+
|
|
| | | I- 580 +
|
|
| | | State Hwy 13 Ramp
|
|
(2 rows)
|
|
</screen>
|
|
|
|
Note that two rows are displayed for the same column, one corresponding
|
|
to the complete inheritance hierarchy starting at the
|
|
<literal>road</literal> table (<literal>inherited</literal>=<literal>t</literal>),
|
|
and another one including only the <literal>road</literal> table itself
|
|
(<literal>inherited</literal>=<literal>f</literal>).
|
|
</para>
|
|
|
|
<para>
|
|
The amount of information stored in <structname>pg_statistic</structname>
|
|
by <command>ANALYZE</command>, in particular the maximum number of entries in the
|
|
<structfield>most_common_vals</structfield> and <structfield>histogram_bounds</structfield>
|
|
arrays for each column, can be set on a
|
|
column-by-column basis using the <command>ALTER TABLE SET STATISTICS</command>
|
|
command, or globally by setting the
|
|
<xref linkend="guc-default-statistics-target"/> configuration variable.
|
|
The default limit is presently 100 entries. Raising the limit
|
|
might allow more accurate planner estimates to be made, particularly for
|
|
columns with irregular data distributions, at the price of consuming
|
|
more space in <structname>pg_statistic</structname> and slightly more
|
|
time to compute the estimates. Conversely, a lower limit might be
|
|
sufficient for columns with simple data distributions.
|
|
</para>
|
|
|
|
<para>
|
|
Further details about the planner's use of statistics can be found in
|
|
<xref linkend="planner-stats-details"/>.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="planner-stats-extended">
|
|
<title>Extended Statistics</title>
|
|
|
|
<indexterm zone="planner-stats-extended">
|
|
<primary>statistics</primary>
|
|
<secondary>of the planner</secondary>
|
|
</indexterm>
|
|
|
|
<indexterm>
|
|
<primary>correlation</primary>
|
|
<secondary>in the query planner</secondary>
|
|
</indexterm>
|
|
|
|
<indexterm>
|
|
<primary>pg_statistic_ext</primary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
It is common to see slow queries running bad execution plans because
|
|
multiple columns used in the query clauses are correlated.
|
|
The planner normally assumes that multiple conditions
|
|
are independent of each other,
|
|
an assumption that does not hold when column values are correlated.
|
|
Regular statistics, because of their per-individual-column nature,
|
|
cannot capture any knowledge about cross-column correlation.
|
|
However, <productname>PostgreSQL</productname> has the ability to compute
|
|
<firstterm>multivariate statistics</firstterm>, which can capture
|
|
such information.
|
|
</para>
|
|
|
|
<para>
|
|
Because the number of possible column combinations is very large,
|
|
it's impractical to compute multivariate statistics automatically.
|
|
Instead, <firstterm>extended statistics objects</firstterm>, more often
|
|
called just <firstterm>statistics objects</firstterm>, can be created to instruct
|
|
the server to obtain statistics across interesting sets of columns.
|
|
</para>
|
|
|
|
<para>
|
|
Statistics objects are created using
|
|
<xref linkend="sql-createstatistics"/>, which see for more details.
|
|
Creation of such an object merely creates a catalog entry expressing
|
|
interest in the statistics. Actual data collection is performed
|
|
by <command>ANALYZE</command> (either a manual command, or background
|
|
auto-analyze). The collected values can be examined in the
|
|
<link linkend="catalog-pg-statistic-ext"><structname>pg_statistic_ext</structname></link>
|
|
catalog.
|
|
</para>
|
|
|
|
<para>
|
|
<command>ANALYZE</command> computes extended statistics based on the same
|
|
sample of table rows that it takes for computing regular single-column
|
|
statistics. Since the sample size is increased by increasing the
|
|
statistics target for the table or any of its columns (as described in
|
|
the previous section), a larger statistics target will normally result in
|
|
more accurate extended statistics, as well as more time spent calculating
|
|
them.
|
|
</para>
|
|
|
|
<para>
|
|
The following subsections describe the kinds of extended statistics
|
|
that are currently supported.
|
|
</para>
|
|
|
|
<sect3>
|
|
<title>Functional Dependencies</title>
|
|
|
|
<para>
|
|
The simplest kind of extended statistics tracks <firstterm>functional
|
|
dependencies</firstterm>, a concept used in definitions of database normal forms.
|
|
We say that column <structfield>b</structfield> is functionally dependent on
|
|
column <structfield>a</structfield> if knowledge of the value of
|
|
<structfield>a</structfield> is sufficient to determine the value
|
|
of <structfield>b</structfield>, that is there are no two rows having the same value
|
|
of <structfield>a</structfield> but different values of <structfield>b</structfield>.
|
|
In a fully normalized database, functional dependencies should exist
|
|
only on primary keys and superkeys. However, in practice many data sets
|
|
are not fully normalized for various reasons; intentional
|
|
denormalization for performance reasons is a common example.
|
|
Even in a fully normalized database, there may be partial correlation
|
|
between some columns, which can be expressed as partial functional
|
|
dependency.
|
|
</para>
|
|
|
|
<para>
|
|
The existence of functional dependencies directly affects the accuracy
|
|
of estimates in certain queries. If a query contains conditions on
|
|
both the independent and the dependent column(s), the
|
|
conditions on the dependent columns do not further reduce the result
|
|
size; but without knowledge of the functional dependency, the query
|
|
planner will assume that the conditions are independent, resulting
|
|
in underestimating the result size.
|
|
</para>
|
|
|
|
<para>
|
|
To inform the planner about functional dependencies, <command>ANALYZE</command>
|
|
can collect measurements of cross-column dependency. Assessing the
|
|
degree of dependency between all sets of columns would be prohibitively
|
|
expensive, so data collection is limited to those groups of columns
|
|
appearing together in a statistics object defined with
|
|
the <literal>dependencies</literal> option. It is advisable to create
|
|
<literal>dependencies</literal> statistics only for column groups that are
|
|
strongly correlated, to avoid unnecessary overhead in both
|
|
<command>ANALYZE</command> and later query planning.
|
|
</para>
|
|
|
|
<para>
|
|
Here is an example of collecting functional-dependency statistics:
|
|
<programlisting>
|
|
CREATE STATISTICS stts (dependencies) ON zip, city FROM zipcodes;
|
|
|
|
ANALYZE zipcodes;
|
|
|
|
SELECT stxname, stxkeys, stxdependencies
|
|
FROM pg_statistic_ext
|
|
WHERE stxname = 'stts';
|
|
stxname | stxkeys | stxdependencies
|
|
---------+---------+------------------------------------------
|
|
stts | 1 5 | {"1 => 5": 1.000000, "5 => 1": 0.423130}
|
|
(1 row)
|
|
</programlisting>
|
|
Here it can be seen that column 1 (zip code) fully determines column
|
|
5 (city) so the coefficient is 1.0, while city only determines zip code
|
|
about 42% of the time, meaning that there are many cities (58%) that are
|
|
represented by more than a single ZIP code.
|
|
</para>
|
|
|
|
<para>
|
|
When computing the selectivity for a query involving functionally
|
|
dependent columns, the planner adjusts the per-condition selectivity
|
|
estimates using the dependency coefficients so as not to produce
|
|
an underestimate.
|
|
</para>
|
|
|
|
<sect4>
|
|
<title>Limitations of Functional Dependencies</title>
|
|
|
|
<para>
|
|
Functional dependencies are currently only applied when considering
|
|
simple equality conditions that compare columns to constant values.
|
|
They are not used to improve estimates for equality conditions
|
|
comparing two columns or comparing a column to an expression, nor for
|
|
range clauses, <literal>LIKE</literal> or any other type of condition.
|
|
</para>
|
|
|
|
<para>
|
|
When estimating with functional dependencies, the planner assumes that
|
|
conditions on the involved columns are compatible and hence redundant.
|
|
If they are incompatible, the correct estimate would be zero rows, but
|
|
that possibility is not considered. For example, given a query like
|
|
<programlisting>
|
|
SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105';
|
|
</programlisting>
|
|
the planner will disregard the <structfield>city</structfield> clause as not
|
|
changing the selectivity, which is correct. However, it will make
|
|
the same assumption about
|
|
<programlisting>
|
|
SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210';
|
|
</programlisting>
|
|
even though there will really be zero rows satisfying this query.
|
|
Functional dependency statistics do not provide enough information
|
|
to conclude that, however.
|
|
</para>
|
|
|
|
<para>
|
|
In many practical situations, this assumption is usually satisfied;
|
|
for example, there might be a GUI in the application that only allows
|
|
selecting compatible city and ZIP code values to use in a query.
|
|
But if that's not the case, functional dependencies may not be a viable
|
|
option.
|
|
</para>
|
|
</sect4>
|
|
</sect3>
|
|
|
|
<sect3>
|
|
<title>Multivariate N-Distinct Counts</title>
|
|
|
|
<para>
|
|
Single-column statistics store the number of distinct values in each
|
|
column. Estimates of the number of distinct values when combining more
|
|
than one column (for example, for <literal>GROUP BY a, b</literal>) are
|
|
frequently wrong when the planner only has single-column statistical
|
|
data, causing it to select bad plans.
|
|
</para>
|
|
|
|
<para>
|
|
To improve such estimates, <command>ANALYZE</command> can collect n-distinct
|
|
statistics for groups of columns. As before, it's impractical to do
|
|
this for every possible column grouping, so data is collected only for
|
|
those groups of columns appearing together in a statistics object
|
|
defined with the <literal>ndistinct</literal> option. Data will be collected
|
|
for each possible combination of two or more columns from the set of
|
|
listed columns.
|
|
</para>
|
|
|
|
<para>
|
|
Continuing the previous example, the n-distinct counts in a
|
|
table of ZIP codes might look like the following:
|
|
<programlisting>
|
|
CREATE STATISTICS stts2 (ndistinct) ON zip, state, city FROM zipcodes;
|
|
|
|
ANALYZE zipcodes;
|
|
|
|
SELECT stxkeys AS k, stxndistinct AS nd
|
|
FROM pg_statistic_ext
|
|
WHERE stxname = 'stts2';
|
|
-[ RECORD 1 ]--------------------------------------------------------
|
|
k | 1 2 5
|
|
nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
|
|
(1 row)
|
|
</programlisting>
|
|
This indicates that there are three combinations of columns that
|
|
have 33178 distinct values: ZIP code and state; ZIP code and city;
|
|
and ZIP code, city and state (the fact that they are all equal is
|
|
expected given that ZIP code alone is unique in this table). On the
|
|
other hand, the combination of city and state has only 27435 distinct
|
|
values.
|
|
</para>
|
|
|
|
<para>
|
|
It's advisable to create <literal>ndistinct</literal> statistics objects only
|
|
on combinations of columns that are actually used for grouping, and
|
|
for which misestimation of the number of groups is resulting in bad
|
|
plans. Otherwise, the <command>ANALYZE</command> cycles are just wasted.
|
|
</para>
|
|
</sect3>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 id="explicit-joins">
|
|
<title>Controlling the Planner with Explicit <literal>JOIN</literal> Clauses</title>
|
|
|
|
<indexterm zone="explicit-joins">
|
|
<primary>join</primary>
|
|
<secondary>controlling the order</secondary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
It is possible
|
|
to control the query planner to some extent by using the explicit <literal>JOIN</literal>
|
|
syntax. To see why this matters, we first need some background.
|
|
</para>
|
|
|
|
<para>
|
|
In a simple join query, such as:
|
|
<programlisting>
|
|
SELECT * FROM a, b, c WHERE a.id = b.id AND b.ref = c.id;
|
|
</programlisting>
|
|
the planner is free to join the given tables in any order. For
|
|
example, it could generate a query plan that joins A to B, using
|
|
the <literal>WHERE</literal> condition <literal>a.id = b.id</literal>, and then
|
|
joins C to this joined table, using the other <literal>WHERE</literal>
|
|
condition. Or it could join B to C and then join A to that result.
|
|
Or it could join A to C and then join them with B — but that
|
|
would be inefficient, since the full Cartesian product of A and C
|
|
would have to be formed, there being no applicable condition in the
|
|
<literal>WHERE</literal> clause to allow optimization of the join. (All
|
|
joins in the <productname>PostgreSQL</productname> executor happen
|
|
between two input tables, so it's necessary to build up the result
|
|
in one or another of these fashions.) The important point is that
|
|
these different join possibilities give semantically equivalent
|
|
results but might have hugely different execution costs. Therefore,
|
|
the planner will explore all of them to try to find the most
|
|
efficient query plan.
|
|
</para>
|
|
|
|
<para>
|
|
When a query only involves two or three tables, there aren't many join
|
|
orders to worry about. But the number of possible join orders grows
|
|
exponentially as the number of tables expands. Beyond ten or so input
|
|
tables it's no longer practical to do an exhaustive search of all the
|
|
possibilities, and even for six or seven tables planning might take an
|
|
annoyingly long time. When there are too many input tables, the
|
|
<productname>PostgreSQL</productname> planner will switch from exhaustive
|
|
search to a <firstterm>genetic</firstterm> probabilistic search
|
|
through a limited number of possibilities. (The switch-over threshold is
|
|
set by the <xref linkend="guc-geqo-threshold"/> run-time
|
|
parameter.)
|
|
The genetic search takes less time, but it won't
|
|
necessarily find the best possible plan.
|
|
</para>
|
|
|
|
<para>
|
|
When the query involves outer joins, the planner has less freedom
|
|
than it does for plain (inner) joins. For example, consider:
|
|
<programlisting>
|
|
SELECT * FROM a LEFT JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);
|
|
</programlisting>
|
|
Although this query's restrictions are superficially similar to the
|
|
previous example, the semantics are different because a row must be
|
|
emitted for each row of A that has no matching row in the join of B and C.
|
|
Therefore the planner has no choice of join order here: it must join
|
|
B to C and then join A to that result. Accordingly, this query takes
|
|
less time to plan than the previous query. In other cases, the planner
|
|
might be able to determine that more than one join order is safe.
|
|
For example, given:
|
|
<programlisting>
|
|
SELECT * FROM a LEFT JOIN b ON (a.bid = b.id) LEFT JOIN c ON (a.cid = c.id);
|
|
</programlisting>
|
|
it is valid to join A to either B or C first. Currently, only
|
|
<literal>FULL JOIN</literal> completely constrains the join order. Most
|
|
practical cases involving <literal>LEFT JOIN</literal> or <literal>RIGHT JOIN</literal>
|
|
can be rearranged to some extent.
|
|
</para>
|
|
|
|
<para>
|
|
Explicit inner join syntax (<literal>INNER JOIN</literal>, <literal>CROSS
|
|
JOIN</literal>, or unadorned <literal>JOIN</literal>) is semantically the same as
|
|
listing the input relations in <literal>FROM</literal>, so it does not
|
|
constrain the join order.
|
|
</para>
|
|
|
|
<para>
|
|
Even though most kinds of <literal>JOIN</literal> don't completely constrain
|
|
the join order, it is possible to instruct the
|
|
<productname>PostgreSQL</productname> query planner to treat all
|
|
<literal>JOIN</literal> clauses as constraining the join order anyway.
|
|
For example, these three queries are logically equivalent:
|
|
<programlisting>
|
|
SELECT * FROM a, b, c WHERE a.id = b.id AND b.ref = c.id;
|
|
SELECT * FROM a CROSS JOIN b CROSS JOIN c WHERE a.id = b.id AND b.ref = c.id;
|
|
SELECT * FROM a JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);
|
|
</programlisting>
|
|
But if we tell the planner to honor the <literal>JOIN</literal> order,
|
|
the second and third take less time to plan than the first. This effect
|
|
is not worth worrying about for only three tables, but it can be a
|
|
lifesaver with many tables.
|
|
</para>
|
|
|
|
<para>
|
|
To force the planner to follow the join order laid out by explicit
|
|
<literal>JOIN</literal>s,
|
|
set the <xref linkend="guc-join-collapse-limit"/> run-time parameter to 1.
|
|
(Other possible values are discussed below.)
|
|
</para>
|
|
|
|
<para>
|
|
You do not need to constrain the join order completely in order to
|
|
cut search time, because it's OK to use <literal>JOIN</literal> operators
|
|
within items of a plain <literal>FROM</literal> list. For example, consider:
|
|
<programlisting>
|
|
SELECT * FROM a CROSS JOIN b, c, d, e WHERE ...;
|
|
</programlisting>
|
|
With <varname>join_collapse_limit</varname> = 1, this
|
|
forces the planner to join A to B before joining them to other tables,
|
|
but doesn't constrain its choices otherwise. In this example, the
|
|
number of possible join orders is reduced by a factor of 5.
|
|
</para>
|
|
|
|
<para>
|
|
Constraining the planner's search in this way is a useful technique
|
|
both for reducing planning time and for directing the planner to a
|
|
good query plan. If the planner chooses a bad join order by default,
|
|
you can force it to choose a better order via <literal>JOIN</literal> syntax
|
|
— assuming that you know of a better order, that is. Experimentation
|
|
is recommended.
|
|
</para>
|
|
|
|
<para>
|
|
A closely related issue that affects planning time is collapsing of
|
|
subqueries into their parent query. For example, consider:
|
|
<programlisting>
|
|
SELECT *
|
|
FROM x, y,
|
|
(SELECT * FROM a, b, c WHERE something) AS ss
|
|
WHERE somethingelse;
|
|
</programlisting>
|
|
This situation might arise from use of a view that contains a join;
|
|
the view's <literal>SELECT</literal> rule will be inserted in place of the view
|
|
reference, yielding a query much like the above. Normally, the planner
|
|
will try to collapse the subquery into the parent, yielding:
|
|
<programlisting>
|
|
SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
|
|
</programlisting>
|
|
This usually results in a better plan than planning the subquery
|
|
separately. (For example, the outer <literal>WHERE</literal> conditions might be such that
|
|
joining X to A first eliminates many rows of A, thus avoiding the need to
|
|
form the full logical output of the subquery.) But at the same time,
|
|
we have increased the planning time; here, we have a five-way join
|
|
problem replacing two separate three-way join problems. Because of the
|
|
exponential growth of the number of possibilities, this makes a big
|
|
difference. The planner tries to avoid getting stuck in huge join search
|
|
problems by not collapsing a subquery if more than <varname>from_collapse_limit</varname>
|
|
<literal>FROM</literal> items would result in the parent
|
|
query. You can trade off planning time against quality of plan by
|
|
adjusting this run-time parameter up or down.
|
|
</para>
|
|
|
|
<para>
|
|
<xref linkend="guc-from-collapse-limit"/> and <xref
|
|
linkend="guc-join-collapse-limit"/>
|
|
are similarly named because they do almost the same thing: one controls
|
|
when the planner will <quote>flatten out</quote> subqueries, and the
|
|
other controls when it will flatten out explicit joins. Typically
|
|
you would either set <varname>join_collapse_limit</varname> equal to
|
|
<varname>from_collapse_limit</varname> (so that explicit joins and subqueries
|
|
act similarly) or set <varname>join_collapse_limit</varname> to 1 (if you want
|
|
to control join order with explicit joins). But you might set them
|
|
differently if you are trying to fine-tune the trade-off between planning
|
|
time and run time.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1 id="populate">
|
|
<title>Populating a Database</title>
|
|
|
|
<para>
|
|
One might need to insert a large amount of data when first populating
|
|
a database. This section contains some suggestions on how to make
|
|
this process as efficient as possible.
|
|
</para>
|
|
|
|
<sect2 id="disable-autocommit">
|
|
<title>Disable Autocommit</title>
|
|
|
|
<indexterm>
|
|
<primary>autocommit</primary>
|
|
<secondary>bulk-loading data</secondary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
When using multiple <command>INSERT</command>s, turn off autocommit and just do
|
|
one commit at the end. (In plain
|
|
SQL, this means issuing <command>BEGIN</command> at the start and
|
|
<command>COMMIT</command> at the end. Some client libraries might
|
|
do this behind your back, in which case you need to make sure the
|
|
library does it when you want it done.) If you allow each
|
|
insertion to be committed separately,
|
|
<productname>PostgreSQL</productname> is doing a lot of work for
|
|
each row that is added. An additional benefit of doing all
|
|
insertions in one transaction is that if the insertion of one row
|
|
were to fail then the insertion of all rows inserted up to that
|
|
point would be rolled back, so you won't be stuck with partially
|
|
loaded data.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="populate-copy-from">
|
|
<title>Use <command>COPY</command></title>
|
|
|
|
<para>
|
|
Use <xref linkend="sql-copy"/> to load
|
|
all the rows in one command, instead of using a series of
|
|
<command>INSERT</command> commands. The <command>COPY</command>
|
|
command is optimized for loading large numbers of rows; it is less
|
|
flexible than <command>INSERT</command>, but incurs significantly
|
|
less overhead for large data loads. Since <command>COPY</command>
|
|
is a single command, there is no need to disable autocommit if you
|
|
use this method to populate a table.
|
|
</para>
|
|
|
|
<para>
|
|
If you cannot use <command>COPY</command>, it might help to use <xref
|
|
linkend="sql-prepare"/> to create a
|
|
prepared <command>INSERT</command> statement, and then use
|
|
<command>EXECUTE</command> as many times as required. This avoids
|
|
some of the overhead of repeatedly parsing and planning
|
|
<command>INSERT</command>. Different interfaces provide this facility
|
|
in different ways; look for <quote>prepared statements</quote> in the interface
|
|
documentation.
|
|
</para>
|
|
|
|
<para>
|
|
Note that loading a large number of rows using
|
|
<command>COPY</command> is almost always faster than using
|
|
<command>INSERT</command>, even if <command>PREPARE</command> is used and
|
|
multiple insertions are batched into a single transaction.
|
|
</para>
|
|
|
|
<para>
|
|
<command>COPY</command> is fastest when used within the same
|
|
transaction as an earlier <command>CREATE TABLE</command> or
|
|
<command>TRUNCATE</command> command. In such cases no WAL
|
|
needs to be written, because in case of an error, the files
|
|
containing the newly loaded data will be removed anyway.
|
|
However, this consideration only applies when
|
|
<xref linkend="guc-wal-level"/> is <literal>minimal</literal> as all commands
|
|
must write WAL otherwise.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="populate-rm-indexes">
|
|
<title>Remove Indexes</title>
|
|
|
|
<para>
|
|
If you are loading a freshly created table, the fastest method is to
|
|
create the table, bulk load the table's data using
|
|
<command>COPY</command>, then create any indexes needed for the
|
|
table. Creating an index on pre-existing data is quicker than
|
|
updating it incrementally as each row is loaded.
|
|
</para>
|
|
|
|
<para>
|
|
If you are adding large amounts of data to an existing table,
|
|
it might be a win to drop the indexes,
|
|
load the table, and then recreate the indexes. Of course, the
|
|
database performance for other users might suffer
|
|
during the time the indexes are missing. One should also think
|
|
twice before dropping a unique index, since the error checking
|
|
afforded by the unique constraint will be lost while the index is
|
|
missing.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="populate-rm-fkeys">
|
|
<title>Remove Foreign Key Constraints</title>
|
|
|
|
<para>
|
|
Just as with indexes, a foreign key constraint can be checked
|
|
<quote>in bulk</quote> more efficiently than row-by-row. So it might be
|
|
useful to drop foreign key constraints, load data, and re-create
|
|
the constraints. Again, there is a trade-off between data load
|
|
speed and loss of error checking while the constraint is missing.
|
|
</para>
|
|
|
|
<para>
|
|
What's more, when you load data into a table with existing foreign key
|
|
constraints, each new row requires an entry in the server's list of
|
|
pending trigger events (since it is the firing of a trigger that checks
|
|
the row's foreign key constraint). Loading many millions of rows can
|
|
cause the trigger event queue to overflow available memory, leading to
|
|
intolerable swapping or even outright failure of the command. Therefore
|
|
it may be <emphasis>necessary</emphasis>, not just desirable, to drop and re-apply
|
|
foreign keys when loading large amounts of data. If temporarily removing
|
|
the constraint isn't acceptable, the only other recourse may be to split
|
|
up the load operation into smaller transactions.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="populate-work-mem">
|
|
<title>Increase <varname>maintenance_work_mem</varname></title>
|
|
|
|
<para>
|
|
Temporarily increasing the <xref linkend="guc-maintenance-work-mem"/>
|
|
configuration variable when loading large amounts of data can
|
|
lead to improved performance. This will help to speed up <command>CREATE
|
|
INDEX</command> commands and <command>ALTER TABLE ADD FOREIGN KEY</command> commands.
|
|
It won't do much for <command>COPY</command> itself, so this advice is
|
|
only useful when you are using one or both of the above techniques.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="populate-max-wal-size">
|
|
<title>Increase <varname>max_wal_size</varname></title>
|
|
|
|
<para>
|
|
Temporarily increasing the <xref linkend="guc-max-wal-size"/>
|
|
configuration variable can also
|
|
make large data loads faster. This is because loading a large
|
|
amount of data into <productname>PostgreSQL</productname> will
|
|
cause checkpoints to occur more often than the normal checkpoint
|
|
frequency (specified by the <varname>checkpoint_timeout</varname>
|
|
configuration variable). Whenever a checkpoint occurs, all dirty
|
|
pages must be flushed to disk. By increasing
|
|
<varname>max_wal_size</varname> temporarily during bulk
|
|
data loads, the number of checkpoints that are required can be
|
|
reduced.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="populate-pitr">
|
|
<title>Disable WAL Archival and Streaming Replication</title>
|
|
|
|
<para>
|
|
When loading large amounts of data into an installation that uses
|
|
WAL archiving or streaming replication, it might be faster to take a
|
|
new base backup after the load has completed than to process a large
|
|
amount of incremental WAL data. To prevent incremental WAL logging
|
|
while loading, disable archiving and streaming replication, by setting
|
|
<xref linkend="guc-wal-level"/> to <literal>minimal</literal>,
|
|
<xref linkend="guc-archive-mode"/> to <literal>off</literal>, and
|
|
<xref linkend="guc-max-wal-senders"/> to zero.
|
|
But note that changing these settings requires a server restart.
|
|
</para>
|
|
|
|
<para>
|
|
Aside from avoiding the time for the archiver or WAL sender to
|
|
process the WAL data,
|
|
doing this will actually make certain commands faster, because they
|
|
are designed not to write WAL at all if <varname>wal_level</varname>
|
|
is <literal>minimal</literal>. (They can guarantee crash safety more cheaply
|
|
by doing an <function>fsync</function> at the end than by writing WAL.)
|
|
This applies to the following commands:
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<command>CREATE TABLE AS SELECT</command>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<command>CREATE INDEX</command> (and variants such as
|
|
<command>ALTER TABLE ADD PRIMARY KEY</command>)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<command>ALTER TABLE SET TABLESPACE</command>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<command>CLUSTER</command>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<command>COPY FROM</command>, when the target table has been
|
|
created or truncated earlier in the same transaction
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="populate-analyze">
|
|
<title>Run <command>ANALYZE</command> Afterwards</title>
|
|
|
|
<para>
|
|
Whenever you have significantly altered the distribution of data
|
|
within a table, running <xref linkend="sql-analyze"/> is strongly recommended. This
|
|
includes bulk loading large amounts of data into the table. Running
|
|
<command>ANALYZE</command> (or <command>VACUUM ANALYZE</command>)
|
|
ensures that the planner has up-to-date statistics about the
|
|
table. With no statistics or obsolete statistics, the planner might
|
|
make poor decisions during query planning, leading to poor
|
|
performance on any tables with inaccurate or nonexistent
|
|
statistics. Note that if the autovacuum daemon is enabled, it might
|
|
run <command>ANALYZE</command> automatically; see
|
|
<xref linkend="vacuum-for-statistics"/>
|
|
and <xref linkend="autovacuum"/> for more information.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="populate-pg-dump">
|
|
<title>Some Notes About <application>pg_dump</application></title>
|
|
|
|
<para>
|
|
Dump scripts generated by <application>pg_dump</application> automatically apply
|
|
several, but not all, of the above guidelines. To reload a
|
|
<application>pg_dump</application> dump as quickly as possible, you need to
|
|
do a few extra things manually. (Note that these points apply while
|
|
<emphasis>restoring</emphasis> a dump, not while <emphasis>creating</emphasis> it.
|
|
The same points apply whether loading a text dump with
|
|
<application>psql</application> or using <application>pg_restore</application> to load
|
|
from a <application>pg_dump</application> archive file.)
|
|
</para>
|
|
|
|
<para>
|
|
By default, <application>pg_dump</application> uses <command>COPY</command>, and when
|
|
it is generating a complete schema-and-data dump, it is careful to
|
|
load data before creating indexes and foreign keys. So in this case
|
|
several guidelines are handled automatically. What is left
|
|
for you to do is to:
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Set appropriate (i.e., larger than normal) values for
|
|
<varname>maintenance_work_mem</varname> and
|
|
<varname>max_wal_size</varname>.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
If using WAL archiving or streaming replication, consider disabling
|
|
them during the restore. To do that, set <varname>archive_mode</varname>
|
|
to <literal>off</literal>,
|
|
<varname>wal_level</varname> to <literal>minimal</literal>, and
|
|
<varname>max_wal_senders</varname> to zero before loading the dump.
|
|
Afterwards, set them back to the right values and take a fresh
|
|
base backup.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Experiment with the parallel dump and restore modes of both
|
|
<application>pg_dump</application> and <application>pg_restore</application> and find the
|
|
optimal number of concurrent jobs to use. Dumping and restoring in
|
|
parallel by means of the <option>-j</option> option should give you a
|
|
significantly higher performance over the serial mode.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Consider whether the whole dump should be restored as a single
|
|
transaction. To do that, pass the <option>-1</option> or
|
|
<option>--single-transaction</option> command-line option to
|
|
<application>psql</application> or <application>pg_restore</application>. When using this
|
|
mode, even the smallest of errors will rollback the entire restore,
|
|
possibly discarding many hours of processing. Depending on how
|
|
interrelated the data is, that might seem preferable to manual cleanup,
|
|
or not. <command>COPY</command> commands will run fastest if you use a single
|
|
transaction and have WAL archiving turned off.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
If multiple CPUs are available in the database server, consider using
|
|
<application>pg_restore</application>'s <option>--jobs</option> option. This
|
|
allows concurrent data loading and index creation.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Run <command>ANALYZE</command> afterwards.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
A data-only dump will still use <command>COPY</command>, but it does not
|
|
drop or recreate indexes, and it does not normally touch foreign
|
|
keys.
|
|
|
|
<footnote>
|
|
<para>
|
|
You can get the effect of disabling foreign keys by using
|
|
the <option>--disable-triggers</option> option — but realize that
|
|
that eliminates, rather than just postpones, foreign key
|
|
validation, and so it is possible to insert bad data if you use it.
|
|
</para>
|
|
</footnote>
|
|
|
|
So when loading a data-only dump, it is up to you to drop and recreate
|
|
indexes and foreign keys if you wish to use those techniques.
|
|
It's still useful to increase <varname>max_wal_size</varname>
|
|
while loading the data, but don't bother increasing
|
|
<varname>maintenance_work_mem</varname>; rather, you'd do that while
|
|
manually recreating indexes and foreign keys afterwards.
|
|
And don't forget to <command>ANALYZE</command> when you're done; see
|
|
<xref linkend="vacuum-for-statistics"/>
|
|
and <xref linkend="autovacuum"/> for more information.
|
|
</para>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 id="non-durability">
|
|
<title>Non-Durable Settings</title>
|
|
|
|
<indexterm zone="non-durability">
|
|
<primary>non-durable</primary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
Durability is a database feature that guarantees the recording of
|
|
committed transactions even if the server crashes or loses
|
|
power. However, durability adds significant database overhead,
|
|
so if your site does not require such a guarantee,
|
|
<productname>PostgreSQL</productname> can be configured to run
|
|
much faster. The following are configuration changes you can make
|
|
to improve performance in such cases. Except as noted below, durability
|
|
is still guaranteed in case of a crash of the database software;
|
|
only abrupt operating system stoppage creates a risk of data loss
|
|
or corruption when these settings are used.
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Place the database cluster's data directory in a memory-backed
|
|
file system (i.e. <acronym>RAM</acronym> disk). This eliminates all
|
|
database disk I/O, but limits data storage to the amount of
|
|
available memory (and perhaps swap).
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Turn off <xref linkend="guc-fsync"/>; there is no need to flush
|
|
data to disk.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Turn off <xref linkend="guc-synchronous-commit"/>; there might be no
|
|
need to force <acronym>WAL</acronym> writes to disk on every
|
|
commit. This setting does risk transaction loss (though not data
|
|
corruption) in case of a crash of the <emphasis>database</emphasis>.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Turn off <xref linkend="guc-full-page-writes"/>; there is no need
|
|
to guard against partial page writes.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Increase <xref linkend="guc-max-wal-size"/> and <xref
|
|
linkend="guc-checkpoint-timeout"/>; this reduces the frequency
|
|
of checkpoints, but increases the storage requirements of
|
|
<filename>/pg_wal</filename>.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Create <link linkend="sql-createtable-unlogged">unlogged
|
|
tables</link> to avoid <acronym>WAL</acronym> writes, though it
|
|
makes the tables non-crash-safe.
|
|
</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
</para>
|
|
</sect1>
|
|
|
|
</chapter>
|