2000-12-16 03:29:36 +01:00
|
|
|
<!--
|
2001-09-13 17:55:24 +02:00
|
|
|
$Header: /cvsroot/pgsql/doc/src/sgml/perform.sgml,v 1.9 2001/09/13 15:55:23 petere Exp $
|
2000-12-16 03:29:36 +01:00
|
|
|
-->
|
|
|
|
|
|
|
|
<chapter id="performance-tips">
|
|
|
|
<title>Performance Tips</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Query performance can be affected by many things. Some of these can
|
|
|
|
be manipulated by the user, while others are fundamental to the underlying
|
|
|
|
design of the system. This chapter provides some hints about understanding
|
|
|
|
and tuning <productname>Postgres</productname> performance.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<sect1 id="using-explain">
|
|
|
|
<title>Using <command>EXPLAIN</command></title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<productname>Postgres</productname> devises a <firstterm>query
|
|
|
|
plan</firstterm> for each query it is given. Choosing the right
|
|
|
|
plan to match the query structure and the properties of the data
|
|
|
|
is absolutely critical for good performance. You can use the
|
|
|
|
<command>EXPLAIN</command> command to see what query plan the system
|
2001-06-11 02:52:09 +02:00
|
|
|
creates for any query.
|
|
|
|
Plan-reading is an art that deserves an extensive tutorial, which
|
|
|
|
this is not; but here is some basic information.
|
2000-12-16 03:29:36 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2001-06-11 02:52:09 +02:00
|
|
|
The numbers that are currently quoted by <command>EXPLAIN</command> are:
|
2000-12-16 03:29:36 +01:00
|
|
|
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Estimated start-up cost (time expended before output scan can start,
|
2001-03-25 00:03:26 +01:00
|
|
|
e.g., time to do the sorting in a SORT node).
|
2000-12-16 03:29:36 +01:00
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Estimated total cost (if all tuples are retrieved, which they may not
|
|
|
|
be --- a query with a LIMIT will stop short of paying the total cost,
|
|
|
|
for example).
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Estimated number of rows output by this plan node (again, without
|
|
|
|
regard for any LIMIT).
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Estimated average width (in bytes) of rows output by this plan
|
|
|
|
node.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The costs are measured in units of disk page fetches. (CPU effort
|
|
|
|
estimates are converted into disk-page units using some
|
|
|
|
fairly arbitrary fudge-factors. If you want to experiment with these
|
|
|
|
factors, see the list of run-time configuration parameters in the
|
|
|
|
<citetitle>Administrator's Guide</citetitle>.)
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
It's important to note that the cost of an upper-level node includes
|
|
|
|
the cost of all its child nodes. It's also important to realize that
|
|
|
|
the cost only reflects things that the planner/optimizer cares about.
|
|
|
|
In particular, the cost does not consider the time spent transmitting
|
|
|
|
result tuples to the frontend --- which could be a pretty dominant
|
|
|
|
factor in the true elapsed time, but the planner ignores it because
|
|
|
|
it cannot change it by altering the plan. (Every correct plan will
|
|
|
|
output the same tuple set, we trust.)
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Rows output is a little tricky because it is <emphasis>not</emphasis> the
|
|
|
|
number of rows
|
|
|
|
processed/scanned by the query --- it is usually less, reflecting the
|
|
|
|
estimated selectivity of any WHERE-clause constraints that are being
|
|
|
|
applied at this node. Ideally the top-level rows estimate will
|
|
|
|
approximate the number of rows actually returned, updated, or deleted
|
2001-06-11 02:52:09 +02:00
|
|
|
by the query.
|
2000-12-16 03:29:36 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Here are some examples (using the regress test database after a
|
2001-06-11 02:52:09 +02:00
|
|
|
vacuum analyze, and 7.2 development sources):
|
2000-12-16 03:29:36 +01:00
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
regression=# explain select * from tenk1;
|
|
|
|
NOTICE: QUERY PLAN:
|
|
|
|
|
|
|
|
Seq Scan on tenk1 (cost=0.00..333.00 rows=10000 width=148)
|
|
|
|
</programlisting>
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
This is about as straightforward as it gets. If you do
|
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
select * from pg_class where relname = 'tenk1';
|
|
|
|
</programlisting>
|
|
|
|
|
2001-09-09 19:21:59 +02:00
|
|
|
you will find out that <classname>tenk1</classname> has 233 disk
|
2001-06-22 20:53:36 +02:00
|
|
|
pages and 10000 tuples. So the cost is estimated at 233 page
|
2001-09-09 19:21:59 +02:00
|
|
|
reads, defined as 1.0 apiece, plus 10000 * <varname>cpu_tuple_cost</varname> which is
|
2000-12-16 03:29:36 +01:00
|
|
|
currently 0.01 (try <command>show cpu_tuple_cost</command>).
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Now let's modify the query to add a qualification clause:
|
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
regression=# explain select * from tenk1 where unique1 < 1000;
|
|
|
|
NOTICE: QUERY PLAN:
|
|
|
|
|
2001-06-11 02:52:09 +02:00
|
|
|
Seq Scan on tenk1 (cost=0.00..358.00 rows=1003 width=148)
|
2000-12-16 03:29:36 +01:00
|
|
|
</programlisting>
|
|
|
|
|
|
|
|
The estimate of output rows has gone down because of the WHERE clause.
|
2001-06-11 02:52:09 +02:00
|
|
|
However, the scan will still have to visit all 10000 rows, so the cost
|
|
|
|
hasn't decreased; in fact it has gone up a bit to reflect the extra CPU
|
|
|
|
time spent checking the WHERE condition.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The actual number of rows this query would select is 1000, but the
|
|
|
|
estimate is only approximate. If you try to duplicate this experiment,
|
|
|
|
you will probably get a slightly different estimate; moreover, it will
|
|
|
|
change after each <command>ANALYZE</command> command, because the
|
|
|
|
statistics produced by <command>ANALYZE</command> are taken from a
|
|
|
|
randomized sample of the table.
|
2000-12-16 03:29:36 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Modify the query to restrict the qualification even more:
|
|
|
|
|
|
|
|
<programlisting>
|
2001-06-11 02:52:09 +02:00
|
|
|
regression=# explain select * from tenk1 where unique1 < 50;
|
2000-12-16 03:29:36 +01:00
|
|
|
NOTICE: QUERY PLAN:
|
|
|
|
|
2001-06-11 02:52:09 +02:00
|
|
|
Index Scan using tenk1_unique1 on tenk1 (cost=0.00..173.32 rows=47 width=148)
|
2000-12-16 03:29:36 +01:00
|
|
|
</programlisting>
|
|
|
|
|
|
|
|
and you will see that if we make the WHERE condition selective
|
|
|
|
enough, the planner will
|
2001-09-09 19:21:59 +02:00
|
|
|
eventually decide that an index scan is cheaper than a sequential scan.
|
2001-06-11 02:52:09 +02:00
|
|
|
This plan will only have to visit 50 tuples because of the index,
|
|
|
|
so it wins despite the fact that each individual fetch is more expensive
|
|
|
|
than reading a whole disk page sequentially.
|
2000-12-16 03:29:36 +01:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Add another condition to the qualification:
|
|
|
|
|
|
|
|
<programlisting>
|
2001-06-11 02:52:09 +02:00
|
|
|
regression=# explain select * from tenk1 where unique1 < 50 and
|
2000-12-16 03:29:36 +01:00
|
|
|
regression-# stringu1 = 'xxx';
|
|
|
|
NOTICE: QUERY PLAN:
|
|
|
|
|
2001-06-11 02:52:09 +02:00
|
|
|
Index Scan using tenk1_unique1 on tenk1 (cost=0.00..173.44 rows=1 width=148)
|
2000-12-16 03:29:36 +01:00
|
|
|
</programlisting>
|
|
|
|
|
2001-09-09 19:21:59 +02:00
|
|
|
The added clause <literal>stringu1 = 'xxx'</literal> reduces the output-rows estimate,
|
2000-12-16 03:29:36 +01:00
|
|
|
but not the cost because we still have to visit the same set of tuples.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Let's try joining two tables, using the fields we have been discussing:
|
|
|
|
|
|
|
|
<programlisting>
|
2001-06-11 02:52:09 +02:00
|
|
|
regression=# explain select * from tenk1 t1, tenk2 t2 where t1.unique1 < 50
|
2000-12-16 03:29:36 +01:00
|
|
|
regression-# and t1.unique2 = t2.unique2;
|
|
|
|
NOTICE: QUERY PLAN:
|
|
|
|
|
2001-06-11 02:52:09 +02:00
|
|
|
Nested Loop (cost=0.00..269.11 rows=47 width=296)
|
2000-12-16 03:29:36 +01:00
|
|
|
-> Index Scan using tenk1_unique1 on tenk1 t1
|
2001-06-11 02:52:09 +02:00
|
|
|
(cost=0.00..173.32 rows=47 width=148)
|
2000-12-16 03:29:36 +01:00
|
|
|
-> Index Scan using tenk2_unique2 on tenk2 t2
|
2001-06-11 02:52:09 +02:00
|
|
|
(cost=0.00..2.01 rows=1 width=148)
|
2000-12-16 03:29:36 +01:00
|
|
|
</programlisting>
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2001-09-09 19:21:59 +02:00
|
|
|
In this nested-loop join, the outer scan is the same index scan we had
|
2000-12-16 03:29:36 +01:00
|
|
|
in the example before last, and so its cost and row count are the same
|
2001-09-13 17:55:24 +02:00
|
|
|
because we are applying the <literal>unique1 < 50</literal> WHERE clause at that node.
|
|
|
|
The <literal>t1.unique2 = t2.unique2</literal> clause is not relevant yet, so it doesn't
|
2001-09-09 19:21:59 +02:00
|
|
|
affect row count of the outer scan. For the inner scan, the unique2 value of the
|
2000-12-16 03:29:36 +01:00
|
|
|
current
|
2001-09-09 19:21:59 +02:00
|
|
|
outer-scan tuple is plugged into the inner index scan
|
|
|
|
to produce an index qualification like
|
2001-09-13 17:55:24 +02:00
|
|
|
<literal>t2.unique2 = <replaceable>constant</replaceable></literal>. So we get the
|
2001-09-09 19:21:59 +02:00
|
|
|
same inner-scan plan and costs that we'd get from, say, <literal>explain select
|
|
|
|
* from tenk2 where unique2 = 42</literal>. The costs of the loop node are then set
|
|
|
|
on the basis of the cost of the outer scan, plus one repetition of the
|
2001-06-11 02:52:09 +02:00
|
|
|
inner scan for each outer tuple (47 * 2.01, here), plus a little CPU
|
2000-12-16 03:29:36 +01:00
|
|
|
time for join processing.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
In this example the loop's output row count is the same as the product
|
|
|
|
of the two scans' row counts, but that's not true in general, because
|
|
|
|
in general you can have WHERE clauses that mention both relations and
|
|
|
|
so can only be applied at the join point, not to either input scan.
|
2001-09-13 17:55:24 +02:00
|
|
|
For example, if we added <literal>WHERE ... AND t1.hundred < t2.hundred</literal>,
|
2001-09-09 19:21:59 +02:00
|
|
|
that would decrease the output row count of the join node, but not change
|
2000-12-16 03:29:36 +01:00
|
|
|
either input scan.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
One way to look at variant plans is to force the planner to disregard
|
|
|
|
whatever strategy it thought was the winner, using the enable/disable
|
|
|
|
flags for each plan type. (This is a crude tool, but useful. See
|
|
|
|
also <xref linkend="explicit-joins">.)
|
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
regression=# set enable_nestloop = off;
|
|
|
|
SET VARIABLE
|
2001-06-11 02:52:09 +02:00
|
|
|
regression=# explain select * from tenk1 t1, tenk2 t2 where t1.unique1 < 50
|
2000-12-16 03:29:36 +01:00
|
|
|
regression-# and t1.unique2 = t2.unique2;
|
|
|
|
NOTICE: QUERY PLAN:
|
|
|
|
|
2001-06-11 02:52:09 +02:00
|
|
|
Hash Join (cost=173.44..557.03 rows=47 width=296)
|
2000-12-16 03:29:36 +01:00
|
|
|
-> Seq Scan on tenk2 t2
|
|
|
|
(cost=0.00..333.00 rows=10000 width=148)
|
2001-06-11 02:52:09 +02:00
|
|
|
-> Hash (cost=173.32..173.32 rows=47 width=148)
|
2000-12-16 03:29:36 +01:00
|
|
|
-> Index Scan using tenk1_unique1 on tenk1 t1
|
2001-06-11 02:52:09 +02:00
|
|
|
(cost=0.00..173.32 rows=47 width=148)
|
2000-12-16 03:29:36 +01:00
|
|
|
</programlisting>
|
|
|
|
|
2001-09-09 19:21:59 +02:00
|
|
|
This plan proposes to extract the 50 interesting rows of <classname>tenk1</classname>
|
|
|
|
using ye same olde index scan, stash them into an in-memory hash table,
|
|
|
|
and then do a sequential scan of <classname>tenk2</classname>, probing into the hash table
|
2001-09-13 17:55:24 +02:00
|
|
|
for possible matches of <literal>t1.unique2 = t2.unique2</literal> at each <classname>tenk2</classname> tuple.
|
2001-09-09 19:21:59 +02:00
|
|
|
The cost to read <classname>tenk1</classname> and set up the hash table is entirely start-up
|
2000-12-16 03:29:36 +01:00
|
|
|
cost for the hash join, since we won't get any tuples out until we can
|
2001-09-09 19:21:59 +02:00
|
|
|
start reading <classname>tenk2</classname>. The total time estimate for the join also
|
2001-06-11 02:52:09 +02:00
|
|
|
includes a hefty charge for CPU time to probe the hash table
|
|
|
|
10000 times. Note, however, that we are NOT charging 10000 times 173.32;
|
2000-12-16 03:29:36 +01:00
|
|
|
the hash table setup is only done once in this plan type.
|
|
|
|
</para>
|
2001-06-22 20:53:36 +02:00
|
|
|
|
|
|
|
<para>
|
|
|
|
It is worth noting that EXPLAIN results should not be extrapolated
|
|
|
|
to situations other than the one you are actually testing; for example,
|
|
|
|
results on a toy-sized table can't be assumed to apply to large tables.
|
|
|
|
The planner's cost estimates are not linear and so it may well choose
|
|
|
|
a different plan for a larger or smaller table. An extreme example
|
|
|
|
is that on a table that only occupies one disk page, you'll nearly
|
|
|
|
always get a sequential scan plan whether indexes are available or not.
|
|
|
|
The planner realizes that it's going to take one disk page read to
|
|
|
|
process the table in any case, so there's no value in expending additional
|
|
|
|
page reads to look at an index.
|
|
|
|
</para>
|
2000-12-16 03:29:36 +01:00
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="explicit-joins">
|
|
|
|
<title>Controlling the Planner with Explicit JOINs</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Beginning with <productname>Postgres</productname> 7.1 it is possible
|
|
|
|
to control the query planner to some extent by using explicit JOIN
|
|
|
|
syntax. To see why this matters, we first need some background.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
In a simple join query, such as
|
|
|
|
<programlisting>
|
|
|
|
SELECT * FROM a,b,c WHERE a.id = b.id AND b.ref = c.id;
|
|
|
|
</programlisting>
|
|
|
|
the planner is free to join the given tables in any order. For example,
|
|
|
|
it could generate a query plan that joins A to B, using the WHERE clause
|
|
|
|
a.id = b.id, and then joins C to this joined table, using the other
|
|
|
|
WHERE clause. Or it could join B to C and then join A to that result.
|
|
|
|
Or it could join A to C and then join them with B --- but that would
|
|
|
|
be inefficient, since the full Cartesian product of A and C would have
|
|
|
|
to be formed, there being no applicable WHERE clause to allow optimization
|
|
|
|
of the join.
|
|
|
|
(All joins in the <productname>Postgres</productname> executor happen
|
|
|
|
between two input tables, so it's necessary to build up the result in one
|
|
|
|
or another of these fashions.) The important point is that these different
|
|
|
|
join possibilities give semantically equivalent results but may have hugely
|
|
|
|
different execution costs. Therefore, the planner will explore all of them
|
|
|
|
to try to find the most efficient query plan.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
When a query only involves two or three tables, there aren't many join
|
|
|
|
orders to worry about. But the number of possible join orders grows
|
|
|
|
exponentially as the number of tables expands. Beyond ten or so input
|
|
|
|
tables it's no longer practical to do an exhaustive search of all the
|
|
|
|
possibilities, and even for six or seven tables planning may take an
|
|
|
|
annoyingly long time. When there are too many input tables, the
|
|
|
|
<productname>Postgres</productname> planner will switch from exhaustive
|
|
|
|
search to a <firstterm>genetic</firstterm> probabilistic search
|
2001-09-09 19:21:59 +02:00
|
|
|
through a limited number of possibilities. (The switch-over threshold is
|
|
|
|
set by the <varname>GEQO_THRESHOLD</varname> run-time
|
2000-12-16 03:29:36 +01:00
|
|
|
parameter described in the <citetitle>Administrator's Guide</citetitle>.)
|
|
|
|
The genetic search takes less time, but it won't
|
|
|
|
necessarily find the best possible plan.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
When the query involves outer joins, the planner has much less freedom
|
|
|
|
than it does for plain (inner) joins. For example, consider
|
|
|
|
<programlisting>
|
|
|
|
SELECT * FROM a LEFT JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);
|
|
|
|
</programlisting>
|
|
|
|
Although this query's restrictions are superficially similar to the
|
|
|
|
previous example, the semantics are different because a row must be
|
|
|
|
emitted for each row of A that has no matching row in the join of B and C.
|
|
|
|
Therefore the planner has no choice of join order here: it must join
|
|
|
|
B to C and then join A to that result. Accordingly, this query takes
|
|
|
|
less time to plan than the previous query.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
In <productname>Postgres</productname> 7.1, the planner treats all
|
|
|
|
explicit JOIN syntaxes as constraining the join order, even though
|
|
|
|
it is not logically necessary to make such a constraint for inner
|
|
|
|
joins. Therefore, although all of these queries give the same result:
|
|
|
|
<programlisting>
|
|
|
|
SELECT * FROM a,b,c WHERE a.id = b.id AND b.ref = c.id;
|
|
|
|
SELECT * FROM a CROSS JOIN b CROSS JOIN c WHERE a.id = b.id AND b.ref = c.id;
|
|
|
|
SELECT * FROM a JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);
|
|
|
|
</programlisting>
|
|
|
|
the second and third take less time to plan than the first. This effect
|
|
|
|
is not worth worrying about for only three tables, but it can be a
|
|
|
|
lifesaver with many tables.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
You do not need to constrain the join order completely in order to
|
|
|
|
cut search time, because it's OK to use JOIN operators in a plain
|
|
|
|
FROM list. For example,
|
|
|
|
<programlisting>
|
|
|
|
SELECT * FROM a CROSS JOIN b, c, d, e WHERE ...;
|
|
|
|
</programlisting>
|
|
|
|
forces the planner to join A to B before joining them to other tables,
|
|
|
|
but doesn't constrain its choices otherwise. In this example, the
|
|
|
|
number of possible join orders is reduced by a factor of 5.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
If you have a mix of outer and inner joins in a complex query, you
|
|
|
|
might not want to constrain the planner's search for a good ordering
|
|
|
|
of inner joins inside an outer join. You can't do that directly in the
|
|
|
|
JOIN syntax, but you can get around the syntactic limitation by using
|
2001-02-19 01:24:30 +01:00
|
|
|
subselects. For example,
|
2000-12-16 03:29:36 +01:00
|
|
|
<programlisting>
|
2001-02-19 01:24:30 +01:00
|
|
|
SELECT * FROM d LEFT JOIN
|
|
|
|
(SELECT * FROM a, b, c WHERE ...) AS ss
|
|
|
|
ON (...);
|
2000-12-16 03:29:36 +01:00
|
|
|
</programlisting>
|
|
|
|
Here, joining D must be the last step in the query plan, but the
|
|
|
|
planner is free to consider various join orders for A,B,C.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Constraining the planner's search in this way is a useful technique
|
|
|
|
both for reducing planning time and for directing the planner to a
|
|
|
|
good query plan. If the planner chooses a bad join order by default,
|
|
|
|
you can force it to choose a better order via JOIN syntax --- assuming
|
|
|
|
that you know of a better order, that is. Experimentation is recommended.
|
|
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="populate">
|
|
|
|
<title>Populating a Database</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
One may need to do a large number of table insertions when first
|
|
|
|
populating a database. Here are some tips and techniques for making that as
|
|
|
|
efficient as possible.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<sect2 id="disable-autocommit">
|
|
|
|
<title>Disable Auto-commit</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Turn off auto-commit and just do one commit at
|
2001-06-22 20:53:36 +02:00
|
|
|
the end. (In plain SQL, this means issuing <command>BEGIN</command>
|
|
|
|
at the start and <command>COMMIT</command> at the end. Some client
|
|
|
|
libraries may do this behind your back, in which case you need to
|
|
|
|
make sure the library does it when you want it done.)
|
|
|
|
If you allow each insertion to be committed separately,
|
|
|
|
<productname>Postgres</productname> is doing a lot of work for each
|
|
|
|
record added.
|
2000-12-16 03:29:36 +01:00
|
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2 id="populate-copy-from">
|
|
|
|
<title>Use COPY FROM</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Use <command>COPY FROM STDIN</command> to load all the records in one
|
2001-06-22 20:53:36 +02:00
|
|
|
command, instead of using
|
|
|
|
a series of <command>INSERT</command> commands. This reduces parsing,
|
|
|
|
planning, etc
|
2000-12-16 03:29:36 +01:00
|
|
|
overhead a great deal. If you do this then it's not necessary to fool
|
2001-06-22 20:53:36 +02:00
|
|
|
around with auto-commit, since it's only one command anyway.
|
2000-12-16 03:29:36 +01:00
|
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
2001-05-17 23:50:18 +02:00
|
|
|
<sect2 id="populate-rm-indexes">
|
|
|
|
<title>Remove Indexes</title>
|
2000-12-16 03:29:36 +01:00
|
|
|
|
|
|
|
<para>
|
|
|
|
If you are loading a freshly created table, the fastest way is to
|
2001-06-22 20:53:36 +02:00
|
|
|
create the table, bulk-load with <command>COPY</command>, then create any
|
|
|
|
indexes needed
|
2000-12-16 03:29:36 +01:00
|
|
|
for the table. Creating an index on pre-existing data is quicker than
|
|
|
|
updating it incrementally as each record is loaded.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
If you are augmenting an existing table, you can <command>DROP
|
2001-06-22 20:53:36 +02:00
|
|
|
INDEX</command>, load the table, then recreate the index. Of
|
2000-12-16 03:29:36 +01:00
|
|
|
course, the database performance for other users may be adversely
|
2001-06-22 20:53:36 +02:00
|
|
|
affected during the time that the index is missing. One should also
|
|
|
|
think twice before dropping UNIQUE indexes, since the error checking
|
|
|
|
afforded by the UNIQUE constraint will be lost while the index is missing.
|
|
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2 id="populate-analyze">
|
|
|
|
<title>ANALYZE Afterwards</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
It's a good idea to run <command>ANALYZE</command> or <command>VACUUM
|
|
|
|
ANALYZE</command> anytime you've added or updated a lot of data,
|
|
|
|
including just after initially populating a table. This ensures that
|
|
|
|
the planner has up-to-date statistics about the table. With no statistics
|
|
|
|
or obsolete statistics, the planner may make poor choices of query plans,
|
|
|
|
leading to bad performance on queries that use your table.
|
2000-12-16 03:29:36 +01:00
|
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
</chapter>
|
|
|
|
|
|
|
|
<!-- Keep this comment at the end of the file
|
|
|
|
Local variables:
|
|
|
|
mode:sgml
|
|
|
|
sgml-omittag:nil
|
|
|
|
sgml-shorttag:t
|
|
|
|
sgml-minimize-attributes:nil
|
|
|
|
sgml-always-quote-attributes:t
|
|
|
|
sgml-indent-step:1
|
|
|
|
sgml-indent-data:t
|
|
|
|
sgml-parent-document:nil
|
|
|
|
sgml-default-dtd-file:"./reference.ced"
|
|
|
|
sgml-exposed-tags:nil
|
|
|
|
sgml-local-catalogs:("/usr/lib/sgml/catalog")
|
|
|
|
sgml-local-ecat-files:nil
|
|
|
|
End:
|
|
|
|
-->
|