mirror of
https://git.postgresql.org/git/postgresql.git
synced 2024-10-02 16:41:16 +02:00
934 lines
34 KiB
Plaintext
934 lines
34 KiB
Plaintext
<!-- $Header: /cvsroot/pgsql/doc/src/sgml/indices.sgml,v 1.30 2001/12/04 01:22:13 tgl Exp $ -->
|
|
|
|
<chapter id="indexes">
|
|
<title id="indexes-title">Indexes</title>
|
|
|
|
<indexterm zone="indexes">
|
|
<primary>indexes</primary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
Indexes are a common way to enhance database performance. An index
|
|
allows the database server to find and retrieve specific rows much
|
|
faster than it could do without an index. But indexes also add
|
|
overhead to the database system as a whole, so they should be used
|
|
sensibly.
|
|
</para>
|
|
|
|
|
|
<sect1 id="indexes-intro">
|
|
<title>Introduction</title>
|
|
|
|
<para>
|
|
The classical example for the need of an index is if there is a
|
|
table similar to this:
|
|
<programlisting>
|
|
CREATE TABLE test1 (
|
|
id integer,
|
|
content varchar
|
|
);
|
|
</programlisting>
|
|
and the application requires a lot of queries of the form
|
|
<programlisting>
|
|
SELECT content FROM test1 WHERE id = <replaceable>constant</replaceable>;
|
|
</programlisting>
|
|
Ordinarily, the system would have to scan the entire
|
|
<structname>test1</structname> table row by row to find all
|
|
matching entries. If there are a lot of rows in
|
|
<structname>test1</structname> and only a few rows (possibly zero
|
|
or one) returned by the query, then this is clearly an inefficient
|
|
method. If the system were instructed to maintain an index on the
|
|
<structfield>id</structfield> column, then it could use a more
|
|
efficient method for locating matching rows. For instance, it
|
|
might only have to walk a few levels deep into a search tree.
|
|
</para>
|
|
|
|
<para>
|
|
A similar approach is used in most books of non-fiction: Terms and
|
|
concepts that are frequently looked up by readers are collected in
|
|
an alphabetic index at the end of the book. The interested reader
|
|
can scan the index relatively quickly and flip to the appropriate
|
|
page, and would not have to read the entire book to find the
|
|
interesting location. As it is the task of the author to
|
|
anticipate the items that the readers are most likely to look up,
|
|
it is the task of the database programmer to foresee which indexes
|
|
would be of advantage.
|
|
</para>
|
|
|
|
<para>
|
|
The following command would be used to create the index on the
|
|
<structfield>id</structfield> column, as discussed:
|
|
<programlisting>
|
|
CREATE INDEX test1_id_index ON test1 (id);
|
|
</programlisting>
|
|
The name <structname>test1_id_index</structname> can be chosen
|
|
freely, but you should pick something that enables you to remember
|
|
later what the index was for.
|
|
</para>
|
|
|
|
<para>
|
|
To remove an index, use the <command>DROP INDEX</command> command.
|
|
Indexes can be added to and removed from tables at any time.
|
|
</para>
|
|
|
|
<para>
|
|
Once the index is created, no further intervention is required: the
|
|
system will use the index when it thinks it would be more efficient
|
|
than a sequential table scan. But you may have to run the
|
|
<command>ANALYZE</command> command regularly to update
|
|
statistics to allow the query planner to make educated decisions.
|
|
Also read <xref linkend="performance-tips"> for information about
|
|
how to find out whether an index is used and when and why the
|
|
planner may choose to <emphasis>not</emphasis> use an index.
|
|
</para>
|
|
|
|
<para>
|
|
Indexes can benefit <command>UPDATE</command>s and
|
|
<command>DELETE</command>s with search conditions. Indexes can also be
|
|
used in join queries. Thus,
|
|
an index defined on a column that is part of a join condition can
|
|
significantly speed up queries with joins.
|
|
</para>
|
|
|
|
<para>
|
|
When an index is created, the system has to keep it synchronized with the
|
|
table. This adds overhead to data manipulation operations.
|
|
Therefore indexes that are non-essential or do not get used at all
|
|
should be removed. Note that a
|
|
query or data manipulation command can use at most one index
|
|
per table.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="indexes-types">
|
|
<title>Index Types</title>
|
|
|
|
<para>
|
|
<productname>PostgreSQL</productname> provides several index types:
|
|
B-tree, R-tree, GiST, and Hash. Each index type is more appropriate for
|
|
a particular query type because of the algorithm it uses.
|
|
<indexterm>
|
|
<primary>indexes</primary>
|
|
<secondary>B-tree</secondary>
|
|
</indexterm>
|
|
<indexterm>
|
|
<primary>B-tree</primary>
|
|
<see>indexes</see>
|
|
</indexterm>
|
|
By
|
|
default, the <command>CREATE INDEX</command> command will create a
|
|
B-tree index, which fits the most common situations. In
|
|
particular, the <productname>PostgreSQL</productname> query optimizer
|
|
will consider using a B-tree index whenever an indexed column is
|
|
involved in a comparison using one of these operators:
|
|
|
|
<simplelist type="inline">
|
|
<member><literal><</literal></member>
|
|
<member><literal><=</literal></member>
|
|
<member><literal>=</literal></member>
|
|
<member><literal>>=</literal></member>
|
|
<member><literal>></literal></member>
|
|
</simplelist>
|
|
</para>
|
|
|
|
<para>
|
|
<indexterm>
|
|
<primary>indexes</primary>
|
|
<secondary>R-tree</secondary>
|
|
</indexterm>
|
|
<indexterm>
|
|
<primary>R-tree</primary>
|
|
<see>indexes</see>
|
|
</indexterm>
|
|
R-tree indexes are especially suited for spatial data. To create
|
|
an R-tree index, use a command of the form
|
|
<synopsis>
|
|
CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING RTREE (<replaceable>column</replaceable>);
|
|
</synopsis>
|
|
The <productname>PostgreSQL</productname> query optimizer will
|
|
consider using an R-tree index whenever an indexed column is
|
|
involved in a comparison using one of these operators:
|
|
|
|
<simplelist type="inline">
|
|
<member><literal><<</literal></member>
|
|
<member><literal>&<</literal></member>
|
|
<member><literal>&></literal></member>
|
|
<member><literal>>></literal></member>
|
|
<member><literal>@</literal></member>
|
|
<member><literal>~=</literal></member>
|
|
<member><literal>&&</literal></member>
|
|
</simplelist>
|
|
(Refer to <xref linkend="functions-geometry"> about the meaning of
|
|
these operators.)
|
|
</para>
|
|
|
|
<para>
|
|
<indexterm>
|
|
<primary>indexes</primary>
|
|
<secondary>hash</secondary>
|
|
</indexterm>
|
|
<indexterm>
|
|
<primary>hash</primary>
|
|
<see>indexes</see>
|
|
</indexterm>
|
|
The query optimizer will consider using a hash index whenever an
|
|
indexed column is involved in a comparison using the
|
|
<literal>=</literal> operator. The following command is used to
|
|
create a hash index:
|
|
<synopsis>
|
|
CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING HASH (<replaceable>column</replaceable>);
|
|
</synopsis>
|
|
<note>
|
|
<para>
|
|
Because of the limited utility of hash indexes, a B-tree index
|
|
should generally be preferred over a hash index. We do not have
|
|
sufficient evidence that hash indexes are actually faster than
|
|
B-trees even for <literal>=</literal> comparisons. Moreover,
|
|
hash indexes require coarser locks; see <xref
|
|
linkend="locking-indexes">.
|
|
</para>
|
|
</note>
|
|
</para>
|
|
|
|
<para>
|
|
The B-tree index is an implementation of Lehman-Yao
|
|
high-concurrency B-trees. The R-tree index method implements
|
|
standard R-trees using Guttman's quadratic split algorithm. The
|
|
hash index is an implementation of Litwin's linear hashing. We
|
|
mention the algorithms used solely to indicate that all of these
|
|
access methods are fully dynamic and do not have to be optimized
|
|
periodically (as is the case with, for example, static hash access
|
|
methods).
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="indexes-multicolumn">
|
|
<title>Multicolumn Indexes</title>
|
|
|
|
<indexterm zone="indexes-multicolumn">
|
|
<primary>indexes</primary>
|
|
<secondary>multicolumn</secondary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
An index can be defined on more than one column. For example, if
|
|
you have a table of this form:
|
|
<programlisting>
|
|
CREATE TABLE test2 (
|
|
major int,
|
|
minor int,
|
|
name varchar
|
|
);
|
|
</programlisting>
|
|
(Say, you keep your <filename class="directory">/dev</filename>
|
|
directory in a database...) and you frequently make queries like
|
|
<programlisting>
|
|
SELECT name FROM test2 WHERE major = <replaceable>constant</replaceable> AND minor = <replaceable>constant</replaceable>;
|
|
</programlisting>
|
|
then it may be appropriate to define an index on the columns
|
|
<structfield>major</structfield> and
|
|
<structfield>minor</structfield> together, e.g.,
|
|
<programlisting>
|
|
CREATE INDEX test2_mm_idx ON test2 (major, minor);
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
Currently, only the B-tree and GiST implementations support multicolumn
|
|
indexes. Up to 16 columns may be specified. (This limit can be
|
|
altered when building <productname>PostgreSQL</productname>; see the
|
|
file <filename>pg_config.h</filename>.)
|
|
</para>
|
|
|
|
<para>
|
|
The query optimizer can use a multicolumn index for queries that
|
|
involve the first <parameter>n</parameter> consecutive columns in
|
|
the index (when used with appropriate operators), up to the total
|
|
number of columns specified in the index definition. For example,
|
|
an index on <literal>(a, b, c)</literal> can be used in queries
|
|
involving all of <literal>a</literal>, <literal>b</literal>, and
|
|
<literal>c</literal>, or in queries involving both
|
|
<literal>a</literal> and <literal>b</literal>, or in queries
|
|
involving only <literal>a</literal>, but not in other combinations.
|
|
(In a query involving <literal>a</literal> and <literal>c</literal>
|
|
the optimizer might choose to use the index for
|
|
<literal>a</literal> only and treat <literal>c</literal> like an
|
|
ordinary unindexed column.)
|
|
</para>
|
|
|
|
<para>
|
|
Multicolumn indexes can only be used if the clauses involving the
|
|
indexed columns are joined with <literal>AND</literal>. For instance,
|
|
<programlisting>
|
|
SELECT name FROM test2 WHERE major = <replaceable>constant</replaceable> OR minor = <replaceable>constant</replaceable>;
|
|
</programlisting>
|
|
cannot make use of the index <structname>test2_mm_idx</structname>
|
|
defined above to look up both columns. (It can be used to look up
|
|
only the <structfield>major</structfield> column, however.)
|
|
</para>
|
|
|
|
<para>
|
|
Multicolumn indexes should be used sparingly. Most of the time,
|
|
an index on a single column is sufficient and saves space and time.
|
|
Indexes with more than three columns are almost certainly
|
|
inappropriate.
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="indexes-unique">
|
|
<title>Unique Indexes</title>
|
|
|
|
<indexterm zone="indexes-unique">
|
|
<primary>indexes</primary>
|
|
<secondary>unique</secondary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
Indexes may also be used to enforce uniqueness of a column's value,
|
|
or the uniqueness of the combined values of more than one column.
|
|
<synopsis>
|
|
CREATE UNIQUE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> (<replaceable>column</replaceable> <optional>, ...</optional>);
|
|
</synopsis>
|
|
Currently, only B-tree indexes can be declared unique.
|
|
</para>
|
|
|
|
<para>
|
|
When an index is declared unique, multiple table rows with equal
|
|
indexed values will not be allowed. NULL values are not considered
|
|
equal.
|
|
</para>
|
|
|
|
<para>
|
|
<productname>PostgreSQL</productname> automatically creates unique
|
|
indexes when a table is declared with a unique constraint or a
|
|
primary key, on the columns that make up the primary key or unique
|
|
columns (a multicolumn index, if appropriate), to enforce that
|
|
constraint. A unique index can be added to a table at any later
|
|
time, to add a unique constraint.
|
|
</para>
|
|
|
|
<note>
|
|
<para>
|
|
The preferred way to add a unique constraint to a table is
|
|
<literal>ALTER TABLE ... ADD CONSTRAINT</literal>. The use of
|
|
indexes to enforce unique constraints could be considered an
|
|
implementation detail that should not be accessed directly.
|
|
</para>
|
|
</note>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="indexes-functional">
|
|
<title>Functional Indexes</title>
|
|
|
|
<indexterm zone="indexes-functional">
|
|
<primary>indexes</primary>
|
|
<secondary>on functions</secondary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
For a <firstterm>functional index</firstterm>, an index is defined
|
|
on the result of a function applied to one or more columns of a
|
|
single table. Functional indexes can be used to obtain fast access
|
|
to data based on the result of function calls.
|
|
</para>
|
|
|
|
<para>
|
|
For example, a common way to do case-insensitive comparisons is to
|
|
use the <function>lower</function> function:
|
|
<programlisting>
|
|
SELECT * FROM test1 WHERE lower(col1) = 'value';
|
|
</programlisting>
|
|
This query can use an index, if one has been
|
|
defined on the result of the <literal>lower(column)</literal>
|
|
operation:
|
|
<programlisting>
|
|
CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1));
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
The function in the index definition can take more than one
|
|
argument, but they must be table columns, not constants.
|
|
Functional indexes are always single-column (namely, the function
|
|
result) even if the function uses more than one input field; there
|
|
cannot be multicolumn indexes that contain function calls.
|
|
</para>
|
|
|
|
<tip>
|
|
<para>
|
|
The restrictions mentioned in the previous paragraph can easily be
|
|
worked around by defining a custom function to use in the index
|
|
definition that computes any desired result internally.
|
|
</para>
|
|
</tip>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="indexes-opclass">
|
|
<title>Operator Classes</title>
|
|
|
|
<para>
|
|
An index definition may specify an <firstterm>operator
|
|
class</firstterm> for each column of an index.
|
|
<synopsis>
|
|
CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> (<replaceable>column</replaceable> <replaceable>opclass</replaceable> <optional>, ...</optional>);
|
|
</synopsis>
|
|
The operator class identifies the operators to be used by the index
|
|
for that column. For example, a B-tree index on four-byte integers
|
|
would use the <literal>int4_ops</literal> class; this operator
|
|
class includes comparison functions for four-byte integers. In
|
|
practice the default operator class for the column's data type is
|
|
usually sufficient. The main point of having operator classes is
|
|
that for some data types, there could be more than one meaningful
|
|
ordering. For example, we might want to sort a complex-number data
|
|
type either by absolute value or by real part. We could do this by
|
|
defining two operator classes for the data type and then selecting
|
|
the proper class when making an index. There are also some
|
|
operator classes with special purposes:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
The operator classes <literal>box_ops</literal> and
|
|
<literal>bigbox_ops</literal> both support R-tree indexes on the
|
|
<literal>box</literal> data type. The difference between them is
|
|
that <literal>bigbox_ops</literal> scales box coordinates down,
|
|
to avoid floating point exceptions from doing multiplication,
|
|
addition, and subtraction on very large floating point
|
|
coordinates. If the field on which your rectangles lie is about
|
|
20 000 units square or larger, you should use
|
|
<literal>bigbox_ops</literal>.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
The following query shows all defined operator classes:
|
|
<programlisting>
|
|
SELECT am.amname AS acc_method,
|
|
opc.opcname AS ops_name,
|
|
opr.oprname AS ops_comp
|
|
FROM pg_am am, pg_opclass opc, pg_amop amop, pg_operator opr
|
|
WHERE opc.opcamid = am.oid AND
|
|
amop.amopclaid = opc.oid AND
|
|
amop.amopopr = opr.oid
|
|
ORDER BY acc_method, ops_name, ops_comp
|
|
</programlisting>
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="keys">
|
|
<title id="keys-title">Keys</title>
|
|
|
|
<para>
|
|
<note>
|
|
<title>Author</title>
|
|
<para>
|
|
Written by Herouth Maoz (<email>herouth@oumail.openu.ac.il</email>).
|
|
This originally appeared on the User's Mailing List on 1998-03-02
|
|
in response to the question:
|
|
"What is the difference between PRIMARY KEY and UNIQUE constraints?".
|
|
</para>
|
|
</note>
|
|
</para>
|
|
|
|
<para>
|
|
<literallayout>
|
|
Subject: Re: [QUESTIONS] PRIMARY KEY | UNIQUE
|
|
|
|
What's the difference between:
|
|
|
|
PRIMARY KEY(fields,...) and
|
|
UNIQUE (fields,...)
|
|
|
|
- Is this an alias?
|
|
- If PRIMARY KEY is already unique, then why
|
|
is there another kind of key named UNIQUE?
|
|
</literallayout>
|
|
</para>
|
|
|
|
<para>
|
|
A primary key is the field(s) used to identify a specific row. For example,
|
|
Social Security numbers identifying a person.
|
|
</para>
|
|
|
|
<para>
|
|
A simply UNIQUE combination of fields has nothing to do with identifying
|
|
the row. It's simply an integrity constraint. For example, I have
|
|
collections of links. Each collection is identified by a unique number,
|
|
which is the primary key. This key is used in relations.
|
|
</para>
|
|
|
|
<para>
|
|
However, my application requires that each collection will also have a
|
|
unique name. Why? So that a human being who wants to modify a collection
|
|
will be able to identify it. It's much harder to know, if you have two
|
|
collections named <quote>Life Science</quote>, the one tagged 24433 is the one you
|
|
need, and the one tagged 29882 is not.
|
|
</para>
|
|
|
|
<para>
|
|
So, the user selects the collection by its name. We therefore make sure,
|
|
within the database, that names are unique. However, no other table in the
|
|
database relates to the collections table by the collection Name. That
|
|
would be very inefficient.
|
|
</para>
|
|
|
|
<para>
|
|
Moreover, despite being unique, the collection name does not actually
|
|
define the collection! For example, if somebody decided to change the name
|
|
of the collection from <quote>Life Science</quote> to <quote>Biology</quote>, it will still be the
|
|
same collection, only with a different name. As long as the name is unique,
|
|
that's OK.
|
|
</para>
|
|
|
|
<para>
|
|
So:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Primary key:
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
<listitem>
|
|
<para>
|
|
Is used for identifying the row and relating to it.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Is impossible (or hard) to update.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Should not allow NULLs.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Unique field(s):
|
|
<itemizedlist spacing="compact" mark="bullet">
|
|
<listitem>
|
|
<para>
|
|
Are used as an alternative access to the row.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Are updatable, so long as they are kept unique.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
NULLs are acceptable.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
As for why no non-unique keys are defined explicitly in standard
|
|
<acronym>SQL</acronym> syntax? Well, you
|
|
must understand that indexes are implementation-dependent.
|
|
<acronym>SQL</acronym> does not
|
|
define the implementation, merely the relations between data in the
|
|
database. <productname>PostgreSQL</productname> does allow
|
|
non-unique indexes, but indexes
|
|
used to enforce <acronym>SQL</acronym> keys are always unique.
|
|
</para>
|
|
|
|
<para>
|
|
Thus, you may query a table by any combination of its columns, despite the
|
|
fact that you don't have an index on these columns. The indexes are merely
|
|
an implementation aid that each <acronym>RDBMS</acronym> offers
|
|
you, in order to cause
|
|
commonly used queries to be done more efficiently.
|
|
Some <acronym>RDBMS</acronym> may give you
|
|
additional measures, such as keeping a key stored in main memory. They will
|
|
have a special command, for example
|
|
<synopsis>
|
|
CREATE MEMSTORE ON <replaceable>table</replaceable> COLUMNS <replaceable>cols</replaceable>
|
|
</synopsis>
|
|
(This is not an existing command, just an example.)
|
|
</para>
|
|
|
|
<para>
|
|
In fact, when you create a primary key or a unique combination of fields,
|
|
nowhere in the <acronym>SQL</acronym> specification does it say
|
|
that an index is created, nor that
|
|
the retrieval of data by the key is going to be more efficient than a
|
|
sequential scan!
|
|
</para>
|
|
|
|
<para>
|
|
So, if you want to use a combination of fields that is not unique as a
|
|
secondary key, you really don't have to specify anything - just start
|
|
retrieving by that combination! However, if you want to make the retrieval
|
|
efficient, you'll have to resort to the means your
|
|
<acronym>RDBMS</acronym> provider gives you
|
|
- be it an index, my imaginary <literal>MEMSTORE</literal> command, or an intelligent
|
|
<acronym>RDBMS</acronym>
|
|
that creates indexes without your knowledge based on the fact that you have
|
|
sent it many queries based on a specific combination of keys... (It learns
|
|
from experience).
|
|
</para>
|
|
</sect1>
|
|
|
|
|
|
<sect1 id="indexes-partial">
|
|
<title>Partial Indexes</title>
|
|
|
|
<indexterm zone="indexes-partial">
|
|
<primary>indexes</primary>
|
|
<secondary>partial</secondary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
A <firstterm>partial index</firstterm> is an index built over a
|
|
subset of a table; the subset is defined by a conditional
|
|
expression (called the <firstterm>predicate</firstterm> of the
|
|
partial index). The index contains entries for only those table
|
|
rows that satisfy the predicate.
|
|
</para>
|
|
|
|
<para>
|
|
A major motivation for partial indexes is to avoid indexing common
|
|
values. Since a query searching for a common value (one that
|
|
accounts for more than a few percent of all the table rows) will not
|
|
use the index anyway, there is no point in keeping those rows in the
|
|
index at all. This reduces the size of the index, which will speed
|
|
up queries that do use the index. It will also speed up many table
|
|
update operations because the index does not need to be
|
|
updated in all cases. <xref linkend="indexes-partial-ex1"> shows a
|
|
possible application of this idea.
|
|
</para>
|
|
|
|
<example id="indexes-partial-ex1">
|
|
<title>Setting up a Partial Index to Exclude Common Values</title>
|
|
|
|
<para>
|
|
Suppose you are storing web server access logs in a database.
|
|
Most accesses originate from the IP range of your organization but
|
|
some are from elsewhere (say, employees on dial-up connections).
|
|
If your searches by IP are primarily for outside accesses,
|
|
you probably do not need to index the IP range that corresponds to your
|
|
organization's subnet.
|
|
</para>
|
|
|
|
<para>
|
|
Assume a table like this:
|
|
<programlisting>
|
|
CREATE TABLE access_log (
|
|
url varchar,
|
|
client_ip inet,
|
|
...
|
|
);
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
To create a partial index that suits our example, use a command
|
|
such as this:
|
|
<programlisting>
|
|
CREATE INDEX access_log_client_ip_ix ON access_log (client_ip)
|
|
WHERE NOT (client_ip > inet '192.168.100.0' AND client_ip < inet '192.168.100.255');
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
A typical query that can use this index would be:
|
|
<programlisting>
|
|
SELECT * FROM access_log WHERE url = '/index.html' AND client_ip = inet '212.78.10.32';
|
|
</programlisting>
|
|
A query that cannot use this index is:
|
|
<programlisting>
|
|
SELECT * FROM access_log WHERE client_ip = inet '192.168.100.23';
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
Observe that this kind of partial index requires that the common
|
|
values be predetermined. If the distribution of values is
|
|
inherent (due to the nature of the application) and static (not
|
|
changing over time), this is not difficult, but if the common values are
|
|
merely due to the coincidental data load this can require a lot of
|
|
maintenance work.
|
|
</para>
|
|
</example>
|
|
|
|
<para>
|
|
Another possibility is to exclude values from the index that the
|
|
typical query workload is not interested in; this is shown in <xref
|
|
linkend="indexes-partial-ex2">. This results in the same
|
|
advantages as listed above, but it prevents the
|
|
<quote>uninteresting</quote> values from being accessed via that
|
|
index at all, even if an index scan might be profitable in that
|
|
case. Obviously, setting up partial indexes for this kind of
|
|
scenario will require a lot of care and experimentation.
|
|
</para>
|
|
|
|
<example id="indexes-partial-ex2">
|
|
<title>Setting up a Partial Index to Exclude Uninteresting Values</title>
|
|
|
|
<para>
|
|
If you have a table that contains both billed and unbilled orders,
|
|
where the unbilled orders take up a small fraction of the total
|
|
table and yet those are the most-accessed rows, you can improve
|
|
performance by creating an index on just the unbilled rows. The
|
|
command to create the index would look like this:
|
|
<programlisting>
|
|
CREATE INDEX orders_unbilled_index ON orders (order_nr)
|
|
WHERE billed is not true;
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
A possible query to use this index would be
|
|
<programlisting>
|
|
SELECT * FROM orders WHERE billed is not true AND order_nr < 10000;
|
|
</programlisting>
|
|
However, the index can also be used in queries that do not involve
|
|
<structfield>order_nr</> at all, e.g.,
|
|
<programlisting>
|
|
SELECT * FROM orders WHERE billed is not true AND amount > 5000.00;
|
|
</programlisting>
|
|
This is not as efficient as a partial index on the
|
|
<structfield>amount</> column would be, since the system has to
|
|
scan the entire index. Yet, if there are relatively few unbilled
|
|
orders, using this partial index just to find the unbilled orders
|
|
could be a win.
|
|
</para>
|
|
|
|
<para>
|
|
Note that this query cannot use this index:
|
|
<programlisting>
|
|
SELECT * FROM orders WHERE order_nr = 3501;
|
|
</programlisting>
|
|
The order 3501 may be among the billed or among the unbilled
|
|
orders.
|
|
</para>
|
|
</example>
|
|
|
|
<para>
|
|
<xref linkend="indexes-partial-ex2"> also illustrates that the
|
|
indexed column and the column used in the predicate do not need to
|
|
match. <productname>PostgreSQL</productname> supports partial
|
|
indexes with arbitrary predicates, so long as only columns of the
|
|
table being indexed are involved. However, keep in mind that the
|
|
predicate must match the conditions used in the queries that
|
|
are supposed to benefit from the index. To be precise, a partial
|
|
index can be used in a query only if the system can recognize that
|
|
the query's WHERE condition mathematically <firstterm>implies</>
|
|
the index's predicate.
|
|
<productname>PostgreSQL</productname> does not have a sophisticated
|
|
theorem prover that can recognize mathematically equivalent
|
|
predicates that are written in different forms. (Not
|
|
only is such a general theorem prover extremely difficult to
|
|
create, it would probably be too slow to be of any real use.)
|
|
The system can recognize simple inequality implications, for example
|
|
<quote>x < 1</quote> implies <quote>x < 2</quote>; otherwise
|
|
the predicate condition must exactly match the query's WHERE condition
|
|
or the index will not be recognized to be usable.
|
|
</para>
|
|
|
|
<para>
|
|
A third possible use for partial indexes does not require the
|
|
index to be used in queries at all. The idea here is to create
|
|
a unique index over a subset of a table, as in <xref
|
|
linkend="indexes-partial-ex3">. This enforces uniqueness
|
|
among the rows that satisfy the index predicate, without constraining
|
|
those that do not.
|
|
</para>
|
|
|
|
<example id="indexes-partial-ex3">
|
|
<title>Setting up a Partial Unique Index</title>
|
|
|
|
<para>
|
|
Suppose that we have a table describing test outcomes. We wish
|
|
to ensure that there is only one <quote>successful</> entry for
|
|
a given subject and target combination, but there might be any number of
|
|
<quote>unsuccessful</> entries. Here is one way to do it:
|
|
<programlisting>
|
|
CREATE TABLE tests (subject text,
|
|
target text,
|
|
success bool,
|
|
...);
|
|
CREATE UNIQUE INDEX tests_success_constraint ON tests (subject, target)
|
|
WHERE success;
|
|
</programlisting>
|
|
This is a particularly efficient way of doing it when there are few
|
|
successful trials and many unsuccessful ones.
|
|
</para>
|
|
</example>
|
|
|
|
<para>
|
|
Finally, a partial index can also be used to override the system's
|
|
query plan choices. It may occur that data sets with peculiar
|
|
distributions will cause the system to use an index when it really
|
|
should not. In that case the index can be set up so that it is not
|
|
available for the offending query. Normally,
|
|
<productname>PostgreSQL</> makes reasonable choices about index
|
|
usage (e.g., it avoids them when retrieving common values, so the
|
|
earlier example really only saves index size, it is not required to
|
|
avoid index usage), and grossly incorrect plan choices are cause
|
|
for a bug report.
|
|
</para>
|
|
|
|
<para>
|
|
Keep in mind that setting up a partial index indicates that you
|
|
know at least as much as the query planner knows, in particular you
|
|
know when an index might be profitable. Forming this knowledge
|
|
requires experience and understanding of how indexes in
|
|
<productname>PostgreSQL</> work. In most cases, the advantage of a
|
|
partial index over a regular index will not be much.
|
|
</para>
|
|
|
|
<para>
|
|
More information about partial indexes can be found in <xref
|
|
linkend="STON89b">, <xref linkend="OLSON93">, and <xref
|
|
linkend="SESHADRI95">.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1 id="indexes-examine">
|
|
<title>Examining Index Usage</title>
|
|
|
|
<para>
|
|
Although indexes in <productname>PostgreSQL</> do not need
|
|
maintenance and tuning, it is still important to check
|
|
which indexes are actually used by the real-life query workload.
|
|
Examining index usage is done with the <command>EXPLAIN</> command;
|
|
its application for this purpose is illustrated in <xref
|
|
linkend="using-explain">.
|
|
</para>
|
|
|
|
<para>
|
|
It is difficult to formulate a general procedure for determining
|
|
which indexes to set up. There are a number of typical cases that
|
|
have been shown in the examples throughout the previous sections.
|
|
A good deal of experimentation will be necessary in most cases.
|
|
The rest of this section gives some tips for that.
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Always run <command>ANALYZE</command> first. This command
|
|
collects statistics about the distribution of the values in the
|
|
table. This information is required to guess the number of rows
|
|
returned by a query, which is needed by the planner to assign
|
|
realistic costs to each possible query plan. In absence of any
|
|
real statistics, some default values are assumed, which are
|
|
almost certain to be inaccurate. Examining an application's
|
|
index usage without having run <command>ANALYZE</command> is
|
|
therefore a lost cause.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Use real data for experimentation. Using test data for setting
|
|
up indexes will tell you what indexes you need for the test data,
|
|
but that is all.
|
|
</para>
|
|
|
|
<para>
|
|
It is especially fatal to use proportionally reduced data sets.
|
|
While selecting 1000 out of 100000 rows could be a candidate for
|
|
an index, selecting 1 out of 100 rows will hardly be, because the
|
|
100 rows will probably fit within a single disk page, and there
|
|
is no plan that can beat sequentially fetching 1 disk page.
|
|
</para>
|
|
|
|
<para>
|
|
Also be careful when making up test data, which is often
|
|
unavoidable when the application is not in production use yet.
|
|
Values that are very similar, completely random, or inserted in
|
|
sorted order will skew the statistics away from the distribution
|
|
that real data would have.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
When indexes are not used, it can be useful for testing to force
|
|
their use. There are run-time parameters that can turn off
|
|
various plan types (described in the <citetitle>Administrator's
|
|
Guide</citetitle>). For instance, turning off sequential scans
|
|
(<varname>enable_seqscan</>) and nested-loop joins
|
|
(<varname>enable_nestloop</>), which are the most basic plans,
|
|
will force the system to use a different plan. If the system
|
|
still chooses a sequential scan or nested-loop join then there is
|
|
probably a more fundamental problem for why the index is not
|
|
used, for example, the query condition does not match the index.
|
|
(What kind of query can use what kind of index is explained in
|
|
the previous sections.)
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
If forcing index usage does use the index, then there are two
|
|
possibilities: Either the system is right and using the index is
|
|
indeed not appropriate, or the cost estimates of the query plans
|
|
are not reflecting reality. So you should time your query with
|
|
and without indexes. The <command>EXPLAIN ANALYZE</command>
|
|
command can be useful here.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
If it turns out that the cost estimates are wrong, there are,
|
|
again, two possibilities. The total cost is computed from the
|
|
per-row costs of each plan node times the selectivity estimate of
|
|
the plan node. The costs of the plan nodes can be tuned with
|
|
run-time parameters (described in the <citetitle>Administrator's
|
|
Guide</citetitle>). An inaccurate selectivity estimate is due to
|
|
insufficient statistics. It may be possible to help this by
|
|
tuning the statistics-gathering parameters (see <command>ALTER
|
|
TABLE</command> reference).
|
|
</para>
|
|
|
|
<para>
|
|
If you do not succeed in adjusting the costs to be more
|
|
appropriate, then you may have to resort to forcing index usage
|
|
explicitly. You may also want to contact the
|
|
<productname>PostgreSQL</> developers to examine the issue.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</sect1>
|
|
</chapter>
|
|
|
|
<!-- Keep this comment at the end of the file
|
|
Local variables:
|
|
mode:sgml
|
|
sgml-omittag:nil
|
|
sgml-shorttag:t
|
|
sgml-minimize-attributes:nil
|
|
sgml-always-quote-attributes:t
|
|
sgml-indent-step:1
|
|
sgml-indent-data:t
|
|
sgml-parent-document:nil
|
|
sgml-default-dtd-file:"./reference.ced"
|
|
sgml-exposed-tags:nil
|
|
sgml-local-catalogs:("/usr/lib/sgml/catalog")
|
|
sgml-local-ecat-files:nil
|
|
End:
|
|
-->
|