2006-09-18 14:11:36 +02:00
|
|
|
<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.4 2006/09/18 12:11:36 teodor Exp $ -->
|
2006-09-14 13:16:27 +02:00
|
|
|
|
|
|
|
<chapter id="GIN">
|
|
|
|
<title>GIN Indexes</title>
|
|
|
|
|
|
|
|
<indexterm>
|
|
|
|
<primary>index</primary>
|
|
|
|
<secondary>GIN</secondary>
|
|
|
|
</indexterm>
|
|
|
|
|
|
|
|
<sect1 id="gin-intro">
|
|
|
|
<title>Introduction</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<acronym>GIN</acronym> stands for Generalized Inverted Index. It is
|
|
|
|
an index structure storing a set of (key, posting list) pairs, where
|
2006-09-18 14:11:36 +02:00
|
|
|
'posting list' is a set of rows in which the key occurs. Each
|
2006-09-14 13:16:27 +02:00
|
|
|
row may contain many keys.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
It is generalized in the sense that a <acronym>GIN</acronym> index
|
|
|
|
does not need to be aware of the operation that it accelerates.
|
|
|
|
Instead, it uses custom strategies defined for particular data types.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
One advantage of <acronym>GIN</acronym> is that it allows the development
|
|
|
|
of custom data types with the appropriate access methods, by
|
|
|
|
an expert in the domain of the data type, rather than a database expert.
|
|
|
|
This is much the same advantage as using <acronym>GiST</acronym>.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The <acronym>GIN</acronym>
|
|
|
|
implementation in <productname>PostgreSQL</productname> is primarily
|
|
|
|
maintained by Teodor Sigaev and Oleg Bartunov, and there is more
|
|
|
|
information on their
|
|
|
|
<ulink url="http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gin">website</ulink>.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="gin-extensibility">
|
|
|
|
<title>Extensibility</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The <acronym>GIN</acronym> interface has a high level of abstraction,
|
|
|
|
requiring the access method implementer to only implement the semantics of
|
|
|
|
the data type being accessed. The <acronym>GIN</acronym> layer itself
|
|
|
|
takes care of concurrency, logging and searching the tree structure.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
All it takes to get a <acronym>GIN</acronym> access method working
|
|
|
|
is to implement four user-defined methods, which define the behavior of
|
|
|
|
keys in the tree. In short, <acronym>GIN</acronym> combines extensibility
|
|
|
|
along with generality, code reuse, and a clean interface.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="gin-implementation">
|
|
|
|
<title>Implementation</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Internally, <acronym>GIN</acronym> consists of a B-tree index constructed
|
|
|
|
over keys, where each key is an element of the indexed value
|
|
|
|
(element of array, for example) and where each tuple in a leaf page is
|
|
|
|
either a pointer to a B-tree over heap pointers (PT, posting tree), or a
|
|
|
|
list of heap pointers (PL, posting list) if the tuple is small enough.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
There are four methods that an index operator class for
|
|
|
|
<acronym>GIN</acronym> must provide (prototypes are in pseudocode):
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
|
|
<term>int compare( Datum a, Datum b )</term>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Compares keys (not indexed values!) and returns an integer less than
|
|
|
|
zero, zero, or greater than zero, indicating whether the first key is
|
|
|
|
less than, equal to, or greater than the second.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
<varlistentry>
|
|
|
|
<term>Datum* extractValue(Datum inputValue, uint32 *nkeys)</term>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Returns an array of keys of value to be indexed, nkeys should
|
|
|
|
contain the number of returned keys.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
<varlistentry>
|
|
|
|
<term>Datum* extractQuery(Datum query, uint32 nkeys,
|
|
|
|
StrategyNumber n)</term>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Returns an array of keys of the query to be executed. n contains
|
2006-09-18 14:11:36 +02:00
|
|
|
the strategy number of the operation
|
|
|
|
(see <xref linkend="xindex-strategies">).
|
2006-09-14 13:16:27 +02:00
|
|
|
Depending on n, query may be different type.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
<varlistentry>
|
|
|
|
<term>bool consistent( bool check[], StrategyNumber n, Datum query)</term>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
2006-09-18 14:11:36 +02:00
|
|
|
Returns TRUE if the indexed value satisfies the query qualifier with
|
|
|
|
strategy n (or may satisfy in case of RECHECK mark in operator class).
|
|
|
|
Each element of the check array is TRUE if the indexed value has a
|
2006-09-14 13:16:27 +02:00
|
|
|
corresponding key in the query: if (check[i] == TRUE ) the i-th key of
|
|
|
|
the query is present in the indexed value.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
</variablelist>
|
|
|
|
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="gin-tips">
|
|
|
|
<title>GIN tips and trics</title>
|
|
|
|
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
|
|
<term>Create vs insert</term>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
2006-09-18 14:11:36 +02:00
|
|
|
In most cases, insertion into <acronym>GIN</acronym> index is slow
|
|
|
|
due to the likelihood of many keys being inserted for each value.
|
|
|
|
So, for bulk insertions into a table it is advisable to to drop the GIN
|
|
|
|
index and recreate it after finishing bulk insertion.
|
2006-09-14 13:16:27 +02:00
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
<varlistentry>
|
|
|
|
<term>gin_fuzzy_search_limit</term>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
2006-09-18 14:11:36 +02:00
|
|
|
The primary goal of developing <acronym>GIN</acronym> indices was
|
2006-09-14 13:16:27 +02:00
|
|
|
support for highly scalable, full-text search in
|
|
|
|
<productname>PostgreSQL</productname> and there are often situations when
|
|
|
|
a full-text search returns a very large set of results. Since reading
|
|
|
|
tuples from the disk and sorting them could take a lot of time, this is
|
|
|
|
unacceptable for production. (Note that the index search itself is very
|
|
|
|
fast.)
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
Such queries usually contain very frequent words, so the results are not
|
|
|
|
very helpful. To facilitate execution of such queries
|
2006-09-18 14:11:36 +02:00
|
|
|
<acronym>GIN</acronym> has a configurable soft upper limit of the size
|
2006-09-14 13:16:27 +02:00
|
|
|
of the returned set, determined by the
|
|
|
|
<varname>gin_fuzzy_search_limit</varname> GUC variable. It is set to 0 by
|
|
|
|
default (no limit).
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
If a non-zero search limit is set, then the returned set is a subset of
|
|
|
|
the whole result set, chosen at random.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
"Soft" means that the actual number of returned results could slightly
|
|
|
|
differ from the specified limit, depending on the query and the quality
|
|
|
|
of the system's random number generator.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
2006-09-14 15:40:28 +02:00
|
|
|
</variablelist>
|
2006-09-14 13:16:27 +02:00
|
|
|
|
|
|
|
</sect1>
|
|
|
|
|
|
|
|
<sect1 id="gin-limit">
|
|
|
|
<title>Limitations</title>
|
|
|
|
|
|
|
|
<para>
|
2006-09-18 14:11:36 +02:00
|
|
|
<acronym>GIN</acronym> doesn't support full index scans due to their
|
|
|
|
extremely inefficiency: because there are often many keys per value,
|
2006-09-14 13:16:27 +02:00
|
|
|
each heap pointer will returned several times.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2006-09-18 14:11:36 +02:00
|
|
|
When extractQuery returns zero keys, <acronym>GIN</acronym> will emit a
|
|
|
|
error: for different opclasses and strategies the semantic meaning of a void
|
|
|
|
query may be different (for example, any array contains the void array,
|
|
|
|
but they don't overlap the void array), and <acronym>GIN</acronym> can't
|
2006-09-14 13:16:27 +02:00
|
|
|
suggest reasonable answer.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<acronym>GIN</acronym> searches keys only by equality matching. This may
|
|
|
|
be improved in future.
|
|
|
|
</para>
|
|
|
|
</sect1>
|
2006-09-14 23:15:07 +02:00
|
|
|
|
2006-09-14 13:16:27 +02:00
|
|
|
<sect1 id="gin-examples">
|
|
|
|
<title>Examples</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
The <productname>PostgreSQL</productname> source distribution includes
|
|
|
|
<acronym>GIN</acronym> classes for one-dimensional arrays of all internal
|
|
|
|
types. The following
|
|
|
|
<filename>contrib</> modules also contain <acronym>GIN</acronym>
|
|
|
|
operator classes:
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
|
|
<term>intarray</term>
|
|
|
|
<listitem>
|
|
|
|
<para>Enhanced support for int4[]</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
|
|
|
|
|
|
|
<varlistentry>
|
|
|
|
<term>tsearch2</term>
|
|
|
|
<listitem>
|
|
|
|
<para>Support for inverted text indexing. This is much faster for very
|
|
|
|
large, mostly-static sets of documents.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</varlistentry>
|
2006-09-14 23:15:07 +02:00
|
|
|
</variablelist>
|
|
|
|
</sect1>
|
2006-09-14 13:16:27 +02:00
|
|
|
|
|
|
|
</chapter>
|