2010-09-20 22:08:53 +02:00
|
|
|
src/backend/access/gin/README
|
2008-03-20 18:55:15 +01:00
|
|
|
|
2006-05-02 13:28:56 +02:00
|
|
|
Gin for PostgreSQL
|
|
|
|
==================
|
|
|
|
|
|
|
|
Gin was sponsored by jfg://networks (http://www.jfg-networks.com/)
|
|
|
|
|
|
|
|
Gin stands for Generalized Inverted Index and should be considered as a genie,
|
|
|
|
not a drink.
|
|
|
|
|
|
|
|
Generalized means that the index does not know which operation it accelerates.
|
2010-11-23 21:27:50 +01:00
|
|
|
It instead works with custom strategies, defined for specific data types (read
|
|
|
|
"Index Method Strategies" in the PostgreSQL documentation). In that sense, Gin
|
2006-05-02 13:28:56 +02:00
|
|
|
is similar to GiST and differs from btree indices, which have predefined,
|
|
|
|
comparison-based operations.
|
|
|
|
|
2010-11-23 21:27:50 +01:00
|
|
|
An inverted index is an index structure storing a set of (key, posting list)
|
|
|
|
pairs, where 'posting list' is a set of documents in which the key occurs.
|
|
|
|
(A text document would usually contain many keys.) The primary goal of
|
2006-05-02 13:28:56 +02:00
|
|
|
Gin indices is support for highly scalable, full-text search in PostgreSQL.
|
|
|
|
|
|
|
|
Gin consists of a B-tree index constructed over entries (ET, entries tree),
|
|
|
|
where each entry is an element of the indexed value (element of array, lexeme
|
2010-11-23 21:27:50 +01:00
|
|
|
for tsvector) and where each tuple in a leaf page is either a pointer to a
|
|
|
|
B-tree over item pointers (PT, posting tree), or a list of item pointers
|
2006-05-02 13:28:56 +02:00
|
|
|
(PL, posting list) if the tuple is small enough.
|
|
|
|
|
2008-07-08 05:25:42 +02:00
|
|
|
Note: There is no delete operation for ET. The reason for this is that in
|
|
|
|
our experience, the set of distinct words in a large corpus changes very
|
2006-05-02 13:28:56 +02:00
|
|
|
rarely. This greatly simplifies the code and concurrency algorithms.
|
|
|
|
|
2010-11-23 21:27:50 +01:00
|
|
|
Gin comes with built-in support for one-dimensional arrays (eg. integer[],
|
2006-05-02 13:28:56 +02:00
|
|
|
text[]), but no support for NULL elements. The following operations are
|
|
|
|
available:
|
|
|
|
|
2006-09-10 02:29:35 +02:00
|
|
|
* contains: value_array @> query_array
|
|
|
|
* overlaps: value_array && query_array
|
|
|
|
* is contained by: value_array <@ query_array
|
2006-05-02 13:28:56 +02:00
|
|
|
|
|
|
|
Synopsis
|
|
|
|
--------
|
|
|
|
|
|
|
|
=# create index txt_idx on aa using gin(a);
|
|
|
|
|
|
|
|
Features
|
|
|
|
--------
|
|
|
|
|
|
|
|
* Concurrency
|
|
|
|
* Write-Ahead Logging (WAL). (Recoverability from crashes.)
|
|
|
|
* User-defined opclasses. (The scheme is similar to GiST.)
|
|
|
|
* Optimized index creation (Makes use of maintenance_work_mem to accumulate
|
|
|
|
postings in memory.)
|
2007-11-14 00:36:26 +01:00
|
|
|
* Text search support via an opclass
|
2006-05-02 13:28:56 +02:00
|
|
|
* Soft upper limit on the returned results set using a GUC variable:
|
|
|
|
gin_fuzzy_search_limit
|
|
|
|
|
|
|
|
Gin Fuzzy Limit
|
|
|
|
---------------
|
|
|
|
|
|
|
|
There are often situations when a full-text search returns a very large set of
|
|
|
|
results. Since reading tuples from the disk and sorting them could take a
|
2010-11-23 21:27:50 +01:00
|
|
|
lot of time, this is unacceptable for production. (Note that the search
|
2006-05-02 13:28:56 +02:00
|
|
|
itself is very fast.)
|
|
|
|
|
2010-11-23 21:27:50 +01:00
|
|
|
Such queries usually contain very frequent lexemes, so the results are not
|
|
|
|
very helpful. To facilitate execution of such queries Gin has a configurable
|
|
|
|
soft upper limit on the size of the returned set, determined by the
|
|
|
|
'gin_fuzzy_search_limit' GUC variable. This is set to 0 by default (no
|
2006-05-02 13:28:56 +02:00
|
|
|
limit).
|
|
|
|
|
|
|
|
If a non-zero search limit is set, then the returned set is a subset of the
|
|
|
|
whole result set, chosen at random.
|
|
|
|
|
|
|
|
"Soft" means that the actual number of returned results could slightly differ
|
2010-11-23 21:27:50 +01:00
|
|
|
from the specified limit, depending on the query and the quality of the
|
2006-05-02 13:28:56 +02:00
|
|
|
system's random number generator.
|
|
|
|
|
|
|
|
From experience, a value of 'gin_fuzzy_search_limit' in the thousands
|
|
|
|
(eg. 5000-20000) works well. This means that 'gin_fuzzy_search_limit' will
|
2010-11-23 21:27:50 +01:00
|
|
|
have no effect for queries returning a result set with less tuples than this
|
2006-05-02 13:28:56 +02:00
|
|
|
number.
|
|
|
|
|
|
|
|
Limitations
|
|
|
|
-----------
|
|
|
|
|
|
|
|
* No support for multicolumn indices
|
|
|
|
* Gin doesn't uses scan->kill_prior_tuple & scan->ignore_killed_tuples
|
|
|
|
* Gin searches entries only by equality matching. This may be improved in
|
|
|
|
future.
|
|
|
|
* Gin doesn't support full scans of indices.
|
|
|
|
* Gin doesn't index NULL values.
|
|
|
|
|
|
|
|
Open Items
|
|
|
|
----------
|
|
|
|
|
|
|
|
We appreciate any comments, help and suggestions.
|
|
|
|
|
|
|
|
* Teach optimizer/executor that GIN is intrinsically clustered. i.e., it
|
|
|
|
always returns ItemPointer in ascending order.
|
|
|
|
* Tweak gincostestimate.
|
|
|
|
|
|
|
|
TODO
|
|
|
|
----
|
|
|
|
|
|
|
|
Nearest future:
|
|
|
|
|
|
|
|
* Opclasses for all types (no programming, just many catalog changes).
|
|
|
|
|
|
|
|
Distant future:
|
|
|
|
|
|
|
|
* Replace B-tree of entries to something like GiST
|
|
|
|
* Add multicolumn support
|
|
|
|
* Optimize insert operations (background index insertion)
|
|
|
|
|
|
|
|
Authors
|
|
|
|
-------
|
|
|
|
|
2010-11-23 21:27:50 +01:00
|
|
|
All work was done by Teodor Sigaev (teodor@sigaev.ru) and Oleg Bartunov
|
2006-05-02 13:28:56 +02:00
|
|
|
(oleg@sai.msu.su).
|