diff --git a/doc/src/sgml/gin.sgml b/doc/src/sgml/gin.sgml index 24517402bc..c2bbad42bf 100644 --- a/doc/src/sgml/gin.sgml +++ b/doc/src/sgml/gin.sgml @@ -1,4 +1,4 @@ - + GIN Indexes @@ -14,8 +14,9 @@ GIN stands for Generalized Inverted Index. It is an index structure storing a set of (key, posting list) pairs, where - 'posting list' is a set of rows in which the key occurs. Each - row may contain many keys. + a posting list is a set of rows in which the key occurs. Each + indexed value may contain many keys, so the same row ID may appear in + multiple posting lists. @@ -45,7 +46,7 @@ The GIN interface has a high level of abstraction, - requiring the access method implementer to only implement the semantics of + requiring the access method implementer only to implement the semantics of the data type being accessed. The GIN layer itself takes care of concurrency, logging and searching the tree structure. @@ -53,26 +54,14 @@ All it takes to get a GIN access method working is to implement four user-defined methods, which define the behavior of - keys in the tree. In short, GIN combines extensibility - along with generality, code reuse, and a clean interface. - - - - - - Implementation - - - Internally, GIN consists of a B-tree index constructed - over keys, where each key is an element of the indexed value - (element of array, for example) and where each tuple in a leaf page is - either a pointer to a B-tree over heap pointers (PT, posting tree), or a - list of heap pointers (PL, posting list) if the tuple is small enough. + keys in the tree and the relationships between keys, indexed values, + and indexable queries. In short, GIN combines + extensibility with generality, code reuse, and a clean interface. - There are four methods that an index operator class for - GIN must provide (prototypes are in pseudocode): + The four methods that an index operator class for + GIN must provide are: @@ -80,9 +69,9 @@ int compare(Datum a, Datum b) - Compares keys (not indexed values!) and returns an integer less than - zero, zero, or greater than zero, indicating whether the first key is - less than, equal to, or greater than the second. + Compares keys (not indexed values!) and returns an integer less than + zero, zero, or greater than zero, indicating whether the first key is + less than, equal to, or greater than the second. @@ -91,21 +80,26 @@ Datum* extractValue(Datum inputValue, uint32 *nkeys) - Returns an array of keys of value to be indexed, nkeys should - contain the number of returned keys. + Returns an array of keys given a value to be indexed. The + number of returned keys must be stored into *nkeys. - Datum* extractQuery(Datum query, uint32 nkeys, - StrategyNumber n) + Datum* extractQuery(Datum query, uint32 *nkeys, + StrategyNumber n) - Returns an array of keys of the query to be executed. n contains the - strategy number of the operation (see ). Depending on n, query may be - different type. + Returns an array of keys given a value to be queried; that is, + query is the value on the right-hand side of an + indexable operator whose left-hand side is the indexed column. + n is the strategy number of the operator within the + operator class (see ). + Often, extractQuery will need + to consult n to determine the data type of + query and the key values that need to be extracted. + The number of returned keys must be stored into *nkeys. @@ -114,11 +108,16 @@ bool consistent(bool check[], StrategyNumber n, Datum query) - Returns TRUE if the indexed value satisfies the query qualifier with - strategy n (or may satisfy in case of RECHECK mark in operator class). - Each element of the check array is TRUE if the indexed value has a - corresponding key in the query: if (check[i] == TRUE) the i-th key of - the query is present in the indexed value. + Returns TRUE if the indexed value satisfies the query operator with + strategy number n (or may satisfy, if the operator is + marked RECHECK in the operator class). The check array has + the same length as the number of keys previously returned by + extractQuery for this query. Each element of the + check array is TRUE if the indexed value contains the + corresponding query key, ie, if (check[i] == TRUE) the i-th key of the + extractQuery result array is present in the indexed value. + The original query datum (not the extracted key array!) is + passed in case the consistent method needs to consult it. @@ -127,6 +126,19 @@ + + Implementation + + + Internally, a GIN index contains a B-tree index + constructed over keys, where each key is an element of the indexed value + (a member of an array, for example) and where each tuple in a leaf page is + either a pointer to a B-tree over heap pointers (PT, posting tree), or a + list of heap pointers (PL, posting list) if the list is small enough. + + + + GIN tips and tricks @@ -134,44 +146,43 @@ Create vs insert - - In most cases, insertion into a GIN index is slow - due to the likelihood of many keys being inserted for each value. - So, for bulk insertions into a table it is advisable to to drop the GIN - index and recreate it after finishing bulk insertion. - + + In most cases, insertion into a GIN index is slow + due to the likelihood of many keys being inserted for each value. + So, for bulk insertions into a table it is advisable to drop the GIN + index and recreate it after finishing bulk insertion. + - gin_fuzzy_search_limit + - - The primary goal of developing GIN indices was - support for highly scalable, full-text search in - PostgreSQL and there are often situations when - a full-text search returns a very large set of results. Since reading - tuples from the disk and sorting them could take a lot of time, this is - unacceptable for production. (Note that the index search itself is very - fast.) + + The primary goal of developing GIN indexes was + to create support for highly scalable, full-text search in + PostgreSQL, and there are often situations when + a full-text search returns a very large set of results. Moreover, this + often happens when the query contains very frequent words, so that the + large result set is not even useful. Since reading many + tuples from the disk and sorting them could take a lot of time, this is + unacceptable for production. (Note that the index search itself is very + fast.) + + + To facilitate controlled execution of such queries + GIN has a configurable soft upper limit on the size + of the returned set, the + gin_fuzzy_search_limit configuration parameter. + It is set to 0 (meaning no limit) by default. + If a non-zero limit is set, then the returned set is a subset of + the whole result set, chosen at random. + + + Soft means that the actual number of returned results + could differ slightly from the specified limit, depending on the query + and the quality of the system's random number generator. - - Such queries usually contain very frequent words, so the results are not - very helpful. To facilitate execution of such queries - GIN has a configurable soft upper limit of the size - of the returned set, determined by the - gin_fuzzy_search_limit GUC variable. It is set to 0 by - default (no limit). - - - If a non-zero search limit is set, then the returned set is a subset of - the whole result set, chosen at random. - - - Soft means that the actual number of returned results - could slightly differ from the specified limit, depending on the query - and the quality of the system's random number generator. - @@ -182,21 +193,30 @@ Limitations - GIN doesn't support full index scans due to their - extreme inefficiency: because there are often many keys per value, - each heap pointer will be returned several times. + GIN doesn't support full index scans: because there are + often many keys per value, each heap pointer would be returned many times, + and there is no easy way to prevent this. When extractQuery returns zero keys, - GIN will emit an error: for different opclasses and - strategies the semantic meaning of a void query may be different (for - example, any array contains the void array, but they don't overlap the - void array), and GIN can't suggest a reasonable answer. + GIN will emit an error. Depending on the operator, + a void query might match all, some, or none of the indexed values (for + example, every array contains the empty array, but does not overlap the + empty array), and GIN can't determine the correct + answer, nor produce a full-index-scan result if it could determine that + that was correct. - GIN searches keys only by equality matching. This may + It is not an error for extractValue to return zero keys, + but in this case the indexed value will be unrepresented in the index. + This is another reason why full index scan is not useful — it would + miss such rows. + + + + GIN searches keys only by equality matching. This may be improved in future. @@ -206,12 +226,12 @@ The PostgreSQL source distribution includes - GIN classes for one-dimensional arrays of all internal + GIN classes for one-dimensional arrays of all internal types. The following contrib modules also contain GIN - operator classes: + operator classes: - + intarray diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml index fd8bb7251e..1edceebd2d 100644 --- a/doc/src/sgml/indices.sgml +++ b/doc/src/sgml/indices.sgml @@ -1,4 +1,4 @@ - + Indexes @@ -116,7 +116,7 @@ CREATE INDEX test1_id_index ON test1 (id); PostgreSQL provides several index types: - B-tree, Hash, GIN and GiST. Each index type uses a different + B-tree, Hash, GiST and GIN. Each index type uses a different algorithm that is best suited to different types of queries. By default, the CREATE INDEX command will create a B-tree index, which fits the most common situations. @@ -247,8 +247,8 @@ CREATE INDEX name ON table GIN index - GIN is a inverted index and it's usable for values which have more - than one key, arrays for example. Like GiST, GIN may support + GIN indexes are inverted indexes which can handle values that contain more + than one key, arrays for example. Like GiST, GIN can support many different user-defined indexing strategies and the particular operators with which a GIN index can be used vary depending on the indexing strategy. @@ -267,7 +267,8 @@ CREATE INDEX name ON table (See for the meaning of these operators.) Other GIN operator classes are available in the contrib - tsearch2 and intarray modules. For more information see . + tsearch2 and intarray modules. + For more information see . diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml index 444839399e..a66dd3c4ee 100644 --- a/doc/src/sgml/xindex.sgml +++ b/doc/src/sgml/xindex.sgml @@ -1,4 +1,4 @@ - + Interfacing Extensions To Indexes @@ -243,15 +243,16 @@ - GIN indexes are similar to GiST's in flexibility: they don't have a fixed - et of strategies. Instead, the consistency support routine - interprets the strategy numbers accordingly with operator class - definition. As an example, strategies of operator class over arrays - is shown in . + GIN indexes are similar to GiST indexes in flexibility: they don't have a + fixed set of strategies. Instead the support routines of each operator + class interpret the strategy numbers according to the operator class's + definition. As an example, the strategy numbers used by the built-in + operator classes for arrays are + shown in . - GIN Array's Strategies + GIN Array Strategies @@ -388,36 +389,35 @@ consistent - determine whether key satisfies the - query qualifier + query qualifier 1 - union - compute union of of a set of given keys + union - compute union of a set of keys 2 - compress - computes a compressed representation of a key or value - to be indexed + compress - compute a compressed representation of a key or value + to be indexed 3 - decompress - computes a decompressed representation of a - compressed key + decompress - compute a decompressed representation of a + compressed key 4 penalty - compute penalty for inserting new key into subtree - with given subtree's key + with given subtree's key 5 picksplit - determine which entries of a page are to be moved - to the new page and compute the union keys for resulting pages + to the new page and compute the union keys for resulting pages 6 - equal - compare two keys and returns true if they are equal - + equal - compare two keys and return true if they are equal 7 @@ -441,23 +441,22 @@ - compare - Compare two keys and return an integer less than zero, zero, or - greater than zero, indicating whether the first key is less than, equal to, - or greater than the second. - + compare - compare two keys and return an integer less than zero, zero, + or greater than zero, indicating whether the first key is less than, + equal to, or greater than the second + 1 - extractValue - extract keys from value to be indexed + extractValue - extract keys from a value to be indexed 2 - extractQuery - extract keys from query + extractQuery - extract keys from a query condition 3 - consistent - determine whether value matches by the - query + consistent - determine whether value matches query condition 4 @@ -822,12 +821,16 @@ CREATE OPERATOR CLASS polygon_ops STORAGE box; - At present, only the GiST and GIN index method supports a + At present, only the GiST and GIN index methods support a STORAGE type that's different from the column data type. - The GiST compress and decompress support + The GiST compress and decompress support routines must deal with data-type conversion when STORAGE - is used. Functions named extractValue and extractQuery - do conversation into internally used types for GIN. + is used. In GIN, the STORAGE type identifies the type of + the key values, which normally is different from the type + of the indexed column — for example, an operator class for + integer array columns might have keys that are just integers. The + GIN extractValue and extractQuery support + routines are responsible for extracting keys from indexed values.