215 lines
6.1 KiB
Plaintext
215 lines
6.1 KiB
Plaintext
|
<sect1 id="pgtrgm">
|
||
|
<title>pg_trgm</title>
|
||
|
|
||
|
<indexterm zone="pgtrgm">
|
||
|
<primary>pgtrgm</primary>
|
||
|
</indexterm>
|
||
|
|
||
|
<para>
|
||
|
The <literal>pg_trgm</literal> module provides functions and index classes
|
||
|
for determining the similarity of text based on trigram matching.
|
||
|
</para>
|
||
|
|
||
|
<sect2>
|
||
|
<title>Trigram (or Trigraph)</title>
|
||
|
<para>
|
||
|
A trigram is a set of three consecutive characters taken
|
||
|
from a string. A string is considered to have two spaces
|
||
|
prefixed and one space suffixed when determining the set
|
||
|
of trigrams that comprise the string.
|
||
|
</para>
|
||
|
<para>
|
||
|
eg. The set of trigrams in the word "cat" is " c", " ca",
|
||
|
"at " and "cat".
|
||
|
</para>
|
||
|
</sect2>
|
||
|
|
||
|
<sect2>
|
||
|
<title>Public Functions</title>
|
||
|
<table>
|
||
|
<title><literal>pg_trgm</literal> functions</title>
|
||
|
<tgroup cols="2">
|
||
|
<thead>
|
||
|
<row>
|
||
|
<entry>Function</entry>
|
||
|
<entry>Description</entry>
|
||
|
</row>
|
||
|
</thead>
|
||
|
<tbody>
|
||
|
<row>
|
||
|
<entry><literal>real similarity(text, text)</literal></entry>
|
||
|
<entry>
|
||
|
<para>
|
||
|
Returns a number that indicates how closely matches the two
|
||
|
arguments are. A zero result indicates that the two words
|
||
|
are completely dissimilar, and a result of one indicates that
|
||
|
the two words are identical.
|
||
|
</para>
|
||
|
</entry>
|
||
|
</row>
|
||
|
<row>
|
||
|
<entry><literal>real show_limit()</literal></entry>
|
||
|
<entry>
|
||
|
<para>
|
||
|
Returns the current similarity threshold used by the '%'
|
||
|
operator. This in effect sets the minimum similarity between
|
||
|
two words in order that they be considered similar enough to
|
||
|
be misspellings of each other, for example.
|
||
|
</para>
|
||
|
</entry>
|
||
|
</row>
|
||
|
<row>
|
||
|
<entry><literal>real set_limit(real)</literal></entry>
|
||
|
<entry>
|
||
|
<para>
|
||
|
Sets the current similarity threshold that is used by the '%'
|
||
|
operator, and is returned by the show_limit() function.
|
||
|
</para>
|
||
|
</entry>
|
||
|
</row>
|
||
|
<row>
|
||
|
<entry><literal>text[] show_trgm(text)</literal></entry>
|
||
|
<entry>
|
||
|
<para>
|
||
|
Returns an array of all the trigrams of the supplied text
|
||
|
parameter.
|
||
|
</para>
|
||
|
</entry>
|
||
|
</row>
|
||
|
<row>
|
||
|
<entry>Operator: <literal>text % text (returns boolean)</literal></entry>
|
||
|
<entry>
|
||
|
<para>
|
||
|
The '%' operator returns TRUE if its two arguments have a similarity
|
||
|
that is greater than the similarity threshold set by set_limit(). It
|
||
|
will return FALSE if the similarity is less than the current
|
||
|
threshold.
|
||
|
</para>
|
||
|
</entry>
|
||
|
</row>
|
||
|
</tbody>
|
||
|
</tgroup>
|
||
|
</table>
|
||
|
</sect2>
|
||
|
|
||
|
<sect2>
|
||
|
<title>Public Index Operator Class</title>
|
||
|
<para>
|
||
|
The <literal>pg_trgm</literal> module comes with the
|
||
|
<literal>gist_trgm_ops</literal> index operator class that allows a
|
||
|
developer to create an index over a text column for the purpose
|
||
|
of very fast similarity searches.
|
||
|
</para>
|
||
|
<para>
|
||
|
To use this index, the '%' operator must be used and an appropriate
|
||
|
similarity threshold for the application must be set. Example:
|
||
|
</para>
|
||
|
<programlisting>
|
||
|
CREATE TABLE test_trgm (t text);
|
||
|
CREATE INDEX trgm_idx ON test_trgm USING gist (t gist_trgm_ops);
|
||
|
</programlisting>
|
||
|
<para>
|
||
|
At this point, you will have an index on the t text column that you
|
||
|
can use for similarity searching. Example:
|
||
|
</para>
|
||
|
<programlisting>
|
||
|
SELECT
|
||
|
t,
|
||
|
similarity(t, 'word') AS sml
|
||
|
FROM
|
||
|
test_trgm
|
||
|
WHERE
|
||
|
t % 'word'
|
||
|
ORDER BY
|
||
|
sml DESC, t;
|
||
|
</programlisting>
|
||
|
<para>
|
||
|
This will return all values in the text column that are sufficiently
|
||
|
similar to 'word', sorted from best match to worst. The index will
|
||
|
be used to make this a fast operation over very large data sets.
|
||
|
</para>
|
||
|
</sect2>
|
||
|
|
||
|
<sect2>
|
||
|
<title>Tsearch2 Integration</title>
|
||
|
<para>
|
||
|
Trigram matching is a very useful tool when used in conjunction
|
||
|
with a text index created by the Tsearch2 contrib module. (See
|
||
|
contrib/tsearch2)
|
||
|
</para>
|
||
|
<para>
|
||
|
The first step is to generate an auxiliary table containing all
|
||
|
the unique words in the Tsearch2 index:
|
||
|
</para>
|
||
|
<programlisting>
|
||
|
CREATE TABLE words AS SELECT word FROM
|
||
|
stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');
|
||
|
</programlisting>
|
||
|
<para>
|
||
|
Where 'documents' is a table that has a text field 'bodytext'
|
||
|
that TSearch2 is used to search. The use of the 'simple' dictionary
|
||
|
with the to_tsvector function, instead of just using the already
|
||
|
existing vector is to avoid creating a list of already stemmed
|
||
|
words. This way, only the original, unstemmed words are added
|
||
|
to the word list.
|
||
|
</para>
|
||
|
<para>
|
||
|
Next, create a trigram index on the word column:
|
||
|
</para>
|
||
|
<programlisting>
|
||
|
CREATE INDEX words_idx ON words USING gist(word gist_trgm_ops);
|
||
|
</programlisting>
|
||
|
<para>
|
||
|
or
|
||
|
</para>
|
||
|
<programlisting>
|
||
|
CREATE INDEX words_idx ON words USING gin(word gist_trgm_ops);
|
||
|
</programlisting>
|
||
|
<para>
|
||
|
Now, a <literal>SELECT</literal> query similar to the example above can be
|
||
|
used to suggest spellings for misspelled words in user search terms. A
|
||
|
useful extra clause is to ensure that the similar words are also
|
||
|
of similar length to the misspelled word.
|
||
|
</para>
|
||
|
<para>
|
||
|
<note>
|
||
|
<para>
|
||
|
Since the 'words' table has been generated as a separate,
|
||
|
static table, it will need to be periodically regenerated so that
|
||
|
it remains up to date with the word list in the Tsearch2 index.
|
||
|
</para>
|
||
|
</note>
|
||
|
</para>
|
||
|
</sect2>
|
||
|
|
||
|
<sect2>
|
||
|
<title>References</title>
|
||
|
<para>
|
||
|
Tsearch2 Development Site
|
||
|
<ulink url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/"></ulink>
|
||
|
</para>
|
||
|
<para>
|
||
|
GiST Development Site
|
||
|
<ulink url="http://www.sai.msu.su/~megera/postgres/gist/"></ulink>
|
||
|
</para>
|
||
|
</sect2>
|
||
|
|
||
|
<sect2>
|
||
|
<title>Authors</title>
|
||
|
<para>
|
||
|
Oleg Bartunov <email>oleg@sai.msu.su</email>, Moscow, Moscow University, Russia
|
||
|
</para>
|
||
|
<para>
|
||
|
Teodor Sigaev <email>teodor@sigaev.ru</email>, Moscow, Delta-Soft Ltd.,Russia
|
||
|
</para>
|
||
|
<para>
|
||
|
Documentation: Christopher Kings-Lynne
|
||
|
</para>
|
||
|
<para>
|
||
|
This module is sponsored by Delta-Soft Ltd., Moscow, Russia.
|
||
|
</para>
|
||
|
</sect2>
|
||
|
|
||
|
</sect1>
|
||
|
|