postgresql/doc/src/sgml/trgm.sgml

<sect1 id="pgtrgm">
 <title>pg_trgm</title>
 
 <indexterm zone="pgtrgm">
  <primary>pgtrgm</primary>
 </indexterm>

 <para>
  The <literal>pg_trgm</literal> module provides functions and index classes
  for determining the similarity of text based on trigram matching.
 </para>

 <sect2>
  <title>Trigram (or Trigraph)</title>
  <para>
   A trigram is a set of three consecutive characters taken
   from a string.  A string is considered to have two spaces
   prefixed and one space suffixed when determining the set
   of trigrams that comprise the string.
  </para>
  <para>
   eg. The set of trigrams in the word "cat" is "  c", " ca", 
   "at " and "cat".
  </para>
 </sect2>

 <sect2>
  <title>Public Functions</title>
  <table>
   <title><literal>pg_trgm</literal> functions</title>
   <tgroup cols="2">
    <thead>
     <row>
      <entry>Function</entry>
      <entry>Description</entry>
     </row>
    </thead>
    <tbody>
     <row>
      <entry><literal>real similarity(text, text)</literal></entry>
      <entry>
       <para>
        Returns a number that indicates how closely matches the two
        arguments are.  A zero result indicates that the two words
        are completely dissimilar, and a result of one indicates that
        the two words are identical.
       </para>
      </entry>
     </row>
     <row>
      <entry><literal>real show_limit()</literal></entry>
      <entry>
       <para>
        Returns the current similarity threshold used by the '%'
        operator.  This in effect sets the minimum similarity between
        two words in order that they be considered similar enough to
        be misspellings of each other, for example.
       </para>
      </entry>
     </row>
     <row>
      <entry><literal>real set_limit(real)</literal></entry>
      <entry>
       <para>
        Sets the current similarity threshold that is used by the '%'
        operator, and is returned by the show_limit() function.
       </para>
      </entry>
     </row>
     <row>
      <entry><literal>text[] show_trgm(text)</literal></entry>
      <entry>
       <para>
        Returns an array of all the trigrams of the supplied text
        parameter.
       </para>
      </entry>
     </row>
     <row>
      <entry>Operator: <literal>text % text (returns boolean)</literal></entry> 
      <entry>
       <para>
        The '%' operator returns TRUE if its two arguments have a similarity
        that is greater than the similarity threshold set by set_limit(). It
        will return FALSE if the similarity is less than the current
        threshold.
       </para>
      </entry>
     </row>
    </tbody>
   </tgroup>
  </table>
 </sect2>

 <sect2>
  <title>Public Index Operator Class</title>
  <para>
   The <literal>pg_trgm</literal> module comes with the 
   <literal>gist_trgm_ops</literal> index operator class that allows a
   developer to create an index over a text column for the purpose
   of very fast similarity searches.
  </para>
  <para>
   To use this index, the '%' operator must be used and an appropriate
   similarity threshold for the application must be set. Example:
  </para>
  <programlisting>
CREATE TABLE test_trgm (t text);
CREATE INDEX trgm_idx ON test_trgm USING gist (t gist_trgm_ops);
  </programlisting>
  <para>	
   At this point, you will have an index on the t text column that you
   can use for similarity searching. Example:
  </para>
  <programlisting>
SELECT
	t,
	similarity(t, 'word') AS sml
FROM
	test_trgm
WHERE
	t % 'word'
ORDER BY
	sml DESC, t;
  </programlisting>
  <para>
   This will return all values in the text column that are sufficiently
   similar to 'word', sorted from best match to worst.  The index will
   be used to make this a fast operation over very large data sets.
  </para>
 </sect2>

 <sect2>
  <title>Tsearch2 Integration</title>
  <para>
   Trigram matching is a very useful tool when used in conjunction
   with a text index created by the Tsearch2 contrib module. (See
   contrib/tsearch2)
  </para>
  <para>
   The first step is to generate an auxiliary table containing all
   the unique words in the Tsearch2 index:
  </para>
  <programlisting>
CREATE TABLE words AS SELECT word FROM 
	stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');
  </programlisting>
  <para>
   Where 'documents' is a table that has a text field 'bodytext'
   that TSearch2 is used to search.  The use of the 'simple' dictionary
   with the to_tsvector function, instead of just using the already
   existing vector is to avoid creating a list of already stemmed
   words.  This way, only the original, unstemmed words are added
   to the word list.
  </para>
  <para>
   Next, create a trigram index on the word column:
  </para>
  <programlisting>
CREATE INDEX words_idx ON words USING gist(word gist_trgm_ops);
  </programlisting>
  <para>
   or
  </para>
  <programlisting>
CREATE INDEX words_idx ON words USING gin(word gist_trgm_ops);
  </programlisting>
  <para>
   Now, a <literal>SELECT</literal> query similar to the example above can be 
   used to suggest spellings for misspelled words in user search terms. A
   useful extra clause is to ensure that the similar words are also
   of similar length to the misspelled word.
  </para>
  <para>
   <note>
    <para>
     Since the 'words' table has been generated as a separate,
     static table, it will need to be periodically regenerated so that
     it remains up to date with the word list in the Tsearch2 index.
    </para>
   </note>
  </para>
 </sect2>

 <sect2>
  <title>References</title>
  <para>
   Tsearch2 Development Site
   <ulink url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/"></ulink>
  </para>
  <para>
   GiST Development Site
   <ulink url="http://www.sai.msu.su/~megera/postgres/gist/"></ulink>
  </para>
 </sect2>

 <sect2>
  <title>Authors</title>
  <para>
   Oleg Bartunov <email>oleg@sai.msu.su</email>, Moscow, Moscow University, Russia
  </para>
  <para>
   Teodor Sigaev <email>teodor@sigaev.ru</email>, Moscow, Delta-Soft Ltd.,Russia
  </para>
  <para>
   Documentation: Christopher Kings-Lynne 
  </para>
  <para>
   This module is sponsored by Delta-Soft Ltd., Moscow, Russia.
  </para>
 </sect2>

</sect1>
Move most /contrib README files into SGML. Some still need conversion or will never be converted. 2007-11-11 00:30:46 +01:00			`<sect1 id="pgtrgm">`
			`<title>pg_trgm</title>`

			`<indexterm zone="pgtrgm">`
			`<primary>pgtrgm</primary>`
			`</indexterm>`

			`<para>`
			`The <literal>pg_trgm</literal> module provides functions and index classes`
			`for determining the similarity of text based on trigram matching.`
			`</para>`

			`<sect2>`
			`<title>Trigram (or Trigraph)</title>`
			`<para>`
			`A trigram is a set of three consecutive characters taken`
			`from a string. A string is considered to have two spaces`
			`prefixed and one space suffixed when determining the set`
			`of trigrams that comprise the string.`
			`</para>`
			`<para>`
			`eg. The set of trigrams in the word "cat" is " c", " ca",`
			`"at " and "cat".`
			`</para>`
			`</sect2>`

			`<sect2>`
			`<title>Public Functions</title>`
			`<table>`
			`<title><literal>pg_trgm</literal> functions</title>`
			`<tgroup cols="2">`
			`<thead>`
			`<row>`
			`<entry>Function</entry>`
			`<entry>Description</entry>`
			`</row>`
			`</thead>`
			`<tbody>`
			`<row>`
			`<entry><literal>real similarity(text, text)</literal></entry>`
			`<entry>`
			`<para>`
			`Returns a number that indicates how closely matches the two`
			`arguments are. A zero result indicates that the two words`
			`are completely dissimilar, and a result of one indicates that`
			`the two words are identical.`
			`</para>`
			`</entry>`
			`</row>`
			`<row>`
			`<entry><literal>real show_limit()</literal></entry>`
			`<entry>`
			`<para>`
			`Returns the current similarity threshold used by the '%'`
			`operator. This in effect sets the minimum similarity between`
			`two words in order that they be considered similar enough to`
			`be misspellings of each other, for example.`
			`</para>`
			`</entry>`
			`</row>`
			`<row>`
			`<entry><literal>real set_limit(real)</literal></entry>`
			`<entry>`
			`<para>`
			`Sets the current similarity threshold that is used by the '%'`
			`operator, and is returned by the show_limit() function.`
			`</para>`
			`</entry>`
			`</row>`
			`<row>`
			`<entry><literal>text[] show_trgm(text)</literal></entry>`
			`<entry>`
			`<para>`
			`Returns an array of all the trigrams of the supplied text`
			`parameter.`
			`</para>`
			`</entry>`
			`</row>`
			`<row>`
			`<entry>Operator: <literal>text % text (returns boolean)</literal></entry>`
			`<entry>`
			`<para>`
			`The '%' operator returns TRUE if its two arguments have a similarity`
			`that is greater than the similarity threshold set by set_limit(). It`
			`will return FALSE if the similarity is less than the current`
			`threshold.`
			`</para>`
			`</entry>`
			`</row>`
			`</tbody>`
			`</tgroup>`
			`</table>`
			`</sect2>`

			`<sect2>`
			`<title>Public Index Operator Class</title>`
			`<para>`
			`The <literal>pg_trgm</literal> module comes with the`
			`<literal>gist_trgm_ops</literal> index operator class that allows a`
			`developer to create an index over a text column for the purpose`
			`of very fast similarity searches.`
			`</para>`
			`<para>`
			`To use this index, the '%' operator must be used and an appropriate`
			`similarity threshold for the application must be set. Example:`
			`</para>`
			`<programlisting>`
			`CREATE TABLE test_trgm (t text);`
			`CREATE INDEX trgm_idx ON test_trgm USING gist (t gist_trgm_ops);`
			`</programlisting>`
			`<para>`
			`At this point, you will have an index on the t text column that you`
			`can use for similarity searching. Example:`
			`</para>`
			`<programlisting>`
			`SELECT`
			`t,`
			`similarity(t, 'word') AS sml`
			`FROM`
			`test_trgm`
			`WHERE`
			`t % 'word'`
			`ORDER BY`
			`sml DESC, t;`
			`</programlisting>`
			`<para>`
			`This will return all values in the text column that are sufficiently`
			`similar to 'word', sorted from best match to worst. The index will`
			`be used to make this a fast operation over very large data sets.`
			`</para>`
			`</sect2>`

			`<sect2>`
			`<title>Tsearch2 Integration</title>`
			`<para>`
			`Trigram matching is a very useful tool when used in conjunction`
			`with a text index created by the Tsearch2 contrib module. (See`
			`contrib/tsearch2)`
			`</para>`
			`<para>`
			`The first step is to generate an auxiliary table containing all`
			`the unique words in the Tsearch2 index:`
			`</para>`
			`<programlisting>`
			`CREATE TABLE words AS SELECT word FROM`
			`stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');`
			`</programlisting>`
			`<para>`
			`Where 'documents' is a table that has a text field 'bodytext'`
			`that TSearch2 is used to search. The use of the 'simple' dictionary`
			`with the to_tsvector function, instead of just using the already`
			`existing vector is to avoid creating a list of already stemmed`
			`words. This way, only the original, unstemmed words are added`
			`to the word list.`
			`</para>`
			`<para>`
			`Next, create a trigram index on the word column:`
			`</para>`
			`<programlisting>`
			`CREATE INDEX words_idx ON words USING gist(word gist_trgm_ops);`
			`</programlisting>`
			`<para>`
			`or`
			`</para>`
			`<programlisting>`
			`CREATE INDEX words_idx ON words USING gin(word gist_trgm_ops);`
			`</programlisting>`
			`<para>`
			`Now, a <literal>SELECT</literal> query similar to the example above can be`
			`used to suggest spellings for misspelled words in user search terms. A`
			`useful extra clause is to ensure that the similar words are also`
			`of similar length to the misspelled word.`
			`</para>`
			`<para>`
			`<note>`
			`<para>`
			`Since the 'words' table has been generated as a separate,`
			`static table, it will need to be periodically regenerated so that`
			`it remains up to date with the word list in the Tsearch2 index.`
			`</para>`
			`</note>`
			`</para>`
			`</sect2>`

			`<sect2>`
			`<title>References</title>`
			`<para>`
			`Tsearch2 Development Site`
			`<ulink url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/"></ulink>`
			`</para>`
			`<para>`
			`GiST Development Site`
			`<ulink url="http://www.sai.msu.su/~megera/postgres/gist/"></ulink>`
			`</para>`
			`</sect2>`

			`<sect2>`
			`<title>Authors</title>`
			`<para>`
			`Oleg Bartunov <email>oleg@sai.msu.su</email>, Moscow, Moscow University, Russia`
			`</para>`
			`<para>`
			`Teodor Sigaev <email>teodor@sigaev.ru</email>, Moscow, Delta-Soft Ltd.,Russia`
			`</para>`
			`<para>`
			`Documentation: Christopher Kings-Lynne`
			`</para>`
			`<para>`
			`This module is sponsored by Delta-Soft Ltd., Moscow, Russia.`
			`</para>`
			`</sect2>`

			`</sect1>`