postgresql/doc/src/sgml/textsearch.sgml

2926 lines
95 KiB
Plaintext
Raw Normal View History

<chapter id="textsearch">
<title>Full Text Search</title>
<sect1 id="textsearch-intro">
<title>Introduction</title>
<para>
Full Text Searching (or just <firstterm>text search</firstterm>) allows
identifying documents that satisfy a <firstterm>query</firstterm>, and
optionally sorting them by relevance to the query. The most common search
is to find all documents containing given <firstterm>query terms</firstterm>
and return them in order of their <firstterm>similarity</firstterm> to the
<varname>query</varname>. Notions of <varname>query</varname> and
<varname>similarity</varname> are very flexible and depend on the specific
application. The simplest search considers <varname>query</varname> as a
set of words and <varname>similarity</varname> as the frequency of query
words in the document. Full text indexing can be done inside the
database or outside. Doing indexing inside the database allows easy access
to document metadata to assist in indexing and display.
</para>
<para>
Textual search operators have existed in databases for years.
<productname>PostgreSQL</productname> has
<literal>~</literal>,<literal>~*</literal>, <literal>LIKE</literal>,
<literal>ILIKE</literal> operators for textual datatypes, but they lack
many essential properties required by modern information systems:
</para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
There is no linguistic support, even for English. Regular expressions are
not sufficient because they cannot easily handle derived words,
e.g., <literal>satisfies</literal> and <literal>satisfy</literal>. You might
miss documents which contain <literal>satisfies</literal>, although you
probably would like to find them when searching for
<literal>satisfy</literal>. It is possible to use <literal>OR</literal>
to search <emphasis>any</emphasis> of them, but it is tedious and error-prone
(some words can have several thousand derivatives).
</para>
</listitem>
<listitem>
<para>
They provide no ordering (ranking) of search results, which makes them
ineffective when thousands of matching documents are found.
</para>
</listitem>
<listitem>
<para>
They tend to be slow because they process all documents for every search and
there is no index support.
</para>
</listitem>
</itemizedlist>
<para>
Full text indexing allows documents to be <emphasis>preprocessed</emphasis>
and an index saved for later rapid searching. Preprocessing includes:
</para>
<itemizedlist mark="none">
<listitem>
<para>
<emphasis>Parsing documents into <firstterm>lexemes</></emphasis>. It is
useful to identify various lexemes, e.g. digits, words, complex words,
email addresses, so they can be processed differently. In principle
lexemes depend on the specific application but for an ordinary search it
is useful to have a predefined list of lexemes. <!-- add list of lexemes.
-->
</para>
</listitem>
<listitem>
<para>
<emphasis>Dictionaries</emphasis> allow the conversion of lexemes into
a <emphasis>normalized form</emphasis> so it is not necessary to enter
search words in a specific form.
</para>
</listitem>
<listitem>
<para>
<emphasis>Store</emphasis> preprocessed documents optimized for
searching. For example, represent each document as a sorted array
of lexemes. Along with lexemes it is desirable to store positional
information to use for <varname>proximity ranking</varname>, so that
a document which contains a more "dense" region of query words is
assigned a higher rank than one with scattered query words.
</para>
</listitem>
</itemizedlist>
<para>
Dictionaries allow fine-grained control over how lexemes are created. With
dictionaries you can:
</para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
Define "stop words" that should not be indexed.
</para>
</listitem>
<listitem>
<para>
Map synonyms to a single word using <application>ispell</>.
</para>
</listitem>
<listitem>
<para>
Map phrases to a single word using a thesaurus.
</para>
</listitem>
<listitem>
<para>
Map different variations of a word to a canonical form using
an <application>ispell</> dictionary.
</para>
</listitem>
<listitem>
<para>
Map different variations of a word to a canonical form using
<application>snowball</> stemmer rules.
</para>
</listitem>
</itemizedlist>
<para>
A data type (<xref linkend="datatype-textsearch">), <type>tsvector</type>
is provided, for storing preprocessed documents,
along with a type <type>tsquery</type> for representing textual
queries. Also, a full text search operator <literal>@@</literal> is defined
for these data types (<xref linkend="textsearch-searches">). Full text
searches can be accelerated using indexes (<xref
linkend="textsearch-indexes">).
</para>
<sect2 id="textsearch-document">
<title>What Is a <firstterm>Document</firstterm>?</title>
<indexterm zone="textsearch-document">
<primary>text search</primary>
<secondary>document</secondary>
</indexterm>
<para>
A document can be a simple text file stored in the file system. The full
text indexing engine can parse text files and store associations of lexemes
(words) with their parent document. Later, these associations are used to
search for documents which contain query words. In this case, the database
can be used to store the full text index and for executing searches, and
some unique identifier can be used to retrieve the document from the file
system.
</para>
<para>
A document can also be any textual database attribute or a combination
(concatenation), which in turn can be stored in various tables or obtained
dynamically. In other words, a document can be constructed from different
parts for indexing and it might not exist as a whole. For example:
<programlisting>
SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document
FROM messages
WHERE mid = 12;
SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document
FROM messages m, docs d
WHERE mid = did AND mid = 12;
</programlisting>
</para>
<note>
<para>
Actually, in the previous example queries, <literal>COALESCE</literal>
<!-- TODO make this a link? -->
should be used to prevent a <literal>NULL</literal> attribute from causing
a <literal>NULL</literal> result.
</para>
</note>
</sect2>
<sect2 id="textsearch-searches">
<title>Performing Searches</title>
<para>
Full text searching in <productname>PostgreSQL</productname> is based on
the operator <literal>@@</literal>, which tests whether a <type>tsvector</type>
(document) matches a <type>tsquery</type> (query). Also, this operator
supports <type>text</type> input, allowing explicit conversion of a text
string to <type>tsvector</type> to be skipped. The variants available
are:
<programlisting>
tsvector @@ tsquery
tsquery @@ tsvector
text @@ tsquery
text @@ text
</programlisting>
</para>
<para>
The match operator <literal>@@</literal> returns <literal>true</literal> if
the <type>tsvector</type> matches the <type>tsquery</type>. It doesn't
matter which data type is written first:
<programlisting>
SELECT 'cat &amp; rat'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
?column?
----------
t
SELECT 'fat &amp; cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
?column?
----------
f
</programlisting>
</para>
<para>
The form <type>text</type> <literal>@@</literal> <type>tsquery</type>
is equivalent to <literal>to_tsvector(x) @@ y</literal>.
The form <type>text</type> <literal>@@</literal> <type>text</type>
is equivalent to <literal>to_tsvector(x) @@ plainto_tsquery(y)</literal>.
<xref linkend="functions-textsearch"> contains a complete list of full text
search functions and operators.
</para>
2007-08-30 03:29:52 +02:00
</sect2>
<sect2 id="textsearch-configurations">
<title>Configurations</title>
<indexterm zone="textsearch-configurations">
<primary>text search</primary>
<secondary>configurations</secondary>
</indexterm>
<para>
The above are all simple text search examples. As mentioned before, full
text search functionality includes the ability to do many more things:
skip indexing certain words (stop words), process synonyms, and use
sophisticated parsing, e.g. parse based on more than just white space.
This functionality is controlled by <emphasis>configurations</>.
Fortunately, <productname>PostgreSQL</> comes with predefined
configurations for many languages. (<application>psql</>'s <command>\dF</>
shows all predefined configurations.)
</para>
<para>
During installation an appropriate configuration was selected and
<xref linkend="guc-default-text-search-config"> was set accordingly
in <filename>postgresql.conf</>. If you are using the same text search
configuration for the entire cluster you can use the value in
<filename>postgresql.conf</>. If using different configurations but
the same text search configuration for an entire database,
2007-08-30 01:25:47 +02:00
use <command>ALTER DATABASE ... SET</>. If not, you must set <varname>
default_text_search_config</varname> in each session. Many functions
also take an optional configuration name.
</para>
</sect2>
</sect1>
<sect1 id="textsearch-tables">
<title>Tables and Indexes</title>
<para>
The previous section described how to perform full text searches using
constant strings. This section shows how to search table data, optionally
using indexes.
</para>
<sect2 id="textsearch-tables-search">
<title>Searching a Table</title>
<para>
It is possible to do full text table search with no index. A simple query
to find all <literal>title</> entries that contain the word
<literal>friend</> is:
<programlisting>
SELECT title
FROM pgweb
WHERE to_tsvector('english', body) @@ to_tsquery('friend')
</programlisting>
</para>
<para>
The query above uses the <literal>english</> the configuration set by <xref
linkend="guc-default-text-search-config">. A more complex query is to
select the ten most recent documents which contain <literal>create</> and
<literal>table</> in the <literal>title</> or <literal>body</>:
<programlisting>
SELECT title
FROM pgweb
WHERE to_tsvector('english', title || body) @@ to_tsquery('create &amp; table')
ORDER BY dlm DESC LIMIT 10;
</programlisting>
<literal>dlm</> is the last-modified date so we
used <command>ORDER BY dlm LIMIT 10</> to get the ten most recent
matches. For clarity we omitted the <function>coalesce</function> function
which prevents the unwanted effect of <literal>NULL</literal>
concatenation.
</para>
</sect2>
<sect2 id="textsearch-tables-index">
<title>Creating Indexes</title>
<para>
We can create a <acronym>GIN</acronym> (<xref
linkend="textsearch-indexes">) index to speed up the search:
<programlisting>
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', body));
</programlisting>
Notice that the 2-argument version of <function>to_tsvector</function> is
used. Only text search functions which specify a configuration name can
be used in expression indexes (<xref linkend="indexes-expressional">).
This is because the index contents must be unaffected by <xref
linkend="guc-default-text-search-config">. If they were affected, the
index contents might be inconsistent because different entries could
contain <type>tsvector</>s that were created with different text search
configurations, and there would be no way to guess which was which. It
would be impossible to dump and restore such an index correctly.
</para>
<para>
Because the two-argument version of <function>to_tsvector</function> was
used in the index above, only a query reference that uses the 2-argument
version of <function>to_tsvector</function> with the same configuration
name will use that index, i.e. <literal>WHERE 'a &amp; b' @@
to_svector('english', body)</> will use the index, but <literal>WHERE
'a &amp; b' @@ to_svector(body))</> and <literal>WHERE 'a &amp; b' @@
body::tsvector</> will not. This guarantees that an index will be used
only with the same configuration used to create the index rows.
</para>
<para>
It is possible to setup more complex expression indexes where the
configuration name is specified by another column, e.g.:
<programlisting>
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_name, body));
</programlisting>
where <literal>config_name</> is a column in the <literal>pgweb</>
table. This allows mixed configurations in the same index while
recording which configuration was used for each index row.
</para>
<para>
Indexes can even concatenate columns:
<programlisting>
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', title || body));
</programlisting>
</para>
<para>
A more complex case is to create a separate <type>tsvector</> column
to hold the output of <function>to_tsvector()</>. This example is a
concatenation of <literal>title</literal> and <literal>body</literal>,
with ranking information. We assign different labels to them to encode
information about the origin of each word:
<programlisting>
ALTER TABLE pgweb ADD COLUMN textsearch_index tsvector;
UPDATE pgweb SET textsearch_index =
setweight(to_tsvector('english', coalesce(title,'')), 'A') || ' ' ||
setweight(to_tsvector('english', coalesce(body,'')),'D');
</programlisting>
Then we create a <acronym>GIN</acronym> index to speed up the search:
<programlisting>
CREATE INDEX textsearch_idx ON pgweb USING gin(textsearch_index);
</programlisting>
After vacuuming, we are ready to perform a fast full text search:
<programlisting>
SELECT ts_rank_cd(textsearch_index, q) AS rank, title
FROM pgweb, to_tsquery('create &amp; table') q
WHERE q @@ textsearch_index
ORDER BY rank DESC LIMIT 10;
</programlisting>
It is necessary to create a trigger to keep the new <type>tsvector</>
column current anytime <literal>title</> or <literal>body</> changes.
Keep in mind that, just like with expression indexes, it is important to
specify the configuration name when creating text search data types
inside triggers so the column's contents are not affected by changes to
<varname>default_text_search_config</>.
</para>
</sect2>
</sect1>
<sect1 id="textsearch-controls">
<title>Additional Controls</title>
<para>
To implement full text searching there must be a function to create a
<type>tsvector</type> from a document and a <type>tsquery</type> from a
user query. Also, we need to return results in some order, i.e., we need
a function which compares documents with respect to their relevance to
the <type>tsquery</type>. Full text searching in
<productname>PostgreSQL</productname> provides support for all of these
functions.
</para>
<sect2 id="textsearch-parser">
<title>Parsing</title>
<para>
Full text searching in <productname>PostgreSQL</productname> provides
function <function>to_tsvector</function>, which converts a document to
the <type>tsvector</type> data type. More details are available in <xref
linkend="functions-textsearch-tsvector">, but for now consider a simple example:
<programlisting>
SELECT to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats');
to_tsvector
-----------------------------------------------------
'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
</programlisting>
</para>
<para>
In the example above we see that the resulting <type>tsvector</type> does not
contain the words <literal>a</literal>, <literal>on</literal>, or
<literal>it</literal>, the word <literal>rats</literal> became
<literal>rat</literal>, and the punctuation sign <literal>-</literal> was
ignored.
</para>
<para>
The <function>to_tsvector</function> function internally calls a parser
which breaks the document (<literal>a fat cat sat on a mat - it ate a
fat rats</literal>) into words and corresponding types. The default parser
recognizes 23 types. Each word, depending on its type, passes through a
group of dictionaries (<xref linkend="textsearch-dictionaries">). At the
end of this step we obtain <emphasis>lexemes</emphasis>. For example,
<literal>rats</literal> became <literal>rat</literal> because one of the
dictionaries recognized that the word <literal>rats</literal> is a plural
form of <literal>rat</literal>. Some words are treated as "stop words"
(<xref linkend="textsearch-stopwords">) and ignored since they occur too
frequently and have little informational value. In our example these are
<literal>a</literal>, <literal>on</literal>, and <literal>it</literal>.
The punctuation sign <literal>-</literal> was also ignored because its
type (<literal>Space symbols</literal>) is not indexed. All information
about the parser, dictionaries and what types of lexemes to index is
documented in the full text configuration section (<xref
linkend="textsearch-tables-configuration">). It is possible to have
several different configurations in the same database, and many predefined
system configurations are available for different languages. In our example
we used the default configuration <literal>english</literal> for the
English language.
</para>
<para>
As another example, below is the output from the <function>ts_debug</function>
function ( <xref linkend="textsearch-debugging"> ), which shows all details
of the full text machinery:
<programlisting>
SELECT * FROM ts_debug('english','a fat cat sat on a mat - it ate a fat rats');
Alias | Description | Token | Dictionaries | Lexized token
-------+---------------+-------+--------------+----------------
lword | Latin word | a | {english} | english: {}
blank | Space symbols | | |
lword | Latin word | fat | {english} | english: {fat}
blank | Space symbols | | |
lword | Latin word | cat | {english} | english: {cat}
blank | Space symbols | | |
lword | Latin word | sat | {english} | english: {sat}
blank | Space symbols | | |
lword | Latin word | on | {english} | english: {}
blank | Space symbols | | |
lword | Latin word | a | {english} | english: {}
blank | Space symbols | | |
lword | Latin word | mat | {english} | english: {mat}
blank | Space symbols | | |
blank | Space symbols | - | |
lword | Latin word | it | {english} | english: {}
blank | Space symbols | | |
lword | Latin word | ate | {english} | english: {ate}
blank | Space symbols | | |
lword | Latin word | a | {english} | english: {}
blank | Space symbols | | |
lword | Latin word | fat | {english} | english: {fat}
blank | Space symbols | | |
lword | Latin word | rats | {english} | english: {rat}
(24 rows)
</programlisting>
</para>
<para>
Function <function>setweight()</function> is used to label
<type>tsvector</type>. The typical usage of this is to mark out the
different parts of a document, perhaps by importance. Later, this can be
used for ranking of search results in addition to positional information
(distance between query terms). If no ranking is required, positional
information can be removed from <type>tsvector</type> using the
<function>strip()</function> function to save space.
</para>
<para>
Because <function>to_tsvector</function>(<LITERAL>NULL</LITERAL>) can
return <LITERAL>NULL</LITERAL>, it is recommended to use
<function>coalesce</function>. Here is the safe method for creating a
<type>tsvector</type> from a structured document:
<programlisting>
UPDATE tt SET ti=
setweight(to_tsvector(coalesce(title,'')), 'A') || ' ' ||
setweight(to_tsvector(coalesce(keyword,'')), 'B') || ' ' ||
setweight(to_tsvector(coalesce(abstract,'')), 'C') || ' ' ||
setweight(to_tsvector(coalesce(body,'')), 'D');
</programlisting>
</para>
<para>
The following functions allow manual parsing control:
<variablelist>
<varlistentry>
<indexterm zone="textsearch-parser">
<primary>text search</primary>
<secondary>parse</secondary>
</indexterm>
<term>
<synopsis>
ts_parse(<replaceable class="PARAMETER">parser</replaceable>, <replaceable class="PARAMETER">document</replaceable> TEXT) returns SETOF <type>tokenout</type>
</synopsis>
</term>
<listitem>
<para>
Parses the given <replaceable>document</replaceable> and returns a series
of records, one for each token produced by parsing. Each record includes
a <varname>tokid</varname> giving its type and a <varname>token</varname>
which gives its content:
<programlisting>
SELECT * FROM ts_parse('default','123 - a number');
tokid | token
-------+--------
22 | 123
12 |
12 | -
1 | a
12 |
1 | number
</programlisting>
</para>
</listitem>
</varlistentry>
<varlistentry>
<indexterm zone="textsearch-parser">
<primary>text search</primary>
<secondary>ts_token_type</secondary>
</indexterm>
<term>
<synopsis>
ts_token_type(<replaceable class="PARAMETER">parser</replaceable> ) returns SETOF <type>tokentype</type>
</synopsis>
</term>
<listitem>
<para>
Returns a table which describes each kind of token the
<replaceable>parser</replaceable> might produce as output. For each token
type the table gives the <varname>tokid</varname> which the
<replaceable>parser</replaceable> uses to label each
<varname>token</varname> of that type, the <varname>alias</varname> which
names the token type, and a short <varname>description</varname>:
<programlisting>
SELECT * FROM ts_token_type('default');
tokid | alias | description
-------+--------------+-----------------------------------
1 | lword | Latin word
2 | nlword | Non-latin word
3 | word | Word
4 | email | Email
5 | url | URL
6 | host | Host
7 | sfloat | Scientific notation
8 | version | VERSION
9 | part_hword | Part of hyphenated word
10 | nlpart_hword | Non-latin part of hyphenated word
11 | lpart_hword | Latin part of hyphenated word
12 | blank | Space symbols
13 | tag | HTML Tag
14 | protocol | Protocol head
15 | hword | Hyphenated word
16 | lhword | Latin hyphenated word
17 | nlhword | Non-latin hyphenated word
18 | uri | URI
19 | file | File or path name
20 | float | Decimal notation
21 | int | Signed integer
22 | uint | Unsigned integer
23 | entity | HTML Entity
</programlisting>
</para>
</listitem>
</varlistentry>
</variablelist>
</para>
</sect2>
<sect2 id="textsearch-ranking">
<title>Ranking Search Results</title>
<para>
Ranking attempts to measure how relevant documents are to a particular
query by inspecting the number of times each search word appears in the
document, and whether different search terms occur near each other. Full
text searching provides two predefined ranking functions which attempt to
produce a measure of how a document is relevant to the query. In spite
of that, the concept of relevancy is vague and very application-specific.
These functions try to take into account lexical, proximity, and structural
information. Different applications might require additional information
for ranking, e.g. document modification time.
</para>
<para>
The lexical part of ranking reflects how often the query terms appear in
the document, how close the document query terms are, and in what part of
the document they occur. Note that ranking functions that use positional
information will only work on unstripped tsvectors because stripped
tsvectors lack positional information.
</para>
<para>
The two ranking functions currently available are:
<variablelist>
<varlistentry>
<indexterm zone="textsearch-ranking">
<primary>text search</primary>
<secondary>ts_rank</secondary>
</indexterm>
<term>
<synopsis>
ts_rank(<optional> <replaceable class="PARAMETER">weights</replaceable> float4[]</optional>, <replaceable class="PARAMETER">vector</replaceable> TSVECTOR, <replaceable class="PARAMETER">query</replaceable> TSQUERY, <optional> <replaceable class="PARAMETER">normalization</replaceable> int4 </optional>) returns float4
</synopsis>
</term>
<listitem>
<para>
This ranking function offers the ability to weigh word instances more
heavily depending on how you have classified them. The weights specify
how heavily to weigh each category of word:
<programlisting>
{D-weight, C-weight, B-weight, A-weight}
</programlisting>
If no weights are provided,
then these defaults are used:
<programlisting>
{0.1, 0.2, 0.4, 1.0}
</programlisting>
Often weights are used to mark words from special areas of the document,
like the title or an initial abstract, and make them more or less important
than words in the document body.
</para>
</listitem>
</varlistentry>
<varlistentry>
<indexterm zone="textsearch-ranking">
<primary>text search</primary>
<secondary>ts_rank_cd</secondary>
</indexterm>
<term>
<synopsis>
ts_rank_cd(<optional> <replaceable class="PARAMETER">weights</replaceable> float4[], </optional> <replaceable class="PARAMETER">vector</replaceable> TSVECTOR, <replaceable class="PARAMETER">query</replaceable> TSQUERY, <optional> <replaceable class="PARAMETER">normalization</replaceable> int4 </optional>) returns float4
</synopsis>
</term>
<listitem>
<para>
This function computes the <emphasis>cover density</emphasis> ranking for
the given document vector and query, as described in Clarke, Cormack, and
Tudhope's "Relevance Ranking for One to Three Term Queries" in the
"Information Processing and Management", 1999.
</para>
</listitem>
</varlistentry>
</variablelist>
</para>
<para>
Since a longer document has a greater chance of containing a query term
it is reasonable to take into account document size, i.e. a hundred-word
document with five instances of a search word is probably more relevant
than a thousand-word document with five instances. Both ranking functions
take an integer <replaceable>normalization</replaceable> option that
specifies whether a document's length should impact its rank. The integer
option controls several behaviors which is done using bit-wise fields and
<literal>|</literal> (for example, <literal>2|4</literal>):
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
0 (the default) ignores the document length
</para>
</listitem>
<listitem>
<para>
1 divides the rank by 1 + the logarithm of the document length
</para>
</listitem>
<listitem>
<para>
2 divides the rank by the length itself
</para>
</listitem>
<listitem>
<para>
<!-- what is mean harmonic distance -->
4 divides the rank by the mean harmonic distance between extents
</para>
</listitem>
<listitem>
<para>
8 divides the rank by the number of unique words in document
</para>
</listitem>
<listitem>
<para>
16 divides the rank by 1 + logarithm of the number of unique words in document
</para>
</listitem>
</itemizedlist>
</para>
<para>
It is important to note that ranking functions do not use any global
information so it is impossible to produce a fair normalization to 1% or
100%, as sometimes required. However, a simple technique like
<literal>rank/(rank+1)</literal> can be applied. Of course, this is just
a cosmetic change, i.e., the ordering of the search results will not change.
</para>
<para>
Several examples are shown below; note that the second example uses
normalized ranking:
<programlisting>
SELECT title, ts_rank_cd('{0.1, 0.2, 0.4, 1.0}',textsearch, query) AS rnk
FROM apod, to_tsquery('neutrino|(dark &amp; matter)') query
WHERE query @@ textsearch
ORDER BY rnk DESC LIMIT 10;
title | rnk
-----------------------------------------------+----------
Neutrinos in the Sun | 3.1
The Sudbury Neutrino Detector | 2.4
A MACHO View of Galactic Dark Matter | 2.01317
Hot Gas and Dark Matter | 1.91171
The Virgo Cluster: Hot Plasma and Dark Matter | 1.90953
Rafting for Solar Neutrinos | 1.9
NGC 4650A: Strange Galaxy and Dark Matter | 1.85774
Hot Gas and Dark Matter | 1.6123
Ice Fishing for Cosmic Neutrinos | 1.6
Weak Lensing Distorts the Universe | 0.818218
SELECT title, ts_rank_cd('{0.1, 0.2, 0.4, 1.0}',textsearch, query)/
(ts_rank_cd('{0.1, 0.2, 0.4, 1.0}',textsearch, query) + 1) AS rnk
FROM apod, to_tsquery('neutrino|(dark &amp; matter)') query
WHERE query @@ textsearch
ORDER BY rnk DESC LIMIT 10;
title | rnk
-----------------------------------------------+-------------------
Neutrinos in the Sun | 0.756097569485493
The Sudbury Neutrino Detector | 0.705882361190954
A MACHO View of Galactic Dark Matter | 0.668123210574724
Hot Gas and Dark Matter | 0.65655958650282
The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
Rafting for Solar Neutrinos | 0.655172410958162
NGC 4650A: Strange Galaxy and Dark Matter | 0.650072921219637
Hot Gas and Dark Matter | 0.617195790024749
Ice Fishing for Cosmic Neutrinos | 0.615384618911517
Weak Lensing Distorts the Universe | 0.450010798361481
</programlisting>
</para>
<para>
The first argument in <function>ts_rank_cd</function> (<literal>'{0.1, 0.2,
0.4, 1.0}'</literal>) is an optional parameter which specifies the
weights for labels <literal>D</literal>, <literal>C</literal>,
<literal>B</literal>, and <literal>A</literal> used in function
<function>setweight</function>. These default values show that lexemes
labeled as <literal>A</literal> are ten times more important than ones
that are labeled with <literal>D</literal>.
</para>
<para>
Ranking can be expensive since it requires consulting the
<type>tsvector</type> of all documents, which can be I/O bound and
therefore slow. Unfortunately, it is almost impossible to avoid since full
text searching in a database should work without indexes <!-- TODO I don't
get this -->. Moreover an index can be lossy (a <acronym>GiST</acronym>
index, for example) so it must check documents to avoid false hits.
</para>
<para>
Note that the ranking functions above are only examples. You can write
your own ranking functions and/or combine additional factors to fit your
specific needs.
</para>
</sect2>
<sect2 id="textsearch-headline">
<title>Highlighting Results</title>
<indexterm zone="textsearch-headline">
<primary>text search</primary>
<secondary>headline</secondary>
</indexterm>
<para>
To present search results it is ideal to show a part of each document and
how it is related to the query. Usually, search engines show fragments of
the document with marked search terms. <productname>PostgreSQL</> full
text searching provides the function <function>headline</function> that
implements such functionality.
</para>
<variablelist>
<varlistentry>
<term>
<synopsis>
ts_headline(<optional> <replaceable class="PARAMETER">config_name</replaceable> text</optional>, <replaceable class="PARAMETER">document</replaceable> text, <replaceable class="PARAMETER">query</replaceable> TSQUERY, <optional> <replaceable class="PARAMETER">options</replaceable> text </optional>) returns text
</synopsis>
</term>
<listitem>
<para>
The <function>ts_headline</function> function accepts a document along with
a query, and returns one or more ellipsis-separated excerpts from the
document in which terms from the query are highlighted. The configuration
used to parse the document can be specified by its
<replaceable>config_name</replaceable>; if none is specified, the current
configuration is used.
</para>
</listitem>
</varlistentry>
</variablelist>
<para>
If an <replaceable>options</replaceable> string is specified it should
consist of a comma-separated list of one or more 'option=value' pairs.
The available options are:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
<literal>StartSel</>, <literal>StopSel</literal>: the strings with which
query words appearing in the document should be delimited to distinguish
them from other excerpted words.
</para>
</listitem>
<listitem >
<para>
<literal>MaxWords</>, <literal>MinWords</literal>: limit the shortest and
longest headlines to output
</para>
</listitem>
<listitem>
<para>
<literal>ShortWord</literal>: this prevents your headline from beginning
or ending with a word which has this many characters or less. The default
value of three eliminates the English articles.
</para>
</listitem>
<listitem>
<para>
<literal>HighlightAll</literal>: boolean flag; if
<literal>true</literal> the whole document will be highlighted
</para>
</listitem>
</itemizedlist>
Any unspecified options receive these defaults:
<programlisting>
StartSel=&lt;b&gt;, StopSel=&lt;/b&gt;, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE
</programlisting>
</para>
<para>
For example:
<programlisting>
SELECT ts_headline('a b c', 'c'::tsquery);
headline
--------------
a b &lt;b&gt;c&lt;/b&gt;
SELECT ts_headline('a b c', 'c'::tsquery, 'StartSel=&lt;,StopSel=&gt;');
ts_headline
-------------
a b &lt;c&gt;
</programlisting>
</para>
<para>
<function>headline</> uses the original document, not
<type>tsvector</type>, so it can be slow and should be used with care.
A typical mistake is to call <function>headline()</function> for
<emphasis>every</emphasis> matching document when only ten documents are
shown. <acronym>SQL</acronym> subselects can help here; below is an
example:
<programlisting>
SELECT id,ts_headline(body,q), rank
FROM (SELECT id,body,q, ts_rank_cd (ti,q) AS rank FROM apod, to_tsquery('stars') q
WHERE ti @@ q
ORDER BY rank DESC LIMIT 10) AS foo;
</programlisting>
</para>
<para>
Note that the cascade dropping of the <function>parser</function> function
causes dropping of the <literal>ts_headline</literal> used in the full text search
configuration <replaceable>config_name</replaceable><!-- TODO I don't get this -->.
</para>
</sect2>
</sect1>
<sect1 id="textsearch-dictionaries">
<title>Dictionaries</title>
<para>
Dictionaries are used to eliminate words that should not be considered in a
search (<firstterm>stop words</>), and to <firstterm>normalize</> words so
that different derived forms of the same word will match. Aside from
improving search quality, normalization and removal of stop words reduce the
size of the <type>tsvector</type> representation of a document, thereby
improving performance. Normalization does not always have linguistic meaning
and usually depends on application semantics.
</para>
<para>
Some examples of normalization:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
Linguistic - ispell dictionaries try to reduce input words to a
normalized form; stemmer dictionaries remove word endings
</para>
</listitem>
<listitem>
<para>
Identical <acronym>URL</acronym> locations are identified and canonicalized:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
http://www.pgsql.ru/db/mw/index.html
</para>
</listitem>
<listitem>
<para>
http://www.pgsql.ru/db/mw/
</para>
</listitem>
<listitem>
<para>
http://www.pgsql.ru/db/../db/mw/index.html
</para>
</listitem>
</itemizedlist>
</para>
</listitem>
<listitem>
<para>
Colour names are substituted by their hexadecimal values, e.g.,
<literal>red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF</literal>
</para>
</listitem>
<listitem>
<para>
Remove some numeric fractional digits to reduce the range of possible
numbers, so <emphasis>3.14</emphasis>159265359,
<emphasis>3.14</emphasis>15926, <emphasis>3.14</emphasis> will be the same
after normalization if only two digits are kept after the decimal point.
</para>
</listitem>
</itemizedlist>
</para>
<para>
A dictionary is a <emphasis>program</emphasis> which accepts lexemes as
input and returns:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
an array of lexemes if the input lexeme is known to the dictionary
</para>
</listitem>
<listitem>
<para>
a void array if the dictionary knows the lexeme, but it is a stop word
</para>
</listitem>
<listitem>
<para>
<literal>NULL</literal> if the dictionary does not recognize the input lexeme
</para>
</listitem>
</itemizedlist>
</para>
<para>
Full text searching provides predefined dictionaries for many languages,
and <acronym>SQL</acronym> commands to manipulate them. There are also
several predefined template dictionaries that can be used to create new
dictionaries by overriding their default parameters. Besides this, it is
possible to develop custom dictionaries using an <acronym>API</acronym>;
see the dictionary for integers (<xref
linkend="textsearch-rule-dictionary-example">) as an example.
</para>
<para>
The <literal>ALTER TEXT SEARCH CONFIGURATION ADD
MAPPING</literal> command binds specific types of lexemes and a set of
dictionaries to process them. (Mappings can also be specified as part of
configuration creation.) Lexemes are processed by a stack of dictionaries
until some dictionary identifies it as a known word or it turns out to be
a stop word. If no dictionary recognizes a lexeme, it will be discarded
and not indexed. A general rule for configuring a stack of dictionaries
is to place first the most narrow, most specific dictionary, then the more
general dictionaries and finish it with a very general dictionary, like
the <application>snowball</> stemmer or <literal>simple</>, which
recognizes everything. For example, for an astronomy-specific search
(<literal>astro_en</literal> configuration) one could bind
<type>lword</type> (latin word) with a synonym dictionary of astronomical
terms, a general English dictionary and a <application>snowball</> English
stemmer:
<programlisting>
ALTER TEXT SEARCH CONFIGURATION astro_en
ADD MAPPING FOR lword WITH astrosyn, english_ispell, english_stem;
</programlisting>
</para>
<para>
Function <function>ts_lexize</function> can be used to test dictionaries,
for example:
<programlisting>
SELECT ts_lexize('english_stem', 'stars');
ts_lexize
-----------
{star}
(1 row)
</programlisting>
Also, the <function>ts_debug</function> function (<xref linkend="textsearch-debugging">)
can be used for this.
</para>
<sect2 id="textsearch-stopwords">
<title>Stop Words</title>
<para>
Stop words are words which are very common, appear in almost
every document, and have no discrimination value. Therefore, they can be ignored
in the context of full text searching. For example, every English text contains
words like <literal>a</literal> although it is useless to store them in an index.
However, stop words do affect the positions in <type>tsvector</type>,
which in turn, do affect ranking:
<programlisting>
SELECT to_tsvector('english','in the list of stop words');
to_tsvector
----------------------------
'list':3 'stop':5 'word':6
</programlisting>
The gaps between positions 1-3 and 3-5 are because of stop words, so ranks
calculated for documents with and without stop words are quite different:
<programlisting>
SELECT ts_rank_cd ('{1,1,1,1}', to_tsvector('english','in the list of stop words'), to_tsquery('list &amp; stop'));
ts_rank_cd
------------
0.5
SELECT ts_rank_cd ('{1,1,1,1}', to_tsvector('english','list stop words'), to_tsquery('list &amp; stop'));
ts_rank_cd
------------
1
</programlisting>
</para>
<para>
It is up to the specific dictionary how it treats stop words. For example,
<literal>ispell</literal> dictionaries first normalize words and then
look at the list of stop words, while <literal>stemmers</literal>
first check the list of stop words. The reason for the different
behaviour is an attempt to decrease possible noise.
</para>
<para>
Here is an example of a dictionary that returns the input word as lowercase
or <literal>NULL</literal> if it is a stop word; it also specifies the name
of a file of stop words. It uses the <literal>simple</> dictionary as
a template:
<programlisting>
CREATE TEXT SEARCH DICTIONARY public.simple_dict (
TEMPLATE = pg_catalog.simple,
STOPWORDS = english
);
</programlisting>
Now we can test our dictionary:
<programlisting>
SELECT ts_lexize('public.simple_dict','YeS');
ts_lexize
-----------
{yes}
SELECT ts_lexize('public.simple_dict','The');
ts_lexize
-----------
{}
</programlisting>
</para>
<caution>
<para>
Most types of dictionaries rely on configuration files, such as files of stop
words. These files <emphasis>must</> be stored in UTF-8 encoding. They will
be translated to the actual database encoding, if that is different, when they
are read into the server.
</para>
</caution>
</sect2>
<sect2 id="textsearch-synonym-dictionary">
<title>Synonym Dictionary</title>
<para>
This dictionary template is used to create dictionaries which replace a
word with a synonym. Phrases are not supported (use the thesaurus
dictionary (<xref linkend="textsearch-thesaurus">) for that). A synonym
dictionary can be used to overcome linguistic problems, for example, to
prevent an English stemmer dictionary from reducing the word 'Paris' to
'pari'. It is enough to have a <literal>Paris paris</literal> line in the
synonym dictionary and put it before the <literal>english_stem</> dictionary:
<programlisting>
SELECT * FROM ts_debug('english','Paris');
Alias | Description | Token | Dictionaries | Lexized token
-------+-------------+-------+----------------+----------------------
lword | Latin word | Paris | {english_stem} | english_stem: {pari}
(1 row)
CREATE TEXT SEARCH DICTIONARY synonym
(TEMPLATE = synonym, SYNONYMS = my_synonyms);
ALTER TEXT SEARCH CONFIGURATION english
ALTER MAPPING FOR lword WITH synonym, english_stem;
SELECT * FROM ts_debug('english','Paris');
Alias | Description | Token | Dictionaries | Lexized token
-------+-------------+-------+------------------------+------------------
lword | Latin word | Paris | {synonym,english_stem} | synonym: {paris}
(1 row)
</programlisting>
</para>
</sect2>
<sect2 id="textsearch-thesaurus">
<title>Thesaurus Dictionary</title>
<para>
A thesaurus dictionary (sometimes abbreviated as <acronym>TZ</acronym>) is
a collection of words which includes information about the relationships
of words and phrases, i.e., broader terms (<acronym>BT</acronym>), narrower
terms (<acronym>NT</acronym>), preferred terms, non-preferred terms, related
terms, etc.
</para>
<para>
Basically a thesaurus dictionary replaces all non-preferred terms by one
preferred term and, optionally, preserves them for indexing. Thesauruses
are used during indexing so any change in the thesaurus <emphasis>requires</emphasis>
reindexing. The current implementation of the thesaurus
dictionary is an extension of the synonym dictionary with added
<emphasis>phrase</emphasis> support. A thesaurus dictionary requires
a configuration file of the following format:
<programlisting>
# this is a comment
sample word(s) : indexed word(s)
more sample word(s) : more indexed word(s)
...
</programlisting>
where the colon (<symbol>:</symbol>) symbol acts as a delimiter between a
a phrase and its replacement.
</para>
<para>
A thesaurus dictionary uses a <emphasis>subdictionary</emphasis> (which
is defined in the dictionary's configuration) to normalize the input text
before checking for phrase matches. It is only possible to select one
subdictionary. An error is reported if the subdictionary fails to
recognize a word. In that case, you should remove the use of the word or teach
the subdictionary about it. Use an asterisk (<symbol>*</symbol>) at the
beginning of an indexed word to skip the subdictionary. It is still required
that sample words are known.
</para>
<para>
The thesaurus dictionary looks for the longest match.
</para>
<para>
Stop words recognized by the subdictionary are replaced by a 'stop word
placeholder' to record their position. To break possible ties the thesaurus
uses the last definition. To illustrate this, consider a thesaurus (with
a <parameter>simple</parameter> subdictionary) with pattern
<replaceable>swsw</>, where <replaceable>s</> designates any stop word and
<replaceable>w</>, any known word:
<programlisting>
a one the two : swsw
the one a two : swsw2
</programlisting>
Words <literal>a</> and <literal>the</> are stop words defined in the
configuration of a subdictionary. The thesaurus considers <literal>the
one the two</literal> and <literal>that one then two</literal> as equal
and will use definition <replaceable>swsw2</>.
</para>
<para>
As any normal dictionary, it can be assigned to the specific lexeme types.
Since a thesaurus dictionary has the capability to recognize phrases it
must remember its state and interact with the parser. A thesaurus dictionary
uses these assignments to check if it should handle the next word or stop
accumulation. The thesaurus dictionary compiler must be configured
carefully. For example, if the thesaurus dictionary is assigned to handle
only the <token>lword</token> lexeme, then a thesaurus dictionary
definition like ' one 7' will not work since lexeme type
<token>digit</token> is not assigned to the thesaurus dictionary.
</para>
</sect2>
<sect2 id="textsearch-thesaurus-config">
<title>Thesaurus Configuration</title>
<para>
To define a new thesaurus dictionary one can use the thesaurus template.
For example:
<programlisting>
CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
TEMPLATE = thesaurus,
DictFile = mythesaurus,
Dictionary = pg_catalog.english_stem
);
</programlisting>
Here:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
<literal>thesaurus_simple</literal> is the thesaurus dictionary name
</para>
</listitem>
<listitem>
<para>
<literal>mythesaurus</literal> is the base name of the thesaurus file
(its full name will be <filename>$SHAREDIR/tsearch_data/mythesaurus.ths</>,
where <literal>$SHAREDIR</> means the installation shared-data directory,
often <filename>/usr/local/share</>).
</para>
</listitem>
<listitem>
<para>
<literal>pg_catalog.english_stem</literal> is the dictionary (Snowball
English stemmer) to use for thesaurus normalization. Notice that the
<literal>english_stem</> dictionary has its own configuration (for example,
stop words), which is not shown here.
</para>
</listitem>
</itemizedlist>
Now it is possible to bind the thesaurus dictionary <literal>thesaurus_simple</literal>
and selected <literal>tokens</literal>, for example:
<programlisting>
ALTER TEXT SEARCH CONFIGURATION russian
ADD MAPPING FOR lword, lhword, lpart_hword WITH thesaurus_simple;
</programlisting>
</para>
</sect2>
<sect2 id="textsearch-thesaurus-examples">
<title>Thesaurus Example</title>
<para>
Consider a simple astronomical thesaurus <literal>thesaurus_astro</literal>,
which contains some astronomical word combinations:
<programlisting>
supernovae stars : sn
crab nebulae : crab
</programlisting>
Below we create a dictionary and bind some token types with
an astronomical thesaurus and english stemmer:
<programlisting>
CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
TEMPLATE = thesaurus,
DictFile = thesaurus_astro,
Dictionary = english_stem
);
ALTER TEXT SEARCH CONFIGURATION russian
ADD MAPPING FOR lword, lhword, lpart_hword WITH thesaurus_astro, english_stem;
</programlisting>
Now we can see how it works. Note that <function>ts_lexize</function> cannot
be used for testing the thesaurus (see description of
<function>ts_lexize</function>), but we can use
<function>plainto_tsquery</function> and <function>to_tsvector</function>
which accept <literal>text</literal> arguments, not lexemes:
<programlisting>
SELECT plainto_tsquery('supernova star');
plainto_tsquery
-----------------
'sn'
SELECT to_tsvector('supernova star');
to_tsvector
-------------
'sn':1
</programlisting>
In principle, one can use <function>to_tsquery</function> if you quote
the argument:
<programlisting>
SELECT to_tsquery('''supernova star''');
to_tsquery
------------
'sn'
</programlisting>
Notice that <literal>supernova star</literal> matches <literal>supernovae
stars</literal> in <literal>thesaurus_astro</literal> because we specified the
<literal>english_stem</literal> stemmer in the thesaurus definition.
</para>
<para>
To keep an original phrase in full text indexing just add it to the right part
of the definition:
<programlisting>
supernovae stars : sn supernovae stars
SELECT plainto_tsquery('supernova star');
plainto_tsquery
-----------------------------
'sn' &amp; 'supernova' &amp; 'star'
</programlisting>
</para>
</sect2>
<sect2 id="textsearch-ispell-dictionary">
<title>Ispell Dictionary</title>
<para>
The <application>Ispell</> template dictionary for full text allows the
creation of morphological dictionaries based on <ulink
url="http://ficus-www.cs.ucla.edu/geoff/ispell.html">Ispell</ulink>, which
supports a large number of languages. This dictionary tries to change an
input word to its normalized form. Also, more modern spelling dictionaries
are supported - <ulink
url="http://en.wikipedia.org/wiki/MySpell">MySpell</ulink> (OO &lt; 2.0.1)
and <ulink url="http://sourceforge.net/projects/hunspell">Hunspell</ulink>
(OO &gt;= 2.0.2). A large list of dictionaries is available on the <ulink
url="http://wiki.services.openoffice.org/wiki/Dictionaries">OpenOffice
Wiki</ulink>.
</para>
<para>
The <application>Ispell</> dictionary allows searches without bothering
about different linguistic forms of a word. For example, a search on
<literal>bank</literal> would return hits of all declensions and
conjugations of the search term <literal>bank</literal>, e.g.
<literal>banking</>, <literal>banked</>, <literal>banks</>,
<literal>banks'</>, and <literal>bank's</>.
<programlisting>
SELECT ts_lexize('english_ispell','banking');
ts_lexize
-----------
{bank}
SELECT ts_lexize('english_ispell','bank''s');
ts_lexize
-----------
{bank}
SELECT ts_lexize('english_ispell','banked');
ts_lexize
-----------
{bank}
</programlisting>
</para>
<para>
To create an ispell dictionary one should use the built-in
<literal>ispell</literal> dictionary and specify several
parameters.
</para>
<programlisting>
CREATE TEXT SEARCH DICTIONARY english_ispell (
TEMPLATE = ispell,
DictFile = english,
AffFile = english,
StopWords = english
);
</programlisting>
<para>
Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
specify the names of the dictionary, affixes, and stop-words files.
</para>
<para>
Ispell dictionaries usually recognize a restricted set of words so they
should be used in conjunction with another broader dictionary; for
example, a stemming dictionary, which recognizes everything.
</para>
<para>
Ispell dictionaries support splitting compound words based on an
ispell dictionary. This is a nice feature and full text searching
in <productname>PostgreSQL</productname> supports it.
Notice that the affix file should specify a special flag using the
<literal>compoundwords controlled</literal> statement that marks dictionary
words that can participate in compound formation:
<programlisting>
compoundwords controlled z
</programlisting>
Several examples for the Norwegian language:
<programlisting>
SELECT ts_lexize('norwegian_ispell','overbuljongterningpakkmesterassistent');
{over,buljong,terning,pakk,mester,assistent}
SELECT ts_lexize('norwegian_ispell','sjokoladefabrikk');
{sjokoladefabrikk,sjokolade,fabrikk}
</programlisting>
</para>
<note>
<para>
<application>MySpell</> does not support compound words.
<application>Hunspell</> has sophisticated support for compound words. At
present, full text searching implements only the basic compound word
operations of Hunspell.
</para>
</note>
</sect2>
<sect2 id="textsearch-stemming-dictionary">
<title><application>Snowball</> Stemming Dictionary</title>
<para>
The <application>Snowball</> dictionary template is based on the project
of Martin Porter, inventor of the popular Porter's stemming algorithm
for the English language and now supported in many languages (see the <ulink
url="http://snowball.tartarus.org">Snowball site</ulink> for more
information). The Snowball project supplies a large number of stemmers for
many languages. A Snowball dictionary requires a language parameter to
identify which stemmer to use, and optionally can specify a stopword file name.
For example, there is a built-in definition equivalent to
<programlisting>
CREATE TEXT SEARCH DICTIONARY english_stem (
TEMPLATE = snowball, Language = english, StopWords = english
);
</programlisting>
</para>
<para>
The <application>Snowball</> dictionary recognizes everything, so it is best
to place it at the end of the dictionary stack. It it useless to have it
before any other dictionary because a lexeme will never pass through it to
the next dictionary.
</para>
</sect2>
<sect2 id="textsearch-dictionary-testing">
<title>Dictionary Testing</title>
<para>
The <function>ts_lexize</> function facilitates dictionary testing:
<variablelist>
<varlistentry>
<indexterm zone="textsearch-dictionaries">
<primary>text search</primary>
<secondary>ts_lexize</secondary>
</indexterm>
<term>
<synopsis>
ts_lexize(<replaceable class="PARAMETER">dict_name</replaceable> text, <replaceable class="PARAMETER">lexeme</replaceable> text) returns text[]
</synopsis>
</term>
<listitem>
<para>
Returns an array of lexemes if the input <replaceable>lexeme</replaceable>
is known to the dictionary <replaceable>dictname</replaceable>, or a void
array if the lexeme is known to the dictionary but it is a stop word, or
<literal>NULL</literal> if it is an unknown word.
</para>
<programlisting>
SELECT ts_lexize('english_stem', 'stars');
ts_lexize
-----------
{star}
SELECT ts_lexize('english_stem', 'a');
ts_lexize
-----------
{}
</programlisting>
</listitem>
</varlistentry>
</variablelist>
</para>
<note>
<para>
The <function>ts_lexize</function> function expects a
<replaceable>lexeme</replaceable>, not text. Below is an example:
<programlisting>
SELECT ts_lexize('thesaurus_astro','supernovae stars') is null;
?column?
----------
t
</programlisting>
The thesaurus dictionary <literal>thesaurus_astro</literal> does know
<literal>supernovae stars</literal>, but <function>ts_lexize</> fails since it
does not parse the input text and considers it as a single lexeme. Use
<function>plainto_tsquery</> and <function>to_tsvector</> to test thesaurus
dictionaries:
<programlisting>
SELECT plainto_tsquery('supernovae stars');
plainto_tsquery
-----------------
'sn'
</programlisting>
</para>
</note>
</sect2>
<sect2 id="textsearch-tables-configuration">
<title>Configuration Example</title>
<para>
A full text configuration specifies all options necessary to transform a
document into a <type>tsvector</type>: the parser breaks text into tokens,
and the dictionaries transform each token into a lexeme. Every call to
<function>to_tsvector()</function> and <function>to_tsquery()</function>
needs a configuration to perform its processing. To facilitate management
of full text searching objects, a set of <acronym>SQL</acronym> commands
is available, and there are several psql commands which display information
about full text searching objects (<xref linkend="textsearch-psql">).
</para>
<para>
The configuration parameter
<xref linkend="guc-default-text-search-config">
specifies the name of the current default configuration, which is the
one used by text search functions when an explicit configuration
parameter is omitted.
It can be set in <filename>postgresql.conf</filename>, or set for an
individual session using the <command>SET</> command.
</para>
<para>
Several predefined text searching configurations are available in the
<literal>pg_catalog</literal> schema. If you need a custom configuration
you can create a new text searching configuration and modify it using SQL
commands.
</para>
<para>
New text searching objects are created in the current schema by default
(usually the <literal>public</literal> schema), but a schema-qualified
name can be used to create objects in the specified schema.
</para>
<para>
As an example, we will create a configuration
<literal>pg</literal> which starts as a duplicate of the
<literal>english</> configuration. To be safe, we do this in a transaction:
<programlisting>
BEGIN;
CREATE TEXT SEARCH CONFIGURATION public.pg ( COPY = english );
</programlisting>
</para>
<para>
We will use a PostgreSQL-specific synonym list
and store it in <filename>share/tsearch_data/pg_dict.syn</filename>.
The file contents look like:
<programlisting>
postgres pg
pgsql pg
postgresql pg
</programlisting>
We define the dictionary like this:
<programlisting>
CREATE TEXT SEARCH DICTIONARY pg_dict (
TEMPLATE = synonym
SYNONYMS = pg_dict
);
</programlisting>
</para>
<para>
Then register the <productname>ispell</> dictionary
<literal>english_ispell</literal> using the <literal>ispell</literal> template:
<programlisting>
CREATE TEXT SEARCH DICTIONARY english_ispell (
TEMPLATE = ispell,
DictFile = english,
AffFile = english,
StopWords = english
);
</programlisting>
</para>
<para>
Now modify mappings for Latin words for configuration <literal>pg</>:
<programlisting>
ALTER TEXT SEARCH CONFIGURATION pg
ALTER MAPPING FOR lword, lhword, lpart_hword
WITH pg_dict, english_ispell, english_stem;
</programlisting>
</para>
<para>
We do not index or search some tokens:
<programlisting>
ALTER TEXT SEARCH CONFIGURATION pg
DROP MAPPING FOR email, url, sfloat, uri, float;
</programlisting>
</para>
<para>
Now, we can test our configuration:
<programlisting>
SELECT * FROM ts_debug('public.pg', '
PostgreSQL, the highly scalable, SQL compliant, open source object-relational
database management system, is now undergoing beta testing of the next
version of our software: PostgreSQL 8.3.
');
COMMIT;
</programlisting>
</para>
<para>
With the dictionaries and mappings set up, suppose we have a table
<literal>pgweb</literal> which contains 11239 documents from the
<productname>PostgreSQL</productname> web site. Only relevant columns
are shown:
<programlisting>
=&gt; \d pgweb
Table "public.pgweb"
Column | Type | Modifiers
-----------+-------------------+-----------
tid | integer | not null
path | character varying | not null
body | character varying |
title | character varying |
dlm | date |
</programlisting>
</para>
<para>
The next step is to set the session to use the new configuration, which was
created in the <literal>public</> schema:
<programlisting>
=&gt; \dF
List of fulltext configurations
Schema | Name | Description
---------+------+-------------
public | pg |
SET default_text_search_config = 'public.pg';
SET
SHOW default_text_search_config;
default_text_search_config
----------------------------
public.pg
</programlisting>
</para>
</sect2>
</sect1>
<sect1 id="textsearch-indexes">
<title>GiST and GIN Index Types</title>
<indexterm zone="textsearch-indexes">
<primary>text search</primary>
<secondary>index</secondary>
</indexterm>
<para>
There are two kinds of indexes which can be used to speed up full text
operators (<xref linkend="textsearch-searches">).
Note that indexes are not mandatory for full text searching.
<variablelist>
<varlistentry>
<indexterm zone="textsearch-indexes">
<primary>text search</primary>
<secondary>GiST</secondary>
</indexterm>
<!--
<indexterm zone="textsearch-indexes">
<primary>GiST</primary>
</indexterm>
-->
<term>
<synopsis>
CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING gist(<replaceable>column</replaceable>);
</synopsis>
</term>
<listitem>
<para>
Creates a GiST (Generalized Search Tree)-based index.
The <replaceable>column</replaceable> can be of <type>tsvector</> or
<type>tsquery</> type.
</para>
</listitem>
</varlistentry>
<varlistentry>
<indexterm zone="textsearch-indexes">
<primary>text search</primary>
<secondary>GIN</secondary>
</indexterm>
<!--
<indexterm zone="textsearch-indexes">
<primary>GIN</primary>
</indexterm>
-->
<term>
<synopsis>
CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING gin(<replaceable>column</replaceable>);
</synopsis>
</term>
<listitem>
<para>
Creates a GIN (Generalized Inverted Index)-based index.
The <replaceable>column</replaceable> must be of <type>tsvector</> type.
</para>
</listitem>
</varlistentry>
</variablelist>
</para>
<para>
A GiST index is <firstterm>lossy</firstterm>, meaning it is necessary
to check the actual table row to eliminate false matches.
<productname>PostgreSQL</productname> does this automatically; for
example, in the query plan below, the <literal>Filter:</literal>
line indicates the index output will be rechecked:
<programlisting>
EXPLAIN SELECT * FROM apod WHERE textsearch @@ to_tsquery('supernovae');
QUERY PLAN
-------------------------------------------------------------------------
Index Scan using textsearch_gidx on apod (cost=0.00..12.29 rows=2 width=1469)
Index Cond: (textsearch @@ '''supernova'''::tsquery)
Filter: (textsearch @@ '''supernova'''::tsquery)
</programlisting>
GiST index lossiness happens because each document is represented by a
fixed-length signature. The signature is generated by hashing (crc32) each
word into a random bit in an n-bit string and all words combine to produce
an n-bit document signature. Because of hashing there is a chance that
some words hash to the same position and could result in a false hit.
Signatures calculated for each document in a collection are stored in an
<literal>RD-tree</literal> (Russian Doll tree), invented by Hellerstein,
which is an adaptation of <literal>R-tree</literal> for sets. In our case
the transitive containment relation <!-- huh --> is realized by
superimposed coding (Knuth, 1973) of signatures, i.e., a parent is the
result of 'OR'-ing the bit-strings of all children. This is a second
factor of lossiness. It is clear that parents tend to be full of
<literal>1</>s (degenerates) and become quite useless because of the
limited selectivity. Searching is performed as a bit comparison of a
signature representing the query and an <literal>RD-tree</literal> entry.
If all <literal>1</>s of both signatures are in the same position we
say that this branch probably matches the query, but if there is even one
discrepancy we can definitely reject this branch.
</para>
<para>
Lossiness causes serious performance degradation since random access of
<literal>heap</literal> records is slow and limits the usefulness of GiST
indexes. The likelihood of false hits depends on several factors, like
the number of unique words, so using dictionaries to reduce this number
is recommended.
</para>
<para>
Actually, this is not the whole story. GiST indexes have an optimization
for storing small tsvectors (&lt; <literal>TOAST_INDEX_TARGET</literal>
bytes, 512 bytes). On leaf pages small tsvectors are stored unchanged,
while longer ones are represented by their signatures, which introduces
some lossiness. Unfortunately, the existing index API does not allow for
a return value to say whether it found an exact value (tsvector) or whether
the result needs to be checked. This is why the GiST index is
currently marked as lossy. We hope to improve this in the future.
</para>
<para>
GIN indexes are not lossy but their performance depends logarithmically on
the number of unique words.
</para>
<para>
There is one side-effect of the non-lossiness of a GIN index when using
query labels/weights, like <literal>'supernovae:a'</literal>. A GIN index
has all the information necessary to determine a match, so the heap is
not accessed. However, label information is not stored in the index,
so if the query involves label weights it must access
the heap. Therefore, a special full text search operator <literal>@@@</literal>
was created which forces the use of the heap to get information about
labels. GiST indexes are lossy so it always reads the heap and there is
no need for a special operator. In the example below,
<literal>fulltext_idx</literal> is a GIN index:<!-- why isn't this
automatic -->
<programlisting>
EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a');
QUERY PLAN
------------------------------------------------------------------------
Index Scan using textsearch_idx on apod (cost=0.00..12.30 rows=2 width=1469)
Index Cond: (textsearch @@@ '''supernova'':A'::tsquery)
Filter: (textsearch @@@ '''supernova'':A'::tsquery)
</programlisting>
</para>
<para>
In choosing which index type to use, GiST or GIN, consider these differences:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
GiN index lookups are three times faster than GiST
</para>
</listitem>
<listitem>
<para>
GiN indexes take three times longer to build than GiST
</para>
</listitem>
<listitem>
<para>
GiN is about ten times slower to update than GiST
</para>
</listitem>
<listitem>
<para>
GiN indexes are two-to-three times larger than GiST
</para>
</listitem>
</itemizedlist>
</para>
<para>
In summary, <acronym>GIN</acronym> indexes are best for static data because
the indexes are faster for lookups. For dynamic data, GiST indexes are
faster to update. Specifically, <acronym>GiST</acronym> indexes are very
good for dynamic data and fast if the number of unique words (lexemes) is
under 100,000, while <acronym>GIN</acronym> handles +100,000 lexemes better
but is slower to update.
</para>
<para>
Partitioning of big collections and the proper use of GiST and GIN indexes
allows the implementation of very fast searches with online update.
Partitioning can be done at the database level using table inheritance
and <varname>constraint_exclusion</>, or distributing documents over
servers and collecting search results using the <filename>contrib/dblink</>
extension module. The latter is possible because ranking functions use
only local information.
</para>
</sect1>
<sect1 id="textsearch-limitations">
<title>Limitations</title>
<para>
The current limitations of Full Text Searching are:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>The length of each lexeme must be less than 2K bytes </para>
</listitem>
<listitem>
<para>The length of a <type>tsvector</type> (lexemes + positions) must be less than 1 megabyte </para>
</listitem>
<listitem>
<para>The number of lexemes must be less than 2<superscript>64</superscript> </para>
</listitem>
<listitem>
<para>Positional information must be non-negative and less than 16,383 </para>
</listitem>
<listitem>
<para>No more than 256 positions per lexeme </para>
</listitem>
<listitem>
<para>The number of nodes (lexemes + operations) in tsquery must be less than 32,768 </para>
</listitem>
</itemizedlist>
</para>
<para>
For comparison, the <productname>PostgreSQL</productname> 8.1 documentation
contained 10,441 unique words, a total of 335,420 words, and the most frequent
word <quote>postgresql</> was mentioned 6,127 times in 655 documents.
</para>
<!-- TODO we need to put a date on these numbers? -->
<para>
Another example &mdash; the <productname>PostgreSQL</productname> mailing list
archives contained 910,989 unique words with 57,491,343 lexemes in 461,020
messages.
</para>
</sect1>
<sect1 id="textsearch-psql">
<title><application>psql</> Support</title>
<para>
Information about full text searching objects can be obtained
in <literal>psql</literal> using a set of commands:
<synopsis>
\dF{,d,p}<optional>+</optional> <optional>PATTERN</optional>
</synopsis>
An optional <literal>+</literal> produces more details.
</para>
<para>
The optional parameter <literal>PATTERN</literal> should be the name of
a full text searching object, optionally schema-qualified. If
<literal>PATTERN</literal> is not specified then information about all
visible objects will be displayed. <literal>PATTERN</literal> can be a
regular expression and can apply <emphasis>separately</emphasis> to schema
names and object names. The following examples illustrate this:
<programlisting>
=&gt; \dF *fulltext*
List of fulltext configurations
Schema | Name | Description
--------+--------------+-------------
public | fulltext_cfg |
</programlisting>
<programlisting>
=&gt; \dF *.fulltext*
List of fulltext configurations
Schema | Name | Description
----------+----------------------------
fulltext | fulltext_cfg |
public | fulltext_cfg |
</programlisting>
</para>
<variablelist>
<varlistentry>
<term>\dF[+] [PATTERN]</term>
<listitem>
<para>
List full text searching configurations (add "+" for more detail)
</para>
<para>
By default (without <literal>PATTERN</literal>), information about
all <emphasis>visible</emphasis> full text configurations will be
displayed.
</para>
<para>
<programlisting>
=&gt; \dF russian
List of fulltext configurations
Schema | Name | Description
------------+---------+-----------------------------------
pg_catalog | russian | default configuration for Russian
=&gt; \dF+ russian
Configuration "pg_catalog.russian"
Parser name: "pg_catalog.default"
Token | Dictionaries
--------------+-------------------------
email | pg_catalog.simple
file | pg_catalog.simple
float | pg_catalog.simple
host | pg_catalog.simple
hword | pg_catalog.russian_stem
int | pg_catalog.simple
lhword | public.tz_simple
lpart_hword | public.tz_simple
lword | public.tz_simple
nlhword | pg_catalog.russian_stem
nlpart_hword | pg_catalog.russian_stem
nlword | pg_catalog.russian_stem
part_hword | pg_catalog.simple
sfloat | pg_catalog.simple
uint | pg_catalog.simple
uri | pg_catalog.simple
url | pg_catalog.simple
version | pg_catalog.simple
word | pg_catalog.russian_stem
</programlisting>
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>\dFd[+] [PATTERN]</term>
<listitem>
<para>
List full text dictionaries (add "+" for more detail).
</para>
<para>
By default (without <literal>PATTERN</literal>), information about
all <emphasis>visible</emphasis> dictionaries will be displayed.
</para>
<para>
<programlisting>
=&gt; \dFd
List of fulltext dictionaries
Schema | Name | Description
------------+------------+-----------------------------------------------------------
pg_catalog | danish | Snowball stemmer for danish language
pg_catalog | dutch | Snowball stemmer for dutch language
pg_catalog | english | Snowball stemmer for english language
pg_catalog | finnish | Snowball stemmer for finnish language
pg_catalog | french | Snowball stemmer for french language
pg_catalog | german | Snowball stemmer for german language
pg_catalog | hungarian | Snowball stemmer for hungarian language
pg_catalog | italian | Snowball stemmer for italian language
pg_catalog | norwegian | Snowball stemmer for norwegian language
pg_catalog | portuguese | Snowball stemmer for portuguese language
pg_catalog | romanian | Snowball stemmer for romanian language
pg_catalog | russian | Snowball stemmer for russian language
pg_catalog | simple | simple dictionary: just lower case and check for stopword
pg_catalog | spanish | Snowball stemmer for spanish language
pg_catalog | swedish | Snowball stemmer for swedish language
pg_catalog | turkish | Snowball stemmer for turkish language
</programlisting>
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>\dFp[+] [PATTERN]</term>
<listitem>
<para>
List full text parsers (add "+" for more detail)
</para>
<para>
By default (without <literal>PATTERN</literal>), information about
all <emphasis>visible</emphasis> full text parsers will be displayed.
</para>
<para>
<programlisting>
=&gt; \dFp
List of fulltext parsers
Schema | Name | Description
------------+---------+---------------------
pg_catalog | default | default word parser
(1 row)
=&gt; \dFp+
Fulltext parser "pg_catalog.default"
Method | Function | Description
-------------------+---------------------------+-------------
Start parse | pg_catalog.prsd_start |
Get next token | pg_catalog.prsd_nexttoken |
End parse | pg_catalog.prsd_end |
Get headline | pg_catalog.prsd_headline |
Get lexeme's type | pg_catalog.prsd_lextype |
Token's types for parser "pg_catalog.default"
Token name | Description
--------------+-----------------------------------
blank | Space symbols
email | Email
entity | HTML Entity
file | File or path name
float | Decimal notation
host | Host
hword | Hyphenated word
int | Signed integer
lhword | Latin hyphenated word
lpart_hword | Latin part of hyphenated word
lword | Latin word
nlhword | Non-latin hyphenated word
nlpart_hword | Non-latin part of hyphenated word
nlword | Non-latin word
part_hword | Part of hyphenated word
protocol | Protocol head
sfloat | Scientific notation
tag | HTML Tag
uint | Unsigned integer
uri | URI
url | URL
version | VERSION
word | Word
(23 rows)
</programlisting>
</para>
</listitem>
</varlistentry>
</variablelist>
</sect1>
<sect1 id="textsearch-debugging">
<title>Debugging</title>
<para>
Function <function>ts_debug</function> allows easy testing of your full text searching
configuration.
</para>
<synopsis>
ts_debug(<optional><replaceable class="PARAMETER">config_name</replaceable></optional>, <replaceable class="PARAMETER">document</replaceable> TEXT) returns SETOF ts_debug
</synopsis>
<para>
<function>ts_debug</> displays information about every token of
<replaceable class="PARAMETER">document</replaceable> as produced by the
parser and processed by the configured dictionaries using the configuration
specified by <replaceable class="PARAMETER">config_name</replaceable>.
</para>
<para>
<replaceable class="PARAMETER">ts_debug</replaceable> type defined as:
<programlisting>
CREATE TYPE ts_debug AS (
"Alias" text,
"Description" text,
"Token" text,
"Dictionaries" regdictionary[],
"Lexized token" text
);
</programlisting>
</para>
<para>
For a demonstration of how function <function>ts_debug</function> works we
first create a <literal>public.english</literal> configuration and
ispell dictionary for the English language. You can skip the test step and
play with the standard <literal>english</literal> configuration.
</para>
<programlisting>
CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );
CREATE TEXT SEARCH DICTIONARY english_ispell (
TEMPLATE = ispell,
DictFile = english,
AffFile = english,
StopWords = english
);
ALTER TEXT SEARCH CONFIGURATION public.english
ALTER MAPPING FOR lword WITH english_ispell, english_stem;
</programlisting>
<programlisting>
SELECT * FROM ts_debug('public.english','The Brightest supernovaes');
Alias | Description | Token | Dicts list | Lexized token
-------+---------------+-------------+---------------------------------------+---------------------------------
lword | Latin word | The | {public.english_ispell,pg_catalog.english_stem} | public.english_ispell: {}
blank | Space symbols | | |
lword | Latin word | Brightest | {public.english_ispell,pg_catalog.english_stem} | public.english_ispell: {bright}
blank | Space symbols | | |
lword | Latin word | supernovaes | {public.english_ispell,pg_catalog.english_stem} | pg_catalog.english_stem: {supernova}
(5 rows)
</programlisting>
<para>
In this example, the word <literal>Brightest</> was recognized by a
parser as a <literal>Latin word</literal> (alias <literal>lword</literal>)
and came through the dictionaries <literal>public.english_ispell</> and
<literal>pg_catalog.english_stem</literal>. It was recognized by
<literal>public.english_ispell</literal>, which reduced it to the noun
<literal>bright</literal>. The word <literal>supernovaes</literal> is unknown
by the <literal>public.english_ispell</literal> dictionary so it was passed to
the next dictionary, and, fortunately, was recognized (in fact,
<literal>public.english_stem</literal> is a stemming dictionary and recognizes
everything; that is why it was placed at the end of the dictionary stack).
</para>
<para>
The word <literal>The</literal> was recognized by <literal>public.english_ispell</literal>
dictionary as a stop word (<xref linkend="textsearch-stopwords">) and will not be indexed.
</para>
<para>
You can always explicitly specify which columns you want to see:
<programlisting>
SELECT "Alias", "Token", "Lexized token"
FROM ts_debug('public.english','The Brightest supernovaes');
Alias | Token | Lexized token
-------+-------------+---------------------------------
lword | The | public.english_ispell: {}
blank | |
lword | Brightest | public.english_ispell: {bright}
blank | |
lword | supernovaes | pg_catalog.english_stem: {supernova}
(5 rows)
</programlisting>
</para>
</sect1>
<sect1 id="textsearch-rule-dictionary-example">
<title>Example of Creating a Rule-Based Dictionary</title>
<para>
The motivation for this example dictionary is to control the indexing of
integers (signed and unsigned), and, consequently, to minimize the number
of unique words which greatly affects to performance of searching.
</para>
<para>
The dictionary accepts two options:
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
The <LITERAL>MAXLEN</literal> parameter specifies the maximum length of the
number considered as a 'good' integer. The default value is 6.
</para>
</listitem>
<listitem>
<para>
The <LITERAL>REJECTLONG</LITERAL> parameter specifies if a 'long' integer
should be indexed or treated as a stop word. If
<literal>REJECTLONG</literal>=<LITERAL>FALSE</LITERAL> (default),
the dictionary returns the prefixed part of the integer with length
<LITERAL>MAXLEN</literal>. If
<LITERAL>REJECTLONG</LITERAL>=<LITERAL>TRUE</LITERAL>, the dictionary
considers a long integer as a stop word.
</para>
</listitem>
</itemizedlist>
</para>
<para>
A similar idea can be applied to the indexing of decimal numbers, for
example, in the <literal>DecDict</literal> dictionary. The dictionary
accepts two options: the <literal>MAXLENFRAC</literal> parameter specifies
the maximum length of the fractional part considered as a 'good' decimal.
The default value is 3. The <literal>REJECTLONG</literal> parameter
controls whether a decimal number with a 'long' fractional part should be indexed
or treated as a stop word. If
<literal>REJECTLONG</literal>=<literal>FALSE</literal> (default),
the dictionary returns the decimal number with the length of its fraction part
truncated to <literal>MAXLEN</literal>. If
<literal>REJECTLONG</literal>=<literal>TRUE</literal>, the dictionary
considers the number as a stop word. Notice that
<literal>REJECTLONG</literal>=<literal>FALSE</literal> allows the indexing
of 'shortened' numbers and search results will contain documents with
shortened numbers.
</para>
<para>
Examples:
<programlisting>
SELECT ts_lexize('intdict', 11234567890);
ts_lexize
-----------
{112345}
</programlisting>
</para>
<para>
Now, we want to ignore long integers:
<programlisting>
ALTER TEXT SEARCH DICTIONARY intdict (
MAXLEN = 6, REJECTLONG = TRUE
);
SELECT ts_lexize('intdict', 11234567890);
ts_lexize
-----------
{}
</programlisting>
</para>
<para>
Create <filename>contrib/dict_intdict</> directory with files
<filename>dict_tmpl.c</>, <filename>Makefile</>, <filename>dict_intdict.sql.in</>:
<programlisting>
$ make &amp;&amp; make install
$ psql DBNAME < dict_intdict.sql
</programlisting>
</para>
<para>
This is a <filename>dict_tmpl.c</> file:
</para>
<programlisting>
#include "postgres.h"
#include "utils/builtins.h"
#include "fmgr.h"
#ifdef PG_MODULE_MAGIC
PG_MODULE_MAGIC;
#endif
#include "utils/ts_locale.h"
#include "utils/ts_public.h"
#include "utils/ts_utils.h"
typedef struct {
int maxlen;
bool rejectlong;
} DictInt;
PG_FUNCTION_INFO_V1(dinit_intdict);
Datum dinit_intdict(PG_FUNCTION_ARGS);
Datum
dinit_intdict(PG_FUNCTION_ARGS) {
DictInt *d = (DictInt*)malloc( sizeof(DictInt) );
Map *cfg, *pcfg;
text *in;
if (!d)
elog(ERROR, "No memory");
memset(d, 0, sizeof(DictInt));
/* Your INIT code */
/* defaults */
d-&gt;maxlen = 6;
d-&gt;rejectlong = false;
if (PG_ARGISNULL(0) || PG_GETARG_POINTER(0) == NULL) /* no options */
PG_RETURN_POINTER(d);
in = PG_GETARG_TEXT_P(0);
parse_keyvalpairs(in, &amp;cfg);
PG_FREE_IF_COPY(in, 0);
pcfg=cfg;
while (pcfg-&gt;key)
{
if (strcasecmp("MAXLEN", pcfg-&gt;key) == 0)
d-&gt;maxlen=atoi(pcfg-&gt;value);
else if ( strcasecmp("REJECTLONG", pcfg-&gt;key) == 0)
{
if ( strcasecmp("true", pcfg-&gt;value) == 0 )
d-&gt;rejectlong=true;
else if ( strcasecmp("false", pcfg-&gt;value) == 0)
d-&gt;rejectlong=false;
else
elog(ERROR,"Unknown value: %s =&gt; %s", pcfg-&gt;key, pcfg-&gt;value);
}
else
elog(ERROR,"Unknown option: %s =&gt; %s", pcfg-&gt;key, pcfg-&gt;value);
pfree(pcfg-&gt;key);
pfree(pcfg-&gt;value);
pcfg++;
}
pfree(cfg);
PG_RETURN_POINTER(d);
}
PG_FUNCTION_INFO_V1(dlexize_intdict);
Datum dlexize_intdict(PG_FUNCTION_ARGS);
Datum
dlexize_intdict(PG_FUNCTION_ARGS)
{
DictInt *d = (DictInt*)PG_GETARG_POINTER(0);
char *in = (char*)PG_GETARG_POINTER(1);
char *txt = pnstrdup(in, PG_GETARG_INT32(2));
TSLexeme *res = palloc(sizeof(TSLexeme) * 2);
/* Your INIT dictionary code */
res[1].lexeme = NULL;
if (PG_GETARG_INT32(2) &gt; d-&gt;maxlen)
{
if (d-&gt;rejectlong)
{ /* stop, return void array */
pfree(txt);
res[0].lexeme = NULL;
}
else
{ /* cut integer */
txt[d-&gt;maxlen] = '\0';
res[0].lexeme = txt;
}
}
else
res[0].lexeme = txt;
PG_RETURN_POINTER(res);
}
</programlisting>
<para>
This is the <literal>Makefile</literal>:
<programlisting>
subdir = contrib/dict_intdict
top_builddir = ../..
include $(top_builddir)/src/Makefile.global
MODULE_big = dict_intdict
OBJS = dict_tmpl.o
DATA_built = dict_intdict.sql
DOCS =
include $(top_srcdir)/contrib/contrib-global.mk
</programlisting>
</para>
<para>
This is a <literal>dict_intdict.sql.in</literal>:
<programlisting>
SET default_text_search_config = 'english';
BEGIN;
CREATE OR REPLACE FUNCTION dinit_intdict(internal)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE 'C';
CREATE OR REPLACE FUNCTION dlexize_intdict(internal,internal,internal,internal)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE 'C'
WITH (isstrict);
CREATE TEXT SEARCH TEMPLATE intdict_template (
LEXIZE = dlexize_intdict, INIT = dinit_intdict
);
CREATE TEXT SEARCH DICTIONARY intdict (
TEMPLATE = intdict_template,
MAXLEN = 6, REJECTLONG = false
);
COMMENT ON TEXT SEARCH DICTIONARY intdict IS 'Dictionary for Integers';
END;
</programlisting>
</para>
</sect1>
<sect1 id="textsearch-parser-example">
<title>Example of Creating a Parser</title>
<para>
<acronym>SQL</acronym> command <literal>CREATE TEXT SEARCH PARSER</literal> creates
a parser for full text searching. In our example we will implement
a simple parser which recognizes space-delimited words and
has only two types (3, word, Word; 12, blank, Space symbols). Identifiers
were chosen to keep compatibility with the default <function>headline()</function> function
since we do not implement our own version.
</para>
<para>
To implement a parser one needs to create a minimum of four functions.
</para>
<variablelist>
<varlistentry>
<term>
<synopsis>
START = <replaceable class="PARAMETER">start_function</replaceable>
</synopsis>
</term>
<listitem>
<para>
Initialize the parser. Arguments are a pointer to the parsed text and its
length.
</para>
<para>
Returns a pointer to the internal structure of a parser. Note that it should
be <function>malloc</>ed or <function>palloc</>ed in the
<literal>TopMemoryContext</>. We name it <literal>ParserState</>.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>
<synopsis>
GETTOKEN = <replaceable class="PARAMETER">gettoken_function</replaceable>
</synopsis>
</term>
<listitem>
<para>
Returns the next token.
Arguments are <literal>ParserState *, char **, int *</literal>.
</para>
<para>
This procedure will be called as long as the procedure returns token type zero.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>
<synopsis>
END = <replaceable class="PARAMETER">end_function</replaceable>,
</synopsis>
</term>
<listitem>
<para>
This void function will be called after parsing is finished to free
allocated resources in this procedure (<literal>ParserState</>). The argument
is <literal>ParserState *</literal>.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term>
<synopsis>
LEXTYPES = <replaceable class="PARAMETER">lextypes_function</replaceable>
</synopsis>
</term>
<listitem>
<para>
Returns an array containing the id, alias, and the description of the tokens
in the parser. See <structname>LexDescr</structname> in <filename>src/include/utils/ts_public.h</>.
</para>
</listitem>
</varlistentry>
</variablelist>
<para>
Below is the source code of our test parser, organized as a <filename>contrib</> module.
</para>
<para>
Testing:
<programlisting>
SELECT * FROM ts_parse('testparser','That''s my first own parser');
tokid | token
-------+--------
3 | That's
12 |
3 | my
12 |
3 | first
12 |
3 | own
12 |
3 | parser
SELECT to_tsvector('testcfg','That''s my first own parser');
to_tsvector
-------------------------------------------------
'my':2 'own':4 'first':3 'parser':5 'that''s':1
SELECT ts_headline('testcfg','Supernovae stars are the brightest phenomena in galaxies', to_tsquery('testcfg', 'star'));
headline
-----------------------------------------------------------------
Supernovae &lt;b&gt;stars&lt;/b&gt; are the brightest phenomena in galaxies
</programlisting>
</para>
<para>
This test parser is an example adopted from a tutorial by Valli, <ulink
url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/HOWTO-parser-tsearch2.html">parser
HOWTO</ulink>.
</para>
<para>
To compile the example just do:
<programlisting>
$ make
$ make install
$ psql regression < test_parser.sql
</programlisting>
</para>
<para>
This is a <filename>test_parser.c</>:
<programlisting>
#ifdef PG_MODULE_MAGIC
PG_MODULE_MAGIC;
#endif
/*
* types
*/
/* self-defined type */
typedef struct {
char * buffer; /* text to parse */
int len; /* length of the text in buffer */
int pos; /* position of the parser */
} ParserState;
/* copy-paste from wparser.h of tsearch2 */
typedef struct {
int lexid;
char *alias;
char *descr;
} LexDescr;
/*
* prototypes
*/
PG_FUNCTION_INFO_V1(testprs_start);
Datum testprs_start(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(testprs_getlexeme);
Datum testprs_getlexeme(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(testprs_end);
Datum testprs_end(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(testprs_lextype);
Datum testprs_lextype(PG_FUNCTION_ARGS);
/*
* functions
*/
Datum testprs_start(PG_FUNCTION_ARGS)
{
ParserState *pst = (ParserState *) palloc(sizeof(ParserState));
pst-&gt;buffer = (char *) PG_GETARG_POINTER(0);
pst-&gt;len = PG_GETARG_INT32(1);
pst-&gt;pos = 0;
PG_RETURN_POINTER(pst);
}
Datum testprs_getlexeme(PG_FUNCTION_ARGS)
{
ParserState *pst = (ParserState *) PG_GETARG_POINTER(0);
char **t = (char **) PG_GETARG_POINTER(1);
int *tlen = (int *) PG_GETARG_POINTER(2);
int type;
*tlen = pst-&gt;pos;
*t = pst-&gt;buffer + pst-&gt;pos;
if ((pst-&gt;buffer)[pst-&gt;pos] == ' ')
{
/* blank type */
type = 12;
/* go to the next non-white-space character */
while ((pst-&gt;buffer)[pst-&gt;pos] == ' ' &amp;&amp;
pst-&gt;pos &lt; pst-&gt;len)
(pst-&gt;pos)++;
} else {
/* word type */
type = 3;
/* go to the next white-space character */
while ((pst-&gt;buffer)[pst-&gt;pos] != ' ' &amp;&amp;
pst-&gt;pos &lt; pst-&gt;len)
(pst-&gt;pos)++;
}
*tlen = pst-&gt;pos - *tlen;
/* we are finished if (*tlen == 0) */
if (*tlen == 0)
type=0;
PG_RETURN_INT32(type);
}
Datum testprs_end(PG_FUNCTION_ARGS)
{
ParserState *pst = (ParserState *) PG_GETARG_POINTER(0);
pfree(pst);
PG_RETURN_VOID();
}
Datum testprs_lextype(PG_FUNCTION_ARGS)
{
/*
Remarks:
- we have to return the blanks for headline reason
- we use the same lexids like Teodor in the default
word parser; in this way we can reuse the headline
function of the default word parser.
*/
LexDescr *descr = (LexDescr *) palloc(sizeof(LexDescr) * (2+1));
/* there are only two types in this parser */
descr[0].lexid = 3;
descr[0].alias = pstrdup("word");
descr[0].descr = pstrdup("Word");
descr[1].lexid = 12;
descr[1].alias = pstrdup("blank");
descr[1].descr = pstrdup("Space symbols");
descr[2].lexid = 0;
PG_RETURN_POINTER(descr);
}
</programlisting>
This is a <literal>Makefile</literal>
<programlisting>
override CPPFLAGS := -I. $(CPPFLAGS)
MODULE_big = test_parser
OBJS = test_parser.o
DATA_built = test_parser.sql
DATA =
DOCS = README.test_parser
REGRESS = test_parser
ifdef USE_PGXS
PGXS := $(shell pg_config --pgxs)
include $(PGXS)
else
subdir = contrib/test_parser
top_builddir = ../..
include $(top_builddir)/src/Makefile.global
include $(top_srcdir)/contrib/contrib-global.mk
endif
</programlisting>
This is a <literal>test_parser.sql.in</literal>:
<programlisting>
SET default_text_search_config = 'english';
BEGIN;
CREATE FUNCTION testprs_start(internal,int4)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE 'C' with (isstrict);
CREATE FUNCTION testprs_getlexeme(internal,internal,internal)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE 'C' with (isstrict);
CREATE FUNCTION testprs_end(internal)
RETURNS void
AS 'MODULE_PATHNAME'
LANGUAGE 'C' with (isstrict);
CREATE FUNCTION testprs_lextype(internal)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE 'C' with (isstrict);
CREATE TEXT SEARCH PARSER testparser (
START = testprs_start,
GETTOKEN = testprs_getlexeme,
END = testprs_end,
LEXTYPES = testprs_lextype
);
CREATE TEXT SEARCH CONFIGURATION testcfg (PARSER = testparser);
ALTER TEXT SEARCH CONFIGURATION testcfg ADD MAPPING FOR word WITH simple;
END;
</programlisting>
</para>
</sect1>
</chapter>