Document filtering dictionaries in textsearch.sgml.
While at it, copy-edit the description of prefix-match marker support in synonym dictionaries, and clarify the description of the default unaccent dictionary a bit more.
This commit is contained in:
parent
acac35adca
commit
9389ac8928
|
@ -1,4 +1,4 @@
|
||||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.58 2010/08/20 13:59:45 tgl Exp $ -->
|
<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.59 2010/08/25 21:42:55 tgl Exp $ -->
|
||||||
|
|
||||||
<chapter id="textsearch">
|
<chapter id="textsearch">
|
||||||
<title>Full Text Search</title>
|
<title>Full Text Search</title>
|
||||||
|
@ -2064,6 +2064,14 @@ SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.h
|
||||||
(notice that one token can produce more than one lexeme)
|
(notice that one token can produce more than one lexeme)
|
||||||
</para>
|
</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
a single lexeme with the <literal>TSL_FILTER</> flag set, to replace
|
||||||
|
the original token with a new token to be passed to subsequent
|
||||||
|
dictionaries (a dictionary that does this is called a
|
||||||
|
<firstterm>filtering dictionary</>)
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
<listitem>
|
<listitem>
|
||||||
<para>
|
<para>
|
||||||
an empty array if the dictionary knows the token, but it is a stop word
|
an empty array if the dictionary knows the token, but it is a stop word
|
||||||
|
@ -2096,6 +2104,13 @@ SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.h
|
||||||
until some dictionary recognizes it as a known word. If it is identified
|
until some dictionary recognizes it as a known word. If it is identified
|
||||||
as a stop word, or if no dictionary recognizes the token, it will be
|
as a stop word, or if no dictionary recognizes the token, it will be
|
||||||
discarded and not indexed or searched for.
|
discarded and not indexed or searched for.
|
||||||
|
Normally, the first dictionary that returns a non-<literal>NULL</>
|
||||||
|
output determines the result, and any remaining dictionaries are not
|
||||||
|
consulted; but a filtering dictionary can replace the given word
|
||||||
|
with a modified word, which is then passed to subsequent dictionaries.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
The general rule for configuring a list of dictionaries
|
The general rule for configuring a list of dictionaries
|
||||||
is to place first the most narrow, most specific dictionary, then the more
|
is to place first the most narrow, most specific dictionary, then the more
|
||||||
general dictionaries, finishing with a very general dictionary, like
|
general dictionaries, finishing with a very general dictionary, like
|
||||||
|
@ -2112,6 +2127,16 @@ ALTER TEXT SEARCH CONFIGURATION astro_en
|
||||||
</programlisting>
|
</programlisting>
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
A filtering dictionary can be placed anywhere in the list, except at the
|
||||||
|
end where it'd be useless. Filtering dictionaries are useful to partially
|
||||||
|
normalize words to simplify the task of later dictionaries. For example,
|
||||||
|
a filtering dictionary could be used to remove accents from accented
|
||||||
|
letters, as is done by the
|
||||||
|
<link linkend="unaccent"><filename>contrib/unaccent</></link>
|
||||||
|
extension module.
|
||||||
|
</para>
|
||||||
|
|
||||||
<sect2 id="textsearch-stopwords">
|
<sect2 id="textsearch-stopwords">
|
||||||
<title>Stop Words</title>
|
<title>Stop Words</title>
|
||||||
|
|
||||||
|
@ -2296,63 +2321,6 @@ SELECT * FROM ts_debug('english', 'Paris');
|
||||||
</screen>
|
</screen>
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>
|
|
||||||
An asterisk (<literal>*</literal>) at the end of definition word indicates
|
|
||||||
that definition word is a prefix, and <function>to_tsquery()</function>
|
|
||||||
function will transform that definition to the prefix search format (see
|
|
||||||
<xref linkend="textsearch-parsing-queries">).
|
|
||||||
Notice that it is ignored in <function>to_tsvector()</function>.
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
Contents of <filename>$SHAREDIR/tsearch_data/synonym_sample.syn</>:
|
|
||||||
<programlisting>
|
|
||||||
postgres pgsql
|
|
||||||
postgresql pgsql
|
|
||||||
postgre pgsql
|
|
||||||
gogle googl
|
|
||||||
indices index*
|
|
||||||
</programlisting>
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
|
||||||
Results:
|
|
||||||
<screen>
|
|
||||||
=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
|
|
||||||
=# SELECT ts_lexize('syn','indices');
|
|
||||||
ts_lexize
|
|
||||||
-----------
|
|
||||||
{index}
|
|
||||||
(1 row)
|
|
||||||
|
|
||||||
=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
|
|
||||||
=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
|
|
||||||
=# SELECT to_tsquery('tst','indices');
|
|
||||||
to_tsquery
|
|
||||||
------------
|
|
||||||
'index':*
|
|
||||||
(1 row)
|
|
||||||
|
|
||||||
=# SELECT 'indexes are very useful'::tsvector;
|
|
||||||
tsvector
|
|
||||||
---------------------------------
|
|
||||||
'are' 'indexes' 'useful' 'very'
|
|
||||||
(1 row)
|
|
||||||
|
|
||||||
=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
|
|
||||||
?column?
|
|
||||||
----------
|
|
||||||
t
|
|
||||||
(1 row)
|
|
||||||
|
|
||||||
=# SELECT to_tsvector('tst','indices');
|
|
||||||
to_tsvector
|
|
||||||
-------------
|
|
||||||
'index':1
|
|
||||||
(1 row)
|
|
||||||
</screen>
|
|
||||||
</para>
|
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
The only parameter required by the <literal>synonym</> template is
|
The only parameter required by the <literal>synonym</> template is
|
||||||
<literal>SYNONYMS</>, which is the base name of its configuration file
|
<literal>SYNONYMS</>, which is the base name of its configuration file
|
||||||
|
@ -2374,6 +2342,60 @@ indices index*
|
||||||
<literal>true</>, words and tokens are not folded to lower case,
|
<literal>true</>, words and tokens are not folded to lower case,
|
||||||
but are compared as-is.
|
but are compared as-is.
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
An asterisk (<literal>*</literal>) can be placed at the end of a synonym
|
||||||
|
in the configuration file. This indicates that the synonym is a prefix.
|
||||||
|
The asterisk is ignored when the entry is used in
|
||||||
|
<function>to_tsvector()</function>, but when it is used in
|
||||||
|
<function>to_tsquery()</function>, the result will be a query item with
|
||||||
|
the prefix match marker (see
|
||||||
|
<xref linkend="textsearch-parsing-queries">).
|
||||||
|
For example, suppose we have these entries in
|
||||||
|
<filename>$SHAREDIR/tsearch_data/synonym_sample.syn</>:
|
||||||
|
<programlisting>
|
||||||
|
postgres pgsql
|
||||||
|
postgresql pgsql
|
||||||
|
postgre pgsql
|
||||||
|
gogle googl
|
||||||
|
indices index*
|
||||||
|
</programlisting>
|
||||||
|
Then we will get these results:
|
||||||
|
<screen>
|
||||||
|
mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
|
||||||
|
mydb=# SELECT ts_lexize('syn','indices');
|
||||||
|
ts_lexize
|
||||||
|
-----------
|
||||||
|
{index}
|
||||||
|
(1 row)
|
||||||
|
|
||||||
|
mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
|
||||||
|
mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
|
||||||
|
mydb=# SELECT to_tsvector('tst','indices');
|
||||||
|
to_tsvector
|
||||||
|
-------------
|
||||||
|
'index':1
|
||||||
|
(1 row)
|
||||||
|
|
||||||
|
mydb=# SELECT to_tsquery('tst','indices');
|
||||||
|
to_tsquery
|
||||||
|
------------
|
||||||
|
'index':*
|
||||||
|
(1 row)
|
||||||
|
|
||||||
|
mydb=# SELECT 'indexes are very useful'::tsvector;
|
||||||
|
tsvector
|
||||||
|
---------------------------------
|
||||||
|
'are' 'indexes' 'useful' 'very'
|
||||||
|
(1 row)
|
||||||
|
|
||||||
|
mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
|
||||||
|
?column?
|
||||||
|
----------
|
||||||
|
t
|
||||||
|
(1 row)
|
||||||
|
</screen>
|
||||||
|
</para>
|
||||||
</sect2>
|
</sect2>
|
||||||
|
|
||||||
<sect2 id="textsearch-thesaurus">
|
<sect2 id="textsearch-thesaurus">
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/unaccent.sgml,v 1.6 2010/08/25 02:12:00 tgl Exp $ -->
|
<!-- $PostgreSQL: pgsql/doc/src/sgml/unaccent.sgml,v 1.7 2010/08/25 21:42:55 tgl Exp $ -->
|
||||||
|
|
||||||
<sect1 id="unaccent">
|
<sect1 id="unaccent">
|
||||||
<title>unaccent</title>
|
<title>unaccent</title>
|
||||||
|
@ -75,8 +75,10 @@
|
||||||
<para>
|
<para>
|
||||||
Running the installation script <filename>unaccent.sql</> creates a text
|
Running the installation script <filename>unaccent.sql</> creates a text
|
||||||
search template <literal>unaccent</> and a dictionary <literal>unaccent</>
|
search template <literal>unaccent</> and a dictionary <literal>unaccent</>
|
||||||
based on it, with default parameters. You can alter the
|
based on it. The <literal>unaccent</> dictionary has the default
|
||||||
parameters, for example
|
parameter setting <literal>RULES='unaccent'</>, which makes it immediately
|
||||||
|
usable with the standard <filename>unaccent.rules</> file.
|
||||||
|
If you wish, you can alter the parameter, for example
|
||||||
|
|
||||||
<programlisting>
|
<programlisting>
|
||||||
mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
|
mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
|
||||||
|
|
Loading…
Reference in New Issue