2009-08-18 12:34:39 +02:00
|
|
|
<sect1 id="unaccent">
|
|
|
|
<title>unaccent</title>
|
|
|
|
|
|
|
|
<indexterm zone="unaccent">
|
|
|
|
<primary>unaccent</primary>
|
|
|
|
</indexterm>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<filename>unaccent</> removes accents (diacritic signs) from a lexeme.
|
|
|
|
It's a filtering dictionary, that means its output is
|
|
|
|
always passed to the next dictionary (if any), contrary to the standard
|
2010-02-19 01:15:25 +01:00
|
|
|
behavior. Currently, it supports most important accents from european
|
2009-08-18 12:34:39 +02:00
|
|
|
languages.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Limitation: Current implementation of <filename>unaccent</>
|
|
|
|
dictionary cannot be used as a normalizing dictionary for
|
|
|
|
<filename>thesaurus</filename> dictionary.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<sect2>
|
|
|
|
<title>Configuration</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
A <literal>unaccent</> dictionary accepts the following options:
|
|
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
<literal>RULES</> is the base name of the file containing the list of
|
|
|
|
translation rules. This file must be stored in
|
|
|
|
<filename>$SHAREDIR/tsearch_data/</> (where <literal>$SHAREDIR</> means
|
|
|
|
the <productname>PostgreSQL</> installation's shared-data directory).
|
|
|
|
Its name must end in <literal>.rules</> (which is not to be included in
|
|
|
|
the <literal>RULES</> parameter).
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
|
|
The rules file has the following format:
|
|
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Each line represents pair: character_with_accent character_without_accent
|
|
|
|
<programlisting>
|
2009-08-20 14:12:37 +02:00
|
|
|
À A
|
|
|
|
Á A
|
|
|
|
 A
|
|
|
|
à A
|
|
|
|
Ä A
|
|
|
|
Å A
|
|
|
|
Æ A
|
2009-08-18 12:34:39 +02:00
|
|
|
</programlisting>
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Look at <filename>unaccent.rules</>, which is installed in
|
|
|
|
<filename>$SHAREDIR/tsearch_data/</>, for an example.
|
|
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2>
|
|
|
|
<title>Usage</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Running the installation script creates a text search template
|
|
|
|
<literal>unaccent</> and a dictionary <literal>unaccent</>
|
|
|
|
based on it, with default parameters. You can alter the
|
|
|
|
parameters, for example
|
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
|
|
|
|
</programlisting>
|
|
|
|
|
|
|
|
or create new dictionaries based on the template.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
To test the dictionary, you can try
|
|
|
|
|
|
|
|
<programlisting>
|
|
|
|
=# select ts_lexize('unaccent','Hôtel');
|
|
|
|
ts_lexize
|
|
|
|
-----------
|
|
|
|
{Hotel}
|
|
|
|
(1 row)
|
|
|
|
</programlisting>
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
Filtering dictionary are useful for correct work of
|
|
|
|
<function>ts_headline</function> function.
|
|
|
|
<programlisting>
|
|
|
|
=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
|
|
|
|
=# ALTER TEXT SEARCH CONFIGURATION fr
|
2009-08-20 14:12:37 +02:00
|
|
|
ALTER MAPPING FOR hword, hword_part, word
|
|
|
|
WITH unaccent, french_stem;
|
2009-08-18 12:34:39 +02:00
|
|
|
=# select to_tsvector('fr','Hôtels de la Mer');
|
|
|
|
to_tsvector
|
|
|
|
-------------------
|
|
|
|
'hotel':1 'mer':4
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
=# select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels');
|
|
|
|
?column?
|
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
|
|
|
=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels'));
|
|
|
|
ts_headline
|
|
|
|
------------------------
|
|
|
|
<b>Hôtel</b>de la Mer
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
</programlisting>
|
|
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2>
|
|
|
|
<title>Function</title>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<function>unaccent</> function removes accents (diacritic signs) from
|
|
|
|
argument string. Basically, it's a wrapper around
|
|
|
|
<filename>unaccent</> dictionary.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<indexterm>
|
|
|
|
<primary>unaccent</primary>
|
|
|
|
</indexterm>
|
|
|
|
|
|
|
|
<synopsis>
|
|
|
|
unaccent(<optional><replaceable class="PARAMETER">dictionary</replaceable>,
|
|
|
|
</optional> <replaceable class="PARAMETER">string</replaceable>)
|
|
|
|
returns <type>text</type>
|
|
|
|
</synopsis>
|
|
|
|
|
|
|
|
<para>
|
|
|
|
<programlisting>
|
|
|
|
SELECT unaccent('unaccent','Hôtel');
|
|
|
|
SELECT unaccent('Hôtel');
|
|
|
|
</programlisting>
|
|
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
</sect1>
|