2010-09-20 22:08:53 +02:00
|
|
|
<!-- doc/src/sgml/unaccent.sgml -->
|
2010-08-25 04:12:00 +02:00
|
|
|
|
2011-05-08 04:29:20 +02:00
|
|
|
<sect1 id="unaccent" xreflabel="unaccent">
|
2009-08-18 12:34:39 +02:00
|
|
|
<title>unaccent</title>
|
|
|
|
|
|
|
|
<indexterm zone="unaccent">
|
|
|
|
<primary>unaccent</primary>
|
|
|
|
</indexterm>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
<filename>unaccent</filename> is a text search dictionary that removes accents
|
2010-08-25 04:12:00 +02:00
|
|
|
(diacritic signs) from lexemes.
|
|
|
|
It's a filtering dictionary, which means its output is
|
|
|
|
always passed to the next dictionary (if any), unlike the normal
|
|
|
|
behavior of dictionaries. This allows accent-insensitive processing
|
|
|
|
for full text search.
|
2009-08-18 12:34:39 +02:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
The current implementation of <filename>unaccent</filename> cannot be used as a
|
2010-08-25 04:12:00 +02:00
|
|
|
normalizing dictionary for the <filename>thesaurus</filename> dictionary.
|
2009-08-18 12:34:39 +02:00
|
|
|
</para>
|
2010-08-25 04:12:00 +02:00
|
|
|
|
2020-02-13 21:02:35 +01:00
|
|
|
<para>
|
|
|
|
This module is considered <quote>trusted</quote>, that is, it can be
|
|
|
|
installed by non-superusers who have <literal>CREATE</literal> privilege
|
|
|
|
on the current database.
|
|
|
|
</para>
|
|
|
|
|
2009-08-18 12:34:39 +02:00
|
|
|
<sect2>
|
|
|
|
<title>Configuration</title>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
An <literal>unaccent</literal> dictionary accepts the following options:
|
2009-08-18 12:34:39 +02:00
|
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
<literal>RULES</literal> is the base name of the file containing the list of
|
2009-08-18 12:34:39 +02:00
|
|
|
translation rules. This file must be stored in
|
2017-10-09 03:44:17 +02:00
|
|
|
<filename>$SHAREDIR/tsearch_data/</filename> (where <literal>$SHAREDIR</literal> means
|
|
|
|
the <productname>PostgreSQL</productname> installation's shared-data directory).
|
|
|
|
Its name must end in <literal>.rules</literal> (which is not to be included in
|
|
|
|
the <literal>RULES</literal> parameter).
|
2009-08-18 12:34:39 +02:00
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
|
|
The rules file has the following format:
|
|
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
2014-07-01 02:51:26 +02:00
|
|
|
Each line represents one translation rule, consisting of a character with
|
|
|
|
accent followed by a character without accent. The first is translated
|
|
|
|
into the second. For example,
|
2010-07-29 21:34:41 +02:00
|
|
|
<programlisting>
|
2009-08-20 14:12:37 +02:00
|
|
|
À A
|
|
|
|
Á A
|
2010-08-25 04:12:00 +02:00
|
|
|
 A
|
2009-08-20 14:12:37 +02:00
|
|
|
à A
|
2010-08-25 04:12:00 +02:00
|
|
|
Ä A
|
|
|
|
Å A
|
2016-04-30 21:06:26 +02:00
|
|
|
Æ AE
|
2010-07-29 21:34:41 +02:00
|
|
|
</programlisting>
|
2014-07-01 02:51:26 +02:00
|
|
|
The two characters must be separated by whitespace, and any leading or
|
|
|
|
trailing whitespace on a line is ignored.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Alternatively, if only one character is given on a line, instances of
|
|
|
|
that character are deleted; this is useful in languages where accents
|
|
|
|
are represented by separate characters.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
2014-07-01 03:46:29 +02:00
|
|
|
<listitem>
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
Actually, each <quote>character</quote> can be any string not containing
|
|
|
|
whitespace, so <filename>unaccent</filename> dictionaries could be used for
|
2014-07-01 03:46:29 +02:00
|
|
|
other sorts of substring substitutions besides diacritic removal.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
2014-07-01 02:51:26 +02:00
|
|
|
<listitem>
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
As with other <productname>PostgreSQL</productname> text search configuration files,
|
2014-07-01 02:51:26 +02:00
|
|
|
the rules file must be stored in UTF-8 encoding. The data is
|
|
|
|
automatically translated into the current database's encoding when
|
|
|
|
loaded. Any lines containing untranslatable characters are silently
|
|
|
|
ignored, so that rules files can contain rules that are not applicable in
|
|
|
|
the current encoding.
|
2009-08-18 12:34:39 +02:00
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
|
|
|
|
<para>
|
2010-08-25 04:12:00 +02:00
|
|
|
A more complete example, which is directly useful for most European
|
2017-10-09 03:44:17 +02:00
|
|
|
languages, can be found in <filename>unaccent.rules</filename>, which is installed
|
|
|
|
in <filename>$SHAREDIR/tsearch_data/</filename> when the <filename>unaccent</filename>
|
2016-04-30 21:06:26 +02:00
|
|
|
module is installed. This rules file translates characters with accents
|
|
|
|
to the same characters without accents, and it also expands ligatures
|
|
|
|
into the equivalent series of simple characters (for example, Æ to
|
|
|
|
AE).
|
2009-08-18 12:34:39 +02:00
|
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2>
|
|
|
|
<title>Usage</title>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
Installing the <literal>unaccent</literal> extension creates a text
|
|
|
|
search template <literal>unaccent</literal> and a dictionary <literal>unaccent</literal>
|
|
|
|
based on it. The <literal>unaccent</literal> dictionary has the default
|
|
|
|
parameter setting <literal>RULES='unaccent'</literal>, which makes it immediately
|
|
|
|
usable with the standard <filename>unaccent.rules</filename> file.
|
2010-08-25 23:42:55 +02:00
|
|
|
If you wish, you can alter the parameter, for example
|
2009-08-18 12:34:39 +02:00
|
|
|
|
|
|
|
<programlisting>
|
2010-08-25 04:12:00 +02:00
|
|
|
mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
|
2009-08-18 12:34:39 +02:00
|
|
|
</programlisting>
|
|
|
|
|
|
|
|
or create new dictionaries based on the template.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2010-08-25 04:12:00 +02:00
|
|
|
To test the dictionary, you can try:
|
2009-08-18 12:34:39 +02:00
|
|
|
<programlisting>
|
2010-08-25 04:12:00 +02:00
|
|
|
mydb=# select ts_lexize('unaccent','Hôtel');
|
|
|
|
ts_lexize
|
2009-08-18 12:34:39 +02:00
|
|
|
-----------
|
|
|
|
{Hotel}
|
|
|
|
(1 row)
|
|
|
|
</programlisting>
|
|
|
|
</para>
|
2010-08-25 04:12:00 +02:00
|
|
|
|
2009-08-18 12:34:39 +02:00
|
|
|
<para>
|
2010-08-25 04:12:00 +02:00
|
|
|
Here is an example showing how to insert the
|
2017-10-09 03:44:17 +02:00
|
|
|
<filename>unaccent</filename> dictionary into a text search configuration:
|
2009-08-18 12:34:39 +02:00
|
|
|
<programlisting>
|
2010-08-25 04:12:00 +02:00
|
|
|
mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
|
|
|
|
mydb=# ALTER TEXT SEARCH CONFIGURATION fr
|
2009-08-20 14:12:37 +02:00
|
|
|
ALTER MAPPING FOR hword, hword_part, word
|
|
|
|
WITH unaccent, french_stem;
|
2010-08-25 04:12:00 +02:00
|
|
|
mydb=# select to_tsvector('fr','Hôtels de la Mer');
|
|
|
|
to_tsvector
|
2009-08-18 12:34:39 +02:00
|
|
|
-------------------
|
|
|
|
'hotel':1 'mer':4
|
|
|
|
(1 row)
|
|
|
|
|
2010-08-25 04:12:00 +02:00
|
|
|
mydb=# select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels');
|
|
|
|
?column?
|
2009-08-18 12:34:39 +02:00
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
2010-08-25 04:12:00 +02:00
|
|
|
|
|
|
|
mydb=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels'));
|
|
|
|
ts_headline
|
2009-08-18 12:34:39 +02:00
|
|
|
------------------------
|
2010-08-25 04:12:00 +02:00
|
|
|
<b>Hôtel</b> de la Mer
|
2009-08-18 12:34:39 +02:00
|
|
|
(1 row)
|
|
|
|
</programlisting>
|
|
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
<sect2>
|
2010-08-25 04:12:00 +02:00
|
|
|
<title>Functions</title>
|
2009-08-18 12:34:39 +02:00
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
The <function>unaccent()</function> function removes accents (diacritic signs) from
|
2014-07-01 02:51:26 +02:00
|
|
|
a given string. Basically, it's a wrapper around
|
2017-10-09 03:44:17 +02:00
|
|
|
<filename>unaccent</filename>-type dictionaries, but it can be used outside normal
|
2010-08-25 04:12:00 +02:00
|
|
|
text search contexts.
|
2009-08-18 12:34:39 +02:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<indexterm>
|
|
|
|
<primary>unaccent</primary>
|
|
|
|
</indexterm>
|
|
|
|
|
2010-07-29 21:34:41 +02:00
|
|
|
<synopsis>
|
2018-09-06 16:49:45 +02:00
|
|
|
unaccent(<optional><replaceable class="parameter">dictionary</replaceable> <type>regdictionary</type>, </optional> <replaceable class="parameter">string</replaceable> <type>text</type>) returns <type>text</type>
|
2010-07-29 21:34:41 +02:00
|
|
|
</synopsis>
|
2009-08-18 12:34:39 +02:00
|
|
|
|
2014-07-01 02:51:26 +02:00
|
|
|
<para>
|
2017-10-09 04:00:57 +02:00
|
|
|
If the <replaceable class="parameter">dictionary</replaceable> argument is
|
2018-09-06 16:49:45 +02:00
|
|
|
omitted, the text search dictionary named <literal>unaccent</literal> and
|
|
|
|
appearing in the same schema as the <function>unaccent()</function>
|
|
|
|
function itself is used.
|
2014-07-01 02:51:26 +02:00
|
|
|
</para>
|
|
|
|
|
2009-08-18 12:34:39 +02:00
|
|
|
<para>
|
2010-08-25 04:12:00 +02:00
|
|
|
For example:
|
2009-08-18 12:34:39 +02:00
|
|
|
<programlisting>
|
2010-08-25 04:12:00 +02:00
|
|
|
SELECT unaccent('unaccent', 'Hôtel');
|
|
|
|
SELECT unaccent('Hôtel');
|
2009-08-18 12:34:39 +02:00
|
|
|
</programlisting>
|
|
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
</sect1>
|