2010-09-20 22:08:53 +02:00
|
|
|
<!-- doc/src/sgml/unaccent.sgml -->
|
2010-08-25 04:12:00 +02:00
|
|
|
|
2011-05-08 04:29:20 +02:00
|
|
|
<sect1 id="unaccent" xreflabel="unaccent">
|
2023-01-20 20:01:59 +01:00
|
|
|
<title>unaccent — a text search dictionary which removes diacritics</title>
|
2009-08-18 12:34:39 +02:00
|
|
|
|
|
|
|
<indexterm zone="unaccent">
|
|
|
|
<primary>unaccent</primary>
|
|
|
|
</indexterm>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
<filename>unaccent</filename> is a text search dictionary that removes accents
|
2010-08-25 04:12:00 +02:00
|
|
|
(diacritic signs) from lexemes.
|
|
|
|
It's a filtering dictionary, which means its output is
|
|
|
|
always passed to the next dictionary (if any), unlike the normal
|
|
|
|
behavior of dictionaries. This allows accent-insensitive processing
|
|
|
|
for full text search.
|
2009-08-18 12:34:39 +02:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
The current implementation of <filename>unaccent</filename> cannot be used as a
|
2010-08-25 04:12:00 +02:00
|
|
|
normalizing dictionary for the <filename>thesaurus</filename> dictionary.
|
2009-08-18 12:34:39 +02:00
|
|
|
</para>
|
2010-08-25 04:12:00 +02:00
|
|
|
|
2020-02-13 21:02:35 +01:00
|
|
|
<para>
|
|
|
|
This module is considered <quote>trusted</quote>, that is, it can be
|
|
|
|
installed by non-superusers who have <literal>CREATE</literal> privilege
|
|
|
|
on the current database.
|
|
|
|
</para>
|
|
|
|
|
2023-01-09 21:08:24 +01:00
|
|
|
<sect2 id="unaccent-configuration">
|
2009-08-18 12:34:39 +02:00
|
|
|
<title>Configuration</title>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
An <literal>unaccent</literal> dictionary accepts the following options:
|
2009-08-18 12:34:39 +02:00
|
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
<literal>RULES</literal> is the base name of the file containing the list of
|
2009-08-18 12:34:39 +02:00
|
|
|
translation rules. This file must be stored in
|
2017-10-09 03:44:17 +02:00
|
|
|
<filename>$SHAREDIR/tsearch_data/</filename> (where <literal>$SHAREDIR</literal> means
|
|
|
|
the <productname>PostgreSQL</productname> installation's shared-data directory).
|
|
|
|
Its name must end in <literal>.rules</literal> (which is not to be included in
|
|
|
|
the <literal>RULES</literal> parameter).
|
2009-08-18 12:34:39 +02:00
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
|
|
The rules file has the following format:
|
|
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
2014-07-01 02:51:26 +02:00
|
|
|
Each line represents one translation rule, consisting of a character with
|
|
|
|
accent followed by a character without accent. The first is translated
|
|
|
|
into the second. For example,
|
2010-07-29 21:34:41 +02:00
|
|
|
<programlisting>
|
2009-08-20 14:12:37 +02:00
|
|
|
À A
|
|
|
|
Á A
|
2010-08-25 04:12:00 +02:00
|
|
|
 A
|
2009-08-20 14:12:37 +02:00
|
|
|
à A
|
2010-08-25 04:12:00 +02:00
|
|
|
Ä A
|
|
|
|
Å A
|
2016-04-30 21:06:26 +02:00
|
|
|
Æ AE
|
2010-07-29 21:34:41 +02:00
|
|
|
</programlisting>
|
2014-07-01 02:51:26 +02:00
|
|
|
The two characters must be separated by whitespace, and any leading or
|
|
|
|
trailing whitespace on a line is ignored.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Alternatively, if only one character is given on a line, instances of
|
|
|
|
that character are deleted; this is useful in languages where accents
|
|
|
|
are represented by separate characters.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
2014-07-01 03:46:29 +02:00
|
|
|
<listitem>
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
Actually, each <quote>character</quote> can be any string not containing
|
|
|
|
whitespace, so <filename>unaccent</filename> dictionaries could be used for
|
2014-07-01 03:46:29 +02:00
|
|
|
other sorts of substring substitutions besides diacritic removal.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
unaccent: Add support for quoted translated characters
As reported in bug #18057, the extension unaccent removes in its rule
file whitespace characters that are intentionally specified when
building unaccent.rules from UnicodeData.txt, causing an incorrect
translation for some characters like numeric symbols. This is caused by
the fact that all whitespaces before and after the origin and target
characters are all discarded (this limitation is documented).
This commit makes possible the use of quotes around target characters,
so as whitespaces can be considered part of target characters. Some
target characters use a double quote, these require an extra double
quote.
The documentation is updated to show how to use quoted areas,
generate_unaccent_rules.py is updated to generate unaccent.rules and a
couple of tests are added for numeric symbols. While working on this
patch, I have implemented a fake rule file to test the parsing logic
implemented, which is not included here as it would just consume extra
cycles in the tests, and it requires the manipulation of an installation
tree to be able to work correctly.
As this requires a change of format in unaccent.rules, this cannot be
backpatched, unfortunately. The idea to use double quotes as escaped
characters comes from Tom Lane.
Reported-by: Martin Schlossarek
Author: Michael Paquier
Discussion: https://postgr.es/m/18057-62712cad01bd202c@postgresql.org
2023-09-20 05:29:36 +02:00
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Some characters, like numeric symbols, may require whitespaces in their
|
|
|
|
translation rule. It is possible to use double quotes around the translated
|
|
|
|
characters in this case. A double quote needs to be escaped with a second
|
|
|
|
double quote when including one in the translated character. For example:
|
|
|
|
<programlisting>
|
|
|
|
¼ " 1/4"
|
|
|
|
½ " 1/2"
|
|
|
|
¾ " 3/4"
|
|
|
|
“ """"
|
|
|
|
” """"
|
|
|
|
</programlisting>
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
|
2014-07-01 02:51:26 +02:00
|
|
|
<listitem>
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
As with other <productname>PostgreSQL</productname> text search configuration files,
|
2014-07-01 02:51:26 +02:00
|
|
|
the rules file must be stored in UTF-8 encoding. The data is
|
|
|
|
automatically translated into the current database's encoding when
|
|
|
|
loaded. Any lines containing untranslatable characters are silently
|
|
|
|
ignored, so that rules files can contain rules that are not applicable in
|
|
|
|
the current encoding.
|
2009-08-18 12:34:39 +02:00
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
|
|
|
|
<para>
|
2010-08-25 04:12:00 +02:00
|
|
|
A more complete example, which is directly useful for most European
|
2017-10-09 03:44:17 +02:00
|
|
|
languages, can be found in <filename>unaccent.rules</filename>, which is installed
|
|
|
|
in <filename>$SHAREDIR/tsearch_data/</filename> when the <filename>unaccent</filename>
|
2016-04-30 21:06:26 +02:00
|
|
|
module is installed. This rules file translates characters with accents
|
|
|
|
to the same characters without accents, and it also expands ligatures
|
|
|
|
into the equivalent series of simple characters (for example, Æ to
|
|
|
|
AE).
|
2009-08-18 12:34:39 +02:00
|
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
2023-01-09 21:08:24 +01:00
|
|
|
<sect2 id="unaccent-usage">
|
2009-08-18 12:34:39 +02:00
|
|
|
<title>Usage</title>
|
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
Installing the <literal>unaccent</literal> extension creates a text
|
|
|
|
search template <literal>unaccent</literal> and a dictionary <literal>unaccent</literal>
|
|
|
|
based on it. The <literal>unaccent</literal> dictionary has the default
|
|
|
|
parameter setting <literal>RULES='unaccent'</literal>, which makes it immediately
|
|
|
|
usable with the standard <filename>unaccent.rules</filename> file.
|
2010-08-25 23:42:55 +02:00
|
|
|
If you wish, you can alter the parameter, for example
|
2009-08-18 12:34:39 +02:00
|
|
|
|
|
|
|
<programlisting>
|
2010-08-25 04:12:00 +02:00
|
|
|
mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
|
2009-08-18 12:34:39 +02:00
|
|
|
</programlisting>
|
|
|
|
|
|
|
|
or create new dictionaries based on the template.
|
|
|
|
</para>
|
|
|
|
|
|
|
|
<para>
|
2010-08-25 04:12:00 +02:00
|
|
|
To test the dictionary, you can try:
|
2009-08-18 12:34:39 +02:00
|
|
|
<programlisting>
|
2010-08-25 04:12:00 +02:00
|
|
|
mydb=# select ts_lexize('unaccent','Hôtel');
|
|
|
|
ts_lexize
|
2009-08-18 12:34:39 +02:00
|
|
|
-----------
|
|
|
|
{Hotel}
|
|
|
|
(1 row)
|
|
|
|
</programlisting>
|
|
|
|
</para>
|
2010-08-25 04:12:00 +02:00
|
|
|
|
2009-08-18 12:34:39 +02:00
|
|
|
<para>
|
2010-08-25 04:12:00 +02:00
|
|
|
Here is an example showing how to insert the
|
2017-10-09 03:44:17 +02:00
|
|
|
<filename>unaccent</filename> dictionary into a text search configuration:
|
2009-08-18 12:34:39 +02:00
|
|
|
<programlisting>
|
2010-08-25 04:12:00 +02:00
|
|
|
mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
|
|
|
|
mydb=# ALTER TEXT SEARCH CONFIGURATION fr
|
2009-08-20 14:12:37 +02:00
|
|
|
ALTER MAPPING FOR hword, hword_part, word
|
|
|
|
WITH unaccent, french_stem;
|
2010-08-25 04:12:00 +02:00
|
|
|
mydb=# select to_tsvector('fr','Hôtels de la Mer');
|
|
|
|
to_tsvector
|
2009-08-18 12:34:39 +02:00
|
|
|
-------------------
|
|
|
|
'hotel':1 'mer':4
|
|
|
|
(1 row)
|
|
|
|
|
2010-08-25 04:12:00 +02:00
|
|
|
mydb=# select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels');
|
|
|
|
?column?
|
2009-08-18 12:34:39 +02:00
|
|
|
----------
|
|
|
|
t
|
|
|
|
(1 row)
|
2010-08-25 04:12:00 +02:00
|
|
|
|
|
|
|
mydb=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels'));
|
|
|
|
ts_headline
|
2009-08-18 12:34:39 +02:00
|
|
|
------------------------
|
2010-08-25 04:12:00 +02:00
|
|
|
<b>Hôtel</b> de la Mer
|
2009-08-18 12:34:39 +02:00
|
|
|
(1 row)
|
|
|
|
</programlisting>
|
|
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
2023-01-09 21:08:24 +01:00
|
|
|
<sect2 id="unaccent-functions">
|
2010-08-25 04:12:00 +02:00
|
|
|
<title>Functions</title>
|
2009-08-18 12:34:39 +02:00
|
|
|
|
|
|
|
<para>
|
2017-10-09 03:44:17 +02:00
|
|
|
The <function>unaccent()</function> function removes accents (diacritic signs) from
|
2014-07-01 02:51:26 +02:00
|
|
|
a given string. Basically, it's a wrapper around
|
2017-10-09 03:44:17 +02:00
|
|
|
<filename>unaccent</filename>-type dictionaries, but it can be used outside normal
|
2010-08-25 04:12:00 +02:00
|
|
|
text search contexts.
|
2009-08-18 12:34:39 +02:00
|
|
|
</para>
|
|
|
|
|
|
|
|
<indexterm>
|
|
|
|
<primary>unaccent</primary>
|
|
|
|
</indexterm>
|
|
|
|
|
2010-07-29 21:34:41 +02:00
|
|
|
<synopsis>
|
2018-09-06 16:49:45 +02:00
|
|
|
unaccent(<optional><replaceable class="parameter">dictionary</replaceable> <type>regdictionary</type>, </optional> <replaceable class="parameter">string</replaceable> <type>text</type>) returns <type>text</type>
|
2010-07-29 21:34:41 +02:00
|
|
|
</synopsis>
|
2009-08-18 12:34:39 +02:00
|
|
|
|
2014-07-01 02:51:26 +02:00
|
|
|
<para>
|
2017-10-09 04:00:57 +02:00
|
|
|
If the <replaceable class="parameter">dictionary</replaceable> argument is
|
2018-09-06 16:49:45 +02:00
|
|
|
omitted, the text search dictionary named <literal>unaccent</literal> and
|
|
|
|
appearing in the same schema as the <function>unaccent()</function>
|
|
|
|
function itself is used.
|
2014-07-01 02:51:26 +02:00
|
|
|
</para>
|
|
|
|
|
2009-08-18 12:34:39 +02:00
|
|
|
<para>
|
2010-08-25 04:12:00 +02:00
|
|
|
For example:
|
2009-08-18 12:34:39 +02:00
|
|
|
<programlisting>
|
2010-08-25 04:12:00 +02:00
|
|
|
SELECT unaccent('unaccent', 'Hôtel');
|
|
|
|
SELECT unaccent('Hôtel');
|
2009-08-18 12:34:39 +02:00
|
|
|
</programlisting>
|
|
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
|
|
|
|
</sect1>
|