mirror of
https://git.postgresql.org/git/postgresql.git
synced 2024-10-01 12:21:18 +02:00
eb67623c96
This allows these modules to be installed into a database without superuser privileges (assuming that the DBA or sysadmin has installed the module's files in the expected place). You only need CREATE privilege on the current database, which by default would be available to the database owner. The following modules are marked trusted: btree_gin btree_gist citext cube dict_int earthdistance fuzzystrmatch hstore hstore_plperl intarray isn jsonb_plperl lo ltree pg_trgm pgcrypto seg tablefunc tcn tsm_system_rows tsm_system_time unaccent uuid-ossp In the future we might mark some more modules trusted, but there seems to be no debate about these, and on the whole it seems wise to be conservative with use of this feature to start out with. Discussion: https://postgr.es/m/32315.1580326876@sss.pgh.pa.us
203 lines
6.2 KiB
Plaintext
203 lines
6.2 KiB
Plaintext
<!-- doc/src/sgml/unaccent.sgml -->
|
|
|
|
<sect1 id="unaccent" xreflabel="unaccent">
|
|
<title>unaccent</title>
|
|
|
|
<indexterm zone="unaccent">
|
|
<primary>unaccent</primary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
<filename>unaccent</filename> is a text search dictionary that removes accents
|
|
(diacritic signs) from lexemes.
|
|
It's a filtering dictionary, which means its output is
|
|
always passed to the next dictionary (if any), unlike the normal
|
|
behavior of dictionaries. This allows accent-insensitive processing
|
|
for full text search.
|
|
</para>
|
|
|
|
<para>
|
|
The current implementation of <filename>unaccent</filename> cannot be used as a
|
|
normalizing dictionary for the <filename>thesaurus</filename> dictionary.
|
|
</para>
|
|
|
|
<para>
|
|
This module is considered <quote>trusted</quote>, that is, it can be
|
|
installed by non-superusers who have <literal>CREATE</literal> privilege
|
|
on the current database.
|
|
</para>
|
|
|
|
<sect2>
|
|
<title>Configuration</title>
|
|
|
|
<para>
|
|
An <literal>unaccent</literal> dictionary accepts the following options:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<literal>RULES</literal> is the base name of the file containing the list of
|
|
translation rules. This file must be stored in
|
|
<filename>$SHAREDIR/tsearch_data/</filename> (where <literal>$SHAREDIR</literal> means
|
|
the <productname>PostgreSQL</productname> installation's shared-data directory).
|
|
Its name must end in <literal>.rules</literal> (which is not to be included in
|
|
the <literal>RULES</literal> parameter).
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>
|
|
The rules file has the following format:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Each line represents one translation rule, consisting of a character with
|
|
accent followed by a character without accent. The first is translated
|
|
into the second. For example,
|
|
<programlisting>
|
|
À A
|
|
Á A
|
|
 A
|
|
à A
|
|
Ä A
|
|
Å A
|
|
Æ AE
|
|
</programlisting>
|
|
The two characters must be separated by whitespace, and any leading or
|
|
trailing whitespace on a line is ignored.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Alternatively, if only one character is given on a line, instances of
|
|
that character are deleted; this is useful in languages where accents
|
|
are represented by separate characters.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Actually, each <quote>character</quote> can be any string not containing
|
|
whitespace, so <filename>unaccent</filename> dictionaries could be used for
|
|
other sorts of substring substitutions besides diacritic removal.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
As with other <productname>PostgreSQL</productname> text search configuration files,
|
|
the rules file must be stored in UTF-8 encoding. The data is
|
|
automatically translated into the current database's encoding when
|
|
loaded. Any lines containing untranslatable characters are silently
|
|
ignored, so that rules files can contain rules that are not applicable in
|
|
the current encoding.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
A more complete example, which is directly useful for most European
|
|
languages, can be found in <filename>unaccent.rules</filename>, which is installed
|
|
in <filename>$SHAREDIR/tsearch_data/</filename> when the <filename>unaccent</filename>
|
|
module is installed. This rules file translates characters with accents
|
|
to the same characters without accents, and it also expands ligatures
|
|
into the equivalent series of simple characters (for example, Æ to
|
|
AE).
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2>
|
|
<title>Usage</title>
|
|
|
|
<para>
|
|
Installing the <literal>unaccent</literal> extension creates a text
|
|
search template <literal>unaccent</literal> and a dictionary <literal>unaccent</literal>
|
|
based on it. The <literal>unaccent</literal> dictionary has the default
|
|
parameter setting <literal>RULES='unaccent'</literal>, which makes it immediately
|
|
usable with the standard <filename>unaccent.rules</filename> file.
|
|
If you wish, you can alter the parameter, for example
|
|
|
|
<programlisting>
|
|
mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
|
|
</programlisting>
|
|
|
|
or create new dictionaries based on the template.
|
|
</para>
|
|
|
|
<para>
|
|
To test the dictionary, you can try:
|
|
<programlisting>
|
|
mydb=# select ts_lexize('unaccent','Hôtel');
|
|
ts_lexize
|
|
-----------
|
|
{Hotel}
|
|
(1 row)
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
Here is an example showing how to insert the
|
|
<filename>unaccent</filename> dictionary into a text search configuration:
|
|
<programlisting>
|
|
mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
|
|
mydb=# ALTER TEXT SEARCH CONFIGURATION fr
|
|
ALTER MAPPING FOR hword, hword_part, word
|
|
WITH unaccent, french_stem;
|
|
mydb=# select to_tsvector('fr','Hôtels de la Mer');
|
|
to_tsvector
|
|
-------------------
|
|
'hotel':1 'mer':4
|
|
(1 row)
|
|
|
|
mydb=# select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels');
|
|
?column?
|
|
----------
|
|
t
|
|
(1 row)
|
|
|
|
mydb=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels'));
|
|
ts_headline
|
|
------------------------
|
|
<b>Hôtel</b> de la Mer
|
|
(1 row)
|
|
</programlisting>
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2>
|
|
<title>Functions</title>
|
|
|
|
<para>
|
|
The <function>unaccent()</function> function removes accents (diacritic signs) from
|
|
a given string. Basically, it's a wrapper around
|
|
<filename>unaccent</filename>-type dictionaries, but it can be used outside normal
|
|
text search contexts.
|
|
</para>
|
|
|
|
<indexterm>
|
|
<primary>unaccent</primary>
|
|
</indexterm>
|
|
|
|
<synopsis>
|
|
unaccent(<optional><replaceable class="parameter">dictionary</replaceable> <type>regdictionary</type>, </optional> <replaceable class="parameter">string</replaceable> <type>text</type>) returns <type>text</type>
|
|
</synopsis>
|
|
|
|
<para>
|
|
If the <replaceable class="parameter">dictionary</replaceable> argument is
|
|
omitted, the text search dictionary named <literal>unaccent</literal> and
|
|
appearing in the same schema as the <function>unaccent()</function>
|
|
function itself is used.
|
|
</para>
|
|
|
|
<para>
|
|
For example:
|
|
<programlisting>
|
|
SELECT unaccent('unaccent', 'Hôtel');
|
|
SELECT unaccent('Hôtel');
|
|
</programlisting>
|
|
</para>
|
|
</sect2>
|
|
|
|
</sect1>
|