unaccent

unaccent unaccent unaccent is a text search dictionary that removes accents (diacritic signs) from lexemes. It's a filtering dictionary, which means its output is always passed to the next dictionary (if any), unlike the normal behavior of dictionaries. This allows accent-insensitive processing for full text search. The current implementation of unaccent cannot be used as a normalizing dictionary for the thesaurus dictionary. Configuration An unaccent dictionary accepts the following options: RULES is the base name of the file containing the list of translation rules. This file must be stored in $SHAREDIR/tsearch_data/ (where $SHAREDIR means the PostgreSQL installation's shared-data directory). Its name must end in .rules (which is not to be included in the RULES parameter). The rules file has the following format: Each line represents a pair, consisting of a character with accent followed by a character without accent. The first is translated into the second. For example, À A Á A Â A Ã A Ä A Å A Æ A A more complete example, which is directly useful for most European languages, can be found in unaccent.rules, which is installed in $SHAREDIR/tsearch_data/ when the unaccent module is installed. Usage Installing the unaccent extension creates a text search template unaccent and a dictionary unaccent based on it. The unaccent dictionary has the default parameter setting RULES='unaccent', which makes it immediately usable with the standard unaccent.rules file. If you wish, you can alter the parameter, for example mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules'); or create new dictionaries based on the template. To test the dictionary, you can try: mydb=# select ts_lexize('unaccent','Hôtel'); ts_lexize ----------- {Hotel} (1 row) Here is an example showing how to insert the unaccent dictionary into a text search configuration: mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french ); mydb=# ALTER TEXT SEARCH CONFIGURATION fr ALTER MAPPING FOR hword, hword_part, word WITH unaccent, french_stem; mydb=# select to_tsvector('fr','Hôtels de la Mer'); to_tsvector ------------------- 'hotel':1 'mer':4 (1 row) mydb=# select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels'); ?column? ---------- t (1 row) mydb=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels')); ts_headline ------------------------ <b>Hôtel</b> de la Mer (1 row) Functions The unaccent() function removes accents (diacritic signs) from a given string. Basically, it's a wrapper around the unaccent dictionary, but it can be used outside normal text search contexts. unaccent unaccent(dictionary, string) returns text For example: SELECT unaccent('unaccent', 'Hôtel'); SELECT unaccent('Hôtel');