postgresql/contrib/unaccent/sql/unaccent.sql

/*
 * This test must be run in a database with UTF-8 encoding,
 * because other encodings don't support all the characters used.
 */

SELECT getdatabaseencoding() <> 'UTF8'
       AS skip_test \gset
\if :skip_test
\quit
\endif

CREATE EXTENSION unaccent;

SET client_encoding TO 'UTF8';

SELECT unaccent('foobar');
SELECT unaccent('ёлка');
SELECT unaccent('ЁЖИК');
SELECT unaccent('˃˖˗˜');
SELECT unaccent('À');  -- Remove combining diacritical 0x0300
SELECT unaccent('℃℉'); -- degree signs
SELECT unaccent('℗'); -- sound recording copyright

SELECT unaccent('unaccent', 'foobar');
SELECT unaccent('unaccent', 'ёлка');
SELECT unaccent('unaccent', 'ЁЖИК');
SELECT unaccent('unaccent', '˃˖˗˜');
SELECT unaccent('unaccent', 'À');
SELECT unaccent('unaccent', '℃℉');
SELECT unaccent('unaccent', '℗');

SELECT ts_lexize('unaccent', 'foobar');
SELECT ts_lexize('unaccent', 'ёлка');
SELECT ts_lexize('unaccent', 'ЁЖИК');
SELECT ts_lexize('unaccent', '˃˖˗˜');
SELECT ts_lexize('unaccent', 'À');
SELECT ts_lexize('unaccent', '℃℉');
SELECT ts_lexize('unaccent', '℗');

-- Controversial case.  Black-Letter Capital H (U+210C) is translated by
-- Latin-ASCII.xml as 'x', but it should be 'H'.
SELECT unaccent('ℌ');
-												Fix regression tests of unaccent to work without UTF8 support

The tests of unaccent rely on UTF8 characters, and unlike any other test
suite in the tree (fuzzystrmatch, citext, hstore, etc.), they would fail
if run on a database that does not support UTF8 encoding.

This commit fixes the tests of unaccent so as these are skipped when run
on a database without UTF8 support, using the same method as the other
test suits based on \if, getdatabaseencoding() and an alternate output
file.

This has been broken for a long time, but nobody has complained about
that either, so no backpatch is done.  This can be reproduced with
something like REGRESS_OPTS="--no-locale --encoding=sql_ascii", for
instance.  To defend against that, this module's Makefile and
meson.build enforced a UTF8 encoding without locales, but it did not
offer protection for options given by REGRESS_OPTS.  This switch makes
this regression test suite more consistent with all the others, as
well.

Reviewed-by: Peter Eisentraut
Discussion: https://postgr.es/m/ZIq1HUnIV2ksW85x@paquier.xyz

											
										
										
											2023-07-04 01:05:00 +02:00
+								/*
 								 * This test must be run in a database with UTF-8 encoding,
 								 * because other encodings don't support all the characters used.
 								 */
 								SELECT getdatabaseencoding() <> 'UTF8'
 								       AS skip_test \gset
 								\if :skip_test
 								\quit
 								\endif
-												Unaccent dictionary.

											
										
										
											2009-08-18 12:34:39 +02:00
-												Fix regression tests of unaccent to work without UTF8 support

The tests of unaccent rely on UTF8 characters, and unlike any other test
suite in the tree (fuzzystrmatch, citext, hstore, etc.), they would fail
if run on a database that does not support UTF8 encoding.

This commit fixes the tests of unaccent so as these are skipped when run
on a database without UTF8 support, using the same method as the other
test suits based on \if, getdatabaseencoding() and an alternate output
file.

This has been broken for a long time, but nobody has complained about
that either, so no backpatch is done.  This can be reproduced with
something like REGRESS_OPTS="--no-locale --encoding=sql_ascii", for
instance.  To defend against that, this module's Makefile and
meson.build enforced a UTF8 encoding without locales, but it did not
offer protection for options given by REGRESS_OPTS.  This switch makes
this regression test suite more consistent with all the others, as
well.

Reviewed-by: Peter Eisentraut
Discussion: https://postgr.es/m/ZIq1HUnIV2ksW85x@paquier.xyz

											
										
										
											2023-07-04 01:05:00 +02:00
+								CREATE EXTENSION unaccent;
-												Print the actual DB encoding in the unaccent regression test.
This is to help make it more obvious what the problem is, if the
encoding isn't what the test expects.

											
										
										
											2009-08-18 18:00:50 +02:00
-												Convert unaccent tests to UTF-8

This makes it easier to add new tests that are specific to Unicode
features.  The files were previously in KOI8-R.

Discussion: https://www.postgresql.org/message-id/8506.1545111362@sss.pgh.pa.us

											
										
										
											2019-01-02 18:36:05 +01:00
+								SET client_encoding TO 'UTF8';
-												Unaccent dictionary.

											
										
										
											2009-08-18 12:34:39 +02:00
 								SELECT unaccent('foobar');
-												Convert unaccent tests to UTF-8

This makes it easier to add new tests that are specific to Unicode
features.  The files were previously in KOI8-R.

Discussion: https://www.postgresql.org/message-id/8506.1545111362@sss.pgh.pa.us

											
										
										
											2019-01-02 18:36:05 +01:00
+								SELECT unaccent('ёлка');
 								SELECT unaccent('ЁЖИК');
-												Update unaccent rules with release 34 of CLDR for Latin-ASCII.xml

This has required an update of the python script generating the rules,
as its format has changed in release 29.  This release has also added
new punctuation and symbols, and a new set of rules has been generated
to include them.  The way to find newest versions of Latin-ASCII gets
also more clearly documented.

Author: Hugh Ranalli, Michael Paquier
Discussion: https://postgr.es/m/15548-cef1b3f8de190d4f@postgresql.org

											
										
										
											2019-01-10 06:10:21 +01:00
+								SELECT unaccent('˃˖˗˜');
-												Add combining characters to unaccent.rules.

Strip certain classes of combining characters, so that accents encoded
this way are removed.

Author: Hugh Ranalli
Discussion: https://postgr.es/m/15548-cef1b3f8de190d4f%40postgresql.org

											
										
										
											2019-02-01 15:23:01 +01:00
+								SELECT unaccent('À');  -- Remove combining diacritical 0x0300
-												Simplify a bit the special rules generating unaccent.rules

As noted by Thomas Munro, CLDR 36 has added SOUND RECORDING COPYRIGHT
(U+2117), and we use CLDR 41, so this can be removed from the set of
special cases.

The set of regression tests is expanded for degree signs, which are two
of the special cases, and a fancy case with U+210C in Latin-ASCII.xml
that we have discovered about when diving into what could be done for
Cyrillic characters (this last part is material for a future patch, not
tackled yet).

While on it, some of the assertions of generate_unaccent_rules.py are
expanded to report the codepoint on which a failure is found, something
useful for debugging.

Extracted from a larger patch by the same author.

Author: Przemysław Sztoch
Discussion: https://postgr.es/m/8478da0d-3b61-d24f-80b4-ce2f5e971c60@sztoch.pl

											
										
										
											2022-07-05 09:17:51 +02:00
+								SELECT unaccent('℃℉'); -- degree signs
 								SELECT unaccent('℗'); -- sound recording copyright
-												Unaccent dictionary.

											
										
										
											2009-08-18 12:34:39 +02:00
 								SELECT unaccent('unaccent', 'foobar');
-												Convert unaccent tests to UTF-8

This makes it easier to add new tests that are specific to Unicode
features.  The files were previously in KOI8-R.

Discussion: https://www.postgresql.org/message-id/8506.1545111362@sss.pgh.pa.us

											
										
										
											2019-01-02 18:36:05 +01:00
+								SELECT unaccent('unaccent', 'ёлка');
 								SELECT unaccent('unaccent', 'ЁЖИК');
-												Update unaccent rules with release 34 of CLDR for Latin-ASCII.xml

This has required an update of the python script generating the rules,
as its format has changed in release 29.  This release has also added
new punctuation and symbols, and a new set of rules has been generated
to include them.  The way to find newest versions of Latin-ASCII gets
also more clearly documented.

Author: Hugh Ranalli, Michael Paquier
Discussion: https://postgr.es/m/15548-cef1b3f8de190d4f@postgresql.org

											
										
										
											2019-01-10 06:10:21 +01:00
+								SELECT unaccent('unaccent', '˃˖˗˜');
-												Add combining characters to unaccent.rules.

Strip certain classes of combining characters, so that accents encoded
this way are removed.

Author: Hugh Ranalli
Discussion: https://postgr.es/m/15548-cef1b3f8de190d4f%40postgresql.org

											
										
										
											2019-02-01 15:23:01 +01:00
+								SELECT unaccent('unaccent', 'À');
-												Simplify a bit the special rules generating unaccent.rules

As noted by Thomas Munro, CLDR 36 has added SOUND RECORDING COPYRIGHT
(U+2117), and we use CLDR 41, so this can be removed from the set of
special cases.

The set of regression tests is expanded for degree signs, which are two
of the special cases, and a fancy case with U+210C in Latin-ASCII.xml
that we have discovered about when diving into what could be done for
Cyrillic characters (this last part is material for a future patch, not
tackled yet).

While on it, some of the assertions of generate_unaccent_rules.py are
expanded to report the codepoint on which a failure is found, something
useful for debugging.

Extracted from a larger patch by the same author.

Author: Przemysław Sztoch
Discussion: https://postgr.es/m/8478da0d-3b61-d24f-80b4-ce2f5e971c60@sztoch.pl

											
										
										
											2022-07-05 09:17:51 +02:00
+								SELECT unaccent('unaccent', '℃℉');
 								SELECT unaccent('unaccent', '℗');
-												Unaccent dictionary.

											
										
										
											2009-08-18 12:34:39 +02:00
 								SELECT ts_lexize('unaccent', 'foobar');
-												Convert unaccent tests to UTF-8

This makes it easier to add new tests that are specific to Unicode
features.  The files were previously in KOI8-R.

Discussion: https://www.postgresql.org/message-id/8506.1545111362@sss.pgh.pa.us

											
										
										
											2019-01-02 18:36:05 +01:00
+								SELECT ts_lexize('unaccent', 'ёлка');
 								SELECT ts_lexize('unaccent', 'ЁЖИК');
-												Update unaccent rules with release 34 of CLDR for Latin-ASCII.xml

This has required an update of the python script generating the rules,
as its format has changed in release 29.  This release has also added
new punctuation and symbols, and a new set of rules has been generated
to include them.  The way to find newest versions of Latin-ASCII gets
also more clearly documented.

Author: Hugh Ranalli, Michael Paquier
Discussion: https://postgr.es/m/15548-cef1b3f8de190d4f@postgresql.org

											
										
										
											2019-01-10 06:10:21 +01:00
+								SELECT ts_lexize('unaccent', '˃˖˗˜');
-												Add combining characters to unaccent.rules.

Strip certain classes of combining characters, so that accents encoded
this way are removed.

Author: Hugh Ranalli
Discussion: https://postgr.es/m/15548-cef1b3f8de190d4f%40postgresql.org

											
										
										
											2019-02-01 15:23:01 +01:00
+								SELECT ts_lexize('unaccent', 'À');
-												Simplify a bit the special rules generating unaccent.rules

As noted by Thomas Munro, CLDR 36 has added SOUND RECORDING COPYRIGHT
(U+2117), and we use CLDR 41, so this can be removed from the set of
special cases.

The set of regression tests is expanded for degree signs, which are two
of the special cases, and a fancy case with U+210C in Latin-ASCII.xml
that we have discovered about when diving into what could be done for
Cyrillic characters (this last part is material for a future patch, not
tackled yet).

While on it, some of the assertions of generate_unaccent_rules.py are
expanded to report the codepoint on which a failure is found, something
useful for debugging.

Extracted from a larger patch by the same author.

Author: Przemysław Sztoch
Discussion: https://postgr.es/m/8478da0d-3b61-d24f-80b4-ce2f5e971c60@sztoch.pl

											
										
										
											2022-07-05 09:17:51 +02:00
+								SELECT ts_lexize('unaccent', '℃℉');
 								SELECT ts_lexize('unaccent', '℗');
 								-- Controversial case.  Black-Letter Capital H (U+210C) is translated by
 								-- Latin-ASCII.xml as 'x', but it should be 'H'.
 								SELECT unaccent('ℌ');