Fix regression tests of unaccent to work without UTF8 support
The tests of unaccent rely on UTF8 characters, and unlike any other test
suite in the tree (fuzzystrmatch, citext, hstore, etc.), they would fail
if run on a database that does not support UTF8 encoding.
This commit fixes the tests of unaccent so as these are skipped when run
on a database without UTF8 support, using the same method as the other
test suits based on \if, getdatabaseencoding() and an alternate output
file.
This has been broken for a long time, but nobody has complained about
that either, so no backpatch is done. This can be reproduced with
something like REGRESS_OPTS="--no-locale --encoding=sql_ascii", for
instance. To defend against that, this module's Makefile and
meson.build enforced a UTF8 encoding without locales, but it did not
offer protection for options given by REGRESS_OPTS. This switch makes
this regression test suite more consistent with all the others, as
well.
Reviewed-by: Peter Eisentraut
Discussion: https://postgr.es/m/ZIq1HUnIV2ksW85x@paquier.xyz
2023-07-04 01:05:00 +02:00
|
|
|
|
/*
|
|
|
|
|
* This test must be run in a database with UTF-8 encoding,
|
|
|
|
|
* because other encodings don't support all the characters used.
|
|
|
|
|
*/
|
|
|
|
|
SELECT getdatabaseencoding() <> 'UTF8'
|
|
|
|
|
AS skip_test \gset
|
|
|
|
|
\if :skip_test
|
|
|
|
|
\quit
|
|
|
|
|
\endif
|
2011-02-14 02:06:41 +01:00
|
|
|
|
CREATE EXTENSION unaccent;
|
2019-01-02 18:36:05 +01:00
|
|
|
|
SET client_encoding TO 'UTF8';
|
2009-08-18 12:34:39 +02:00
|
|
|
|
SELECT unaccent('foobar');
|
|
|
|
|
unaccent
|
|
|
|
|
----------
|
|
|
|
|
foobar
|
|
|
|
|
(1 row)
|
|
|
|
|
|
2019-01-02 18:36:05 +01:00
|
|
|
|
SELECT unaccent('ёлка');
|
2009-08-18 12:34:39 +02:00
|
|
|
|
unaccent
|
|
|
|
|
----------
|
2019-01-02 18:36:05 +01:00
|
|
|
|
елка
|
2009-08-18 12:34:39 +02:00
|
|
|
|
(1 row)
|
|
|
|
|
|
2019-01-02 18:36:05 +01:00
|
|
|
|
SELECT unaccent('ЁЖИК');
|
2009-08-18 12:34:39 +02:00
|
|
|
|
unaccent
|
|
|
|
|
----------
|
2019-01-02 18:36:05 +01:00
|
|
|
|
ЕЖИК
|
2009-08-18 12:34:39 +02:00
|
|
|
|
(1 row)
|
|
|
|
|
|
2019-01-10 06:10:21 +01:00
|
|
|
|
SELECT unaccent('˃˖˗˜');
|
|
|
|
|
unaccent
|
|
|
|
|
----------
|
|
|
|
|
>+-~
|
|
|
|
|
(1 row)
|
|
|
|
|
|
2019-02-01 15:23:01 +01:00
|
|
|
|
SELECT unaccent('À'); -- Remove combining diacritical 0x0300
|
|
|
|
|
unaccent
|
|
|
|
|
----------
|
|
|
|
|
A
|
|
|
|
|
(1 row)
|
|
|
|
|
|
2022-07-05 09:17:51 +02:00
|
|
|
|
SELECT unaccent('℃℉'); -- degree signs
|
|
|
|
|
unaccent
|
|
|
|
|
----------
|
|
|
|
|
°C°F
|
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
|
|
SELECT unaccent('℗'); -- sound recording copyright
|
|
|
|
|
unaccent
|
|
|
|
|
----------
|
|
|
|
|
(P)
|
|
|
|
|
(1 row)
|
|
|
|
|
|
unaccent: Add support for quoted translated characters
As reported in bug #18057, the extension unaccent removes in its rule
file whitespace characters that are intentionally specified when
building unaccent.rules from UnicodeData.txt, causing an incorrect
translation for some characters like numeric symbols. This is caused by
the fact that all whitespaces before and after the origin and target
characters are all discarded (this limitation is documented).
This commit makes possible the use of quotes around target characters,
so as whitespaces can be considered part of target characters. Some
target characters use a double quote, these require an extra double
quote.
The documentation is updated to show how to use quoted areas,
generate_unaccent_rules.py is updated to generate unaccent.rules and a
couple of tests are added for numeric symbols. While working on this
patch, I have implemented a fake rule file to test the parsing logic
implemented, which is not included here as it would just consume extra
cycles in the tests, and it requires the manipulation of an installation
tree to be able to work correctly.
As this requires a change of format in unaccent.rules, this cannot be
backpatched, unfortunately. The idea to use double quotes as escaped
characters comes from Tom Lane.
Reported-by: Martin Schlossarek
Author: Michael Paquier
Discussion: https://postgr.es/m/18057-62712cad01bd202c@postgresql.org
2023-09-20 05:29:36 +02:00
|
|
|
|
SELECT unaccent('1½'); -- math expression with whitespace
|
|
|
|
|
unaccent
|
|
|
|
|
----------
|
|
|
|
|
1 1/2
|
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
|
|
SELECT unaccent('〝'); -- quote
|
|
|
|
|
unaccent
|
|
|
|
|
----------
|
|
|
|
|
"
|
|
|
|
|
(1 row)
|
|
|
|
|
|
2009-08-18 12:34:39 +02:00
|
|
|
|
SELECT unaccent('unaccent', 'foobar');
|
|
|
|
|
unaccent
|
|
|
|
|
----------
|
|
|
|
|
foobar
|
|
|
|
|
(1 row)
|
|
|
|
|
|
2019-01-02 18:36:05 +01:00
|
|
|
|
SELECT unaccent('unaccent', 'ёлка');
|
2009-08-18 12:34:39 +02:00
|
|
|
|
unaccent
|
|
|
|
|
----------
|
2019-01-02 18:36:05 +01:00
|
|
|
|
елка
|
2009-08-18 12:34:39 +02:00
|
|
|
|
(1 row)
|
|
|
|
|
|
2019-01-02 18:36:05 +01:00
|
|
|
|
SELECT unaccent('unaccent', 'ЁЖИК');
|
2009-08-18 12:34:39 +02:00
|
|
|
|
unaccent
|
|
|
|
|
----------
|
2019-01-02 18:36:05 +01:00
|
|
|
|
ЕЖИК
|
2009-08-18 12:34:39 +02:00
|
|
|
|
(1 row)
|
|
|
|
|
|
2019-01-10 06:10:21 +01:00
|
|
|
|
SELECT unaccent('unaccent', '˃˖˗˜');
|
|
|
|
|
unaccent
|
|
|
|
|
----------
|
|
|
|
|
>+-~
|
|
|
|
|
(1 row)
|
|
|
|
|
|
2019-02-01 15:23:01 +01:00
|
|
|
|
SELECT unaccent('unaccent', 'À');
|
|
|
|
|
unaccent
|
|
|
|
|
----------
|
|
|
|
|
A
|
|
|
|
|
(1 row)
|
|
|
|
|
|
2022-07-05 09:17:51 +02:00
|
|
|
|
SELECT unaccent('unaccent', '℃℉');
|
|
|
|
|
unaccent
|
|
|
|
|
----------
|
|
|
|
|
°C°F
|
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
|
|
SELECT unaccent('unaccent', '℗');
|
|
|
|
|
unaccent
|
|
|
|
|
----------
|
|
|
|
|
(P)
|
|
|
|
|
(1 row)
|
|
|
|
|
|
unaccent: Add support for quoted translated characters
As reported in bug #18057, the extension unaccent removes in its rule
file whitespace characters that are intentionally specified when
building unaccent.rules from UnicodeData.txt, causing an incorrect
translation for some characters like numeric symbols. This is caused by
the fact that all whitespaces before and after the origin and target
characters are all discarded (this limitation is documented).
This commit makes possible the use of quotes around target characters,
so as whitespaces can be considered part of target characters. Some
target characters use a double quote, these require an extra double
quote.
The documentation is updated to show how to use quoted areas,
generate_unaccent_rules.py is updated to generate unaccent.rules and a
couple of tests are added for numeric symbols. While working on this
patch, I have implemented a fake rule file to test the parsing logic
implemented, which is not included here as it would just consume extra
cycles in the tests, and it requires the manipulation of an installation
tree to be able to work correctly.
As this requires a change of format in unaccent.rules, this cannot be
backpatched, unfortunately. The idea to use double quotes as escaped
characters comes from Tom Lane.
Reported-by: Martin Schlossarek
Author: Michael Paquier
Discussion: https://postgr.es/m/18057-62712cad01bd202c@postgresql.org
2023-09-20 05:29:36 +02:00
|
|
|
|
SELECT unaccent('unaccent', '1½');
|
|
|
|
|
unaccent
|
|
|
|
|
----------
|
|
|
|
|
1 1/2
|
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
|
|
SELECT unaccent('unaccent', '〝');
|
|
|
|
|
unaccent
|
|
|
|
|
----------
|
|
|
|
|
"
|
|
|
|
|
(1 row)
|
|
|
|
|
|
2009-08-18 12:34:39 +02:00
|
|
|
|
SELECT ts_lexize('unaccent', 'foobar');
|
|
|
|
|
ts_lexize
|
|
|
|
|
-----------
|
|
|
|
|
|
|
|
|
|
(1 row)
|
|
|
|
|
|
2019-01-02 18:36:05 +01:00
|
|
|
|
SELECT ts_lexize('unaccent', 'ёлка');
|
2009-08-18 12:34:39 +02:00
|
|
|
|
ts_lexize
|
|
|
|
|
-----------
|
2019-01-02 18:36:05 +01:00
|
|
|
|
{елка}
|
2009-08-18 12:34:39 +02:00
|
|
|
|
(1 row)
|
|
|
|
|
|
2019-01-02 18:36:05 +01:00
|
|
|
|
SELECT ts_lexize('unaccent', 'ЁЖИК');
|
2009-08-18 12:34:39 +02:00
|
|
|
|
ts_lexize
|
|
|
|
|
-----------
|
2019-01-02 18:36:05 +01:00
|
|
|
|
{ЕЖИК}
|
2009-08-18 12:34:39 +02:00
|
|
|
|
(1 row)
|
|
|
|
|
|
2019-01-10 06:10:21 +01:00
|
|
|
|
SELECT ts_lexize('unaccent', '˃˖˗˜');
|
|
|
|
|
ts_lexize
|
|
|
|
|
-----------
|
|
|
|
|
{>+-~}
|
|
|
|
|
(1 row)
|
|
|
|
|
|
2019-02-01 15:23:01 +01:00
|
|
|
|
SELECT ts_lexize('unaccent', 'À');
|
|
|
|
|
ts_lexize
|
|
|
|
|
-----------
|
|
|
|
|
{A}
|
|
|
|
|
(1 row)
|
|
|
|
|
|
2022-07-05 09:17:51 +02:00
|
|
|
|
SELECT ts_lexize('unaccent', '℃℉');
|
|
|
|
|
ts_lexize
|
|
|
|
|
-----------
|
|
|
|
|
{°C°F}
|
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
|
|
SELECT ts_lexize('unaccent', '℗');
|
|
|
|
|
ts_lexize
|
|
|
|
|
-----------
|
|
|
|
|
{(P)}
|
|
|
|
|
(1 row)
|
|
|
|
|
|
unaccent: Add support for quoted translated characters
As reported in bug #18057, the extension unaccent removes in its rule
file whitespace characters that are intentionally specified when
building unaccent.rules from UnicodeData.txt, causing an incorrect
translation for some characters like numeric symbols. This is caused by
the fact that all whitespaces before and after the origin and target
characters are all discarded (this limitation is documented).
This commit makes possible the use of quotes around target characters,
so as whitespaces can be considered part of target characters. Some
target characters use a double quote, these require an extra double
quote.
The documentation is updated to show how to use quoted areas,
generate_unaccent_rules.py is updated to generate unaccent.rules and a
couple of tests are added for numeric symbols. While working on this
patch, I have implemented a fake rule file to test the parsing logic
implemented, which is not included here as it would just consume extra
cycles in the tests, and it requires the manipulation of an installation
tree to be able to work correctly.
As this requires a change of format in unaccent.rules, this cannot be
backpatched, unfortunately. The idea to use double quotes as escaped
characters comes from Tom Lane.
Reported-by: Martin Schlossarek
Author: Michael Paquier
Discussion: https://postgr.es/m/18057-62712cad01bd202c@postgresql.org
2023-09-20 05:29:36 +02:00
|
|
|
|
SELECT ts_lexize('unaccent', '1½');
|
|
|
|
|
ts_lexize
|
|
|
|
|
-----------
|
|
|
|
|
{"1 1/2"}
|
|
|
|
|
(1 row)
|
|
|
|
|
|
|
|
|
|
SELECT ts_lexize('unaccent', '〝');
|
|
|
|
|
ts_lexize
|
|
|
|
|
-----------
|
|
|
|
|
{"\""}
|
|
|
|
|
(1 row)
|
|
|
|
|
|
2022-07-05 09:17:51 +02:00
|
|
|
|
-- Controversial case. Black-Letter Capital H (U+210C) is translated by
|
|
|
|
|
-- Latin-ASCII.xml as 'x', but it should be 'H'.
|
|
|
|
|
SELECT unaccent('ℌ');
|
|
|
|
|
unaccent
|
|
|
|
|
----------
|
|
|
|
|
x
|
|
|
|
|
(1 row)
|
|
|
|
|
|