Commit Graph

11 Commits

Author SHA1 Message Date
Michael Paquier 59f47fb98d unaccent: Add support for quoted translated characters
As reported in bug #18057, the extension unaccent removes in its rule
file whitespace characters that are intentionally specified when
building unaccent.rules from UnicodeData.txt, causing an incorrect
translation for some characters like numeric symbols.  This is caused by
the fact that all whitespaces before and after the origin and target
characters are all discarded (this limitation is documented).

This commit makes possible the use of quotes around target characters,
so as whitespaces can be considered part of target characters.  Some
target characters use a double quote, these require an extra double
quote.

The documentation is updated to show how to use quoted areas,
generate_unaccent_rules.py is updated to generate unaccent.rules and a
couple of tests are added for numeric symbols.  While working on this
patch, I have implemented a fake rule file to test the parsing logic
implemented, which is not included here as it would just consume extra
cycles in the tests, and it requires the manipulation of an installation
tree to be able to work correctly.

As this requires a change of format in unaccent.rules, this cannot be
backpatched, unfortunately.  The idea to use double quotes as escaped
characters comes from Tom Lane.

Reported-by: Martin Schlossarek
Author: Michael Paquier
Discussion: https://postgr.es/m/18057-62712cad01bd202c@postgresql.org
2023-09-20 12:29:36 +09:00
Michael Paquier 44e73a498c Fix regression tests of unaccent to work without UTF8 support
The tests of unaccent rely on UTF8 characters, and unlike any other test
suite in the tree (fuzzystrmatch, citext, hstore, etc.), they would fail
if run on a database that does not support UTF8 encoding.

This commit fixes the tests of unaccent so as these are skipped when run
on a database without UTF8 support, using the same method as the other
test suits based on \if, getdatabaseencoding() and an alternate output
file.

This has been broken for a long time, but nobody has complained about
that either, so no backpatch is done.  This can be reproduced with
something like REGRESS_OPTS="--no-locale --encoding=sql_ascii", for
instance.  To defend against that, this module's Makefile and
meson.build enforced a UTF8 encoding without locales, but it did not
offer protection for options given by REGRESS_OPTS.  This switch makes
this regression test suite more consistent with all the others, as
well.

Reviewed-by: Peter Eisentraut
Discussion: https://postgr.es/m/ZIq1HUnIV2ksW85x@paquier.xyz
2023-07-04 08:05:00 +09:00
Jeff Davis f413941f41 Fix t_isspace(), etc., when datlocprovider=i and datctype=C.
Check whether the datctype is C to determine whether t_isspace() and
related functions use isspace() or iswspace().

Previously, t_isspace() checked whether the database default collation
was C; which is incorrect when the default collation uses the ICU
provider.

Discussion: https://postgr.es/m/79e4354d9eccfdb00483146a6b9f6295202e7890.camel@j-davis.com
Reviewed-by: Peter Eisentraut
Backpatch-through: 15
2023-03-17 12:08:46 -07:00
Jeff Davis 27b62377b4 Use ICU by default at initdb time.
If the ICU locale is not specified, initialize the default collator
and retrieve the locale name from that.

Discussion: https://postgr.es/m/510d284759f6e943ce15096167760b2edcb2e700.camel@j-davis.com
Reviewed-by: Peter Eisentraut
2023-03-09 10:52:41 -08:00
Michael Paquier e3dd7c06e6 Simplify a bit the special rules generating unaccent.rules
As noted by Thomas Munro, CLDR 36 has added SOUND RECORDING COPYRIGHT
(U+2117), and we use CLDR 41, so this can be removed from the set of
special cases.

The set of regression tests is expanded for degree signs, which are two
of the special cases, and a fancy case with U+210C in Latin-ASCII.xml
that we have discovered about when diving into what could be done for
Cyrillic characters (this last part is material for a future patch, not
tackled yet).

While on it, some of the assertions of generate_unaccent_rules.py are
expanded to report the codepoint on which a failure is found, something
useful for debugging.

Extracted from a larger patch by the same author.

Author: Przemysław Sztoch
Discussion: https://postgr.es/m/8478da0d-3b61-d24f-80b4-ce2f5e971c60@sztoch.pl
2022-07-05 16:17:51 +09:00
Thomas Munro 456e3718e7 Add combining characters to unaccent.rules.
Strip certain classes of combining characters, so that accents encoded
this way are removed.

Author: Hugh Ranalli
Discussion: https://postgr.es/m/15548-cef1b3f8de190d4f%40postgresql.org
2019-02-01 15:23:01 +01:00
Michael Paquier e1c1d5444e Update unaccent rules with release 34 of CLDR for Latin-ASCII.xml
This has required an update of the python script generating the rules,
as its format has changed in release 29.  This release has also added
new punctuation and symbols, and a new set of rules has been generated
to include them.  The way to find newest versions of Latin-ASCII gets
also more clearly documented.

Author: Hugh Ranalli, Michael Paquier
Discussion: https://postgr.es/m/15548-cef1b3f8de190d4f@postgresql.org
2019-01-10 14:10:21 +09:00
Peter Eisentraut b6f3649bba Convert unaccent tests to UTF-8
This makes it easier to add new tests that are specific to Unicode
features.  The files were previously in KOI8-R.

Discussion: https://www.postgresql.org/message-id/8506.1545111362@sss.pgh.pa.us
2019-01-02 18:36:05 +01:00
Tom Lane 629b3af27d Convert contrib modules to use the extension facility.
This isn't fully tested as yet, in particular I'm not sure that the
"foo--unpackaged--1.0.sql" scripts are OK.  But it's time to get some
buildfarm cycles on it.

sepgsql is not converted to an extension, mainly because it seems to
require a very nonstandard installation process.

Dimitri Fontaine and Tom Lane
2011-02-13 22:54:49 -05:00
Tom Lane 4b98b613f6 Print the actual DB encoding in the unaccent regression test.
This is to help make it more obvious what the problem is, if the
encoding isn't what the test expects.
2009-08-18 16:00:50 +00:00
Teodor Sigaev 92e05bc6a5 Unaccent dictionary. 2009-08-18 10:34:39 +00:00