postgresql/src/test/regress/expected/unicode.out

SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
\if :skip_test
\quit
\endif
SELECT U&'\0061\0308bc' <> U&'\00E4bc' COLLATE "C" AS sanity_check;
 sanity_check 
--------------
 t
(1 row)

SELECT unicode_version() IS NOT NULL;
 ?column? 
----------
 t
(1 row)

SELECT unicode_assigned(U&'abc');
 unicode_assigned 
------------------
 t
(1 row)

SELECT unicode_assigned(U&'abc\+10FFFF');
 unicode_assigned 
------------------
 f
(1 row)

SELECT normalize('');
 normalize 
-----------
 
(1 row)

SELECT normalize(U&'\0061\0308\24D1c') = U&'\00E4\24D1c' COLLATE "C" AS test_default;
 test_default 
--------------
 t
(1 row)

SELECT normalize(U&'\0061\0308\24D1c', NFC) = U&'\00E4\24D1c' COLLATE "C" AS test_nfc;
 test_nfc 
----------
 t
(1 row)

SELECT normalize(U&'\00E4bc', NFC) = U&'\00E4bc' COLLATE "C" AS test_nfc_idem;
 test_nfc_idem 
---------------
 t
(1 row)

SELECT normalize(U&'\00E4\24D1c', NFD) = U&'\0061\0308\24D1c' COLLATE "C" AS test_nfd;
 test_nfd 
----------
 t
(1 row)

SELECT normalize(U&'\0061\0308\24D1c', NFKC) = U&'\00E4bc' COLLATE "C" AS test_nfkc;
 test_nfkc 
-----------
 t
(1 row)

SELECT normalize(U&'\00E4\24D1c', NFKD) = U&'\0061\0308bc' COLLATE "C" AS test_nfkd;
 test_nfkd 
-----------
 t
(1 row)

SELECT "normalize"('abc', 'def');  -- run-time error
ERROR:  invalid normalization form: def
SELECT U&'\00E4\24D1c' IS NORMALIZED AS test_default;
 test_default 
--------------
 t
(1 row)

SELECT U&'\00E4\24D1c' IS NFC NORMALIZED AS test_nfc;
 test_nfc 
----------
 t
(1 row)

SELECT num, val,
    val IS NFC NORMALIZED AS NFC,
    val IS NFD NORMALIZED AS NFD,
    val IS NFKC NORMALIZED AS NFKC,
    val IS NFKD NORMALIZED AS NFKD
FROM
  (VALUES (1, U&'\00E4bc'),
          (2, U&'\0061\0308bc'),
          (3, U&'\00E4\24D1c'),
          (4, U&'\0061\0308\24D1c'),
          (5, '')) vals (num, val)
ORDER BY num;
 num | val | nfc | nfd | nfkc | nfkd 
-----+-----+-----+-----+------+------
   1 | äbc | t   | f   | t    | f
   2 | äbc | f   | t   | f    | t
   3 | äⓑc | t   | f   | f    | f
   4 | äⓑc | f   | t   | f    | f
   5 |     | t   | t   | t    | t
(5 rows)

SELECT is_normalized('abc', 'def');  -- run-time error
ERROR:  invalid normalization form: def
Add SQL functions for Unicode normalization This adds SQL expressions NORMALIZE() and IS NORMALIZED to convert and check Unicode normal forms, per SQL standard. To support fast IS NORMALIZED tests, we pull in a new data file DerivedNormalizationProps.txt from Unicode and build a lookup table from that, using techniques similar to ones already used for other Unicode data. make update-unicode will keep it up to date. We only build and use these tables for the NFC and NFKC forms, because they are too big for NFD and NFKD and the improvement is not significant enough there. Reviewed-by: Daniel Verite <daniel@manitou-mail.org> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Discussion: https://www.postgresql.org/message-id/flat/c1909f27-c269-2ed9-12f8-3ab72c8caf7a@2ndquadrant.com 2020-03-26 08:14:00 +01:00			`SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset`
			`\if :skip_test`
			`\quit`
			`\endif`
			`SELECT U&'\0061\0308bc' <> U&'\00E4bc' COLLATE "C" AS sanity_check;`
			`sanity_check`
			`--------------`
			`t`
			`(1 row)`

Additional unicode primitive functions. Introduce unicode_version(), icu_unicode_version(), and unicode_assigned(). The latter requires introducing a new lookup table for the Unicode General Category, which is generated along with the other Unicode lookup tables. Discussion: https://postgr.es/m/CA+TgmoYzYR-yhU6k1XFCADeyj=Oyz2PkVsa3iKv+keM8wp-F_A@mail.gmail.com Reviewed-by: Peter Eisentraut 2023-11-02 06:47:06 +01:00			`SELECT unicode_version() IS NOT NULL;`
			`?column?`
			`----------`
			`t`
			`(1 row)`

			`SELECT unicode_assigned(U&'abc');`
			`unicode_assigned`
			`------------------`
			`t`
			`(1 row)`

			`SELECT unicode_assigned(U&'abc\+10FFFF');`
			`unicode_assigned`
			`------------------`
			`f`
			`(1 row)`

Fix buffer overrun in unicode string normalization with empty input PostgreSQL 13 and newer versions are directly impacted by that through the SQL function normalize(), which would cause a call of this function to write one byte past its allocation if using in input an empty string after recomposing the string with NFC and NFKC. Older versions (v10~v12) are not directly affected by this problem as the only code path using normalization is SASLprep in SCRAM authentication that forbids the case of an empty string, but let's make the code more robust anyway there so as any out-of-core callers of this function are covered. The solution chosen to fix this issue is simple, with the addition of a fast-exit path if the decomposed string is found as empty. This would only happen for an empty string as at its lowest level a codepoint would be decomposed as itself if it has no entry in the decomposition table or if it has a decomposition size of 0. Some tests are added to cover this issue in v13~. Note that an empty string has always been considered as normalized (grammar "IS NF[K]{C,D} NORMALIZED", through the SQL function is_normalized()) for all the operations allowed (NFC, NFD, NFKC and NFKD) since this feature has been introduced as of 2991ac5. This behavior is unchanged but some tests are added in v13~ to check after that. I have also checked "make normalization-check" in src/common/unicode/, while on it (works in 13~, and breaks in older stable branches independently of this commit). The release notes should just mention this commit for v13~. Reported-by: Matthijs van der Vleuten Discussion: https://postgr.es/m/17277-0c527a373794e802@postgresql.org Backpatch-through: 10 2021-11-11 07:00:59 +01:00			`SELECT normalize('');`
			`normalize`
			`-----------`

			`(1 row)`

Add SQL functions for Unicode normalization This adds SQL expressions NORMALIZE() and IS NORMALIZED to convert and check Unicode normal forms, per SQL standard. To support fast IS NORMALIZED tests, we pull in a new data file DerivedNormalizationProps.txt from Unicode and build a lookup table from that, using techniques similar to ones already used for other Unicode data. make update-unicode will keep it up to date. We only build and use these tables for the NFC and NFKC forms, because they are too big for NFD and NFKD and the improvement is not significant enough there. Reviewed-by: Daniel Verite <daniel@manitou-mail.org> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Discussion: https://www.postgresql.org/message-id/flat/c1909f27-c269-2ed9-12f8-3ab72c8caf7a@2ndquadrant.com 2020-03-26 08:14:00 +01:00			`SELECT normalize(U&'\0061\0308\24D1c') = U&'\00E4\24D1c' COLLATE "C" AS test_default;`
			`test_default`
			`--------------`
			`t`
			`(1 row)`

			`SELECT normalize(U&'\0061\0308\24D1c', NFC) = U&'\00E4\24D1c' COLLATE "C" AS test_nfc;`
			`test_nfc`
			`----------`
			`t`
			`(1 row)`

			`SELECT normalize(U&'\00E4bc', NFC) = U&'\00E4bc' COLLATE "C" AS test_nfc_idem;`
			`test_nfc_idem`
			`---------------`
			`t`
			`(1 row)`

			`SELECT normalize(U&'\00E4\24D1c', NFD) = U&'\0061\0308\24D1c' COLLATE "C" AS test_nfd;`
			`test_nfd`
			`----------`
			`t`
			`(1 row)`

			`SELECT normalize(U&'\0061\0308\24D1c', NFKC) = U&'\00E4bc' COLLATE "C" AS test_nfkc;`
			`test_nfkc`
			`-----------`
			`t`
			`(1 row)`

			`SELECT normalize(U&'\00E4\24D1c', NFKD) = U&'\0061\0308bc' COLLATE "C" AS test_nfkd;`
			`test_nfkd`
			`-----------`
			`t`
			`(1 row)`

			`SELECT "normalize"('abc', 'def'); -- run-time error`
			`ERROR: invalid normalization form: def`
			`SELECT U&'\00E4\24D1c' IS NORMALIZED AS test_default;`
			`test_default`
			`--------------`
			`t`
			`(1 row)`

			`SELECT U&'\00E4\24D1c' IS NFC NORMALIZED AS test_nfc;`
			`test_nfc`
			`----------`
			`t`
			`(1 row)`

			`SELECT num, val,`
			`val IS NFC NORMALIZED AS NFC,`
			`val IS NFD NORMALIZED AS NFD,`
			`val IS NFKC NORMALIZED AS NFKC,`
			`val IS NFKD NORMALIZED AS NFKD`
			`FROM`
			`(VALUES (1, U&'\00E4bc'),`
			`(2, U&'\0061\0308bc'),`
			`(3, U&'\00E4\24D1c'),`
Fix buffer overrun in unicode string normalization with empty input PostgreSQL 13 and newer versions are directly impacted by that through the SQL function normalize(), which would cause a call of this function to write one byte past its allocation if using in input an empty string after recomposing the string with NFC and NFKC. Older versions (v10~v12) are not directly affected by this problem as the only code path using normalization is SASLprep in SCRAM authentication that forbids the case of an empty string, but let's make the code more robust anyway there so as any out-of-core callers of this function are covered. The solution chosen to fix this issue is simple, with the addition of a fast-exit path if the decomposed string is found as empty. This would only happen for an empty string as at its lowest level a codepoint would be decomposed as itself if it has no entry in the decomposition table or if it has a decomposition size of 0. Some tests are added to cover this issue in v13~. Note that an empty string has always been considered as normalized (grammar "IS NF[K]{C,D} NORMALIZED", through the SQL function is_normalized()) for all the operations allowed (NFC, NFD, NFKC and NFKD) since this feature has been introduced as of 2991ac5. This behavior is unchanged but some tests are added in v13~ to check after that. I have also checked "make normalization-check" in src/common/unicode/, while on it (works in 13~, and breaks in older stable branches independently of this commit). The release notes should just mention this commit for v13~. Reported-by: Matthijs van der Vleuten Discussion: https://postgr.es/m/17277-0c527a373794e802@postgresql.org Backpatch-through: 10 2021-11-11 07:00:59 +01:00			`(4, U&'\0061\0308\24D1c'),`
			`(5, '')) vals (num, val)`
Add SQL functions for Unicode normalization This adds SQL expressions NORMALIZE() and IS NORMALIZED to convert and check Unicode normal forms, per SQL standard. To support fast IS NORMALIZED tests, we pull in a new data file DerivedNormalizationProps.txt from Unicode and build a lookup table from that, using techniques similar to ones already used for other Unicode data. make update-unicode will keep it up to date. We only build and use these tables for the NFC and NFKC forms, because they are too big for NFD and NFKD and the improvement is not significant enough there. Reviewed-by: Daniel Verite <daniel@manitou-mail.org> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Discussion: https://www.postgresql.org/message-id/flat/c1909f27-c269-2ed9-12f8-3ab72c8caf7a@2ndquadrant.com 2020-03-26 08:14:00 +01:00			`ORDER BY num;`
			`num \| val \| nfc \| nfd \| nfkc \| nfkd`
			`-----+-----+-----+-----+------+------`
			`1 \| äbc \| t \| f \| t \| f`
			`2 \| äbc \| f \| t \| f \| t`
			`3 \| äⓑc \| t \| f \| f \| f`
			`4 \| äⓑc \| f \| t \| f \| f`
Fix buffer overrun in unicode string normalization with empty input PostgreSQL 13 and newer versions are directly impacted by that through the SQL function normalize(), which would cause a call of this function to write one byte past its allocation if using in input an empty string after recomposing the string with NFC and NFKC. Older versions (v10~v12) are not directly affected by this problem as the only code path using normalization is SASLprep in SCRAM authentication that forbids the case of an empty string, but let's make the code more robust anyway there so as any out-of-core callers of this function are covered. The solution chosen to fix this issue is simple, with the addition of a fast-exit path if the decomposed string is found as empty. This would only happen for an empty string as at its lowest level a codepoint would be decomposed as itself if it has no entry in the decomposition table or if it has a decomposition size of 0. Some tests are added to cover this issue in v13~. Note that an empty string has always been considered as normalized (grammar "IS NF[K]{C,D} NORMALIZED", through the SQL function is_normalized()) for all the operations allowed (NFC, NFD, NFKC and NFKD) since this feature has been introduced as of 2991ac5. This behavior is unchanged but some tests are added in v13~ to check after that. I have also checked "make normalization-check" in src/common/unicode/, while on it (works in 13~, and breaks in older stable branches independently of this commit). The release notes should just mention this commit for v13~. Reported-by: Matthijs van der Vleuten Discussion: https://postgr.es/m/17277-0c527a373794e802@postgresql.org Backpatch-through: 10 2021-11-11 07:00:59 +01:00			`5 \| \| t \| t \| t \| t`
			`(5 rows)`
Add SQL functions for Unicode normalization This adds SQL expressions NORMALIZE() and IS NORMALIZED to convert and check Unicode normal forms, per SQL standard. To support fast IS NORMALIZED tests, we pull in a new data file DerivedNormalizationProps.txt from Unicode and build a lookup table from that, using techniques similar to ones already used for other Unicode data. make update-unicode will keep it up to date. We only build and use these tables for the NFC and NFKC forms, because they are too big for NFD and NFKD and the improvement is not significant enough there. Reviewed-by: Daniel Verite <daniel@manitou-mail.org> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Discussion: https://www.postgresql.org/message-id/flat/c1909f27-c269-2ed9-12f8-3ab72c8caf7a@2ndquadrant.com 2020-03-26 08:14:00 +01:00
			`SELECT is_normalized('abc', 'def'); -- run-time error`
			`ERROR: invalid normalization form: def`